Tutorial - Predicting mood from raw audio data

How to solve a multi-label classification problem

This tutorial will show you how to build a model that will solve a multi-label classification problem. This means that the model tries to decide for each class whether the example belongs to that class or not.

You will create a solution on the Platform to a complex problem using state of the art machine learning models.

Target audience: Data scientists and Developers

Preread: This tutorial is based on an adaptation of FCN-6, a content-based automatic music tagging algorithm using fully convolutional neural networks, from the paper AUTOMATIC TAGGING USING DEEP CONVOLUTIONAL NEURAL NETWORKS. If you want to you can dig deep into that before you dig into this tutorial.

The problem

Can you figure out the beat, feeling and mood of a song by "looking" at its signature? We're going to find out! In this case we're converting music file segments to "log scaled mel spectrograms". For great detail on how to do this yourself, check out our GitHub repo with the Jupyter Notebook ready to go: https://github.com/Peltarion/tagger-tutorial/.

All the spectrograms are tagged with the songs' moods. For example, one song in the dataset, I'm Your Ride (Instrumental Version), is tagged with "Happy". Listen here to see if you agree:

I'm Your Ride (Instrumental Version)

The illustration shows a log scaled mel spectrogram of the first 30 s of "I'm Your Ride (Instrumental Version)". X-axis: time, y-axis: frequency (high Hz at the bottom and low Hz on top)
Definition of log scaled mel spectrogram: The magnitude of the short-time Fourier transform (STFT) transformed to mel scale. This is then transformed to log scale.

Possible applications

If our model can predict and then tag a song, it could be used for a number of use cases such as:

  • Automate tagging of new songs. This will remove subjective opinions on what mood a new song has, making the tagging more consistent.
  • Improve the quality and consistency of existing metadata. Since the manual assignments are not perfect, the model can be used to identify most likely erroneous tags for existing songs and suggest alternative tags.
  • Tag songs at a finer granularity level are allowing more detailed queries when searching for songs.
  • Find related songs by ranking songs according to mood similarity.

Understanding the data

One of the biggest challenges with machine learning is the quality of the input data. Quite often it’s not good enough. In this tutorial, the ground truth, that is the labels for the data, comes from a manual assignment of moods, based partially on subjective opinions. This will make it difficult for the model to identify consistent patterns in the training data.

For example: Did you agree that the song I'm Your Ride (Instrumental Version) really is "Happy"? What about "Hopeful"? These tags were created by hand, by different people. Now imagine you are a data scientist working for this music company and your goal is to improve consistency of search results when searching by mood.

Challenge accepted!

Create project

First, create a project and name it so you know what kind of project it is. Naming is important.

A project combines all of the steps in solving a problem, from pre-processing of datasets to model building, evaluation, and deployment. Using projects makes it easy to collaborate with others.

Add the dataset

Please note that, by working with this dataset you accept the author's license in the Dataset licenses section of the Knowledge center. 

You'll find the link to download the Deep tagger dataset at the bottom of this page.

When you have downloaded the dataset, navigate to the Datasets view and click New dataset. Add the dowloaded zip-file to the Upload files tab.

You can also import the dataset without downloading it. To do this, copy the link to the dataset and paste that link in the Import files tab.

It may take some time to import the spectrogram file since it’s so large, but when all files are imported click Next, name the dataset and click Done.

Create your subsets

All samples in the dataset is by default split into 20% validation and 80% training subsets. Use the defaults to train a real working tagger. However, since this dataset is so large and, in real life, you may want to test our model first, we will only use 8% of the dataset for training and 2% for validation.

  1. In the Subsets section, click New subset.
  2. Set Value to 0.08 and Operator to Less than.

    Name the selection to Training(8%), since this subset now will contain 8% of the datasets examples.

    Keep the type and the seed. Click Create.
  3. Click New subset again.
  4. Set Value to 0.98 and Operator to Greater than.

    Name the selection Validation(2%), since this subset now will contain 2% of the datasets examples.

    Keep the type and the seed. Click Create.
  5. Set Normalize on subset to Training (8%).

Create a combined feature for moods

Click on New combined feature, name the feature Moods and select all the moods for a real working tagger. Again however, as a first quick stab at this problem lets choose only these five moods:

  • Angry (1)
  • Countryside (1)
  • Dark (1)
  • Epic (1)
  • Happy (1)

We select only five if you're testing the model. Choose all the moods to build a real tagger.

Save the dataset

You’ve now created a dataset ready to be used in the platform. Click Save version and navigate to the Modeling view.

Design experiment

Now it’s time to build a model. We’ll build an adaptation of FCN-6, a content-based automatic music tagging algorithm using fully convolutional neural networks, from the paper AUTOMATIC TAGGING USING DEEP CONVOLUTIONAL NEURAL NETWORKS.

In the Modeling view create a new experiment.

Select the 8%-training and 2%-validation subsets

Navigate to the Settings tab in the Inspector. Since this is the testrun of our model select select the 8%-training and 2%-validation subsets in the Dataset section of your new experiment,

Navigate the Modeling canvas

Tip! Use the zooming tools if the model doesn't fit the Modeling canvas. You'll find more navigation tips in the topic Modeling canvas controls.

Add blocks to the model

This is the model you are going to build:

Click on the Input block in the Inspector. This will add an Input block to the Modeling canvas, and the Information center pop-up will appear with error messages. But don't worry these error messages are very descriptive and easy to solve.

Navigate to the Blocks tab in the Inspector and set the Feature to spectrogram. First error message fixed.

Now we will stack five fully convolutional layers after each other.

  1. Add a 2D Convolution block.


    Activation: Linear

    This block is used to detect spatial features in an image.

    In this particular case we want to set the activation function outside the 2D Convolution block, i.e., in a separate Activation block. Setting Activation function to Linear is equivalent to no activation function at all.
  2. Add a Batch normalization block.

    This normalizes all input features to a similar range of values which will speed up learning.
  3. Add Activation.

    An activation function is used to determine the output of the neural network, for example, yes or no.

    Set ReLU (rectified linear unit) activation.

    A ReLU activation function has output 0 if the input is less than 0, and the same output as input if the input is greater than 0.
  4. Add a 2D Max pooling block.

    Horizontal stride4

    Vertical stride2

    This layer reduces the size of the data. You can say that 2D max pooling is similar to scaling down the size of an image.
  5. Add a Dropout block.


    This block prevents overfitting by setting random weights to 0.
Add 2d network layer

Copy paste the 1st convolutional network layer, but set: 

  • 256 filters in the 2D Convolution block.

Connect the second network to the first.

Add 3d network layer

Copy paste the 1st convolutional network layer, but set:

  • 512 filters in the 2D Convolution block.
  • In the 2D Max pooling block.

    Horizontal stride2

    Vertical stride1

Connect the third network to the second.

Add 4th network layer

Copy paste the 1st convolutional network layer, but set:

  • 1024 filters in the 2D Convolution block
  • In the 2D Max pooling block.

    Horizontal stride4

    Vertical stride4

Connect the fourth network to the third.

Add 5th network layer

Copy paste the 1st convolutional network layer, but set:

  • 2048 filters in the 2D Convolution block
  • In the 2D Max pooling block.

    Horizontal tride2

    Vertical stride2

Connect the fifth network to the fourth.

Last part of the model

Add the following blocks:

  1. Flatten. This block flattens the output of the last convolutional layer to a vector. You do this to give the following Dense block the right input size.
  2. Dense. A densely connected neural network layer. Set to default Initializer to Glorot uniform. The initializer defines the way to set the initial random weight.

    Set Activation to ReLU activation, that is the positive part of the function’s argument. The activation defines the output of that node given the input.
  3. Dense.

    Nodes5. One node for each mood (Angry, Countryside, Dark, Epic, and Happy)

    Set Initializer to Glorot uniform.

    Set Activation to Sigmoid activation since we need to squash the values to a range between 0 and 1.
  4. Target block.

    Set the Feature to Moods.

    Set the Loss function to Binary crossentropy. This will output a score for each tag for each song. We can use this score to tag a song, for example, should it have the tag "Happy", "Dark" or "Angry".

Run experiment

The experiment is done and ready to be trained. In the Inspector, click Settings tab and change the batch size to 16.

Click Run to start to train the model.

Navigate to the Evaluation view. As the training of the model advances, epoch by epoch, the training and validation performance metrics are visualized in the evaluation graphs. It's a large experiment so it will take some time.

Analyze experiment

When analyzing an experiment we are looking for, among other things, "overfitting". When it's almost as if the model memorizes the training data and then can't figure out how to tag a new song when provided. We do not want the lines to grow further apart as time goes on.

Improve experiment

Now it's time to find out if you can improve the model. Try to duplicate the experiment and then add blocks and change settings in the model. As long as you keep the same loss function (in this case binary crossentropy) you can compare the models' result and see which one is the best in the Evaluation view.

If there is a large discrepancy between training and validation losses, try to introduce Dropout and or Batch Normalization blocks to improve generalization. If the training loss is very high, the model is not learning well enough.

Last iteration

In this tutorial we’ve kept things small just to test out our ideas, we’ve only used 5 out of 46 moods plus used the 8%-training and 2%-validation subsets. Now when you’re satisfied with your model you should train it on the complete dataset with all moods to see what happens. This will take much longer time though since we’re using all data.

To do this navigate to the Dataset view and create a new version of the dataset. In this version create a new combined feature with all moods. Make sure you set Normalize on subset to the subset Training (80%).

Then duplicate your favorite model in the Modeling view, select 80% training subset and 20% validation. Set the target feature set to the one that includes all moods and change the number of nodes in the last Dense block to 46. Then run the experiment and watch it train in the Evaluation view.

Tutorial recap

It took time, but you've just build something fundamentally cool. Think of other dense data types you can represent as a mel spectrogram. Or other datasets that you can now classify into multiple classes. Achievement unlocked!

Try the platform