Detailed experiment description

This is a detailed review of the experiment.
For a summarized review, see experiment overview.

 

The code for the entire experiment can be found in this GitHub project.

Dataset

Data Preparation

Step 1: Each chorus we convert to WAV audio. We do this using the Python library Pydub. We export the data to a single channel 44100 sample rate WAV file, which we then load into a wavfile object using the SciPy library:


sound = AudioSegment.from_mp3(os.sep.join([AUDIO_FOLDER, audio_file])).set_channels(1)

audio_file_wave = sound.export(format="wav", bitrate=RATE)

sample_rate, samples = wavfile.read(audio_file_wave)

Step 2: Each WAV audio is converted to Mel-scale spectrogram with 128 frequency buckets. The mel scale is a scale of pitches which would be judged by most listeners to be equal in distance from one another. This is a better representation of the way a human hears pitches, unlike the natural representation of pitches by distance in Hz alone. I chose 128 frequency buckets because on the one hand, it is not too few, and can put different half-tones in different buckets (within the human hearing range), and on the other hand, it is a small enough number, which will allow us to reduce the number of nodes in our machine. Since the size of the input will be the number of buckets times the chunk size, fewer buckets translates to a smaller amount of calculation the machine has to eventually do when being trained.

For this conversion, we use a the PyTorch library, and specifically the MelSpectrogram transformator:


spectrogrammer = transforms.MelSpectrogram(sample_rate=RATE, n_fft=(MEL_SPECTROGRAM_BUCKETS * 2 - 2), win_length=MEL_SPECTROGRAM_WINDOW_LENGTH, power=2, normalized=True)

 

We then apply the transformer to each audio file thus:


spectogram = spectrogrammer(torch.from_numpy(samples/(2**15)).float().reshape((1, -1)))


Step 3: Each spectrogram will be divided to spectrogram chunks of 196 windows. This means each chunk matches exactly 0.5 seconds of the original audio. I did this for the sake of ease in calculation (working with 0.5 seconds), and because 0.5 is a long enough time to perceive a tone; while being not too long, to avoid too many emotion changes in a single chunk. Please notice that each chunk size is of 1 second, but represents 0.5 seconds of music, since the hop size is of 0.5 a second. When calculating the spectrograms, we take a window size of 224 samples per spectrogram. Since we have 44100 samples per second, we will have 196 spectrograms per second (44100/224 = 196). But since we have a hop size of 112, the sliding windows move just 0.5 second each time, so there is an overlap of 0.5 second between chunks.

Finally, each one second of audio, represented as a 128 X 196 spectrogram chunk, represents for us 0.5 seconds of the audio.

Step 4: Each spectrogram chunk is matched with its appropriate Thayer-model tagging. Since each spectrogram now represents exactly 0.5 seconds of audio, and our dataset is tagged per 0.5 seconds, we can easily match the dataset’s tagging to our spectrograms.

Neural Network Architecture

The architecture of the Neural Network can be divided into 3 sections - the CNN, the LSTM, and the ANN. The input of the machine is a spectrogram chunk of size 128 X 196 (196 frames of 128 frequency buckets).

The data then goes through two CNN tiers, each tear is composed of two convolutional layers and one pooling layer. Between each layer there is a ReLU gate.

The data then goes through three more CNN tiers, each tear is composed of one convolutional layer, a ReLU gate, one pooling layer and a Dropout layer with 0.25 ratio drop.

The data continues into an LSTM layer, and into 5 Linear layers to reduce the data size gradually to 2. After the first and second Linear layers there is a Dropout layer with 0.5 ratio drop.

The final activation function is the Identity, as the output resembles a regression problem.


The code for the machine, in PyTorch format:

 

class AudioLSTMCNN2(nn.Module):
  def __init__(self, out_size: int = 2, cnn_channels: int = 64):
     """
     For a spectrograms with 128 buckets and chunk size of 196, will be (128, 196)
     """
     # call the parent constructor
     super(AudioLSTMCNN2, self).__init__()

     self.conv11 = nn.Conv2d(in_channels=1, out_channels=cnn_channels, kernel_size=(3, 3), stride=(1, 1), padding=1)
     self.relu11 = nn.ReLU()
     self.conv12 = nn.Conv2d(in_channels=cnn_channels, out_channels=cnn_channels, kernel_size=(3, 3), stride=(1, 1),
                     padding=1)
     self.relu12 = nn.ReLU()
     self.maxpool1 = nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2))

     self.conv21 = nn.Conv2d(in_channels=cnn_channels, out_channels=cnn_channels, kernel_size=(3, 3), stride=(1, 1),
                     padding=1)
     self.relu21 = nn.ReLU()
     self.conv22 = nn.Conv2d(in_channels=cnn_channels, out_channels=cnn_channels, kernel_size=(3, 3), stride=(1, 1),
                     padding=1)
     self.relu22 = nn.ReLU()
     self.maxpool2 = nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2))

     self.conv3 = nn.Conv2d(in_channels=cnn_channels, out_channels=cnn_channels * 2, kernel_size=(3, 3),
                    stride=(1, 1), padding=1)
     self.relu3 = nn.ReLU()
     self.maxpool3 = nn.MaxPool2d(kernel_size=(3, 3), stride=(3, 3))
     self.dropout3 = nn.Dropout(p=0.25)

     self.conv4 = nn.Conv2d(in_channels=cnn_channels * 2, out_channels=cnn_channels * 4, kernel_size=(3, 3),
                    stride=(1, 1), padding=1)
     self.relu4 = nn.ReLU()
     self.maxpool4 = nn.MaxPool2d(kernel_size=(3, 3), stride=(3, 3))
     self.dropout4 = nn.Dropout(p=0.25)

     self.conv5 = nn.Conv2d(in_channels=cnn_channels * 4, out_channels=cnn_channels * 4, kernel_size=(3, 3),
                    stride=(1, 1), padding=1)
     self.relu5 = nn.ReLU()
     self.maxpool5 = nn.MaxPool2d(kernel_size=(3, 3), stride=(3, 3))
     self.dropout5 = nn.Dropout(p=0.25)

     self.lstm6 = nn.LSTM(cnn_channels * 4, cnn_channels * 4) # , batch_first=True)
     self.hidden = (torch.zeros(1, 1, cnn_channels * 4),
               torch.zeros(1, 1, cnn_channels * 4))
     self.fc6 = nn.Linear(in_features=cnn_channels * 4, out_features=cnn_channels * 2)
     self.dropout6 = nn.Dropout(p=0.5)

     self.fc7 = nn.Linear(in_features=cnn_channels * 2, out_features=cnn_channels)
     self.dropout7 = nn.Dropout(p=0.5)

     self.fc8 = nn.Linear(in_features=cnn_channels, out_features=cnn_channels//2)
     self.fc9 = nn.Linear(in_features=cnn_channels//2, out_features=cnn_channels//4)
     self.fc10 = nn.Linear(in_features=cnn_channels//4, out_features=out_size)
     self.final = nn.Identity()

  def forward(self, x):
     x = x.reshape((1, 1, x.shape[0], -1))

     x = self.conv11(x)
     x = self.relu11(x)
     x = self.conv12(x)
     x = self.relu12(x)
     x = self.maxpool1(x)

     x = self.conv21(x)
     x = self.relu21(x)
     x = self.conv22(x)
     x = self.relu22(x)
     x = self.maxpool2(x)

     x = self.conv3(x)
     x = self.relu3(x)
     x = self.maxpool3(x)
     x = self.dropout3(x)

     x = self.conv4(x)
     x = self.relu4(x)
     x = self.maxpool4(x)
     x = self.dropout4(x)

     x = self.conv5(x)
     x = self.relu5(x)
     x = self.maxpool5(x)
     x = self.dropout5(x)

     x = x.view(x.size(0), x.size(1), -1)
     x = x.permute(0, 2, 1)

     x, self.hidden = self.lstm6(x, self.hidden)

     x = x.view(x.size(0), -1)
     x = self.fc6(x)
     x = self.dropout6(x)

     x = self.fc7(x)
     x = self.dropout7(x)

     x = self.fc8(x)
     x = self.fc9(x)
     x = self.fc10(x)

     final_x = self.final(x.reshape((-1)))

     return final_x

 



The description of the machine and sizes:

 

Name

Description and parameters

Input size

conv11

Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

1 X 1 X 128 X 196

relu11

ReLU()

1 X 64 X 128 X 196

conv12

Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

1 X 64 X 128 X 196

relu12

ReLU()

1 X 64 X 128 X 196

maxpool1

MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)

1 X 64 X 128 X 196

conv21

Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

1 X 64 X 64 X 98

relu21

ReLU()

1 X 64 X 64 X 98

conv22

Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

1 X 64 X 64 X 98

relu22

ReLU()

1 X 64 X 64 X 98

maxpool2

MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)

1 X 64 X 64 X 98

conv3

Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

1 X 64 X 32 X 49

relu3

ReLU()

1 X 128 X 32 X 49

maxpool3

MaxPool2d(kernel_size=(3, 3), stride=(3, 3), padding=0, dilation=1, ceil_mode=False)

1 X 128 X 32 X 49

dropout3

Dropout(p=0.25, inplace=False)

1 X 128 X 10 X 16

conv4

Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

1 X 128 X 10 X 16

relu4

ReLU()

1 X 256 X 10 X 16

maxpool4

MaxPool2d(kernel_size=(3, 3), stride=(3, 3), padding=0, dilation=1, ceil_mode=False)

1 X 256 X 10 X 16

dropout4

Dropout(p=0.25, inplace=False)

1 X 256 X 3 X 5

conv5

Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

1 X 256 X 3 X 5

relu5

ReLU()

1 X 256 X 3 X 5

maxpool5

MaxPool2d(kernel_size=(3, 3), stride=(3, 3), padding=0, dilation=1, ceil_mode=False)

1 X 256 X 3 X 5

dropout5

Dropout(p=0.25, inplace=False)

1 X 256 X 1 X 1

lstm6

LSTM(256, 256)

1 X 1 X 256

fc6

Linear(in_features=256, out_features=128, bias=True)

1 X 256

dropout6

Dropout(p=0.5, inplace=False)

1 X 128

fc7

Linear(in_features=128, out_features=64, bias=True)

1 X 128

dropout7

Dropout(p=0.5, inplace=False)

1 X 64

fc8

Linear(in_features=64, out_features=32, bias=True)

1 X 64

fc9

Linear(in_features=32, out_features=16, bias=True)

1 X 32

fc10

Linear(in_features=16, out_features=2, bias=True)

1 X 16

final

Identity()

1 X 2

We use MSE to calculate the loss, because we want to use the distance between the truth and the hypothesis on the Thayer-model 2-dimensional axes.
We use the Adam optimizer. We start with a 0.0001 learning rate, which drops by a factor of 0.5 every 3 epochs that the mean loss is not improved.
We run the training for 10 epochs, with a batch size of 50 audio files.

 

The code:

 

model_c = AudioLSTMCNN2().cuda()
LEARNING_RATE = 0.0001
criterion = nn.MSELoss().cuda()
optimizer = torch.optim.Adam(model_c.parameters(), lr=LEARNING_RATE)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', patience = 3, factor=0.5)
start_time = time.time()
EPOCS = 6
PRINT_MARK = 500
BATCH_SIZE = 50
STOP_LOSS = 0.01
MIN_LEARNING_RATE = 0.000001


model_c.train()
model_c.hidden = (model_c.hidden[0].cuda(), model_c.hidden[1].cuda())
for epoc in range(EPOCS):
  losses = list()
  quadrants = list()
  real_quadrants = list()
  train_key_sample = [key for keys in [random.sample(trainset_quadrant_to_keys[i+1], BATCH_SIZE//4) for i in range(4)] for key in keys]
  random.shuffle(train_key_sample)
  train_sample = [datum for sample_key in train_key_sample for datum in trainset[sample_key]]
  
  for batch_i, (X_train, (valence, arousal)) in enumerate(train_sample):
    model_c.hidden = tuple([each.data for each in model_c.hidden])

    optimizer.zero_grad()
    
    y_train = torch.Tensor((valence, arousal)).cuda()
    # Apply the model
    y_pred = model_c(X_train.cuda()) # we don't flatten X-train here
    loss = criterion(y_pred, y_train)

    # Update parameters
    loss.backward(retain_graph=True)
    optimizer.step()

    losses.append(loss.cpu().item())
    real_quadrants.append(get_quadrant(y_train[0].item(), y_train[1].item()))
    quadrants.append(get_quadrant(y_pred[0].item(), y_pred[1].item()))
    
    # Print interim results
    if (batch_i > 0 or epoc == 0) and batch_i%PRINT_MARK == 0:
      print(f'{epoc:2}-{batch_i:4} | loss: {np.mean(losses):.5f} | [{quadrants.count(1):4}({real_quadrants.count(1):4}), {quadrants.count(2):4}({real_quadrants.count(2):4}), {quadrants.count(3):4}({real_quadrants.count(3):4}), {quadrants.count(4):4}({real_quadrants.count(4):4})]   lr: {optimizer.param_groups[0]["lr"]}')
  
  scheduler.step(np.mean(losses))

  if np.mean(losses) < STOP_LOSS or optimizer.param_groups[0]["lr"] < MIN_LEARNING_RATE:
    break
    
print(f'\nDuration: {time.time() - start_time:.0f} seconds') # print the time elapsed      

 

Curated Testset

For creating the audio test files, we use the midiutil, PyDub and mido libraries. I’ve created an AudioData class which helps create audio files easily based on our required parameters (pitch, length, volume and beat).


For example, a script to create 25 seconds of a continuous A note:


one_tone_A = AudioData()
one_tone_A.add_sound([69], 0, 50, 100)
one_tone_A.save_to_wav(os.sep.join(["..", "data", "test_audio", "one_tone_A.wav"]))