Detailed experiment results description

This is a detailed review of the experiment's results.
For a summarized review, see experiment results.

 

Neural Network

The neural network was trained on 80% of the dataset. The dataset was randomly split to train and set while keeping songs intact (including the order of their spectrograms).

The loss function set for the training was MSELoss, the optimizer Adam.

The initial learning rate was set to 0.0001, with a ReduceLROnPlateau strategy, that reduced the learning rate by a factor of 0.5 every 3 epochs where the average loss did not show improvement. The machine was trained in batches, where each batch had 50 songs randomly selected from the trainset. This was not done uniformly from the dataset, as that might cause an overfitting over one of the Thayer-model 2-dimensional space quadrants (since most songs fall in the first quadrant). Instead, each song had a quadrant, based on its spectrograms’ average thayer-score, and a similar number of songs were randomly chosen from each quadrant, each epoch.

These songs were fed to the machine in the order of the spectrograms in the song, in order to maintain the concept of memory in the machine. The machine was trained for 5 total epochs (this low number was chosen to minimize overfitting on the trainset), which took 164 seconds on a GPU. The learning curve seems pretty steep at first, and plateaued pretty quick.

This is the progress log, where quadrants’ classification marks for each quadrant in the Thayer-model 2-dimensional space, how many tags are in it, and how many predictions were put in it, so far in the epoch. For example, [ 0( 0), 0( 0), 1( 0), 0( 1)] means that 1 spectrogram was predicted to be in the 3rd quadrant, and 1 spectrogram was really in the 4th quadrant. This check was done to make sure no quadrant overfitting was happening.

 

epoch-spectrogram

mean loss over epoch

quadrants’ classification

learning rate

0- 0

0.16111

[ 0( 0), 0( 0), 1( 0), 0( 1)]

0.0001

0- 500

0.03283

[ 168( 217), 71( 83), 187( 57), 75( 144)]

0.0001

0-1000

0.02853

[ 333( 348), 183( 276), 354( 81), 131( 296)]

0.0001

0-1500

0.02834

[ 489( 523), 292( 407), 504( 180), 216( 391)]

0.0001

0-2000

0.02439

[ 576( 601), 385( 529), 809( 328), 231( 543)]

0.0001

1- 500

0.01711

[ 144( 104), 107( 207), 213( 106), 37( 84)]

0.0001

1-1000

0.02364

[ 324( 318), 131( 213), 378( 240), 168( 230)]

0.0001

1-1500

0.02103

[ 433( 405), 241( 371), 588( 324), 239( 401)]

0.0001

1-2000

0.02151

[ 550( 508), 245( 371), 879( 603), 327( 519)]

0.0001

1-2500

0.02269

[ 729( 667), 280( 500), 1077( 687), 415( 647)]

0.0001

2- 500

0.01569

[ 145( 117), 180( 212), 62( 9), 114( 163)]

0.0001

2-1000

0.01263

[ 236( 205), 239( 289), 327( 191), 199( 316)]

0.0001

2-1500

0.01154

[ 345( 323), 262( 342), 573( 413), 321( 423)]

0.0001

2-2000

0.0167

[ 439( 499), 332( 425), 884( 570), 346( 507)]

0.0001

3- 500

0.0219

[ 147( 182), 167( 176), 176( 125), 11( 18)]

0.0001

3-1000

0.01776

[ 326( 268), 181( 322), 362( 161), 132( 250)]

0.0001

3-1500

0.02337

[ 426( 364), 228( 415), 640( 270), 207( 452)]

0.0001

3-2000

0.02241

[ 595( 531), 265( 503), 874( 439), 267( 528)]

0.0001

4- 500

0.01744

[ 184( 179), 53( 14), 105( 166), 159( 142)]

0.0001

4-1000

0.01646

[ 269( 269), 71( 84), 376( 339), 285( 309)]

0.0001

4-1500

0.01982

[ 532( 403), 191( 269), 443( 441), 335( 388)]

0.0001

4-2000

0.01648

[ 641( 427), 288( 464), 653( 606), 419( 504)]

0.0001

5- 500

0.01493

[ 276( 198), 152( 124), 47( 88), 26( 91)]

0.0001

5-1000

0.01876

[ 562( 281), 234( 291), 47( 115), 158( 314)]

0.0001

5-1500

0.02086

[ 786( 384), 416( 484), 133( 245), 166( 388)]

0.0001

5-2000

0.02101

[ 957( 524), 426( 508), 442( 415), 176( 554)]

0.0001


Curated Testset Results

These are the output graphs for all audio files in the curated testset: