Aimpathy

Amit Yungman

Much research has been done to better understand the emotional experience of music; from the philosophical, artistic, psychological, and statistical approaches. In this research we conduct a cross-domain experiment based on those four disciplines, to further understand the factors that influence the emotional perception of music; and in particular the difference between the artist’s emotional conception and the audience’s perception. In the experiment we train a novel model of an Artificial Neural Network, to predict the perceived emotion from a short musical phrase. We then feed the machine curated input, which simulates artistic choices, to explore its most significant factors in determining the perceived emotions. In the conclusion we describe the results, as well as the possible follow-ups to the experiment, such as an emotional expression training tool for musicians.

Detailed experiment results description

This is a detailed review of the experiment's results.
For a summarized review, see experiment results.

Neural Network

The neural network was trained on 80% of the dataset. The dataset was randomly split to train and set while keeping songs intact (including the order of their spectrograms).

The loss function set for the training was MSELoss, the optimizer Adam.

The initial learning rate was set to 0.0001, with a ReduceLROnPlateau strategy, that reduced the learning rate by a factor of 0.5 every 3 epochs where the average loss did not show improvement. The machine was trained in batches, where each batch had 50 songs randomly selected from the trainset. This was not done uniformly from the dataset, as that might cause an overfitting over one of the Thayer-model 2-dimensional space quadrants (since most songs fall in the first quadrant). Instead, each song had a quadrant, based on its spectrograms’ average thayer-score, and a similar number of songs were randomly chosen from each quadrant, each epoch.

These songs were fed to the machine in the order of the spectrograms in the song, in order to maintain the concept of memory in the machine. The machine was trained for 5 total epochs (this low number was chosen to minimize overfitting on the trainset), which took 164 seconds on a GPU. The learning curve seems pretty steep at first, and plateaued pretty quick.

This is the progress log, where quadrants’ classification marks for each quadrant in the Thayer-model 2-dimensional space, how many tags are in it, and how many predictions were put in it, so far in the epoch. For example, [ 0( 0), 0( 0), 1( 0), 0( 1)] means that 1 spectrogram was predicted to be in the 3rd quadrant, and 1 spectrogram was really in the 4th quadrant. This check was done to make sure no quadrant overfitting was happening.

epoch-spectrogram	mean loss over epoch	quadrants’ classification	learning rate
0- 0	0.16111	[ 0( 0), 0( 0), 1( 0), 0( 1)]	0.0001
0- 500	0.03283	[ 168( 217), 71( 83), 187( 57), 75( 144)]	0.0001
0-1000	0.02853	[ 333( 348), 183( 276), 354( 81), 131( 296)]	0.0001
0-1500	0.02834	[ 489( 523), 292( 407), 504( 180), 216( 391)]	0.0001
0-2000	0.02439	[ 576( 601), 385( 529), 809( 328), 231( 543)]	0.0001
1- 500	0.01711	[ 144( 104), 107( 207), 213( 106), 37( 84)]	0.0001
1-1000	0.02364	[ 324( 318), 131( 213), 378( 240), 168( 230)]	0.0001
1-1500	0.02103	[ 433( 405), 241( 371), 588( 324), 239( 401)]	0.0001
1-2000	0.02151	[ 550( 508), 245( 371), 879( 603), 327( 519)]	0.0001
1-2500	0.02269	[ 729( 667), 280( 500), 1077( 687), 415( 647)]	0.0001
2- 500	0.01569	[ 145( 117), 180( 212), 62( 9), 114( 163)]	0.0001
2-1000	0.01263	[ 236( 205), 239( 289), 327( 191), 199( 316)]	0.0001
2-1500	0.01154	[ 345( 323), 262( 342), 573( 413), 321( 423)]	0.0001
2-2000	0.0167	[ 439( 499), 332( 425), 884( 570), 346( 507)]	0.0001
3- 500	0.0219	[ 147( 182), 167( 176), 176( 125), 11( 18)]	0.0001
3-1000	0.01776	[ 326( 268), 181( 322), 362( 161), 132( 250)]	0.0001
3-1500	0.02337	[ 426( 364), 228( 415), 640( 270), 207( 452)]	0.0001
3-2000	0.02241	[ 595( 531), 265( 503), 874( 439), 267( 528)]	0.0001
4- 500	0.01744	[ 184( 179), 53( 14), 105( 166), 159( 142)]	0.0001
4-1000	0.01646	[ 269( 269), 71( 84), 376( 339), 285( 309)]	0.0001
4-1500	0.01982	[ 532( 403), 191( 269), 443( 441), 335( 388)]	0.0001
4-2000	0.01648	[ 641( 427), 288( 464), 653( 606), 419( 504)]	0.0001
5- 500	0.01493	[ 276( 198), 152( 124), 47( 88), 26( 91)]	0.0001
5-1000	0.01876	[ 562( 281), 234( 291), 47( 115), 158( 314)]	0.0001
5-1500	0.02086	[ 786( 384), 416( 484), 133( 245), 166( 388)]	0.0001
5-2000	0.02101	[ 957( 524), 426( 508), 442( 415), 176( 554)]	0.0001

Curated Testset Results

These are the output graphs for all audio files in the curated testset:

Tip

Tip

Amit Yungman - Aimpathy - 2023