AIMPATHY
/ˈempəTHē/


.
Exploring the distance between conceived and perceived emotions using Artificial Intelligence

Amit Yungman (3349217) | Early Music Masters - Voice | Royal Conservatoire The Hague | Research advisor: Johannes Boer

Introduction

As a classical singer, I found that while our technique and theory are well founded and somewhat uniform, requirements of our emotional expression are almost exclusively based on the instructor's opinion or personal experience. This can create discrepancies between teaching methods, and even between teachers in the same method. This confusion may lead to an ineffective, sometimes completely failed education regarding emotional expression.

With my knowledge and experience in Data Science and Artificial Intelligence, I intend to slightly demystify the factors that attribute to a musician’s expression. More concretely, I want to use Data Science methods to pin-point factors which correlate with the difference between a musician’s conceived expression (the emotion the musician intends to convey), and the audience’s perceived expression (the emotion the audience ultimately experiences).

“ When asked to what consciousness is, we have no better answer than Louis Armstrong's when a reporter asked him what jazz is: "Lady, if you have to ask, you'll never know." But even consciousness is not as thoroughgoing a mystery as it used to be. Parts of the mystery have been pried off and turned into ordinary scientific problems. „ 3

We can find numerous treatises and essays about the subject of emotion expression in rhetoric and music, whose conclusions rely more on anecdotal evidence and rationalism, rather than broad statistical or empirical research. Since the last century, we are witnessing a rise in empirical research of the subject, but almost exclusively from a psychological or commercial point of view. From the Data Science community, we see many successful experiments and insightful research on the subject. Most relevantly; researchers are able to automatically predict and analyze human emotional responses to music with higher accuracy, wider variety of input, and larger complexity.

I proposed an experiment in which I train an Artificial Neural Network to predict the perceived emotion of a human audience to a short musical phrase. While this has been done successfully before, I propose a novel model, which combines LSTM (Long Short Term Memory) and CNN (Convolutional Neural Network). I believe this previously untested approach represents more closely the way the human mind perceives music.

I then fed the trained novel model carefully curated input (such as single tones, quick dynamic changes etc.), in order to examine the factors it attributes as important for the conceived emotion, and the emotional gradient. With this I was able to examine the factors that affect the simulated audience’s emotional perception, in response to carefully isolated changes in the musical input.

While I am intrigued by the capabilities of Artificial Intelligence to predict (and maybe modify) human emotional perception, I find the questions that this topic raises to be even more interesting and relevant. I am most fascinated by the great contradiction in our inability to believe that emotions are no more than physical factors, and refusal to attribute our conscience experience to mere input and signals - yet we are determined to academically claim that one form of expression is superior, or that a certain way of expression is more effective than the other.

Research question

As artists, our biggest power is evoking emotions; yet it is the least practiced element of our craft. While we are able to use our imagination, empathy and personal experience to improve our emotional expression, our only means of testing our progress is with a live audience - test settings of which are few and far in between.

As an experienced Artificial Intelligence Algorithms engineer, and a classically trained singer, the undertested approach to expression bothered me. And thus I decided to dedicate my Master's research to exploring our understanding of emotional expression in music, using Artificial Intelligence and empirical methods.

To keep my research focused, I've formulated this research question:

 

What are the main quantifiable or qualitative factors which influence the difference between a musician’s emotional expression and their audience’s experience?

In order to fully answer my research question, I believe the objective of my research must be twofold:

 

  1. To have a better understanding of the main factors which influence the difference between a musician’s emotional expression and their audience’s experience. 
  2. To create a training tool which will allow musicians to perfect their expression, by showing them which emotion a human listener would experience (without the need of a human audience).

An important thing to keep in mind when reading this research, is the subjective nature of music and of emotions.

We like to say that music is a universal language, but the fact of the matter is, most people have very different tastes and sensibilities in music. A lot of people are born without being able to carry a tune or even a rhythm, and those who are able need extensive training to be able to use those abilities in a favorable fashion.1

“ Music is an enigma. […] In all cultures, certain rhythmic sounds give listeners intense pleasure and heartfelt emotions. What benefit could there be to diverting time and energy to the making of plinking noises, or to feeling sad when no one has died? „ 2

Technical concepts and definitions

This is a summarized and simplified explanation about the concepts and definitions relevant to the research.

Dataset

A dataset is simply a set of inputs.

A tagged dataset is a set of inputs, and their desired output (which we call the “tag”).

A trainset is the name for a dataset used to train an algorithm.

A testset is the name for a dataset used to test an algorithm’s performance.


For example, if our problem is

“what is the sum of two numbers?”,

our dataset can be:

(1, 1)

(1, 4)

(9, 3)

(5, 6)

….

(8, 1)

 

 

 

An appropriate tagged dataset would then be:

(1, 1)  →  2  

(1, 4)  →  5 

(9, 3)  →  12 

(5, 6)  →  11

….

(8, 1)  →  9 

Artificial Neural Network

An Artificial Neural Network (ANN) is a software architecture based on the structure of the brain. ANNs are simply networks of nodes, sometimes called artificial neurons. The architect decides on the architecture of the network - which nodes are connected to which, and in what way (i.e. simple directional network, directional connection of node layers etc.).

Each node contains a mathematical function, and each connection holds a weight.


In order to train an ANN to solve a problem, a tagged dataset is fed into the machine, and the machine’s output is compared with the dataset’s tagging. The machine’s weights are then updated in a process called “backward propagation” based on the difference between output and tag.

With the appropriate architectures and  trainsets, ANNs can be trained to solve many diverse problems such as solving differential equations, automatic translation, speech to text transcription and many many more.


Convolutional Neural Network

A Convolutional Neural Network (CNN) is the name for an ANN with a specific structure - where the layers of the neurons are considered in a multidimensional way - usually in 2D or 3D matrices.

For example, let’s say we are building a machine to detect if a photo is of a dog or not. If we put the photo into the machine as a string of bits, it would lose the additional layer of information that comes from its colors and 2-dimensionality. That means it’ll be harder for the machine to learn and solve the problem. So instead, we adjust the architecture (the mathematical function and connections between nodes) to better suit a multi-dimensional input.

Long Short Term Memory

A Long Short Term Memory (LSTM) is a name for a specific type of Recurrent Neural Network, which is a family of ANN architectures. While in regular (no memory) ANNs, each input is considered separately and with no relation to its previous input, in an LSTM we save some memory of past input as part of the ANN architecture.

For example, problems like multiplication and classification of dogs in pictures have no need for memory. If anything, memory might “confuse” the machine. But what about problems like predicting how a sentence will end? If we feed the machine with one word at a time, it will be able to predict the next word much better if we let it remember the last 10 words it got as input.

Spectrogram

A spectrogram is a 3-dimensional representation of audio, where the X-axis is time, the Y-axis is wave frequencies (in hertz), and the Z-axis is the strength of the signal (sometimes in decibels).

A spectrogram is achieved usually by applying a Fourier Transformation to audio, converting its waveform (audioform) into its component sine waves. In a spectrogram we can more easily see which frequencies the sound wave contains.

Thayer Emotion Model4

The Thayer model is a model proposed in 1989 to quantifiably characterize emotions on a 2-dimensional scale - the dimensions being Valence (stress) and Arousal (energy). For example, happiness would be high valence and high arousal, sadness would be low valence low arousal.


Of the several models considered, this model was the best fitting for our problem, since it had the broadest representation of emotions, while still showing good results in previous research.

Experiment overview

This is a summarized and simplified review of the experiment done in this research.
For a deeper dive, see detailed experiment description.

Overview

The purpose of this experiment is to get a better understanding of the leading factors in an audience’s emotional perception of a music phrase. In this experiment we will train a Neural Network to predict the emotion an audience will experience based on a short musical phrase. This audience will be our simulated “audience”. We will then get the network’s predictions on simple curated musical phrases, to see what are the factors that affect its decision making.

Dataset

The dataset we will be using to create the machine is called PMEmo5.

The dataset contains “emotion annotations of 794 songs as well as the simultaneous electrodermal activity (EDA) signals. A Music Emotion Experiment was well-designed for collecting the affective-annotated music corpus of high quality, which recruited 457 subjects.”6


For our experiment, we will use these parts of the dataset:

  1. The MP3 audio of the chorus of each song 
  2. Dynamic (thayer model) labels for each 0.5-second segment 

While this dataset is not very suitable for our experiment, it is the best one I could find within the restrictions. For more on this, see the caveats section.

Data Preparation

Step 1: Each chorus we will convert to WAV audio.

Step 2: Each WAV audio will be converted to Mel-scale spectrogram* with 128 frequency buckets.

Step 3: Each spectrogram will be divided to spectrogram chunks of 196 windows. This means each chunk matches exactly 0.5 seconds of the original audio.

Step 4: Each spectrogram chunk is matched with its appropriate Thayer-model tagging.


Our final dataset consists of spectrogram chunks of 128 buckets and 196 frames, and their Thayer-model tagging.

Neural Network Architecture

For the input of the neural network, I wanted to be as “similar” to the human musical experience as possible. For that, I didn’t want the input to be features like tempo, pitch etc., and I chose instead to use the natural physical depiction of music - which is the different frequencies the brain experiences with each musical phrase. This is represented without loss in a spectrogram.

For the output, I have two values, which correlate to coordinates on the Thayer-model axes. To read the 2-dimensional spectrogram most “naturally”, the main architecture of the network is a CNN, as was tested most successfully before in 20207.

I proposed a variation on that architecture that allows memory. Since the human musical experience relies greatly on its short-term memory between harmonies and phrases, I found this part of the architecture to be necessary for a good simulation. To do that, I added an LSTM component to the network.

Curated Testset

In order to answer our research questions, I created and curated a testset of audio clips. I focused on 3 main attributes of music, to make the analysis simpler:

  • Tempo (measured in Beats Per Minute)
  • Pitch (Measured in Hz)
  • Volume (Measured in relative levels between 0 and 127)


The created files are these:

 

File ID

File name

Test name

Length (sec)

Pitch
(Hz)

Tempo
(BPM)

Volume

(level)

1

one_tone_A

Single tone A no beat

25

440 Hz (concert A)

-

100

2

one_tone_C

Single tone C no beat

25

261.63 Hz (Middle C)

-

100

3

chromatic_8_scale_up

Chromatic scale up

200

Starts at 82.41 (E2), and increases half a tone up every 4 seconds up to 1479.98 (F#6/Gb6)

15

100

4

chromatic_8_scale_down

Chromatic scale down

200

Starts at 1479.98 (F#6/Gb6), and decrease half a tone up every 4 seconds up to 82.41 (E2)

15

100

5

one_tone_A_120_bpm

Single tone A with beat

25

440 Hz (concert A)

120

100

6

one_tone_C_120_bpm

Single tone C with beat

25

261.63 Hz (Middle C)

120

100

7

Harmony_A_C_one_tone

Single minor harmony no beat

25

440 Hz (concert A)
And
523.25 Hz (C5)

-

100

8

Harmony_A_C_120_bpm

Single minor harmony with beat

25

440 Hz (concert A)
And
523.25 Hz (C5)

120

100

9

Harmony_A_Cs_one_tone

Single major harmony no beat

25

440 Hz (concert A)
And
554.37 Hz (C#5)

-

100

10

Harmony_A_Cs_120_bpm

Single major harmony with beat

25

440 Hz (concert A)
And
554.37 Hz (C#5)

120

100

11

volume_decrease_fade

Single tone volume fade decrease

100

440 Hz (concert A)

30

Starts at 100 and decreases by 2 every 2 seconds

12

volume_increase_fade

Single tone volume fade increase

100

440 Hz (concert A)

30

Starts at 0 and increases by 2 every 2 seconds

13

volume_decrease_sharp

Single tone volume sharp decrease

20

440 Hz (concert A)

30

Starts at 100 and decreases by 20 every 4 seconds

14

volume_increase_sharp

Single tone volume sharp increase

20

440 Hz (concert A)

30

Starts at 0 and increases by 20 every 4 seconds

15

A_decreasing_bpm

Single tone tempo decrease

56

440 Hz (concert A)

Every 8 seconds bpm decreases [240, 120, 80, 60, 48, 40, 30]

100

16

A_increasing_bpm

Single tone tempo increase

56

440 Hz (concert A)

Every 8 seconds bpm increases [30, 40, 48, 60, 80, 120, 240]

100

Continuous tone A

Single minor harmony no beat

Single tone volume sharp decrease

Continuous tone C

Single minor harmony with beat

Single tone volume sharp increase

Chromatic scale up

Single major harmony no beat

Single tone tempo decrease

Chromatic scale down

Single major harmony with beat

Single tone tempo increase

Single tone A with beat

Single tone volume fade decrease

Single tone C with beat

Single tone volume fade increase

Experiment results

This is a summarized and simplified review of the experiment's results.
For a deeper dive, see detailed experiment results description.

Neural Network

For the predictor (our neural network), we measure its accuracy as the average (mean) distance between each predicted emotion (point on the Thayer-model 2-dimensional space), and the true tagged emotion. Meaning, a perfect prediction will be 0.0.

We trained the machine over 80% of the dataset, and tested it on the rest (20%).
For a random predictor (a machine that randomly selects an emotion), we get an accuracy of 0.13162.
Our network was able to predict with an accuracy of 0.02101 over the trainset, and an accuracy of 0.04508 of the testset. Meaning, the predictor was able to predict the tagged emotion over 6 times better than random for the trainset, and around 3 times better for the testset.
This shows clearly that the neural network was properly set, and was able to learn the problem correctly.

Noteworthy Results over the Curated Testset

For each audio file in the curated testset, we ran the predictor twice - once as it, and once with a disabled memory. Since the neural network is built with an LSTM layer, it was interesting to see how much of its prediction is based on its memory, and how much is based on other absolute factors.
Running the curated testset through the predictor, we were able to notice some interesting factors that affected the predictor’s decision:

Single continuous tone with no beat

As expected, the predictor treats a single continuous tone the same for every frame, when no memory is used (the dip in the beginning is due to the tone pick up from zero, as the spectrogram frames are built with sliding windows).

But when the memory is used, we can see the predictor experiences excitement (movement towards higher values) both in valence and in arousal, which plateaus and starts dipping after a peak around the 8 seconds mark. The machine is predicting more excitement with the first appearance and sustaining of a note, and then slowly predicts less excitement with time (“gets bored”). This is interesting, because it may suggest the timespan for excitement from the same harmony/note at around 8 seconds.

Continuous tone vs. continuous harmony

Interesting to notice, that while the spike in the machine’s prediction is similar between a harmonized note and a single note (regardless of the mode of harmony), the dip is not. With the harmony, the predictor maintains a steadier plateau, and barely dips under it for the rest of the audio.

Chromatic descent and ascent

Another interesting prediction can be seen with the chromatic scales, when using memory. We see a clear trend downwards. We are in the second quadrant of the Thayer model (anger and annoyance), since arousal is past the middle (0.5) and valence is under the middle). This trend can be imagined as intensification of the irritation as time passes. This happens in a similar manner in both the descending and ascending scale, suggesting the predictor cares less about the absolute value of the note, and about the direction of the melodic interval, and puts a stronger emphasis on the melodic interval’s value.
Another interesting thing is that a scale up is considered less energetic (lower arousal values), but also less positive (lower valence values).

Tempo change

From this graph we can see that the change in tempo, when the memory is on, affects the prediction very minimally. It seems that the machine almost completely ignores the tempo when the memory is used.

On the other hand, when the memory is not used, we can see big fluctuations. The oscillations in the graph are probably due to the way the tempo is made (by adding quiet parts between notes). Meaning, when the memoryless machine hears a quiet part, it reacts the same as when hearing the A tone.
But when hearing the A tone dipping in an outs (right at the cusp of each “beat”), it reacts differently:
When the A tone fades out, the valence score is reduced. When the tone fades back in, the valence score shoots up. For arousal, either a fade in or out creates a spike in the score.
We can hypothesize thatboth the appearance or disappearance of a tone increases the arousal, while the appearance appears more positive, and the disappearance negative.

Volume change

Similarly to the change in tempo, from this graph we can see that the change in volume, when the memory is on, doesn’t affect the prediction.

Caveats

The results of the experiment show some clear and interesting results, but some caveats should be made:

Dataset

When reviewing the results of the experiment it is important to take the dataset the predictor was trained on into consideration. Just like the human brain, the neural network learns by example and correction. There were a few features of our chosen dataset which were suboptimal:


The size - for a proper training of a neural network, very large and diverse datasets are required. Usually, the more complex the problem, the bigger the required dataset. Since our dataset was composed of only 794 choruses of popular songs, the diversity was not great. And since the audio snippets were not very long, we couldn’t really get a robust dataset.


The noise - this dataset was of popular songs, rich with harmonies, beats and generally very noisy. This means that the network had to digest a great multitude of parameters. While this is a good representation of popular music, it means that the machine had a hard time learning the subtleties and intricacies of the music. If this dataset was much larger, this wouldn’t have been a problem. But considering the relatively small size of the datasets, it is very likely that the network would have had to neglect the finer details and focus on the bigger ones.


The relevance - Since what we wanted to examine were very simple deconstructed audio phrases, the best source for a trainset would have been simpler, less polyphonic pieces. With our dataset, the machine learned to predict for audio that is very different to what we were testing.

Unfortunately, attempts of procuring funding to create a better suited dataset failed. 

Neural Network

When training a deep neural network on a small yet complex dataset, we face the very plausible risk of under- or overfitting. Underfitting means that we don’t really take the dataset into consideration, but rather spit out the same output, regardless of the input. Overfitting means the machine learns its trainset exactly, and would perform very poorly for any other input. In our case we were trying to avoid underfitting by starting with a small enough learning rate, random batching, and plenty of dropout layers in the network’s architecture. We were trying to avoid overfitting by running for a very limited number of epochs, as well as verifying each epoch that all 4 quadrants were getting relatively enough predictions.

It is important to mention, that since our purpose was not creating a predictor, but rather inspecting the network's deciding factors; overfitting was not an issue we needed to fear

Adaptations

The scope of this research was not enough to fully explore all the possibilities of this approach. I hope the preparations I’ve made will allow future researchers to take this research into various interesting directions.

I have made all of the code and data that I’ve used available openly in this GitHub Project.

I will describe here a few possible adaptations:

Research Tool

As described in the caveats section, the results of this experiment rely heavily (not surprisingly) on the dataset. A dataset of violin melodies will provide wonderful insight over the factors of emotional expression in violin playing, but will not be very beneficial for EDM music. Similarly, a dataset of harmonic melodies, will provide results which are less relevant to monodic melodies. I leave this code open and well documented, to make it simple for succeeding research to be done. On various datasets and for various purposes.

Generative Adversarial Network

A Generative Adversarial Network (GAN) is an interesting new architecture for neural networks. In a GAN, we train two networks in an adversarial manner - one is the predictor, the other a generator. We start by training a predictor (for example, in our case, a network that can predict an emotion based on a musical spectrogram). We then train another network to be able to generate spectrograms, based on an emotional prompt (for example, we give it a point on the Thayer-model 2-dimensional space, and it outputs a spectrogram). Then we train the generator network, until it is able to successfully fool the predictor (generate spectrograms based on its prompt, that are close enough to the predictor’s predictions). We continue training the network with each other. At the end of this process we will have two networks - one that can predict emotions based on a musical phrase (like we did in this experiment), and another that can generate musical phrases based on emotional prompts.

This novel approach will allow us to explore the factors the machines take into consideration, in a much less “supervised’ way. Removing the stage of a curated testset, will allow us to reduce human intervention in the examination stage, and to get direct answers to our queries. We could simply ask the generator - give us a happy musical phrase, and it’ll generate one. We can then look at the generated pitches, harmonies, volume, tempi etc.

Practice Tool

Seeing the potential of our predictor, I have also developed an interface that will allow musicians to practice their emotional expression.

In this practice tool, a musician could get live and immediate feedback of the emotion an audience would have experienced from their played phrase, without the need of performing in front of an audience. Since the times a young musician spends in front of an audience are few and far in between, this will allow the musician to get ample feedback on their emotional expression, from the comfort of their own practice space. This following video is of a live demonstration of the practice tool, which is also available through the GitHub Project I have publicly shared:

In this video we can see the 4 sections of the tool:

Audioform - The sound wave of the live audio

Thayer model - The live position of the emotional prediction on the Thayer-model 2-dimensional space

Spectrogram - The spectrogram of the live audio, shown with 5 seconds of history

Perceived emotion graph - The graph of the emotion predicted by the predictor, since the begging of the session

In this demonstration, we are seeing the prediction of a very basic predictor. To get a properly functioning practice tool, the predictor will have to be trained on a large data of relevant recordings - depending on the desired usage. For example, for singers, a trainset of monodic a-capella singing would be best. For violinists, a trainset of monodic violin lines would be best. To undestand this better, see the caveats section..

Conclusions

In this research I was looking for novel empirical ways to better understand the contributing factors to our emotional experience of music. Within the scope of this research I successfully gave preliminary answers to the research question, and supplied the groundwork for follow-up research, including practical and thematic infrastructures.

The main factors contributing to our emotional experience of music, as were found in the analysis of the experiment, were:

  • Melodic relativity (short term memory) - whether in single pitch, single harmony, or progressions, we could clearly see a preference to change compared with sustaining. We could also see that the attention span of a sustained pitch or harmony was around 8 seconds; after which the suspense fades. 
  • Pitch variation - while tempo and volume change had minor to no effect on the emotional prediction, chromatic scales showed clear change in the prediction, meaning that melodic pitch variation is a leading factor. 
  • Tempo - when analyzing predictions without short term memory, we see that tempo is an important factor. 


I also demonstrated possible adaptations or implications to the research, such as powerful research tools, and an emotional expression practice tool for musicians.


To me, the most interesting conclusion was how absolutely we depend on our “trainset” in our experience of emotions in music. While genetics and natural inclinations may dictate the architecture of our brain (our neural network); our human brain, just like the neural network, does its learning from experience and correction. We hear countless musical phrases throughout our lives, and are “trained” (by cues from lyrics, society, corresponding art etc.) to interpret them in a certain way. With enough musical experience, we create deep subconscious connections between complicated factors and patterns, and emotional experiences.


We are each unique in our experience, because our brains were “trained” differently!
That’s what gives us the unique, indescribable, unequivocal human experience.