A Proposed Embodied Mapping Strategy for IoT Network Monitoring


In order to demonstrate how some of the parameters and approaches discussed thus far in data-driven music might be applied in a sonic information design context, we will now consider a recently devised mapping strategy for live monitoring of traffic activity in a large-scale internet of things (IoT) network.


Whilst Worrall (2015) has demonstrated how sonification could be successfully deployed for representing metadata in an organization’s own internal network, there are a different set of factors at play in the sonification of IoT data. An IoT network is comprised of physical objects, machines and devices that have been enabled for Internet connectivity. The mapping strategy presented here is intended for use with the Pervasive Nation, Ireland’s national-scale IoT test-bed. The network consists of a diverse set of devices spread across the country, monitoring everything from water levels for flood detection to agricultural applications. These devices relay data through a system of gateways (or base stations) spread around the country. A log of all messages shared across the network is maintained by the network server. Pervasive Nation is a Low-Power Wide-Area Network (LPWAN), which means that it operates at low throughput, processing very few data packets when compared to a modern cellular network. This is more than enough to support messages from IoT devices in which transfer speeds usually fall below 27kb per second (Adelantado, Vilajosana, Tuset-Peiro, Martinez, Melia-Segui and Watteyne 2017). As a result, there are no continuous variables with IoT network data of this nature. However, given the number of devices online, the data can still become quite dense and complex.


IoT networks are generally concerned with machine to machine (M2M) communication and, as such, device payloads (sensor measurements) are encrypted and inaccessible. Network monitoring practices tend to focus on maintaining the overall “health” and integrity of the network. To this end there are a number of behaviors that need to be detected: devices that continually fail to connect to the network server, devices that exhibit irregular behavior (e.g. erratic switching across frequency positions, constantly reconnecting to the network server), and devices with low signal strength or bad signal to noise ratio. Monitoring for these anomalies generally consists of visually scanning large tables which describe the activity of each node over some predefined time period. If a problem is identified, a visual representation of the data from individual devices can be accessed. Given the large amount of data involved, this process can be slow and inefficient. Furthermore, this is all carried out after the fact, with the result that problems in the network can continue undetected for some time. These issues could be addressed by designing an auditory display to represent the data with sound in real-time.

The future is not going to be people talking to people; it's not going to be people accessing information. It's going to be about using machines to talk to other machines on behalf of people. (Tan and Wang 2010)

A recurrent metaphor employed across the IoT literature to describe M2M communication, reflected in the above quote from Tan and Wang, is that of “machines talking to each other.” Drawing from Imaz and Benyon’s (2007) recommendations to structure HCI design on the basis of conceptual metaphors and blends, this metaphor can be adopted as a frame of reference for our auditory display design. The auditory display can be conceptualized as a blend between the data and sound, framed in terms of a conversation between machines (see figure 2). Designing this interpretation of M2M communication into our auditory display might help to make it more intelligible to the listener and support them in understanding and reasoning about the data. Other relevant work in fields related to embodied cognition can be called upon throughout the design process to further inform and refine design choices.

Figure 2: M2M Communication as Machines Talking to One Another

Formant synthesis (see figure 3) has been applied effectively in a number of auditory display contexts (Hermann et al. 2006; Chafe et al. 2013). This synthesis approach creates speech-like sounds using a source filter model in which the source simulates the action of human vocal folds, and the filter models the resonances of the vocal tract (Smith 2010). It was also the basis of the synthesis method used to create vocal sounds in “The Human Cost. Vocal sounds can communicate rich information to a listener because the auditory system has evolved to interpret and extract information from the human voice (Armstrong, Stokoe and Wilcox 1995; Armstrong 2002; Gentilucci and Corballis 2006; Fogassi and Ferrari 2007). This capability extends beyond language to the highly communicative prosodic dimensions of human speech (Hirschberg 2002; Juslin and Laukka 2003; Grieser and Kuhl 1988; Grandjean et al. 2005; Elordieta and Prieto 2012; Alba-Ferrara, Hausmann, Mitchell and Weis 2011).

Figure 3: Formant Synthesis


The mapping strategy presented in table 1 was implemented in the Reaktor 5 audio programming environment using formant synthesis techniques. The mapping strategy was informed by the metaphor of M2M communication discussed previously.

Table 1: Proposed Mapping Strategy for IoT Network Monitoring

There are a number of key message types relevant to the IoT network monitoring in the context of the Pervasive Nation network. Join requests are messages sent by devices when they are ready to share data across the network. They are met with an accept or reject message by the network server. To follow our “conversation” metaphor, we can adopt a call–and–response structure to represent this data. The application of an “a” vowel formant profile lends them a speech–like timbre. Each request message is 200ms in length. They move along a frequency contour from +50 cents above A4 to -50 cents below A4. This increase in pitch is a culturally dependent strategy (from variants of the English language) intended to simulate the high rising terminal (HRT) wherein speakers modulate their intonation so that the fundamental frequency of their voice rises in pitch from the beginning of the final accented syllable, especially when posing a question. Because the sounds involved are much shorter in length and different in nature to a full sentence, our mapping exaggerates the HRT by beginning to rise at the very beginning of the sound for maximum effect.

AudioObject 4: Join Request Message


The response messages are designed to sound more machine-like in origin. They have a two-part structure, moving from C#2 to a Low A1 for a reject message, while the accept messages move from a C#4 to a high A4.

AudioObject 5: Reject Message

AudioObject 6: Accept Message


Each tone is roughly 200ms in length. The vowel formant profile for the reject messages is an “o,” intended to sound similar to “No,” while the accept messages have an “e” formant profile, intended to sound similar to a “Yes”. These mappings are influenced by Candace Brower’s (2000) cognitive theory of meaning in music. She argues that on the basis of the “image schemas that lend coherence to our bodily experience,” listeners experience a range of melodic “forces.” One of these is a sense of tonal attraction, whereby a series of notes is experienced as achieving a “stable” state when it reaches its tonic, the harmonic center of attraction. Outside the tonic there are varying degrees of instability. In the above mapping, where the key is established in A major, the sound for the accept message resolves at the tonic, becoming stable, while the reject sound “misses” the tonic, landing on an unstable A.

Gateway status requests are sent to gateways to make sure they are online and functioning correctly. If the gateway is not in working order, it can be rebooted. In this mapping strategy, gateway reboots are signified by sweeping a vowel formant filter across the sound signal. The sound is 700ms in length and pitched to an F0. The data is redundantly mapped to modulate both the cutoff frequency and the vowel shape, which moves from an “i” when the filter is high to an “a” as it closes. The sweep moves down the frequency spectrum before coming back up, which is intended to simulate the process of a reboot where the system first closes down and then boots back up.

AudioObject 7: GW Reboot


This pattern of cyclical closing down/booting up is informed by two image schemata, described by Johnson (1987) as the cycle schema, the topological pattern underlying experiences of cycles, and the up-down (verticality) schema, the topological pattern underlying experience of movement along a vertical axis, with different positions on the vertical axis corresponding to stability or instability. In this case, the higher/filter open position (in which the network is functioning normally or has booted back up) is taken to be the stable, nominal one (related to a metaphor of the network “standing upright”).


The device data within this network is complex, consisting of the encrypted payload data from the device along with twenty or more additional parameters, depending on the number of gateways complicit in relaying the data. Fortunately for the present purposes, there are only a few messages relevant to the basic monitoring of the overall health of the network. The message integrity code (MIC) is used to authenticate the message; a MIC code that fails to validate can indicate a security problem. The Received Signal Strength Indicator (RSSI) and the signal and Signal to Noise Ratio (SNR) are somewhat self–explanatory, and frequency refers to the frequency band on which the message is received. Each message also contains information about the Spreading Factor (SF), an important variable, alongside bandwidth, in the determination of data rate: the rate at which devices transmit data. The SF determines the amount of time a device is allotted to send its message across the network. The further a device is from a gateway, the larger the SF, along a scale from 7 to 12. Radio spectrum is heavily regulated (Levin 2013), and in Europe the total amount of “time on air” a transmission is allowed to take is regulated to a 1% device duty cycle per channel. For example, if a device sends data for 1 second every 100 seconds, it has a duty cycle of 1%. Radio frequency (RF) and bandwidth are also regulated, and in Europe LoRaWAN networks operate in the RF range of 868-870MHz with a channel bandwidth of 125-250khz. Legally, all nodes and gateways must be compliant with duty cycle regulations. They are thus quite important factors to monitor. Gateways and nodes, which exhibit suspect patterns of activity, need to be easily identifiable. Each of these messages also comes with information about the sequence of gateways that have relayed that particular message to the server. Pertinent information here includes MIC Code, SNR, RSSI, RF, and SF. In designing a mapping strategy for these messages, we return to our conceptual metaphor for M2M communication. Each message is broken into two “sections”: the first section contains a single phrase which pertains to the node data while the second section will contain phrases relating to each of the gateways that the message has travelled through.

SNR represents a continuous variable that can be simulated in a direct manner with the addition of noise to the original signal. As SNR decreases the amplitude level of noise decreases, and as it increases the level of noise increases.

AudioObject 8: SNR Lo

AudioObject 9: SNR Hi


RSSI is also a continuous signal that, in this case, can be mapped sonically using vowel shapes and behaviors. This mapping strategy draws again from our conceptual metaphor for M2M as a conversation between machines. It is intended to represent strong RSSI with controlled and relaxed speech-like patterns and weak RSSI with chaotic and tense patterns.

AudioObject 10: RSSI Strong

AudioObject 11: RSSI Weak


The generation of an "i" vowel requires more tension in the throat and facial musculature of the speaker than the more relaxed "a" (Durand 2005), hence a straightforward conceptual link can be made between controlled, relaxed sounding speech patterns and stability, and chaotic, tense patterns of speech and instability. When RSSI is at its strongest, the vowel formant position is constant and has an “a” profile; when it becomes weaker, the vowel formant profile begins to transform into an “i,” and the position of a vowel filter rapidly shifts in a random fashion across a range of ± 12 semitones.

The MIC code is evaluated to test message integrity. A MIC code which fails verification can be perceptualized as an error in the message. Drawing from sonic representations of failure explored in our earlier discussion on glitching, the amplitude of a sound signal can be modulated by a randomized square wave generator so that it changes amplitude in a random and abrupt manner, switching itself off completely for short periods.

AudioObject 12: Bad MIC


RF can be also be mapped in quite a direct manner to the fundamental frequency of the voice, representing each of the twenty-one possible frequency bands between 868–870MHz (i.e. 868.0, 868.1, 868.2, 868.3, etc.). This data is mapped to a chromatic scale, extending from C3 to G#4, keeping it distinct from the other harmonic material used.

AudioObject 13: RF Lo

AudioObject 14: RF Hi

SF also contains 6 levels: SF7 to SF12. Given that the spread factor is a measure of the length of time taken to send a message, we created another more direct mapping strategy. The length of the sound is controlled to reflect SF: longer SF factors correspond to longer phrases. The Just Noticeable Difference (JND) for a tempo change in speech is estimated at roughly 5%, which suggests that a listener will detect a change between a 500ms vocalization and a 525ms vocalization (Quené 2007). The timings used here fall well within those limitations. SF7 is assigned a length of 700ms, and each following SF is incremented by 300ms, up to 2.2 seconds for SF12.

AudioObject 15: SF 7


AudioObject 16: SF 12


This increment was chosen to support the listener in distinguishing timings while adhering to the JND for tempo in speech and not exceeding the limits of echoic memory, roughly 4 seconds (Darwin, Turvey and Crowder 1972).


The second section of each message, which represents the gateway data, uses the same mapping strategy as the first phrase with the addition of a descending pitch contour and reverb. The descending contour is intended to help listeners discern gateway messages from device messages, while reverb indicates the position of the gateway in the original relay sequence. Reverb tails decay at 10ms to avoid interfering with other messages. The first gateway is represented with 100% wet level reverb, while the most recent are represented with no reverb.

AudioObject 17: GW Far

AudioObject 18: GW Close

Furthermore, these messages have a descending pitch contour from +50 cents above their designated pitch level at the beginning of the message to -50 cents below at the end. This is to distinguish them from the other message types. The entire mapping strategy is formalized in Table 1 (see above).