3.    Sound Synthesis and Processing Methods

3.1. Design considerations


 

Sonic information design for the display of proteomic data must take into account the perceptual attributes that will undergo variation according to changes in the multivariate proteomic data matrices that occur with neuro-degradation. The most straightforward approach to this design problem would seem to involve identifying a few select components of variation in the data, values which can be used to modulate sound synthesis parameters associated with variation in selected perceptual attributes. If this approach is taken, then the design would be identified as a parameter-mapping sonification (Grond and Berger 2011), where ideally a one-to-one mapping is established between the data domain and the parameter domain describing the synthesis of sound to be presented to the listener. Whether the variation in the parameter domain is associated with orthogonal variation in the perceptual domain is difficult to determine without solid foundational knowledge in psychoacoustics. Typically, the manipulated perceptual attributes are chosen based on whether they can be easily distinguished, such as pitch ranging from low to high while timbre ranges from dark to bright. Such variation, which is not strictly orthogonal, nonetheless admits of two perceptually salient dimensions with easily identifiable anchoring points. If a single sound is isolated at each of these anchoring points, it could be identified in terms of pitch-timbre pairings such as low-dark, low-bright, high-dark, and high-bright. The analogy to musical performance on a trumpet might be appreciated here, as playing from low to high pitch while muting and then unmuting a trumpet can move through these two perceptually salient dimensions.

 

Note that pitch and timbre are not strictly independent perceptual dimensions, since pitch has been shown to influence judgements of timbral differences. Although timbral relationships between recorded musical instrument tones are similar at different pitches within an octave range (Marozeau, de Cheveigné, McAdams and Winsberg 2003), if pitch is allowed to vary over a wider range, the higher pitched tones will be heard to sound brighter than the lower pitched tones (Marozeau and de Cheveigné 2007). It might be thought that such interactions between two perceptual attributes would be less likely if one was a timbral attribute and the other was a spatial attribute, such as the apparent direction of a spectrally-rich tone (e.g., a plucked string sound). However, in everyday listening, if the incidence angle of a tone varies in azimuth from a frontal incidence angle (0 degrees) to an extreme lateral angle (90 degrees), the tone will naturally increase in perceived brightness because of the increased high-frequency emphasis apparent in the acoustical response of the head (as measured by the head-related transfer function [HRTF]). This influence of the HRTF on the timbre of sound sources is not usually noticed, most likely because it is a common feature of direction-dependent spectral variation that is habitually interpreted in a spatial mode in everyday listening. HRTF-based sound processing is often used to control the apparent direction of sound sources (see Martens 2003) and is particularly useful in headphone-based auditory display. However, there are potential difficulties that can be found here, especially when synthetic rather than recorded natural sounds are displayed. A demonstration of the interaction between timbral modulation and spatial processing of synthetic string sounds can be found in section 3.5.

 


3.2. Data preparation prior to sonification


 

A critical aspect of parameter-mapping sonification identified by Grond and Berger (2011) is data preparation. This is because the complex datasets of interest (such as proteomic data) are typically not ready for direct input to a sonification system in their raw form. It is quite rare to find good results with the “drag and drop” method of data input. Therefore, the data are often subjected to preliminary processing that make them more amenable to sonification (of course, the same would be true if the data were to be submitted for visualization). For the sake of the current discussion of design considerations, it will suffice to say that the multivariate complexity of the data to be sonified needs to be reduced to variation in terms of the principle components that can potentially distinguish between the three cell types being examined here. Describing this data reduction process is beyond the scope of this essay; the reader may refer to the process description that appeared in the ICAD paper by Martens et al. (2016). What is pertinent here is that values on 1815 variables could be reduced to values along three principle dimensions. Moreover, the variation on these dimensions not only captured a large proportion of the total variance in the multivariate data but were also found to be potentially revealing of gross differences between the three cell types to be compared.



3.3. Sound synthesis for the sonifications


 

To generate a sonification for the available proteomic data of interest, a parameter-mapping strategy for synthesis that took into account the complexity of the large multivariate dataset was formulated. For nine distinct cases, an assembly of short-duration, temporally-overlapping “grains” of sound were created, the timing parameters of which were selected to approach approximately the minimum perceivable event time for distinct percepts of duration, frequency, and amplitude (i.e., approaching auditory resolution of human observers in discriminating between identifiable attributes of loudness, pitch, and those component auditory attributes that are generally regarded as belonging to one of two collections termed timbral or spatial attributes). The “hypothesis-driven” design approach taken here required sound synthesis technology that could offer independent variation of many sound synthesis parameters to provide identifiable variation in distinct auditory attributes. In the initial stage of this work, synthesis based upon a simple physical model (Karplus and Strong 1983) was tested for its versatility in producing a wide range of short sounds exhibiting audibly identifiable timbral variations, each showing potential for evoking physical referents in the minds of the listeners (such as the “plucked-string” tones that are presented in Audio Object 1).

AudioObject 1: A sequence of “plucked-string” tones, varying in both pitch and duration, generated using the Karplus-Strong (1983) algorithm. The pitch of each was constrained to range fromA4 (440 Hz), down to A2 (110 Hz), two octaves below.

 

Whereas data sonification for a multivariate dataset usually proceeds by mapping values on several different synthesis parameters, in this demonstration of the Karplus-Strong output, only one variable (molecular weight of proteins found in a given cell) was mapped to the frequency of vibration of the physically-modelled string. Given that the synthesized string sounds with longer wavelengths also decayed more slowly, the duration of each synthesized tone varied with its pitch.

 

The synthesis technology that was ultimately adopted for this project resembles granular synthesis (described in Dutilleux, de Poli, von dem Knesebeck and Zölzer 2011) in that a multitude of short sound sources formed an ensemble output (likened to a “swarm”) rather than forming clearly separable events that might be heard as distinct in time and space. Unlike the quintessential “Gaussian envelope” approach to granular synthesis, the “grains” employed here had percussive amplitude envelopes with the character of highly-damped, plucked strings typical of the algorithms developed by Karplus and Strong (1983). At high grain density, it is more the streams of grains that stand out as distinct events than any individual grains. It is the natural capacity of the human auditory system to perform such automatic streaming (or grouping) of incoming sounds that is expected to play a central role in the apprehension of meaningful patterns in the complex proteomic data. Careful selection of pitch range and onset timing for the grains is required if stream segregation is to allow listeners to distinguish between rhythmic patterns. Fortunately, some guidelines for such design considerations have been developed by Barrass and Best (2008), who provided a palette of stream-based sonification diagrams showing the influence of brightness, pitch, loudness, and spatial panning on streaming.

 

In all sonifications designed for multivariate data display using the approach described here, hypotheses needed to be tested regarding which data variables were “mapped” to which synthesis parameters. It is beyond the scope of this paper to present the details of the synthesis technology that was developed and refined through experimentation and iterative hypothesis testing, given the available multidimensional proteomic data. Suffice it to say that swarms of percussive grains (Dutilleux et al. 2011) were synthesized with “parameter-mapped” control over spatial and timbral auditory attributes, which varied over time and space according to the multivariate distribution of proteomic data observed for the nine cases that were examined. A flowchart representing the sound processing that was utilized in the sonifications is shown in Figure 1, with attention to the details of which dimensions of the sonification the user might interactively control in the search for the best results (such as the dimensions of duration, pitch range, and angular extent, highlighted in red).


In the next two sections, the timbral modulations and spatial positioning of synthesized grains are explained in turn.

AudioObject 2: A plucked-string tone processed successively by the six open-tube resonance structures shown in Figure 2. The pitch and duration are constant because the same input is applied to the comb-filter module with six different peak frequency patterns.

Figure 1. Flowchart representing the sound processing employed to produce the examined sonifications. Trapezoidal boxes show the data inputs to the system, including the input multivariate proteomic data and user control data; pentagonal symbols represent the potentially complex “parameter-mapping” operations in which user control data modifies the mapping into audio signal processing of the multivariate data; and rectangular boxes represent signal processing modules, which pass audio signals rightward in the flowchart toward adjacent modules, ultimately for auditory display. The number of arrows connecting signal processing modules indicates the typical number of audio signals that are passed in parallel, thus the spatial processing module shows only two outputs, since headphones are assumed to be the target reproduction system. (The alternative output to loudspeaker arrays might include common main channel counts of 5, 7, 10, or 22, with the addition of a variable number of low-frequency channels, most commonly only 1 or 2.)



3.4. Timbral modulations for the sonifications


 

Again, in efforts to create timbral variations that might evoke physical referents in the minds of the listeners, the tonal coloration of the plucked-string tones was modified via spectral shaping mimicking that of musical instruments exhibiting “open-tube” resonances, such as those of the clarinet. The timbral modulations were enabled for the sonifications using a simple recursive comb filter modulated by just a few parameters. Figure 2 shows a family of gain over frequency functions illustrating one dimension of the spectral variation that could be imposed on the synthesized signals using the comb filter. The timbral variation afforded here mimics the tonal coloration imparted to vowel sounds by manipulation of the human vocal tract, and hence can present readily recognizable differences to which meaning could be ascribed. Audio Object 2 provides an example of the spectral variation illustrated in Figure 2, with a sequence of six comb filter outputs from a single Karplus-Strong plucked-string tone as their input.

Figure 2. A family of gain over frequency functions showing the variations in comb-filter patterns that were imposed on the synthesized signals (labelled “tonal processing” in Figure 1). Note that the 0-dB value on the y-axis indicates gain relative to input, and thus indicates unity gain here. The frequency-dependent gain of the filter shows regularly-spaced peaks typical of open tube resonances (at frequencies of F, 3F, 5F, etc.). The blue solid curve shows the comb-filter resonance pattern with its first resonant peak at F=440 Hz, the highest fundamental frequency of the input to be applied to the filter). The dashed lines show the remaining five featured comb-filters exhibiting resonance patterns beginning with a value of F set to progressively higher frequencies.

 

 

3.5. Spatial sound processing for headphone display



Although discrimination of frontward from rearward incidence of sonification components could be well supported if binaural processing were to be coupled with head-tracking technology (see Martens 2003), the experimental stimuli generated in the current auditory display system were not modified by active sensing of the listener’s head turning. Without such tracking of head movements, the sonification designer should not expect the listener to be able to clearly identify whether a virtual source presented at a given lateral angle is being presented with frontward or rearward incidence. Due to the difficulty in supporting reliable front/rear distinctions using static (i.e., uncoupled) binaural processing for headphone-based spatial auditory display (see Martens 2003), only a simplified model of head acoustics was employed here to move sonification components along the listener’s interaural axis (the axis passing through the listener’s ears). The acoustical cues that were simulated to accomplish this manipulation of a headphone-displayed virtual source incidence angle included the interaural time delay (ITD) and the head shadow that generally grows larger at the listener’s contralateral ear as the incidence angle of the source is offset laterally from the listener’s median plane. The simulation did not include any of the spectral features that are known to influence perceived elevation, even though HRTF-based processing is often used for such angular control. The issue here concerns the relative salience of differences in perceived elevation, which present angular offsets that are associated with smaller perceptual differences than comparable angular offsets in source azimuth angles (Pereira and Martens 2018).


The simpler simulation approach offered an advantage over a single-user headphone display in that several listeners could use the system simultaneously without the unexpected variation that would occur if the spatial processing were coupled with head movement of just one of multiple listeners. Of course, using head-coupled updating of headphone-based binaural rendering technology could be added for single-user exploration of the spatial configuration of sonification components (including sensitivity to the listener’s translational movements as well as changes in head orientation); however, for the initial studies reported here, only non-head-tracking headphone-based spatial audio processing was employed. First, an ITD that depended upon the desired virtual source lateral angle was introduced between the ipsilateral and contralateral ear signals; then, the “head-shadow” filter shown in Figure 3 was applied to the contralateral ear signal to mimic the attenuation of high-frequency signals at the ear on the side of the head opposite to the virtual source. While not appropriate for loudspeaker-based sound spatialization, this simple spatial processing results in a very clear lateral shifting of virtual sources via headphone reproduction. Note, however, that this spatial processing produces results that are not really independent of the timbral modulation introduced above.

Figure 3. A family of gain over frequency functions showing the output results for the “head-shadow” filter that was applied to the contralateral ear signal (as a key component process within the “spatial processing” module shown in Figure 1). The frequency-dependent gain of the filter shows increasingly greater attenuation at higher frequencies as the lateral angle of the input source grows from 0 degrees (showing no attenuation) to 90 degrees (exhibiting attenuation greater than 18 dB at the highest frequency shown). This 90-degree range of virtual-source lateral angles appears within the inset graphic showing a top view of the listener’s head. The blue solid curves show the filter gain for five specified lateral angles in 20-degree steps from 0 to 80 degrees. To aid in distinguishing adjacent curves, the line style alternates between blue solid lines and dashed red lines, which show the filter gain functions for intermediate lateral angle values, ranging from 10 to 90 degrees, again in 20-degree steps.

As a demonstration of the relative interdependence of headphone-based virtual source positioning and the timbral modulation afforded by the application of the recursive comb filter introduced in section 3.4, Audio Object 3 presents a sequence of plucked-string tones that should be heard to move from a small lateral angle (near the median plane) to a large lateral angle (well to the side) and then back again, repeating the sequence of six tones while reversing the sequence of lateral angles. What should be noticed when listening to Audio Object 3 is that the angular extent of the sequence of tones is greater for the first sequence presented than the second. The tones move from a more central position to a more extreme lateral position in the first case, but return more slowly from a less extreme lateral position to a terminal position that is not as distant from the initial position of sequence 2 (due to the reversed timbral variation that works in opposition to the head-related filtering).

AudioObject 3: A sequence plucked-string tones exhibiting variation in both tone and spatial position. The tones produced by the six different peak frequency patterns demonstrated in Audio Object 2 are progressively moved from a central to a lateral position in 10-degree steps (demonstrating positioning at lateral angles 10, 20, 30, 40, 50, and 60 degrees), and then from a lateral to a central position for the same sequence of six tones.


3.6 Spatial versus timbral emphasis in sonification



It is worth asking whether emphasis should be placed upon spatial positioning versus timbral modulation in producing the most effective sonifications. Based upon the results of the experimental investigation of sonifications for proteomic datasets associated with ALS (Martens, Poronik and Saunders 2016), sonic information design decisions could be made about the relative emphasis on one or both of these approaches. Support for those decisions also might be provided by the audio examples associated with the current paper. In that investigation, there were nine sonifications created from the factorial combination of three synthesis solutions applied to proteomic data from the three cell types.


So, for each of three types of cells that should produce an identifiably different sound, each of three unique parameter-mapping synthesis solutions were applied for presentation. The first of these synthesis solutions was termed the “timbral-only” approach, which put emphasis upon timbral differences resulting from spectral variation between grains. The second approach was termed the “spatial-only” approach, which held grain spectra constant and only distributed the grains spatially along the listener’s interaural axis. The third approach was termed the “spatial-timbral” approach and combined redundant variation in the output sound based upon the simultaneous application of both of these parameter-mapping approaches. These sonifications, which were chosen as candidates for best – allowing the differences between cells to be appreciated by human listeners – are demonstrated in Audio Object 4, 5, and 6, respectively.