The Data-driven Voice-Body in Performance: AI Voices
as Materials, Mediators, and Gifts
Supplemental Documentation
Data Ecologies and Voice Transfer Models
These five target voices, which collectively scaffold the vocal data ecology of iː ɡoʊ weɪ, represent a deliberate part of the artistic research and design process of the performance work. They were designed in part to create a dynamic vocal playing field, allowing the performer to experiment with a fluid set of target voice-like phenomena in real-time, and in part to explore different socio-technical relationships between the performer and other vocalizing bodies implicated in the process of dataset creation.
SELFIE
The "Self" target voice models include datasets of approximately two hours of recordings of the performer's own voice, recorded over the course of a month in the studio of the Intelligent Instruments Lab in 2022. The aim of this dataset was to capture a wide variety of vocalizations, a "snap shot" of the performer's vocal production across varied techniques both phonetic and paralinguistic.
In producing these recordings, the performer prepared a selection of vocal warmup and training exercises, alongside specific scores and phonetically balanced texts to provide comprehensive coverage of diverse vocal sounds, including:
- Vocal Techniques: Vocal warm-ups and technical exercises in blues and gospel singing, death metal growling, and the hollars, shouts and warbles characteristic of Appalachian folk singing.
- Paralinguistic Expressions: Humming, breath sounds, percussive lip smacking.
- Speech Sounds: Readings of the International Phonetic Alphabet (IPA) at different speeds, a complete performance of Kurt Schwitters' abstract sound poem Ursonate, and English scripts commonly used in voice research and voice banking such as the Harvard Sentences and the Rainbow Passage.
Evaluation of the resulting model was done subjectively by the performer, based on the model's ability create a real-time voice clone that was as perceptually transparent as possible, meaning that the model could faithfully reproduce speech and nuanced timbral and expressive qualities when the input voice was the performer's own. Beyond transparency, a secondary evaluation metric of the "My / A Snapshot" models was their capacity for being "pushed" through latent space manipulations into producing new vocal gestures not present in the dataset. This metric was chosen so as to create a model that would be capable of creating subtle but uncanny disjunctures between the performer's biological voice and a "not quite right" reproduction.
TUTTI
The TUTTI target voice includes a mixture of choral ensemble and soloist recordings from open choral singing datasets (Cuesta et al., 2018; Rosenzweig et al., 2020; Cuesta and Gómez, 2022) and approximately 1.5 hours of recordings made in collaboration with the student choir MUSILON, of the University of Twente, recorded with full consent for AI model training as part of an artist residency at the University of Twente. All of these mixed recordings were brought into a single DAW project for editing and mastering, with band-specific EQ and compression applied in order to ensure that these recordings shared a similar dynamic range. A final layer of compression, eq and reverb was added to the project as a whole in order to enhance the richness of the recordings and situate them within a shared reverberant space.
The aim of this dataset was to capture a wide range of solo and choral singing befitting commonplace conceptions in Western singing of the aesthetics of a choir, and particularly to allow the performer to inhabit voices of other genders, often thought of as "pristine" and "beautiful", such as that of a female sopranosorprano. The mixture of solo and ensemble vocal parts was intentional in order to enable the performer to control both solo vocal singing and choral vocal singing within the scope of a single model. The results of model training were evaluated subjectively by the performer based on his ability to navigate this single versus multi-vocal dynamic by deploying different vocal techniques and manipulations of the latent space while performing.
There is more to tell about the data ecology of TUTTI than its constituentconstitutuent audio files and model training goals, which is relevant to later discussions around ethics. The relationship between the author and the MUSILON choir began in 2022 through an artist residency at the Human-Media Interaction lab at the University of Twente. During the residency the author provided thesis mentorship for one of the lead choristers of the group, a student in Human-Media Interaction who was researching the choral phenomenon of "blend" in human-machine vocal collaboration. Through this mentorship the author was able to develop an intimate relationship with the MUSILON choir, attending multiple rehearsals and building interpersonal relationships with a number of the choristers who were also interested in the intersection of voice and computer science. It was thought this slow building of genuine trust and shared research interests that the collaboration of dataset creation was formed.
THE ZOO
THE ZOO target voice was trained on approximately one hour of curated non-human vocalizations, including gibbons, howler monkeys, capuchins, dogs, wild parrots and cicadas. These recordings were all created by the author and recorded over the past decade at various locations, including the Apenheul Primate Park in Apeldoorn, the Vondelpark in Amsterdam, and rural areas nearby Kumasi in Ghana.
The aim of this dataset was to include a wide variety of non-human vocal sounds by species that diverge by increasing degrees from homo-sapiens. The inclusion of non-human vocalizations was also central to the performance's artistic intention, as an exploration of the expansive and permeable boundaries of "voice" itself as a fixed, and anthropocentric concept. Model training for "The Zoo" was evaluated subjectively by the performer based on how accessible species-specific subsets of the training data were given the vocal capabilities of the performer and his ability to perform specialized vocal techniques in order to evoke sonic material from a wide range of the latent space.
BLONK
The "BLONK" target voice includes approximately three hours of solo voice recordings provided by Jaap Blonk, the prolific Dutch vocal improviser and sound poet. Blonk is internationally recognized for his idiosyncratic explorations of vocal sounds and live performance presence. His work is rooted in the 1970s live poetry scene and traditions of extended vocal improvisation and the sound poetry lineage emerging from the Dada art movement. His contributions to the dataset included, amongst others, solo voice works from the albums Vocalor and klinkt, alongside performances of sound poems by Kurt Schwitters and Theo van Doesburg.
The targetting of Blonk's voice performances is significant for several reasons. Beyond his artistic stature, Blonk has collaborated with the me as both an artistic mentor and vocal coach. This mentorship was in many ways coupled with the process of dataset curation and model training for Blonk's target voice, making the process deeply interpersonal and collaborative.
The aim of the dataset itself was similar to SELFIE in desiring wide coverage of Blonk's impressively large range of vocal techniques by collecting a range of material from his performative ouvre. In model training, the guiding quality metric was once again the desire for transparent voice transformation from the performer's voice to the target voice, meaning that, given a suitable vocal signal from the performer, the AI model could convincingly emulate Blonk's unique vocal characteristics in real-time.
A CRY
The A CRY models are trained on approximately 45 minutes of sound material taken from videos posted on the social media platform Instagram in late October, shortly after the Al-Ahli Arab Hospital explosion in Gaza. The material was specifically taken from the Instagram feed of the performer, representing a collage of the performer's personal media landscape during this period. The audio recordings comprise sonic material from social media videos that were chosen for their notable emphasis on human voice, consisting mostly of audible cries, screams, and pleas from Israeli and Palestinian citizens expressing grief, fear, desperation, and rage.
This dataset was created as an immediate artistic and personal response to the escalation of violence and growing humanitarian crisis in the Middle East following the October 7th Hamas-led attacks on Southern Israel, and the accompanying role of social media in the global representation of this conflict. iː ɡoʊ weɪ had been included as part of the performing arts line-up for the "Not to Be Senseless" series in The Hague, a concert scheduled for October 20th 2023, just 13 days after the initial attacks on Southern Israel; and the decision to create a model based on this target voice was made specifically for this event, shortly after the October 17th Al-Ahli Arab Hospital explosion, at a time when the social media landscape around the conflict became increasingly saturated with graphic depictions of human suffering (Abbas et al., 2024; Ghosh, 2024). This left three days to curate a dataset and train a model before the performance, while the two-stage training process of the RAVE v3 architecture being used would on average require at least a full week of training time. Crucially the decision was made to intentionally underfit the model for the performance, with a deliberate intention of training a voice transfer model that was uncapable of accurately reproducing audio from the domain of the training data.
RAVE models are trained using a two-stage process, first a Variational Autoencoder (VAE) Stage performs general feature learning for a default 1-million training steps, then a General Adversarial Network (GAN) stage is used to improve audio fidelity. Having had prior familiarity with the aesthetics of RAVE models at all stages of training, the author recognized that the time constraint of the upcoming performance also presented an opportunity to work in an artistically and ethically appropriate way with such a sensitive dataset. In general, A CRY models only train within the VAE stage, within the range of 200k-800k training steps. This underfitting in the VAE stage results in a voice transfer model that reproduces prominent features of the input voice, such as dynamics, but loses most timbral nuances, producing a sea of reconstruction noise and artifacts, and therefore obfuscating any audible link that could be used to identify the voices present in the training data. Unlike the evaluative goals of previous target voices, this and later models of the A CRY target voice were evaluated based on the capacity for voice-like sounds to be "pulled out" from the reconstruction noise of the model through emotive vocalisations such as cries, shouts and hollars.
iː ɡoʊ weɪ (2021–present) is a series of performances by JC Reus that attempt to unravel the idea of the singular, individual vocal self. Through live performance with neural real-time voice transfer models, the work stages a progressive dissolution of the singular voice into multi-voiced, polyphonic, and alien configurations. This trajectory aims to unsettle the voice as a stable marker of personal identity, treating voice instead as a dynamic boundary in which multiple subjectivities can inter-be.
The performances combine traditions of dadaist phonetic poetry, extended vocal technique, and inter-genre experimentation with technical infrastructures drawn from contemporary speech synthesis and voice conversion research; however, the most deeply explored voice transfer architecture has been RAVE (Realtime Variational AutoEncoder), after first encountering this architecture at its public release at the Neural Audio Synthesis Hackathon (NASH) in 2021. These models operate in low-latency configurations suitable for performance, enabling continuous transformation between the performer’s voice and composite models incorporating multiple human voices.
The performance arc begins with a self-trained voice transfer model, establishing a perceptual baseline of vocal identity. This model is incrementally morphed into antiphonal, choric and polyphonic relationships with the voices of others - including the sound poet Jaap Blonk, the student choir of the University of Twente, voices from the artist's social media feed, field recordings of non-human vocalisations such as Gibbons, Howler Monkeys, Parrots and Bee Hives -resulting in fluid hybridizations.
Like other data-driven AI works by Reus, the broader data ecology of dataset creation and collection is treated as embodied, situated acts of being vocal. Datasets, conceived here as both archival and agential, are interlocutors between bodies at a particular moment in time. Once integrated into generative models, these traces become capable of recombination, expansion, and unforeseen forms of recombination with a living body. Voice Data Ecologies are a generator for potential musical forms to emerge, but also a place where important musical-social relationships are made. All models used in iː ɡoʊ weɪ are trained on datasets created by the artist through direct engagement with his own body and environment, or through collaborative and meaningful relationship building with other musicians, leading to the creation and collection of recorded voice. The dataset, while functioning as a technical substrate for model training, is simultaneously an intimate biometric record - a high-resolution acoustic trace of a specific physiological state - which acquires unpredictable circulation once embedded in AI systems.
The work resonates with the philosophical and theoretical writings on the complexity of voice as simultaneously sigular and relational, alongside phenomenological and cognitive accounts of perception of the voice-body gestalt. In recent iterations, iː ɡoʊ weɪ has incorporated the recorded cries of those affected by contemporary political violence, framing these sonic fragments not as symbolic quotation but as material for embodied resonance. Here, AI-mediated voice functions as a form of empathetic attunement, in which the performer’s voice-body gestalt become a mode of witness.
By bringing together technical experimentation, theoretical inquiry, and politically attentive performance practice, iː ɡoʊ weɪ contributes to interdisciplinary debates on mediated embodiment, the ethics of vocal data, and the possibilities of collective and distributed subjectivities in the age of machine listening. It proposes that AI voice transformation, far from being merely a tool of imitation, can be mobilized as a critical and affective practice for reimagining the relational space between bodies, voices, and the technologies that connect them.