Gained In Translation

words prompt sounds

sounds prompt words

photos by Robin Leverton

Gained In Translation is a music & text performance piece that I made in May 2025 for the Tech, Tea & Exchange residency at Tate.* The piece comprises different Machine Learning (ML) models strung together to form a recursive loop. The loop begins with human input: a short musical phrase that I play on viola in response to a text score.

text as score

I generate a text description from the viola audio, then generate audio using the generated text as a prompt... and the loop continues. Each performance is unique in how it unfolds and evolves because the initial input is improvised and there is a degree of stochasity (chance) in the ML model architectures.

failure to copy

cross-modal translation as meta-composition

I use translation between modalities (sound and text) as a meta-compositional technique, where the failure to reproduce gradually tr

ansforms a single musical idea. While reproduction accuracy is an important element of evaluation in generative ML model training, the goal of artworks is usually not to reproduce, but to greate something physically or conceptually new. I place these two contradictory viewpoints on the nature and goals of reproduction side by side. In this way, Gained In Translation grew from the points of tension between artistic practice and technical practice.

what is gained in translation?

*The Tech, Tea & Exchange residency was supported by Anthropic and Gucci.
I provide a︎Github repository for this project.

examples

The piece is different each time it is performed, even if the initial audio is identical:

In the example videos, the generated text features references to battle and to music practices that I have no connection to. These references emerged as a result of the class labels of the pre-trained YAMNet model, and were likely further amplified by the recursive loop between the (also pre-trained) Stable Audio and Claude AI models featured in this piece. I would never have purposefully guided the piece towards violent themes and cultural appropriation. However, I have intentionally left these emergent themes in the piece to reveal nature of the models and their training datasets.

models & datasets

YAMNet - a sound classification model from TensorFlow

The model is trained to predict the most probable sound in an audio waveform. We can predict sounds from a pre-recorded sound file (as I do for the tech demo version of Gained in Translation) or from an incoming stream of audio (as I do in the Gained in Translaton performance).

dataset - "AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos" https://research.google.com/audioset/

Claude 3.7 Sonnet - a large language model created by Anthropic

dataset - "Training data includes public internet information, non-public data from third-parties, contractor-generated data, and internally created data. When Anthropic's general purpose crawler obtains data by crawling public web pages, we follow industry practices with respect to robots.txt instructions that website operators use to indicate whether they permit crawling of the content on their sites. We did not train this model on any user prompt or output data submitted to us by users or customers." - https://www.anthropic.com/transparency

Stable Audio 2.0 (text & audio-to-audio)

dataset - "AudioSparx is an industry-leading music library and stock audio web site that brings together a world of music and sound effects from thousands of independent music artists, producers, bands and publishers in a hot online marketplace." https://www.audiosparx.com/