Lois Clary is a software engineer at General Dexterity, a San Francisco robotics company with world-changing ambitions. She codes all day and collapses at night, her human contact limited to the two brothers who run the neighborhood hole-in-the-wall from which she orders dinner every evening. Then, disaster! Visa issues. The brothers close up shop, and fast. But they have one last delivery for Lois: their culture, the sourdough starter used to bake their bread. She must keep it alive, they...
Making the Music of the Mazg
Text by Robin Sloan
Here’s how it came to pass that the Sourdough audiobook is the first, as far as I know, to include, tucked into its human narration, a contribution from a creative machine.
In case you don’t know: audiobooks are amazing now! They’re the fastest-growing part of the publishing industry, and their production has kept pace with their popularity. Audiobooks circa-2017 aren’t flat recitations of the words on the page; they’re a genre all their own, often enhanced with material that’s not available anywhere else.
So, the Sourdough audiobook contains several short chapters I wrote expressly for its listeners. And, in addition, something stranger.
First, I need to explain a bit about the story.
All of the action in this novel is precipitated by a sourdough starter given to the narrator by a pair of mysterious brothers who belong to a made-up culture called the Mazg. The Mazg have their own language, and in that language they sing their own songs.
Sourdough’s narrator listens to a CD of those songs and describes them like this:
Some were sung by groups of women, others by groups of men, and one was a mixed chorus. The style was all the same: sad, so very sad, but matter-of-factly so. These songs did not blubber. They calmly asserted that life was tragic, but at least there was wine in it.
Elsewhere, she observes that the language
sounded Slavic, but every so often there was a hard stop, like the hitch of a sob, or an ear-bending slide between notes that spun the sound into some other, more distant dimension.
This is the great affordance of fiction: I can write “ear-bending slide,” and it just happens in your brain. No singing required!
This music is so pivotal to the story, and I make such a big deal of its mysteriousness, that I knew we had to include it in the audiobook, even — or maybe especially — if only as a ghostly whisper.
So then, the question: how?
Let’s begin at the root of my own imaginings. When I was writing about this made-up music, I was thinking of something real: a genre of Croatian a capella singing called klapa. I encountered it online years ago and I’ve been listening avidly ever since.
Here’s a tiny sample:
But my imaginary choruses don’t sing in Croatian, and to blithely drop some klapa into the audiobook as makeshift Mazg music would have been both inaccurate and, I think, offensive.
I considered the tools at my disposal.
I’ve spent the past couple of years fascinated by the creative potential of machine learning, a set of techniques that allow computers to accomplish interesting tasks without being explicitly programmed to do so. Turns out, one of the things a machine learning system can do is extract patterns from a bundle of source material, then produce new material from those patterns. It can learn, very roughly, to generalize and create.
Most of my experiments with this generalization have been focused on text. For example, I’ve trained a machine learning system to produce prose in the style of old science fiction stories.
But the approach works for audio, too, and there’s a recent paper called SampleRNN that puts this into practice. Trained on a library of audio, SampleRNN can in turn produce new, never-before-heard sounds.
Now, I need to point out that these techniques are still very, very limited. Trained on all the Beatles albums, SampleRNN won’t spit out new era-defining hits. Mostly, it will spit out garble — but, the garble will undeniably be veined with Beatles-ish sounds! Sadly, the garble never blooms into a chorus or a bridge; SampleRNN doesn’t yet have any sense of a song’s structure. For now, machine learning is no substitute for a piano and a brain.
But! Under certain, very constrained, conditions, SampleRNN’s output can be interesting and maybe even useful. It seems to work best when its input is relatively uniform in style, and when its output can be just a short snippet. Turns out: it works pretty well with klapa!
I used an open-source implementation of SampleRNN created by the programmer Richard Assar, training it on a bundle of klapa MP3s I’ve collected over the years. Then, I asked the system to produce new music using the patterns it had extracted.
Here’s what I got.
It’s… really weird!! But, to my ear, it’s weird in all the right ways. It actually sounds something like what I’d imagined as I was writing about the Mazg and their songs.
And here’s what’s cool: this isn’t Croatian anymore. It’s no language at all, though it has the unmistakable phonetic bounce of language. It seems to me to dance just on the edge of understanding. This is singing that all listeners — speakers of English, of Croatian, of Cantonese — will find equally foreign. I am really, really into that.
I generated a hundred of these songs, selected the ones that sounded the best, and sent them over to the producers of the Sourdough audiobook, truly hoping that these were among the weirdest attachments ever to land in their inboxes. Those producers crafted the sound, finessed it, and now you can hear the result in the audiobook
One more thing.
I’ve taken pains to call this “machine learning,” not “artificial intelligence,” because, in this case, there’s really nothing intelligent about it. These systems — whether they’re working on text, images, or audio — learn statistical models that allow them to mimic the structure of the material they’re trained on. For example, the system I trained on old science fiction stories knows that if it has generated the character T, followed by the character H, it ought next to generate E, or maybe A, or O — but almost certainly not F, or J, or Q. If you can imagine many, many of these probabilities linked in a sprawling mesh: you have a rough model of the English language!
It works the same way with audio, and as the system learns, a surprising principle emerges.
SampleRNN quickly determines that if it has generated a sample of silence, followed by another sample of silence, it ought next to generate yet another sample of silence. This is because blocks of silence are a common occurrence in all recorded music: beginnings, endings, pauses between sections. As a result, it’s often the case that when you ask SampleRNN to generate new audio, it churns for a few minutes and then proudly produces… pure silence. It’s simply obeying the statistical model it learned.
Working on this project, I deleted a lot of files full of silence.
But sometimes, it’s different. Sometimes, the system’s output begins with silence, and is followed by silence, and on and on, establishing a high-probability void. But remember, these probabilities aren’t absolute. At each step, there’s always a vanishingly small chance that, instead of silence, we ought to hear something different. And sometimes, one of those small chances sparks to life — the dominoes fall a different way — and from silence, we get…
Listen for yourself. The audio below begins with silence, and that’s intentional. Keep listening. Wait for the moment when something in the computer’s heart trembles just enough to make a difference.
You know what?
I think that’s what having an idea sounds like.