The Strange Acoustic Phenomenon Behind These Wacked-Out Versions of Pop Songs

Earlier this week on Twitter, I came across quite the sonic phenomena. If you take an mp3 file, convert it to MIDI, then convert it back to mp3, it sounds totally nuts. Just give a couple of these songs a quick listen so you know what you’re dealing with:

Weird, right? Andy Baio, a technologist who is an old school blogger and journalist, former CTO of Kickstarter, and founder of the XOXO Festival remixed a couple of songs in this style (after discovering the Carey one on Tumblr) and wrote about it on his blog, calling the version of “All I Want for Christmas Is You” a “terrible and amazing thing to listen to.”

Videos by VICE

“The resulting version sounds like Mariah as a player piano—none of the original recording is preserved, only a series of hyperactive notes matching the frequencies of the original song,” Baio wrote. “Incredibly, you can still make out the lyrics and music, though likely only if you’re familiar with the original song.”

Baio is right. If you spend your days wandering the mall or listening to Christmas radio, you probably recognize Mariah’s impressive vocal intro, though it sounds like she’s screaming them from somewhere underwater through an alien voice recorder or something. If you haven’t heard the song before, it probably sounds like a stampede of children are jumping on an FAO Schwartz floor xylophone.

Before I get into why this isn’t merely a “cool thing” but potentially a useful discovery for the field of psychoacoustics, let’s discuss what’s happening here, technologically speaking.

MIDI stands for Musical Instrument Digital Interface—it’s a file type with a complex and important history, but I mostly just remember it as the file format that countless bros in college used to make bad songs in Garage Band. It serves more or less as a file format that allows instruments to talk to each other and to talk to a computer—play a ‘C’ note on a MIDI keyboard, and the computer will transcribe it so that you can then play it back as a different computerized instrument altogether (these are called “fonts”).

“MIDI is a description of the music—there’s nothing in MIDI that says ‘play this note in this way,’ it says ‘here’s the note to play—it’s a C-note, and play it using this type of instrument,’” Baio told me. “It’s recording the fact you hit these notes at this specific time, but it’s not recording the sound wave performance of those notes.”

For this reason, MIDI files are really small and were popular in the early days of the web as a way of putting music on websites.

In this case, the MIDI converter is essentially flattening all of the instruments and vocals into one sound track and is recording the sound frequencies of these notes and spitting them out. Baio then used a free online MIDI to MP3 converter to turn them back into an MP3 file, played by a grand piano “font.”

Basically a piano is playing Carey’s vocal notes, the backing vocals, the shimmering drums, the bass, the guitar, the Christmas bells, and everything else in that song, very very quickly. An artist did this with an actual piano back in 2009, and the idea can be visualized in this YouTube video:

So, that’s how it works. But why does it sound like the vocals are actually there? To be clear, they’re not: There is no vocal or verbal information there whatsoever, you’re merely perceiving notes being hit by a digital piano to be not only a human voice, but a recognizable one with actual words. It is, as the video above suggests, an “auditory illusion.”

“If you run a song that you don’t know and you try to make out the words, you can’t. Your brain is filling in the gaps. If you know the words, the lyrics and the song, you can hear the points very very clearly—the piano is enunciating these words,” Baio said. “I think it’s your brain filling in the blanks with what you’re familiar with.”

That makes sense, but it’s not quite that simple. I emailed James Dias, a researcher at the University of California Riverside’s Audiovisual Speech and Audition Laboratory, to learn what’s actually going on here. It seems as though Baio may have stumbled onto what could perhaps be a (slightly) new area to focus on for speech researchers.

“This is really cool! I’ve not run into this specific case before. However, the phenomenon seems similar to other acoustic transformations used in speech research,” Dias wrote.

Dias said that a field of research called sine-wave speech may help explain what’s happening with Baio’s MIDI phenomenon. Dias pointed to the experiment I embedded just above this paragraph in which a person’s voice was converted into a series of bleeps and bloops—”think R2D2,” Dias said—to determine if any meaning could be pulled from them.

When listened to out of context, humans can’t decipher much of anything from the sample in the video. But listening to a human speak the “words” that are being converted before hearing the sine-waves changes the context immediately.

“The most impressive improvements in identifying speech from sine-wave signals is observed after presenting people with the untransformed signal first and then asking them to identify the speech within the sine-wave replica of that same signal,” he added. “Years of studying sine-wave speech has revealed that people can identify not only the speech spoken, but also characteristics of the speaker, such as gender, age, and even accent.”

Dias says that MIDI is perhaps doing the same thing here. Your brain is only able to recognize the vocals (and the lyrics) once you know what the vocals and lyrics are.

“Knowledge of the songs themselves can improve the experience of actually hearing the singer’s voice,” he wrote. “The best example of the effect was, for me, the video of Smash Mouth’s All Star, where you can often see Steve Harwell mouthing the words along with the audio,” which adds a “visual advantage” to the effect.

“I myself have never heard of MIDI transformations as a topic of speech research,” he added.”It looks like you have stumbled upon something that may be related to research investigating the influence of other acoustic transformations on speech perception.”

And that’s how a silly little audio transformation that sounds cool and terrible and otherworldly all at the same time could—and maybe should—become the topic of serious research.