While still a vast field, a huge part of machine learning exists for what may seem to be a relatively narrow subset of problems. These are problems involving visual processing: character recognition, facial recognition, the generation of trippy images dominated by populations of dogslugs, birdlegs, and spidereyes.
This isn't accidental. Image data is unique in its suitability for machine learning tasks. It naturally occurs as multidimensional arrays—tensors, really—of pixel data. It's more at the fringes of machine learning that audio data gets a turn. Part of the problem is that, despite the vast amounts of digital audio data that exists in the world, there is a relative lack of openly accessible computational datasets. There's pretty much just one, actually: the Million Song Dataset, which offers some 280 GB of feature data extracted from 1 million audio tracks. Musicology remains largely old-school.
A group of computer scientists based at University College London and Queen Mary University wants to change this. To that end, they've developed the Digital Music Lab (DML) system, a large-scale open-source framework for the analysis of musical data that, crucially, allows for the remote analysis of copyrighted materials (read: most music). "Our first installation has access to a collection of over 1.2 million audio recordings from multiple sources across a wide range of musical cultures and styles, of which over 250,000 have been analysed so far, producing over 3TB of features and aggregated data," they write in ACM Journal on Computing and Cultural Heritage.
The project started in earnest in 2014 with an initial workshop attended by 48 musicologists of varying technical backgrounds. It was basically a big discussion of the problem laid out above (the lack of computational resources for analyzing music data) and what might be done to solve it. The first part of that was just the dataset itself, which should contain basic metadata about individual samples and information about where they can be accessed. The second is the actual analysis.
This is the hard part. Each individual recording contains millions of data points that need to be analyzed for musical content, with the result being that each one of those recordings requires about as much processor time as the length of the recording.
"The data are richly structured: there is not just one analysis needed but several different ones, for example, for melody, harmony, rhythm, timbre, similarity structure, and so on," the researchers explain. "These create a rich set of relations within and between works and collections that is more complex than, for instance, typical textual data."
You're much better off just playing with the thing than having me paraphrase a technical paper about DML. It has a handy web interface here.
Or watch the demo:
Of course, musical recordings are fundamentally different than images. The vast ImageNet image recognition dataset doesn't consist of millions of careful compositions (or "compositions" at all), but instead of all manner of images gathered from the internet. Image recognition is ultimately a means for computers to better interface with the real-world. So, no, it's not quite a fair comparison, but the need for better musical analysis tools in the era of big data seems clear enough.