We all use Shazam to find out what song is playing at a restaurant, or expect Spotify to recommend a genre based playlist. But, have you ever wondered how that “magic” works? Who is behind all the playlists on YouTube, Spotify and Apple music?
Who are those funny little people, sorting rock songs into categories just for our convenience?
An online service that offers a vast amount of music, has to sort it somehow. In most cases, the enigmatic field of ‘Music Information Retrieval’ is called for help.
Let’s say you’re a guitar player looking for chords to a new song you just heard on the radio. You probably think somebody out there already found out about that particular song, and was kind enough to transcribe the chords. Well, not always. Sometimes, There is no chords at all and an automated solution is needed.
"Hum the Tune to a Microphone and Let the Computer Figure It Out"
At utab we wanted to create the biggest repository of synced chords in the world. So we turned to technology for a solution. We incorporated an algorithm so smart it is capable of establishing a chorded timeline to every song in the world with surprising success rates. It takes just under 15 seconds, regardless of the record quality, release year and music genre.
But, what if you’re not a musician, and you just want to find a certain song in your favorite record store? Wouldn’t it be nice to just hum the tune into a microphone, let the computer figure it out and suggest the desired song based on what you just hummed?
This also, is part of the constantly evolving, multidisciplinary field of study called ‘Music Information Retrieval’ or in short, MIR.
Dr. Holger Kirchhoff from ‘zplane.development’, the developers of ‘Kort’, the algorithm behind utab explains: “Traditionally, ‘Information Retrieval’ is the science of finding information in large data collections. ‘Multimedia Retrieval’ and more specifically ‘Music Information Retrieval’ focuses on extracting all kinds of information directly from the music, such as tempo, time signature, beat times, melody, chords, structure, instrumentation, etc.”
Search engines like Google and Bing will ultimately help musicologists to find out how music composers influenced one another, police units could analyse audio recordings made by surveillance equipment for suspicious sounds and even film makers will benefit from having the ability to find the best suitable sound effect from a vast library of audio recordings in no time.
Although ‘Music Information Retrieval’ has the word ‘music’ in it, it is not a music centric technology. This technology is already at the core of many other applications in a handful of fields such as, agriculture, home security and mobile phones.
Scientists all over the world “aim at extending the understanding and usefulness of music data”, says Prof. Juan Pablo Bello from the New York University, and explains, that it could be done “through the research, development and application of computational approaches and tools.”
In Bello’s opinion, there is still much to learn about this field of study, since it’s “grounded in the combined use of theories, concepts and techniques from music, computer science, signal processing and cognition” (MPATE-GE 2623 ‘Music Information Retrieval’).
MIR scientists encounter many difficulties on a day to day basis. The wide variety of aspects in this field of study make it almost impossible for one person to master. That is why a healthy collaboration is a necessity.
While we see the whole world only now starting to understand the advantages of knowledge sharing and agile frameworks (teams with diverse fields of study), MIR was forced from the very beginning to work in this manner in order to accomplish the extraction of data from a given audio track.
Psychoacoustics for Computers
Today’s scientists have an unprecedented amount of computing power available at their fingertips. They can execute tremendous amounts of calculations within seconds, and also scan and analyse large amounts of data in no time. But was it always like this? Could they do it with outdated computers? The first and obvious answer is no, and yet, early scientists managed to lay the foundation for ‘Music Information Retrieval’.
In the beginning, MIR was a programming language with musical expressions, which means, that common theoretical music expressions appear as part of the programming language and as the result of the MIR computer program.
In simple words, ‘Music Information Retrieval’ was a programming language that you had to punch in on a special card and insert it inside a huge computer, hoping the whole thing wouldn’t fall apart.
Nowadays, MIR is all about connections. Psychology, computer science, signal processing, machine learning, information science and human-computer interaction. When combining all these fields of science together, you get a remarkable learning process that evolves into places we wouldn’t think of otherwise.
The MIR pipeline aggregates 3 different, yet related, stages:
1. Hearing Representation
2. Understanding Analysis
3. Acting Interaction
Hearing Representation is the first stage of it all. It simply means, “listening” to a given audio source (an original song or a cover version for example), applying signal processing methods and lastly extracting the format, pitch/harmonic analysis, rhythm analysis and sound characteristic analysis.
This stage is named “Hearing Representation” because the MIR program is trying to mimic the human ear and the “glitches” we have in our physiological system. Psychoacoustics, or our perception of music is affected directly from the physical structure of our ear canal and our brain's interpretation.
One example of this is hearing 2 different instruments simultaneously that are playing the same exact rhythm and have similar tonal characteristics. We will perceive this as combined information in our brain resulting in unconsciously hearing only 1 instrument, the one that is loudest and has the lowest frequency.
This psychoacoustic artifact is known as “masking” and any computer that doesn’t understand this, will have a hard time representing the data as a human being will perceive it.
Another interesting gap between humans and computers is the perception of rhythm. While listening to your favorite track, it’s very easy to keep moving to the beat even when there is no constant rhythm. The continuous motion we feel is hardly represented by a computer and this simple act of “beat calculation” sometimes can be fairly hard to execute.
Understanding Analysis is a complex stage that utilizes machine learning, processing and the retrieval of data in a way that a computer can understand it. This is a crucial step if, for example, you wish to search for a song that you heard, love or hummed.
For us, to ‘Google’ a certain song by singing, means for a computer “go fetch the song you think we just sang or uploaded, retrieve this information and show us related songs that you think will interest us”.
This whole process is accomplished without using any textual help at all or any other input parameters. The computer program needs to sonically ‘hear’ and understand what is inside the given song. To understand the genre, approximate time of recording, database comparison by statistical properties such as timbral texture, rhythmic structure, harmonic content and to do all of this within seconds.
Acting Interaction is the 3rd and final stage of MIR. It naturally harnesses the data from the 2 former stages and utilizes it in different ways such as, pitch modification and audio to midi conversion to name a few.
Could this technology, eventually, replace real life musicians? Yotam Laufer, VP R&D at utab.com, does not think this is possible at this stage. His explanation is: “We need to remember that music wise - there isn’t a substitute to a simple human ear.
"In today's world the fact that you, as an individual, hear 10 different music genres a day, makes our work a bit harder since we need to come up with a ‘one stop shop’ solution for each song we may encounter. I can say that analysing and providing chords for Pop songs is 99% accurate and perfectly synced with the song itself. With that said, with the major advances in AI, new solutions and ideas will surely emerge. We witness computers writing music and poems that are in some cases indistinguishable from human creation. The ability to transcribe songs to the note is a logical next step”. utab.com is a working music platform that answers the constant need of musicians across the world by providing simple yet accurate chords for every song with the help of music analysis which utilizes the 3 MIR pipeline stages in a very elegant and user friendly way.
What is MIR Good For Anyway?
Thomas A. Edison once said, “Just because something doesn’t do what you planned it to do doesn’t mean it is useless.” From traffic jams to respiratory infections in pigs, ‘Music Information Retrieval’ once planned to do one thing but now it is everywhere.
As human beings we constantly use our senses for various situations. While you can’t smell or see music, you can get fairly excited whenever you recognize a good tune. Music has more to it than brief changes in atmospheric pressure - the sounds we hear are highly informative and can remind us of a place, a time and even our beloved ones.
The same happens in our everyday life. Advertisers know all the secrets when it comes to luring in potential consumers. They manage to do so by engaging only 2 senses, sight and hearing. Through sophisticated commercials, advertisers can make us feel a need and cause us to act upon it. An important part of these commercials is using the correct music.
If you think about it, we are surrounded by many different sounds all day long. We can separate spoken words from a noisy street and spot a familiar voice in a loud space easily. How can computers help us organize such vast amounts of sound?
In an interview with Mark Plumbley, who has a Ph.D. in neural networks and is now Professor of Signal Processing at the University of Surrey, UK. Mark introduced me to some mind blowing facts about MIR that are happening right now.
Prof. Mark Plumbley pointed out the company, Audio Analytic which developed a program that can detect loud noises such as a crying baby, broken glass and a gunshot sound. After detecting the sound, the program will then send a text message saying “Hi, something went wrong”. If you are somewhere else in the house and might be too far away to hear the baby, this could be extremely beneficial. Audio Analytic can even detect aggression in a human voice.
Additionally, according to a research performed by Ferrari, S., Silva, M., Guarino, M., Aerts, J. M., & Berckmans, D. in 2008, computers can detect respiratory infections in pigs by ‘listening’ to their coughs. Amazing as it sounds, that’s just the beginning. (Computers and electronics in agriculture, 64(2), 318-325)
What about a user interface to harness the powers of MIR? Xavier Serra, Director of the Music Technology Group at the Universitat Pompeu Fabra in Barcelona and coordinator of the ‘AudioCommons’ initiative, is glad to answer.
“The AudioCommons initiative aims to develop MIR technologies to facilitate the use of open audio content by the creative industries”. In a way, they want to make the life of sound creators and sound professionals much easier by allowing them to collaborate through a one stop shop solution that will hold all of the world’s music, sound effects and the creative rights of the two, in one place.
This is a great effort from the AudioCommons people. Jamendo and freesound.org are already a working part of this initiative. They intend to grow and improve this industry with the help of some very talented people.
“Imagine exploring Google Earth and zooming in to hear the actual sounds”
In the future of MIR industry, we will see better recommendations of music. These recommendations will improve the experience by focusing more on how the music sounds and less on a user driven data. This way we will discover new bands, new artists and new music in a more natural way.
Another interesting point Mark Plumbley mentioned is that our participation in music will grow as a community. Soon music enthusiasts, with no musical background at all, will be able to take part among the industry’s leading producers by sharing each other's knowledge. Similar to Instagram and the way it affected the world of photography.
Plumbley gave another example of future implementation of MIR: “Imagine exploring Google Earth and zooming in to hear the actual sounds of a specific place in real time. This can happen only with a ‘smart city’, capable of retrieving sound information about traffic jams or a mass gathering happening across town from the simplest microphones already implemented in today’s surveillance cameras”.
The technology is out there and many scientists are working hard on scaling up the database. It is not that easy since research in MIR is not as advanced as speech or vision in using methods like ‘deep learning’ yet,, because there is not enough valuable information to work with.There is in fact a lot of information available but MIR needs to “hear” isolated sounds first, only then it will be able to find these sounds in real life situations.
“On YouTube, for example, in a Football [soccer] game you do not know the individual sounds occurring all the time” says Prof Mark Plumbley. That is why the best way to ‘teach’ the machine is through crowd sourcing. That way, “where the machine is unsure, it asks the human to help”. This is what we call ‘Active Learning’.
Can an algorithm extract the chords of any given song in under 15 seconds?
We went through a lot of information in this article. We now understand the importance and the inevitable progress of the ‘Music Information Retrieval’ technology. The majority of this research is shared between 200 scientists only and is mostly not available for the public, not because it is a secret, but because it is only in its early stages.
Don’t forget to listen closely!