Compressed Sensing Audio Demonstration

by Laura Balzano, Robert Nowak and Jordan Ellenberg
a demo developed for NPR; listen at the all tech considered blog.

Compressed Sensing is a signal acquisition and reconstruction technique. Compressed Sensing can be used for signal reconstruction when it is known that the signal is compressible; in particular, for those of you who are familiar with basis functions (the Fourier basis, the Wavelet basis, or generalized basis functions), the signal must be sparse in some known basis. What Compressed Sensing does is try to fit samples of a signal to a sum of the known basis functions, and it has a preference to use as few basis functions as possible to match the samples. If a sum of a very small number of basis functions allows us to exactly reconstruct the signal, we say the signal is sparse in that basis.

Recently Professor Jordan Ellenberg wrote an article in Wired Magazine, 'Fill in the Blanks', describing how Compressed Sensing is being applied to signals such as MRI. In order to illustrate the concept, the article had a demonstration of how Compressed Sensing (CS) reconstruction works on an image of President Barack Obama. On this page, we give a demonstration of how Compressed Sensing might work on audio signals.

First we make the disclaimer that CS is not practical for visible imaging or audio signals because either (a) the current technologies already provide excellent compression capability, (b) the signals are actually not sparse enough in any currently known basis in order for CS to be useful, or (c) if the signals are sparse in some known basis, the current technologies do not capture the signal in such a way to facilitate CS reconstruction.


Image from Jordan Ellenberg's article in Wired Magazine. These images were simulated by Jarvis Haupt and Robert Nowak.
Here we have a demo with three different audio files. The first is a very simple midi-sounding version of Mary Had a Little Lamb. The second is a snippet from a piano sonata by Domenico Scarlatti: the first few seconds of Sonata in D minor K. 9, as played by A. Benedetti Michelangeli. The last is Professor Ellenberg saying "You are listening to NPR, National Public Radio."

For all three songs, we used a very simple basis consisting of piano notes lasting a quarter-second long. The function used to generate the notes is a sine wave of the note's frequency times a rising exponential and a falling exponential to give it the kind of 'ding' sound you might hear when striking a piano key.

We constructed Mary Had a Little Lamb directly from the basis for the piano notes. The song is very simple and thus has a very sparse representation in that basis, which makes it ideal for Compressed Sensing reconstruction. The Scarlatti song can be very well represented with the piano notes, however, the fact that the notes are only a quarter-second long has an artifact that you cannot hear fast note transitions in the song. Lastly, Jordan's voice is very poorly represented by the basis of piano notes, as you will see.

The listening to these .wav files has only been tested to work on Safari; we have had trouble in Firefox on Mac. If you are having trouble, please download them to your computer to listen to them.
The first audio examples are for Mary Had a Little Lamb. We generated the song using our basis, this can be heard in the first audio file. Mary Had a Little Lamb, 44100 hz
Then, we undersampled the song uniformly only about 170 times per second. You can listen to three reconstructions. The first uses the Compressed Sensing technique-- what I call the l1-reconstruction because it penalizes the solution with the l1-norm in order to prefer sparsity. The l1-reconstruction is perfect!

Second you can hear the l2-reconstruction: a simple least squares reconstruction. When there aren't enough samples to distinguish the notes, this technique simply chooses many notes to try to make the best fit. It has no preference for finding a sparse representation.

Lastly you can hear a linearly interpolated file-- Because the samples are so few, the resulting signal is very low frequency, and it is almost inaudible. You can hear the aliasing, a phenomenon where high notes wrap around and sound like low notes. To understand this, think about what happens when you watch car hub caps spinning or a fan spinning. Often what you see instead is what looks like the fan spinning at a very low rate or sometimes even backwards. This is aliasing.

l1-reconstruction of Mary, 170 hz
l2-reconstruction of Mary, 170 hz
linear interpolation of Mary, 170 hz
Next we undersampled even more: we took only 456 uniform samples, which is only about 63 per second. Now the l1-reconstruction is not perfect, however, you can hear that it is still very sparse-- it tries to use as few notes as possible to match the samples it gets. The l2-reconstruction, however, gets even noisier; it uses as many notes as it needs to in order to match the very few samples as exactly as possible.

I did not post the linearly interpolated signal because it was inaudible.

l1-reconstruction of Mary, 63 hz
l2-reconstruction of Mary, 63 hz
A neat thing about CS reconstruction is that we don't need to take uniformly spaced samples to reconstruct the signal. In fact, it tends to be that randomness in the samples allows better l1-reconstruction of the signals. That can be heard in this file, where we reconstruct from 456 randomly spaced samples-- that still gives us 63 per second on average, but the reconstruction is better. The l2-reconstruction is still noisy, and the linearly interpolated signal is still inaudible. l1-reconstruction of Mary, 63 hz random
l2-reconstruction of Mary, 63 hz random
Okay so far, we've been able to illustrate what Compressed Sensing can do on this simple song. What does it do on a more complicated song? Next we looked at the first few seconds from Sonata in D minor K. 9, the piano sonata by Domenico Scarlatti, as played by A. Benedetti Michelangeli. Here is the original clip that we used. Scarlatti song clip, 44100 hz
The song is beautiful and has a lot of quick sounds-- can our basis even do a good job of capturing this song when we have full information? This file is the best least squares (l2) fit of the entire, fully sampled song to our basis of quarter-second piano notes. It does a nice job, but it definitely misses the detail of the Scarlatti piece. Also, it doesn't sound as sparse in our basis as Mary Had a Little Lamb was-- in fact, if you look at the coefficients of the fit to our basis, it isn't particularly sparse. Best fit of the Scarlatti song clip, 44100 hz
We next undersampled Scarlatti-- we took every 40th sample, resulting in about 1100 samples per second.

The Compressed Sensing l1-reconstruction does an okay job of matching the best fit to our basis. The l2-reconstruction, once again, uses too many notes and results in a very noisy signal reconstruction-- beware, this one is ear-piercing.

This time, there are enough samples to use linear interpolation and get something audible-- however we can still hear the aliasing in this clip. So, none of our techniques work very well for such an undersampled version of Scarlatti's sonata.

l1-reconstruction of Scarlatti, 1100 hz
l2-reconstruction of Scarlatti, 1100 hz
linear interpolation of Scarlatti, 1100 hz
Finally, we recorded Jordan saying "You are listening to NPR, National Public Radio." The first file is the original, our recording of Jordan's voice, with 44100 samples per second. Jordan saying "NPR"
Now, our basis of functions is quarter-second piano notes-- Can that be used to sound like Jordan's voice?? It turns out the answer is no. Here is the best approximation of Jordan's voice using our basis of piano notes-- this is the least squares (l2) fit of all the samples to our basis of functions. As you can see, it doesn't capture much. Best fit of Jordan saying "NPR"
Here we have the CS-l1-reconstruction and an l2-reconstruction from 8268 samples- that's only about 2200 samples per second. These use the piano note basis again. This is to illustrate that without a good basis that can represent the signal, CS cannot reconstruct it.

Lastly, we have a reconstruction of Jordan's voice using linear interpolation. Though the audio sounds scratchy, at least we can make out what Jordan is saying. This time, linear interpolation wins! Why? Jordan's voice is low enough, and our samples are fast enough, that we don't experience aliasing. Also, our ears are very attuned to hearing speech even in a noisy signal. So in this example, linear interpolation is good enough for us to hear Jordan's voice.

l1-reconstruction of "NPR", 2200 hz
l2-reconstruction of "NPR", 2200 hz
linear interpolation of "NPR", 2200 hz
This concludes our demonstration of Compressed Sensing on audio.
But wait! If you are curious about how all this works-- there's more! I prepared everything you heard here using Matlab, and here is the code. For Compressed Sensing reconstruction, I used the algorithm GPSR, Gradient Projection for Sparse Reconstruction. The code is available at the link.

To first create the basis of our piano notes, run this file:

createBasis.m

Next, to create the song of Mary Had a Little Lamb, run this file:

generateMarySong.m

Next you can use these three files to re-do what I did above for Mary, Scarlatti, and NPR song files. You will need the Scarlatti original song clip and the NPR origianl sound clip in the same directory as these m files. Read the comments to see what sampling rate is being used; at the end of the file, you can give the song files a name to be written to your computer.

maryCS.m
scarlattiCS.m
nprCS.m

Thank you for your interest in Compressed Sensing! If you have any questions feel free to email me, Laura Balzano. My email address is sunbeam, followed by the 'at' sign, followed by ece.wisc.edu.