This is the first of what I hope will be a series. We’ll call this part, Only Getting Halfway with AS3’s Built-In Sound Capabilities. I’ve spent close to a week working on a little application, which has so far turned out pretty nice. The end result is an AIR app that reads sound files, draws their waveforms, and then saves out a PNG of each sound’s waveform.
Easy peasy, right?
Sure enough, we thought this would be a pretty quick job. Just use the built-in SoundChannel.peakLeft and peakRight properties, or perhaps SoundMixer.computeSpectrum(). Load an MP3, set up a loop to play a Sound from an incremented position, and grab the peak values. Should take a few minutes, tops!
Except that that approach has several flaws.
The first problem is that the loop/play/read approach doesn’t work the way you’d think it does. When you start playing a Sound, and then grab the left/rightPeak, you get 0 back as the first result. Always. So, you’ll next try setting up a Timer to alternately start playing the Sound, then reading the peak a few moments later. And you’ll get the same result: 0. Same sort of deal with computeSpectrum(). The only way I was able to get this approach to work was to make a second peak an interval of time after the first. So, my Timer would drive a function that would alternately play the Sound from an incremented position and then immediately read the peak, and then read the peak again on the next Timer interval. And repeat.
That was a hassle, but at least we can now grab peak information from the MP3. Well, sort of. leftPeak/rightPeak returns a value from 0 to 1. We don’t get negative values. This was a minor issue, as you could certainly argue that when zoomed out enough on the graphic, you can get away with simply drawing a mirror image of the positive value. Except…what if we did want that data? We’re drawing waveforms here, and I won’t get into why we need to do so, but we might need that detail. Which brings me to gripe #3.
There’s going to be a limit to how much detail we can get out of a Sound, using left/rightPeak. Given that telling a Sound to play from a certain position seems to be an inexact science, and that we can only grab that peak data after waiting for a TimerEvent to fire, which is also an inexact science, we’re limited as to how accurate we can really be.
But even if we’re OK with a certain degree of inaccuracy and resolution limitation, there’s something else that I started to realize as I was experimenting with trying to get this to work. Imagine a drum track, by itself (no other instruments). Drums produce highly-transient sounds, that is, they peak quickly, last just a few milliseconds, and then decay quickly. You’ve probably seen a drum waveform, with its extra-spikey, comb-like appearance. You’ve probably also seen a, say, rock guitar waevform, which tends to have a lot more “body.” In the image below, a drum track is on top, and a guitar track is on bottom.
So, imagine that we’re gathering sample data at a certain rate. Also imagine that the sampling rate (of the peak data gathering, not the audio file itself) is at a lower resolution (for a more “zoomed out” waveform) and, most importantly, not in sync with the tempo of the sound. In the image below, the situation is exaggerated, but it’s a drum track (a little more zoomed-in that the last one) with peak data samples happening where the white lines occur. Notice that the samples have completely failed to capture the fact that two large peaks happened, and managed to sample peak data that is actually very low energy. This is an inherent problem with simple polling of highly dynamic data.
Ideally we’d have a “sample window” that we can take all of the actual samples as they exist in the audio file for a certain duration, and average them (or something) into a bit of data for waveform drawing. That is, instead of taking a single data sample at each of the above white line, we’d take all of the data between the lines, and come up with a value to represent that duration of time. And here’s about where we hit the limits of using the SoundChannel and SoundMixer classes, as those represent only instantaneous data, not continuous data.
And to top it all off, the deal-killer is that in order to get the peak data (using either left/rightPeak or computeSpectrum()), you have to be actually playing the song. Seems like a “no doi,” I know, but it’s not until you actually try this that you realize that annoyance levels rise very quickly to dangerous, even lethal, amounts when you have to listen to the sound playing back in a stuttered fashion. All those play() calls from the Timer actually play the sound. But only for 50 ms or so until you stop it in the next TimerEvent. But then you play it again 50 ms later…and so on.
“Gee Dru,” I hear you saying, “just set the Sound’s volume to 0.” Yes, good point. I tried that. Turns out that both left/rightPeak and computeSpectrum() are “post fader,” that is, they get the sound energy level after the Sound’s volume(s) is adjusted. Set the Sound’s volume to 0, and you’ll get 0 out of every peak read operation.
“Well can’t you just turn your computer speakers down while you do this?” Um, no.
What’s a boy to do? My next thought was to look into reading the actual sound file and see if we can get the sample data from the ByteArray. Stay tuned for the next post for more on that.



15 comments
Comments feed for this article
August 1, 2008 at 9:06 am
dub
Hi,
I’ve being doing a lot of work recently experimenting with sample data and bytearray’s.
In particular editing and mixing sound.
Would you be interested in this type of thing? Might be cool to share our “A-HA” moments?
August 1, 2008 at 9:51 am
Joa Ebert
You could use Sound.extract() now with Flash 10. Using computeSpectrum() or the left/right peak values is a bad idea because the spectrum data is only updated every 44ms or so.
With sound extract() you can be accurate. Remember that when displaying a waveform and zooming out you have to make sure that you do not miss a peak if you squish the waveform.
August 1, 2008 at 9:46 pm
davr
Might wanna check out Flash 10…it has some very handy functions for dealing with sound that you don’t have in Flash 9. It would make this a lot easier.
August 2, 2008 at 6:04 am
drukepple
We’ve been playing around with Flash 10, and are really excited about it…but unfortunately we needed to get something built today that uses Flash 9. In fact, the whole extract() thing pretty much solves the biggest issue we’ve been having with another audio piece we’ve been working on. Lucas blogged about it a week or so ago
We’ll definitely remember that, though, when we go to upgrade the piece.
And thanks for the advice! I’ve looked at the Pop Forge stuff for a little inspiration on this, so it’s interesting to end up full circle, as it were.
August 11, 2008 at 10:49 pm
Tutorials | Flash/AS3 Tutorials Roundup « Flash Enabled Blog
[...] Wave Theory in ActionScript 3 [...]
August 12, 2008 at 10:06 am
Seb
What a nightmare. Just goes to show how we should never imagine a job taking “a few minutes” – especially when it comes to AS3.
I spent nearly a week with similar levels of issues using Papervision3D on roll over/out logic on moving items.
Glad that you took the time to log and share these issues with us though, in case we should fall foul of them also.
October 24, 2008 at 10:41 pm
My code blog. » Blog Archive » Flash Actionscript 3 Waveform Generation Class
[...] to the Summit Projects Flash Blog and Thibault Imbert at ByteArray for their comments and input on the different techniques that went [...]
December 6, 2008 at 2:24 am
webtry
Hi,
I’m working on a visual spectrum analyzer and I think I have a similar sort of a problem. I’ll try to explain it as briefly as I can.
As you know there are two types of analyzers you can have (switching FFTMode parameter) with computeSpectrum method: a waveform-type and a frequency-type which I’ll call ‘wave-type’ and ‘bar-type’ respectively.
Now the visualization seems at least acceptable with the wave type.
But in the bar-type analyzer, the lover frequencies (bass) are calculated way too high and almost always none-zero. the higher frequencies (treble) on the other hand looks so low at all times.
I have compared it to winamps’s default bar type visualizer and it’s awful.
I have detected a reason for trebles being low at all times.
I’ll post this temporary solution, which I’m not sure about its correctness, in the next comment as this is getting too long.
December 6, 2008 at 2:44 am
webtry
Now here are continue:
Some quick notes on my bar-type analyzer properties:
-computeSpectrum computes 512 values 256 for left channel and 256 for the right. I take the average of left and right values, since we’ll display a single analyzer for both channels.
-There will be limited amount of bar displayed not all 256 bars for each value. So I again get the average of some ‘groups’ of values, based on the total displayed bars on screen (if we display 32 bars for instance, we need groups of 8 values: 8×32=256).
So here is what I inspected:
-I displayed each 256 numeric value on screen while playing a song. Two sets of data one for wave-type calculated (FFTMode=false) and one for bar-type (FFTMode=true).
-As I expected there were almost never zero values in wave-type, but I got lots and lots of zero values (in high freq. mostly) in bar-type. So this means when I’m getting the average of 8 values in bar-type I get 5 or 6 zeros and just 2 or 3 non-zeros for instance. And this is what makes higher frequencies looking soo low.
-So if I omit zero-values (or set a similar threshold) while getting the average, I thought might have a better solution.
Still not sure why there are nearly no zero values in low frequencies but in high ones. I’m not a sound engineer so I don’t know much about the physical properites of a sound wave in nature. But this makes me think that (if the flash’s calculation is correct) lower frequencies produces more continuous peaks while higher freq. produces ‘interrrupted’ ’seperated’ ones.
PS: I have coded a few very simple swf files to demonsntrate the experiments I’ve explained above so mail me, if you are interested at all. I’d love to hear any thoughts on these.
You may contact me at (add letters in single quotes together) ’serdar’ ’soy’ ‘@gmail’ ‘.com’
December 8, 2008 at 5:57 pm
Dru Kepple
Thanks for your input, webtry! It’s possible that your source material is just bass-heavy. Depending on the sounds I’ve fed to a FFT, I get values in the higher frequencies or not. But the bass frequencies are always heavier.
Adobe doesn’t seem to supply information on what the “bins” are for the FFT, but you can try to supply a value to the “strecth factor” parameter in computeSpectrum(). It’s the third parameter, and at 0 you get data out at 44.1 kHz (the default). At 1, you get data at 22.05 kHz, which effectively eliminates the upper octave of frequencies. Try it out, you might find more information in all of the “bins” to work with.
December 8, 2008 at 11:49 pm
webtry
Thanks for the reply. I’ve been studying sound wave theory and FFT basics for a while.
Here are some conclusions I have, hoping I’m correct:
-Flash’s default spectrum range is 0kHz-22.05kHz not 0kHz-44.1kHz. See ‘Nyquist frequency’ article in Wikipedia. ( “An audio CD can represent frequencies up to 22.05 kHz—the Nyquist frequency of the 44.1 kHz sample rate.”
http://en.wikipedia.org/wiki/Red_Book_(audio_CD_standard) )
-Frequencies higher than 12kHz could be represented by fewer bars, as human hearing is less sensitive to those frequencies.
-I would be nice to know which exact frequency is represented by each of the 26 values flash computes. I made some tests with pure test tones in 100Hz, 440Hz and 1000Hz. Here are the resulting numeric data:
(I ignored values smaller that 1, to show the peak more clearly)
http://img236.imageshack.us/img236/5043/100hzmr5.gif
http://img224.imageshack.us/img224/600/440hzki4.gif
http://img124.imageshack.us/img124/5919/1000hzro0.gif
December 8, 2008 at 11:52 pm
webtry
Correction:
-’It’ would be nice to know … each of the ‘256′ values flash computes.
December 12, 2008 at 5:39 pm
Dru Kepple
Not only is human hearing less sensitive to higher frequencies, but if you think about frequencies in musical terms rather than a linear set of numbers, an entire half of the spectrum is dedicated to but one/tenth of what you can hear.
That is, musically speaking, an octave tone is defined as being twice (or half) the frequency of the original tone. So, if you press the A key on a piano above middle C (that’s A 440…I think), and then you press the A one octave above that, then the higher A is 880 Hz. The A above that is 1760 Hz. Musically, we hear equal distances. But physically, there are twice as many frequency “buckets” between the higher two A notes than between the lower two A notes.
So, if human hearing is from more-or-less 20 Hz to 20 kHz, we can view it as starting at 20 Hz, with the first octave of hearing going to 40 Hz. Then the second going to 80Hz, etc, for a list of numbers that goes:
20
40
80
160
320
640
1280
2560
5120
10240
20480
That’s 10 octaves of hearing. But 9 of those octaves fit into 20 Hz to 10240 Hz, and the 10th octave requires 10240 Hz to 20480 Hz. So, frequencies above 10 kHz or 12 kHz are about half of the frequency spectrum, but only 1/10 of what you hear.
So, if the FFT from computeSpectrum() has an even distribution of frequency “buckets,” it stands to reason that the upper half of the values might be somewhat empty.
December 15, 2008 at 2:07 pm
webtry
Yes I believe the FFT data of computeSpectrum() is distributed evenly, as you said.
In order to desing a good-responding analyzer, there are few things to have in mind besides the facts you mentioned above:
-A weighting function to modify frequencies so that the displayed graph looks more natural to human hearing. (http://en.wikipedia.org/wiki/Weighting_filter).
-An enveloping / equalizing function to get a smoother (and more sensitive to the ‘beats’) graph. (this is still somewhat unclear for me to implement.)
In fact there is a brilliant (non-flash) analyzer, which produces excellent results, and is the best analyzer I’ve seen in popular media players:
http://www.winamp.com/plugins/details/165966
If I haven’t seen this analyzer, I would be happy with my own compared to most of the analyzers around.
Too bad there is no documentation I could find about it.
PS: It’s good to see this new template in your blog; a lot better.
December 28, 2008 at 6:43 pm
darkglove
Hi all,
I just stumbled across your conversation as I have just built a visual player and I too noticed the “bunching” of frequencies.
I also thought, maybe just the track, but I too checked it in winamp and although the bottom end seems to be more active than the higher, it does seem to be calculated differently so as to spread them out.
Bass notes can be seen to move up and down the spectrum, whilst in mine they sit in one big lump.
Thanks Dru, that has made it a lot clearer.
Looks like some clever adjustment curve is in order to recreate a visual that looks “more real”.
If anyone wants to see my visual, feel free.
The frequency blocks represent how many of the 512 blocks are displayed from bottom to top. If you set it to 10 then only the first (lowest) 10 blocks of the left channel will be shown. If you put 512 then the right channel mirrors inward for aesthetics.
http://www.mjnbrown.com/Music.html