Hunt for hifi X: There are ten types of people who understand binary…
Digital audio has been around for a very long time – the CD format is over 30 years old now – yet enormous leaps in progress are still being made, and the latest development is undeniably server-based playback. The days of swapping CD’s seem to be over for me, and I’m not looking back. I still buy all my music as physical releases though! As of writing this, I still have not purchased a single electronically distributed piece of music. But that’s another story.
I normally just write stuff here for my own enjoyment, as a way to flush my system of thoughts on a certain topic. But with digital audio I’ll see if I can fire up my inner philanthropist and start at the very beginning, going through things I’ve already known for 15 years. I think it will be an interesting challenge and it would make this series a lot more complete if I do. I will attempt to put everything in laymans terms, so even fundamentalist vinyl cavemen can follow my reasoning.

The first thing I need to define is what digital actually is, especially in terms of audio. To do that, you need to know how sound works, but I won’t regress into the basic physics here. So assuming that you’re familiar with the fundamental principles of waveforms (like the sine wave in the oscilloscope above), let’s look at how they are stored as data rather than as a groove in a vinyl or magnetism on a reel tape.
Imagine a chess board. It has 64 squares arranged in an 8×8 grid, denominated A-H along its X axis and 1-8 along the Y. In chess you cannot place your pieces on E4-and-a-half or anything like that; they have to obey the grid, so you’re either E4 or E5. This same type of rule exists with digital audio, in that a sine wave cannot be drawn as a smooth curve like in our analogue oscilloscope screen above. In fact, the image above has a faint square pattern on the screen, so imagine trying to approximate the sine wave by tracing the dark lines of this grid. This is how sound is stored as data; a series of coordinates with time along the X axis and amplitude along Y. The horizontal distance, or time, between these coordinates is constant. So that value is just a frequency rather than coordinate data for each point. This frequency is kept by a clock, and we’ll back to that later.
As you may guess, the finer grid you have, the closer you get to analogue sound. Inversely, a coarser resolution will reveal more of the unnatural sounding sideeffects of digital sound. There is no information of how the waveform is supposed to look between the points of data. There is simply no information there, so after each sample point the waveform is drawn straight ahead along the X axis until the frequency clock says it’s time for a new data point. Once there, the waveform is drawn straight up or down to get to the new coordinates. Because of this, the waveform isn’t drawn as a connect-the-dots chart, but with orthogonal angles:

The vertical axis, which denotes amplitude, is stored as a 16 bit value in CD audio. Now what the hell does that mean, you might ask. A bit in digital terms is a single one or zero, so at 1 bit you would have full amplitude or silence – 1 or 0, respectively. At two bits, you can have four combinations: 00, 01, 10, 11, essentially meaning you could have four different amplitudes in your waveform. So moving all the way up to 16 bits, there are 65384 possible combinations of amplitude. While it might be difficult to hear the difference in loudness between 47361 and 47362 in amplitude, there are other reasons to desire even higher bit depth.
So that’s the basics of CD audio, really – a clock-timed series of amplitudes stored as a 16-bit values. When converting something to digital, the strength of the electric signal in the microphone is measured at the rate of this clock, and stored as amplitude values. When playing it back from digital, the amplitude values are read at the pace of a new clock (hopefully indistinguishable from the recording one!) and a new electric signal is throttled according to these amplitudes. It’s a bit more complicated than that, but I’m sure you get the point.
Now that the basics are laid down, my next post will go through the ways in which these conversions, mainly focusing on digital to analogue, can get messed up and how to avoid these pitfalls.
A slight disclaimer: The above explanation of digital audio is for pulse-code modulation, or PCM for short. There is another, perhaps slightly less intuitive way of storing sound as data that is called Pulse-Width Modulation which is used on Super Audio CD’s under the name DSD. Since it’s trickier and far less common I won’t describe it further here.
Hi there! What you wrote about digital sound is not correct in all parts. The figure is not correct, and neiher is the assumption “There is no information of how the waveform is supposed to look between the points of data.”. Because, after the low-pass filtering (ca 22 kHz for 44.1 kHz sampling), all information in the analog signal is there.
Regards,
I beg to differ. The data is exactly that, data. Since it’s stored as sample points, there is nothing in between the sample points. I don’t see how there’s any debate there. And “all information in the analog signal is there” is just flat-out wrong. A 30kHz tone stored as 44.1kHz sample rate PCM data can never be retreived. And even if we step into the human hearing range and go with, say, a 15000Hz square wave, it will not be identical after A/D-D/A conversion. Close, but certainly not “all information (…) is there”.
Digital audio does not work like connect-the-dots drawing games, hence the orthogonal angles in my diagram. These are indeed smothed out by the filter to form something resembling the red line, but there are many different types of filter with different sideeffects. Issues like pre-ringing and phase distortion are very real and have audible effects on music.
“Since it’s stored as sample points, there is nothing in between the sample points.”
Well, there is, actually. Since the signal is lowpass filtered through a brickwall sinc-funktion filter, all information is stored and replicated, even the information between the sampling points. I repeat: all information between the sampling points are stored and replicated.
And of course a 30 kHz tone is not retrieved with 44.1 kHz sample rate, but that is because it was filtered away in the 22.05 kHz LP filter before the A/D conversion. It was never stored to begin with.
Here you can read more about the sinc: http://en.wikipedia.org/wiki/Sinc_function
Ah, I think I’m beginning to see what you’re referring to. Let me know if I understand you right:
Lest say an analog signal results in sample points A, B and C (a very short signal, in order to simplify this argument). These have amplitudes of 0, 5 and -4 respectively. What I was saying is that sample point B, with a value of 5, would be 5 regardless of what happened in the analog domain between the clock tics at sample points A and B. But what you are saying is that B is a representation of everything that happened between these two tics, and that a transient at +20 between A and B would have given B another value even if the amplitude was at the equivalent of 5 for that clock tick.
If this is what you mean, I understand your argument but I’m still very skeptical: A sonic event between two sample points at 44,1kHz would by definition be above 22,05kHz, wouldn’t it? And as such irretrievable from the data, just as the 30kHz tone in my example.
I am unfortunately no mathematician (as I ended up following the desires of the right side of my brain instead), so the Wikipedia article is a little difficult to digest. Nonetheless, I appreciate the discussion and would like to emphasize that I desire to learn the truth, rather than to claim I already know the truth.
“I appreciate the discussion and would like to emphasize that I desire to learn the truth, rather than to claim I already know the truth.”
That was my impression from reading this site, otherwise I wouldn’t have bothered. :-)
“But what you are saying is that B is a representation of everything that happened between these two tics, and that a transient at +20 between A and B would have given B another value even if the amplitude was at the equivalent of 5 for that clock tick.
If this is what you mean…”
Well, not quite. The point is that the analog signal is filtered through a sinc filter (and thus transformed) BEFORE it is sampled. Imagine an very (infinitely, you migh say) narrow pulse between two sampling points. This would indeed be “invisible” for the sampling function. But since it is first sinc filtered, it is transformed into another shape (see the curve on the wiki page), and this curve has values ON the sample points, and so it is “visible”.
It is not easy to explain in words only. I have some links to quite informative articles, if you are interested.
Here is more on wikipedia: http://en.wikipedia.org/wiki/Sinc_filter
Ah, very interesting!
I was not aware that there was filtering at this step; I have only read up on D/A, not A/D. I have some work to do…
I am somehat confused by the Wikipedia graph though:
http://en.wikipedia.org/wiki/File:Sinc_function_%28both%29.svg
Wouldn’t this indicate that the resulting signal, which is to be digitized, is in effect detuned from the original signal? The red line has different frequencies, and it’s not halved either. I’m in over my head now, but in this exampe it looks as though this particular filter implementation would be detrimental to the music rather than helping the listener approximate signal data in between sample points.
You could see “x” as 2*pi*f, where f is the highest allowed frequency, which is 22,05 kHz for the CD system. The curve is also the look of the signal out of the filter if the signal in is a delta function, that is, an ideal pulse. So, if the pulse is situated exactly on a sampling point, after the filter it will have the value “1″ in this point and “0″ in all other sampling points (crossing the x axel at zero). If the pulse is situated between sampling points, after the filter it will have defined values in a number of sampling points – it is not approximations, but exact values, that will recreate the original signal.
Sorry, x is “2*pi*f*t”. The time t is the variable, of course.
My follow-up questions are forking in an ever-widening delta right now, so I’ll do some reading and perhaps make a proper post on this combined with some other A/D things I’ve been contemplating, such as how the hell Meridians 808.2 CD player can filter out pre-ringing that is from the A/D process.
Basically, how can the D/A can know how the A/D was made? Jitter is just one issue of many here, and I’m starting to think this filtering business will prove very interesting.
Please do! I am not an expert on this, far from it, so this is very benificial for me, too. It’s alwas interesting to try to explain things you think you know. :-)
Basically, I think that most A/D-D/As are conceived with the assumption of the sinc, although there may be exceptions to this. With digital filters, I think the pre-ringing is no problem.
Here are some stuff…
An article in Swedish about digitalization and the sinc function: http://www.lts.a.se/artiklar/sincen.pdf
A discussion in Swedish on a forum about filters and pre-ringing: http://www.faktiskt.se/modules.php?name=Forums&file=viewtopic&p=251444
Excellent digging – This’ll make for good couch reading tonight. Thanks!