Communications Technology: Archive

Current Issue

Advertising Information

Meet the Editors

Advisory Board

Annual Awards

Custom Publishing

WebEvents

Show Dailies

Reprints

List Rentals

Archives

Archives

Feature: Video Compression 5: Audio
Can You Hear the Difference?
By

So far in this series on intuitive approaches to digital television, we have talked about "lossy" video compression. But theres another component of your program that needs to be compressedyour audio.

"Lossy" video compression refers to the portion of the video that gets lost during the process of doing the compression. The trick, of course, is to find a way to lose only information that the eye could not detect anyway. Moving Picture Experts Group (MPEG) video compression offers a toolkit full of standards for decompressing material, but it does not explain how to do the compression. The thinking is that compression algorithms will improve with time, and yet be compatible with all decompression circuits. We hope. There are also loss-less compression tricks, which well cover later in this series.

But now we need to talk about the other part of the program, without which the video would be pretty useless. We have to have audio to go with the video, and that audio must also be compressed.

We can get a lot of compression out of videoyou end up with about 2 percent to 3 percent of the bits you start with. Audio is not quite so nice. You can get down to maybe 7 percent to 8 percent of the bits you started with. The ear is just too sensitive to allow much more compression than this, at least at the present state of the art.

Remember that we are not going to make you a compression guru with this series, but we will give you an appreciation of the basics. And we provide several references to other sources to take you further.

Whats in it?

The audio compression used in the United States is AC-3, developed by Dolby Labs, the same company that gave you the noise reduction technology used for years in tape recording. The compression used with the digital video broadcast (DVB) transmission system in Europe is usually called MPEG audio compression. It works on similar principles, but the two are not compatible. AC-3 is the same compression that you find in the famous, or maybe infamous, MP3 audio files, where MP3 is said to be a contraction of MPEG and AC-3 coding.

AC-3 encodes the so-called 5.1 channel surround sound, where the five channels are center (used for dialogue), front left and right, and rear left and right. The ".1" comes from a sub-low channel used to reproduce the lowest frequencies that are perhaps felt as much as they are heard.

There are a number of different service types defined for AC-3 audio by the Advanced Television Systems Committee (ATSC):

Complete main (CM): dialogue, music and effects are included in the 5.1 channels. The CM signal is constrained to using a maximum data rate of 384 kbps or less.
Music and effects (ME): dialogue is separate from the music and effects. Several channels of dialogue may be sent, and one selected to be combined with the music and effects. This would be done where it is desired to transmit several languages, for example. One set of music and effects are transmitted, and several dialogue channels are sent.
Dialogue (D): The dialogue to accompany ME. A single channel must use a bit rate of 128 kbps or less, and two dialogue channels must use no more than 192 kbps combined. Typically, the two dialogue channels would be used for two languages, for example, but would not include music and effects, which are sent on separate channels.
Visually impaired (VI): A narrative description of the video is sent.
Hearing impaired (HI): The dialogue may be processed for improved intelligibility.
Commentary (C), voice over (VO), emergency (E), karaoke.

A main channel, plus associated services intended for simultaneous decoding, must use no more than a data rate of 512 kbps.

The audio sampling rate used in ATSC implementations of AC-3 is 48 ksps (thousand samples per second), locked to the 27 MHz system clock. The basic AC-3 channel sampling is done at a minimum of 16 bits and a maximum of 24 bits. AC-3 as defined by Dolby supports 32 ksps and 44.1 ksps sampling as well as 48 ksps, but these are not supported in the ATSC system.

The basic principle

The audio in each channel is sampled and converted to digital representation. This is done many times each second by breaking the signal into blocks of 512 samples. At a 48 ksps, each block lasts 10.67 ms, though later in the processing, more than one block may be combined when possible. When a transient sound is encountered, the block size is halved to improve the reproduction of the transient. (An example of a transient sound is a drum beat or a cymbal crash.)

The block is taken every 256 samples, so each block overlaps the preceding and succeeding blocks by 50 percent. This reduces the amount of compression possible, but is necessary because of the extreme sensitivity of the human ear to errorsan example of why audio compression is more complex than video compression.

The basic principle behind both AC-3 and MPEG audio compressionand this is really basic is sub-band encoding. For each audio channel, the audio stream from 20 Hz to 20 kHz is digitized then filtered through a filter bank to determine the spectral content of the audio stream.

You may have used audio spectrum analyzers in consumer electronics equipment that display the signal level in each of a number of frequency bands across the audio spectrum. An audio spectrum analyzer contains crude filters doing the same thing as the filter bank used in compression.

Initially, the spectrum is divided into 93 Hz-wide bands, but later some bands may be combined again in the processing when they may be treated as one. The exact number of bands ultimately used depends on the characteristics of the audio.

Analyzing spectral bands

After the audio spectrum is divided into a large number of narrow bands, the contents of each are analyzed. If during a particular block, no significant sound power exists in a band, then it is not necessary to transmit any information regarding that band, and we save bits. It is a well-known principle about human hearing that a loud sound in one frequency band will mask softer sounds in adjacent bands. So where we have a band with high power, we dont need to transmit softer sounds in bands that are close to the loud one. Considerable effort has gone into determining just how much masking we get, and you can find graphs published on the subject.

It is often the case that the ear can hear softer sounds in some bands near, but not necessarily adjacent to, loud bands. But the ear does not need a great signal-to-noise ratio in those bands. In this case, we can save some bits by encoding the signal in those bands to lower resolution. That is, rather than transmit maybe 18 bits per sample, we can get away with transmitting perhaps 8 bits or some other smaller number of bits.

Thus, we have several possibilities for handling each of the spectral bands. The encoder may not to transmit information from a band, or it may transmit it with lower resolution. It may decide to combine bands or to combine blocks. In each case, we save bits. The process of deciding what bits to send is known as adaptive bit allocation.

Obviously, we must communicate the bit allocation to the decoder. There are two ways to communicate this allocation. With forward adaptive bit allocation, the encoder explicitly tells the decoder what the bit allocation is. This is preferable from the standpoint that the encoder has knowledge of just what the original signal was, so it can optimize bit allocation. Also, as improved psycho-acoustic models become available, they may be incorporated in encoders. This, again, is the principle that we allow improvements in encoding while preserving the functionality of the decoders in consumers hands.

The problem with forward allocation is that it requires extra bits to be transmitted, which works against optimum compression. The alternative is backward adaptive bit allocation, where the decoder infers the bit allocation from the data it receives. This saves bits, but is limited in that the decoder does not have perfect knowledge of the original signal. Also, it does not lend itself to incorporation of improved psychoacoustic models. The encoder performs the same decoding process as is done in the decoder, so it knows how well the encoder will infer the proper bit allocation. If backward allocation does not yield good enough performance on a particular block, then forward allocation is used.

Other tricks

While adaptive sub-band encoding is the main way audio bit-rate reduction is achieved, you have other tricks at your disposal. It is not necessary to transmit all frequencies on two channels to achieve good location. Very high and very low frequencies do not help locate a sound, so they may be combined between the various channels and transmitted only once. Where two channels have nearly the same information except for the phase, you can force the phase to be common between the channels as long as the location is not affected. These tricks are called coupling strategies.

Location refers to preserving the characteristics of sound that allow you to locate the sound relative to where you are. Close your eyes and have someone stand in front of or behind you, or to one side. When that person speaks or makes any sound, you can point to them even though you cant see them. You know where they are because your psychoacoustic system can differentiate sounds arriving from different directions.

The object of a surround sound system is to bring this experience to you in audio reproduction, which is why we have five full sound channels. (The center, or dialogue, channel actually is used more to allow proper location over a wide range of places in the listening room.) Stereo is a huge improvement over mono, but only can locate sounds in front of you. Surround sound gives more of a sense of being there, by allowing sounds to come to you from all sides. (I wonder when they will add the third dimension...)

It is interesting, though off the subject, to think about what properties of the sound are used to allow you to locate sound at all angles around you. Intuitively, you might expect the brain to use relative amplitude and phase of sounds arriving at each ear to distinguish where the sound originates. But this does not explain how you can differentiate sounds directly in front from sounds directly behind. This ability is thought to have something to do with the shape of the outer earit filters the signal in a way that allows the brain to locate sound. A lot of work has been conducted in recent years toward understanding this filtering.

A humorous conclusion

Last January, I addressed a different type of audio compression used in IP telephony and elsewhere in our sister publication, CT International (then known as International Cable). This compression algorithm is called linear predictive coding (LPC), and essentially works by modeling sounds as being produced by a carrier with certain modulation. It works fairly well for speech but does not work well for music and other sounds. While researching this column, I ran across an interesting book that I was reminded of while preparing this material.

Michael D. Alder, a professor at the University of Western Australia, has posted a book covering pattern recognitionAn Introduction to Pattern Recognition: Statistical, Neural Net and Syntactic Methods of Getting Robots to See and Hearon the Web.1 He covers LPC in this book, and in the introduction to that chapter makes an observation that is most apropos, and with which this author heartily agrees:

"Once the reader understands that this is desperation city, and that things are done this way because they can be, rather than because there is a solid rationale, he or she may feel much more cheerful about things. For speech, there is a theory that regards the vocal tract as a sequence of resonators made up out of something deformable, and which can, in consequence, present some sort of justification for linear predictive coding. In general, the innocent beginner finds an extraordinary emphasis on linear models throughout physics, engineering and statistics, and may innocently believe that this is because life is generally linear. It is actually because we know how to do the sums in these cases. Sometimes, it more or less works.

And they say that RF engineering is black magic!

Additional references

There are a number of references available on the subject of audio compression. The basic specification is ATSC A/52. The entire suite of ATSC specifications is one of the best bargains in the whole world of standards: you can download them for free from www.atsc.org. The problem with reading the specifications themselves is that they are not intended to provide understanding of the principles, but rather assume you understand what is going on, and that you need the gory details.

Dr. Michael Isnardi of Sarnoff Labs is an outstanding teacher and writer on the subject of compression. He has presented several superb tutorials on the subject at the IEEE Consumer Electronics Societys annual conferencethe International Conference on Consumer Electronics (www.icce.org). Unfortunately, I dont believe he is going to be presenting at the 2001 conferencein a rare moment of weakness, he agreed to chair the program committee. I have used several of his past presentations in preparing this material. If you have access to any of his writings, grab them and study them because he offers invaluable information.

The Digital Consumer Electronics Handbook, by Ronald K. Jurgen (McGraw-Hill, ISBN #0-07-034143-5), covers audio compression in a fairly understandable and readable manner. The coverage is a bit brief, but is useful.

Jim Farmer is chief technical officer of Wave7 Optics, Inc. He may be reached at .

The Art of Audio

Audio compression as practiced in North America uses a technique developed by Dolby Labs called AC-3. It works on the principle of sub-band encoding, where the signal is broken into narrow bands, and each band is analyzed for content. Those bands that dont contribute the most to the content are either not transmitted or are transmitted at lower resolution.

Back to December Issue