[These notes will eventually become part of a tech talk on video encoding. List of all articles in this series.]
Unless you’re going to stick to films made before 1927 or so, you’re going to want an audio track. A future article will talk about how to pick the audio codec that’s right for you, but for now I just want to introduce the concept and describe the playing field. (This information is likely to go out of date quickly; future readers, be aware that this was written in December 2008.)
Like video codecs, audio codecs are algorithms by which an audio stream is encoded. Like video codecs, there are lossy and lossless audio codecs. Today’s article will only deal with lossy audio codecs. Actually, it’s even narrower than that, because there are different categories of lossy audio codecs. Audio is used in many places where video is not (telephony, for example), and there is an entire category of audio codecs optimized for encoding speech. You wouldn’t rip a music CD with these codecs, because the result would sound like a 4-year-old singing into a speakerphone. But you would use them in an Asterisk PBX, because bandwidth is precious, and these codecs can compress human speech into a fraction of the size of general-purpose codecs.
And that’s all I have to say about speech-optimized audio codecs. Onward…
As I mentioned in part 2: lossy video codecs, when you “watch a video,” your player software is doing several things at once:
The audio codec specifies how to do #3 — decoding the audio stream and turning it into digital waveforms that your speakers then turn into sound. As with video codecs, there are all sorts of tricks to minimize the amount of information stored in the audio stream. And since we’re talking about lossy audio codecs, information is being lost during the recording → encoding → decoding → listening lifecycle. Different audio codecs throw away different things, but they all have the same purpose: to trick your ears into not noticing the parts that are missing.
One concept that audio has that video does not is channels. We’re sending sound to your speakers, right? Well, how many speakers do you have? If you’re sitting at your computer, you may only have two: one on the left and one on the right. My desktop has three: left, right, and one more on the floor. So-called “surround sound” systems can have six or more speakers, strategically placed around the room. Each speaker is fed a particular channel of the original recording. The theory is that you can sit in the middle of the six speakers, literally surrounded by six separate channels of sound, and your brain synthesizes them and feels like you’re in the middle of the action. Does it work? A multi-billion-dollar industry seems to think so.
Most general-purpose audio codecs can handle two channels of sound. During recording, the sound is split into left and right channels; during encoding, both channels are stored in the same audio stream; during decoding, both channels are decoded and each is sent to the appropriate speaker. Some audio codecs can handle more than two channels, and they keep track of which channel is which and so your player can send the right sound to the right speaker.
There are lots of audio codecs. Did I say there were lots of video codecs? Forget that. There are a metric fuck-ton of audio codecs. These are the ones you need to know about:
…colloquially known as “MP3.” If you haven’t heard of MP3s, I don’t know what to do with you. Walmart sells portable music players and calls them “MP3 players.” Walmart. Anyway…
MP3s can contain up to 2 channels of sound. They can be encoded at different bitrates: 64 kbps, 128 kbps, 192 kbps, and a variety of others from 32 to 320. Higher bitrates mean larger file sizes and better quality audio, although the ratio of audio quality to bitrate is not linear. (128 kbs sounds more than twice as good as 64 kbs, but 256 kbs doesn’t sound twice as good as 128 kbs.) Furthermore, the MP3 format allows for variable bitrate encoding, which means that some parts of the encoded stream are compressed more than others. For example, silence between notes can be encoded at a very low bitrate, then the bitrate can spike up a moment later when multiple instruments start playing a complex chord. MP3s can also be encoded with a constant bitrate, which, unsurprisingly, is called constant bitrate encoding.
The MP3 standard doesn’t define exactly how to encode MP3s (although it does define exactly how to decode them); different encoders use different psychoacoustic models that produce wildly different results, but are all decodable by the same players. The open source LAME project is the best free encoder, and arguably the best encoder period for all but the lowest bitrates.
The MP3 format was standardized in 1991 and is patent-encumbered, which explains why Linux sucks can’t play MP3 files out of the box. Pretty much every portable music player supports standalone MP3 files, and MP3 audio streams can be embedded in any video container. Adobe Flash can play both standalone MP3 files and MP3 audio streams within an MP4 video container.
…affectionately known as “AAC.” Standardized in 1997, it lurched into prominence when Apple chose it as their default format for the iTunes Store. Originally, all AAC files “bought” from the iTunes Store were encrypted with Apple’s proprietary DRM scheme, called FairPlay. Many songs in the iTunes Store are now available as unprotected AAC files, which Apple calls “iTunes Plus” because it sounds so much better than calling everything else “iTunes Minus.” The AAC format is patent-encumbered; licensing rates are available online.
AAC was designed to provide better sound quality than MP3 at the same bitrate, and it can encode audio at any bitrate. (MP3 is limited to a fixed number of bitrates, with an upper bound of 320 kbs.) AAC can encode up to 48 channels of sound, although in practice no one does that. The AAC format also differs from MP3 in defining multiple profiles, in much the same way as H.264, and for the same reasons. The “low-complexity” profile is designed to be playable in real-time on devices with limited CPU power, while higher profiles offer better sound quality at the same bitrate at the expense of slower encoding and decoding.
All current Apple products, including iPods, AppleTV, and QuickTime support certain profiles of AAC in standalone audio files and in audio streams in an MP4 video container. Adobe Flash supports all profiles of AAC in MP4, as do the open source mplayer and VLC video players. For encoding, the FAAC library is the open source option; support for it is a compile-time option in mencoder and ffmpeg. (I’ll dive into all the different encoding tools in a future article.)
…a.k.a. “WMA.” As you might guess from the name, Windows Media Audio was developed by Microsoft. The acronym “WMA” has historically referred to many different things: a lossless audio codec (“WMA Lossless”), a speech-optimized codec (“WMA Voice”), and several different lossy audio codecs (“WMA 1″, “WMA 2″, “WMA 7″, “WMA 8″, “WMA 9″, and “WMA Pro”). It is also (incorrectly) used to refer to the Advanced Systems Format, because WMA-encoded audio streams are usually embedded in an ASF container. Roughly speaking, the lossy audio codecs (WMA 1-9) compete with MP3 and low-complexity AAC; WMA Lossless competes with Apple Lossless and FLAC; WMA Pro competes with high-complexity AAC, Vorbis, AC-3, and DTS.
All the different codecs under the “WMA” brand are playable with Windows Media Player, which comes pre-installed on desktops and laptops running Microsoft Windows XP and Vista. Portable devices like the Zune and the ironically named “PlaysForSure” devices can play WMA 1-9; stores that allow you to “purchase” WMA files generally encrypt them with a Microsoft-proprietary DRM scheme. The open source ffmpeg project can play WMA 1-9, and Flip4Mac offers a commercial QuickTime component to encode and decode WMA audio on Mac OS X.
WMA 1-9 support up to 2 channels of sound; WMA Pro supports up to 8 channels of sound. All WMA formats are patent-encumbered; licensing information is available from Microsoft.
…known to many as “Ogg Vorbis,” although for some reason that pisses off both Ogg and Vorbis advocates. (Technically, “Ogg” is a container format, and Vorbis audio streams can be embedded in other containers.) Vorbis is not encumbered by any known patents and is therefore supported out-of-the-box by all major Linux distributions and by portable devices running the open source Rockbox firmware. Mozilla Firefox 3.1 will support Vorbis audio files in an Ogg container, or Ogg videos with a Vorbis audio track. Android mobile phones can also play standalone Vorbis audio files. Vorbis audio streams are usually embedded in an Ogg container, but they can also be embedded in an MP4 or MKV container (or, with some hacking, in AVI).
There are open source Vorbis encoders and decoders, including OggConvert (encoder), ffmpeg (decoder), aoTuV (encoder), and libvorbis (decoder). There are also QuickTime components for Mac OS X and DirectShow filters for Windows.
Vorbis supports an arbitrary number of sound channels.
…a.k.a. “AC-3.” AC-3 was developed by Dolby Laboratories. AC-3 is most well-known for being a mandatory format in the DVD standard; all DVD players must be able to decode AC-3 audio streams. It is also mandatory for Blu-Ray players, and many digital TV broadcasts send AC-3 audio streams as well. AC-3 supports up to 6 channels of sound and bitrates of up to 640 kbps, although its most popular application — audio on DVDs — is officially limited to 448 kbps. (Blu-Ray discs may use the maximum 640 kbps.)
There are open source encoders and decoders for AC-3, including liba52 (decoding), AC3Filter (decoding), and Aften (encoding). ffmpeg has a compile-time option to include liba52, which will allow all ffmpeg-based players and plugin chains (like GStreamer) to play AC-3 audio streams. However, the AC-3 format is patent-encumbered; licensing is brokered by Dolby Laboratories.
AC-3 is rarely seen in standalone audio files; it is designed to be embedded in a video container. Other than DVDs and Blu-Ray discs (which use a video container format I haven’t talked about yet), you can embed AC-3 audio streams in MKV, AVI, and — just standardized earlier this year — in MP4 files (discussion). Apple’s AppleTV set-top box is the only hardware device I know of that supports AC-3 in MP4; you can encode AppleTV-compatible AC3-in-MP4 videos with HandBrake, or manually insert AC-3 audio into existing MP4 files with this Windows-only fork of mp4creator.
…a.k.a. “DTS.” As you might guess from the name, DTS is designed for real-life movie theaters. Like WMA, “DTS” is a brand name for a family of different audio formats. The “core” DTS format supports up to six channels; later extensions like DTS-HD support up to eight channels. There is also DTS-HD Master Audio, a lossless variant by the same company. Core DTS is designed for high bitrates (up to 1536 kbps, which is virtually indistinguishable from being there in the first place). DTS-HD Master Audio bitrates can go even higher, although at some point even audiophiles will wonder why they should bother.
Core DTS was not originally part of the DVD specification, so early DVD players did not support it. Most recent DVD players support natively decoding core DTS audio or passing the audio stream through to an external speaker system which decodes it, but relatively few DVDs include a DTS stream due to size constraints. Core DTS is a mandatory part of the Blu-Ray specification, and many Blu-Ray discs include a DTS audio track — sometimes the exact same stream that was originally played in the movie theater. (DTS-HD Master Audio is an optional part of the Blu-Ray specification, but few Blu-Ray discs include it due to — you guessed it — size constraints.)
DTS is patent-encumbered; licensing is brokered by DTS, Inc.
As with everything else in this series, this article barely scratches the surface. (Really!) If you like, you can read about other audio codecs: ATRAC, Musepack, MP2, RealAudio, AMR, ADPCM, and so forth and so on. Wikipedia has a comparison of common audio codecs, HydrogenAudio has lots of technical details, and wiki.multimedia.cx is always your friend too.
Tomorrow: subtitles!
§
I think there is a typo here: “DTS is patent-encumbered; licensing is brokered by Dolby Laboratories.“
— Bilgehan ![]()
Another AAC encoder is available from Nero’s website. It’s both free and legal since I guess Nero is paying the license fees, and the sound quality has generally tested higher than FAAC. (Windows executables only though.)
Subtitles? Wow, I didn’t think you’d get into that rathole. This is what happens when there are no good standards available, so bunches of anime junkies create their own.
Thanks for the nice overview, I’ll bookmark it to refer friends to when they ask. One thing I’d perhaps suggest is adding a reference to http://wiki.hydrogenaudio.org/index.php?title=Category:Codecs – a much richer resource than wiki.multimedia.cx. Best wishes!
— yungchin ![]()
These articles are awesome. Clear and concise, just the thing to send to my friends when they ask about this stuff.
On a side note, there’s a small typo in this one: “durring encoding”.
@yungchin: added that link in the final paragraph.
@Alex: fixed, thanks.
— Mark ![]()
I originally learnt a lot of this stuff from this page: http://mewiki.project357.com/wiki/Computer_movie_files
I do this for creating videos to play in flash. Thank god it’s not as hard as it used to be :D
— nine ![]()
Rather major nitpick: DTS is owned and licensed by DTS Inc., not Dolby. Separate company, but still collecting a nickle for every DVD player sold (quite a good business to be in!)
@Rod: fixed, thanks!
— Mark ![]()
@Paul: actually, the Nero AAC Codec includes Linux binaries as well.
A funny note on DTS: plain DTS WAV files are almost always specified as PCM WAV files in the header. This is because old, broken CD-burning programs would choke on any non-PCM WAV files, and so now the de facto standard is to sniff the WAV file to make sure it’s not really a DTS WAV file, because the header field that was intended to make that unnecessary is worthless. I bet there’s a lesson in there somewhere.
Also, you mentioned LAME being the best available encoder; there’s some impressive-looking data on how well the encoder avoids certain types of artifacts which are endemic to MP3 encoding. The screenshots are quite nice.
I’m also a bit surprised that you didn’t even mention the craven actions of Nokia in squashing Vorbis inclusion in the HTML5 standard. I suppose it’s not that relevant, since it’s a done deal already, but it would have made the use of Vorbis for nearly any project a no-brainer.
Please strongly consider an article on lossless audio codecs. It seems like FLAC is the mindshare winner over SHN, but it is far from clear why. And then there is our good friend WAV….
I’m not sure if this is right, and it probably isn’t good advice anyway: “Adobe Flash can play both standalone MP3 files and MP3 audio streams within an MP4 video container.”
Usually for Flash people create H.263+MP3 in FLV, VP6+MP3 in FLV, or H.264+AAC in MP4. MP3 in MP4 might work, but why bother.
Paul Hoffman: It seems like FLAC is the mindshare winner over SHN, but it is far from clear why.
According to this comparison (admittedly two years old, but the gap can only widen), the top reasons are likely speed and compression rate, which are the primary metrics by which a lossless codec can be measured. The installed base is significant, of course, but for new applications, it makes sense to go with FLAC.
Also, Shorten, the encoder, is no longer actively developed, and is only free for noncommercial use, which violates DFSG #6.
I wasn’t around at the time, but I would guess that grendelkhan’s last sentence was the clincher. There’s a market for a free-as-in-speech lossless codec, and there’s a market for proprietary lossless codecs that come pre-installed on various platforms (Apple has ALAC, Microsoft has WMA Lossless). But once a free-as-in-speech codec exists and doesn’t suck, I can’t imagine there would be much of a market for an only-free-as-in-beer codec.
— Mark ![]()
Free-as-in-beer and free-as-in-speech considered harmful. Also, where’s the preview button?
James must be new here.
James: Also, where’s the preview button?
I’m sure it’s in the chrome for people using cool browsers. People clearly not like me.
(On a less relevant note, since it’s a matter of personal taste, I always thought that the speech/beer distinction was intuitive, because I didn’t think simply of “beer” and “speech”, but rather “free beer” and “free speech”; the former made me think of material objects which cost no money, while the latter brought to mind various political and legal constructs.)
I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)
§
© 2001–9 Mark Pilgrim