[Part of an ongoing series.]
The first thing you need to know about captions and subtitles is that captions and subtitles are different. The second thing you need to know about captions and subtitles is that you can safely ignore the differences unless you’re creating your own from scratch. I’m going to use the terms interchangeably throughout this article, which will probably drive you crazy if you happen to know and care about the difference.
Historically, captioning has been driven by the needs of deaf and hearing impaired consumers, and captioning technology has been designed around the technical quirks of broadcast television. In the United States, so-called “closed captions” are embedded into a part of the NTSC video source (“Line 21″) that is normally outside the viewing area on televisions. In Europe, they use a completely different system that is embeddable in the PAL video source. Over time, each new medium (VHS, DVD, and now online digital video) has dealt a blow to the accessibility gains of the previous medium. For example:
And accessible online video is just fucking hopeless. (And no, it won’t change unless new regulation forces it to change. When it comes to captioning, Joe Clark has been right longer than many of you have been alive.)
So even in broadcast television, captioning technology was fractured by different broadcast technologies in different countries. Digital video had the capability of unifying the technologies and learning from their mistakes. Of course, exactly the opposite happened. Early caption formats split along company lines; each major video software platform (RealPlayer, QuickTime, Windows Media, Adobe Flash) implemented captioning in their own way, with levels of adoption ranging from nil to zilch. At the same time, an entire subculture developed around “fan-subbing,” i.e. using captioning technology to provide translations of foreign language videos. For example, non-Japanese-speaking consumers wanted to watch Japanese anime films, so amateur translators stepped up to publish their own English captions that could be overlaid onto the original film. In the 1980s, fansubbers would actually take VHS tapes and overlay the English captions onto a new tape, which they would then (illegally) distribute. Nowadays, translators can simply publish their work on the Internet as a standalone file. English-speaking consumers can have their DVDs shipped directly from Japan, and they use software players that can overlay standalone English caption files while playing their Japanese-only DVDs. The legality of distributing these unofficial translations (even separately, in the form of standalone caption files) has been disputed in recent years, but the fansubbing community persists.
Technically, there is a lot of variation in captioning formats. At their core, captions are a combination of text to display, start and end times to display it, information about where to position the text on a screen, fonts, styling, alignment, and so on. Some captions roll up from the bottom of the screen, others simply appear and disappear at the appropriate time. Some caption formats mandate where each caption should be placed and how it should be styled; others merely suggest position and styling; others leave all display attributes entirely up to the player. Almost every conceivable combination of these variables has been tried. Some forms of media try multiple combinations at once. DVDs, for example, can have two entirely distinct forms of captioning — closed captioning (as used in NTSC broadcast television) embedded in the video stream, and one or more subtitle tracks. DVD subtitle tracks are used for many different things, including subtitles (just the words being spoken, in the same language as the audio), captions for the hearing impaired (which include extra notations of background noises and such), translations into other languages, and director’s commentary. Oh, and they’re stored on the DVD as images, not text, so the end user has no control over fonts or font size.
Beyond DVDs, most caption formats store the captions as text, which inevitably raises the issue of character encoding. Some caption formats explicitly specify the character encoding, others only allow UTF-8, others don’t specify any encoding at all. On the player side, most players respect the character encoding if present (but may only support specific encodings); in its absence, some players assume UTF-8, some guess the encoding, and some allow the user to override the encoding. Obviously standalone caption files can be in any format, but if you want to embed your captions as a track within a video container, your choices are limited to the caption formats that the video container supports.
And remember when I said that there were a metric fuck-ton of audio codecs? Forget that. There are an imperial fuck-ton of caption formats (i.e. multiply by 9/5 and add 32). Here is a partial list of caption formats, taken from the list of formats supported by Subtitle Workshop, which I used to caption my short-lived video podcast series:
Adobe Encore DVD, Advanced SubStation Alpha, AQTitle, Captions 32, Captions DAT, Captions DAT Text, Captions Inc., Cheetah, CPC-600, DKS Subtitle Format, DVD Junior, DVD Studio Pro, DVD Subtitle System, DVDSubtitle, FAB Subtitler, IAuthor Script, Inscriber CG, JACOSub 2.7+, Karaoke Lyrics LRC, Karaoke Lyrics VKT, KoalaPlayer, MacSUB, MicroDVD, MPlayer, MPlayer2, MPSub, OVR Script, Panimator, Philips SVCD Designer, Phoenix Japanimation Society, Pinnacle Impression, PowerDivX, PowerPixel, QuickTime Text, RealTime, SAMI Captioning, Sasami Script, SBT, Sofni, Softitler RTF, SonicDVD Creator, Sonic Scenarist, Spruce DVDMaestro, Spruce Subtitle File, Stream SubText Player, Stream SubText Script, SubCreator 1.x, SubRip, SubSonic, SubStation Alpha, SubViewer 1.0, SubViewer 2.0, TMPlayer, Turbo Titler, Ulead DVD Workshop 2.0, ViPlay Subtitle File, ZeroG.
Which of these formats are important? The answer will depend on whom you ask, and more specifically, how you’re planning to distribute your video. This series is primarily focused on videos delivered as files to be played on PCs or other computing devices, so my choices here will reflect that. These are some of the most well-supported caption formats:
SubRip is the AVI of caption formats, in the sense that its basic functionality is supported everywhere but various people have tried to extend it in mostly incompatible ways and the result is a huge mess. As a standalone file, SubRip captions are most commonly seen with a .srt extension. SubRip is a text-based format which can include font, size, and position information, as well as a limited set of HTML formatting tags, although most of these features are poorly supported. Its “official” specification is a doom9 forum post from 2004. Most players assume that .srt files are encoded in Windows-1252 (what Windows programs frequently call “ANSI”), although some can detect and switch to UTF-8 encoding automatically.
Because .srt files are so often published separately from the video files they describe, the most common use case is to put your .srt file in the same directory as your video file and give them the same name (up to the file extensions). But it is also possible to embed SubRip captions directly into AVI files with AVI-Mux GUI, into MKV files with mkvmerge, and into MP4 files with MP4Box.
You can play SubRip captions in Windows Media Player or other DirectShow-based video players after installing VSFilter; in QuickTime after installing Perian; on Linux, both mplayer and VLC support it natively.
SubStation Alpha and its successor, Advanced SubStation Alpha, are the preferred caption formats of the fansubbing community. As standalone files, they are commonly seen with .ssa or .ass extensions. They have a spec longer than three paragraphs. They are actually miniature scripting languages. A .ass file contains a series of commands to control position, scrolling, animation, font, size, scaling, letter spacing, borders, text outline, text shadow, alignment, and so on; and a series of time-coded events for displaying text given the current styling parameters. It has support for multiple character encodings.
The playing requirements for SubStation Alpha captions are almost identical to SubRip. The same plugins are required for Windows and Mac OS X. On Linux, mplayer prides itself on having the most complete SSA/ASS implementation.
a.k.a. “MPEG-4 Part 17,” a.k.a. ISO 14496-17, MPEG-4 Timed Text (hereafter “MP4TT”) is the one and only caption format for the MP4 container. It is not a file format; it is only defined in terms of a track within an MP4 container. As such, it can not be embedded in any other video container, and it can not exist as a separate file. (Note: the last sentence was a lie; the MPEG-4 Timed Text format is really the 3GPP Timed Text format, and it can very much be embedded in a 3GPP container. What I meant to say is that the format can not be embedded in any of the other popular video container formats like AVI, MKV, or OGG. I could go on about the subtle differences between MPEG-4 Timed Text in an MP4 container and 3GPP Timed Text in a 3GPP container, but it would just make you cry, and besides, technical accuracy is for pussies.)
MP4TT defines detailed information on text positioning, fonts, styles, scrolling, and text justification. These details are encoded into the track at authoring time, and can not be changed by the end user’s video player. The most readable description of its features is actually the documentation for GPAC, an open source implementation of much of the MPEG-4 specification (including MP4TT). Since MP4TT doesn’t define a text-based serialization, GPAC invented one for their own use; since their format is designed to capture all the possible information in an MP4TT track, it turns out to be an easy way to read about all of MP4TT’s features.
MP4Box, part of the GPAC project, can take an .srt file and convert it into a MPEG-4 Timed Text track and embed it in an existing MP4 file. It can also reverse the process — extract a Timed Text track from an MP4 file and output a .srt file.
On Mac OS X, QuickTime supports MP4TT tracks within an MP4 container, but only if you rename the file from .mp4 to .3gp or .m4v. I shit you not. (On the plus side, changing the file extension will allow you to sync compatible video to an iPod or iPhone, which will actually display the captions. Still not kidding.) On Windows, any DirectShow-based video player (such as Windows Media Player or Media Player Classic) supports MP4TT tracks once you install Haali Media Splitter. On Linux, VLC has supported MP4TT tracks for several years.
SAMI was Microsoft’s first attempt to create a captioning format for PC video files (as opposed to broadcast television or DVDs). As such, it is natively supported by Microsoft video players, including Windows Media Player, without the need for third-party plugins. It has a specification on MSDN. It is a text-based format that supports a large subset of HTML formatting tags. SAMI captions are almost always embedded in an ASF container, along with Windows Media video and Windows Media audio.
Don’t use SAMI for new projects; it has been superceded by SMIL. For historical purposes, you may enjoy reading about creating SAMI captions and embedding them in an ASF container, as long as you promise to never, ever try it at home.
SMIL (Synchronized Multimedia Integration Language) is not actually a captioning format. It is “an XML-based language that allows authors to write interactive multimedia presentations.” It also happens to have a timing and synchronization module that can, in theory, be used to display text on a series of moving pictures. That is to say, if you think of SMIL as a way to provide captions for a video, you’re doing it wrong. You need to invert your thinking — your video and your captions are each merely components of a SMIL presentation. SMIL captions are not embedded into a video container; the video and its captions are referenced from a SMIL document.
SMIL is a W3C standard; the most recent revision, SMIL 3.0, was just published in December 2008. If you printed out the SMIL 3.0 specification on US-Letter-sized paper, it would weigh in at 395 pages. So don’t do that.
QuickTime supports a subset of SMIL 1.0. WebAIM provides a nice tutorial on using SMIL to add captions to a QuickTime movie.
§
CMML and Kate are the subtitle technologies promoted by Xiph.org Foundation.
Jeez, you are not making this any easier you know… I was about to get started on some captioning recommendations but I’m back to the drawing board thanks to you:-)
@Mark, I’ve been told that 3GPP TimedText encodes a subset of W3C TimedText in MP4. So wouldn’t that make (a subset of) DFXP the standalone file version of 3GPP TimedText?
@Matthew, it seems to me that neither CMML nor Kate is really properly promoted as a subtitle solution by Xiph. Instead, there’s an ongoing effort to solve the captioning / subtitling issue in Ogg. See the archives of the mailing list called accessibility at xiph dot org.
My understanding is that CMML defines how to put metadata text into Ogg but doesn’t properly specify how to reuse the format for subtitles, and I hear that the single implementation of Kate is currently tied to Linux-oriented libraries, which might not be nice if you want to play it in a Windows of Mac OS X app.
P.S. I’d have included URLs, but the commenting system appeared to barf on URLs.
You’ve got a trailing ‘(‘ at the end of the second paragraph on SubRip. Is something missing there?
I just checked my VHS recordings and they do, in fact, retain the closed captions. Apparently, the bandwidth limitation does not apply to Super VHS. Who knew?
There’s a rogue round bracket in your SubRip section.
All NTSC home video formats (VHS, Beta, 8mm, laserdisc) could and did store closed captions. It was PAL formats that couldn’t (except for Super VHS). I have hundreds of closed-captioned videotapes. So does anyone who ever taped programming off the TV.
More care, please, Mark.
Indeed, VHS keeps NTSC field 21 intact. I used to tape ST:TNG in the late ’80s and rewatch them wth captions (being deaf).
I recently discovered that, in terms of screen resolution, DVDs displayed in 480p, 720p or 1080p cannot carry captions, and I’m still not certain why this is the case. I have two players that can handle DVDs: a Philip DVP640 and a nicer, upscaling Toshiba HD-A2. The former is connected over component, the latter over HDMI. The former is the only one that can play captions, IFF the display is set to 480i. The latter doesn’t seem to be able to display 480i, so, basically, when I get a DVD from Netflix that doesn’t have English subtitles but has captions, I put it in the lesser DVD player. It’s a terrible arrangement.
I completely agree that the only solution that’ll stick is legislation.
@grendelkhan: fixed, thanks.
— Mark ![]()
@Joe: fixed, thanks.
— Mark ![]()
@Henri, re DFXP: I have no idea. http://gpac.sourceforge.net/doc_ttxt.php states “There is no official textual representation of a text stream. Moreover, the specification relies on IsoMedia knowledge for most structure descriptions.” I have no reason not to believe them.
— Mark ![]()
@Matthew: Henri is correct, the wiki pages that describe those formats make it clear that they are not endorsed by Xiph.org. For example, http://wiki.xiph.org/index.php/OggKate starts with “This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has anything to do with this, much less responsibility.”
— Mark ![]()
Based on my experience, I’ve ever seen only Koreans using SAMI or SMIL and anime-fansubbers using SSA or ASS. The rest of the non-commercial world uses SubRip and MicroDVD formats. Having dealt with many of these formats, both watching videos with subtitles and writing code to read and write subtitle files, I feel the need to rant about how much they suck.
SubRip is actually the only format that does not suck. It’s very simple, easy enough to edit with a text editor if needed. It does not contain anything excessive and is fairly easy to parse. No specification, but the syntax is obvious and of the markup tags, I’ve only seen italics and fonts used. The Extended SubRip format that specifies corner coordinates in pixels for each subtitle is rather mad and useless though.
MicroDVD is a decent frame based format with a bit more style definitions than SubRip. The big problem with MicroDVD is that markup tags affect either the whole subtitle or one entire line; it’s impossible to, for example, italicize only one word.
SSA and ASS are far too complex formats with a horrible syntax. If you take a look at the spec (google for ‘ass-specs.doc’) you’ll find things like fields that are not used, fields that even the spec writer does not know what it does, two different linebreaks: ‘\n’ and ‘\N’, their own version of the hexadecimal color string: “hexadecimal RGB value, but in reverse order. Leading zeroes are not required.” and even functions for drawing squares and circles.
In addition to the above, I have dealt with TMPlayer format that does not specify any end times for subtitles and MPL2, which has its own set of markup tags, but can also use markup tags of the MicroDVD format (this was the upgrade from MPL1). SAMI and SMIL I have managed to stay away from.
Many of these formats are completely lacking specs and often they come in different versions or other varieties with undocumented and incompatible differences. TMPlayer is the craziest; it apparently exists in five different varieties, all under the same name, with completely insignificant syntactical differences.
The main problem with most of these subtitle file formats is that they try to define far too much what the subtitles should look like. With subtitles that quickly dissapear, readability is very important. The viewer should be able to choose the font, size, color, outline, default placement, etc. of the subtitles. The only effects I consider usable in subtitles is italics and possibly placement exceptions to avoid overlapping with other text on the screen. Even the placement should be defined just as Top, Bottom, Left, Right or something similar instead of pixel coordinates (as most formats do) which break when viewing with a different resolution or even a different font size. Stuff like rotation angle and letter-spacing should have nothing to do with subtitles.
HDMI cables kill Line 21 captions. There is no workaround.
If we’re doing corrections, let’s note that the example given of uncaptioned DVDs (some TV series) is true almost exclusively of B- and C-tier series. Due to company policies and a class-action lawsuit, TV series released by major labels always have captioning. Now, they may pointlessly recaption the episodes instead of using the original captions, but they’re there nearly all the time. This does not apply to Canada, which, as in so many things, remains half-assed.
QuickTime Player will also display 3GPP Timed Text streams if you rename the file’s extension to .m4v. These m4v files will offer working captions on iPhones and iPod touches, which leads to the salutary side effect of the captions not being overlaid over the content on wide aspect-ratio video.
Timed Text subtitles can be stripped into MP4 containers on Mac OS X using a program called Muxo, which is perpetually pre-release, does very poor bounds-checking (leading to broken video files until the busted text track is stripped out), and is generally not very polished, but which does the job.
Perian’s interpolation of subtitles from accompanying .ass, .ssa or .srt files is not 100% reliable — in fact, I haven’t been able to get it to work for months, using the latest stable release of Perian.
@Joe: I added a note about HDMI. Also changed “television shows” to “low-budget television shows.” Thanks.
— Mark ![]()
@Forrest: I added a note about .m4v and captions on iPhones. Thanks.
— Mark ![]()
@Forrest: there are unofficial binaries of MP4Box for Mac OS X/Darwin. Surely that’s a better solution than some pre-alpha hack that corrupts files.
— Mark ![]()
re: Xiph and subtitles
CMML has a version with explicit caption tags, but it didn’t get implemented, because we are now working on a better, more inclusive format for time-aligned text.
As for Kate being not endorsed – Xiph is very careful in promoting codecs that are immature. Kate has now been around for more than a year and has seen massive development. There are patches for almost all open source video players. Of the new codecs in development, Kate is probably the closes to getting endorsed. But it doesn’t mean that Kate is the one and only time-aligned text solution supported by Xiph. Kate is awesome when trying to get all time-aligned text into the Ogg container. It is not so awesome as an authoring and display format for the Web. Which is why there may well be an alternative soon. So you get a better choice depending on your aims.
Thanks, Silvia. That makes a lot of sense.
— Mark ![]()
Hey, Mark. Really enjoying this series, keep it up!
Can you site a source for the fact that HDMI drops Line 21? Ran into this issue the other day at work and we couldn’t figure out if it was the playout device, the monitor, or the cable that was stripping the CC data. Would be nice to be able to point to something and say “Yup, it’s the cable.”
Thanks!
Lucas: Years of discussion on the Captioning mailing list. Not everything in the world has a permalink yet.
Maybe someone could add a mention on the error-strewn Wikipedia article on captioning (or on HDMI or both). But then you’d need a source like a small-town newspaper or Jimmy Wales.
I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)
§
© 2001–9 Mark Pilgrim