1% -*- mode: latex; TeX-master: "Vorbis_I_spec"; -*-
2%!TEX root = Vorbis_I_spec.tex
3% $Id$
4\section{Introduction and Description} \label{vorbis:spec:intro}
5
6\subsection{Overview}
7
8This document provides a high level description of the Vorbis codec's
9construction.  A bit-by-bit specification appears beginning in
10\xref{vorbis:spec:codec}.
11The later sections assume a high-level
12understanding of the Vorbis decode process, which is
13provided here.
14
15\subsubsection{Application}
16Vorbis is a general purpose perceptual audio CODEC intended to allow
17maximum encoder flexibility, thus allowing it to scale competitively
18over an exceptionally wide range of bitrates.  At the high
19quality/bitrate end of the scale (CD or DAT rate stereo, 16/24 bits)
20it is in the same league as MPEG-2 and MPC.  Similarly, the 1.0
21encoder can encode high-quality CD and DAT rate stereo at below 48kbps
22without resampling to a lower rate.  Vorbis is also intended for
23lower and higher sample rates (from 8kHz telephony to 192kHz digital
24masters) and a range of channel representations (monaural,
25polyphonic, stereo, quadraphonic, 5.1, ambisonic, or up to 255
26discrete channels).
27
28
29\subsubsection{Classification}
30Vorbis I is a forward-adaptive monolithic transform CODEC based on the
31Modified Discrete Cosine Transform.  The codec is structured to allow
32addition of a hybrid wavelet filterbank in Vorbis II to offer better
33transient response and reproduction using a transform better suited to
34localized time events.
35
36
37\subsubsection{Assumptions}
38
39The Vorbis CODEC design assumes a complex, psychoacoustically-aware
40encoder and simple, low-complexity decoder. Vorbis decode is
41computationally simpler than mp3, although it does require more
42working memory as Vorbis has no static probability model; the vector
43codebooks used in the first stage of decoding from the bitstream are
44packed in their entirety into the Vorbis bitstream headers. In
45packed form, these codebooks occupy only a few kilobytes; the extent
46to which they are pre-decoded into a cache is the dominant factor in
47decoder memory usage.
48
49
50Vorbis provides none of its own framing, synchronization or protection
51against errors; it is solely a method of accepting input audio,
52dividing it into individual frames and compressing these frames into
53raw, unformatted 'packets'. The decoder then accepts these raw
54packets in sequence, decodes them, synthesizes audio frames from
55them, and reassembles the frames into a facsimile of the original
56audio stream. Vorbis is a free-form variable bit rate (VBR) codec and packets have no
57minimum size, maximum size, or fixed/expected size.  Packets
58are designed that they may be truncated (or padded) and remain
59decodable; this is not to be considered an error condition and is used
60extensively in bitrate management in peeling.  Both the transport
61mechanism and decoder must allow that a packet may be any size, or
62end before or after packet decode expects.
63
64Vorbis packets are thus intended to be used with a transport mechanism
65that provides free-form framing, sync, positioning and error correction
66in accordance with these design assumptions, such as Ogg (for file
67transport) or RTP (for network multicast).  For purposes of a few
68examples in this document, we will assume that Vorbis is to be
69embedded in an Ogg stream specifically, although this is by no means a
70requirement or fundamental assumption in the Vorbis design.
71
72The specification for embedding Vorbis into
73an Ogg transport stream is in \xref{vorbis:over:ogg}.
74
75
76
77\subsubsection{Codec Setup and Probability Model}
78
79Vorbis' heritage is as a research CODEC and its current design
80reflects a desire to allow multiple decades of continuous encoder
81improvement before running out of room within the codec specification.
82For these reasons, configurable aspects of codec setup intentionally
83lean toward the extreme of forward adaptive.
84
85The single most controversial design decision in Vorbis (and the most
86unusual for a Vorbis developer to keep in mind) is that the entire
87probability model of the codec, the Huffman and VQ codebooks, is
88packed into the bitstream header along with extensive CODEC setup
89parameters (often several hundred fields).  This makes it impossible,
90as it would be with MPEG audio layers, to embed a simple frame type
91flag in each audio packet, or begin decode at any frame in the stream
92without having previously fetched the codec setup header.
93
94
95\begin{note}
96Vorbis \emph{can} initiate decode at any arbitrary packet within a
97bitstream so long as the codec has been initialized/setup with the
98setup headers.
99\end{note}
100
101Thus, Vorbis headers are both required for decode to begin and
102relatively large as bitstream headers go.  The header size is
103unbounded, although for streaming a rule-of-thumb of 4kB or less is
104recommended (and Xiph.Org's Vorbis encoder follows this suggestion).
105
106Our own design work indicates the primary liability of the
107required header is in mindshare; it is an unusual design and thus
108causes some amount of complaint among engineers as this runs against
109current design trends (and also points out limitations in some
110existing software/interface designs, such as Windows' ACM codec
111framework).  However, we find that it does not fundamentally limit
112Vorbis' suitable application space.
113
114
115\subsubsection{Format Specification}
116The Vorbis format is well-defined by its decode specification; any
117encoder that produces packets that are correctly decoded by the
118reference Vorbis decoder described below may be considered a proper
119Vorbis encoder.  A decoder must faithfully and completely implement
120the specification defined below (except where noted) to be considered
121a proper Vorbis decoder.
122
123\subsubsection{Hardware Profile}
124Although Vorbis decode is computationally simple, it may still run
125into specific limitations of an embedded design.  For this reason,
126embedded designs are allowed to deviate in limited ways from the
127`full' decode specification yet still be certified compliant.  These
128optional omissions are labelled in the spec where relevant.
129
130
131\subsection{Decoder Configuration}
132
133Decoder setup consists of configuration of multiple, self-contained
134component abstractions that perform specific functions in the decode
135pipeline.  Each different component instance of a specific type is
136semantically interchangeable; decoder configuration consists both of
137internal component configuration, as well as arrangement of specific
138instances into a decode pipeline.  Componentry arrangement is roughly
139as follows:
140
141\begin{center}
142\includegraphics[width=\textwidth]{components}
143\captionof{figure}{decoder pipeline configuration}
144\end{center}
145
146\subsubsection{Global Config}
147Global codec configuration consists of a few audio related fields
148(sample rate, channels), Vorbis version (always '0' in Vorbis I),
149bitrate hints, and the lists of component instances.  All other
150configuration is in the context of specific components.
151
152\subsubsection{Mode}
153
154Each Vorbis frame is coded according to a master 'mode'.  A bitstream
155may use one or many modes.
156
157The mode mechanism is used to encode a frame according to one of
158multiple possible methods with the intention of choosing a method best
159suited to that frame.  Different modes are, e.g. how frame size
160is changed from frame to frame. The mode number of a frame serves as a
161top level configuration switch for all other specific aspects of frame
162decode.
163
164A 'mode' configuration consists of a frame size setting, window type
165(always 0, the Vorbis window, in Vorbis I), transform type (always
166type 0, the MDCT, in Vorbis I) and a mapping number.  The mapping
167number specifies which mapping configuration instance to use for
168low-level packet decode and synthesis.
169
170
171\subsubsection{Mapping}
172
173A mapping contains a channel coupling description and a list of
174'submaps' that bundle sets of channel vectors together for grouped
175encoding and decoding. These submaps are not references to external
176components; the submap list is internal and specific to a mapping.
177
178A 'submap' is a configuration/grouping that applies to a subset of
179floor and residue vectors within a mapping.  The submap functions as a
180last layer of indirection such that specific special floor or residue
181settings can be applied not only to all the vectors in a given mode,
182but also specific vectors in a specific mode.  Each submap specifies
183the proper floor and residue instance number to use for decoding that
184submap's spectral floor and spectral residue vectors.
185
186As an example:
187
188Assume a Vorbis stream that contains six channels in the standard 5.1
189format.  The sixth channel, as is normal in 5.1, is bass only.
190Therefore it would be wasteful to encode a full-spectrum version of it
191as with the other channels.  The submapping mechanism can be used to
192apply a full range floor and residue encoding to channels 0 through 4,
193and a bass-only representation to the bass channel, thus saving space.
194In this example, channels 0-4 belong to submap 0 (which indicates use
195of a full-range floor) and channel 5 belongs to submap 1, which uses a
196bass-only representation.
197
198
199\subsubsection{Floor}
200
201Vorbis encodes a spectral 'floor' vector for each PCM channel.  This
202vector is a low-resolution representation of the audio spectrum for
203the given channel in the current frame, generally used akin to a
204whitening filter.  It is named a 'floor' because the Xiph.Org
205reference encoder has historically used it as a unit-baseline for
206spectral resolution.
207
208A floor encoding may be of two types.  Floor 0 uses a packed LSP
209representation on a dB amplitude scale and Bark frequency scale.
210Floor 1 represents the curve as a piecewise linear interpolated
211representation on a dB amplitude scale and linear frequency scale.
212The two floors are semantically interchangeable in
213encoding/decoding. However, floor type 1 provides more stable
214inter-frame behavior, and so is the preferred choice in all
215coupled-stereo and high bitrate modes.  Floor 1 is also considerably
216less expensive to decode than floor 0.
217
218Floor 0 is not to be considered deprecated, but it is of limited
219modern use.  No known Vorbis encoder past Xiph.org's own beta 4 makes
220use of floor 0.
221
222The values coded/decoded by a floor are both compactly formatted and
223make use of entropy coding to save space.  For this reason, a floor
224configuration generally refers to multiple codebooks in the codebook
225component list.  Entropy coding is thus provided as an abstraction,
226and each floor instance may choose from any and all available
227codebooks when coding/decoding.
228
229
230\subsubsection{Residue}
231The spectral residue is the fine structure of the audio spectrum
232once the floor curve has been subtracted out.  In simplest terms, it
233is coded in the bitstream using cascaded (multi-pass) vector
234quantization according to one of three specific packing/coding
235algorithms numbered 0 through 2.  The packing algorithm details are
236configured by residue instance.  As with the floor components, the
237final VQ/entropy encoding is provided by external codebook instances
238and each residue instance may choose from any and all available
239codebooks.
240
241\subsubsection{Codebooks}
242
243Codebooks are a self-contained abstraction that perform entropy
244decoding and, optionally, use the entropy-decoded integer value as an
245offset into an index of output value vectors, returning the indicated
246vector of values.
247
248The entropy coding in a Vorbis I codebook is provided by a standard
249Huffman binary tree representation.  This tree is tightly packed using
250one of several methods, depending on whether codeword lengths are
251ordered or unordered, or the tree is sparse.
252
253The codebook vector index is similarly packed according to index
254characteristic.  Most commonly, the vector index is encoded as a
255single list of values of possible values that are then permuted into
256a list of n-dimensional rows (lattice VQ).
257
258
259
260\subsection{High-level Decode Process}
261
262\subsubsection{Decode Setup}
263
264Before decoding can begin, a decoder must initialize using the
265bitstream headers matching the stream to be decoded.  Vorbis uses
266three header packets; all are required, in-order, by this
267specification. Once set up, decode may begin at any audio packet
268belonging to the Vorbis stream. In Vorbis I, all packets after the
269three initial headers are audio packets.
270
271The header packets are, in order, the identification
272header, the comments header, and the setup header.
273
274\paragraph{Identification Header}
275The identification header identifies the bitstream as Vorbis, Vorbis
276version, and the simple audio characteristics of the stream such as
277sample rate and number of channels.
278
279\paragraph{Comment Header}
280The comment header includes user text comments (``tags'') and a vendor
281string for the application/library that produced the bitstream.  The
282encoding and proper use of the comment header is described in \xref{vorbis:spec:comment}.
283
284\paragraph{Setup Header}
285The setup header includes extensive CODEC setup information as well as
286the complete VQ and Huffman codebooks needed for decode.
287
288
289\subsubsection{Decode Procedure}
290
291The decoding and synthesis procedure for all audio packets is
292fundamentally the same.
293\begin{enumerate}
294\item decode packet type flag
295\item decode mode number
296\item decode window shape (long windows only)
297\item decode floor
298\item decode residue into residue vectors
299\item inverse channel coupling of residue vectors
300\item generate floor curve from decoded floor data
301\item compute dot product of floor and residue, producing audio spectrum vector
302\item inverse monolithic transform of audio spectrum vector, always an MDCT in Vorbis I
303\item overlap/add left-hand output of transform with right-hand output of previous frame
304\item store right hand-data from transform of current frame for future lapping
305\item if not first frame, return results of overlap/add as audio result of current frame
306\end{enumerate}
307
308Note that clever rearrangement of the synthesis arithmetic is
309possible; as an example, one can take advantage of symmetries in the
310MDCT to store the right-hand transform data of a partial MDCT for a
31150\% inter-frame buffer space savings, and then complete the transform
312later before overlap/add with the next frame.  This optimization
313produces entirely equivalent output and is naturally perfectly legal.
314The decoder must be \emph{entirely mathematically equivalent} to the
315specification, it need not be a literal semantic implementation.
316
317\paragraph{Packet type decode}
318
319Vorbis I uses four packet types. The first three packet types mark each
320of the three Vorbis headers described above. The fourth packet type
321marks an audio packet. All other packet types are reserved; packets
322marked with a reserved type should be ignored.
323
324Following the three header packets, all packets in a Vorbis I stream
325are audio.  The first step of audio packet decode is to read and
326verify the packet type; \emph{a non-audio packet when audio is expected
327indicates stream corruption or a non-compliant stream. The decoder
328must ignore the packet and not attempt decoding it to
329audio}.
330
331
332
333
334\paragraph{Mode decode}
335Vorbis allows an encoder to set up multiple, numbered packet 'modes',
336as described earlier, all of which may be used in a given Vorbis
337stream. The mode is encoded as an integer used as a direct offset into
338the mode instance index.
339
340
341\paragraph{Window shape decode (long windows only)} \label{vorbis:spec:window}
342
343Vorbis frames may be one of two PCM sample sizes specified during
344codec setup.  In Vorbis I, legal frame sizes are powers of two from 64
345to 8192 samples.  Aside from coupling, Vorbis handles channels as
346independent vectors and these frame sizes are in samples per channel.
347
348Vorbis uses an overlapping transform, namely the MDCT, to blend one
349frame into the next, avoiding most inter-frame block boundary
350artifacts.  The MDCT output of one frame is windowed according to MDCT
351requirements, overlapped 50\% with the output of the previous frame and
352added.  The window shape assures seamless reconstruction.
353
354This is easy to visualize in the case of equal sized-windows:
355
356\begin{center}
357\includegraphics[width=\textwidth]{window1}
358\captionof{figure}{overlap of two equal-sized windows}
359\end{center}
360
361And slightly more complex in the case of overlapping unequal sized
362windows:
363
364\begin{center}
365\includegraphics[width=\textwidth]{window2}
366\captionof{figure}{overlap of a long and a short window}
367\end{center}
368
369In the unequal-sized window case, the window shape of the long window
370must be modified for seamless lapping as above.  It is possible to
371correctly infer window shape to be applied to the current window from
372knowing the sizes of the current, previous and next window.  It is
373legal for a decoder to use this method. However, in the case of a long
374window (short windows require no modification), Vorbis also codes two
375flag bits to specify pre- and post- window shape.  Although not
376strictly necessary for function, this minor redundancy allows a packet
377to be fully decoded to the point of lapping entirely independently of
378any other packet, allowing easier abstraction of decode layers as well
379as allowing a greater level of easy parallelism in encode and
380decode.
381
382A description of valid window functions for use with an inverse MDCT
383can be found in \cite{Sporer/Brandenburg/Edler}.  Vorbis windows
384all use the slope function
385\[ y = \sin(.5*\pi \, \sin^2((x+.5)/n*\pi)) . \]
386
387
388
389\paragraph{floor decode}
390Each floor is encoded/decoded in channel order, however each floor
391belongs to a 'submap' that specifies which floor configuration to
392use.  All floors are decoded before residue decode begins.
393
394
395\paragraph{residue decode}
396
397Although the number of residue vectors equals the number of channels,
398channel coupling may mean that the raw residue vectors extracted
399during decode do not map directly to specific channels.  When channel
400coupling is in use, some vectors will correspond to coupled magnitude
401or angle.  The coupling relationships are described in the codec setup
402and may differ from frame to frame, due to different mode numbers.
403
404Vorbis codes residue vectors in groups by submap; the coding is done
405in submap order from submap 0 through n-1.  This differs from floors
406which are coded using a configuration provided by submap number, but
407are coded individually in channel order.
408
409
410
411\paragraph{inverse channel coupling}
412
413A detailed discussion of stereo in the Vorbis codec can be found in
414the document \href{stereo.html}{Stereo Channel Coupling in the
415Vorbis CODEC}.  Vorbis is not limited to only stereo coupling, but
416the stereo document also gives a good overview of the generic coupling
417mechanism.
418
419Vorbis coupling applies to pairs of residue vectors at a time;
420decoupling is done in-place a pair at a time in the order and using
421the vectors specified in the current mapping configuration.  The
422decoupling operation is the same for all pairs, converting square
423polar representation (where one vector is magnitude and the second
424angle) back to Cartesian representation.
425
426After decoupling, in order, each pair of vectors on the coupling list,
427the resulting residue vectors represent the fine spectral detail
428of each output channel.
429
430
431
432\paragraph{generate floor curve}
433
434The decoder may choose to generate the floor curve at any appropriate
435time.  It is reasonable to generate the output curve when the floor
436data is decoded from the raw packet, or it can be generated after
437inverse coupling and applied to the spectral residue directly,
438combining generation and the dot product into one step and eliminating
439some working space.
440
441Both floor 0 and floor 1 generate a linear-range, linear-domain output
442vector to be multiplied (dot product) by the linear-range,
443linear-domain spectral residue.
444
445
446
447\paragraph{compute floor/residue dot product}
448
449This step is straightforward; for each output channel, the decoder
450multiplies the floor curve and residue vectors element by element,
451producing the finished audio spectrum of each channel.
452
453% TODO/FIXME: The following two paragraphs have identical twins
454%   in section 4 (under "dot product")
455One point is worth mentioning about this dot product; a common mistake
456in a fixed point implementation might be to assume that a 32 bit
457fixed-point representation for floor and residue and direct
458multiplication of the vectors is sufficient for acceptable spectral
459depth in all cases because it happens to mostly work with the current
460Xiph.Org reference encoder.
461
462However, floor vector values can span \~{}140dB (\~{}24 bits unsigned), and
463the audio spectrum vector should represent a minimum of 120dB (\~{}21
464bits with sign), even when output is to a 16 bit PCM device.  For the
465residue vector to represent full scale if the floor is nailed to
466$-140$dB, it must be able to span 0 to $+140$dB.  For the residue vector
467to reach full scale if the floor is nailed at 0dB, it must be able to
468represent $-140$dB to $+0$dB.  Thus, in order to handle full range
469dynamics, a residue vector may span $-140$dB to $+140$dB entirely within
470spec.  A 280dB range is approximately 48 bits with sign; thus the
471residue vector must be able to represent a 48 bit range and the dot
472product must be able to handle an effective 48 bit times 24 bit
473multiplication.  This range may be achieved using large (64 bit or
474larger) integers, or implementing a movable binary point
475representation.
476
477
478
479\paragraph{inverse monolithic transform (MDCT)}
480
481The audio spectrum is converted back into time domain PCM audio via an
482inverse Modified Discrete Cosine Transform (MDCT).  A detailed
483description of the MDCT is available in \cite{Sporer/Brandenburg/Edler}.
484
485Note that the PCM produced directly from the MDCT is not yet finished
486audio; it must be lapped with surrounding frames using an appropriate
487window (such as the Vorbis window) before the MDCT can be considered
488orthogonal.
489
490
491
492\paragraph{overlap/add data}
493Windowed MDCT output is overlapped and added with the right hand data
494of the previous window such that the 3/4 point of the previous window
495is aligned with the 1/4 point of the current window (as illustrated in
496the window overlap diagram). At this point, the audio data between the
497center of the previous frame and the center of the current frame is
498now finished and ready to be returned.
499
500
501\paragraph{cache right hand data}
502The decoder must cache the right hand portion of the current frame to
503be lapped with the left hand portion of the next frame.
504
505
506
507\paragraph{return finished audio data}
508
509The overlapped portion produced from overlapping the previous and
510current frame data is finished data to be returned by the decoder.
511This data spans from the center of the previous window to the center
512of the current window.  In the case of same-sized windows, the amount
513of data to return is one-half block consisting of and only of the
514overlapped portions. When overlapping a short and long window, much of
515the returned range is not actually overlap.  This does not damage
516transform orthogonality.  Pay attention however to returning the
517correct data range; the amount of data to be returned is:
518
519\begin{Verbatim}[commandchars=\\\{\}]
520window_blocksize(previous_window)/4+window_blocksize(current_window)/4
521\end{Verbatim}
522
523from the center of the previous window to the center of the current
524window.
525
526Data is not returned from the first frame; it must be used to 'prime'
527the decode engine.  The encoder accounts for this priming when
528calculating PCM offsets; after the first frame, the proper PCM output
529offset is '0' (as no data has been returned yet).
530