MPEG Background


Introduction.

MPEG-1 is an ISO/IEC (International Organization for Standardization/ International Electrotechnical Commission) standard for medium quality and medium bit rate video and audio compression. It allows video to be compressed by the ratios in the range of 50:1 to 100:1, depending on image sequence type and desired quality (but it all depends on how you count, see the MPEG-2 FAQ for a discussion). The encoded data rate is targeted at 1.5Mb/s for this is a reasonable transfer rate of a double-speed CD-ROM player (rate includes audio and video). VHS-quality playback is expected from this level of compression. The Motion Picture Expert Group (MPEG) also established the MPEG-2 standard for high-quality video playback at a higher data rates.

Software-based MPEG-1 video decoding is considerably CPU-intensive. However, the performance bottleneck in current implementations is still file input, especially in the case of video playback through remote channel. Nevertheless, the performance requirement of a video decoder is still fundamentally bounded by the so-called real-time performance, which is 30 frames per second. There is no practical reason why one would desire a video decoder to run faster than real-time, except in fast-forwarding playback. In the case of fast-forwarding playback, the limited file input rate and the video output rate become the major concerns.

Software-based MPEG-1 video encoding is an entirely different story. While the encoder needs a decent input rate to read in the raw and uncompressed video source, the encoding process itself is extremely CPU-intensive. In order to achieve real-time encoding (30 frames per second), several GFlops is needed. Sequential software encoders are unavoidably slow. The Berkeley MPEG Encoder can compress video at a rate of 1.2 frames per second for 352x288 (CIF) images on a Sun SPARCstation 10. Hardware encoders can encode video in real-time by using multiple custom-designed video processor chips but they are expensive. The C-Cube real-time MPEG-1 video encoder uses 8 custom-designed video processor chips. A complete system is sold for $120,000. Much cheaper boards are available ($4,000 is the cheapest full MPEG-1 board we know of).

To obtain better software encoding performance, a cluster of workstations could be used to distribute the workload. The Berkeley Parallel MPEG Encoder can compress 4.7 frames per second on 6 Sun workstations connected by an ethernet.

MPEG-1 Video Overview.

The basic idea behind MPEG video compression is to remove spatial redundancy within a video frame and temporal redundancy between video frames. As in JPEG, the standard for still image compression, DCT-based (Discrete Cosine Transform) compression is used to reduce spatial redundancy. Motion- compensation is used to exploit temporal redundancy. The images in a video stream usually do not change much within small time intervals. The idea of motion-compensation is to encode a video frame based on other video frames temporally close to it.

Video Format.

A video stream is a sequence of video frames. Each frame is a still image. A video player displays one frame after another, usually at a rate close to 30 frames per second (23.976, 24, 25, 29.97, 30).

Frames are digitized in a standard RGB format, 24 bits per pixel (8 bits each for Red, Green, and Blue). MPEG-1 is designed to produce bit rates of 1.5Mb/s or less, and is intended to be used with images of size 352x288 at 24-30 frames per second. This results in data rates of 55.7-69.6Mb/s.

The MPEG-1 algorithm operates on images represented in YUV color space (Y Cr Cb). If an image is stored in RGB format, it must first be converted to YUV format. In YUV format, images are also represented in 24 bits per pixel (8 bits for the luminance information (Y) and 8 bits each for the two chrominance information (U and V)). The YUV format is subsampled. All luminance information is retained. However, chrominance information is subsampled 2:1 in both the horizontal and vertical directions. Thus, there are 2 bits each per pixel of U and V information. This subsampling does not drastically affect quality because the eye is more sensitive to luminance than to chrominance information. Subsampling is a lossy step. The 24 bits RGB information is reduced to 12 bits YUV information, which automatically gives 2:1 compression. Technically speaking, MPEG-1 is 4:2:0 YCrCb.

Frame Encoding.

Frames are divided into 16x16 pixel macroblocks. Each macroblock consists of four 8x8 luminance blocks and two 8x8 chrominance blocks(1 U and 1 V). Macroblocks are the units for motion-compensated compression. Blocks are used for DCT compression.

Frames can be encoded in three types: intra-frames (I-frames), forward predicted frames (P-frames), and bi-directional predicted frames (B-frames).

An I-frame is encoded as a single image, with no reference to any past or future frames. The encoding scheme used is similar to JPEG compression. Each 8x8 block is encoded independently with one exception explained below. The block is first transformed from the spatial domain into a frequency domain using the DCT (Discrete Cosine Transform), which separates the signal into independent frequency bands. Most frequency information is in the upper left corner of the resulting 8x8 block. After this, the data is quantized. Quantization can be thought of as ignoring lower-order bits (though this process is slightly more complicated). Quantization is the only lossy part of the whole compression process other than subsampling. The resulting data is then run-length encoded in a zig-zag ordering to optimize compression. This zig-zag ordering produces longer runs of 0's by taking advantage of the fact that there should be little high-frequency information (more 0's as one zig-zags from the upper left corner towards the lower right corner of the 8x8 block). The afore-mentioned exception to independence is that the coefficient in the upper left corner of the block, called the DC coefficient, is encoded relative to the DC coefficient of the previous block (DCPM coding).

A P-frame is encoded relative to the past reference frame. A reference frame is a P- or I-frame. The past reference frame is the closest preceding reference frame. Each macroblock in a P-frame can be encoded either as an I-macroblock or as a P-macroblock. An I-macroblock is encoded just like a macroblock in an I-frame. A P-macroblock is encoded as a 16x16 area of the past reference frame, plus an error term. To specify the 16x16 area of the reference frame, a motion vector is included. A motion vector (0, 0) means that the 16x16 area is in the same position as the macroblock we are encoding. Other motion vectors are relative to that position. Motion vectors may include half-pixel values, in which case pixels are averaged. The error term is encoded using the DCT, quantization, and run-length encoding. A macroblock may also be skipped which is equivalent to a (0, 0) vector and an all-zero error term. The search for good motion vector (the one that gives small error term and good compression) is the heart of any MPEG-1 video encoder and it is the primary reason why encoders are slow.

A B-frame is encoded relative to the past reference frame, the future reference frame, or both frames. The future reference frame is the closest following reference frame (I or P). The encoding for B-frames is similar to P-frames, except that motion vectors may refer to areas in the future reference frames. For macroblocks that use both past and future reference frames, the two 16x16 areas are averaged.

Figure 1. Typical MPEG-1 Encoding Pattern and Dependencies.

A typical IPB sequence is shown in Figure 1. The arrows represent the inter-frame dependencies. Frames do not need to follow a static IPB pattern. Each individual frame can be of any type. Often, however, a fixed IPB sequence is used throughout the entire video stream for simplicity. The typical data rate of an I-frame is 1 bit per pixel while that of a P-frame is 0.1 bit per pixel and for a B-frame, 0.015 bit per pixel. The order of the frames in the output sequence is rearranged in a way that an MPEG decoder can decompress the frames with minimum frame buffering (a maximum of 3 frame buffers). For example, an input sequence of IBBPBBP will be arranged in the output sequence as IPBBPBB.

Layered structure.

An MPEG-1 video sequence is an ordered stream of bits, with a special bit patterns marking the beginning and ending of a logical section.

Each video sequence is composed of a series of Groups of Pictures (GOP's). A GOP is composed of a sequence of pictures (frames). A frame is composed of a series of slices. A slice is composed of a series of macroblocks, and a macroblock is composed of 6 or fewer blocks (4 for luminance and 2 for chrominance) and possibly a motion vector.

The GOP structure is intended to assist random access into a sequence. A GOP is independently decodable unit that can be of any size as long as it begins with an I-frame. (There is one caveat here, SEQUENCES are a higher level structure than GOPs, and may contain information about quantization tables. Their information is needed to decode all following GOPs.) GOPs are independently decodable if they are closed, for example a GOP with the pattern IBBP is closed, but the pattern IB is not.

Each slice is (in a sense) an independently decodable unit too. There can be 1 slice per frame, 1 slice per macroblock, or anything in between. The slice structure is intended to allow decoding in the presence of errors. Serendipitously, it also allows parallel encoding/decoding at the slice level.

See also:


The basis (and much of the credit) for this page goes to: Siddhartha Devadhar, Cederic Krumbein, and Kim Man Liu. Corrections, additions and deletions by Steve Smoot.

BMRC Publications BIBS