Understanding H.264 Video Compression

How the new H.264 standard for surveillance video compression works, and when to use it


Relationship to MPEG-4, Part 2

MPEG-4 (ISO/IEC 14496) is a collection of standards defining the coding of audio-visual objects. The collection is divided into a number of parts describing video compression and audio compression standards, as well as system level parts, describing features such as the MPEG-4 file format. The video compression standard found in many products today is the traditional DCT-based MPEG-4 Part 2 (ISO/IEC 14496-2) standard.

The H.264 video compression standard has been incorporated into MPEG-4 as MPEG-4 Part 10 (ISO/IEC 14496-10). This means MPEG-4 now has two video compression standards available. However, these two video compression standards are non-interoperable, with each standard using different methods to compress and represent the data, i.e. an MPEG-4 Part 10 (H.264) decoder cannot decode an MPEG-4 Part 2 bitstream, and vice versa.

IP Video and H.264

The best way to see the benefits of H.264 in IP Video solutions is to look at an actual implementation of the standard, in this case IndigoVision's new 9000 series DVR.

Inside a 9000 transmitter, frames of video are captured from the camera and sent to the internal H.264 encoder to be compressed. Each frame of video is then compressed in one of two ways: as an I-frame or as a P-frame.

An I-frame is a video frame that has been encoded without reference to any other frame of video. A video stream or recording will always start with an I-frame and will typically contain regular I-frames throughout the stream. These regular I-frames, also called intra frames, key frames or access points, are crucial for random access of recorded H.264 files, such as with rewind and seek operations during playback. The regularity of these I-frames is known as the I-frame interval; however, the disadvantage of I-frames is that they tend to be much larger than P-frames.

P-frames are motion-compensated frames: that is to say the encoder makes use of the difference between the current frame being processed and a previous frame of video, ensuring that information that does not change, e.g. a static background, is not repeatedly transmitted. To put it simply, a P-frame tells the codec what changed from the previous I-frame, and since it is not a full frame of video, it takes much less space to store. Unlike purely difference-based codecs, such as delta-MJPEG, H.264 not only looks for differences but searches for motion that has occurred in the video. This means that motion-compensated codecs will typically outperform simple difference-based codecs when there is motion. The process of searching for motion is known as motion estimation.

Within the codec the motion estimation unit is one of the most computationally expensive parts and critical to the performance of the H.264 encoder. Motion estimation is a complex procedure and often encoders, especially real-time software or DSP-based encoders, will use reduced search areas or use a restrictive search algorithm in order to achieve real-time performance. However, this can often result in poor quality video and significantly reduced compression.

An example of the savings that can be achieved on a scene, such as the one opposite, is demonstrated in the following graph. In this example the same video sequence has been encoded using four different encoders: MPEG-4 (using the IndigoVision 8000), H.264 (using the IndigoVision 9000), an MPEG-4 encoder with no motion estimation, and an MJPEG encoder. All were encoded at 25fps (with the exception of MJPEG at 5fps) to the same subjective video quality.

The graph shows that compared to MPEG-4, H.264 can achieve savings of typically between 20 and 25 percent in bandwidth usage and in excess of 50 percent during periods of scene inactivity - i.e. when there is no moving traffic. Not only does this reduce the overall bandwidth requirements of the IP video system but more importantly it can significantly reduce the amount of storage required for recording the video, often one of the most expensive items in the system.