Real words or Buzzwords?: H.264 and I-frames, P-frames and B-frames – Part 2

Feb. 4, 2020
A look at the different H.264 video frame types and how they relate to intended uses of video
This is the 50th article in the “Real Words or Buzzwords?” series from SecurityInfoWatch.com contributor Ray Bernard about how real words can become empty words and stifle technology progress. Access the other parts of this three-part article in the Related Box on the right. 

This article continues the discussion about the H.264/H.265 MPEG video standards, including the manufacturer innovation of H.264 smart codecs, and how the configurations settings for video camera stream configuration impact computing and networking requirements for security video deployments. The reasons for discussing this particular topic are presented in the Part 1 Article.

The two overall reasons to have a clear understanding of these video compression issues is that they impact video system cost and performance:

  • Higher levels of compression reduce network bandwidth requirements and video storage requirements, but increase video server processing requirements. In some deployments, this has degraded video quality when camera counts and/or higher compression levels have increased to the point where excessive server CPU processing demands result in dropped video frames and video image artifacts.
  • It’s often easier to upgrade video server hardware than it is to expand network capacity, especially in the case where some or all of the video traffic is being carried by a corporate business network, rather than a physically separate network for video surveillance and other security applications. To determine the server requirements requires a good understanding of the levels of CPU processing power that will be required over the intended life of the server – recommended to be only two to three years, given the increasingly rapid advancements in server and GPU hardware.

Today’s Codecs

The previous generation of video codec technology (codec being short for coding/decoding) dealt with improvements in compressing each video image, but still sent the visually same image over and over again when the scene the camera was viewing didn’t change. This single-image compression is called spatial compression – compression that analyses blocks of space within a video image to determine how they can be represented using less data. Some of the color and shading nuances in an image can’t be detected by the human eye. They can, for example, be averaged, without reducing the useful quality of the image.

Modern codecs use temporal compression – which is compression based on how a sequence of images vary over time. Only the changes from one video frame to the next are transmitted. If a bird flies across a clear blue sky, only the small areas where the bird is and was need to be updated.

These types of compression are referred to as lossy compression, because the decoded (i.e. rebuilt) video images are not identical to the original image. The less identical they are, the worse the video quality is, but also the lower the total amount of information is that’s transmitted and stored. This is where the trade-offs occur between video quality and the requirements for network bandwidth and storage.

Consumer Video Impacts

H.264 uses both types of compression – spatial and temporal – and has become the predominant type of video compression in the video industry, which includes online video streaming and television video broadcast. It is these highly profitable commercial uses of video that have driven the advances in video compression. The physical security industry does not drive these advances – the consumer video industry does.

Thus, you can find all kinds of advice online about configuring camera video for H.264 video compression – but only some of it applies to security video. The fact that security video is used for investigation purposes and courtroom evidence gives it quality requirements that differ from those of entertainment video. A primary consideration for commercial video is, “What can we fit into a single video disc?”

Security Video Considerations

The primary considerations regarding security video are the cost of the video system infrastructure (video transmission and storage) and the quality of the video evidence. Today, quality is no longer driven just by visual evidence image requirements. Ongoing advancements in AI-based video analytics are adding new requirements to video collection, with several new industry entrants saying that full facility situational awareness requires applying analytics to all cameras all the time, to provide a sound basis for anomaly detection. Video tracking of people and objects requires not just having evidence of the area where a critical asset was compromised, or an individual was hurt, but coverage of the entry and exit travel paths of all the individuals involved, including witnesses.

Video analytics for retail operations require much more detailed coverage of store merchandise displays, and 100% coverage of customer traffic areas with sufficient quality to determine customer facial responses so that AI can identify frustration, displeasure, or delight resulting from merchandise and promotional displays. Retail video is just one of the directions where video applications are going, and there are many of them for industrial/occupational safety, security and business operations reasons.

Need for Understanding Video Compression

Thus, individuals responsible for selecting video technology, as well as those deploying it, need to understand how video compression works and what the quality requirements are for all the various intended uses of the video. It also impacts camera placement and lighting design, which is why I included links to video quality guides in the previous article.

I-Frames, B-Frames and P-Frames

The term “frame” has been historically used for animation and video still images, to refer to the single snapshot in time that it presents. Today, “frame” is even more accurate than “image” because image most accurately refers not to what the camera’s sensor saw, or the compressed full or partial video image being transmitted, but to what the end user sees when viewing a video clip or still image.

In this article we’ll look at what these different H.264 video frames types are and how they relate to the intended uses of video. In the next article we’ll examine smart codecs and their significance to the different uses of video.

I-Frames

I-frame is short for intra-coded frame, meaning that the compression is done using only the information contained within that frame, the way a JPEG image is compressed.  Intra is Latin for within. I-frames are also called keyframes, because each one contains the full image information. This is spatial compression.

P-Frames

P-frame is short for predicted frame and holds only the changes in the image from the previous frame. This is temporal compression. Except for video with high amounts of scene change, the approach of combining I-frames and P-frames can result in compression levels between 50% and 90% compared to send sending only I-frames. One significance of this is that higher frame rates can be supported, which improves the quality of the video. Many megapixel cameras today can transmit 60 frames per second whereas 10 years ago, five frames per second was a common limit for megapixel cameras.

B-Frames

B-frame is short for bidirectional predicted frame, because it uses differences between the current frame and both the preceding and following frames to determine its content. Figure 1 from Wikipedia shows the relationships between the frame types.

Figure 1. H.264 Frame Types

Image source: Wikimedia commons

The more P-frames and B-frames there are between I-frames, the greater the compression. Also, the greater the possibility of video image loss for the time interval between I-frames. For video conferencing, for example, you might have only one I-frame every five or ten seconds, which means the loss of a single I-frame could result in the loss of video for several seconds, as the reference for the following P-frames wouldn’t exist. Typically, a video conferencing system would simply display the last-received I-frame until the next I-frame was received, and that would have minimal impact on the conference call. The loss of many seconds of security video could be a different story. In the Figure 2 illustration below the blue arrows from the b-frames point only to i-frames and p-frames.

Figure 2 below shows what the frame sequence might be for a 6-frame per second video stream with one I-frame per second.

Figure 2. H.264 Stream Example 

Image source: Wikimedia commons

The Wikipedia Inter frame article explains some important aspects of this compression encoding:

Because video compression only stores incremental changes between frames (except for keyframes), it is not possible to fast forward or rewind to any arbitrary spot in the video stream. That is because the data for a given frame only represents how that frame was different from the preceding one. For that reason, it is beneficial to include keyframes at arbitrary intervals while encoding video.

For example, a keyframe may be output once for each 10 seconds of video, even though the video image does not change enough visually to warrant the automatic creation of the keyframe. That would allow seeking within the video stream at a minimum of 10-second intervals. The down side is that the resulting video stream will be larger in size because many keyframes are added when they are not necessary for the frame's visual representation. This drawback, however, does not produce significant compression loss when the bitrate is already set at a high value for better quality (as in the DVD MPEG-2 format).

This is one reason why Eagle Eye’s cloud VMS marks selected frames as key frames for its search function – it facilitates better forward and backward video searching while still allowing the original video stream size to be small by containing fewer I-frames.

Hanwah Techwin America provides a very informative and well-illustrated PDF presentation file titled H.265 vs. H.264 that discusses many of the points in this article and relates them to video settings and video quality.

Group of Pictures (GOP) / Group of Video Pictures (GOV)

These two terms come from the MPEG video compression standards, and for purposes of this discussion mean the same thing. A Group of Pictures begins with an I-frame, followed by some number of P-frames and B-frames. Figure 2 above shows a video stream with a GOP length of six: one I-frame followed by five P-frames and B-frames. Most security video documentation uses GOP Length and GOV Length interchangeably, which only causes confusion if you don’t know about it and the camera settings say “GOV Length” while the VMS settings say “GOP Length.” So, a shorter GOP length results in more I-frames, a longer GOP length results in fewer I-frames and greater compression.

Quality Implications of Frame Types

Video encoding includes methods that average out the properties of very small blocks of pixels in the image. Earlier versions of video standards have stricter requirements relating to the types of compression, such as averaging, that can be used for encoding the pixel blocks in each frame type. Later standards allow more compression options.

H.264 Slices

What’s also significant about H.264 is that the standard introduced the idea of separately encoded “slices.” A slice is a spatially distinct region of a frame that is encoded separately from any other region in the same frame. I-slices, P-slices, and B-slices take the place of I, P, and B frames.

That part of the standard is what has led to smart codecs, where the type of compression used – such as high, medium or low quality – can vary within the same frame. Thus, for example, a plain wall can receive high-compression low-quality treatment, while the person walking in front of the wall gets low-compression high-quality treatment.

Smart codecs utilize multiple types of compression within a single image, providing higher quality and higher compression for each type of frame, depending, of course, on the content of the frame.

The next article in this series will provide more detail on smart codecs, which is important, because the “smartness” of such codecs can vary significantly between vendors and even within a vendor’s own camera line.

About the Author:

Ray Bernard, PSP CHS-III, is the principal consultant for Ray Bernard Consulting Services (RBCS), a firm that provides security consulting services for public and private facilities (www.go-rbcs.com). In 2018 IFSEC Global listed Ray as #12 in the world’s top 30 Security Thought Leaders. He is the author of the Elsevier book Security Technology Convergence Insights available on Amazon. Mr. Bernard is a Subject Matter Expert Faculty of the Security Executive Council (SEC) and an active member of the ASIS International member councils for Physical Security and IT Security. Follow Ray on Twitter: @RayBernardRBCS.