Eye on Video: Adding audio intelligence

The movie industry introduced its first "talkie" back in 1927. Yet video surveillance, for the most part, has remained oddly silent. Given that what we hear adds as much to our understanding of events as the images we see, the lack of an audio component can seriously impact the ability of security personnel to effectively protect property and people.

Consider a video surveillance system sans audio. A cry for help, the sound of breaking glass, a gunshot, or an explosion in the vicinity of a camera - but outside the field of view - would escape notice. Even in a parking under visual surveillance, without audio support security staff might never know that a vehicle's alarm had gone off.

Audio covers a 360-degree area, enabling a video surveillance system to extend its coverage beyond a camera's field of view. Intelligent audio can instruct a pan/tilt/zoom (PTZ) or dome camera or an operator of the camera to visually verify an audio alarm, giving remote security personnel additional information about the environment on which to base their response.

In addition to being a listening post, security personnel can use audio to communicate with visitors or intruders. If a person in a camera's field of view exhibits suspicious behavior - such as loitering near a bank machine or entering a restricted area - a remote security guard can send a verbal warning to the individual. In cases where the camera reveals a person who is injured, the guard can remotely assure the victim that help is on the way. Audio naturally dovetails with a variety of security applications. In access control, intercom technology is a strong fit. A doorman can remotely greet visitors before buzzing them into a building. In an unmanned garage, patrons can request assistance from a remote security attendant.

Deployment obstacles: analog vs. network video system

The industry expects audio adoption to increase as network video systems become more commonplace. This is primarily due to audio being easier to implement in network video systems than analog CCTV systems.

Analog systems require users to install separate audio and video cables from the camera and microphone location to the recording and viewing station. If the distance is too long, you need to add balanced audio equipment, which increases installation difficulty and cost. A simpler way would be to tie the analog cameras into a network video system, using video encoders with built-in audio support.

Network video systems equipped with audio support process the audio and send both the audio and video over the same network cable used for monitoring and/or recording. This eliminates the need for extra cabling and makes synchronizing the audio and video much easier. (For more information, see athe previous article in this series, Eye on Video: Intelligent Video Architecture, What to consider when deciding upon centralized or distributed surveillance analytics)

Selecting audio equipment

Network cameras or encoders that support audio generally include a built-in microphone, but very rarely a built-in speaker. While a built-in microphone may be adequate for some applications, other may require a more sensitive external microphone. External microphones fall into four main categories: condenser, electret condenser, dynamic and directional.

Condenser microphones offer the highest audio sensitivity and quality. These are the same microphones used in professional sound studios.

Electret condenser microphones offer a high level of sensitivity and are less expensive than condenser microphones.

Dynamic microphones are rarely used in security or video surveillance because they typically do not possess sufficient audio sensitivity.

Directional microphones pick up sound based on a particular pattern. An omni-directional microphone picks up audio equally well in all directions. Unidirectional microphones have audio sensitivity in one specific direction.

Adding audio detection alarms

Audio can be analyzed by a network camera in much the same way as video. Audio detection nicely complements video motion detection since it can react to events in areas outside the camera's view or too dark for video motion detection to function properly. When intelligent audio detects a suspicious sound - such as a pane of glass breaking or voices in a room that should be unoccupied - it can trigger a response in much the same way intelligent motion detection or door contact systems can. The system can instruct the network camera to record and send audio and video, send e-mail or other alert, or activate alarms or other external devices. In systems with PTZ or network dome cameras, audio alarm detection can direct a camera to automatically turn to a preset location, such as a specific window or doorway.

If you use directional microphones, the audio system can even ascertain which direction the sound is coming from and point a PTZ camera in that direction. This feature is particularly useful in city center surveillance projects, where operators often monitor a large array of fixed and PTZ cameras.

Audio detection offers a number of deployment options. You can enable audio detection all the time, during specific times or disable it during certain events, such as closed-door meetings. You can set it to trigger a sequence of responses if the incoming sound level rises above, falls below or passes a certain level of sound intensity.

Choosing an audio compression algorithm

For efficient transmission and storage, analog audio signals must be converted into digital audio through a sampling process and then compressed to reduce the file size.

Sampling refers to number of times per second a sample of an audio signal is taken. Generally, the sample rate must be twice the maximum required frequency. For example, if you want to capture human speech which is normally below 4 kHz, you need a sample rate of at least 8 kHz. In general. The higher the sampling frequency, the better the audio quality and the greater the bandwidth and storage required.

Compression is defined by bit rate. The higher the compression level, the lower the bit rate and the lower the audio quality, especially for more complex sounds. Higher compression levels may also introduced more transmission delay, but save on bandwidth and storage.

There are a number of coding and decoding (codec) algorithms for audio data, each with different sampling frequencies, bit rates and levels of compression. All of these factors affect audio quality and file size.


Audio compression



Bit Rate

Compression Standard



8-96 kHz

2-300+ kbit/s


G.711 PCM


8 kHZ

64 kbit/s




8 kHz

16, 24, 32 and 40 kbit/s


G.722.2 or AMR-WB


16 kHz

6.60-23.85 kbit/s



Tips for proper deployment

There are a number of factors to consider when deploying audio to ensure that you achieve the best quality from the installation.

Audio equipment and placement. Place the microphone as close as possible to the source of the sound. For two-way communication, face the microphone away from and some distance from the speaker to reduce feedback.

Signal amplification. Amplify the signal as early as possible to minimize noise in the signal chain. Set the signal level as close to, but not over, the clipping level which is the level at which audio becomes distorted.

Acoustical adjustments. Adjust the input gain and use different features such as echo cancellation and speech filter to reduce distortion, eliminate feedback and screen background noise.

Codec and bit rate selection. Choosing a variable bit rates that adjusts to the complexity of the audio will help you achieve a higher quality stream than a constant bit rate file of the same size.

Shielded cabling. Use shielded audio cable to minimize disturbance and noise. Avoid running the cable near power cables and cables carrying high frequency switching signals. Keep the audio cables as short as possible. If you need to run a long audio cable, be sure that the cable, amplifier and microphone are balanced to reduce noise.

Legal restrictions. Some countries place restrictions on audio and video surveillance. Check with local authorities to determine what is allowable before you start.

Where audio intelligence is going

Industry experts predict that advances in audio analytics will soon rival those of video. Audio applications will support two audio compression formats simultaneously to allow users to take advantage of different strengths of each for different purposes. Improvements will be ongoing in areas such as synchronization, real-time and post-event analytics, audio tracking with PTZ cameras, audio quality and the ability to search for certain sounds. These algorithms are becoming so sophisticated that they can detect certain word usage or even the tone of voice. Since incidents often start with verbal aggression, audio surveillance offers enormous benefit in alerting security personnel early on to prevent a conflict from escalating into violence.

Fredrik Nilsson is general manager of Axis Communications, a provider of IP-based network video solutions that include network cameras and servers for surveillance. This story is part of Mr. Nilsson’s “Eye on Video” series appearing in ST&D and on SecurityInfoWatch.com and IPSecurityWatch.com.