Ambarella's New SoC Makes AI Processing a Reality

Jan. 26, 2022
Chips borne from needs of the auto industry could revolutionize video surveillance
There were many introductions of Systems on Chip (SoCs), AI Accelerators and Software Development Kits (SDKs) to support the latest IoT Neural Networks or “brains” at CES 2022. Some of them use natural language detection, speech processing, energy waveform analysis, and even extracted barely perceptible speech amid a noisy trade show like CES.

At CES 2022, Ambarella unveiled CV5, an artificial intelligence (AI) vision processor borne from the needs of the automobile industry and capable of recording 8K video or four 4K video streams. Its new AI Neural Network-based Image Signal Processor (ISP) enhances color imaging and applies HDR in ultra-low light conditions.

Among all of the SoC innovations, Ambarella’s seemed to turn the most heads – processing visualization streams at power levels a fraction of those several years ago, and up to 100X improved resolution.

Additionally, just prior to the show, Ambarella completed its acquisition of Oculii, a radar perception AI algorithm
maker that uses current production radar chips to achieve significantly higher (up to 100X) resolution, longer range and greater accuracy. These improvements augment Ambarella’s visual intelligence processing suite of 8K, 4K, 3D, and LiDAR streams, and they eliminate the need for specialized high-resolution radar chips, which have significantly higher power consumption and cost than conventional radar solutions.

Changing the Radar/LiDAR Paradigm

Radar used in the security industry and in Advanced driver-assistance systems (ADAS) use high frequency radio waves to get range, direction and velocity of objects that appeared like “blobs” – even through inclement weather and fog. With Oculii, the “blobs” are suddenly objects with detail.

The fusion of Ambarella’s camera technology and Oculii’s radar software will provide an all-weather, low-cost and scalable perception solution – enabling higher levels of autonomy for vehicles, surveillance and drone OEMs.

The facility security and public safety markets have already begun the transition to “alternative” visual sensors and processing using 3D imaging, radar, LiDAR and more. At CES, Ambarella showed off those sensors, with AI processing, on its existing CV2 and new CV3 Vision Processors – rendering stunningly detailed three-dimensional images of people, faces, vehicle make and model, vehicle occupants and license plates in real time at significant cost savings.

If privacy is required, the same “camera” with these sensors is capable of providing another stream without the visible light imagery. In other words, it provides the greatest detail of the person and what they are carrying, without facial imagery, thus preserving privacy.

Ambarella’s products are already used in a variety of human and computer vision applications, including video security devices, ADAS, electronic mirror, drive recorder, driver/cabin monitoring, autonomous driving, and robotics applications. For example, the high-end Ring Doorbell Pro 2 Model 5AT2S2 delivers enhanced 1536p HD video with an expanded head-to-toe view, birds-eye view with intruder motion history, dual band Wi-Fi and operates on the Ambarella CV25M SoC. There are other IP cameras based on Ambarella platforms, including several 4K resolution dashcams used in demanding environments.

The Best of Object Rendering and Behavior Recognition Applications

There were nearly 50 different individual demonstrations of object rendering and behavior recognition paired with sensors and processors at Ambarella’s CES exhibit. Here are a handful that caught my attention:

Immersive visualization: A building lobby of a multi-tenant facility might require different screening, performed continuously and accurately. In this demonstration, four 4K visible light RGB cameras were paired with 3D imagers, producing eight streams – four of which used Convolutional Neural Networks (CNN) to process skeletal poses, object detection, face detection and something the person has with them (a body part). Skeletal poses can quickly alert on slip-and-falls or crowding; object detection of package theft; visitors wearing masks; and weapons detection. The significance of a single CV5 AI vision processor supporting four 4K RGB full color imagers – with an option to include 3D and LiDAR sensors – will result in lower costs for immersive surveillance at stadiums, concerts, conventions and borders.

About six years ago, IP security camera OEMs began scaling back on products having multiple sensors processed with a common SoC. With the specialization of AI processing through CNN, achieving more accurate object recognition of many different object classes and behaviors in real time, CV5 vision processor groups with different sensors can lower cost, improve recognition accuracy, and significantly lower power consumption.

The IoT “endgame” will be realized with these sensor groups, a vision processor like Ambarella’s CV5, and Neural Networks so efficient that they can operate for years on battery power.

AI-based low-light and HDR processing: HDR, WDR and low-light color imaging is currently done with most all IP camera OEMs with varying efficiency, but this is a leap forward, using CNN to process a scene more accurately.

The CVflow AI engine, Ambarella’s new AI-based ISP architecture, uses neural networks to augment image processing. This approach enables color imaging at low light and very low lux levels, and minimal noise – a 10 to 100X improvement over state-of-the-art traditional ISPs. It also enables new levels of high dynamic range (HDR) processing with more natural color reproduction and higher dynamic range. At the exhibit, the “low light” was very close to total darkness – even when I viewed the demonstration area through a port and waited until my eyes adjusted to the lack of illumination.

Biometric access control: Legacy face-matching algorithms can often be spoofed by 2D images of the person on file. When a 3D ToF camera is used together with an RGB camera and SoC capable of fusing both streams and images via CNNs, false positives are statistically eliminated, and trusted personnel entry is achieved.

Anonymous occupancy sensing: Detailed visual imaging is not always necessary to maintain safe occupancy in a building or space. 3D imaging, radar or LiDAR streams can be processed by Neural Networks, giving an accurate occupancy count in a given space in real time, or even projected by time, while maintaining privacy.

Delivery via vehicle; vehicle theft: Vehicles are often the targets of the fast “smash and grab” activity or sophisticated gray market parts distribution. A successful delivery of goods at a residence or business might involve some common behaviors, but also needs to be distinguished and recorded for anomalies. In this demonstration, multiple visible light cameras with Ambarella SoCs and CNN processed all these behaviors in real time.

When processing images and video, a CNN recognizes pixels having image data, takes advantage of adjacent pixels to recognize a single or multiple objects, and downscale the images (known as pooling) for simpler recognition at scale. In general, a “basic” DNN would return similar analyses if all the pixels in an image were shuffled, where a CNN recognizes patches of pixels together as meaningful parts of an object, thereby reducing the number of processing layers.

Distance inference: When multiple people are in a video doorbell’s field of view, it can be useful to see what they were doing before the person pressed the doorbell and their distances relative to the user at home or at a commercial entry.

Steve Surfaro is Chairman of the Public Safety Working Group for the Security Industry Association (SIA) and has more than 30 years of security industry experience. He is a subject matter expert in smart cities and buildings, cybersecurity, forensic video, data science, command center design and first responder technologies. Follow him on Twitter, @stevesurf.