AI is coming to video surveillance, but what kind of intelligence do end-users need?

Oct. 4, 2017
An in-depth look at the benefits, differences between convolutional and spiking neural networks

When IBM’s Deep Blue computer won its first game of chess against world champion Garry Kasparov in 1996, the public got a real taste of how powerful computers had become in competing with human intelligence. Since then, not only has computing power grown exponentially but the cost of processing power has fallen dramatically. These trends, combined with advances in artificial intelligence algorithms have enabled the development of systems that can, in some instances, perform tasks better than human beings.

Video surveillance is one of these tasks; and certainly there is a large market opportunity given there has been little increase in the ability to analyze video, despite the massive growth in surveillance and in the storage of video data. According to IHS, 127 million surveillance cameras and 400 thousand body-worn cameras will ship in 2017 - in addition to the estimated 300 million cameras already deployed - and approximately 2.5 billion exabytes of data will be created every day.

Major Challenges

One problem for surveillance operators is directed attention fatigue. The brain naturally alternates between periods of attention and distraction. In surveillance, distractions can result in dire consequences. So, what if we could have a surveillance system that didn’t ever get distracted, one that worked with humans to cut out the errors? That’s the promise of artificial intelligence in video surveillance.

But here’s the challenge. Computers don’t really work like the human brain. For example, they separate processing and memory, which the brain does not, and while computers are purely digital systems, the brain demonstrates both analog and digital characteristics, so it’s more complicated to model. Neuromorphic computing is the science that seeks to understand how the human brain works and apply some of these characteristics to computers to make them better at certain functions. For some time, computers have been better than most of us at crunching numbers; a cellphone processor, for example, can complete 100 billion of these kinds of operations in a second. The brain didn't develop to do this stuff but it is extremely good at sensing, processing and reacting to streams of information gathered from our environment. For video surveillance, artificial intelligence that demonstrates the latter characteristic is most relevant. In addition, AI systems based on computers have the added advantage of reliable memory – something that often eludes the human brain.

Early Development of AI in Computer Vision

Until 2012, computers couldn’t recognize many different types of images but an algorithm developed by Alex Krishevsky changed that. He demonstrated that object recognition and classification could be achieved by simulating and training a network of computational elements. The topology of this network resembled that of brain cells (neurons), hence the name artificial neural network. Krishevsky’s base computational element was a convolution, and type of math function that performs filtering. Therefore such networks became known as convolutional neural networks (CNNs). CNNs are a powerful addition to the computer vision arsenal of tools but they have two key limitations in video surveillance.

First, their learning process requires much computationally intensive iteration. Powerful cloud computers can take days or weeks to complete the task. Second, a large set of training data is needed. In image recognition, this means collecting a lot of images in which each object has been labeled, so that an error function can be calculated at the end of each pass of the neural network. Millions of training cycles and millions of labeled images may be needed to recognize all the objects relevant to the system’s required function.

Other limitations of the technology include poor noise immunity, particularly when random pixels appear in an image due to noisy sensors or lens contamination. What’s more, false classification can arise if the network gets confused, for example, by someone wearing glasses, and it cannot find a new face in a crowd without a large set of labeled images relating to that face being added to the database. The network parameters of CNNs need careful adjustment and even then, the accuracy rate for correct image classification can be less than ideal for surveillance applications.

To summarize, CNNs can be used to enhance video surveillance but only with substantial processing power and an enormous amount of training data to hand. Both of which add considerable cost. The time needed to train such networks, and their inability to learn ‘on-the-fly’ is also an inhibiting factor.

Spiking Neural Networks (SNNs) and Video Surveillance

SNNs seek to simulate different aspects of the way that the brain works. Our brains generate brief energy bursts or “spikes.”  These occur at precise times relative to each other. Billions of them flow through our neurons in parallel. Our brains convert visual (and other) stimuli, including colors and image segments, into pulse trains of spikes, which are processed by our neurons. Synapses connect neurons together, the brain using chemical and electrical potentials as messengers. Each synapse has a tiny ‘memory’ that stores a value set by the electrical energy in a spike. Each neuron sums all the values coming to its input synapses, and then fires off its own spike pulse if that sum reaches a critical value. Feedback determines which spikes contributed to an output event, and promotes the significance of those synapses’ signals while other synapses’ signals are demoted. In this way, the neuron is sensitized to a specific pattern of spikes at its input. This is in stark contrast to CNNs, which rely on complex math functions. SNNs actually model the functionality of neurons.

So, what does this all mean for image classification in video? Today’s SNN technology can find patterns and people in videos from a single image. For instance, a police department that is looking for a suspect in live video streams does not have thousands of images of that suspect; nor does it have weeks to train a CNN system. In a SNN-based system, the image can be as small as 24 x 24 pixels – it doesn’t need to be high definition. The technology learns in real-time, requires only modest processing power – typically an x86 desktop computer or server, and consumes little energy. All this means that it can be used with legacy systems without requiring expensive hardware or infrastructure upgrades. SNN technology can be implemented as software-only solutions, or accelerated using an FPGA-based PCIe add-in card.

Real-World Performance of SNNs

Tested in a casino, the task of a software-only implementation of an SNN on x86-based servers was to recognize all 52 cards plus jokers in a pack, plus the gambling chips (See Figure 1 above). The SNN was trained in the real-world, poor-lighting, low-resolution environment.

With the cards dealt naturally, they are recognized as long as they are dealt face upwards (as in Baccarat). Over 29 tables, card recognition accuracy was 99.76 percent and chip recognition accuracy was 99.21 percent.

But what about facial classification, a key concern in civil surveillance applications? To evaluate this the SNN was tested on a dataset of web images collected by the California Institute of Technology (Caltech). It identified all occurrences of the same person (See Figure 2 above) correctly without false positives, in a dataset of 450 faces. The recognition performance was not affected by adding: 68 percent noise (Fig 2b); noise and a 52 percent gamma offset (Fig 2c): or noise and blur (Fig 2d).

In a further trial, the system extracted and tracked more than 500,000 facial images from eight high-definition cameras in three-and-a-half hours, using one x86 server. In another, it processed 36 hours of video in less than two hours and extracted more than 150,000 facial images.

Tasks that seemed impossible for machines just a few years ago are becoming routine. CNNs are a major step forward but SNNs perhaps have the greatest potential to bring valuable new capabilities into mainstream video surveillance today.

About the Author:

BrainChip Senior Vice President of Marketing and Business Development Bob Beachler is a Silicon Valley veteran with over 30 years of success in developing and marketing cutting-edge technologies. He has served as Vice President of Marketing, Operations, and Systems Design at Stretch Inc., a provider of embedded video processing solutions, up until its acquisition by Exar Corporation in 2014. Most recently, he served at Xilinx Corporation, the leading worldwide independent provider of FPGA products, and led the marketing of imaging, video and machine learning solutions for Xilinx's industrial, scientific, and medical markets. Bob Beachler holds a Bachelor’s of Science in Electrical Engineering from The Ohio State University. BrainChip develops and markets SNNs for civil surveillance applications.