AI’s Evolution in the Security Industry Begins to Unleash Its Potential

Note: This is the first installment of a two-part feature. In the second part of this article on AI and the evolution of the Security Industry, the author will focus on the symbiosis of humans and machines. Part two will be published in the upcoming May/June issue of STE.

Artificial Intelligence contains the nested subfields of Machine Learning (ML) and Deep Learning (DL), bringing the potential of security technology automation, or workforce augmentation to broad industries. This article takes a different approach than generalizations about AI, taking on popular myths, a closer view into ML that justifies the need for low power AI processing and discussing popular use cases. In our journey to create the cognitive security assistant, questions are a good place to start.

Will practitioners understand where to apply AI advancements, where to use human decision-making and/or labor, or a combination of the two? How will practitioners resolve different outcomes from automated systems? Are they using the appropriate data sets to run the automation? Are there geographic privacy issues preventing companies from certain datasets due to potential bias?

Going Where No Man Has Gone Before

If AI can be described as a science, Deep Learning and Neural Networks (NN) use mathematical concepts and rapid estimations to approach independent reasoning. In the fictional movie Star Trek: The Voyage Home, the Humpback Whale is extinct in the future, and the time-traveling crew must transport the sea creatures in a tank of their natural habitat, made of the fictional transparent aluminum:

Captain Kirk: Mr. Spock, have you accounted for the variable mass of whales and water in your time re-entry program?

Mr. Spock: Mr. Scott cannot give me exact figures, Admiral, so… I will make a guess.

Kirk: A guess? You, Spock? That’s extraordinary.

Spock: [to Dr. McCoy] I don’t think he understands.

Dr. McCoy: No, Spock. He means that he feels safer about your guesses than most other people’s facts.

Spock: Then you’re saying… It is a compliment?

McCoy: It is.

Spock: Ah. Then, I will try to make the best guess I can.

McCoy: Please do.

Mr. Spock’s mission was to use many “known” parameters including density of seawater, and the habitat’s mass, forming data sets for his “AI.”  In ML, hidden relationships in data sets are found without being explicitly pre-programmed or by knowing where to look, or what to conclude, as we’ve seen in legacy video analytics. In other words, ML “guesses” are accurate and based on data sets that potentially have relationships in multiple dimensions, creating outcomes to continually improve.

Kirk felt more confident in Spock’s “guess” based on millions of data sets he could process, than an ordinary human’s “facts” based on an analytical model they believe to recognize. Because Spock could process representations of many datasets, his “guess” was accurate enough for the crew and whales to survive the journey, even though they appeared right in front of the Golden Gate Bridge and nearly crashed. This is a simplified explanation of the difference between a Neural Network (NN) and a basic Video Analytic Algorithm.

Brandon Reich, Founder and CEO, Secure Business Intelligence, (securebi.com) summarizes three key concepts:

  • Artificial Intelligence: “A program that can sense, reason, act and adapt – ‘the science of making things smart’”
  • Machine Learning: “Algorithms whose performance improve as they are exposed to more data over time – ‘training’”
  • Deep Learning: “ML technique based on neural networks able to learn from vast amounts of data and recognize complex patterns”

A Brief Journey into Machine Learning

 In “Basics of Linear Algebra for Machine Learning,” Jason Brownlee, Ph.D., the Melbourne-based author of more than 1,000 tutorials at MachineLearningMastery.com, describes a starting point. In ML, you fit a model on a dataset or a table-like set of numbers where each row represents an observation, and each column represents a feature of the observation. An observation can be an object, visual behavior, speech, temperature, gunshot energy wave, thermal imaging representation of a firearm, or the number of SARS COVID-19 pathogen particles in a known space.

To read and understand ML, you must be able to read and understand linear algebra and the statistical representation of an entire data set. In one development case of a face recognition algorithm, one data set will have many face images, where characteristic images are extracted using principal component analysis (PCA). These images are used to create a weight vector for any seen or unseen images. In this case, a weight vector represents a class of images as a list of numbers, and vector algebra are operations performed on the numbers.

In ML, the weight vectors of different images are compared for their similarity and may be put into another array or matrix that still has a relationship with the whole image of the face in question or a few pixels.

The formation and execution of a Deep Neural Network (DNN) involve many layers of data structures multiplied and added together using linear algebra. Scaled up to multiple dimensions, deep learning methods work with vectors, matrices and tensors of inputs and coefficients, where a tensor is a matrix of data with more than two dimensions. We begin to see the complexity of running DNNs and the need for low-power, highly efficient processors purpose-built for AI.

Convolutional Neural Network

Computer vision, a class of AI, relies on pattern recognition and deep learning to recognize what’s in a picture or video. When machines can process, analyze and understand images, they can capture images or videos in real-time and interpret their surroundings.

A Convolutional Neural Network (CNN) is a class of DNN most often used for visual imagery, speech recognition, and natural language processing. A CNN analyzes whether or not a given pixel is part of an object (such as a gun). If multiple guns of different types are in the same image, a Regional CNN (R-CNN) will virtually separate the objects believed to be guns, resize, zoom, and analyze them separately. CNNs are widely used in AI Systems on Chip (SoC) and AI Processors.

CNNs are versatile enough for natural language processing (NLP), or the ability to analyze, understand and generate human language, including speech. The next stage of NLP is natural language interaction, which allows humans to communicate with systems using normal, everyday language to perform tasks. Because our interactions can be via speech, keyword, keyboard, pointing device, game controller and in the near future, gestures, the fusion is known as User Experience (UX).

Training versus Inferencing AI

Training is an intensive computational process where the Machine Learning model in the security system learns to perform a task, often taking days. The result of this process is a trained ML model, which can be leveraged across many distinct types of algorithms used in security.

Apart from the public safety use case, facial recognition allows extremely young children, autistic children, elderly patients suffering from neurodegenerative diseases, and ER patients to present accurate symptoms via facial expressions.

The inference process for face recognition may include the following:

  • Face detection, as there could be multiple people in the ER
  • Location of facial landmarks like eyes, eyebrows, nose, mouth, jawline (as many as 68 key points of interest)
  • Face alignment
  • Tiling of aligned face data with scaled facial landmarks
  • Detecting all visually discernible facial movements and micro-expressions (example: Facial Action Coding System) 

The result is that a new Trauma Unit of a Children’s Hospital may be remotely located, yet still offer the collective “training through inference” to deliver care of greater accuracy.

Brandon Reich states, “A myth exists that AI is inherently unfair.”  For example, a NN is trained to mimic the behavior of a human decision-maker that might associate a flat affect with poor attention. Some mental health conditions can impact the ability to decipher facial expressions. The model did not consider societal context where a person may lack expression or even have a hearing disability. Having a diverse group of individuals that test the training and use inference for the desired output avoids potential bias. 

He continues, “The biggest myth about AI is that it is inherently superior to all other technologies. AI is dependent on the quality of its inputs – the developers, architecture, training data and the deployed environment. While AI certainly has the potential to improve many technologies, it’s not automatic.”

Sometimes models used for inference become invalid, due to parameters not considered in the original data set. User behavior predicted from ML models created in 2019, differed greatly from actual behavior during the pandemic in 2020, requiring retraining. Data sets need to be periodically validated, like a chef that describes ingredients for a dessert recipe that is no longer valid when an ingredient changes.

The AI SoC – Multiple IoT Device Functions on a Single Assembly

There is a positive trend in the low-power AI-Edge processor market. Power consumption is a key factor for AI-Edge applications where the entire system is powered by a battery. An ultra-low-power microcontroller with a dedicated Convolutional Neural Network (CNN) accelerator and camera support can be equipped with active, sleep, and low power modes, allowing complex face identification functions periodically, typical of entry screening at an outdoor event.

The modern “core” of the IoT device used for Security, Safety and Sustainability is the System-on-Chip (SoC), which may incorporate some or all the following:  CPU, memory, Graphics Processing Unit (GPU), I/O control for HDMI port(s), Ethernet, Power via PoE and Power Sourcing Equipment (PSE), USB port(s), WIFI and Bluetooth connectivity, and Sound/Visual Sensor Fusion.

Future IoT devices must balance power consumption with processing optimized for CNNs. Excessive power consumption, as the processor strains to process streams of complex video, like vehicles on a multi-lane highway may produce a “choppy” effect as packets of data are lost and entire video frames go unprocessed.

Ambarella, an example of a leading AI vision silicon company, completed their acquisition of Oculii, whose adaptive AI software algorithms are designed to enable radar perception using current production radar chips to achieve significantly higher (up to 100X) resolution, longer range and greater accuracy. Radar used in the Security Industry and in Advanced driver assistance systems (ADAS) uses high-frequency radio waves to get range, direction and velocity of objects.

These AI Vision Processors are already used in a wide variety of human and computer vision applications, including video security devices, ADAS, electronic mirrors, and robotics applications. For example, the high-end Ring Doorbell Pro 2 delivers enhanced 1536p HD video with an expanded Head to Toe view, Bird’s Eye View with Intruder Motion History, dual-band WiFi and operates on the low power, high-performance Ambarella CV25M SoC.

AI Processors

What if you’re working with IP Cameras that you do not wish to upgrade and run “Edge-AI” algorithms? Streaming UHD for a lengthy period of time also presents challenges. An IP camera has to initiate the video stream to the decoding application, often a Video Management System (VMS). Should the VMS be on an underpowered server, itself connected to a poorly performing network, or even under a DDoS cyber-attack, the IP camera may not be powerful enough to maintain the stream while trying to manage all these other tasks. Add to that, any AI algorithm running at the camera edge, and the cost per channel grow with the cost of the more powerful camera required. The individual cameras have to work harder which consequently generates more heat and power consumption.

Quickly gaining popularity is the AI Processor Unit that contains an AI Accelerator capable of running NNs for example, pose (skeletal) detection, weapons detection, vehicle identification, face matching, instance detection, occupancy (with privacy), and multiple object behaviors before an event.

The Foxconn AI Processor with Hailo-8™ M.2 AI Acceleration Module performs a continuous 26 tera-operations per second (TOPS) and is capable of processing 15 UHD streams from IP Cameras at extremely low power. The AI Processor can be placed between the IP Cameras and a VMS, where an additional video stream is processed, delivering far more actionable real-time visual data in a quickly deployed upgrade.

Consolidating AI stream processing of a suite of Visual Sensors in a 15:1 ratio drastically reduces the cost per channel to purchase and operate. A small city with 500 IP cameras and video analytics applications at an Emergency Operating Center (EOC) can present an upgrade challenge. Locating approximately 40 AI Processors closer to clusters of existing cameras delivers the benefits of multiple AI algorithms without a dependency on increased network traffic. In addition, multiple output streams from the AI Processor can serve Mobile Command Centers with quick effect and the same UX as the EOC.

Sensor Fusion

The Autonomous Vehicle market and the push for greater safety make 3D Imaging, LiDAR, radar and thermal sensors affordable alternatives to visible light im
aging security surveillance. Any combination of this wide range of sensors provides a visual “fusion” of data while preserving privacy or displaying a spectrum most closely suited to recognizing a potential threat.

Even the latest iPhone uses ToF sensors to enhance mobile device security. Facility, security, public safety has already begun the transition to “alternative” visual sensors and processing using 3D imaging, radar, LiDAR and more. At the Ambarella exhibition, those sensors, with AI processing, on their existing CV2 or new CV3 Vision Processors now render detailed three-dimensional images of many people, their faces, vehicles, make and model, their occupants and vehicle plates in real-time at significant cost savings. The similar size and form factor of 3D Imaging, Radar and LiDAR sensors with the CV3 development platform illustrates Ambarella’s continuous improvement in power management.

If privacy is required, the same “camera” with these sensors is providing another stream of a detailed wireframe or point-cloud-renderings without the visible light imagery. In other words, the greatest detail of the person, what they are carrying, without facial imagery, preserves privacy.

LiDAR uses pulsed lasers to build a point cloud, which is then used to construct a large 3D map or image. A ToF sensor in a 3D Camera is able to reliably reconstruct individual objects in 3D, in real-time, with detail and with fast frame rates. The “depth maps” can be colorized, or even merged with the RGB visible light camera.

Thermal Imaging sensors, together with AI algorithms trained on weapons and IED data can provide improved public safety, as recent years have demonstrated an increasing trend toward IED attacks that are suicide initiated.

Use Case: AI-based Entry Screening: With SARS-COVID-19, entry screening now has three or more parts: one step to make sure you should be visiting the facility, another to verify your identity, and yet another for a health check.

For the converged and optimized security team recognizing the value of AI, this has been significantly optimized. If a visitor drives a vehicle they can be screened by an AI-based automated number plate identification device incorporating visible and IR light detection, multi-core imaging processors and capturing vital data like vehicle make/model, # passengers, vehicle speed/behavior, vehicle tags, and contraband/explosives detection. On entry into the facility, there is no longer the need for security staff to manually perform repetitive entry screening tasks. At an entry portal, the visitor’s biometric factors may be verified, facial recognition capture performed in case of emergency location, detection for concealed weapons and a health check performed using the subject’s basal temperature.

Athena Security’s walk-through metal detector solution leverages AI Processing from a suite of sensors including magnetometer, induction, LiDAR, thermal, and visual camera to scan one person at a time, walking at normal speed through unobtrusive pillars. At a maximum flow rate of 3,600 people per hour, it is 10X faster than legacy metal detection and even faster when considering legacy secondary screening methods.

Use Case: Delivery via Vehicle; Vehicle Theft: Vehicles are often the targets of the fast “smash and grab” activity or sophisticated gray market parts distribution. Successful delivery of goods at a residence or business might involve some recognized behaviors; however, Visual Sensor Fusion used with Edge AI Computer Vision Perception SoC or AI Processor can process unclassified behaviors in real-time.

Use Case: Pre-Entry Behavior: When multiple people are in a Video Doorbell’s field of view, it can be useful to see what they were doing before the person depressed the doorbell and their distances relative to the user at home or at a commercial entry. This is also known as AI Distance Inference.

Use Case: Complex Building Lobby: A building lobby of a multi-tenant facility might require different screening, performed continuously and accurately. In this demonstration, each of four – 4K visible light RGB cameras were paired with 3D imagers producing eight streams, four of which used CNNs to process skeletal poses, object detection, face detection and something the person has with them (a body part). Skeletal poses can quickly alert on slip ‘n falls or crowding; object detection of package theft, visitors wearing masks, and weapons detection are all significant & time-sensitive data.

Use Case: Face Matching Access Control: Legacy Face Matching algorithms can often be spoofed by 2D images of the person on file. When a 3D ToF camera is used together with an RGB camera and SoC capable of fusing both streams and images via CNNs, false positives are statistically eliminated, and trusted personnel entry is achieved.

Use Case: Anonymous Occupancy Sensing: Detailed visual imaging is not always necessary to maintain safe occupancy in a building or space; 3D imaging, radar or Lidar streams can be processed by Neural Networks and an accurate occupancy for given spaces can be resolved in real-time, or even projected by time., while maintaining privacy.

Privacy considerations and AI legislation will be among the subjects covered in Part 2 of this article. However, the most significant use of AI, now and in the future, according to market analysts, is the birth of a symbiotic relationship between human and machine, advancing the augmented worker and team as an augmented workforce.

About the author: Steve Surfaro is Chairman of the Public Safety Working Group for the Security Industry Association (SIA) and has more than 30 years of security industry experience. He is a subject matter expert in smart cities and buildings, cybersecurity, forensic video, data science, command center design and first responder technologies. Follow him on Twitter, @stevesurf.