Technical Approach - Visual and Acoustic Surveillance and Monitoring

Site-model based image stabilization.

Prior research in collaboration with ARL on image stabilization for automatic target acquisition (ATA) revealed that high-accuracy image alignment is very important for subsequent detection and tracking tasks. For some applications, even subpixel misalignment between image frames may cause ATA algorithms to fail. In wide-area surveillance, a person or small vehicle in a surveillance image may be only several pixels across and the camera platform may vibrate due to wind and/or strong impacts on the ground. We expect a high-accuracy camera stabilization capability to be even more critical in this situation. Additionally, in surveillance systems in which the visual sensors are truly mobile, stabilization is again a critical step in the detection of people and vehicles. We have developed several image stabilization algorithms over the past three years. As part of the DARPA UGV/RSTA project, we developed a 2D feature-based multi-resolution camera motion estimation technique. These stabilization algorithms will be extended through the use of site models for surveillance applications. Using a 3D site model for image stabilization has the following advantages:
- Camera location and orientation can be accurately determined by camera resection based on the image domain locations of several control points whose 3D coordinates with respect to the site are known.
- Task-dependent information (e.g., monitoring a building entrance for activity) can be used to choose the points and/or surfaces stabilized by the algorithm.
In the surveillance system, the acquired video sequence will be registered to the site model. Task-specific control points whose 3D coordinates are known from the site model will be tracked from frame to frame and used in the registration of the video sequence to the site model. We expect the registration procedure to achieve high accuracy. We will develop a real-time, high-accuracy feature tracking algorithm for video sequence to site model registration.
Wide area detection of people and vehicles using visual and acoustic sensors.

Detection, recognition, and motion analysis of people and vehicles form the core tasks of wide-area surveillance and monitoring. For the detection task, we plan to use
- changes between video frames to detect moving objects,
- changes between the acquired video and reference images stored in the site model to detect slow-motion intruders, and
- unexpected acoustic signals to detect potential targets.
Recognition will involve fusing cues from the site model with information from visual and acoustic features of the target(s), while moving people and vehicles will be detected use 3D cues from the site model combined with feature tracking results to estimate location, heading, and speed.
Narrow-area recognition of activities.

Our research on narrow-area recognition of human activities addresses the problems of:
- Detection, segmentation and tracking of people (and their parts) in color video and IR sensors.
- Developing structural models for single-person and multi-person activities, with emphasis on entering, exiting, carrying and exchanging activities.
Our proposed research on detection and tracking of people is based on segmenting that part of each image predicted to contain the moving person, and finding as many natural body parts as possible using a combination of motion-based tracking, and shape and color analysis. We plan to employ a novel hierarchical, region-based background subtraction method to focus on that part of the image predicted to be the person. Current versions of these programs currently operate at 10-15 frames per second on a PC system. Our proposed research involves integration of this segmentation process into a tracking framework that employs a generic model of the human body and its parts to find (through limited searches over the combinatorial space defined by the hierarchical segmentations) body parts and to track them both through the image sequence and, when possible, in 3-D. Tracking will be based on identifying short-term stable features on the surfaces of the visible body parts, dynamically updating this set as the sequence progresses. Tracking will employ models of the dynamics of the motion being observed (walking, bending, reaching, etc.) under control of the high-level system described below.
Recognizing human activity.

We propose to develop a mixed statistical and structural approach to the representation of human actions. The models will be grounded in statistical primitive action models, in which the movements of individual body parts in a body-centered frame of reference are associated with ``primitive'' body part actions. These primitive action models are augmented with a theory of attachment that is used to determine how and when a person is moving with an object (as opposed to a chance visual coincidence between the instantaneous views of a human and an object in the scene). We will specifically be concerned with three types of attachment corresponding, roughly, to objects with handles (briefcases, suitcases) that are carried in one hand, bags and boxes that are carried with both arms and hands, and stick-like objects that are carried in one or two hands. The theory will also model how such objects are ``picked up'' and ``put down.'' Structural models are compositional activities involving the coordinated action of many body parts and sequential constraints on primitive or constituent composite body actions.
Primitive body actions will be recognized using a robust estimation algorithm for indexing into databases of such linear models and simultaneously extrapolating motion descriptions to previously unencountered viewpoints.
The theory of attachment, used to recognize carrying activities, will be based both on constraints on body part motion that result from different types of carrying activity, and on the recognition of ``objects'' in a time-varying image that move with a person and whose position and motion are consistent with the hypothesized type of carrying activity.
Our computational approach will be strongly motion- and body-pose-based, and will not rely on any general vision capabilities for finding and tracking objects that humans might be carrying. Instead, we will hypothesize, from body pose and motion, the type of carrying action, and then search in the image sequence for collections of features (regions, markings, etc.) that move with the body in a manner consistent with the hypothesized carrying action.
High-level system for control and operator interaction.

There is a wide variety of control knowledge that should be represented in a general way in order to control the activities of the surveillance system, and the interactions between the surveillance system and human operators. This control knowledge will, generally, make reference to both spatial and temporal attributes of the surveillance site being monitored. We propose to represent this control knowledge using temporal logic programs, taking advantage of the general database capabilities of logic programming to
- support the insertion of ancillary data that can be integrated into situation assessments by the surveillance system, and
- specify the conditions under which control passes from wide-area surveillance to narrow-area surveillance to requests for human assistance through queries posed to the temporal logic programming database.
This high-level system will also draw upon the research being conducted under a DARPA AASERT grant.