M. Hofmann Prof. Dr. D. M. Gavrila

Looking at People M. HofmannProf. Dr. D. M. Gavrila Intelligent Systems LaboratoryInformatics Institute, Faculty of ScienceUniversity of AmsterdamWeb: www.gavrila.net

Outline • Introduction • Research context, applications • Human Body Detection and Tracking • 2D vs. 3D approaches • Object level vs. body part-specific (pose estimation) • The CASSANDRA smart surveillance system

Possible applications pedestrian protection surveillance motion capture foranimation and games smart homes, elderly care robotic pets motion analysis (sports, medical)

Looking-at-People: visual information channels Faces Whole-Bodies Hands Crowds

Why is finding people in images difficult? Variation in size and clothing Variation in pose and viewpoint Cluttered, dynamic backgrounds Fast changing scenes Depth ambiguities Reduced observation(occlusion, loose clothing) Real-time analysis often required.

2D, object level, detection • Bypass a pose recovery step and describe human overall appearance in terms of simple, low-level, 2-D image features. • Follow pattern classification approach • training stage: use particular classifier/features to derive an implicit representation of the target class. • test stage: use sliding window approach to determine presence of target class. At each location, extract features and apply classifier. Typically test image at multiple scales. • Due to efficiency considerations: determine ROIs by background subtraction, skin color or motion detection. Follow up optionally with morphological operations to reduce noise. Apply any scene constraints (e.g. ground plane constraint).

Testing Feature extraction Pattern classifier person non-person Learning from examples: person detection Training person non-person Feature extraction Pattern classifier Feature space

Learning from examples: open questions What are good features? How to deal with the “curse of dimensionality” ? What pattern classifier to use?

Detecting pedestrians w. wavelets Testing Training Papageorigiou and Poggio, IJCV, 2000 • ROI: none (sliding window technique on static images, size 128 x 64 pixels) • low level features: Haar wavelets (differential operators applied at various locations, orientations, scales) • 29 - 1326 Haar wavelets used • pattern classification by Support Vector Machine • 100 ms - 2 min per image Overcomplete Representation SVM Classifier Overcomplete Representation SVM Classifier person non-person 2-D wavelets

2D, object level, detection: AdaBoost Viola, Jones and Snow, ICCV, 2003 • A weak learner is a classifier with accuracy only slightly better than chance. It is based on a single feature. • AdaBoost: continue adding weak learners until some desired low training error is reached by ensemble. • Key idea is to reweight examples based on whether they are classified correctly or not (decrease and increase weight, respectively). • Final classification decision (strong learner) involves setting a threshold on the weighted sum of the outputs of the weak learners. • 24x24 windows applied at multiple scales, in raster scan • simple Haar-like features • 45.396 possible features in each window

2D, object level, detection: cascaded AdaBoost Viola, Jones and Snow, ICCV, 2003 For efficiency, use cascade classifier stage 1 classifier stage 2 classifier stage N accepted hypotheses(detections) hypotheses hypotheses hypotheses rejected hypotheses • very fast: 4 Hz on 360x240 images • detection rate: 80% with 1 false positive in 2 images • OpenCV implemention available for full-body, face, etc.

2D pose recovery: detection • Global detectors are useful for locating people at larger distances. • Do not provide pose information • Do not scale up to close range because of possible articulations (speed, storage) • Are not very robust to partial occlusion => Detect body parts individually, assemble to whole-body configurations using consistency constraints. Decompose structure and appearance variation.

2D pose recovery, detection: pictorial structures Felzenswalb and Huttenlocher, IJCV 2005 • Pictorial structure • Collection of parts arranged in a deformable configuration • Appearance of each part is modeled separately, deformable structure represented by spring-like connections between parts Statistical approach • Let • Idenote a given image • θ = (u, E, c) denote model parameters, u are appearance parameters, set of edges E denote which parts are connected, and c are connection parameters • L configuration of the object (location of each parts) Posterior by Bayes‘ rule: Two approaches: MAP estimation, sampling from the posterior distribution

2D pose recovery, detection: some results Felzenswalb and Huttenlocher, IJCV 2005 correct erroneous • chamfer matching of body parts on binary foreground image=> alternative: AdaBoost [Micilotta2005,Mikolajczyk2004] or SVM [Ronfard2002] • space of possible locations: 70 x 70 x 10 x 32 (x,y,s,θ) • (Comment: running time quite high - 1 min/frame - for „efficient algorithm“, considering that multiple pictorial structures might be needed to account for various general poses)

2D pose recovery, detection and tracking Ramanan et al., PAMI 2007 Automatic system for tracking multiple people. Two stages: • Model-building: opportunistically detect subset of „easy“ poses/motions using „detuned“ detectors (i.e. based on shape only). Derive texture-based appearance models for body parts. • Track: by detection of arbitrary poses with the now richer (more discriminative) appearance models.

2D pose recovery, detection and tracking Ramanan et al., PAMI 2007 Model building 1. Bottom-up: by clustering ribbon-like structures that move Clustered detections Detected torso patches Detected torso patches over time 2. Top-down: by stylized pose detector (legs in „scissor“-pose)

2D pose recovery: detection and tracking Ramanan et al., PAMI 2007

Motion Capture – The Conventional Way De-facto industry standard: VICONTM • Applications: Life sciences, animation, engineering • Highly accurate 3D motion capture • but • Intrusive (markers attached to human body) • Inflexible (controlled background and lighting) • Expensive (VICON ~ EUR 100.000) Move from laboratoryto real-world setting.

Model-based approach • 2-D vs. 3-D • 3-D human body model = “skeleton” + “flesh” • “Skeleton”: in terms of joint angles. Typical models have 20–30 DOFs. • “Flesh”: body parts in terms of cylinders, ellipsoids, super-quadrics with optional global deformations • Trade-off between complexity and accuracy of model. 3D models “Ellen” and “Dariu” (1996) 22 DOF skeleton (6 DOF for position and orientation of torso, 4 DOF for each limb), each body part tapered superquadric (7 parameters)

Use given 3D human models: 3D pose recovery 3D pose recovery: align human model in 3D space such that its 2D projection(s) matches the available 2D camera view(s). Pose recovery formulated as search in high-dimensional pose parameter space. Correlation Edge detection remove effect clothing, maintain structure

Hierarchical 3D pose recovery (multi-camera) Work with M. Hofmann (TNO/UvA)

Fuse data- and model drivenapproaches: model-adaptation. Example: texture mapping Opportunistic model adaptation [Ramanan2007] Qualitative comparison Learning from 2D examples Matching with prior 3D models • works for simple or constrained recognition tasks (e.g. person/face detection, lateral walking, gestures) • relatively easy to implement • can deal with arbitrary, unconstrained movement • difficult to implement Address scalability problems (e.g. views, different movements, occlusion) Address model matching, occlusion, use of single camera

Experimental test bed: Surveillance • Motivation • Public safety is a major societal/political issue. • Surveillance technology is increasingly applied to secure public spaces (e.g. train stations, shopping malls, street corners). • Human operators monitor vast amounts of data for events that occur rarely (inefficient, many “missed” events). • Goal • Develop intelligent system that extracts from surveillance data relevant parts for verification by a human operator.

CASSANDRA project (2005-2009) sss • Detect complex human activity (aggression) • Fuse (complementary) audio and video modality • From signal processing to cognitive modeling • Realistic experimental validation • Amstel station 03/2006: different scenarios played out by professional actors: from “normal”, “lively” to “aggressive” • Comparison CASSANDRA-estimated aggression level with that of human

Context Information Prototype Overview Inference (High-level Fusion) Video-based Aggression Detection Audio-based Aggression Detection

CASSANDRA: video processing Work with W. Zajdel (UvA) • Tracking foreground region • Computation of basic motion features

Aggression detection by audio–video fusion -“critical” -“serious” -“attention” -“normal” -“none”

Next Steps CASSANDRA Need more meaningful visual features! 2D body part detection and pose estimation (DOAS project)

M. Hofmann Prof. Dr. D. M. Gavrila

M. Hofmann Prof. Dr. D. M. Gavrila

Presentation Transcript

Prof. Dr. Asrar M. Khan/ Dr. M. Nadeem Hassan

M D M

Poster Presenters: Prof D A Evans, Prof D M Taylor

( Prof. M. Zonis )

DIRECTORS: PROF. DR.G. LOHNERT, PROF. DR. E. LAURIEN, PROF. DR. M. SCHMIDT

Prof. Dr. M. Fethi Şahin

M D M

Prof. M. M. Ninan

M. Hofmann Prof. Dr. D. M. Gavrila

“ d m m d i ” 2015