Linking Video Analysis to Annotation Technologies

Linking Video Analysis to Annotation Technologies Presentation for the BMVA’s Computer Vision Methods for Ambient Intelligence 31st May 2006 Dimitrios Makris (Kingston University) & Bogdan Vrusias (University of Surrey)

REVEAL project • EPSRC-funded project, initiated in 2004 • Academic partners • Kingston University, University of Surrey • Industrial partners/observers • SIRA Ltd, Ipsotek Ltd, CrowdDynamics Ltd, Overview Ltd • End-Users • PITO, PSDB (Home Office), Surrey Police • Aim: “to promote those key technologies which will enable automated extraction of evidence from CCTV archives”

See No EvilHear No EvilSpeak No Evil

Scope of REVEAL • See Evil • Computer Vision • Input: Video Streams • Hear Evil • Natural Language Processing • Input: Annotations of Video Streams • Speak Evil • Link together • Output: Automatic Video Annotations

Challenges • Development of Visual Evidence Thesaurus • Automatic Extraction of Surveillance Ontology • Extracting Visual Semantics • Motion Detection & Tracking • Geometric and Colour Constancy • Object Classification, Behaviour Analysis, Semantic Landscape • Analysing Crowds • Development of Surveillance Meta-Data Mode • Multimodal Data Fusion • Fusion of Visual Semantics and Annotations • Video Summarisation

reveal@dirc.kingston

Video Analysis Overview • Motion Analysis • Motion Detection • Motion Tracking • Crowd Analysis • Automatic Camera Calibration • Colour • Geometric • Visual Semantics extraction • Object Classification • Behaviour Analysis • Semantic landscape

Motion Analysis (1/2) • Motion Detection • Novel Technique for handling rapid light variations, based on correlating changes in YUV (Renno et al, VS2006) • Motion Tracking • Blob-based Kalman filter for tackling partial occlusion (Xu&Ellis, BMVC2002)

Motion Analysis (2/2) Example

Crowd Analysis (1/3) • Problem: Detect and Track Individuals in Crowded situations Original Frame Foreground Mask

Crowd Analysis (2/3) • Combine edges of original image with edges of foreground mask Original Frame Edges Foreground Mask Edges

Crowd Analysis (3/3) • Fit a head-shoulder (Omega) model Head Candidates on the boundaries of the foreground Head Candidates in the scene Head Candidates within the foreground

Large Vehicle Heights Vehicle Person Horizon Image Position (pixels) g Height (pixels) i - horizon h Automatic Geometric Calibration (1/3) • Pedestrian height model • Estimate linear pedestrian height model from observations (Renno et al, ICIP2002)

Automatic Geometric Calibration (2/3) • Ground Plane Estimation • Use the pedestrian linear model to estimate ground plane. (Renno et al, BMVC2002)

Occlusion Edges Depth Map Automatic Geometric Calibration (3/3) • Scene Depth Map • Use moving objects estimated depths to determine the scene depth map (Renno et al, BMVC2004)

Automatic Colour Calibration (1/3) • Variation of colour responses is significant! • A real-time colour constancy algorithm is required

Automatic Colour Calibration (2/3) • Grey world and Gamut Mapping algorithms were tested. • Automatic method for reference frame. • Gamut Mapping performs better, but Grey World can operate real-time. (Renno et al, VS-PETS2005)

Automatic Colour Calibration (3/3) • Real Time Colour Constancy

Visual Semantics(Makris et al, ECOVISION 2004) Targets • Pedestrians • Cars • large vehicles Actions • move • stop • enter/exit • accelerate • turn left/right Static features • road/corridor • door/gate • ATM • desk • bus stop

Visual Semantics • Object Classification (ongoing work) • Behaviour Analysis (ongoing work) • Semantic landscape • Label static scene by observing activity (Makris&Ellis, AVSS 2003)

Reverse Engineering

Entry/Exit Zones Detected by an EM-based algorithm

Detected Routes

Segmentation of Routesto Paths & Junction

Possible extensions • Use target labels • paths: traffic road or pavements • pedestrian crossing: junction of • pedestrian route • vehicle route • More complicated rules • bus stop • pedestrians stop • vehicle stop • pedestrians merge with vehicle

Data Hierarchy in Video Analysis Textual Summary Actor labels Scene labels Action labels Blobs Trajectories Pixels

Natural Language Processing (Surrey) Hypothesis: Experts are using a common language/keywords to describe crime scene/video evidence. • Visual Evidence Thesaurus • Data acquisition (workshops) • Data analysis • Automatic ontology extraction

Video Annotation Workshops • 2 different workshops were organised and ran to prove the hypothesis and construct a domain thesaurus. • Different experts • Police Forces (Surrey-West Yorkshire). • Forensic Services (London-Birmingham). • Private Video Evidence expert (Luton). • Several data collection tasks • Purpose: Gather knowledge and feedback from experts in order to understand the way videos are observed and perceived. • Task: Validate the hypothesis and extract common keywords used and the description pattern.

Video Annotation Workshops Selected sample from the Workshop

Video Evidence Thesaurus • Workshop Outputs: • Initial descriptions from experts, for analysis. • Useful feedback and comments. • Workshop Feedback: • Strong interest in the project. • Willing to help (within the legal limits).

Analysis • Same video clip. • Description from 3 different people. • Pattern (Identify, Elaborate, Location)

Video Evidence Thesaurus • Analysis: • Description from different people, for same video clips. • Pattern (Identify <I>, Elaborate <E>, Location <L>) . • Grammar: • <Description>: <I><E|L><L|E|{Φ}><Description|{Φ}>. • <I>: <Single|Group> • <Single> : <{Person}|{Male}|{Female}|….> • <Group> : <2|3|…..|n><{People}|Single>

Thesaurus Construction Methodology • We adopt a text-driven and bottom-up method: starting from a collection of texts in a specialist domain, together with a representative general language corpus • Use a five-step algorithm for identifying discourse patterns with more or less unique meanings, without any overt access to an external knowledge base

Thesaurus Construction Methodology • Select training corpora: CCTV-Related Corpus and a general language corpus. • Extract key words; • Extract key collocates; • Extract local grammar using collocation and relevance feedback; • Assert the grammar as a finite state automaton.

Development of Visual Evidence Thesaurus • Once the single terms, especially weird terms, are identified we find candidate compound terms by computing collocation statistics between the single terms and other open class words in the entire CORPUS.

Development of Visual Evidence Thesaurus • Collocates of the the weird term EARPRINT + collocation statistics

Development of Visual Evidence Thesaurus • Collocates of the the weird term EARPRINT IDENTIFICATION

Development of Visual Evidence Thesaurus A inheritance hierarchy of EARPRINT collocates exported to a knowledge representation system PROTEGE:

Development of Visual Evidence Thesaurus • A multiple inheritance hierarchy of EARPRINT collocates now exported to a knowledge representation workbench PROTEGE: Rubbish!

Experiments and Evaluation • I. Select training corpora • Training-Corpus • The British National Corpus, comprising 100-million tokens distributed over 4124 texts (Aston and Burnard 1998); • Crime Alerts Corpus (FBI Crime Alerts, Wanted by the Royal Canadian Mounted Police (RCMP) & Polizei Bayern, Journal/Conference papers)comprising 109 articles and contains 214,437 words

Experiments and Evaluation • II. Extract key words • The frequencies of individual words in the Crime Alerts Corpus were computed using System Quirk;

Experiments and Evaluation

Experiments and Evaluation • III. Extract key collocates

Experiments and Evaluation • IV. Extract local grammar using collocation and relevance feedback

Experiments and Evaluation • V. Assert the grammar as a finite state automaton • The (re-) collocation patterns can then be asserted as a finite state automata for each of the movement verbs and spatial preposition metaphors

Describing Videos • An experiment: • 16 videos in the CAVIAR data set were shown to 4 different surveillance experts; • The experts were asked to describe the videos in their own words – in English surveillance speak • Experts will describe videos in a succinct manner using terminology of their domain and framing the description in a ‘local grammar’ • The interviews were transcribed and sentences and phrases were marked up using a basic ontology: Action, Location, Result, Miscellaneous.

Describing Videos One of our experts described the frame on the left as Man in blue t-shirt, centre of scene, facing camera, raises white card high above his head. Second man wearing a dark top with white stripes down the sleeve enters scene from above. Meets a third individual with a dark top and pale trousers and an altercation occurs in the centre of the open space. Individuals meet briefly and leave scene in opposite directions. The original person with the white sleeves -- white stripes on the sleeves leaves scene below camera. Second person in the altercation leaves scene by the red chairs. That was an assault, I’d say.

Describing Videos One of our experts described the frame on the left as Event 1: Man in blue t-shirt, centre of scene, facing camera, raises white card high above his head. Event 1: Miscellaneous: Man in blue t-shirt, Location: centre of scene, Action: facing camera, Result: raises white card high above his head.

Describing Videos One of our experts described the frame on the left as Event 1: M Man in blue t-shirt, L centre of scene, A facing camera, R raises white card high above his head. Event 2: A Second man wearing a dark top with white stripes down the sleeve enters scene L from above. A Meets a third individual with a dark top and pale trousers and an altercation occurs L in the centre of the open space. A Individuals meet briefly R and leave scene in opposite directions.

Linking Video Analysis to Annotation Technologies