410 likes | 513 Views
How Machines Learn to Talk. Amitabha Mukerjee IIT Kanpur work done with: Computer Vision: Profs. C. Venkatesh, Pabitra Mitra Prithvijit Guha, A. Ramakrishna Rao, Pradeep Vaghela Natural Language: Prof. Achla Raina, V. Shreeniwas.
E N D
How Machines Learn to Talk Amitabha Mukerjee IIT Kanpur work done with: Computer Vision: Profs. C. Venkatesh, Pabitra Mitra Prithvijit Guha, A. Ramakrishna Rao, Pradeep Vaghela Natural Language: Prof. Achla Raina, V. Shreeniwas
RoboticsCollaborations:IGCAR KalpakkamSanjay Gandhi PG Medical Hospital
Visual Robot Navigation Time-to-Collisionbased Robot Navigation
Hyper-Redundant Manipulators The same manipulator can work in changing workspaces • Reconfigurable Workspaces / Emergency Access • Optimal Design of Hyper-Redundant Systems – Scara and 3D
Planar Hyper-Redundancy 4-link PlanarRobot Motion Planning
Micro-Robots • Micro Soccer Robots (1999-) • 8cm Smart Surveillance Robot – 1m/s • Autonomous Flying Robot (2004) • Omni-directional platform (2002) Omni-Directional Robot Sponsor: DSTamit@iitk.ac.in
Flying Robot Start-Up at IIT Kanpur WhirligigRobotics Test Flight of UAV. Inertial Meas Unit (IMU) under commercial production
Tracheal Intubation Device Assists surgeon while inserting breathing tube during general anaesthesia Ball & Socket joint Aperture for Oxygenation tube Aperture for Fibre optic video cable Endotracheal tube Aperture Hole for suction tube Control cables Attachment Points Device for Intubation during general Anesthesia Sponsor: DST / SGPGM sens@iitk.ac.in
Draupadi’s Swayamvar Can the Arrow hit the rotating mark? Sponsor: Media Lab Asia
High DOF Motion Planning • Accessing Hard to Reach spaces • Design of Hyper-Redundant Systems • Parallel Manipulators • Sponsor: BRNS / MHRD • dasgupta@iitk.ac.in 10-link 3D Robot – Optimal Design
Multimodal Language Acquisition Consider a child observing a scene together with adults talking about it Grounded Language : Symbols are grounded in perceptual signals Use of simple videos with boxes and simple shapes – standardly used in sociopsychology
Objective To develop a computational framework for Multimodal Language Acquisition • acquiring the perceptual structure corresponding to verbs • using Recurrent Neural Networks as a biologically plausible model for temporal abstraction • Adapt the learned model to interpret activities in real videos
Visually Grounded Corpus • Two psychological research films, one based on the classic Heider & Simmel (1944) and other based on Hide & Seek • These animation portray motion paths of geometric figures (Big Square, Small square & Circle) • ChaseAlt
Cognate clustering • Similarity Clustering: Different expressoins for same action, e.g.: “move away from center” vs “go to a corner” • Frequency: Remove Infrequent lexical units • Synonymy: Set of lexical units being used consistently in the same intervals, to mark the same action, for the same set of agents.
Video Feature Extraction Features Multi Modal Input Trained Simple Recurrent Network VICES Cognate Clustering Descriptions Events Perceptual Process
Design of Feature Set • The features selected here are related to spatial aspects of conceptual primitives in children, such as position, relative pose, velocity etc. • Use features that are kinematical in nature, temporal derivations or simple transforms of the basic ones.
Video Feature Extraction Features Multi Modal Input Trained Simple Recurrent Network VICES Cognate Clustering Descriptions Events VIdeo and Commentary for Event Structures [VICES]
The classification problem • The problem is of time series classification • Possible methodologies include: • Logic based methods • Hidden Markov Models • Recurrent Neural Networks
Elman Network • Commonly a two-layer network with feedback from the first-layer output to the first layer input • Elman Networks detect and generate time-varying patterns • It is also able to learn spatial patterns
Feature Extraction in Abstract Videos • Each image is read into a 2D matrix • Connected Component Analysis is performed • Bounding box is computed for each such connected component • Dynamic tracking is used to keep track of each object
Working with Real Videos • Challenges • Noise in real world videos • Illumination Changes • Occlusions • Extracting Depth Information • Our Setup • Camera is fixed at head height. • Angle of depression is 0 degrees (approx.). • Video
2 P(x,y) - µ(x,y) < kσ(x,y) Background Subtraction • Background Subtraction • Learn on still background images • Find pixel intensity distributions • Classify each pixel as background if • Remove Shadows • Special Case of Reduced Illumination • S = k*P where k<1.0
Contd.. • Extract Human Blobs • By Connected Component Analysis • Bounding box is computed for each person • Track Human Blobs • Each object is tracked using a mean-shift tracking algorithm.
Depth Estimation • Two approximations • Using Gibson’s affordances • Camera Geometry • Affordances: Visual Clues • Action of a human is triggered by the environment itself. • A floor offers walk-on ability • Every object affords certain actions to perceive along with anticipated effects • A cups handle affords grasping-lifting-drinking
Contd.. • Gibson’s model • Horizon is fixed at the head height of the observer. • Monocular Depth Cues • Interposition • An object that occludes another is closer. • Height in the visual field • Higher the object is the further it is.
Pin hole Camera Model Mapping (X,Y,Z) to (x,y) x = X * f / Z y = Y * f / Z For the point of contact with the ground Z 1 / y X x / y Depth Estimation
Depth plot for A chase B • Top view (Z-X plane)
Results • Separate-SRN-for-each-action • Trained & tested on different parts of the abstract video • Trained on abstract video and tested on real video • Single-SRN-for-all-actions • Trained on synthetic video and tested on real video
Basis for Comparison Let the total time of visual sequence for each verb be t time units
Separate SRN for each action Real video (action recognition only)
Conclusions & Future Work • Sparse nature of video provides for ease of visual analysis • Directly learning event structures from perceptual stream. • Extensions: Learn fine nuances between event structures of related action words. • Learn the Morphological variations. • Extend the work towards using Long Short Term Memory (LSTM). • Hierarchical acquisition of higher level action verbs.