1 / 41

How Machines Learn to Talk

How Machines Learn to Talk. Amitabha Mukerjee IIT Kanpur work done with: Computer Vision: Profs. C. Venkatesh, Pabitra Mitra Prithvijit Guha, A. Ramakrishna Rao, Pradeep Vaghela Natural Language: Prof. Achla Raina, V. Shreeniwas.

brody
Download Presentation

How Machines Learn to Talk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How Machines Learn to Talk Amitabha Mukerjee IIT Kanpur work done with: Computer Vision: Profs. C. Venkatesh, Pabitra Mitra Prithvijit Guha, A. Ramakrishna Rao, Pradeep Vaghela Natural Language: Prof. Achla Raina, V. Shreeniwas

  2. RoboticsCollaborations:IGCAR KalpakkamSanjay Gandhi PG Medical Hospital

  3. Visual Robot Navigation Time-to-Collisionbased Robot Navigation

  4. Hyper-Redundant Manipulators The same manipulator can work in changing workspaces • Reconfigurable Workspaces / Emergency Access • Optimal Design of Hyper-Redundant Systems – Scara and 3D

  5. Planar Hyper-Redundancy 4-link PlanarRobot Motion Planning

  6. Micro-Robots • Micro Soccer Robots (1999-) • 8cm Smart Surveillance Robot – 1m/s • Autonomous Flying Robot (2004) • Omni-directional platform (2002) Omni-Directional Robot Sponsor: DSTamit@iitk.ac.in

  7. Flying Robot Start-Up at IIT Kanpur WhirligigRobotics Test Flight of UAV. Inertial Meas Unit (IMU) under commercial production

  8. Tracheal Intubation Device Assists surgeon while inserting breathing tube during general anaesthesia Ball & Socket joint Aperture for Oxygenation tube Aperture for Fibre optic video cable Endotracheal tube Aperture Hole for suction tube Control cables Attachment Points Device for Intubation during general Anesthesia Sponsor: DST / SGPGM sens@iitk.ac.in

  9. Draupadi’s Swayamvar Can the Arrow hit the rotating mark? Sponsor: Media Lab Asia

  10. High DOF Motion Planning • Accessing Hard to Reach spaces • Design of Hyper-Redundant Systems • Parallel Manipulators • Sponsor: BRNS / MHRD • dasgupta@iitk.ac.in 10-link 3D Robot – Optimal Design

  11. Multimodal Language Acquisition Consider a child observing a scene together with adults talking about it Grounded Language : Symbols are grounded in perceptual signals Use of simple videos with boxes and simple shapes – standardly used in sociopsychology

  12. Objective To develop a computational framework for Multimodal Language Acquisition • acquiring the perceptual structure corresponding to verbs • using Recurrent Neural Networks as a biologically plausible model for temporal abstraction • Adapt the learned model to interpret activities in real videos

  13. Visually Grounded Corpus • Two psychological research films, one based on the classic Heider & Simmel (1944) and other based on Hide & Seek • These animation portray motion paths of geometric figures (Big Square, Small square & Circle) • ChaseAlt

  14. Cognate clustering • Similarity Clustering: Different expressoins for same action, e.g.: “move away from center” vs “go to a corner” • Frequency: Remove Infrequent lexical units • Synonymy: Set of lexical units being used consistently in the same intervals, to mark the same action, for the same set of agents.

  15. Video Feature Extraction Features Multi Modal Input Trained Simple Recurrent Network VICES Cognate Clustering Descriptions Events Perceptual Process

  16. Design of Feature Set • The features selected here are related to spatial aspects of conceptual primitives in children, such as position, relative pose, velocity etc. • Use features that are kinematical in nature, temporal derivations or simple transforms of the basic ones.

  17. Monadic Features

  18. Dyadic Predicates

  19. Video Feature Extraction Features Multi Modal Input Trained Simple Recurrent Network VICES Cognate Clustering Descriptions Events VIdeo and Commentary for Event Structures [VICES]

  20. The classification problem • The problem is of time series classification • Possible methodologies include: • Logic based methods • Hidden Markov Models • Recurrent Neural Networks

  21. Elman Network • Commonly a two-layer network with feedback from the first-layer output to the first layer input • Elman Networks detect and generate time-varying patterns • It is also able to learn spatial patterns

  22. Feature Extraction in Abstract Videos • Each image is read into a 2D matrix • Connected Component Analysis is performed • Bounding box is computed for each such connected component • Dynamic tracking is used to keep track of each object

  23. Working with Real Videos • Challenges • Noise in real world videos • Illumination Changes • Occlusions • Extracting Depth Information • Our Setup • Camera is fixed at head height. • Angle of depression is 0 degrees (approx.). • Video

  24. 2 P(x,y) - µ(x,y) < kσ(x,y) Background Subtraction • Background Subtraction • Learn on still background images • Find pixel intensity distributions • Classify each pixel as background if • Remove Shadows • Special Case of Reduced Illumination • S = k*P where k<1.0

  25. Contd.. • Extract Human Blobs • By Connected Component Analysis • Bounding box is computed for each person • Track Human Blobs • Each object is tracked using a mean-shift tracking algorithm.

  26. Contd..

  27. Depth Estimation • Two approximations • Using Gibson’s affordances • Camera Geometry • Affordances: Visual Clues • Action of a human is triggered by the environment itself. • A floor offers walk-on ability • Every object affords certain actions to perceive along with anticipated effects • A cups handle affords grasping-lifting-drinking

  28. Contd.. • Gibson’s model • Horizon is fixed at the head height of the observer. • Monocular Depth Cues • Interposition • An object that occludes another is closer. • Height in the visual field • Higher the object is the further it is.

  29. Pin hole Camera Model Mapping (X,Y,Z) to (x,y) x = X * f / Z y = Y * f / Z For the point of contact with the ground Z  1 / y X  x / y Depth Estimation

  30. Depth plot for A chase B • Top view (Z-X plane)

  31. Results(contd..)

  32. Results(contd..)

  33. Results(contd..)

  34. Results • Separate-SRN-for-each-action • Trained & tested on different parts of the abstract video • Trained on abstract video and tested on real video • Single-SRN-for-all-actions • Trained on synthetic video and tested on real video

  35. Basis for Comparison Let the total time of visual sequence for each verb be t time units

  36. Separate SRN for each action Framework: Abstract video

  37. Time Line comparison for Chase

  38. Separate SRN for each action Real video (action recognition only)

  39. Single SRN for all actions Framework: Real video

  40. Conclusions & Future Work • Sparse nature of video provides for ease of visual analysis • Directly learning event structures from perceptual stream. • Extensions: Learn fine nuances between event structures of related action words. • Learn the Morphological variations. • Extend the work towards using Long Short Term Memory (LSTM). • Hierarchical acquisition of higher level action verbs.

More Related