How Machines Learn to Talk

How Machines Learn to Talk Amitabha Mukerjee IIT Kanpur work done with: Computer Vision: Profs. C. Venkatesh, Pabitra Mitra Prithvijit Guha, A. Ramakrishna Rao, Pradeep Vaghela Natural Language: Prof. Achla Raina, V. Shreeniwas

RoboticsCollaborations:IGCAR KalpakkamSanjay Gandhi PG Medical Hospital

Visual Robot Navigation Time-to-Collisionbased Robot Navigation

Hyper-Redundant Manipulators The same manipulator can work in changing workspaces • Reconfigurable Workspaces / Emergency Access • Optimal Design of Hyper-Redundant Systems – Scara and 3D

Planar Hyper-Redundancy 4-link PlanarRobot Motion Planning

Micro-Robots • Micro Soccer Robots (1999-) • 8cm Smart Surveillance Robot – 1m/s • Autonomous Flying Robot (2004) • Omni-directional platform (2002) Omni-Directional Robot Sponsor: DSTamit@iitk.ac.in

Flying Robot Start-Up at IIT Kanpur WhirligigRobotics Test Flight of UAV. Inertial Meas Unit (IMU) under commercial production

Tracheal Intubation Device Assists surgeon while inserting breathing tube during general anaesthesia Ball & Socket joint Aperture for Oxygenation tube Aperture for Fibre optic video cable Endotracheal tube Aperture Hole for suction tube Control cables Attachment Points Device for Intubation during general Anesthesia Sponsor: DST / SGPGM sens@iitk.ac.in

Draupadi’s Swayamvar Can the Arrow hit the rotating mark? Sponsor: Media Lab Asia

High DOF Motion Planning • Accessing Hard to Reach spaces • Design of Hyper-Redundant Systems • Parallel Manipulators • Sponsor: BRNS / MHRD • dasgupta@iitk.ac.in 10-link 3D Robot – Optimal Design

Multimodal Language Acquisition Consider a child observing a scene together with adults talking about it Grounded Language : Symbols are grounded in perceptual signals Use of simple videos with boxes and simple shapes – standardly used in sociopsychology

Objective To develop a computational framework for Multimodal Language Acquisition • acquiring the perceptual structure corresponding to verbs • using Recurrent Neural Networks as a biologically plausible model for temporal abstraction • Adapt the learned model to interpret activities in real videos

Visually Grounded Corpus • Two psychological research films, one based on the classic Heider & Simmel (1944) and other based on Hide & Seek • These animation portray motion paths of geometric figures (Big Square, Small square & Circle) • ChaseAlt

Cognate clustering • Similarity Clustering: Different expressoins for same action, e.g.: “move away from center” vs “go to a corner” • Frequency: Remove Infrequent lexical units • Synonymy: Set of lexical units being used consistently in the same intervals, to mark the same action, for the same set of agents.

Video Feature Extraction Features Multi Modal Input Trained Simple Recurrent Network VICES Cognate Clustering Descriptions Events Perceptual Process

Design of Feature Set • The features selected here are related to spatial aspects of conceptual primitives in children, such as position, relative pose, velocity etc. • Use features that are kinematical in nature, temporal derivations or simple transforms of the basic ones.

Monadic Features

Dyadic Predicates

Video Feature Extraction Features Multi Modal Input Trained Simple Recurrent Network VICES Cognate Clustering Descriptions Events VIdeo and Commentary for Event Structures [VICES]

The classification problem • The problem is of time series classification • Possible methodologies include: • Logic based methods • Hidden Markov Models • Recurrent Neural Networks

Elman Network • Commonly a two-layer network with feedback from the first-layer output to the first layer input • Elman Networks detect and generate time-varying patterns • It is also able to learn spatial patterns

Feature Extraction in Abstract Videos • Each image is read into a 2D matrix • Connected Component Analysis is performed • Bounding box is computed for each such connected component • Dynamic tracking is used to keep track of each object

Working with Real Videos • Challenges • Noise in real world videos • Illumination Changes • Occlusions • Extracting Depth Information • Our Setup • Camera is fixed at head height. • Angle of depression is 0 degrees (approx.). • Video

2 P(x,y) - µ(x,y) < kσ(x,y) Background Subtraction • Background Subtraction • Learn on still background images • Find pixel intensity distributions • Classify each pixel as background if • Remove Shadows • Special Case of Reduced Illumination • S = k*P where k<1.0

Contd.. • Extract Human Blobs • By Connected Component Analysis • Bounding box is computed for each person • Track Human Blobs • Each object is tracked using a mean-shift tracking algorithm.

Contd..

Depth Estimation • Two approximations • Using Gibson’s affordances • Camera Geometry • Affordances: Visual Clues • Action of a human is triggered by the environment itself. • A floor offers walk-on ability • Every object affords certain actions to perceive along with anticipated effects • A cups handle affords grasping-lifting-drinking

Contd.. • Gibson’s model • Horizon is fixed at the head height of the observer. • Monocular Depth Cues • Interposition • An object that occludes another is closer. • Height in the visual field • Higher the object is the further it is.

Pin hole Camera Model Mapping (X,Y,Z) to (x,y) x = X * f / Z y = Y * f / Z For the point of contact with the ground Z  1 / y X  x / y Depth Estimation

Depth plot for A chase B • Top view (Z-X plane)

Results(contd..)

Results • Separate-SRN-for-each-action • Trained & tested on different parts of the abstract video • Trained on abstract video and tested on real video • Single-SRN-for-all-actions • Trained on synthetic video and tested on real video

Basis for Comparison Let the total time of visual sequence for each verb be t time units

Separate SRN for each action Framework: Abstract video

Time Line comparison for Chase

Separate SRN for each action Real video (action recognition only)

Single SRN for all actions Framework: Real video

Conclusions & Future Work • Sparse nature of video provides for ease of visual analysis • Directly learning event structures from perceptual stream. • Extensions: Learn fine nuances between event structures of related action words. • Learn the Morphological variations. • Extend the work towards using Long Short Term Memory (LSTM). • Hierarchical acquisition of higher level action verbs.

How Machines Learn to Talk

How Machines Learn to Talk

Presentation Transcript

Learn to Talk, Talk to Learn

How to Talk About Rights

LEARN HOW TO LEARN

How to give a talk

How to talk to anyone

Learn how to…

HOW TO PREPARE A TALK

Computability to Practical Computing - and - How to Talk to Machines

In order to learn chemistry we need to learn to talk chemistry.

How to talk to consumers

Teaching Machines to Learn by Metaphors

Talk Moves: Using math talk to help students learn

How to talk ab

How To Give a Talk

How to Talk

How People Learn to Learn!

How to make small talk

Like, I’mlike gonna like learn how to like talk.

We have been using Talk for Writing to learn how to

Learn and talk

How to Give a Talk

HOW TO PREPARE A TALK