Vision Strategies and “Tools You Can Use”

Vision Strategies and “Tools You Can Use” RSS II Lecture 5 September 16, 2005 Prof. Teller

Today • Strategies • What do we want from a vision system? • Tools • Pinhole camera model • Vision algorithms • Development • Carmen Module APIs, brainstorming

Vision System Capabilities • Material identification • Are there bricks in vicinity? If so, where? • Motion freedom • What local motion freedom does robot have? • Manipulation support • Is brick pose amenable to grasping/placement? • Is robot pose correct for grasping/placement? • Localization • Where is robot, with respect to provided map? • Which way to home base? To a new region? • Has robot been in this region before?

Material Identification • Detecting bricks when they’re present • How? • Locate bricks • In which coordinate system? • Estimate range and bearing – how?

v y z Image plane z = 0 z = +1 Gauge distance from apparent size? • Yes – under what assumption?

Pinhole Camera Model (physical) v y World point Image plane (u, v) P = (0, y, z) z Image point p = (0, -y/z) pinhole at world origin O World coordinates (x,y,z) z = 0 z = -1 enclosure • Notes: • Diagram is drawn in the plane x=0 • Image-space u-axis points out of diagram • World-space x-axis points out of diagram • Both coordinate systems are left-handed

Pinhole Camera Model (virtual) (“Virtual” image plane placed 1 unit in front of pinhole; no inversion) World points P2 = (0, y2, z2) v y Image point P1 = (0, y1, z1) p = (0, y/z) O z (All points along ray Op project to image point p!) Image plane z = 0 z = +1

Perspective: Apparent Size • Apparent object size decreases with depth(perp. distance from camera image plane)

Perspective: Apparent Size

What assumptions yield depth?

Ground plane assumption • Requires additional “metric” information • Think of this as a constraint on camera, world structure • Plane in scene, with two independent marked lengths • Can measure distance to, or size of, objects on the plane • … but where do the marked lengths come from? 4 m 3 m 2 m 1 m 1 m

Camera Calibration • Maps 3D world points wP to 2D image plane IP • Map can be factored into two operations: • Extrinsic (rigid-body) calibration (situates camera in world) • Intrinsic calibration (warps rays through optics onto image) v wZ cY height cZ wP cO Principal point IP wY CP (u0, v0) A cX K wO wX width IO u Camera coordinates (e.g., cm) World coordinates (arbitrary choice) Image coordinates (pixels) K3x3 = A3x4=(R3x3 t3x1) Ip = K (1/CZ) A WP = K (1/Cz) (R t)WP Ip3x1 = K3x3(1/CZ) A3x4WP4x1

World-to-Camera Transform • Relabels world-space points w.r.t. camera body • Extrinsic (rigid-body) calibration (situates camera in world) wZ cY cZ=1 cZ wP cO wY CP A cX wO wX World coordinates (arbitrary choice) Camera coordinates (e.g., cm) A3x4=(R3x3 t3x1) CP = (1/Cz) (R t)WP CP3x1 = (1/CZ)A3x4WP4x1 Note effect of division by Cz; no scaling necessary!

Camera-to-Image Transform • Maps 2D camera points to 2D image plane • Models ray path through camera optics and body to CCD v cY cZ=1 height cZ cO Principal point IP CP (u0, v0) cX K width IO u Camera coordinates (e.g., cm) Image coordinates (pixels) Ip = KCP Ip3x1 = K3x3CP3x1 • Matrix K captures the camera’s intrinsic parameters: • a, b: horizontal, vertical scale factors (equal iff pixel elements are square) • u0, v0: principal point, i.e., point at which optical axis pierces image plane • c: image element (CCD) skew, usually ~0 K3x3 =

End-to-End Transformation v wZ cY height cZ cZ=1 wP cO Principal point IP wY CP (u0, v0) A cX K wO wX width IO u Camera coordinates (e.g., cm) World coordinates (arbitrary choice) Image coordinates (pixels) K3x3 = A3x4=(R3x3 t3x1) Ip = K (1/CZ) A WP = K (1/Cz) (R t)WP Ip3x1 = K3x3(1/CZ) A3x4WP4x1

Example: Metric Ground Plane • Make camera-frame and world-frame coincidentThus R = I3x3, t = 03x1, A4x4 = (Rt) as before • Lay out a tape measure on line x = 0, y = -h • Mark off points at (e.g.) 50-cm intervals • What is the functional form of map u = f(wx,wy,wz)? Ip = K (1/CZ) A WP = K (1/Cz) I WP Image point y = K (1/CZ) (0, -h, Cz)T Ip = (0, v, 1) = K (0, -h/Cz, 1)T v = (u0, -bh/Cz + v0, 1)T Camera Measure h; observe CZi, vi; repeatedlysolve for u0, b, v0 y = -h z Image plane z = 0 z = +1 z1=2 z2 = 3 z3 =4

Vision System Capabilities • Material identification • Are there bricks in vicinity? If so, where? • Motion freedom • What local motion freedom does robot have? • Manipulation support • Is brick pose amenable to grasping/placement? • Is robot pose correct for grasping/placement? • Localization • Where is robot, with respect to provided map? • Which way to home base? To a new region? • Has the robot been in this region before?

Motion Freedom • What can be inferred from image?

Freespace Map • Discretize bearing; classify surface type

Freespace Map Ideas • Use simple color classifier • Train on road, sidewalk, grass, leaves etc. • Training could be done offline, or in a start-of-mission calibration phase adapted from RSS II Lab 2 • For each wedge of disk, could report distance to nearest obstruction • Careful: how will your code deal with varying lighting conditions? • Finally: can fuse (or confirm) with laser data

Manipulation Support • Two options • Manipulate brick into appropriate grasp pose • Plan motion to approach the (fixed-pose) brick Manipulation and/or motion plan Initial pose Desired pose How? Hint: compute moments How to disambiguate edge-on, end-on?

Localization support • Localization w.r.t. a known map • See localization lecture from RSS I • Features: curb cuts, vertical building edges • Map format not yet defined – one of your tasks L3 L1 L2 {P, q} Locus of likely poses

Localization support • Weaker localization model • Create (virtual) landmarks at intervals • Chain each landmark to predecessor • Recognize when landmark is revisited • Record direction from landmark to its neighbors • … Is this map topological or metrical? • … Does it support homing? Exploration?

Visual basis for landmarks • Desire a visual property that: • Is nearly invariant to large robot rotations • Is nearly invariant to small robot translations • Has a definable scalar distance d(b,c) [why?] • Possible approaches: • Hue histograms (coarsely discretized) • Ordered hue lists (e.g., of vertical strips) • Skylines (must segment ground from sky) • Some hybrid of vision, laser data • Careful: think about ambiguity in scene c ? ? b a

Today • Strategies • What do we want from a vision system? • Tools • Pinhole camera model • Vision algorithms • Development • Carmen Module APIs, brainstorming

Carmen Module APIs • Vision module handles (processes) image stream • Must export more compact representation than images • What representation(s) should module export? • Features? Distances? Landmarks? Directions? Maps? • What questions should module answer? • Are there collectable blocks nearby? Where? • Has robot been here before? With what confidence? • Which direction(s) will get me closer to home? • Which direction(s) will explore new regions? • Which directions are physically possible for robot? • What non-vision data should module accept? • Commands to establish a new visual landmark? • Notification of rotation in place? Translation? • Spiral development • Put simple APIs in place, even if performance is stubbed • Get someone else to exercise them; revise appropriately

Conclusion • One plausible task decomposition • Bottom-up operating scenario • Related existing, new vision tools • Functional view: what are APIs? • Exhortation to spiral development

vision lecture plan (seth) review pinhole camera model show that, since depth is unknown, scale of a scene object is unknown (show picture of rob and me?) but: under certain assumptions, we can measure an object in the scene simplest such assumption is the "ground plane" assumption show how the extent of the object can be "read off" from image show intuitive trig, then show in terms of calibration matrix K how do we know that the object isn't just a "flash in the pan" ? i.e., it could be a transient object: a person, animal, something blowing in the wind answer: look for confirmatory evidence over multiple image frames! and from multiple vantage points. if the thing appears to have roughly constant: position size color can get arbitrarily more demanding, looking for constant: shape orientation texture (surface color variation) but of course the cost of cranking the validation criteria up is fewer stable objects inferred in the world. instance of "structure from motion" problem (motion refers to monocular camera motion, providing an effective stereo baseline): given a collection of features fij (jth features observed by camera i), estimate: positions xj,yj of the jth feature in world coordinates poses Xi,Yi,thetai of the ith camera in world coordinates such that the xy and XY are expressed in same coordinates! ... but where does knowledge of camera coordinates come from? answer: we assume it! old trick: iterative algorithm. take a reasonable initial guess for cameras, from e.g. odometry. then repeat in alternation: fix camera estimates, and from bearings, solve for feature locations and fix feature locations, and from bearings, solve for camera poses ... but where does knowledge of matching features across frames come from? this is called the "correspondence" or "data association" problem fundamental problem in computer vision our visual systems solves it effortlessly, with massively parallel analog processor. but we cannot introspect about how it is done, and no machine vision algorithm comes even close to human performance even under ideal conditions (high contrast, textured, unambigous shapes). under poor conditions (dark, fuzzy) vision algs fail. another example of AI adage: "the easy problems are hard" (pinker, 1995, the language instinct) ... how to tie the world coordinates back to an external reference? could provide, e.g., GPS coordinates for first camera or for first feature could express everything in terms of first camera in sequence i.e., define its pose w.l.o.g. to be x,y,theta = 0,0,0 brainstorm: what sort of vision operations would you like to have available? segmentation: identify object(s) of interest in scene use HSV color model, and some kind of clustering technique measure size of object(s) use ground plane assumption this gives a bearing and range sensor confirmation: theta map of free space, e.g., angles into which bot can roll requires training on color of asphalt, sidewalk, grass, etc. perhaps this could be done offline, or in a first, high-latency, calibration phase (adapt from lab 2?) note: for each wedge of disk, could report how many meters to obstruction careful: how will you deal with varying lighting conditions? semantic map of free space: make "best guess" as to local neighborhood revisitation probability say the bot puts down waypoints periodically, and retains a chain/tree of waypoints rooted at (and leading back to) home. how can the vision system infer, with high probability, that it is reoccupying an earlier-created waypoint, rather than occupying a region for the first time? idea: compute a visual/geometric "signature" at each location can combine vision and laser data (or can use either alone) here, i will concentrate on vision possibilities: HSV pixel histogram careful: want invariance to modest rotations, translations (why?) histogram whole scene? or divide world into vertical strips? or histogram a disk or half-disk or wedge on ground-plane, inside camera FOV? group that's excited about vision: think! or could look for vertical strips (think fourier or DCT) of texture, corresponding to built structure off of ground plane another possibility is a geometric signature. look for constellation of 2D or 3D points or edges in the world. or could build on free-space map. write a distance function relates two free-space maps. periodically save a map, and compare newly acquired maps to it. when variation grows, put a new waypoint down. another possibility: if you can reliably segment sky from ground/built structure, you can build a "skyline" from any position, by rotating in place and concatenating. so... this skyline could also serve as a reliable signature. how to segment sky from ground? 1) texture frequency (careful: clouds vs. blue sky) 2) color/hue how do you compute revisitation? compute map, then search stored map for the one that is closest. three gotchas: first, map may not change much over long distance, e.g., going down narrow roadway. incorp. color histogram? second, don't want to search O(t) stuff where t is length of mission so far. how to avoid? keep an estimate of current position and uncertainty; search only waypoints within this envelope. identify in O(log t) time; hopefully come up with O(1) relevant waypoints/signatures to search. also think in terms of spiral development. how can you bring up a useful architecture and API for capability, even if its performance is poor at the start? need some way for vision module to handle image stream, and export a much more compact representation to other modules! think about what the others want to "know" from vision system; that leads to what sorts of questions they'll ask. are there any collectable blocks in the vicinity? is this somewhere we've been before? in which directions can i proceed in the short term, and for what distance? also, how do you tell the module about your motion. example: if you rotate in place pi/6 radians, should free space map simply counter-rotate by the same amount? or should that be the responsibility of a higher-level module. if some behavior results in robot translation, need you tell the vision module about this?

Vision Strategies and “Tools You Can Use”