As humans, we perceive the three-dimensional structure of the world around us with apparent ease. Think of how vivid the three-dimensional percept is when you look at a vase of flowers sitting on the table next to you. You can tell the shape and translucency of each petal through the subtle patterns of light and shading that play across its surface and effortlessly segment each flower from the background of the scene. Looking at a framed group portrait, you can easily count (and name) all of the people in the picture and even guess at their emotions from their facial appearance. Perceptual psychologists have spent decades trying to understand how the visual system works and, even though they can devise optical illusions1 to tease apart some of its principles, a complete solution to this puzzle remains elusive (Marr 1982; Palmer 1999; Livingstone 2008).
Researchers in computer vision have been developing, in parallel, mathematical techniques for recovering the three-dimensional shape and appearance of objects in imagery. We now have reliable techniques for accurately computing a partial 3D model of an environment from thousands of partially overlapping photographs. Given a large enough set of views of a particular object or fac¸ade, we can create accurate dense 3D surface models using stereo matching. We can track a person moving against a complex background. We can even, with moderate success, attempt to find and name all of the people in a photograph using a combination of face, clothing, and hair detection and recognition. However, despite all of these advances, the dream of having a computer interpret an image at the same level as a two-year old (for example, counting all of the animals in a picture) remains elusive. Why is vision so difficult? In part, it is because vision is an inverse problem, in which we seek to recover some unknowns given insufficient information to fully specify the solution. We must therefore resort to physics-based and probabilistic models to disambiguate between potential solutions. However, modeling the visual world in all of its rich complexity is far more difficult than, say, modeling the vocal tract that produces spoken sounds.
The forward models that we use in computer vision are usually developed in physics (radiometry, optics, and sensor design) and in computer graphics. Both of these fields model how objects move and animate, how light reflects off their surfaces, is scattered by the atmosphere, refracted through camera lenses (or human eyes), and finally projected onto a flat (or curved) image plane. While computer graphics are not yet perfect (no fully computer-animated movie with human characters has yet succeeded at crossing the uncanny valley2 that separates real humans from android robots and computer-animated humans), in limited domains, such as rendering a still scene composed of everyday objects or animating extinct creatures such as dinosaurs, the illusion of reality is perfect.
In computer vision, we are trying to do the inverse, i.e., to describe the world that we see in one or more images and to reconstruct its properties, such as shape, illumination, and color distributions. It is amazing that humans and animals do this so effortlessly, while computer vision algorithms are so error prone. People who have not worked in the field often underestimate the difficulty of the problem.
Real-world applications of computer vision
- Optical character recognition (OCR): reading handwritten postal codes on letters and automatic number plate recognition (ANPR);
- Machine inspection: rapid parts inspection for quality assurance using stereo vision with specialized illumination to measure tolerances on aircraft wings or auto body parts or looking for defects in steel castings using X-ray vision;
- Retail: object recognition for automated checkout lanes;3D model building (photogrammetry): fully automated construction of 3D models from aerial photographs used in systems such as Bing Maps;
- Medical imaging: registering pre-operative and intra-operative imagery or performing long-term studies of people’s brain morphology as they age;
- Match move: merging computer-generated imagery (CGI) with live-action footage by tracking feature points in the source video to estimate the 3D camera motion and shape of the environment. Such techniques are widely used in Hollywood
- Motion capture (mocap): using retro-reflective markers viewed from multiple cameras or other vision-based techniques to capture actors for computer animation;
- Surveillance: monitoring for intruders, analyzing highway traffic, and monitoring pools for drowning victims;
- Fingerprint recognition and biometrics: for automatic access authentication as well as forensic applications.
- Stitching: turning overlapping photos into a single seamlessly stitched panorama
- Exposure bracketing: merging multiple exposures taken under challenging lighting conditions (strong sunlight and shadows) into a single perfectly exposed image
- Morphing: turning a picture of one of your friends into another, using a seamless morph transition
- 3D modeling: converting one or more snapshots into a 3D model of the object or person you are photographing
- Video match move and stabilization: inserting 2D pictures or 3D models into your videos by automatically tracking nearby reference points 3 or using motion estimates to remove shake from your videos
- Photo-based walkthroughs: navigating a large collection of photographs, such as the interior of your house, by flying between different photos in 3D
- Face detection: for improved camera focusing as well as more relevant image searching
- Visual authentication: automatically logging family members onto your home computer as they sit down in front of the webcam