Multiple View Geometry in Computer Vision

Richard Hartley, Andrew Zisserman

Mentioned 10

A basic problem in computer vision is to understand the structure of a real world scene given several images of it. Techniques for solving this problem are taken from projective geometry and photogrammetry. Here, the authors cover the geometric principles and their algebraic representation in terms of camera projection matrices, the fundamental matrix and the trifocal tensor. The theory and methods of computation of these entities are discussed with real examples, as is their use in the reconstruction of scenes from multiple images. The new edition features an extended introduction covering the key ideas in the book (which itself has been updated with additional examples and appendices) and significant new results which have appeared since the first edition. Comprehensive background material is provided, so readers familiar with linear algebra and basic numerical methods can understand the projective geometry and estimation algorithms presented, and implement the algorithms directly from the book.

More on Amazon.com

Mentioned in questions and answers.

I'm an undergrad who finds computer vision to be fascinating. Where should somebody brand new to computer vision begin?

As with all other things at school.... start by taking up a course with a good amount of project work. Explore ideas and implement algorithms in those projects that you find interesting. Wikipedia is a good beginners resource as usual. If you want books, the most popular ones are:

  1. http://www.amazon.com/Multiple-View-Geometry-Computer-Vision/dp/0521540518
  2. http://www.amazon.com/Computer-Vision-Approach-David-Forsyth/dp/0130851981/
  3. http://research.microsoft.com/en-us/um/people/szeliski/book/drafts/SzeliskiBook_20100423_draft.pdf

But I would suggest before you jump in to books, take a course/go through some course slides at one of the top ten universities or via iTunesU.

If I take a picture with a camera, so I know the distance from the camera to the object, such as a scale model of a house, I would like to turn this into a 3D model that I can maneuver around so I can comment on different parts of the house.

If I sit down and think about taking more than one picture, labeling direction, and distance, I should be able to figure out how to do this, but, I thought I would ask if someone has some paper that may help explain more.

What language you explain in doesn't matter, as I am looking for the best approach.

Right now I am considering showing the house, then the user can put in some assistance for height, such as distance from the camera to the top of that part of the model, and given enough of this it would be possible to start calculating heights for the rest, especially if there is a top-down image, then pictures from angles on the four sides, to calculate relative heights.

Then parts will need to differ in color to help separate out the various parts of the model I expect also.

As mentioned, the problem is very hard and is often also referred to as multi-view object reconstruction. It is usually approached by solving the stereo-view reconstruction problem for each pair of consecutive images.

Performing stereo reconstruction requires that pairs of images are taken that have a good amount of visible overlap of physical points. If you have:

then the stereo reconstruction problem can be solved for all these matching points using only matrix theory. However, this requires a lot of theory about co-ordinate projections with homogeneous co-ordinates and also knowledge of the pinhole camera model and camera matrix. More specifically, you would need to calculate the fundamental and essential matrices and then use triangulation to find the 3D co-ordinates of the points.

This whole stereo reconstruction would then be repeated for each pair of consecutive images (implying that you need an order to the images or at least knowledge of which images have many overlapping points). For each pair you can consider the matching positions of any 8 points.

In your case you probably want a method that works without needing the camera parameters so that it works for unknown camera set-ups. For this you should probably look into methods for uncalibrated stereo reconstruction. My knowledge is actually quite thin on most of the theory, so the best I can do is to further provide you with some references that are hopefully useful (in order of relevance):

I'm not sure how helpful all of this is, but hopefully it includes enough useful terminology and references to find further resources. As you can see, this is far from a solved problem and is still actively researched. The less you want to do in an automated manner the more well-defined the problem becomes, but even in these cases quite a bit of theory is required to get started.

I am really interested in image processing. I downloaded OpenCV and started playing with it. But I think I lack the knowledge behind image processing. I would like to learn the basic fundamentals of image processing.

I searched for open course from MIT or other universities but didn't seem to find any good tutorial. I did find some slides, but they seem useless without the actually presentation. I searched for online tutorial but mostly they are not for beginners.

Is there a good online tutorial for image processing for beginners?

I really like Rich Szeliski's Computer Vision book which has a nice mix of theory and practice. You can also access the electronic drafts for free.

Other good ones are Hartley and Zissermann's Multi-View Projective Geometry and David Forsyth's Computer Vision: A Modern Approach

In MATLAB I have calculated the Fundamental matrix (of two images) using the normalized Eight point algorithm. From that I need to triangulate the corresponding image points in 3D space. From what I understand, to do this I would need the rotation and translation of the image's cameras. The easiest way of course would be calibrate the cameras first then take the images, but this is too constricting for my application as it would require this extra step.

So that leaves me with auto (self) camera calibration. I see mention of bundle adjustment, however in An Invitation to 3D Vision it seems it requires an initial translation and rotation, which makes me think that a calibrated camera is needed or my understanding is falling short.

So my question is how can I automatically extract the rotation/translation so I can reprojected/triangulate the image points into 3D space. Any MATLAB code or pseudocode would be fantastic.

Peter's matlab code would be much helpful to you I think :

http://www.csse.uwa.edu.au/~pk/research/matlabfns/

Peter has posted a number of fundamental matrix solutions. The original algorithms were mentioned in the zisserman book

http://www.amazon.com/exec/obidos/tg/detail/-/0521540518/qid=1126195435/sr=8-1/ref=pd_bbs_1/103-8055115-0657421?v=glance&s=books&n=507846

Also, while you are at it don't forget to see the fundamental matrix song :

http://danielwedge.com/fmatrix/

one fine composition in my honest opinion!

Does anyone know any good book or web resource for geometric and mathematical fundamentals of augmented reality?

Thanks!

I'd recommend the following two books. Both are pricey but contain lots of really useful stuff in Projective Geometry which is what you need to know.

It's hard going though so unless you really want to understand the maths behind it you may want to use a third party library as suggested above.

Multiple View Geometry in Computer Vision by Hartkey and Zisserman

and

Three Dimensional Computer Vision: A Geometric Viewpoint by Faugeras

I am having quite a bit of trouble understanding the workings of plane to plane homography. In particular I would like to know how the opencv method works.

Is it like ray tracing? How does a homogeneous coordinate differ from a scale*vector?

Everything I read talks like you already know what they're talking about, so it's hard to grasp!

Googling homography estimation returns this as the first link (at least to me): http://cseweb.ucsd.edu/classes/wi07/cse252a/homography_estimation/homography_estimation.pdf. And definitely this is a poor description and a lot has been omitted. If you want to learn these concepts reading a good book like Multiple View Geometry in Computer Vision would be far better than reading some short articles. Often these short articles have several serious mistakes, so be careful.

In short, a cost function is defined and the parameters (the elements of the homography matrix) that minimize this cost function are the answer we are looking for. A meaningful cost function is geometric, that is, it has a geometric interpretation. For the homography case, we want to find H such that by transforming points from one image to the other the distance between all the points and their correspondences be minimum. This geometric function is nonlinear, that means: 1-an iterative method should be used to solve it, in general, 2-an initial starting point is required for the iterative method. Here, algebraic cost functions enter. These cost functions have no meaningful/geometric interpretation. Often designing them is more of an art, and for a problem usually you can find several algebraic cost functions with different properties. The benefit of algebraic costs is that they lead to linear optimization problems, hence a closed form solution for them exists (that is a one shot /non-iterative method). But the downside is that the found solution is not optimal. Therefore, the general approach is to first optimize an algebraic cost and then use the found solution as starting point for an iterative geometric optimization. Now if you google for these cost functions for homography you will find how usually these are defined.

In case you want to know what method is used in OpenCV simply need to have a look at the code: http://code.opencv.org/projects/opencv/repository/entry/trunk/opencv/modules/calib3d/src/fundam.cpp#L81 This is the algebraic function, DLT, defined in the mentioned book, if you google homography DLT should find some relevant documents. And then here: http://code.opencv.org/projects/opencv/repository/entry/trunk/opencv/modules/calib3d/src/fundam.cpp#L165 An iterative procedure minimizes the geometric cost function.It seems the Gauss-Newton method is implemented: http://en.wikipedia.org/wiki/Gauss%E2%80%93Newton_algorithm

All the above discussion assumes you have correspondences between two images. If some points are matched to incorrect points in the other image, then you have got outliers, and the results of the mentioned methods would be completely off. Robust (against outliers) methods enter here. OpenCV gives you two options: 1.RANSAC 2.LMeDS. Google is your friend here.

Hope that helps.

I have one Kinect camera and one webcam, I'm trying to find the rotation/translation matrix between the Kinect and the webcam using OpenCV. Here is the setup:

setup

The two cameras are facing towards the same direction. I can get the intrinsic matrix for both cameras but I'm not sure how to get the relative position between them?

I made some researches and found the findEssentialMat() function. Apparently it returns an essential matrix (but this function seems not suitable since it assumes that the focal and principle point are the same in both cameras), which can be used with:

  1. recoverPose()
  2. decomposeEssentialMat() -> if I understood, it will return 4 different solutions, should I use this function ?

Thank you very much !

EDIT: How about the stereoCalibrate() function ? But my setup does not really correspond to a stereo camera..

EDIT2: I gave a try with the "stereo_calib.cpp" example provided with openCV. Here is my result, I don't really know how to interpret it ?

enter image description here

Also, it produces an "extrinsics.yml" file where I can find the R and T matrices but I don't know in which units they are represented ? I changed the squareSize variable in the source code many times but it seems the matrices are not changed at all.

I think that stereoCalibrate is the way to work if you are interested in the depth map and in aligning the 2 images (and I think this is an important issue even if I don't know what you're trying to do and even if you're already have a depth map from the kinect).

But, If I understand it correctly what you need you also want to find the position of the cameras in the world. You can do that by having the same known geometry in both view. This is normally achieved via a chessboard pattern that is lying in the floor, send by both (fixed position) cameras.

Once you have a known geometry 3d points and the correspective 2d points projected in the image plane you can find independently the 3d position of the camera relative to the 3d world considering the world starting in one edge of the chessboard.

In this way what you're going to achieve is something like this image:

enter image description here

To find the 3d position of the camera relative to the chessboards you can use the cv::solvePnP to find the extrinsic matrix for each camera independently. The are some issues about the direction of the camera (the ray pointing from the camera to the origin world) and you have to handle them (the same: independently for each camera) if you want to visualise them (like in OpenGL). Some matrix algebra and angle handling too.

For a detailed description of the math I can address you to the famous Multiple View Geometry.

See also my previous answer on augmented reality and integration between OpenCV and OpenGL (i.e. hot to use the extrinsic matrix and T and R matrixes that can be decomposed from it and that represent position and orientation of the camera in the world).

Just for curiosity: why are you using a normal camera PLUS a kinect? The kinect gives you the depth map that we are try to achieve with 2 stereo camera. I don't understand exactly what kind of data an additional normal camera can give you more then a calibrated kinect with good use of the extrinsic matrix already gives you.

PS the image is taken from this nice OpenCV introductory blog but I think that post is not much relevant to your question because that post is about intrisinc matrix and distortion parameters that seems you already have. Just to clarify.

EDIT: when you're talking about units of the extrinsic data you are normally measure them in the same unit of the 3D points of the chessboard are, so if you identify a squared chessboard edge points in 3D with P(0,0) P(1,0) P(1,1) P(0,1) and use them with solvePnP the translation of the camera will be measured in the unit of "chessboard edge size". If it is 1 meter long, the unit of measure will be meters. For the rotations, the unit are normally angles in radians, but it depends how you are extracting them with the cv::Rodrigues and how you're getting the 3 angles yawn-pitch-roll from a rotation matrix.

I have been working Augmented Reality for quite a few months. I have used third party tools like Unity/Vuforia to create augmented reality applications for android.

I would like to create my own framework in which I will create my own AR apps. Can someone guide me to right tutorials/links to achieve my target. On a higher level, my plan is to create an application which can recognize multiple markers and match it with cloud stored models.

That seems like a massive undertaking: model recognition is not an easy task. I recommend looking at OpenCV (which has some standard algorithms you can use as a starting point) and then looking at a good computer vision book (e.g., Richard Szeliski's book or Hartley and Zisserman).

But you are going to run into a host of practical problems. Consider that systems like Vuforia provide camera calibration data for most Android devices, and it's hard to do computer vision without it. Then, of course, there's efficiently managing the whole pipeline which (again) companies like Qualcomm and Metaio invest huge amounts of $$ in.

Can someone point out the math involved in getting the 3d points of an image from its disparity values?I have the Image(i,j) and disparity at each of these points.What i want is the true 3D coordinate x,y,z using math equations.

Long answer - http://www.amazon.com/Multiple-View-Geometry-Computer-Vision/dp/0521540518/

Short answer. You have the pixel scale, so for a given number of pixel disparity you can get an angle different. With the baseline between cameras and an angle you have a distance

ps. take a look at the opencv book, it has a couple of good chapters on stereo enter image description here

Is it possible to get some good reconstructed surfaces from bumbleebee cameras (produced by Point Gray Research).Does any one has any information regarding this?I am looking for a fairly simple solution which should be easy to implement.

I have worked with the BumbleBee camera before. I can tell you that the built-in stereo depth software can be very sensitive to color/material properties and thus confuse shading with shape etc. Thus, reconstruction from 1 shot is likely not feasible.

If you're willing to move the camera around to do structure from motion, then try any number of SfM software such as Bundler and Voodoo tracker. These can produce a 3D point cloud.

Once you get the 3D point cloud, you can then use a 3D texture mapping software like CMVS to color the point cloud. Note that CMVS is meant to work with Bundler. You'll have to do some format conversion if you use another SfM package.

If you prefer to roll your own, I'd suggest reading the Hartley and Zisserman classic.