The Perspective-n-Point (PnP) problem

A moving rigid body needs to be ``pinned down'' using $ n$ observed features. This is called the Perspective-$ n$-Point (or PnP) problem. We can borrow much of the math from Chapter 3; however, here we consider the placement of bodies in the real world, rather than the virtual world. Furthermore, we have an inverse problem, which is to determine the body placement based on points in the image. Up until now, the opposite problem was considered. For visual rendering in Chapter 7, an image was produced based on the known body placement in the (virtual) world.

The features could be placed on the body or in the surrounding world, depending on the sensing method. Suppose for now that they are on the body. Each feature corresponds to a point $ p = (x,y,z)$ with coordinates defined in the frame of the body. Let $ T_{rb}$ be a homogeneous transformation matrix that contains the pose parameters, which are assumed to be unknown. Applying the transform $ T_{rb}$ to the point $ p$ as in (3.22) could place it anywhere in the real would. Recall the chain of transformations (3.41), which furthermore determines where each point on the body would appear in an image. The matrix $ T_{eye}$ held the camera pose, whereas $ T_{vp}$ and $ T_{can}$ contained the perspective projection and transformed the projected point into image coordinates.

Figure 9.14: Each feature that is visible eliminates $ 2$ DOFs. On the left, a single feature is visible, and the resulting rigid body has only $ 4$ DOFs remaining. On the right, two features are visible, resulting in only $ 2$ DOFs. This can be visualized as follows. The edge that touches both segments can be moved back and forth while preserving its length if some rotation is also applied. Rotation about an axis common to the edge provides the second DOF.

Now suppose that a feature has been observed to be at location $ (i,j)$ in image coordinates. If $ T_{rb}$ is unknown, but all other transforms are given, then there would be six independent parameters to estimate, corresponding to the $ 6$ DOFs. Observing $ (i,j)$ provides two independent constraints on the chain of transforms (3.41), one $ i$ and one for $ j$. The rigid body therefore loses $ 2$ DOFs, as shown in Figure 9.14. This was the P1P problem because $ n$, the number of features, was one.

The P2P problem corresponds to observing two features in the image and results in four constraints. In this case, each constraint eliminates two DOFs, resulting in only two remaining DOFs; see Figure 9.14. Continuing further, if three features are observed, then for the P3P problem, zero DOFs remain (except for the case in which collinear features are chosen on the body). It may seem that the problem is completely solved; however, zero DOFs allows for a multiple solutions (they are isolated points in the space of solutions). The P3P problem corresponds to trying to place a given triangle into a pyramid formed by rays so that each triangle vertex touches a different ray. This can be generally accomplished in four ways, which are hard to visualize. Imagine trying to slice a tall, thin pyramid (simplex) made of cheese so that four different slices have the exact same triangular size and shape. The cases of P4P and P5P also result in ambiguous solutions. Finally, in the case of P6P, unique solutions are always obtained if no four features are coplanar. All of the mathematical details are worked out in [361].

The PnP problem has been described in the ideal case of having perfect coordinate assignments to the feature points on the body and the perfect observation of those through the imaging process. In practice, small errors are made due to factors such as sensor noise, image quantization, and manufacturing tolerances. This results in ambiguities and errors in the estimated pose, which could deviate substantially from the correct answer [286]. Therefore, many more features may be used in practice to improve accuracy. Furthermore, a calibration procedure, such as bundle adjustment [112,285,331], may be applied before the device is used so that the feature point locations can be more accurately assigned before pose estimation is performed. Robustness may be improved by employing RANSAC [78].

Steven M LaValle 2020-01-06