The probabilistic approach is most common in pattern classification.
This results in a *Bayesian classifier*. Here it is assumed that
and are given. The distribution of features
for a given class is indicated by
. The overall
frequency of class occurrences is given by . If large,
preclassified data sets are available, then these distributions can be
reliably learned. The feature space is often continuous, which
results in a density
, even though remains a
discrete probability distribution. An optimal classifier, , is
designed according to (9.26). It performs classification
by receiving a feature vector, , and then declaring that the class
is
. The expected cost using (9.32) is
the probability of error.

Suppose that there are three different image processing algorithms:

- []
**Shape extractor:**This returns if the letter is composed of straight edges only, and if it contains at least one curve. - []
**End counter:**This returns , the number of segment ends. For example, has none and has four. - []
**Hole counter:**This returns , the number of holes enclosed by the character. For example, has none and has one.

Imagine doing classification under the nondeterministic model, with the assumption that the features always provide correct information. For , the only possible letter is . For , the only letter is . If each is consistent with only one or no letters, then a perfect classifier can be constructed. Unfortunately, is consistent with both and . In the worst case, the cost of using (9.32) is .

One way to fix this is to introduce a new feature. Suppose that an image processing algorithm is used to detect corners. These are places at which two segments meet at a right ( degrees) angle. Let denote the number of corners, and let the new feature vector be . The new algorithm nicely distinguishes from , for which and , respectively. Now all letters can be correctly classified without errors.

Of course, in practice, the image processing algorithms occasionally
make mistakes. A Bayesian classifier can be designed to maximize the
probability of success. Assume conditional independence of the
observations, which means that the classifier can be considered *naive*. Suppose that the four image
processing algorithms are run over a training data set and the
results are recorded. In each case, the correct classification is
determined by hand to obtain probabilities
,
,
, and
. For example, suppose
that the hole counter receives the letter as input. After running
the algorithm over many occurrences of in text, it may be
determined that
, which is the correct
answer. With smaller probabilities, perhaps
and
. Assuming that the
output of each image processing algorithm is independent given the
input letter, a joint probability can be assigned as

(9.33) |

The value of the prior can be obtained by running the classifier over large amounts of hand-classified text and recording the relative numbers of occurrences of each letter. It is interesting to note that some context-specific information can be incorporated. If the text is known to be written in Spanish, then should be different than from text written in English. Tailoring to the type of text that will appear improves the performance of the resulting classifier.

The classifier makes its decisions by choosing the action that minimizes the probability of error. This error is proportional to

(9.34) |

by neglecting the constant in the denominator of Bayes' rule in (9.26).

Steven M LaValle 2012-04-20