Feature Extraction by Multiobjective Genetic Programming

It has long been recognised that mapping a vector of raw pattern attributes to another (decision) space can often improve the labelling performance of pattern classifiers. In fact, support vector machines - and more generally, kernel methods – rely on just this `trick´.

A block diagram showing the feature extraction process

Rather than using some pre-determined kernel, this project seeks to generate an optimised feature mapping using genetic programming (GP) – a variant of genetic algorithms which typically use a tree-like chromosomal structure to encode a program, or more generally, a sequence of transformations. An example tree is shown below:

A tree diagram showing the described structure

Using a wrapper approach, we can perform evolutionary search for the feature mapping which minimises the error rate in conjunction with a standard classifier.

Although GP is well-suited to this task, one of the drawbacks of the technique is its tendency to produce ever-more complex mappings with no real increase in generalisation performance over an independent test set: the so-called bloat phenomenon. To address, this we minimise multiple objectives: The first is the usual 0/1 misclassification loss. The second is a measure of mapping complexity. (In fact, for simplicity we use total tree node count.) This vector of objectives is minimised in a Pareto framework which uses the concept of vector dominance to sort members of the GP population, thus facilitating selective breeding and also removes the need to arbitrarily weight the contributions of non-commensurate quantities of error rate and tree size.

We have made comparison over a series of 13 benchmark machine learning datasets using the following popular classifiers: Radial basis function neural networks, Logistic regression, Nearest neighbours, Bayesian Networks, ADT and C4.5 Decision Trees, Support Vector Machines.

In 116 pairwise comparisons, our multiobjective GP (MOGP) technique was superior in 98 and statistically identical in 16; in no case was the performance of our MOGP technique inferior to those of the conventional classifiers.

In addition, we have extended the work to derive edge detectors for image processing. For comparison, we have used the well-known Canny edge detector with its threshold set for minimum loss at equal costs.

The leftmost image shows the original image. The middle image shows the Canny result. (This is not a mistake! Unless edges/non-edges are assigned highly asymmetric costs, the Canny detector has been shown to be a very poor classifier [1].) The rightmost image shows the edge map produced by our MOGP-derived detector.

The original image, the canny detector image, our MOGP derived detector image

[1] Y. Zhang and P. I. Rockett, "The Bayesian operating point of the Canny edge detector", IEEE Transactions on Image Processing, vol. 15, no. 11, pp. 3409- 3416 2006.