Face detection on low powered devices
Abstract
Face detection is a computer technology being used in a variety of applications that identifies human faces in digital images. The first realtime face detection machine learning algorithm was proposed by Paul Viola and Michael Jones. This work is distinguished by three key contributions. The first is the introduction of a new image representation called the “Integral Image” which allows the features used by our detector to be computed very quickly. This is just the sum of the pixel values to the left and above a certain pixel location (x,y).This helps in constant time calculation of our features using simple additions and subtractions.This way we calculate a number of features of different scale sizes(~160,000 for a 24X24 window).These features resemble the Haar wavelets and thus are called Haar features. The second contribution is a learning algorithm,based on AdaBoost,which selects a small number of critical visual features from this huge set and yields extremely efficient classifiers. It involves taking the weighted sum of a number of weak classifiers(single feature classifier) to make a strong classifier. This helps in fast classification and is the main learning algorithm.The third contribution is a method for combining increasingly more complex classifiers in a “cascade”. This makes use of our own experience that in an image, a rather limited number of subwindows deserve more attention than others. The algorithm thus deploys more resources to work on those windows more likely to contain a face while spending as little effort as possible on the rest. Our implementation was done in Matlab/Octave. The next objective is to employ this on lowpowered hardware devices.
Keywords: machine learning, haar features, integral image, adaboost, classifier cascade
INTRODUCTION
Background
In the past few years, face recognition owned significant consideration and appreciated as one of the most promising applications in the field of image analysis. Face detection is considered a substantial part of face recognition operations. The method of face detection in pictures is complicated because of variability present across human faces such as pose, expression, position and orientation, skin colour, the presence of glasses or facial hair, differences in camera gain, lighting conditions, and image resolution.
Face Detection is the first and essential step for face recognition, and it is used to detect faces in the images. It is a part of object detection and can use in many areas such as security, biometrics, law enforcement, motion capture, marketing etc. The primary aim of face detection algorithms is to determine whether there is any face in an image or not.
In recent times, a lot of study work proposed in the field of Face Recognition and Face Detection to make it more advanced and accurate, but it made a revolution in this field when ViolaJones came with its RealTime Face Detector, which is capable of detecting the faces in realtime with high accuracy. ^{[1]}^{ }
The problem to be solved is detection of faces in an image. A human can do this easily, but a computer needs precise instructions and constraints. To make the task more manageable, Viola–Jones requires full view frontal upright faces. Thus in order to be detected, the entire face must point towards the camera and should not be tilted to either side. While it seems these constraints could diminish the algorithm's utility somewhat, because the detection step is most often followed by a recognition step, in practice these limits on pose are quite acceptable. ^{[1]}^{ }
Currently,we have even better methods for face detection, but they primarily involve tools such as Deep Learning,which require implementing Convolutional Neural Networks(CNNs) etc. Although this approach provides higher accuracy in the long run,it takes a longer time to train and is more resource intensive. Thus for designing an application for detecting faces using low powered hardware like FPGAs, the Viola Jones algorithm is the best option.
Statement of Problems
Implement the Viola Jones algorithm for face detection in a manner that the results of training can be used for detection using low powered devices.
Scope
The training was done assuming a fixed target size of 19X19 or 24X24 face sizes. This can be improved by allowing more flexibility in the feature extraction process.
METHODOLOGY
Concepts
The paper has three important contributions/aspects.The first is the introduction of a new image representation called the “Integral Image” which allows the features used by the detector to be computed very quickly. The second is a learning algorithm,based on AdaBoost,which selects a small number of critical visual features from a larger set and yields extremely efficient classifiers.The third contribution is a method for combining increasingly more complex classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising objectlike regions.The cascade can be viewed as an object specific focusofattention mechanism which unlike previous approaches provides statistical guarantees that discarded regions are unlikely to contain the object of interest. ^{[1]}^{ }
Feature extraction and integral images
Most approaches use the pixel values as input for the algorithm.But Viola and Jones introduced these new types of features,based on Haar wavelets:
The sum of the pixels in the unshaded rectangles are subtracted from the sum of the pixels in the shaded rectangles. It is easy to see that even for small images, there are a lot of features (162336 for a 24 x 24 image considering all positions and scales) ^{[2]}^{ }. To calculate so many features efficiently,the integral image representation was defined as follows:
s(x,y) denotes the cumulative row sum at pixel location (x,y),while ii(x,y) is the integral image value at the same location,and i(x,y) is the pixel value at that point.Thus,the integral image is nothing but the sum of pixel values above and to the left of a location.This sort of recursive relation has the added advantage of being employed in the following manner:
To calculate the sum of pixel values in the region D,we simply need to compute ii(1)+ii(4)ii(2)ii(3). It can be easily verified that this general procedure works for all cases.
Builiding classifiers using Adaboost
For training, ViolaJones uses a variant of Adaboost. The general idea of boosting is for each subsequent weak classifier to correct the mistakes of the previous classifier. To do this, it assigns a weight to each training example, trains the classifiers, chooses the best classifier, and updates the weights according to the error of the classifier. Incorrectly labeled examples will get larger weights so they are correctly classified by the next classifier chosen.A weak classifier is simply a single feature classifier.
So we have the following algorithm :
1. Initialize the weights
2. Normalize the weights
3. Select the best weak classifier (based on the weighted error)
4. Update the weights based on the chosen classifiers error
5. Repeat steps 2–4 T times where T is the desired number of weak classifiers
Assigning weights
Initially we have the following weights for the i^{th} image
where p is the number of positive examples and n is the number of negative examples.
Building Weak classifiers
Each weak classifier is “weak” because by itself, it cannot accurately fulfill the classification task. Each weak classifier looks at a single feature( f ). It has both a threshold ( θ ) and a polarity ( p ) to determine the classification of a training example.The hypothesis function is as follows:
Polarity can be either 1 or 1. When p = 1, the weak classifier outputs a positive result when f(x) < θ, or the feature value is less than the threshold. When p = 1, the weak classifier outputs a positive result when f(x) > θ.
Selecting the threshold
Training the weak classifiers is the most computationally expensive piece of the algorithm because each time a new weak classifier is selected as the best one, all of them have to be retrained since the training examples are weighted differently. However, there is an efficient way to find the optimal threshold and polarity for a single weak classifier using the weights. First, sort the weights according to the feature value that they correspond to. Now iterate through the array of weights, and compute the error if the threshold was chosen to be that feature. Find the threshold and polarity with the minimum error. The possible values for a threshold are the values of the feature on each training example. The error can be measured by :
T is the sum total of all weights,while S is the sum of weights of examples seen so far. The superscripts + and  denote positive and negative samples respectively. This method is known as a decision stump.
Selecting best classifier and updating weights
This procedure can be summarised as follows:
Epsilon represents the error of the best classifier, w is the weight of the i^{th} example, and beta is the factor why which to change the weight. The exponent of beta is 1e where the e is 0 if the training example was classified correctly and 1 if classified incorrectly.
Building strong classifier
Now we train several of these weak classifiers and take the weighted sum of their hypotheses to compile a strong classifier.
The weighted sum of the weak classifiers’ decisions is compared to half the sum of the alphas because it is akin to saying “At least half the weak classifiers agree with a positive classification.”
The complete Adaboost training can be summarised as follows:
Attentional cascade
In theory, Adaboost can produce a single committee of decision stumps that generalizes well. However, to achieve that, an enormous negative training set is needed at the outset to gather all possible negative patterns. In addition, a single committee implies that all the windows inside an image have to go through the same lengthy decision process. There has to be another more costefficient way.
The prior probability for a face to appear in an image bears little relevance to the presented classifier construction because it requires both the empirical false negative and false positive rate to approach zero. However, our own experience tells us that in an image, a rather limited number of subwindows deserve more attention than others. This is true even for faceintensive group photos. Hence the idea of a multilayer attentional cascade which embodies a principle akin to that of Shannon coding: the algorithm should deploy more resources to work on those windows more likely to contain a face while spending as little effort as possible on the rest.
Each layer in the attentional cascade is expected to meet a training target expressed in false positive and false negative rates: among n negative examples declared positive by all of its preceding layers, layer l ought to recognize at least (1−γl)*n as negative and meanwhile try not to sacrifice its performance on the positives: the detection rate should be maintained above 1−βl. Here βl is the false positive rate and γl is the false negative rate ^{[2]}^{ }. This can be achieved with the following procedure proposed by Viola Jones:
Some better methods have also been suggested to deal with some ambiguities in this algorithm.Particularly, Tavalalli et al have devised a train set selection method based on histograms generated from AdaBoost and also, a simple method to select few features in beginning cascades are proposed.A simple method to find only few features which can satisfy the conditions at first cascades is presented. This solution consists of changing initial weights. A method based on histograms generated from AdaBoost is introduced which clarifies how to select threshold, training samples and finally results in decreasing computational complexity and having less features while achieving the required conditions on each cascade. ^{[4]}^{ }
Implementation
For the image processing part, I used OpenCV with C++ . For the machine learning part,I used MATLAB/Octave .Octave is basically a free open source version of MATLAB.
Preprocessing:
Before feature extraction,all images were resized to 19X19 (or 24X24) and were changed to grayscale scheme.
To compensate the effect of different lighting conditions, all the images should be mean and variance normalized beforehand. Those images with variance lower than one, having little information of interest in the first place, are left out of consideration. ^{[3]}^{ }
Thus feature scaling is employed. This is also useful as now the threshold reduction step gets easier, as all the features are brought down to the same range.
Datasets
For this project,we made use of various open source databases labelled as face and nonface images
1. MIT Center For Biological and Computation Learning ^{[3]}^{ }
Training set: 2,429 faces, 4,548 nonfaces
Test set: 472 faces, 23,573 nonfaces
2. Georgia Tech face database
640X480 pixel images(750 in number) with frontal or tilted faces with varied scale, lighting conditions and facial expressions. Average size of faces in the images is 150X150 .
3. Labelled Faces in the Wild(LFW) database University of Massachusetts Amherst
13233 images of 5749 people
4. Google Things database
520 images of varying sizes
RESULTS
Feature extraction
For convenience, the feature values were stored in a single table of 1396X63961,where the number of rows denotee the number of samples and the number of columns denote the number of features.
Cascade
Later layers of the cascade are increasingly more complex,utilizing a lot of weak classifiers:
The above code is executed in the testing phase and answers the question, “Does the image have a face or not?”. We also keep track of the total number of false positives, false negatives and the total number of images correctly predicted to calculate the false positive rate, the false negative rate and the accuracy.
CONCLUSION
In this project,an effort was made to understand the face detection framework proposed by Paul Viola and Michael Jones,and implement it in MATLAB and OpenCV/C++. The training process yielded us coefficients and parameters for a cascade of classifiers. These can now be programmed into low powered hardware like FPGAs and thus efficiently classify images as face or nonface.
ACKNOWLEDGEMENTS
I wish to express my sincere gratitude to my guide and mentor, Dr Chetan Singh Thakur for guiding and encouraging me during the course of my fellowship in Indian Institute of Science, while working on the project on “Face detection on low powered devices”.I have been extremely fortunate to have worked under his supervision.
I also take the opportunity to thank Mr Abhishek Ramdas Nair(Ph.D scholar) and Ms Shilpa Sangappa(M.Tech student) for helping me in carrying out this project with their suggestions and advice.
References

Wikipedia, Viola–Jones object detection framework, [Online] Available at: https://en.wikipedia.org/wiki/Viola%E2%80%93Jones_object_detection_framework

Viola, P. and Jones, M. (2001) 'Rapid Object Detection using a Boosted Cascade of Simple Features', ACCEPTED CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION 2001,[Online]. Available at: https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/violacvpr 01.pdf

Y.Q. Wang, “An Analysis of the ViolaJones Face Detection Algorithm”, Image Processing On Line IPOL, 2014.

Anmol Parande,"Understanding and Implementing the ViolaJones Image Classification Algorithm",[Online]. Available at: https://medium.com/datadriveninvestor/understandingandimplementingtheviolajonesimageclassificationalgorithm85621f7fe20b

P. Tavalalli et al, "An Efficient Training Procedure for ViolaJones Face Detector" ,2017 International Conference on Computational Science and Computational Intelligence
Post your comments
Please try again.