Traditional marketing research tools like surveys and focus groups are used to evaluate customer responses on various aspects of the product, but they do not fully capture what’s going on in the customer’s mind. Responses are not always true in some cases. This is where the idea of Neuromarketing becomes useful. Using non-verbal consumer responses and brain signals is a more reliable method as they are spontaneous and aren’t altered by the conscious mind. The widely used methods are – eye tracking, facial emotion detection, EEG data, and fMRI signals.
In this blog, we’ll talk about Facial Emotion Detection.
Facial Emotion Recognition: Facial expressions and emotions are the most prominent of non-verbal communication. A lot of things that cannot be conveyed by words can easily be expressed using the face. The most prominent 7 emotions are – anger, disgust, fear, happiness, sadness, surprised, and neutral. We experimented with 2 of the most famous machine learning models: CNN and SVM.
- CNN : Convolutional Neural Networks are known to work well for image analysis tasks. Before building the model, we need to prepare the data for the network. Dataset: The dataset is acquired from Kaggle and consists of the 7 classes (0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise, 6=Neutral) mentioned above. There are 13,690 ‘350×350’ pixel-colored images of faces exhibiting 7 different expressions.
Distribution of images amongst the classes –
Notice that the dataset has several more images for the happiness and neutral classes.
Sample image from the dataset –
Pre-Processing:
The following pre-processing is performed on the dataset:
- The facial emotion is independent of the color of the image. So, we can change the image from RGB to grayscale thus reducing the number of channels. This results in an overall reduction in data by a fraction of 3.
- The extra background can add noise and misleading information to the input, so extract only the facial part using Cascade Classifier (CV API) and set the image to a fixed size of 48X48 pixels to maintain consistency.
Our emotion data is categorical. We perform one-hot encoding which converts the labels into vectors containing 0’s and a 1 in the index position of the class.
One hot encoding:
Now, the data is ready to be used to train the model.
Architecture: The architecture is built using the Keras framework.
Input: 48×48 grayscale image
Output: The probability of each expression class
The network architecture comprises of:
- 5 convolutional layers
- 3 sub-sampling layers and
- 1 fully-connected layer.
Layers:
- The first layer of CNN is a convolution layer that applies a convolution kernel of 3 × 3 and outputs 64 images of 48 x 48 pixels. This is followed by a sub-sampling layer that uses max-pooling (with kernel size 3 × 3) to reduce the image to the third of its size.
- The second convolutional layer has an output of 64 images of 16 ×16 pixels, followed by a sub-sampling layer with kernel size 2 × 2.
- The third layer is the same as the second.
- The fourth convolutional layer outputs 128 images of size 8 × 8 pixels and uses max pooling with kernel 2 × 2.
- The fifth layer is the same as the fourth.
- The output from the last convolutional layer is given to a fully connected hidden layer. This layer maps 1024 nodes to 7 output nodes (one for each expression that outputs their probability) in a fully connected fashion.
Hyper-parameter Tuning:
Multiple values for the hyper-parameters were tested and the following gave us the best accuracies:
- Optimizer – Adam with a learning rate – 1e-3 and a decay of 3.125e-5
- Batch size – 32
- Loss– Categorical cross-entropy
Results:
The ratio of samples in the training set, validation set and test set is 6.8: 2.9: 0.2
The CNN model achieved a test accuracy of 85.15%
2. SVM
A Support Vector Machine (SVM) finds hyperplanes in high-dimensional space to classify the data points.
Here H3 is the best hyperplane
Dataset:
We use another dataset for our SVM model. There are 1071 images in the dataset which are divided into 803 and 268 images for training and testing dataset. It has 6 classes (Anger, Disgust, Happiness, Sadness, Surprised, Neutral).
The previous dataset was very skewed which could cause biases in prediction. We chose a much lesser skewed dataset.
Pre-processing:
In deciding the facial expression, the points on the eyebrows, eyes, nose, lips, and jawline are of the utmost importance. All these points are extracted and are interpolated from the nose-tip. The distance of these points (green lines) from the nose-tips are used as features for the SVM model.
Model architecture
The API Scikit-learn is used to implement SVM with C=1 and linear kernel.
Results:
The test accuracy was found to be 61.69
Conclusion:
CNN gives us a better accuracy of 85.15%. It is worth noting that it is also more complex and computationally expansive than the SVM model.
Sneha Bahl
Data Scientist Intern