Being able to predict and understand human emotions is important in many areas, such as education, games and movies. Especially in education it is crucial to detect when a learner is in a bad mood which can negatively influence the learning gain. Recently we have conducted a large user study where we have collected emotions as well as video recordings of users while they have been solving math tasks (active part) and looking at pictures (passive part). In this thesis we want to analyze this dataset to build predictive models of the user emotions based on video recordings (facial expression, eye gaze, head movement) and investigate if such a model can be generalized over different domains (pictures and math).
The main task of the thesis is the development of a data-driven model for the prediction of the emotional state of a person based on the facial expression, eye gaze, head movement and distance to the screen. Moreover, it should be investigated if the built model generalizes over domains (from pictures to math and vice versa). Last but not least, an analysis of the best performing models should allow for insights into how to improve the prediction further. The four main tasks of this project are
The participants have been recorded with two different video cameras, the front camera of the tablet and an external GoPro. Each participant was looking at 40 pictures and solved around 60 math tasks. After each image and math task, the participant rated the emotion with the two scales valence and arousal, each with a value between 1 (low) and 9 (high). In the beginning, we analyzed the log files about ratings, the delay in ratings, change of ratings, and the confidence of the extracted features. Then, we defined a valid timestamp synchronization between these two video sources and the underlying database. Some smaller adjustments for the videos have been made to ensure optimal facial recognition by the third party libraries (Affectiva and OpenFace). Based on these frameworks, we extracted the following facial features:
- Action Units
- Statistics on single action units
- Statistics on combined action units
- Eye Aspect Ratio (EAR) and blinks
- Mouth Aspect Ratio (MAR)
- Eye-Gaze features
- On a single axis
- In the 3D space
- Statistics of the extracted fidgeting index
We normalized all features with the extracted features of a baseline video. To find the optimal baseline of a 7-minute video, we analyzed different statistics and performed various tests. Further analysis of the feature correlation has been made to ensure the correct behavior of some estimators.
We wrote a learning pipeline to extract the frames relative to a given point of interest. As for the pictures, we obtained the frames corresponding to the start when the image was shown, and for the math tasks, we extracted the frames relative to the end of the math task. Further, the duration could be specified to obtain only a defined number of frames. With this extraction, we first searched for the optimal shift (position relative to the event) and size (duration) of the frame. For the shift, we used values from -5 to 5 seconds, and for the duration, we used values between 0.5 and 10 seconds. The labels consisted of low (1-3) and high (7-9) valence and low (1-3) and high (7-9) arousal. Without further hyperparameter optimization, a trend became clear that essential frames are around the submitting of the current affective state for pictures as well for math tasks. On the one hand, we conclude that the highest impact on the facial expression is not the picture itself, but when the image is compared to the personal experience and memory to rate it. On the other hand, for math tasks, the highest impact is the success message, which informs the participant about the correctness of the given solution.
Once we found the optimal window, we ran a hyperparameter optimization with a random search on several parameters of the five most promising estimators. In the end, the performance was with an accuracy of around 0.8 for predicting the valence, whereas current solutions achieved only slightly above the random level. As the participants were involved more during the math tasks (the active part), the performance was also marginally better. The arousal performed slightly worse for the pictures (around 0.75) and a lot better for math tasks (0.9). The high performance for math tasks is probably because many participants rated only in one class over the whole experiment.
We detected a few possible problem sources during an analysis of the misclassified samples:
- Little Facial Expression
- Inconsistent Rating
- Noise in Extracted Features
To verify the extracted features and the approach, we applied the same pipeline to the RAVDESS-dataset. For this dataset, 24 professional actors have been recorded, and they played the emotion based on the given label. In opposition, the participants for the ETH dataset labeled an emotion that occurred by external stimuli based on their understanding. Another difference was that the labels consisted of discrete emotions, rather than the valence and arousal space. For the pipeline with the RAVDESS-dataset, we used all eight given labels, and the accuracy was around 0.95. With this result, we were sure that the overall pipeline is working and that we are extracting useful features. For the ETH-dataset, the specified limitations are probably leading to lower accuracy.
Solving exercises at home on a tablet limits the software to rely on the built-in camera, which is most likely filming the ceiling for most of the time. With a small attached mirror on the tablet and our video restoration pipeline to reconstruct the face, we improved the facial recognition to be valid almost always. Without these adjustments, only a few sequences would have been useful from the front camera.
We introduced and implemented the following steps for the facial restoration pipeline with Python and OpenCV:
- Mirror Handling: Convert the raw input of the video source to have a consistent orientation of the face.
- Face Detection: With some parts of the face occluded in the video after the first step, many face detection algorithms had problems to detect the face at all. In this step, the face is inpainted with the most recent information, and the face detection was run after this. As a result, double as many frames became valid, which could be used for the feature extraction.
- Face Restoration: In this step, the face is restored based on the information of the previous face detection.
- We used a generative neuronal network (inpainting-model).
- We trained the inpainting-model on the CelebA-HQ dataset with the expected occlusions.
- Based on the face detection, the correct input was generated for the inpainting-model, and the output could be used to restore the frame.
With this restoration pipeline, we restored all front-camera videos for each participant and applied the same machine learning pipeline as for the GoPro videos. These models performed remarkably similar, which means that for further experiments, only the tablet can be used without an external video source.
My finished tasks (Summary)
- Analysis of a big dataset:
- Protocols: 29.9 GB
- Video files: 1.44 TB
- Extracted features from OpenFace: 46.0 GB
- Feature extraction:
- Apply existing frameworks (Affectiva und OpenFace).
- Build a smooth feature extraction pipeline, which can be applied on many videos.
- Search and analyze existing studies/papers for useful features.
- Analyze the significance of the features.
- Machine Learning:
- Implementation of a learning pipeline which considers the window of interest.
- Hyperparameter optimization
- Restoration Pipeline:
- Implement a pipeline to edit video files easily. For each video, different steps and parameters can be defined.
- Include OpenCV functionality for single steps.
- Inpainting Model:
- Training of an existing implementation of an inpainting model based on Keras/TensorFlow. I used the dataset CelebA-HQ (30 GB) and the expected occlusions of our problem. I did the training with my notebook with an Nvidia GTX 1070 within eight days.
- Include the model in the restoration pipeline, which requires a correct extraction of the video part for the model input.