Aditya Sundar - Waseda University
Table of Contents
- Table of Contents
- 1. Introduction
- 2. Methodology
- 2.1 Video Downloading and Processing
- 2.2 Audio Classification and Background Noise Removal
- 2.2.1 Audio Classification
- 2.2.2 Background noise removal and isolation speech
- 2.3 Unique Face Detection and Video Trimming
- 2.4 Face Cropping and Clip Generation
- 2.4.1 Optional Background Removal for Subject Isolation
- 2.5 Face Pose Estimation, Direction Classification, and Emotion Analysis
- 2.5.1 Face Detection and Landmark Estimation
- 2.5.2 Pose Smoothing and Direction Classification
- 2.5.3 Emotion Classification
- 2.5.4 Display and Annotation
- 3. Results
- 3.1 Audio Classification
- 3.2 Face Cropping and Clip Generation
- 3.2.1 Background Removal
- 3.3 Face Pose Estimation and Emotion Analysis
- 3.3.1 Pose Estimation Summary
- 3.3.2 Emotion Classification Summary
- 3.4 Overall Final Result
- 4. Next Steps and Future Directions
1. Introduction
The rise of video-based generative models, such as those used in facial animation and speech generation, has generated a need for robust and preprocessed video datasets. While datasets like CelebV-HQ have provided a standardized collection of facial clips for training, they still require manual preparation and are limited to specific datasets. This presented a challenge for those who want to build custom datasets using diverse and dynamic video sources quickly.
The goal of this project was to create a pipeline that automates the preprocessing of video and audio data from a given video dataset, extracting key information such as unique faces, emotional states, and relevant speech segments that may be useful for training video generation models.
The project aimed to preprocess and analyze video and audio data to
- Automatically classify and recognize unique faces
- Detect facial emotions and head poses across time,
- Classify audio for background music and isolate speech if required.
- Trim and refine videos for use in generative models.
The overarching goal was to automate the process of building a dataset for training video-based AI models that could later generate speaking faces. The following article provides a summary of the methods employed so far and the progress made in attempting to build a comprehensive video preprocessing tool.
2. Methodology
The following flowchart illustrates the entire pipeline used to preprocess and analyze video data. The pipeline begins with video retrieval, followed by a series of steps designed to detect and extract relevant features from both the audio and visual content. Each stage plays a crucial role in preparing the dataset for further use in video generation models:
- Audio Classification: The process begins by classifying the audio component from the video to identify the presence of speech and isolate it from background noise if necessary.
- Face Detection and Unique Face Identification: This next step involves detecting faces in each frame of the video. If more than a single face is detected, the video is trimmed to ensure that only the desired face remains in the clips to be processed.
- Face Cropping and Clip Generation: Once the face is isolated, each frame is processed to focus only on the face, and short clips are generated. An optional step for background removal can also be applied to further isolate the subject.
- Pose Estimation and Direction Classification: Using facial landmarks, the subject’s head pose is estimated. The system smooths the pose angles to classify the overall direction the subject is facing in each clip.
- Emotion Classification: Emotions are detected in each frame, and the inferred emotions are normalized to produce reliable results. This data is aggregated across clips and saved for further analysis.
All these methods are explained in detail in the following sections.
2.1 Video Downloading and Processing
In order to obtain simple videos for testing, the yt-dlp
Python library was used to download YouTube videos programmatically. The library allows for easy configuration to specify the desired video format and save location. The code used fetches the best video format and stores it in a structured directory for further analysis.
def download_video(url, output_dir):
ydl_opts = {
'format': 'best',
'outtmpl': os.path.join(output_dir, '%(title)s-%(id)s.%(ext)s'),
'quiet': False,
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
ydl.download([url])
The function retrieves the video in the best available format and stores it in the specified directory with the appropriate name format. Once the video(s) were downloaded, the frames were extracted for further facial and audio analysis.
2.2 Audio Classification and Background Noise Removal
In the case of video data, the presence of background noise, such as ambient sounds or music, can interfere with the accuracy of speech-based analysis or ML models. To ensure clean and usable audio data, the next stage involved identifying the presence of background noise by detecting and classifying different audio components.
2.2.1 Audio Classification
The audio extracted from the video was processed using a pre-trained audio classification model, Audio Spectrogram Transformer. The audio is turned into a spectrogram, after which a Vision Transformer is applied to classify audio based on various classification benchmarks.
def load_and_process_audio(audio_path, feature_extractor):
# Load audio with librosa at 16kHz and convert to mono
speech, rate = librosa.load(audio_path, sr=16000, mono=True)
print(f"Loaded waveform shape: {speech.shape}, Sample rate: {rate}")
# Convert numpy array to tensor
waveform = torch.tensor(speech) # Ensure there is a batch dimension
print(f"Tensor waveform shape: {waveform.shape}")
# Prepare the waveform for the model
inputs = feature_extractor(waveform, return_tensors="pt", padding=True, sampling_rate=16000)
return inputs
def classify_audio(audio_inputs, model):
with torch.no_grad():
outputs = model(**audio_inputs)
logits = outputs.logits
probabilities = F.softmax(logits, dim=-1)
class_probabilities = []
for prob in probabilities[0]:
class_probabilities.append(prob.item())
predicted_labels = [(model.config.id2label[idx], prob) for idx, prob in enumerate(class_probabilities)]
return predicted_labels
Here are two example audio classification results:
- A YouTube video with ordinary commentary accompanied by background music:
Classification Probabilities:
Speech: 50.28%
Music: 37.20%
Drum kit: 1.33%
Drum: 1.24%
Musical instrument: 1.21%
...
- A YouTube video of a live stage performance:
Classification Probabilities:
Music: 46.54%
New-age music: 7.62%
Ambient music: 6.93%
Mechanisms: 3.58%
Inside, small room: 1.89%
Speech: 1.50%
...
- A YouTube video of an interview from a news channel:
Classification Probabilities:
Speech: 82.64%
Female speech, woman speaking: 7.36%
Narration, monologue: 4.10%
Conversation: 3.69%
Inside, small room: 0.42%
...
With comparing the results from various other examples, a threshold of around 20% was used to identify if a video contains background noise.
2.2.2 Background noise removal and isolation speech
To then isolate the vocals/speech from an audio, a music source separation model, Demucs, was used. Although the model was initially designed for analyzing songs, applying it to video commentary or interviews with background music also yielded valuable results.
In the previous example for a YouTube video with ordinary commentary accompanied by background music, the following are the classification results after music source separation:
Classification Probabilities:
Speech: 86.69%
Speech synthesizer: 3.35%
Music: 2.28%
Narration, monologue: 1.32%
Male speech, man speaking: 0.78%
...
As observed, the confidence % for music present in the audio drastically reduced, demonstrating the effectiveness of the music source separating in enhancing the quality of the extracted speech.
By reducing noise and isolating speech, this process attempts to ensure that the audio is clean and suitable for further analysis in subsequent stages, such as emotion recognition or even speech-based model training.
2.3 Unique Face Detection and Video Trimming
Videos taken directly from YouTube or other datasets may contain more than a single person in a given frame at a particular time. This could introduce noise or irrelevant data if the goal is to analyze and train models on a single individual at a time. To ensure that only a given individual is visible in all frames, it is essential to detect and isolate the presence of the desired face, effectively trimming the video down to sections where only the target individual is featured.
This method involved using a pre-trained face detection model, such as the face_recognition API, which works by analyzing each frame of the video and detecting the presence of multiple faces. By detecting all faces in a frame, the system generates an encoding for each face and compares it with the encoding of the reference image of the desired individual. Frames where the desired face is present are retained, while the other frames are excluded.
This process results in a trimmed video where only the target individual remains visible throughout all the frames, thus reducing the amount of irrelevant data to be processed in subsequent steps. This step was crucial for ensuring that later stages, such as pose estimation and emotion detection, would only be applied to the relevant individual, streamlining the overall analysis process.
def crop_video_to_face_frames(reference_image_path, video_path, output_video_path):
reference_image = face_recognition.load_image_file(reference_image_path)
reference_face_encoding = face_recognition.face_encodings(reference_image)[0]
video = mpe.VideoFileClip(video_path)
clips = []
current_clip_start = None
for time in tqdm(np.arange(0, video.duration, 1.0 / video.fps), desc="Processing Video", unit="frame"):
frame = video.get_frame(time)
rgb_frame = np.ascontiguousarray(frame[:, :, ::-1]) # Convert BGR to RGB
face_locations = face_recognition.face_locations(rgb_frame)
face_encodings = face_recognition.face_encodings(rgb_frame, face_locations)
face_found = any(face_recognition.compare_faces([reference_face_encoding], face, tolerance=0.6) for face in face_encodings)
if face_found:
if current_clip_start is None:
current_clip_start = time
else:
if current_clip_start is not None:
clips.append(video.subclip(current_clip_start, max(current_clip_start, time)))
current_clip_start = None
# Ensure the last segment is added if the video ends with a face
if current_clip_start is not None:
clips.append(video.subclip(current_clip_start, video.duration))
if clips:
final_clip = mpe.concatenate_videoclips(clips)
final_clip.write_videofile(output_video_path, codec='libx264', audio_codec='aac')
face_recognition was used since this pipeline is for initial testing purposes, and it provides an easy-to-use interface with Python and contains pre-trained models known for accuracy and reliability. Additionally, the low computation overhead provided by the library was crucial to test out different features and allow quick iterations in the early stages without significant delays in processing time.
Generally, if the provided video only contains a single predominant face, then there is no need to execute this step.
2.4 Face Cropping and Clip Generation
For effective model training, it is often necessary to focus just on the face of the subject in a video, removing any unnecessary background information that might introduce noise into the analysis. To achieve this, the project included a method for creating short face-only video clips from a given full-length video.
The goal of this face cropping process was to generate 3-10 second video clips of only the subject’s face, allowing for a model to be trained on cleaner data. The focus on the face ensures that the training data remains consistent, eliminating other potential distractions from background elements or other individuals that might appear in the video.
To crop the face from each frame, a pre-trained YuNet face detection model was used. The process functions in the following manner:
- Face Detection: For each frame, the model identifies all the possible faces and selects the largest face in the frame as a subject.
- Cropping: After detecting the largest face, the corresponding bounding box is used to crop the image, isolation just the face from the rest of the frame.
- Clip Generation: The cropped face images are used to generate short clips of 3 to 10 seconds. This ensures that the processed video is focused exclusively on a single subject’s face throughout the clip.
face_size = w * h
largest_face_size = max(largest_face_size, face_size)
smallest_face_size = min(smallest_face_size, face_size)
center_x, center_y = x + w // 2, y + h // 2
x1 = max(0, center_x - crop_size // 2)
y1 = max(0, center_y - crop_size // 2)
x2 = min(frame.shape[1], x1 + crop_size)
y2 = min(frame.shape[0], y1 + crop_size)
cropped_face = frame[y1:y2, x1:x2]
if cropped_face.shape[0] != crop_size or cropped_face.shape[1] != crop_size:
cropped_face = cv2.resize(cropped_face, (crop_size, crop_size))
face_sequence.append(cropped_face)
… # Save the frame and append it to the new video
The face cropping method not only provides cleaner input data for model training but also significantly reduces the size of the video files, making it easier to handle and process multiple clips. By focusing on shorter clips (3-10 seconds) rather than the full video, the project takes advantage of more manageable segments that can be processed and analyzed more efficiently.
This approach ensures that the face detection and cropping processes are performed at scale while maintaining high accuracy and precision. The resulting face-cropped clips are high-quality and ready for further analysis in tasks such as emotion detection, facial landmark extraction, or video-based speech synthesis.
2.4.1 Optional Background Removal for Subject Isolation
In certain scenarios, it may be beneficial to remove the background from video frames, leaving only the subject visible. This step could be valuable in removing any additional distractions and noise from the surroundings.
In each step, rembg, a pre-trained deep learning model designed for background removal, is used to process each frame of the video. The model effectively separates the subject from the background by generating a mask that removes the irrelevant areas.
def process_frame_batch(frames):
providers = ['CPUExecutionProvider'] # Use 'CUDAExecutionProvider' for GPU acceleration
session = new_session(providers=providers)
processed_frames = []
for frame in frames:
pil_image = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
output = remove(pil_image, session=session)
processed_frame = cv2.cvtColor(np.array(output), cv2.COLOR_RGBA2BGRA)
processed_frames.append(processed_frame)
return processed_frames
Here, the rembg function is used to convert each frame into a PIL image, process it to remove the background, and return the frame with only the subject remaining. The result is a series of video frames where the background has been removed, leaving only the subject in focus.
This step was considered to be mostly optional since it could oversimplify the visual data used to train the model, making it inefficient in complex environments. But the main reason was mostly due to the computational cost and resource limitations that deemed this step not entirely necessary.
2.5 Face Pose Estimation, Direction Classification, and Emotion Analysis
In this stage of the video analysis, the goal was to detect the head pose and classify the direction the subject faces in each video frame. Additionally, emotion classification was performed based on the face to identify the most likely emotion in each frame.
2.5.1 Face Detection and Landmark Estimation
The face detection process is initialized using a face detection model (YuNet) that localizes the face in each frame. Once a face is detected, facial landmarks (68 key points) are extracted using the face_alignment library. This library was used as it provides 68 facial landmarks which include key features like eyes, nose, mouth, and jawline, which are crucial for calculating the head pose.
By modifying a pre-existing method, found here, the 3D positions of these landmarks are compared to a pre-defined 3D facial model to estimate the yaw, pitch, and roll angles of the head. These angles represent the orientation of the head in three-dimensional space:
- Yaw: The left-right rotation (turning left or right)
- Pitch: The up-down rotation (looking up or down)
- Roll: The tilt of the head (head tilt left or right)
2.5.2 Pose Smoothing and Direction Classification
To ensure smooth and consistent head pose classification, yaw, pitch, and roll values from multiple consecutive frames are stored in a buffer. This allows for smoothing over the recent values to avoid sudden changes or noises in pose estimation. The average values from the buffer are used to calculate a smoothed pose.
def smooth_pose_angles(self, yaw, pitch, roll):
"""Smooth the yaw, pitch, and roll values using a buffer."""
self.yaw_buffer.append(yaw)
self.pitch_buffer.append(pitch)
self.roll_buffer.append(roll)
smoothed_yaw = np.mean(self.yaw_buffer)
smoothed_pitch = np.mean(self.pitch_buffer)
smoothed_roll = np.mean(self.roll_buffer)
return smoothed_yaw, smoothed_pitch, smoothed_roll
def classify_pose_majority(self):
"""Classify the majority head pose direction from buffered yaw and pitch values."""
yaw_directions = [("Right" if yaw > 10 else "Left" if yaw < -10 else "Forward") for yaw in self.yaw_buffer]
pitch_directions = [("Up" if pitch > 10 else "Down" if pitch < -10 else "") for pitch in self.pitch_buffer]
most_common_yaw = max(set(yaw_directions), key=yaw_directions.count)
most_common_pitch = max(set(pitch_directions), key=pitch_directions.count)
return f"{most_common_yaw} {most_common_pitch}"
The current code classifies head pose into general directions based on the smoothed yaw and pitch angles:
- Yaw:
- If the yaw is greater than 10°, the face is classified as “Looking Right”
- If the yaw is less than -10°, the face is classified as “Looking Left”
- Otherwise, the face is “Looking Forward”
- Pitch:
- If the pitch is greater than 10°, the face is classified as “Looking Up”
- If the pitch is less than -10°, the face is classified as “Looking Down”
By using both yaw and pitch, the majority head pose direction over the recent frames is calculated to determine where the subject is predominantly looking throughout the video. This is essential in determining the focus of the subject to the camera or surrounding context.
2.5.3 Emotion Classification
To further gain insight into a given video, emotion detection is applied to each frame using a pre-trained emotion classification model, facial_emotions_image_detection, from Hugging Face. The detected face is cropped, converted into a PIL image, and passed into the emotion classifier which outputs a set of probabilities for various emotions, such as happiness, sadness, fear, neutral, and anger.
def normalize_emotions(emotions):
total_score = sum(emotion['score'] for emotion in emotions)
if total_score == 0:
return emotions
return [{'label': emotion['label'], 'score': emotion['score'] / total_score} for emotion in emotions]
...
# Convert cropped_face (NumPy array) to a PIL image for the emotion classifier
cropped_face_pil = Image.fromarray(cropped_face)
# Run emotion detection for the cropped face
emotions = emotion_classifier(cropped_face_pil)
# Normalize emotion scores so they sum to 100%
normalized_emotions = normalize_emotions(emotions)
most_likely_emotion = normalized_emotions[0]['label']
most_likely_score = normalized_emotions[0]['score']
...
The emotion scores are normalized to ensure they sum to 100% for each frame, allowing us to identify the most dominant emotion in every frame of the video. These emotions are tracked and averaged across the entire video to provide a summary of the subject’s emotional state throughout the video. This can be valuable as it provides a contextual understanding of the subject, offering additional layers of meaning to the content.
2.5.4 Display and Annotation
For additional information/verification, an annotated video is also generated with the following information:
- Yaw, pitch, and roll angles are drawn on each frame of the video, allowing visualization of the subject’s head orientation.
- The smoothed pose direction is also displayed.
- The most likely emotion and its confidence score are also shown.
3. Results
In this section, the results of applying the preprocessing and analysis pipeline to an example selected video clip are presented. The results are shown step-by-step, with intermediate results. The main goal was to evaluate the efficiency of these techniques for segmenting and analyzing videos, with an emphasis on isolating facial regions.
The selected example video features a well-known actor discussing his movie. This video was chosen because of its brevity and the consistent presence of a single face in the frame, with the facial features clearly visible throughout, making it a suitable example to test the different pre-processing techniques.
3.1 Audio Classification
The audio classification process successfully identified the dominant presence of speech with minimal background noise. The high confidence in the speech-related categories indicates that the audio is primarily focused on the subject's dialogue, with no significant interference from environmental sounds. As a result, there was no need for additional speech isolation. Below are the classification probabilities:
Classification Probabilities:
Speech: 88.89%
Rustling leaves: 1.12%
Rustle: 0.74%
Outside, rural or natural: 0.59%
Male speech, man speaking: 0.58%
...
3.2 Face Cropping and Clip Generation
The face cropping and clip generation process was performed on the input video to extract shorter segments focused on the subject’s face. Given that the video contains only a single face throughout, the process efficiently detected and cropped the single face in every frame. Below is the summary of the results from this process:
{
"input_video": "Hacksaw Ridge Interview - Andrew Garfield (2016) - Drama-_sSeRemLpa4.mp4",
"total_frames": 6024,
"fps": 23.976023976023978,
"face_detection_rate": 98.2569721115538,
"avg_faces_per_frame": 1.0,
"largest_face_size": 20352,
"smallest_face_size": 13720,
"crop_size": 200,
"min_duration": 3,
"max_duration": 10,
"total_clips": 26,
"clips": [... description of each clip ...]
}
A total of 26 clips were generated from the original video which is 4 minutes and 11 seconds long. The cropped clips provide a consistent view of the subject's face, isolating it for further analysis. By breaking down the video into shorter clips, the data is more manageable, allowing for easier processing in subsequent steps such as pose estimation or emotion classification.
3.2.1 Background Removal
As an optional step, background removal can be applied to isolate the subject in each clip. Below are the results of applying background removal to all the clips:
3.3 Face Pose Estimation and Emotion Analysis
The face pose estimation and emotion analysis were performed on each of the 26 clips generated during the face cropping stage. Below is a summary of the key findings, with selected representative clips showcasing the overall results.
3.3.1 Pose Estimation Summary
For each clip, the subject’s head pose was analyzed to determine the predominant orientation in terms of yaw, pitch, and roll. Most clips were either classified as ‘Looking Forward’ or ‘Looking Right’.
Here are two examples from two clips where the majority direction is different:
"video": "clip_Hacksaw Ridge Interview - Andrew Garfield (2016) - Drama-_sSeRemLpa4_s148.73_e158.70.mp4",
"average_pose_angles": {
"yaw": 0.6485263134542517,
"pitch": 4.067361513267401,
"roll": -84.15095088190242
},
"majority_direction": "Forward ",
"average_emotions": {
"happy": 0.27890650758955365,
"sad": 0.2207855023376018,
"angry": 0.21036808651447347,
"neutral": 0.16581701156305875,
"fear": 0.13855400448644115,
"disgust": 0.11735546569752066,
"surprise": 0.11485576324799043
}
"video": "clip_Hacksaw Ridge Interview - Andrew Garfield (2016) - Drama-_sSeRemLpa4_s45.34_e55.31.mp4",
"average_pose_angles": {
"yaw": 10.363011304670724,
"pitch": -1.2197863376722897,
"roll": -79.94518617480344
},
"majority_direction": "Right ",
"average_emotions": {
"angry": 0.279351381243619,
"happy": 0.20172418331974284,
"sad": 0.18258986628127324,
"neutral": 0.17996235064119007,
"disgust": 0.1612432542923294,
"fear": 0.15300598093552292
}
The current majority direction after smoothening the angles appears to work, but it needs further tuning, as there are cases where the direction may be inaccurately classified.
This may require further refinement, either by improving the landmark-based calculation process or adjusting the classification thresholds after testing with a broader set of examples.
3.3.2 Emotion Classification Summary
Emotion classification was performed for each clip, with the inference results noted for each frame. However, the results revealed that the model was never strongly confident in predicting any single dominant emotion for most frames. The classification model assigned different emotions similar probabilities, indicating uncertainty in its predictions.
For instance, the following are the results for a randomly selected clip:
"average_emotions": {
"angry": 0.24729668899285473,
"happy": 0.22430150593786655,
"sad": 0.2051394251327069,
"neutral": 0.19470320521696338,
"fear": 0.13737549817498176,
"disgust": 0.11678542926529709,
"surprise": 0.09303550507150424
}
These average probabilities suggest that the model struggled to differentiate between emotional states confidently, with the highest probability barely reaching 25%.
This lack of confidence could stem from multiple factors such as the model’s limited ability to handle subtle expressions or the complexity of the subject’s expressions in the video making it difficult to classify.
3.4 Overall Final Result
The final output of the pipeline consists of a few output files and the generated video clips, as showcased below. The annotated clip provides a visual confirmation of the intermediate processing, while the main output includes the face-cropped clip (with background removed), a text file containing the audio classification output, and two JSON files containing detailed information about the clips. For this example, 26 short clips were generated and analyzed from the given video (251 seconds in length).
The two clips displayed correspond to the same portion of the video: clip_Hacksaw Ridge Interview - Andrew Garfield (2016) - Drama-_sSeRemLpa4_s194.44_e204.41.mp4
On the left is the final cropped clip with the background removed, and on the right is the intermediate annotated result showing facial landmarks, pose estimation, and emotion detection.
- Left: Final clip with face cropped and background removed.
- Right: Intermediate annotated clip displaying head pose and emotion information for reference.
4. Next Steps and Future Directions
The progress so far represents an initial attempt to create a video preprocessing pipeline that could be useful for training models in various domains, including speech recognition, video generation, face generation, etc. The methods employed, such as face cropping, pose estimation, and emotion classification, provide key information that can enhance training processes. However several other classifications and improvements could be explored to extend the capabilities of the pipeline.
For instance, perhaps lip reading or action/gesture detection could be useful in providing further insight into a given video dataset. While these are just examples, such additions could increase the depth and robustness of the dataset for training more sophisticated models.
Within the current progress, several limitations had to be accounted for, such as having to run all models on a CPU, which influenced the different methods and models that were used. As a result, the project relied heavily on pre-existing trained models that were not guaranteed to be fine-tuned for the specific tasks in this pipeline which may have contributed to the inaccuracy of certain results, such as emotion classification. Fine-tuning these models or developing custom models could lead to much more accurate outcomes.
Additionally, emotion classification based solely on static facial images has proven to be insufficient. The model used demonstrated significant uncertainty, and it is not an unexpected result. Emotions are inherently complex to analyze, so other emotion detection methods that can employ more powerful and fine-tuned emotion detection models should be considered.
Ultimately, this pipeline was supposed to serve as a foundation for further development, and the next steps would involve refining the techniques, improving model accuracy, and expanding the range of classifications that can be derived from video data.
References:
Adrianb. (n.d.). GitHub - 1adrianb/face-alignment: 2D and 3D Face alignment library built using pytorch. GitHub. https://github.com/1adrianb/face-alignment
Ageitgey. (n.d.). GitHub - ageitgey/face_recognition: The world’s simplest facial recognition API for Python and the command line. GitHub. https://github.com/ageitgey/face_recognition
CelebV-HQ. (n.d.). GitHub - CelebV-HQ/CelebV-HQ: [ECCV 2022] CelebV-HQ: A Large-Scale Video Facial Attributes Dataset. GitHub. https://github.com/CelebV-HQ/CelebV-HQ
Danielgatis. (n.d.). GitHub - danielgatis/rembg: Rembg is a tool to remove images background. GitHub. https://github.com/danielgatis/rembg
dima806/facial_emotions_image_detection · Hugging Face. (n.d.). https://huggingface.co/dima806/facial_emotions_image_detection
Facebookresearch. (n.d.). GitHub - facebookresearch/demucs: Code for the paper Hybrid Spectrogram and Waveform Source Separation. GitHub. https://github.com/facebookresearch/demucs
MIT/AST-Finetuned-Audioset-10-10-0.4593 · Hugging face. (n.d.). https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593
Rotten Tomatoes Coming Soon. (2016, November 3). Hacksaw Ridge Interview - Andrew Garfield (2016) - drama [Video]. YouTube. https://www.youtube.com/watch?v=_sSeRemLpa4
Yinguobing. (n.d.). head-pose-estimation/pose_estimation.py · yinguobing/head-pose-estimation. GitHub. https://github.com/yinguobing/head-pose-estimation/blob/master/pose_estimation.py