create a draft paper and final term paper. 2 files docx, i already attached term paper proposal and refrence link also attacthed.
ar X iv :2 10 2. 11 12 6v 3 [ cs .C V ] 1 1 M ar 2 02 1 Deepfake Video Detection Using Convolutional Vision Transformer Deressa Wodajo Jimma University
[email protected] Solomon Atnafu Addis Ababa University
[email protected] Abstract The rapid advancement of deep learning models that can generate and synthesis hyper-realistic videos known as Deepfakes and their ease of access have raised concern on possible malicious intent use. Deep learning techniques can now generate faces, swap faces between two subjects in a video, alter facial expressions, change gender, and alter facial features, to list a few. These powerful video manipulation methods have potential use in many fields. However, they also pose a looming threat to everyone if used for harmful purposes such as identity theft, phishing, and scam. In this work, we propose a Convolutional Vi- sion Transformer for the detection of Deepfakes. The Con- volutional Vision Transformer has two components: Con- volutional Neural Network (CNN) and Vision Transformer (ViT). The CNN extracts learnable features while the ViT takes in the learned features as input and categorizes them using an attention mechanism. We trained our model on the DeepFake Detection Challenge Dataset (DFDC) and have achieved 91.5 percent accuracy, an AUC value of 0.91, and a loss value of 0.32. Our contribution is that we have added a CNN module to the ViT architecture and have achieved a competitive result on the DFDC dataset. 1. Introduction Technologies for altering images, videos, and audios are developing rapidly [12, 62]. Techniques and technical ex- pertise to create and manipulate digital content are also eas- ily accessible. Currently, it is possible to seamlessly gener- ate hyper-realistic digital images [28] with a little resource and an easy how-to-do instructions available online [30, 9]. Deepfake is a technique which aims to replace the face of a targeted person by the face of someone else in a video [1]. It is created by splicing synthesized face region into the orig- inal image [62]. The term can also mean to represent the final output of a hype-realistic video created. Deepfakes can be used for creation of hyper-realistic Computer Generated Imagery (CGI), Virtual Reality (VR) [7], Augmented Re- ality (AR), Education, Animation, Arts, and Cinema [13]. However, since Deepfakes are deceptive in nature, they can also be used for malicious purposes. Since the Deepfake phenomenon, various authors have proposed different mechanisms to differentiate real videos from fake ones. As pointed by [10], even though each pro- posed mechanism has its strength, current detection meth- ods lack generalizability. The authors noted that current ex- isting models focus on the Deepfake creation tools to tackle by studying their supposed behaviors. For instance, Yuezun et al. [33] and TackHyun et al. [25] used inconsistencies in eye blinking to detect Deepfakes. However, using the work of Konstantinos et al. [58] and Hai et al. [46], it is now possible to mimic eye blinking. The authors in [58] pre- sented a system that generates videos of talking heads with natural facial expressions such as eye blinking. The authors in [46] proposed a model that can generate facial expression from a portrait. Their system can synthesis a still picture to express emotions, including a hallucination of eye-blinking motions. We base our work on two weaknesses of Deepfake de- tection methods pointed out by [10, 11]: data preprocess- ing, and generality. Polychronis et al. [11] noted that cur- rent Deepfake detection systems focus mostly on presenting their proposed architecture, and give less emphasis on data preprocessing and its impact on the final detection model. The authors stressed the importance of data preprocess- ing for Deepfake detections. Joshual et al. [10] focused on the generality of facial forgery detection and found that most proposed systems lacked generality. The authors de- fined generality as reliably detecting multiple spoofing tech- niques and reliably spoofing unseen detection techniques. Umur et al. [13] proposed a generalized Deepfake de- tector called FakeCatcher using biological signals (internal representations of image generators and synthesizers). They used a simple Convolutional Neural Network (CNN) clas- sifier with only three layers. The authors used 3000 videos for training and testing. However, they didn’t specify in de- tail how they preprocessed their data. From [31, 52, 21], it is evident that very deep CNNs have superior performance than shallow CNNs in image classification tasks. Hence, there is still room for another generalized Deepfake detec- http://arxiv.org/abs/2102.11126v3 tor that has extensive data preprocessing pipeline and also is trained on a very deep Neural Network model to catch as many Deepfake artifacts as possible. Therefore, we propose a generalized Convolutional Vi- sion Transformer (CViT) architecture to detect Deepfake videos using Convolutional Neural Networks and the Trans- former architecture. We call our approach generalized for three main reasons. 1) Our proposed model can learn local and global image features using the CNN and the Trans- former architecture by using the attention mechanism of the Transformer [6]. 2) We give equal emphasis on our data pre- processing during training and classification. 3) We propose to train our model on a diverse set of face images using the largest dataset currently available to detect Deepfakes cre- ated in different settings, environments, and orientations. 2. Related Work With the rapid advancement of the CNNs [4, 20], Gen- erative Adversarial Networks (GANs) [18], and its vari- ants [22], it is now possible to create hyper-realistic im- ages [32], videos [61] and audio signals [53, 15] that are much harder to detect and distinguish from real untampered audiovisuals. The ability to create a seemingly real sound, images, and videos have caused a steer from various con- cerned stakeholders to deter such developments not to be used by adversaries for malicious purposes [12]. To this ef- fect, there is currently an urge in the research community to come with Deepfake detection mechanisms. 2.1. Deep Learning Techniques for Deepfake Video Generation Deepfake is generated and synthesized by deep genera- tive models such GANs and Autoencoders (AEs) [18, 37]. Deepfake is created by swapping between two identities of subjects in an image or video [56]. Deepfake can also be created by using different techniques such as face swap [43], puppet-master [53], lip-sync [49, 47], face- reenactment [14], synthetic image or video generation, and speech synthesis [48]. Supervised [45, 24, 51], and un- supervised image-to-image translation [19] and video-to- video translation [59, 35] can be used to create highly re- alistic Deepfakes. The first Deepfake technique is the FakeAPP [42] which used two AE network. An AE is a Feedforward Neural Net- work (FFNN) with an encoder-decoder architecture that is trained to reconstruct its input data [60]. FakeApp’s encoder extracts the latent face features, and its decoder reconstructs the face images. The two AE networks share the same en- coder to swap between the source and target faces, and dif- ferent decoders for training. Most of the Deepfake creation mechanisms focus on the face region in which face swapping and pixel-wise editing are commonly used [28]. In the face swap, the face of a source image is swapped on the face of a target image. In puppet-master, the person creating the video controls the person in the video. In lip-sync, the source person controls the mouse movement in the target video, and in face reen- actment, facial features are manipulated [56]. The Deep- fake creation mechanisms commonly use feature map rep- resentations of a source image and target image. Some of the feature map representations are the Facial Action Cod- ing System (FACS), image segmentation, facial landmarks, and facial boundaries [37]. FACS is a taxonomy of human facial expression that defines 32 atomic facial muscle ac- tions named Action Units (AU) and 14 Action Descriptors (AD) for miscellaneous actions. Facial land marks are a set of defined positions on the face, such as eye, nose, and mouth positions [36]. 2.1.1 Face Synthesis Image synthesis deals with generating unseen images from sample training examples [23]. Face image synthesis tech- niques are used in face aging, face frontalization, and pose guided generation. GANs are used mainly in face synthe- sis. GANs are generative models that are designed to create generative models of data from samples [3, 18]. GANs contain two adversarial networks, a generative model G , and discriminative model D . The generator and the dis- criminator act as adversaries with respect to each other to produce real-like samples [22]. The generator’s goal is to capture the data distribution. The goal of the discriminator is to determine whether a sample is from the model distribu- tion or the data distribution [18]. Face frontalization GANs change the face orientation in an image. Pose guided face image generation maps the pose of an input image to an- other image. GAN architecture, such as StyleGAN [26] and FSGAN [43], synthesize highly realistic-looking images. 2.1.2 Face Swap Face swap or identity swap is a GAN based method that creates realistic Deepfake videos. The face swap process inserts the face of a source image in a target image of which the subject has never appeared [56]. It is most popularly used to insert famous actors in a variety of movie clips [2]. Face swaps can be synthesized using GANs and traditional CV techniques such as FaceSwap (an application for swap- ping faces) and ZAO (a Chines mobile application that swaps anyone’s face onto any video clips) [56]. Face Swap- ping GAN (FSGAN) [43], and Region-Separative GAN (RSGAN) [39] are used for face swapping, face reenact- ment, attribute editing, and face part synthesis. Deepfake FaceSwap uses two AEs with a shared encoder that recon- structs training images of the source and target faces [56]. The processes involve a face detector that crops and aligns the face using facial landmark information [38]. A trained 2 encoder and decoder of the source face swap the features of the source image to the target face. The autoencoder out- put is then blended with the rest of the image using Poisson editing [38]. Facial expression (face reenactment) swap alters one’s facial expression or transforms facial expressions among persons. Expression reenactment turns an identity into a puppet [37]. Using facial expression swap, one can transfer the expression of a person to another one [27]. Various fa- cial reenactments have proposed through the years. Cycle- GAN is proposed by Jun-Yan et al. [63] for facial reenact- ment between two video sources without any pair of train- ing examples. Face2Face manipulates the facial expression of a source image and projects onto another target face in real-time [54]. Face2Face creates a dense reconstruction between the source image and the target image that is used for the synthesis of the face images under different light set- tings [38]. 2.2. Deep Learning Techniques for Deepfake Video Detection Deepfake detection methods fall into