ar X iv :2 10 2. 11 12 6v 3 [ cs .C V ] 1 1 M ar 2 02 1 Deepfake Video Detection Using Convolutional Vision Transformer Deressa Wodajo Jimma University XXXXXXXXXX Solomon Atnafu Addis Ababa University...

1 answer below »
by using term paper proposal, make two doc file,1) Term draft paper2) Final term paper


ar X iv :2 10 2. 11 12 6v 3 [ cs .C V ] 1 1 M ar 2 02 1 Deepfake Video Detection Using Convolutional Vision Transformer Deressa Wodajo Jimma University [email protected] Solomon Atnafu Addis Ababa University [email protected] Abstract The rapid advancement of deep learning models that can generate and synthesis hyper-realistic videos known as Deepfakes and their ease of access have raised concern on possible malicious intent use. Deep learning techniques can now generate faces, swap faces between two subjects in a video, alter facial expressions, change gender, and alter facial features, to list a few. These powerful video manipulation methods have potential use in many fields. However, they also pose a looming threat to everyone if used for harmful purposes such as identity theft, phishing, and scam. In this work, we propose a Convolutional Vi- sion Transformer for the detection of Deepfakes. The Con- volutional Vision Transformer has two components: Con- volutional Neural Network (CNN) and Vision Transformer (ViT). The CNN extracts learnable features while the ViT takes in the learned features as input and categorizes them using an attention mechanism. We trained our model on the DeepFake Detection Challenge Dataset (DFDC) and have achieved 91.5 percent accuracy, an AUC value of 0.91, and a loss value of 0.32. Our contribution is that we have added a CNN module to the ViT architecture and have achieved a competitive result on the DFDC dataset. 1. Introduction Technologies for altering images, videos, and audios are developing rapidly [12, 62]. Techniques and technical ex- pertise to create and manipulate digital content are also eas- ily accessible. Currently, it is possible to seamlessly gener- ate hyper-realistic digital images [28] with a little resource and an easy how-to-do instructions available online [30, 9]. Deepfake is a technique which aims to replace the face of a targeted person by the face of someone else in a video [1]. It is created by splicing synthesized face region into the orig- inal image [62]. The term can also mean to represent the final output of a hype-realistic video created. Deepfakes can be used for creation of hyper-realistic Computer Generated Imagery (CGI), Virtual Reality (VR) [7], Augmented Re- ality (AR), Education, Animation, Arts, and Cinema [13]. However, since Deepfakes are deceptive in nature, they can also be used for malicious purposes. Since the Deepfake phenomenon, various authors have proposed different mechanisms to differentiate real videos from fake ones. As pointed by [10], even though each pro- posed mechanism has its strength, current detection meth- ods lack generalizability. The authors noted that current ex- isting models focus on the Deepfake creation tools to tackle by studying their supposed behaviors. For instance, Yuezun et al. [33] and TackHyun et al. [25] used inconsistencies in eye blinking to detect Deepfakes. However, using the work of Konstantinos et al. [58] and Hai et al. [46], it is now possible to mimic eye blinking. The authors in [58] pre- sented a system that generates videos of talking heads with natural facial expressions such as eye blinking. The authors in [46] proposed a model that can generate facial expression from a portrait. Their system can synthesis a still picture to express emotions, including a hallucination of eye-blinking motions. We base our work on two weaknesses of Deepfake de- tection methods pointed out by [10, 11]: data preprocess- ing, and generality. Polychronis et al. [11] noted that cur- rent Deepfake detection systems focus mostly on presenting their proposed architecture, and give less emphasis on data preprocessing and its impact on the final detection model. The authors stressed the importance of data preprocess- ing for Deepfake detections. Joshual et al. [10] focused on the generality of facial forgery detection and found that most proposed systems lacked generality. The authors de- fined generality as reliably detecting multiple spoofing tech- niques and reliably spoofing unseen detection techniques. Umur et al. [13] proposed a generalized Deepfake de- tector called FakeCatcher using biological signals (internal representations of image generators and synthesizers). They used a simple Convolutional Neural Network (CNN) clas- sifier with only three layers. The authors used 3000 videos for training and testing. However, they didn’t specify in de- tail how they preprocessed their data. From [31, 52, 21], it is evident that very deep CNNs have superior performance than shallow CNNs in image classification tasks. Hence, there is still room for another generalized Deepfake detec- http://arxiv.org/abs/2102.11126v3 tor that has extensive data preprocessing pipeline and also is trained on a very deep Neural Network model to catch as many Deepfake artifacts as possible. Therefore, we propose a generalized Convolutional Vi- sion Transformer (CViT) architecture to detect Deepfake videos using Convolutional Neural Networks and the Trans- former architecture. We call our approach generalized for three main reasons. 1) Our proposed model can learn local and global image features using the CNN and the Trans- former architecture by using the attention mechanism of the Transformer [6]. 2) We give equal emphasis on our data pre- processing during training and classification. 3) We propose to train our model on a diverse set of face images using the largest dataset currently available to detect Deepfakes cre- ated in different settings, environments, and orientations. 2. Related Work With the rapid advancement of the CNNs [4, 20], Gen- erative Adversarial Networks (GANs) [18], and its vari- ants [22], it is now possible to create hyper-realistic im- ages [32], videos [61] and audio signals [53, 15] that are much harder to detect and distinguish from real untampered audiovisuals. The ability to create a seemingly real sound, images, and videos have caused a steer from various con- cerned stakeholders to deter such developments not to be used by adversaries for malicious purposes [12]. To this ef- fect, there is currently an urge in the research community to come with Deepfake detection mechanisms. 2.1. Deep Learning Techniques for Deepfake Video Generation Deepfake is generated and synthesized by deep genera- tive models such GANs and Autoencoders (AEs) [18, 37]. Deepfake is created by swapping between two identities of subjects in an image or video [56]. Deepfake can also be created by using different techniques such as face swap [43], puppet-master [53], lip-sync [49, 47], face- reenactment [14], synthetic image or video generation, and speech synthesis [48]. Supervised [45, 24, 51], and un- supervised image-to-image translation [19] and video-to- video translation [59, 35] can be used to create highly re- alistic Deepfakes. The first Deepfake technique is the FakeAPP [42] which used two AE network. An AE is a Feedforward Neural Net- work (FFNN) with an encoder-decoder architecture that is trained to reconstruct its input data [60]. FakeApp’s encoder extracts the latent face features, and its decoder reconstructs the face images. The two AE networks share the same en- coder to swap between the source and target faces, and dif- ferent decoders for training. Most of the Deepfake creation mechanisms focus on the face region in which face swapping and pixel-wise editing are commonly used [28]. In the face swap, the face of a source image is swapped on the face of a target image. In puppet-master, the person creating the video controls the person in the video. In lip-sync, the source person controls the mouse movement in the target video, and in face reen- actment, facial features are manipulated [56]. The Deep- fake creation mechanisms commonly use feature map rep- resentations of a source image and target image. Some of the feature map representations are the Facial Action Cod- ing System (FACS), image segmentation, facial landmarks, and facial boundaries [37]. FACS is a taxonomy of human facial expression that defines 32 atomic facial muscle ac- tions named Action Units (AU) and 14 Action Descriptors (AD) for miscellaneous actions. Facial land marks are a set of defined positions on the face, such as eye, nose, and mouth positions [36]. 2.1.1 Face Synthesis Image synthesis deals with generating unseen images from sample training examples [23]. Face image synthesis tech- niques are used in face aging, face frontalization, and pose guided generation. GANs are used mainly in face synthe- sis. GANs are generative models that are designed to create generative models of data from samples [3, 18]. GANs contain two adversarial networks, a generative model G , and discriminative model D . The generator and the dis- criminator act as adversaries with respect to each other to produce real-like samples [22]. The generator’s goal is to capture the data distribution. The goal of the discriminator is to determine whether a sample is from the model distribu- tion or the data distribution [18]. Face frontalization GANs change the face orientation in an image. Pose guided face image generation maps the pose of an input image to an- other image. GAN architecture, such as StyleGAN [26] and FSGAN [43], synthesize highly realistic-looking images. 2.1.2 Face Swap Face swap or identity swap is a GAN based method that creates realistic Deepfake videos. The face swap process inserts the face of a source image in a target image of which the subject has never appeared [56]. It is most popularly used to insert famous actors in a variety of movie clips [2]. Face swaps can be synthesized using GANs and traditional CV techniques such as FaceSwap (an application for swap- ping faces) and ZAO (a Chines mobile application that swaps anyone’s face onto any video clips) [56]. Face Swap- ping GAN (FSGAN) [43], and Region-Separative GAN (RSGAN) [39] are used for face swapping, face reenact- ment, attribute editing, and face part synthesis. Deepfake FaceSwap uses two AEs with a shared encoder that recon- structs training images of the source and target faces [56]. The processes involve a face detector that crops and aligns the face using facial landmark information [38]. A trained 2 encoder and decoder of the source face swap the features of the source image to the target face. The autoencoder out- put is then blended with the rest of the image using Poisson editing [38]. Facial expression (face reenactment) swap alters one’s facial expression or transforms facial expressions among persons. Expression reenactment turns an identity into a puppet [37]. Using facial expression swap, one can transfer the expression of a person to another one [27]. Various fa- cial reenactments have proposed through the years. Cycle- GAN is proposed by Jun-Yan et al. [63] for facial reenact- ment between two video sources without any pair of train- ing examples. Face2Face manipulates the facial expression of a source image and projects onto another target face in real-time [54]. Face2Face creates a dense reconstruction between the source image and the target image that is used for the synthesis of the face images under different light set- tings [38]. 2.2. Deep Learning Techniques for Deepfake Video Detection Deepfake detection methods fall into
Answered 5 days AfterMar 31, 2022

Answer To: ar X iv :2 10 2. 11 12 6v 3 [ cs .C V ] 1 1 M ar 2 02 1 Deepfake Video Detection Using Convolutional...

Amar Kumar answered on Apr 06 2022
109 Votes
Deep fake Video Detection Using Convolutional Vision Transformer
1. Introduction
In the present PC vision field, deepfake location is turning into significantly more famous issue. Deepfakes are the point at which an entertainer's presentation is set onto a photograph or video of an objective individual to make it look
like the objective is doing likewise ways of behaving as the entertainer. Late AI/ML improvements have made it conceivable to make deepfakes, and current deepfakes are almost imperceptible to human sight. Government officials might be compelled to convey comments they could never have given if not, chronicled material can be doctored, and big names can be set into explicit film with this technique. It's basic, then, that there be dependable calculations for recognizing real photos or film from deepfakes. Deepfake location is interesting since they are quick turning out to be more normal in the present society, have a high potential for harm, and are unbelievably challenging for individuals to execute without help.
2. Prior Work
The creators of Sabir et al [1] use an intermittent convolutional organization to examine video outlines, then, at that point, a fundamental feed-forward net to decide whether the video is genuine or fake. We expect to change this methodology by exploring different avenues regarding elective organizations, as the first paper's fundamental organizations were ResNet and DenseNet. Moreover, Hsu et al [2] utilize a Common phony element organization to separate face attributes from the photographs, trailed by a grouping organization to distinguish deepfakes, to recognize adjusted still pictures.
3. Data
The dataset comes from Kaggle, which is by and by facilitating a rivalry wherein they are offering a 500GB dataset containing authentic and deepfake films, as well as a mark demonstrating if the video is a deepfake. We are currently dealing with more modest lumps of the dataset for computational suitability in light of a legitimate concern for computational attainability.
We separate the main casing of video for our pattern that trains on a solitary edge each video. We separate each six casings of video for our fundamental model, which trains on various casings inside a solitary video. Since every one of the motion pictures in the preparation and test sets are 10 seconds in length at 30 edges each second, there are 300 casings generally in the preparation and test sets, of which we separate 50 every video. Every video is 1920x1080 (scene) or 1080x1920 (representation) (picture). We'll just track down the essences of the members in the film and rescale them to 224x224 pixels for preparing and testing since this is more data than we really want.
We utilize a 60GB subset of the dataset to prepare our last model. Since our model purposes the Inception network for facial embeddings, we run video outlines through a facial indicator and rescale the face to 299x299 pixels with the most elevated certainty. We select each 30 casings of video, and that implies we train on 10 approaches each video cut. This was done to guarantee that we had a changed enough dataset to prepare a vigorous model on while as yet safeguarding sufficient info...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here