Paper:《First Order Motion Model for Image Animation》翻译与解读

Paper:《First Order Motion Model for Image Animation》翻译与解读
 

 

 

 

目录

《First Order Motion Model for Image Animation》翻译与解读

Abstract

1 Introduction  

2 Related work  

3 Method

3.1 Local Affine Transformations for Approximate Motion Description  

3.2 Occlusion-aware Image Generation 

3.3 Training Losses

3.4 Testing Stage: Relative Motion Transfer  

4 Experiments


 

 

更新中……

《First Order Motion Model for Image Animation》翻译与解读

相关论文

《First Order Motion Model for Image Animation》

https://papers.nips.cc/paper/8935-first-order-motion-model-for-image-animation

Aliaksandr Siarohin DISI, University of Trento [email protected] Stéphane Lathuilière DISI, University of Trento LTCI, Télécom Paris, Institut polytechnique de Paris [email protected] Sergey Tulyakov Snap Inc. [email protected] Elisa Ricci DISI, University of Trento Fondazione Bruno Kessler [email protected] Nicu Sebe DISI, University of Trento Huawei Technologies Ireland [email protected]

GitHub https://github.com/AliaksandrSiarohin/first-order-model

 

Abstract

Image animation consists of generating a video sequence so that an object in a source image is animated according to the motion of a driving video. Our framework addresses this problem without using any annotation or prior information about the specific object to animate. Once trained on a set of videos depicting objects of the same category (e.g. faces, human bodies), our method can be applied to any object of this class. To achieve this, we decouple appearance and motion information using a self-supervised formulation. To support complex motions, we use a representation consisting of a set of learned keypoints along with their local affine transformations. A generator network models occlusions arising during target motions and combines the appearance extracted from the source image and the motion derived from the driving video. Our framework scores best on diverse benchmarks and on a variety of object categories. 图像动画包括生成视频序列,以便根据驱动视频的运动使源图像中的对象动画。我们的框架解决了这个问题,没有使用任何注释或关于动画特定对象的先验信息。一旦在一组描述同一类别对象(例如人脸、人体)的视频上进行训练,我们的方法就可以应用于该类中的任何对象。为了实现这一点,我们解耦外观表面和运动信息使用一个自监督的公式。为了支持复杂的运动,我们使用一种由一组学习过的关键点及其局部仿射变换组成的表示法。生成器网络对目标运动中产生的遮挡进行建模,并将从源图像中提取的外观与从驾驶视频中提取的运动相结合。我们的框架在各种基准测试和各种对象类别上得分最高。

 

1 Introduction  

Generating videos by animating objects in still images has countless applications across areas of  interest including movie production, photography, and e-commerce. More precisely, image animation  refers to the task of automatically synthesizing videos by combining the appearance extracted from  a source image with motion patterns derived from a driving video. For instance, a face image of a  certain person can be animated following the facial expressions of another individual (see Fig. 1). In  the literature, most methods tackle this problem by assuming strong priors on the object representation  (e.g. 3D model) [4] and resorting to computer graphics techniques [6, 33]. These approaches can  be referred to as object-specific methods, as they assume knowledge about the model of the specific  object to animate.    
 
Recently, deep generative models have emerged as effective techniques for image animation and  video retargeting [2, 41, 3, 42, 27, 28, 37, 40, 31, 21]. In particular, Generative Adversarial Networks  (GANs) [14] and Variational Auto-Encoders (VAEs) [20] have been used to transfer facial expressions  [37] or motion patterns [3] between human subjects in videos. Nevertheless, these approaches  usually rely on pre-trained models in order to extract object-specific representations such as keypoint  locations. Unfortunately, these pre-trained models are built using costly ground-truth data annotations  [2, 27, 31] and are not available in general for an arbitrary object category. To address this issues,  recently Siarohin et al. [28] introduced Monkey-Net, the first object-agnostic deep model for image animation. Monkey-Net encodes motion information via keypoints learned in a self-supervised fashion.  At test time, the source image is animated according to the corresponding keypoint trajectories  estimated in the driving video. The major weakness of Monkey-Net is that it poorly models object  appearance transformations in the keypoint neighborhoods assuming a zeroth order model (as we  show in Sec. 3.1). This leads to poor generation quality in the case of large object pose changes  (see Fig. 4). To tackle this issue, we propose to use a set of self-learned keypoints together with  local affine transformations to model complex motions. We therefore call our method a first-order  motion model. Second, we introduce an occlusion-aware generator, which adopts an occlusion mask  automatically estimated to indicate object parts that are not visible in the source image and that  should be inferred from the context. This is especially needed when the driving video contains large  motion patterns and occlusions are typical. Third, we extend the equivariance loss commonly used  for keypoints detector training [18, 44], to improve the estimation of local affine transformations.  Fourth, we experimentally show that our method significantly outperforms state-of-the-art image  animation methods and can handle high-resolution datasets where other approaches generally fail.  Finally, we release a new high resolution dataset, Thai-Chi-HD, which we believe could become a  reference benchmark for evaluating frameworks for image animation and video generation.  

 

 

2 Related work  

Video Generation. Earlier works on deep video generation discussed how spatio-temporal neural  networks could render video frames from noise vectors [36, 26]. More recently, several approaches  tackled the problem of conditional video generation. For instance, Wang et al. [38] combine a  recurrent neural network with a VAE in order to generate face videos. Considering a wider range  of applications, Tulyakov et al. [34] introduced MoCoGAN, a recurrent architecture adversarially  trained in order to synthesize videos from noise, categorical labels or static images. Another typical  case of conditional generation is the problem of future frame prediction, in which the generated video  is conditioned on the initial frame [12, 23, 30, 35, 44]. Note that in this task, realistic predictions can  be obtained by simply warping the initial video frame [1, 12, 35]. Our approach is closely related to these previous works since we use a warping formulation to generate video sequences. However,  in the case of image animation, the applied spatial deformations are not predicted but given by the  driving video.  
 

Image Animation. Traditional approaches for image animation and video re-targeting [6, 33,  13] were designed for specific domains such as faces [45, 42], human silhouettes [8, 37, 27] or  gestures [31] and required a strong prior of the animated object. For example, in face animation,  method of Zollhofer et al. [45] produced realistic results at expense of relying on a 3D morphable  model of the face. In many applications, however, such models are not available. Image animation  can also be treated as a translation problem from one visual domain to another. For instance, Wang  et al. [37] transferred human motion using the image-to-image translation framework of Isola et  al. [16]. Similarly, Bansal et al. [3] extended conditional GANs by incorporating spatio-temporal  cues in order to improve video translation between two given domains. Such approaches in order to  animate a single person require hours of videos of that person labelled with semantic information,  and therefore have to be retrained for each individual. In contrast to these works, we neither rely on  labels, prior information about the animated objects, nor on specific training procedures for each  object instance. Furthermore, our approach can be applied to any object within the same category  (e.g., faces, human bodies, robot arms etc).  

 
Several approaches were proposed that do not require priors about the object. X2Face [40] uses  a dense motion field in order to generate the output video via image warping. Similarly to us  they employ a reference pose that is used to obtain a canonical representation of the object. In our  formulation, we do not require an explicit reference pose, leading to significantly simpler optimization  and improved image quality. Siarohin et al. [28] introduced Monkey-Net, a self-supervised framework  for animating arbitrary objects by using sparse keypoint trajectories. In this work, we also employ  sparse trajectories induced by self-supervised keypoints. However, we model object motion in the  neighbourhood of each predicted keypoint by a local affine transformation. Additionally, we explicitly  model occlusions in order to indicate to the generator network the image regions that can be generated  by warping the source image and the occluded areas that need to be inpainted.  

 

 

3 Method

We are interested in animating an object depicted in a source image S based on the motion of a similar  object in a driving video D. Since direct supervision is not available (pairs of videos in which objects  move similarly), we follow a self-supervised strategy inspired from Monkey-Net [28]. For training,  we employ a large collection of video sequences containing objects of the same object category. Our  model is trained to reconstruct the training videos by combining a single frame and a learned latent  representation of the motion in the video. Observing frame pairs, each extracted from the same video,  it learns to encode motion as a combination of motion-specific keypoint displacements and local  affine transformations. At test time we apply our model to pairs composed of the source image and of  each frame of the driving video and perform image animation of the source object.  
An overview of our approach is presented in Fig. 2. Our framework is composed of two main  modules: the motion estimation module and the image generation module. The purpose of the motion  estimation module is to predict a dense motion field from a frame D ∈ R  3×H×W of dimension  H × W of the driving video D to the source frame S ∈ R  3×H×W . The dense motion field is later  used to align the feature maps computed from S with the object pose in D. The motion field is  modeled by a function TS←D : R  2 → R  2  that maps each pixel location in D with its corresponding  location in S. TS←D is often referred to as backward optical flow. We employ backward optical flow,  rather than forward optical flow, since back-warping can be implemented efficiently in a differentiable  manner using bilinear sampling [17]. We assume there exists an abstract reference frame R. We  independently estimate two transformations: from R to S (TS←R) and from R to D (TD←R). Note  that unlike X2Face [40] the reference frame is an abstract concept that cancels out in our derivations  later. Therefore it is never explicitly computed and cannot be visualized. This choice allows us to  independently process D and S. This is desired since, at test time the model receives pairs of the  source image and driving frames sampled from a different video, which can be very different visually.  Instead of directly predicting TD←R and TS←R, the motion estimator module proceeds in two steps.  
In the first step, we approximate both transformations from sets of sparse trajectories, obtained by  using keypoints learned in a self-supervised way. The locations of the keypoints in D and S are  separately predicted by an encoder-decoder network. The keypoint representation acts as a bottleneck  resulting in a compact motion representation. As shown by Siarohin et al. [28], such sparse motion  representation is well-suited for animation as at test time, the keypoints of the source image can be  moved using the keypoints trajectories in the driving video. We model motion in the neighbourhood  of each keypoint using local affine transformations. Compared to using keypoint displacements only,  the local affine transformations allow us to model a larger family of transformations. We use Taylor  expansion to represent TD←R by a set of keypoint locations and affine transformations. To this end,  the keypoint detector network outputs keypoint locations as well as the parameters of each affine  transformation.  
During the second step, a dense motion network combines the local approximations to obtain the  resulting dense motion field Tˆ  S←D. Furthermore, in addition to the dense motion field, this network  outputs an occlusion mask Oˆ  S←D that indicates which image parts of D can be reconstructed by  warping of the source image and which parts should be inpainted, i.e.inferred from the context.  
Finally, the generation module renders an image of the source object moving as provided in the  driving video. Here, we use a generator network G that warps the source image according to Tˆ  S←D  and inpaints the image parts that are occluded in the source image. In the following sections we detail  each of these step and the training procedure.
 

 

3.1 Local Affine Transformations for Approximate Motion Description  

The motion estimation module estimates the backward optical flow TS←D from a driving frame D to  the source frame S. As discussed above, we propose to approximate TS←D by its first order Taylor  expansion in a neighborhood of the keypoint locations. In the rest of this section, we describe the  motivation behind this choice, and detail the proposed approximation of TS←D.  
We assume there exist an abstract reference frame R. Therefore, estimating TS←D consists in  estimating TS←R and TR←D. Furthermore, given a frame X, we estimate each transformation  TX←R in the neighbourhood of the learned keypoints. Formally, given a transformation TX←R, we  consider its first order Taylor expansions in K keypoints p1, . . . pK. Here, p1, . . . pK denote the  coordinates of the keypoints in the reference frame R. Note that for the sake of simplicity in the  following the point locations in the reference pose space are all denoted by p while the point locations  in the X, S or D pose spaces are denoted by z. We obtain:
 
   

 

 

3.2 Occlusion-aware Image Generation 

As mentioned in Sec.3, the source image S is not pixel-to-pixel aligned with the image to be generated  Dˆ . In order to handle this misalignment, we use a feature warping strategy similar to [29, 28, 15].  More precisely, after two down-sampling convolutional blocks, we obtain a feature map ξ ∈ R  H0×W0  of dimension H0 × W0  . We then warp ξ according to Tˆ  S←D. In the presence of occlusions in S,  optical flow may not be sufficient to generate Dˆ . Indeed, the occluded parts in S cannot be recovered  by image-warping and thus should be inpainted. Consequently, we introduce an occlusion map  Oˆ  S←D ∈ [0, 1]H0×W0  to mask out the feature map regions that should be inpainted. Thus, the  occlusion mask diminishes the impact of the features corresponding to the occluded parts. The  transformed feature map is written as:  
   

 

 

3.3 Training Losses

We train our system in an end-to-end fashion combining several losses. First, we use the reconstruction loss based on the perceptual loss of Johnson et al. [19] using the pre-trained VGG-19 network as our main driving loss. The loss is based on implementation of Wang et al. [37]. With the input driving frame D and the corresponding reconstructed frame Dˆ , the reconstruction loss is written as:  
   

 

 

3.4 Testing Stage: Relative Motion Transfer  

At this stage our goal is to animate an object in a source frame S1 using the driving video D1, . . . DT .  Each frame Dt is independently processed to obtain St. Rather than transferring the motion encoded  in TS1←Dt  (pk) to S, we transfer the relative motion between D1 and Dt to S1. In other words, we  apply a transformation TDt←D1  (p) to the neighbourhood of each keypoint pk:  
   

 

4 Experiments

Datasets. We train and test our method on four different datasets containing various objects. Our model is capable of rendering videos of much higher resolution compared to [28] in all our experiments.