Search
Duplicate

High Resolution Face Age Editing

자료
Empty
사회자
황중원
기타
Empty
날짜
2021/05/15
주차
11주차
Abstract
Face age editing has become a crucial task in film post- production, and is also becoming popular for general purpose photog- raphy. Recently, adversarial training has produced some of the most visually impressive results for image manipulation, including the face aging/de-aging task. In spite of considerable progress, current methods often present visual artifacts and can only deal with low-resolution im- ages. In order to achieve aging/de-aging with the high quality and ro- bustness necessary for wider use, these problems need to be addressed. This is the goal of the present work. We present an encoder-decoder ar- chitecture for face age editing. The core idea of our network is to create both a latent space containing the face identity, and a feature modula- tion layer corresponding to the age of the individual. We then combine these two elements to produce an output image of the person with a desired target age. Our architecture is greatly simplified with respect to other approaches, and allows for continuous age editing on high res- olution images in a single unified model. Source codes are available at https://github.com/InterDigitalInc/HRFAE.
Face age editing은 영화제작이나 일반적인 사진을 위한 목적으로 대중화되고 있다
GAN으로 성과들이 나오고 있으나 현재까지의 sol 들은 낮은 해상도만을 제공한다. 높은 해상도의 face age editing 을 위해서는 이전 방법론들에 대한 검토가 필요하다.
이 논문의 주안점 2가지는
1.
Face identity를 담고 있는 latent space 생성
2.
Feature modultion layer 생성 (개개인의 나이고려)
1. Introduction
Learning to manipulate face age is an important topic both in industry and academia. In the movie post-production industry, many actors are retouched in some way, either for beautification or texture editing. More specifically, synthetic aging or de-aging effects are usually generated by makeup or special visual ef- fects. Although impressive results can be obtained digitally, as in the recent Mar- tin Scorcese’s movie The Irishman, the underlying processes are extremely time consuming. Thus, robust, high-quality algorithms for performing automatic age modification are highly desirable. Nevertheless, editing faces is an intrinsically difficult task. Indeed, the human brain is particularly good at perceiving faces’ attributes in order to detect, recognize or analyze them, for instance to infer identity or emotions. Consequently, even small artifacts are immediately per- ceived and ruin the perception of results. For this reason, our goal is to produce artifact-free, sharp and photorealistic results on high-resolution face images.
최근 영화 Irishman 에서의 디지털로 처리된 face eating 기술은 결과적으로 놀랍지만 그 뒷단의 작업은 많은 시간을 소요한다
자동화된 고품질의 face age editing 기술 필요성이 대두
face editing 은 어려운 기술이고, 사람은 작은 결점도 잘 알아챈다.
이 논문의 목적은 artifact free, sharp, photorealistic 한 결과를 만들어 내는 것이다
With the success of Generative Adversarial Networks (GANs) [7] in high quality image generation, GAN-based models have been widely used for image- to-image translation [35,40]. Despite having set new standards for natural image synthesis, GANs are known to suffer from two major flaws : an abundance of small artifacts and strong instability of the training process. The latest face aging studies [9,20,33,36,39] also adopt GAN-based models. Specifically, they divide face datasets into different age groups, feed young images into the generator, and rely on the discriminator to map output images to older age distributions. There are multiple limitations to this approach. Firstly, as can be expected, these approaches inherit the drawbacks of GAN-based methods - blurry background, small parasite structures, instability of training. Secondly, as the aging effect is generated by matching the output image distribution to the target group, these methods are limited to coarse aging/de-aging. To achieve fine-grained transfor- mation, a separate model needs to be trained between each pair of ages.
GAN 은 이미지 합성에 새로운 기준을 마련했지만,
2가지 결점이 있음
1.
무수히 많은 작은 artifact 발생
2.
학습이 불안정
최근 face aging 연구들은 GAN 방식을 사용하고 있는데, 데이테셋을 두 그룹으로 나누고 (young/old) Generator 에 Yong 이미지를 넣고, discriminate 다 출력이미지를 old age 분포에 맞추게 학습시킴
이 방식에는 두가지 결점이 있음
1.
기존 GAN방식의 결점을 그대로 가져옴
2.
Aging/de-aging에 제한을 가지는 방식임
Fig. 1: Age editing results on 1024 × 1024 images. We propose a single deep age transformer network able to perform both face aging and de-aging, producing high quality images that are sharp and with little artifacts. Using the face images indicated by a yellow frame as input, our network can output a photo-realistic image of the same person at any required target age in the range {20, . . . , 69}.
In this work, we propose an encoder-decoder architecture for the problem of face age editing with high visual quality on high resolution images. In order to address the aforementioned limitations, namely the tendency to produce visual artifacts and training instability, we endeavour to keep the architecture as simple as possible. Firstly, we use a single network for both aging and de-aging. This is reasonable since the encoder part of our model is assumed to encode identity, emotion or details in the input image that are not related to age, so that the same latent space can be used for both tasks of aging and de-aging. Secondly, we rely on a feature modulation layer, that is compact, acts directly on the latent space and allows for continuous age transitions. Thirdly, unlike in competing methods where the discriminator used during adversarial training is conditioned on the target age, we use a discriminator which is not conditioned and concentrates solely on the photorealism of the output images to reduce editing artifacts. The discriminator can be considered as a regularizer which imposes photorealism other than a traditional discriminator trying to match two distributions. Thanks to this design, our model achieves efficient disentanglement of age attributes and face identity. We present experimental results on high resolution images with qualitative and quantitative evaluations. In particular, these experiments provide clear evidence that the visual quality achieved by our results outperforms state of the art methods. Experiments on alternative datasets further illustrate the generalization capacity of the method.
위 문제들을 해결하기 위해 Encoder-decoder 구조의 아키텍쳐 제안
1.
Single net 사용 - aging/de-aging
a.
Latent vector: Encoder에서 나이를 제외한 Identity, emotion, image detatil들을 인코딩함으로써 같은 latent vector로 aging과 de-aging을 가능하게 함
b.
Feature modulation layer: 간단하고, latent space에 바로 적용가능하며, 연속적인 age 변환이 가능하게 함
c.
개선된 Discriminator: 기존 타겟 age로 조건을 거는 discriminator와는 달리, artifact를 제거하기 위해 photo realism에만 집중함 ⇒ age attributes와 face identity가 분리가능
2. Related Works
Face aging The survey work [6] gives an exhaustive overview of the tradi- tional age synthesis algorithms. In this work, we are more interested in deep learning based methods, which have made impressive progress on face aging tasks during the last few years. A conditional GAN [24] model is first intro- duced for face aging task by [1,39]. They encode the face image to the latent space, manipulate the latent code, and decode it to an aged face with the gener- ator. However, the identity information is damaged during this process. This is further improved by [36,38], by adding an identity preserving term to the objec- tive. Despite the improvement, their results are over-smoothed compared with the input images. To capture texture details, wavelet-based generative models are introduced by [19,20]. Their complex models increase the training difficulty and still yield strong artifacts. All the aforementioned models only enable face aging from one age group to another, e.g ., from 20s to 40s, lacking flexibility. Recently, [9] proposed an encoder-decoder network, in which a personalized ag- ing basis is synthesized and an age-specific transform is applied. Their model also relies on a conditional discriminator to distinguish aging patterns between age groups. Different from other methods, our model is designed for age editing with a random target age. Moreover, our approach produces much less artifacts, making age editing on images of high resolution (1024 × 1024) possible.
Face aging deep learning works
conditional GAN:
face image → latent space → aged face image
identity info 가 손상됨
Face aging with identity-preserved conditional generative adversarial network / Exchanging latent encodings with gan for transferring multiple face attributes:
Objective func에 indentity preserving term추가
입력 face image에 비해 결과 face image가 over-smooth됨
Wavelet-based generative models: (texture detail을 잡기 위해 제안)
모델 복잡도가 올라가고, artifacts들이 많이 남음
위 3종류의 방법들은 한 age 그룹에서 다른 age그룹으로 밖에 변환이 안됨
예) 20대 → 40대
S2gan (2019, 개인화된 age editing이 가능해짐)
여전히 conditional discriminator에서 age groups들 사이에 aging pattern을 구별해줘야함
이 논문에서 제안하는 방법은
random target age를 가능하게함
artifact가 기존대비 줄어듬
1024 x 1024 high resolution에서 age editing을 가능하게 함
Image-to-image translation Face aging can be considered as an image- to-image translation problem, ie translating images between young age and old age domains. An optimization based method is proposed by [34], showing the possibility to use linear interpolation of deep features from pretrained convnets to transform images. GAN based methods [13,40,11] further enable real-time trans- lation, by training a feed forward generator. Existing image-to-image translation studies [3,4,18,29,30,37] on face images also yield impressive results in manip- ulating facial attributes. Lample et al . [18] design an autoencoder architecture to reconstruct images, and isolate single image characteristics in a latent com- ponent via a discriminator. These characteristics can then be modified directly in the latent space. Choi et al . [4] propose a method to perform image-to-image translation for multiple domains using only a single model. Pumarola et al . [29] introduce an attention based model, which enables face animation by simple interpolation.
Deep feature interpolation for image content changes
Pretrained conv net에서 추출한 feature에 Linear interpolation을 사용하여 young age 그룹에서 old age group으로 변환
GAN based method: Image-to-image translation with conditional adversarial networks(2017) Unpaired image-to-image translation using cycle-consistent adversarial networks(2017) Multimodal unsupervised image-to-image translation (2018)
real time 변환 제공
Existing image-to-image translation studies: Semantic component decomposition for face attribute manipulation. (2019) Stargan: Unified generative adversarial networks for multi-domain image-to-image translation(2018) Fader networks: Manipulating images by sliding attributes (2017) Ganimation: Anatomically-aware facial animation from a single image (2018) Make a face: Towards arbitrary high fidelity face manipulation(2019) Exchanging latent encodings with gan for transferring multiple face attributes (2018)
face attribute를 변환 성능이 좋음
Fader networks: Manipulating images by sliding attributes (2017)
Autoencoder arch로 image reconstruction함
Discriminator에서 single image의 특성을 latent component에 붙잡아둠 ⇒ latent space에서 변환을 통해 해당 특성을 변환 가능
Stargan: Unified generative adversarial networks for multi-domain image-to-image translation(2018)
Single model에서 multi domain 변환이 가능한 net 제안
Ganimation: Anatomically-aware facial animation from a single image (2018)
attention based method
Fig. 2: Training process: each input image x0 is edited by the age transformer G using the initial age α0 (reconstruction task) and the target age α1 (editing task). The reconstructed image G(x0, α0) should be identical to the input image. The edited image G(x0, α1) is further passed in a discriminator D that ensures photorealism of the transformed image, and an age-classifier V that ensures age- accurate transformation. The age transformer G contains three sub networks: an encoder, a modulating network and a decoder. The encoder maps the input image x0 to an age-invariant deep feature space. The modulating network maps a target age α to a 128-dimensional modulating vector. This vector is used to modulate each channel of the encoded features, hence applying the desired age transformation. The modulated features are finally passed in the decoder to obtain the transformed image. Two skip connections between the encoder and the decoder in order to preserve the age irrelevant details better.
High-resolution image synthesis In spite of the considerable progress of recent methods, manipulating/editing natural images of high resolution has not yet been achieved. Nevertheless, in another task - image generation, high quality results at high resolution are now available. Image generation at 1024 × 1024 res- olution is first achieved by [15], with a progressive growing of GAN architectures. The quality of their results is further improved by StyleGAN [16,17], which learns a separation of high-level attributes automatically during the training. Based on this work, Shen et al . [32] propose an effective way to interpret the latent space learned by the generator and achieve high visual fidelity face manipulation on synthesized images. However, according to our experiments, only a fraction of natural images can be accurately reconstructed with a latent code, which makes this type of method impractical. In contrast, our proposed method achieves face age editing on 1024 × 1024 images, with great simplicity of architecture and loss design. The age editing is achieved only by an auxiliary modulating network, which could be potentially generalized to other face manipulation tasks.
Image-to-image translation 방법론에서는 위와 같은 개선에도 불구하고 high-resolution이미지에 대한 처리는 여전히 힘듬
하지만 다른 방법론(이미지 생성)에서는 가능해지기 시작함
Progressive growing of GANs for improved quality, stability, and variation(2018)
고해상도 첫 달성
StyleGAN(품질을 더 향상시킴)
학습 시 자동으로 high-level attributes의 분리가 학습 됨
Interpreting the latent space of gans for semantic face editing(2019)
학습된 generator에 의해 latent space를 해석 할 수 있는 방법 제시
생성된 이미지에서 face 조작을 고해상도 영역에서 가능하게 함
이 논문에서 실험한 바에 의하면, 일반 이미지들 중 일부만 정확하게 변환이 가능하였다고 함
1024x1024 해상도에서 심플한 net arch로 face age editing을 달성하였다고 함
다른 face attribute 조작에도 가능한 추가 net module을 통해 age editing을 함
3. Method
그림 2에서 age transformer와 학습 절차를 설명하였음
3.1 Overview
Let x0x_{0} be an image drawn randomly from a face dataset. We denote by α0 the age of the person in x0x_{0}. Our goal is to transform x0x_{0} so that the person in this image looks like someone at α1α_{1} years old. We want the aged version of x0x_{0} to share many age-unrelated characteristics with x0x_{0}: identity, emotion, haircut, background, etc. That is to say: the facial attributes not relevant to age, as well as the background, need to be preserved during age transformation. Therefore, we assume that a face aging model and a face de-aging model can share most of their parameters. In this setting, we consider a single age transformer GG and assume that GG can transform any face image to any target age. The inputs of our model are the face image x0x_{0} and the target age α1α_{1}. The output is denoted by G(x0,α1)G(x_{0}, α_{1}), which depicts x0x_{0} at the target age α1α_{1}
x0x_{0}: randomly drawn face image
a0a_{0}: x0x_{0}'s age
Goal: x0x_{0} 이미지를 a1a_{1}로 보이게 변환
facial attributes(identity, emotion, haircut, background, etc)는 age나 background와 연관이 없고, 변환 시 보존되어야함
aging / de-aging model은 파라미터를 공유가능하다고 가정
위 가정으로 변환함수 GG가 어떤 face image이든 어떤 타겟 나이로 보이게 변환 시킬수 있음
3.2 Age transformer
The proposed age transformer shown in Figure 2 employs an auto-encoder architecture and is made of an encoder, a feature modulation block and a decoder. The encoder consists of three strided convolutional layers (the first one of stride 1, the other two of stride 2) and four residual blocks [8], while the decoder contains two nearest-neighbour upsampling layers and three convolutional layers, similar to the architecture used in [14,40]. The main difference compared to these works is our feature modulation block, in which the output features of the encoder are modulated by an age-specific vector (see details below). This idea is inspired by recent works on style transfer [5,10] which show the possibility to represent different styles using the parameters of normalization layers.
구조 : Encoder, feature modulation bock, decoder
Encoder구조: strided conv layer x 3(s 1,2,2), residual block x 4
Decoder구조: nn uplampling layer x 2, conv layer x 3
Feature modulation block
역할: Encoder output을 age-specific vector로 변환
- Encoder: The face image x0x_{0} is the input of the encoder. The output features are denoted by Cn×cC \in \real^{n \times c}, where c = 128 is the number of channels and n is the product of the two spatial dimensions.
- Feature modulation for age selection: he target age a1a_{1} is encoded as an one-hot vector, denoted by z1z_{1}, and passed to the modulating network. This network consists of a single fully connected layer whith a sigmoid activation. It outputs a modulation vector w[0,1]cw\in[0,1]^{c}, which is used to re-weight the features C before passing them into the decoder and obtaining the face image at the desired age. The modulated features are Cdiag(w)Cdiag(w),where diag(w)diag(w) is the diagonal matrix with diagonal ww.
- Decoder: The decoder takes the modulated features Cdiag(w)C diag(w) as input and two skip connections, used to preserve the finer details of the input image. The final output is denoted by G(x0,α1)G(x_{0}, α_{1}).
3.3 Training
age transformer가 age accurate transform을 하게 하고, discriminator가 photo-realism을 유지하게 학습
학습시 a1과 a0가 멀리 떨어져있으면, artifact가 덜 발생
a1 샘플링을 a0로 부터 특정 age interval 이상나는 것으로함
Classification loss
Classifier V, 입력 G(x0,a1)G(x_{0},a_{1}), 출력: 0~100 (classes)
p(x)p(x): 학습 이미지 분포
ll: categorical cross-entropy loss
z1z_{1}: one-hot encoding a1a_{1}
Adversarial loss
PatchGAN w/ LSGAN 으로 부터 adversarial loss 가져옴
Discriminator에서 패치 단위로 예측 후 전체 평균하여 예측값 출력
4 Experiments
In this section, we introduce our training setup and present the experimental results. We further evaluate the quality of our results using quantitative metrics.
4.1 Data augmentation with synthetic images
Our training dataset is built upon FFHQ [16], a high resolution dataset which contains 70, 000 face images at 1024 × 1024 resolution. The dataset includes large variations in age, ethnicity, pose, lighting, and image background. However, the dataset contains only unlabeled raw images collected from Flickr.
To obtain the age information, we use an age classifier pretrained on IMDB- WIKI [31]. We observe that FFHQ contains much more samples of young faces than of old ones. This data imbalance is challenging since the aging and de-aging tasks would not be treated equally during training: most of faces being young, the age transformer would be trained to perform aging much more often than de-aging, failing to yield satisfying de-aging results. To compensate this imbalance in the age distribution, we propose to perform data augmentation using StyleGAN - a state-of-the-art high resolution image generation model [16]. We use the StyleGAN model pretrained on FFHQ to generate 300, 000 synthetic images. A quick visual inspection shows that most of the generated images have no significant artifacts and are nearly indistinguishable from real images by a human. Therefore, we use them for data augmentation to obtain a quasi-uniform age distribution over Q: for any age bin with less than 1, 000 samples in the original FFHQ dataset, we complete this bin with some of the generated synthetic face images; for any age bin with more than 1, 000 samples, we select randomly 1, 000 face images from the original FFHQ dataset. The age-equalized dataset contains 47, 990 images over the range Q = {20, . . . , 69}
FFHQ는 young group이 많고 old group이 적었음
Data imbalance를 해결하기 위해 FFHQ로 pretrained된 StyleGAN을 사용하여 30만장 이미지 생성
각 연령별 1000개 이상 확보되도록 함, 100개가 넘는 경우 랜덤샘플링
최종적으로 47,990개 이미지 사용, Q={20,...,69}
4.2 Implementation details
Our model is implemented in PyTorch [28]. We take 95% of the equalized dataset as our training set and the rest as test set. For the age transformer and the discriminator, spectral normalisation [25] is applied on all the convolution layers except the last one of the age transformer. All the activation layers use Leaky ReLU [21] with a negative slope of 0.2.
We consider age transformation only in the age range Q = {20, . . . , 69}. The constant α∗ is set to 25. We have observed that the most significant artifacts appear when the gap between the source and target age is large. By choosing α∗ large enough, we force the discriminator D to suppress these artifacts during adversarial training. The weights λ recon and λ class are set to 10 and 0.1, respectively. We use Adam optimizer with a learning rate of 10 4. The age transformer G is updated once after each discriminator update. Our model is trained for 20 epochs to achieve face age editing on high resolution images. The first 10 epochs are trained on 512 × 512 images with a batch size of 4. The next 10 epochs are trained on 1024 × 1024 images, for which we reduce the batch size to 2, learning rate to 10 5 and λ recon to 1.
4.3 Qualitative evaluation
Figure 9 presents age editing results on 1024 × 1024 input images in different age groups. Our approach yields visually satisfying results with sharp details (best viewed when zooming on the results) and without introducing significant artifacts. Only the age relevant facial features are modified, while the identity, haircut, emotion and background are well preserved. This is all the more satis- fying that no mask has been used to isolate the face from the rest of the image. Figure 4 presents age editing results with a smooth evolution of the target age. The difference between two adjacent results is nearly invisible, which illustrates the smoothness of the aging process.
We compare our method to the two most recent state-of-the-art methods on face aging for which the official codes are released - IPCGAN [36] and PAG- GAN [38]. We also compare our results to those obtained with FaderNet [18], which allows one to manipulate several facial attributes including the age.
Figure 5 present the face aging results of IPCGAN, PAGGAN and our method on CACD [2]. The output size of each method is: 128 × 128 for IPCGAN, 224 × 224 for PAGGAN, 256 × 256 for our method. IPCGAN generates satisfying aging results and preserves well the identity of input images. However, as can be seen e.g. in Figure 5(a) row 1 column 4, the generated image presents noticeable artifacts. PAGGAN generates impressive aging effects but also introduce colored artifacts as shown in Figure 5(b) row 1 column 2. IPCGAN and PAGGAN both degrade the quality of input images. Our method is able to generate consistent aging effects, and preserve well the fine details of the input images.
Fig. 3: Age editing results on 1024 × 1024 images on FFHQ [16]. On each row, the yellow frame indicates the original image. Each column corresponds to a target age of: 25, 35, 45, 55, 65. Our approach yields visually satisfying results without introducing significant artifacts. Only age relevant features are modified, while the identity, haircut, emotion and background are perfectly preserved.
Generalisation capacity for images in unseen dataset For fair comparison and also to reduce the possible effect of overfitting on the training data, we evaluate all methods on a dataset not viewed at training time by any of the methods. We chose CelebA-HQ [15], a high resolution version of the CelebA dataset. The input images are at 1024 × 1024 resolution, and are further down- sampled at the resolution at which each method was trained using their official codes. The output size of each method is: 224 × 224 for PAGGAN, 128 × 128 for IPCGAN, 256 × 256 for FaderNet, and 1024 × 1024 for our method. We com- pare only the face aging results from young age group to old age group, since PAGGAN and IPCGAN are trained only for aging. Figure 6 shows the results ob- tained with the different methods. FaderNet [18] introduces little modifications. PAGGAN [38] generates satisfying age progression effects. However, noticeable artifacts are present on the face edges and hairs. IPCGAN [36] is limited to low resolution and thus introduces a strong degradation on the quality of the image.
Fig. 4: Continuous face age editing results on FFHQ [16]. As can be observed, the difference between two adjacent results is nearly invisible, which demonstrates the smoothness of the aging process.
Table 1: Quantitative evaluation using online face recognition API [12]. We compare our method against three methods: Fader Network [18], PAG- GAN [38] and IPCGAN [36]. Images are transferred to the oldest age group (50+) for all the methods. The second column presents the average predicted age. The third column indicates the blurriness of the results (lower value means less blurry). The fourth column is the gender preservation rate, meaning to which percentage the original gender is preserved. The fifth column refers to expres- sion preservation - smiling preservation rate. The last two columns indicate the emotion preservation rate
In comparison to these results, our approach introduces much less artifacts and preserves the fine details of the face and the background better.
4.4 Quantitative evaluation
Quantitative evaluation of image-to-image translation tasks is still an open question and there is no universal metric to measure photorealism or quantify artifacts in an image. The recent works [9,20,38] on face aging use an online face recognition API to estimate the age and the identity preservation accuracy of the modified images. We thus employ a similar evaluation process.
In our evaluation, the first 1, 000 images with true “Young” label of the CelebA-HQ dataset are extracted as test images. Using this test set, we make a quantitative comparison with FaderNet [18], IPCGAN [36] and PAGGAN [38]. Each image is transferred to the oldest age group using their official released models. For IPCGAN and PAGGAN, the oldest age group refer to 50+ and [51, 60] respectively. For FaderNet, the old attribute is set to be the default largest value for aging in their official code. To have a fair comparison with groupwise methods, and since 50+ is considered as the oldest age group, we choose a target age of 60 (the mean of the age range {51, . . . , 69} ⊂ Q) for our age transformer.
Fig. 5: Comparison with IPCGAN [36] and PAGGAN [38] on CACD [2]. For each subfigure in (a), the top row corresponds to the aging results of IPCGAN. The second row shows the images generated by our method. For each subfigure in (b), the top row corresponds to the aging results of PAG- GAN. The second row shows the images generated by our method.
Thus we get 1000 modified images for each method. We further evaluate these output images using the online face recognition API of Face++ [12]. From the detect API, we obtain the following interesting metrics: age, gender, blurriness (whether the face is blurry or not, larger values means blurrier), smiling and emo- tion estimation. The emotion estimation contains a series of emotions: sadness, neutral, disgust, anger, surprise, fear and happiness. With a preliminary analysis on the results, 94.20% of the input images are classified as neutral or happiness. Thus we just keep these two terms for emotion preservation comparison. We have also compared the identity preservation rate using the API to compare the modified images with the original inputs. However, since all methods achieve a nearly 100% accuracy, this metric is not reported here.
Table 1 shows the quantitative evaluation results. All the methods are given the oldest age group as aging target, and we notice that our method has the highest average predicted age. The gender preservation rate is calculated by comparing the estimated gender with the original CelebA annotations. Using this metric, FaderNet achieves the best performance, followed by our method. For expression preservation (smiling) and emotion preservation (neutral, happiness), our approach yields the best results. It is to be noted however that all methods have similar results. For the blur evaluation, results are much more contrasted. Our method performs much better in generating sharper images, which is in agreement with the visual comparisons.
Fig. 6: Comparison of face aging results on CelebA-HQ [15]. The first column are the input images. The second to fifth column are outputs from Fader Network [18], PAG-GAN [38], IPC-GAN [36] and our method. Our results reach the highest resolution without introducing significant artifacts. Our method pre- serves the background better compared to other techniques, see for instance the letters on the third row. In addition, compared to other techniques, our method leads to a result without artefacts nor blur.
Fig. 7: Face age editing results with different types of discriminator. (a) Conditional discriminator. (b) Two separate discriminators. One receives images only from old age groups, the other receives images from young age groups. (c) Our proposed method - using one single discriminator. Comparing to the results in (a) and (b), the proposed method (c), which uses a single discriminator, generates reliable face aging/de-aging effets with the least artifacts.
4.5 Discussion
Ablation study on discriminator We have explored three different types of discriminators to train the age transformer. Figure 7 presents the face age editing results corresponding to the different settings.
- Conditional discriminator. We adopt a patch discriminator [13] with a label projection applied on the features before the last convolutional layer, similar to the settings in [26]. The discriminator is conditioned on four age groups: 20-35, 35-45, 45-55, 55-70. At the training stage we find it essential to give the same number of real and fake images from each class to the discriminator to make the training successful. If we sample a target age α1 from the set Qα0 = {α ∈ Q : |α α0| ≥ α∗} at training time, the discriminator will receive more manipulated images in the youngest and oldest group. Thus it tends to classify all the images in these two groups as fake. The conditional discriminator is very sensitive to the original data distribution and needs much more hyper-parameter fine-tuning to converge. Figure 7(a) presents the age editing results with conditional discriminator. Strong artifacts can be observed in the aging results.
- Two separate discriminators. One discriminator receives manipulated and real images with a desired age lies in the old age group (45-70), while the other one takes manipulated and real images in the young age group (20-44). With this setting, the task of generating aging/de-aging effects is shared among the classifier and the discriminators. Although the results in 7(b) are better than those in 7(a), over-smoothing artifacts are perceived in the de-aging results and colored artifacts appear in the aging results.
- One single discriminator. This is our proposed method. The discriminator can be considered as a regularizer which imposes photorealism, as it takes all the manipulated and real images as input. The generation of aging/de-aging effects is solely dictated by the age classifier. We are able to achieve high resolution results only with this last setting.
Fig. 8: Images reconstructed from a latent code optimization. We analyze the possibility of encoding natural images to the latent space of StyleGAN [16], through optimization in the latent space minimizing the distance between the generated image and the input image. Each image is then reconstructed from this optimized latent code. The relatively low quality of the reconstruction strongly suggest that editing performed in the latent space cannot lead to a sharp and artifact-free result
Image reconstructed from a latent code optimization As mentioned in Section 2, the recent work of Shen et al . [32] proposes an effective way to manipulate the latent code of an image generator to achieve high visual quality manipulation of synthesized images. It is therefore tempting to manipulate the latent code directly to produce face manipulation (and thus age editing) on natural images with this approach. However, finding such a latent code for an arbitrary face image is still a challenging problem. According to our experiments using StyleGAN [16], only a fraction of natural face images can be accurately reconstructed from the latent code 4 by [27]. Consequently, this type of method is impractical until a better StyleGAN encoder is made available. Figure 8 is meant to support this claim, where reconstruction results of natural face images can be assessed. We notice that the reconstructed images have painting-like artifacts, blurry backgrounds, and sometimes fail to preserve the identity of the person in the input image. Indeed, StyleGAN is much more efficient at sampling random faces from the latent space than at approximating a given face image. This is due to the fact that a GAN is not necessarily invertible. Hence, an editing method based on this latent code reconstruction will struggle to handle correctly natural images and to achieve the high visual quality of our method.
Weakly supervised training To the best of our knowledge, our work is the first to use unlabeled data for training among recent face aging stud- ies [9,20,33,36,39]. A classifier pretrained on IMDB-WIKI [31], a low resolution face dataset, is used to provide age information. Moreover, the discriminator in our method is used only to distinguish real and manipulated images. Rely- ing solely on the classifier, we successfully extract the age specific features and further realize age transform on high resolution images. This reveals the capacity of the classifier, even trained on low quality images. Our method could be potentially generalized to other face attributes manipulation tasks, by using a separate pair of modulating network and classifier for each attribute.
5 Conclusion
In this paper, we have proposed an age transformer architecture, enabling continuous face age editing with a single network, which we have endeavoured to keep as simple as possible. We believe that this approach, combined with an encoder- decoder architecture, rather than relying on a complex GAN, is the best path towards high quality, high resolution face editing results. We have demonstrated the capacity of our model to produce photorealistic and sharp results, without introducing significant artifacts, on images of resolution 1024 × 1024. The pro- posed feature modulation block appears to achieve efficient separation of age and identity information. Given the performance achieved, this design can be potentially useful for other face attribute manipulation tasks.
A Network architecture
Table 2 presents the hyperparameters of the proposed network architecture. The discriminator is a 142 × 142 patch discriminator. Each element of the output feature map corresponds to a receptive field of 142 × 142 on the original input image.
B Age classifier
To obtain the age information of FFHQ dataset [16], we use the age classifier [31], which has been pretrained on IMDB-WIKI. This dataset contains 523, 051 face images of 20, 284 celebrities collected from the IMDB and Wikipedia websites. The dataset mostly covers the [20, 65] age interval, and has only very few samples for the younger and older age intervals. Consequently, the age classifier might yield less accurate age estimation for faces of people younger than 20 years old or much older than 65 years old. We therefore choose to use images in the age range Q = {20, . . . , 69} for training. We pass the images of FFHQ dataset into the age classifier and observe that FFHQ contains much more samples of young faces than of old ones. We then augment the dataset with synthetic images generated by StyleGAN [16] to achieve a quasi-uniform age distribution over the age range Q, as described in section 4.1 of the paper
C Additional results
In this section, we present supplementary results on 1024 × 1024 images.
C.1 Results on FFHQ dataset
More age transform results on 1024×1024 images of FFHQ dataset are presented in Figure 9 and 10.
C.2 Comparison with other methods
In Figure 13, we show additional comparison of face aging results on Celeba- HQ [15]. As mentioned in the paper, we compare our method against the two most recent state-of-the-art methods on face aging for which the official codes are released - PAGGAN [38] and IPCGAN [36]. We also compare our results to those obtained with Fader Network [18], which allows one to manipulate several facial attributes including the age. Each input image is transformed to the oldest age group using their official released models. For IPCGAN and PAGGAN, the oldest age group refer to 50+ and [51, 60] respectively. For Fader Network, the age attribute is set to be the default largest value for aging in their official code. To have a fair comparison with groupwise methods, and since 50+ is considered as the oldest age group, we choose a target age of 60 (the mean of the age range {51, . . . , 69} ⊂ Q) for our age transformer.
Table 2: Hyperparameters of the proposed network architecture. The input size is 1024 × 1024 × 3. For the age transformer, except the last one, each convolution is followed by an instance normalization and a LeakyReLU activation. For the discriminator, except the first and the last one, each convolution is followed by a batch normalization and a LeakyReLU activation.
Fig. 9: Age transformation on 1024 × 1024 images. On each row, the yellow frame indicates the original image. Each column corresponds to a target age of: 25, 35, 45, 55, 65. Our approach yields visually satisfying results without introducing significant artifacts. Only age relevant features are modified, while the identity, haircut, emotion and background are perfectly preserved.
Fig. 10: Age transformation on 1024 × 1024 images. On each row, the yellow frame indicates the original image. Each column corresponds to a target age of: 25, 35, 45, 55, 65. Our approach yields visually satisfying results without introducing significant artifacts. Only age relevant features are modified, while the identity, haircut, emotion and background are perfectly preserved
Fig. 11: Age transformation on 1024 × 1024 images. On each row, the yellow frame indicates the original image. Each column corresponds to a target age of: 25, 35, 45, 55, 65. Our approach yields visually satisfying results without introducing significant artifacts. Only age relevant features are modified, while the identity, haircut, emotion and background are perfectly preserved
Fig. 12: Age transformation on 1024 × 1024 images. On each row, the yellow frame indicates the original image. Each column corresponds to a target age of: 25, 35, 45, 55, 65. Our approach yields visually satisfying results without introducing significant artifacts. Only age relevant features are modified, while the identity, haircut, emotion and background are perfectly preserved
Fig. 13: Comparison of face aging results on CelebA HQ [15]. The first column are the input images. The second to fifth column are outputs from Fader Network [18], PAG-GAN [38], IPC-GAN [36] and our method. Our results reach the highest resolution without introducing significant artifacts. Our method pre- serves the background better compared to other techniques, see for instance the letters on the third row. In addition, compared to other techniques, our method leads to results that are free of artefacts and blur.
Fig. 14: Comparison of face aging results on CelebA HQ [15]. The first column are the input images. The second to fifth column are outputs from Fader Network [18], PAG-GAN [38], IPC-GAN [36] and our method. Our results reach the highest resolution without introducing significant artifacts. Our method pre- serves the background better compared to other techniques, see for instance the letters on the third row. In addition, compared to other techniques, our method leads to results that are free of artefacts and blur.