Abstract
Editing and manipulating facial features in videos is an interesting and important field of research with a plethora of applications, ranging from movie post-production and visual effects to realistic avatars for video games and virtual assistants. Our method supports semantic video manipulation based on neural rendering and 3D-based facial expression modelling. We focus on interactive manipulation of the videos by altering and controlling the facial expressions, achieving promising photorealistic results. The proposed method is based on a disentangled representation and estimation of the 3D facial shape and activity, providing the user with intuitive and easy-to-use control of the facial expressions in the input video. We also introduce a user-friendly, interactive AI tool that processes human-readable semantic labels about the desired expression manipulations in specific parts of the input video and synthesizes photorealistic manipulated videos. We achieve that by mapping the emotion labels to points on the Valence-Arousal space (where Valence quantifies how positive or negative is an emotion and Arousal quantifies the power of the emotion activation), which in turn are mapped to disentangled 3D facial expressions through an especially-designed and trained expression decoder network. The paper presents detailed qualitative and quantitative experiments, which demonstrate the effectiveness of our system and the promising results it achieves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abouaf, J.: Creating illusory realism through VFX. IEEE Comput. Graph. Appl. 20(04), 4–5 (2000)
Amerini, I., Caldelli, R.: Exploiting prediction error inconsistencies through LSTM-based classifiers to detect deepfake videos. In: Proceedings of the 2020 ACM Workshop on Information Hiding and Multimedia Security, pp. 97–102 (2020)
Ashbrook, S.: Adobe [r] photoshop lightroom. PSA J. 72(12), 12–13 (2006)
Averbuch-Elor, H., Cohen-Or, D., Kopf, J., Cohen, M.F.: Bringing portraits to life. ACM Trans. Graph. (ToG) 36(6), 1–13 (2017)
Barros, P., Parisi, G.I., Weber, C., Wermter, S.: Emotion-modulated attention improves expression recognition: a deep learning model. Neurocomputing 253, 104–114 (2017)
Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, pp. 187–194. ACM Press/Addison-Wesley Publishing Co. (1999)
Booth, J., Roussos, A., Ponniah, A., Dunaway, D., Zafeiriou, S.: Large scale 3D morphable models. Int. J. Comput. Vision 126(2), 233–254 (2018)
Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2D & 3D face alignment problem? (And a dataset of 230,000 3D facial landmarks). In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1021–1030 (2017)
Chesney, B., Citron, D.: Deep fakes: a looming challenge for privacy, democracy, and national security. Calif. L. Rev. 107, 1753 (2019)
Christos Doukas, M., Rami Koujan, M., Sharmanska, V., Zafeiriou, S.: Head2headfs: video-based head reenactment with few-shot learning. arXiv e-prints pp. arXiv-2103 (2021)
Dai, H., Pears, N., Smith, W.A., Duncan, C.: A 3D morphable model of craniofacial shape and texture variation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3085–3093 (2017)
Doukas, M.C., Koujan, M.R., Sharmanska, V., Roussos, A., Zafeiriou, S.: Head2head++: deep facial attributes re-targeting. IEEE Trans. Biometrics Behav. Identity Sci. 3(1), 31–43 (2021). https://doi.org/10.1109/TBIOM.2021.3049576
Egger, B., et al.: 3D morphable face models-past, present, and future. ACM Trans. Graph. (TOG) 39(5), 1–38 (2020)
Gafni, G., Thies, J., Zollöfer, M., Nießner, M.: Dynamic neural radiance fields for monocular 4D facial avatar reconstruction. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021). https://justusthies.github.io/posts/nerface/
Garrido, P., Valgaerts, L., Rehmsen, O., Thormahlen, T., Perez, P., Theobalt, C.: Automatic face reenactment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4217–4224 (2014)
Garrido, P., et al.: Vdub: modifying face video of actors for plausible visual alignment to a dubbed audio track. In: Computer Graphics Forum, vol. 34, pp. 193–204. Wiley Online Library (2015)
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2414–2423 (2016). https://doi.org/10.1109/CVPR.2016.265
Geng, J., Shao, T., Zheng, Y., Weng, Y., Zhou, K.: Warp-guided GANs for single-photo facial animation. ACM Trans. Graph. (ToG) 37(6), 1–12 (2018)
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315–323. JMLR Workshop and Conference Proceedings (2011)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Huber, P., et al.: A multiresolution 3D morphable face model and fitting framework. In: Proceedings of the 11th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (2016)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
Kim, H., et al.: Neural style-preserving visual dubbing. ACM Trans. Graph. 38(6) (2019). https://doi.org/10.1145/3355089.3356500
Kim, H., et al.: Deep video portraits. ACM Trans. Graph. 37(4) (2018). https://doi.org/10.1145/3197517.3201283
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. iclr. 2015. arXiv preprint arXiv:1412.6980, September 2015
Kollias, D., Cheng, S., Ververas, E., Kotsia, I., Zafeiriou, S.: Deep neural network augmentation: generating faces for affect analysis. Int. J. Comput. Vision 128(5), 1455–1484 (2020)
Korshunov, P., Marcel, S.: Vulnerability assessment and detection of deepfake videos. In: 2019 International Conference on Biometrics (ICB), pp. 1–6. IEEE (2019)
Koujan, M.R., Alharbawee, L., Giannakakis, G., Pugeault, N., Roussos, A.: Real-time facial expression recognition “in the wild” by disentangling 3D expression from identity. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 24–31. IEEE (2020)
Koujan, M.R., Doukas, M.C., Roussos, A., Zafeiriou, S.: Head2head: video-based neural head synthesis. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 16–23. IEEE (2020)
Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36(6), 194:1–194:17 (2017). Two first authors contributed equally
de Lima, O., Franklin, S., Basu, S., Karwoski, B., George, A.: Deepfake detection using spatiotemporal convolutional networks. arXiv preprint arXiv:2006.14749 (2020)
Liu, Z., Shan, Y., Zhang, Z.: Expressive expression mapping with ratio images. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, pp. 271–276 (2001)
Luan, F., Paris, S., Shechtman, E., Bala, K.: Deep photo style transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4990–4998 (2017)
Martínez, B., Valstar, M.F., Jiang, B., Pantic, M.: Automatic analysis of facial actions: a survey. IEEE Trans. Affect. Comput. 10, 325–347 (2019)
Otberdout, N., Kacem, A., Daoudi, M., Ballihi, L., Berretti, S.: Deep covariance descriptors for facial expression recognition. In: BMVC (2018)
Papantoniou, F.P., Filntisis, P.P., Maragos, P., Roussos, A.: Neural emotion director: speech-preserving semantic control of facial expressions in “in-the-wild” videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18781–18790 (2022)
Pollock, D.: Smoothing with cubic splines. In: Handbook of Time Series Analysis, Signal Processing, and Dynamics, pp. 293–332, December 1999. https://doi.org/10.1016/B978-012560990-6/50013-0
Rössler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Faceforensics++: learning to detect manipulated facial images. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1–11 (2019)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Tewari, A., et al.: Pie: portrait image embedding for semantic control. ACM Trans. Graph. (TOG) 39(6), 1–14 (2020)
Tewari, A., et al.: Stylerig: rigging styleGAn for 3D control over portrait images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6142–6151 (2020)
Thies, J., Zollhöfer, M., Nießner, M.: Deferred neural rendering: image synthesis using neural textures. ACM Trans. Graph. (TOG) 38(4), 1–12 (2019)
Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2face: real-time face capture and reenactment of RGB videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2387–2395 (2016)
Toisoul, A., Kossaifi, J., Bulat, A., Tzimiropoulos, G., Pantic, M.: Estimation of continuous valence and arousal levels from faces in naturalistic conditions. Nat. Mach. Intell. 3(1), 42–50 (2021)
Tripathy, S., Kannala, J., Rahtu, E.: Icface: interpretable and controllable face reenactment using GANs. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3385–3394 (2020)
Wang, T.C., Mallya, A., Liu, M.Y.: One-shot free-view neural talking-head synthesis for video conferencing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2021)
Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V.: Few-shot adversarial learning of realistic neural talking head models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9459–9468 (2019)
Acknowledgments
A. Roussos was supported by HFRI under the ‘\(1^{st}\) Call for HFRI Research Projects to support Faculty members and Researchers and the procurement of high-cost research equipment’ Project I.C. Humans, Number 91.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Solanki, G.K., Roussos, A. (2023). Deep Semantic Manipulation of Facial Videos. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13806. Springer, Cham. https://doi.org/10.1007/978-3-031-25075-0_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-25075-0_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25074-3
Online ISBN: 978-3-031-25075-0
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science

