Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

Last update: Dec 27, 2022

Related tags

Deep Learning ATVGnet

Overview

Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

By Lele Chen , Ross K Maddox, Zhiyao Duan, Chenliang Xu.

University of Rochester.

Introduction
Citation
Running
Model
Results
Disclaimer and known issues

Introduction

This repository contains the original models (AT-net, VG-net) described in the paper Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss. The demo video is avaliable at https://youtu.be/eH7h_bDRX2Q. This code can be applied directly in LRW and GRID. The outputs from the model are visualized here: the first one is the synthesized landmark from ATnet, the rest of them are attention, motion map and final results from VGnet.

Citation

If you use any codes, models or the ideas from this repo in your research, please cite:

@inproceedings{chen2019hierarchical,
  title={Hierarchical cross-modal talking face generation with dynamic pixel-wise loss},
  author={Chen, Lele and Maddox, Ross K and Duan, Zhiyao and Xu, Chenliang},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={7832--7841},
  year={2019}
}

Running

This code is tested under Python 2.7. The model we provided is trained on LRW. However, it works fine on GRID,VOXCELB and other datasets. You can directly compare this model on other dataset with your own model. We treat this as fair comparison.
Pytorch environment:Pytorch 0.4.1. (conda install pytorch=0.4.1 torchvision cuda90 -c pytorch)
Install requirements.txt (pip install -r requirement.txt)
Download the pretrained ATnet and VGnet weights at google drive. Put the weights under model folder.
Run the demo code: python demo.py
- -device_ids: gpu id
- -cuda: using cuda or not
- -vg_model: pretrained VGnet weight
- -at_model: pretrained ATnet weight
- -lstm: use lstm or not
- -p: input example image
- -i: input audio file
- -lstm: use lstm or not
- -sample_dir: folder to save the outputs
- ...
Download and unzip the training data from LRW
Preprocess the data (Extract landmark and crop the image by dlib).
Train the ATnet model: python atnet.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_dir: folder to save weights
- -lstm: use lstm or not
- -sample_dir: folder to save visualized images during training
- ...
Test the model: python atnet_test.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_name: pretrained weights
- -sample_dir: folder to save the outputs
- -lstm: use lstm or not
- ...
Train the VGnet: python vgnet.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_dir: folder to save weights
- -sample_dir: folder to save visualized images during training
- ...
Test the VGnet: python vgnet_test.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_name: pretrained weights
- -sample_dir: folder to save the outputs
- ...

Model

Overall ATVGnet
Regresssion based discriminator network

Results

Result visualization on different datasets:
Reuslt compared with other SOTA methods:
The studies on image robustness respective with landmark accuracy:
Quantitative results:

Disclaimer and known issues

These codes are implmented in Pytorch.
In this paper, we train LRW and GRID seperately.
The model are sensitive to input images. Please use the correct preprocessing code.
I didn't finish the data processing code yet. I will release it soon. But you can try the model and replace with your own image.
If you want to train these models using this version of pytorch without modifications, please notice that:
- You need at lest 12 GB GPU memory.
- There might be some other untested issues.
There is another intresting and useful research on audio to landmark genration. Please check it out at https://github.com/eeskimez/Talking-Face-Landmarks-from-Speech.

Todos

Release training data

License

MIT

Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

Related tags

Overview

Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

Table of Contents

Introduction

Citation

Running

Model

Results

Disclaimer and known issues

Todos

License

Owner

Lele Chen

Cross-platform-profile-pic-changer - Script to change profile pictures across multiple platforms

Unofficial implementation of HiFi-GAN+ from the paper "Bandwidth Extension is All You Need" by Su, et al.

Extremely easy multi instancing software for minecraft speedrunning.

Code for Multimodal Neural SLAM for Interactive Instruction Following

Exponential Graph is Provably Efficient for Decentralized Deep Training

Fiddle is a Python-first configuration library particularly well suited to ML applications.

Code for the TPAMI paper: "Syntax Customized Video Captioning by Imitating Exemplar Sentences"

Implementation of ViViT: A Video Vision Transformer

Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt)

Ensemble Visual-Inertial Odometry (EnVIO)

Video Swin Transformer - PyTorch

MusicYOLO framework uses the object detection model, YOLOx, to locate notes in the spectrogram.

PyTorch implementation of "LayoutTransformer: Layout Generation and Completion with Self-attention"

[TIP 2020] Multi-Temporal Scene Classification and Scene Change Detection with Correlation based Fusion

YoloV3 Implemented in Tensorflow 2.0

PixelPyramids: Exact Inference Models from Lossless Image Pyramids (ICCV 2021)

TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification

PyTorch implementation of MulMON

DeOldify - A Deep Learning based project for colorizing and restoring old images (and video!)

Embracing Single Stride 3D Object Detector with Sparse Transformer