Semi-Autoregressive Transformer for Image Captioning

Last update: Dec 09, 2022

Related tags

Deep Learning satic

Overview

Semi-Autoregressive Transformer for Image Captioning

Requirements

Python 3.6
Pytorch 1.6

Prepare data

Please use git clone --recurse-submodules to clone this repository and remember to follow initialization steps in coco-caption/README.md.
Download the preprocessd dataset from this link and extract it to data/.
Please follow this instruction to prepare the adaptive bottom-up features and place them under data/mscoco/. Please follow this instruction to prepare the features and place them under data/cocotest/ for online test evaluation.
Download part checkpoints from here and extract them to save/.

Offline Evaluation

To reproduce the results, such as SATIC(K=2, bw=1) after self-critical training, just run

python3 eval.py  --model  save/nsc-sat-2-from-nsc-seqkd/model-best.pth   --infos_path  save/nsc-sat-2-from-nsc-seqkd/infos_nsc-sat-2-from-nsc-seqkd-best.pkl    --batch_size  1   --beam_size   1   --id  nsc-sat-2-from-nsc-seqkd

Online Evaluation

Please first run

python3 eval_cocotest.py  --input_json  data/cocotest.json  --input_fc_dir data/cocotest/cocotest_bu_fc --input_att_dir  data/cocotest/cocotest_bu_att   --input_label_h5    data/cocotalk_label.h5  --num_images -1    --language_eval 0
--model  save/nsc-sat-4-from-nsc-seqkd/model-best.pth   --infos_path  save/nsc-sat-4-from-nsc-seqkd/infos_nsc-sat-4-from-nsc-seqkd-best.pkl    --batch_size  32   --beam_size   3   --id   captions_test2014_alg_results

and then follow the instruction to upload results.

Training

In the first training stage, such as SATIC(K=2) model with sequence-level distillation and weight initialization, run

python3  train.py   --noamopt --noamopt_warmup 20000 --label_smoothing 0.0  --seq_per_img 5 --batch_size 10 --beam_size 1 --learning_rate 5e-4 --num_layers 6 --input_encoding_size 512 --rnn_size 2048 --learning_rate_decay_start 0 --scheduled_sampling_start 0  --save_checkpoint_every 3000 --language_eval 1 --val_images_use 5000 --max_epochs 15    --input_label_h5   data/cocotalk_seq-kd-from-nsc-transformer-baseline-b5_label.h5   --checkpoint_path   save/sat-2-from-nsc-seqkd   --id   sat-2-from-nsc-seqkd   --K  2

Then in the second training stage, copy the above pretrained model first

cd save
./copy_model.sh  sat-2-from-nsc-seqkd    nsc-sat-2-from-nsc-seqkd
cd ..

and then run

python3  train.py    --seq_per_img 5 --batch_size 10 --beam_size 1 --learning_rate 1e-5 --num_layers 6 --input_encoding_size 512 --rnn_size 2048  --save_checkpoint_every 3000 --language_eval 1 --val_images_use 5000 --self_critical_after 10  --max_epochs    40   --input_label_h5    data/cocotalk_label.h5   --start_from   save/nsc-sat-2-from-nsc-seqkd   --checkpoint_path   save/nsc-sat-2-from-nsc-seqkd  --id  nsc-sat-2-from-nsc-seqkd    --K 2

Citation

@article{zhou2021semi,
  title={Semi-Autoregressive Transformer for Image Captioning},
  author={Zhou, Yuanen and Zhang, Yong and Hu, Zhenzhen and Wang, Meng},
  journal={arXiv preprint arXiv:2106.09436},
  year={2021}
}

Acknowledgements

This repository is built upon self-critical.pytorch. Thanks for the released code.

Semi-Autoregressive Transformer for Image Captioning

Related tags

Overview

Semi-Autoregressive Transformer for Image Captioning

Requirements

Prepare data

Offline Evaluation

Online Evaluation

Training

Citation

Acknowledgements

Owner

YE Zhou

Implementation of Convolutional enhanced image Transformer

An experiment to bait a generalized frontrunning MEV bot

Supplemental Code for "ImpressionNet :A Multi view Approach to Predict Socio Facial Impressions"

Implementation of the ICCV'21 paper Temporally-Coherent Surface Reconstruction via Metric-Consistent Atlases

Official repository for Fourier model that can generate periodic signals

Franka Emika Panda manipulator kinematics&dynamics simulation

Face Recognition Attendance Project

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

Codeflare - Scale complex AI/ML pipelines anywhere

Only valid pull requests will be allowed. Use python only and readme changes will not be accepted.

Team Enigma at ArgMining 2021 Shared Task: Leveraging Pretrained Language Models for Key Point Matching

Using machine learning to predict undergrad college admissions.

[ECCV'20] Convolutional Occupancy Networks

Event-forecasting - Event Forecasting Algorithms With Python

PyTorch implementation of the supervised learning experiments from the paper Model-Agnostic Meta-Learning (MAML)

NNR conformation conditional and global probabilities estimation and analysis in peptides or proteins fragments

GT4SD, an open-source library to accelerate hypothesis generation in the scientific discovery process.

The codes reproduce the figures and statistics in the paper, "Controlling for multiple covariates," by Mark Tygert.

A Planar RGB-D SLAM which utilizes Manhattan World structure to provide optimal camera pose trajectory while also providing a sparse reconstruction containing points, lines and planes, and a dense surfel-based reconstruction.

Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers