METER: Multimodal End-to-end TransformER

Last update: Jan 06, 2023

Related tags

Overview

METER

Code and pre-trained models will be publicized soon.

Citation

@article{dou2021meter,
  title={An Empirical Study of Training End-to-End Vision-and-Language Transformers},
  author={Dou, Zi-Yi and Xu, Yichong and Gan, Zhe and Wang, Jianfeng and Wang, Shuohang and Wang, Lijuan and Zhu, Chenguang and Peng, Nanyun and Liu, Zicheng and Zeng, Michael},
  journal={arXiv},
  year={2021},
  url={https://arxiv.org/abs/2111.02387},
}

Acknowledgements

The code is based on ViLT and some of the code is borrowed from CLIP and Swin-Transformer.

Comments

questions about VQA

Hi, could you share the VQAv2 result fine-tuning with image resolution of 384, the result implemented by me is 76.52 and it is based on your checkpoint pretrained on COCO, SBU, VG, CC3M.

opened by Henry9805 20
Some questions for the paper

What is the difference between the score in Table 5 and Table 8? 77.19 in Table 5 results on test-dev set of VQAv2, and, 77.68 in Table 8 results on test-dev set of VQAv2.

opened by wanng-ide 17
How much is the per gpu batch size?

How much is the per gpu batch size? total batchsize is 4096, GPU num is 8, so per gpu batch size is 512? But I use A100 GPU, the batch size only can be set 16?

opened by qiao1025566574 5
pretraining task

Hello, the author, great work! I'm curious whether you have tried to add Image Text Contrast Learning in the pretraining task? Because in the ALBEF paper, they reported that the ITC task had a great impact on the experimental results.

opened by mactavish91 4

Inference with Fine-tuned SNLI Model

Hi,

Thank you for the great work and the fine-tuned models, but I just wanted to ask how I should go about running inference with the fine-tuned model. Currently, I run into this error in my notebook:

1 model = METERTransformerSS(cfg)
----> 2 model.load_state_dict(torch.load("/content/meter_clip16_288_roberta_snli.ckpt")['state_dict'])

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in load_state_dict(self, state_dict, strict)
   1050         if len(error_msgs) > 0:
   1051             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
-> 1052                                self.__class__.__name__, "\n\t".join(error_msgs)))
   1053         return _IncompatibleKeys(missing_keys, unexpected_keys)
   1054 

RuntimeError: Error(s) in loading state_dict for METERTransformerSS:
	Unexpected key(s) in state_dict: "vit_model.token_embedding.weight". 
	size mismatch for vit_model.visual.positional_embedding: copying a param with shape torch.Size([577, 768]) from checkpoint, the shape in current model is torch.Size([197, 768]).

I wonder if this is due to how I configure the model or not, is there a specific way I should create the config for inference? Thank you in advance.

opened by sramshetty 4

The model meter_clip16_288_roberta_flickr.ckpt is inconsistent with the network weight parameter dimension

Hi, Thank you for your excellent work, may I use this model "METER-CLIP16-RoBERTa fine-tuned on Flickr30k IR/TR (resolution: 384^2)" as meter_clip16_288_roberta_flickr.ckpt, why does the code report this error showing inconsistent dimensions, thank you answer my question.

opened by attutude 4
Unable to train models faster with more gpus

Hi, I am facing an issue where, on increasing the number of gpus and nodes, the number of steps for each epoch doesnot change. for eg if I run

python run.py with data_root=/data/datasets/meter_data_combined num_gpus=4 num_nodes=8 task_mlm_itm_clip_bert per_gpu_batchsize=64 clip16 text_roberta image_size=224 precision=16 datasets='["vg"]'

the number of steps per epoch is nearly 150k. I observe that the number of steps is 150k when num_gpus=1 num_nodes=1, and when num_gpus=4 num_nodes=8. I made sure that all gpus were being utilized when I set num_gpus=4 num_nodes=8. I also observe that while using num_gpus=4 num_nodes=8, the time for each epoch is ~160 hours in my case, while it is ~30 hours if I set num_gpus=1 num_nodes=1.

Is there any suggestion that you have for this problem?

opened by HarmanDotpy 3
GPU OOM when pretraining
HI, I'm trying to pre-train the METER by using 8 A100 GPUS with the recommended config:

python run.py with num_gpus=8 num_nodes=1 task_mlm_itm_clip_bert per_gpu_batchsize=32 clip16 text_roberta image_size=288

but the GPU OOM occurred.

So what is the extract per_gpu_batchsize? And how can I pre-train the model in about 8 days as mentioned in the paper.

By the way, will the mixed precision training (precision=16) cause a performance drop?

Many thanks!
opened by hi-zhenyu 3
The training set of using different pretraining datasets.

When I tried to reproduce the results in Table 17, I found that using the default learning rate and only using the coco pertaining dataset worked extremely poorly on downstream tasks.

So, I would like to ask, do you set different training parameters (eg, lr, bs, max epoch, etc) for different pre-training datasets?

opened by ShiYaya 2
question about the pre-trained weights

Dear authors, Thanks for the great work! I have downloaded the pre-trained weights of ViT-B-16(224)+RoBERTa checkpoint from https://github.com/zdou0830/METER/releases/download/checkpoint2/meter_clip16_224_roberta_pretrain.ckpt, and found that the last layer of the visual encoder "vit_model.visual.transformer.resblocks.11..." is not included in the ckpt file? Did I miss something? Could you please help me to check it?

opened by Junction4Nako 2
About license

Thanks for the great work! The codebase is released under an MIT license (https://github.com/zdou0830/METER/blob/main/LICENSE) and an Apache License (https://github.com/zdou0830/METER/blob/main/ViLT_LICENSE).

I want to know whether the pre-trained models are also released under the same license? Thanks.

opened by WangWenhao0716 2
Pretrained weights of CLIP-ViT-224/32

Hi,

Thanks for the code! I wonder if you plan to release the pretrained weights of CLIP-ViT-224/32 (e.g., METER-CLIP32-RoBERTa (resolution: 224^2) pre-trained on GCC+SBU+COCO+VG)? It would be helpful for those who want to play with your model but don't have enough computational resources. Thanks!

opened by bfshi 0
The last checkpoint or the best one on the Val split?

Hi, I'm confused by the testing checkpoint in the downstream tasks.

I wonder which checkpoint should I use to evaluate, the last ckpt or the saved top-1 on the val split?

opened by hi-zhenyu 3
Why the test results are different using same data?
I used pl.seed_everything to set seed,

pl.seed_everything(_config["seed"], workers=True)

but I still got different result when I tested flickr30k Image2Text Retrieval task on the model trained by myself. First:

(tensor(0.7382), tensor(0.9274), tensor(0.9638), tensor(0.8965), tensor(0.9814), tensor(0.9941)) 0)

Second:

(tensor(0.7366), tensor(0.9294), tensor(0.9656), tensor(0.8975), tensor(0.9814), tensor(0.9941)) 0

I ensure the config files are same. Do you meet this problem?
opened by qiao1025566574 1
ValueError and AttributeError

Hi, I‘m trying to making "run.py" work for Pre-training, but I got ValueError and AttributeError, and I didn't find a solution, can you help me to check it? Thank you very much!

Traceback (most recent call last): File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/sacred/experiment.py", line 312, in run_commandline return self.run( File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/sacred/experiment.py", line 276, in run run() File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/sacred/run.py", line 238, in call self.result = self.main_function(*args) File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/sacred/config/captured_function.py", line 42, in captured_function result = wrapped(*args, **kwargs) File "run.py", line 20, in main dm = MTDataModule(_config, dist=True) File "/home/T3090U3/PycharmProjects/hxf/METER/METER-main/meter/datamodules/multitask_datamodule.py", line 19, in init self.dm_dicts = {key: _datamoduleskey for key in datamodule_keys} File "/home/T3090U3/PycharmProjects/hxf/METER/METER-main/meter/datamodules/multitask_datamodule.py", line 19, in self.dm_dicts = {key: _datamoduleskey for key in datamodule_keys} File "/home/T3090U3/PycharmProjects/hxf/METER/METER-main/meter/datamodules/coco_caption_karpathy_datamodule.py", line 7, in init super().init(*args, **kwargs) File "/home/T3090U3/PycharmProjects/hxf/METER/METER-main/meter/datamodules/datamodule_base.py", line 60, in init self.tokenizer = get_pretrained_tokenizer(tokenizer) File "/home/T3090U3/PycharmProjects/hxf/METER/METER-main/meter/datamodules/datamodule_base.py", line 25, in get_pretrained_tokenizer return RobertaTokenizer.from_pretrained(from_pretrained) File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1672, in from_pretrained resolved_vocab_files[file_id] = cached_path( File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/transformers/file_utils.py", line 1271, in cached_path output_path = get_from_cache( File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/transformers/file_utils.py", line 1494, in get_from_cache raise ValueError( ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "run.py", line 16, in def main(_config): File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/sacred/experiment.py", line 190, in automain self.run_commandline() File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/sacred/experiment.py", line 347, in run_commandline print_filtered_stacktrace() File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/sacred/utils.py", line 493, in print_filtered_stacktrace print(format_filtered_stacktrace(filter_traceback), file=sys.stderr) File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/sacred/utils.py", line 528, in format_filtered_stacktrace return "".join(filtered_traceback_format(tb_exception)) File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/sacred/utils.py", line 568, in filtered_traceback_format current_tb = tb_exception.exc_traceback AttributeError: 'TracebackException' object has no attribute 'exc_traceback'

opened by huhuhud 3
Pre-trained models for the Merged Attention Model?

Thanks for the amazing repository. The code is really clean. If I understand correctly, the current implementation is co-attention model, and same for pre-trained weights. I wanted to know if you had plans to release the merge attention model weights as well! Thanks in advance!

opened by TheShadow29 1

Releases(checkpoint2)

Owner

Zi-Yi Dou

Zi-Yi Dou (窦子轶).

GitHub Repository https://arxiv.org/pdf/2111.02387

Face Recognition & AI Based Smart Attendance Monitoring System.

In today’s generation, authentication is one of the biggest problems in our society. So, one of the most known techniques used for authentication is h

1 Jan 14, 2022

Open source single image super-resolution toolbox containing various functionality for training a diverse number of state-of-the-art super-resolution models. Also acts as the companion code for the IEEE signal processing letters paper titled 'Improving Super-Resolution Performance using Meta-Attention Layers’.

Deep-FIR Codebase - Super Resolution Meta Attention Networks About This repository contains the main coding framework accompanying our work on meta-at

17 Jun 17, 2022

code for Grapadora research paper experimentation

Road feature embedding selection method Code for research paper experimentation Abstract Traffic forecasting models rely on data that needs to be sens

0 May 26, 2022

PyTorch implementation DRO: Deep Recurrent Optimizer for Structure-from-Motion

DRO: Deep Recurrent Optimizer for Structure-from-Motion This is the official PyTorch implementation code for DRO-sfm. For technical details, please re

56 Dec 12, 2022

Codes accompanying the paper "Learning Nearly Decomposable Value Functions with Communication Minimization" (ICLR 2020)

NDQ: Learning Nearly Decomposable Value Functions with Communication Minimization Note This codebase accompanies paper Learning Nearly Decomposable Va

69 Nov 26, 2022

This is the repository for our paper Ditch the Gold Standard: Re-evaluating Conversational Question Answering

Ditch the Gold Standard: Re-evaluating Conversational Question Answering This is the repository for our paper Ditch the Gold Standard: Re-evaluating C

38 Dec 16, 2022

Transfer Learning Shootout for PyTorch's model zoo (torchvision)

pytorch-retraining Transfer Learning shootout for PyTorch's model zoo (torchvision). Load any pretrained model with custom final layer (num_classes) f

169 Jun 29, 2022

Pytorch ImageNet1k Loader with Bounding Boxes.

ImageNet 1K Bounding Boxes For some experiments, you might wanna pass only the background of imagenet images vs passing only the foreground. Here, I'v

11 Oct 15, 2022

Official implementation of the NRNS paper: No RL, No Simulation: Learning to Navigate without Navigating

No RL No Simulation (NRNS) Official implementation of the NRNS paper: No RL, No Simulation: Learning to Navigate without Navigating NRNS is a heriarch

20 Nov 29, 2022

Repository to run object detection on a model trained on an autonomous driving dataset.

Autonomous Driving Object Detection on the Raspberry Pi 4 Description of Repository This repository contains code and instructions to configure the ne

51 Nov 17, 2022

MLP-Like Vision Permutator for Visual Recognition (PyTorch)

Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition (arxiv) This is a Pytorch implementation of our paper. We present Vision

162 Nov 28, 2022

HIVE: Evaluating the Human Interpretability of Visual Explanations

HIVE: Evaluating the Human Interpretability of Visual Explanations Project Page | Paper This repo provides the code for HIVE, a human evaluation frame

16 Dec 13, 2022

Mmdet benchmark with python

mmdet_benchmark 本项目是为了研究 mmdet 推断性能瓶颈，并且对其进行优化。配置与环境机器配置 CPU：Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz GPU：NVIDIA GeForce RTX 3080 10GB 内存：64G 硬盘：1T

24 May 21, 2022

Bald-to-Hairy Translation Using CycleGAN

GANiry: Bald-to-Hairy Translation Using CycleGAN Official PyTorch implementation of GANiry. GANiry: Bald-to-Hairy Translation Using CycleGAN, Fidan Sa

10 Oct 27, 2022

Script that attempts to force M1 macs into RGB mode when used with monitors that are defaulting to YPbPr.

fix_m1_rgb Script that attempts to force M1 macs into RGB mode when used with monitors that are defaulting to YPbPr. No warranty provided for using th

116 Jan 01, 2023

这是一个deeplabv3-plus-pytorch的源码，可以用于训练自己的模型。

DeepLabv3+：Encoder-Decoder with Atrous Separable Convolution语义分割模型在Pytorch当中的实现目录性能情况 Performance 所需环境 Environment 注意事项 Attention 文件下载 Download 训练步骤

350 Dec 28, 2022

Accommodating supervised learning algorithms for the historical prices of the world's favorite cryptocurrency and boosting it through LightGBM.

1 Nov 27, 2021