LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Overview

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Where we are ?

12.27 目前和原论文仍有1%左右得差距,但已经力压很多SOTA了

ckpt__448_epoch_25.pth mIoU Overall IoU [email protected]
Refcoco val 70.743 71.671 82.26
Refcoco testA 73.679 74.772 -
Refcoco testB 67.582 67.339 -

12.29 45epoch的结果又上升了大约1%

ckpt__448_epoch_45.pth mIoU Overall IoU
Refcoco val 71.949 72.246
Refcoco testA 74.533 75.467
Refcoco testB 67.849 68.123

the pretrain model will be released soon

对原论文的复现

论文链接: https://arxiv.org/abs/2112.02244

官方实现: https://github.com/yz93/LAVT-RIS

Architecture

Features

  • 将不同模态feature的fusion提前到Image Encoder阶段

  • 思路上对这两篇论文有很多借鉴

    • Vision-Language Transformer and Query Generation for Referring Segmentation

    • Locate then Segment: A Strong Pipeline for Referring Image Segmentation

  • 采用了比较新的主干网络 Swin-Transformer

Usage

详细参数设置可以见args.py

for training

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345 main.py --batch_size 2 --cfg_file configs/swin_base_patch4_window7_224.yaml --size 448

for evaluation

CUDA_VISIBLE_DEVICES=4,5,6,7 python -m torch.distributed.launch --nproc_per_node 4 --master_port 23458 main.py --size 448 --batch_size 1 --resume --eval --type val --eval_mode cat --pretrain ckpt_448_epoch_20.pth --cfg_file configs/swin_base_patch4_window7_224.yaml

*.pth 都放在./checkpoint

for resume from checkpoint

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12346 main.py --batch_size 2 --cfg_file configs/swin_base_patch4_window7_224.yaml --size 448 --resume --pretrain ckpt_448_epoch_10.pth

for dataset preparation

please get details from ./data/readme.md

Need to be finished

由于我在复现的时候,官方的code还没有出来,所以一些细节上的设置可能和官方code不同

  • Swin Transformer 我选择的是 swin_base_patch4_window12_384_22k.pth,具体代码可以参考官方代码 https://github.com/microsoft/Swin-Transformer/blob/main/get_started.md 原论文中的图像resize的尺寸是480*480,可是我目前基于官方的代码若想调到这个尺寸,总是会报错,查了一下觉得可能用object detection 的swin transformer的code比较好

    12.27 这个问题目前也已经得到了较好的解决,目前训练用的是 swin_base_patch4_window7_224_22k.pth, 输入图片的尺寸调整到448*448

    解决方案可以参考:

    https://github.com/microsoft/Swin-Transformer/issues/155

  • 原论文中使用的lr_scheduler是polynomial learning rate decay, 没有给出具体的参数手动设置了一下

    12.21 目前来看感觉自己设置的不是很好

    12.27 调整了一下设置,初始学习率的设置真的很重要,特别是根据batch_size 去scale你的 inital learning rate

  • 原论文中的batch_size=32,基于自己的实验我猜想应该是用了8块GPU,每一块的batch_size=4, 由于我第一次写DDP code,训练时发现,程序总是会在RANK0上给其余RANK开辟类似共享显存的东西,导致我无法做到原论文相同的配置,需要改进

  • 仔细观察Refcoco的数据集,会发现一个target会对应好几个sentence,training时我设计的是随机选一个句子,evaluate时感觉应该要把所有句子用上会更好,关于这一点我想了两种evaluate的方法

    目前eval 只能支持 batch_size=1

    • 将所有句子concatenate成为一个句子,送入BERT,Input 形式上就是(Image,cat(sent_1,sent_2,sent_3)) => model => pred

    实验发现这种eval_mode 下的mean IOU 会好不少, overall_IOU 也会好一点

    • 对同一张图片处理多次处理,然后将结果进行平均,Input 形式上就是 ((Image,sent_1),(Image,sent_2),(Image,sent_3)) => model => average(pred_1,pred_2,pred_3)

Visualization

详细见inference.ipynb

input sentences

  1. right girl
  2. closest girl on right

results

Failure cases study

AnalysisFailure.ipynb 提供了一个研究model不work的途径,主要是筛选了IoU < 0.5的case,并在这些case中着重查看了一下IoU < 0.10.4 < IoU < 0.5 的例子

目前我只看了一些有限的failure cases,做了如下总结

  • 模型对于similar,dense object在language guide下定位不精确
  • 模型对于language的理解不分主次
  • refcoco本身标记的一些问题
Owner
zichengsaber
CVer
zichengsaber
Source code of AAAI 2022 paper "Towards End-to-End Image Compression and Analysis with Transformers".

Towards End-to-End Image Compression and Analysis with Transformers Source code of our AAAI 2022 paper "Towards End-to-End Image Compression and Analy

37 Dec 21, 2022
Rank 1st in the public leaderboard of ScanRefer (2021-03-18)

InstanceRefer InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring

63 Dec 07, 2022
The code for paper Efficiently Solve the Max-cut Problem via a Quantum Qubit Rotation Algorithm

Quantum Qubit Rotation Algorithm Single qubit rotation gates $$ U(\Theta)=\bigotimes_{i=1}^n R_x (\phi_i) $$ QQRA for the max-cut problem This code wa

SheffieldWang 0 Oct 18, 2021
This package contains a PyTorch Implementation of IB-GAN of the submitted paper in AAAI 2021

The PyTorch implementation of IB-GAN model of AAAI 2021 This package contains a PyTorch implementation of IB-GAN presented in the submitted paper (IB-

Insu Jeon 9 Mar 30, 2022
A pure PyTorch implementation of the loss described in "Online Segment to Segment Neural Transduction"

ssnt-loss ℹ️ This is a WIP project. the implementation is still being tested. A pure PyTorch implementation of the loss described in "Online Segment t

張致強 1 Feb 09, 2022
Attention Probe: Vision Transformer Distillation in the Wild

Attention Probe: Vision Transformer Distillation in the Wild Jiahao Wang, Mingdeng Cao, Shuwei Shi, Baoyuan Wu, Yujiu Yang In ICASSP 2022 This code is

Wang jiahao 3 Oct 31, 2022
Source code for Adaptively Calibrated Critic Estimates for Deep Reinforcement Learning

Adaptively Calibrated Critic Estimates for Deep Reinforcement Learning Official implementation of ACC, described in the paper "Adaptively Calibrated C

3 Sep 16, 2022
An implementation of the proximal policy optimization algorithm

PPO Pytorch C++ This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch. It uses a simple TestEnvironment t

Martin Huber 59 Dec 09, 2022
PyTorch implementation for MINE: Continuous-Depth MPI with Neural Radiance Fields

MINE: Continuous-Depth MPI with Neural Radiance Fields Project Page | Video PyTorch implementation for our ICCV 2021 paper. MINE: Towards Continuous D

Zijian Feng 325 Dec 29, 2022
Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces, originally created as a benchmark for generative adversarial networks (GAN)

Flickr-Faces-HQ Dataset (FFHQ) Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces, originally created as a benchmark for generative

NVIDIA Research Projects 2.9k Dec 28, 2022
Code for paper "Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs"

This is the codebase for the paper: Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs Directory Structur

Peter Hase 19 Aug 21, 2022
An integration of several popular automatic augmentation methods, including OHL (Online Hyper-Parameter Learning for Auto-Augmentation Strategy) and AWS (Improving Auto Augment via Augmentation Wise Weight Sharing) by Sensetime Research.

An integration of several popular automatic augmentation methods, including OHL (Online Hyper-Parameter Learning for Auto-Augmentation Strategy) and AWS (Improving Auto Augment via Augmentation Wise

45 Dec 08, 2022
Train a deep learning net with OpenStreetMap features and satellite imagery.

DeepOSM Classify roads and features in satellite imagery, by training neural networks with OpenStreetMap (OSM) data. DeepOSM can: Download a chunk of

TrailBehind, Inc. 1.3k Nov 24, 2022
'Solving the sampling problem of the Sycamore quantum supremacy circuits

solve_sycamore This repo contains data, contraction code, and contraction order for the paper ''Solving the sampling problem of the Sycamore quantum s

Feng Pan 29 Nov 28, 2022
Vertical Federated Principal Component Analysis and Its Kernel Extension on Feature-wise Distributed Data based on Pytorch Framework

VFedPCA+VFedAKPCA This is the official source code for the Paper: Vertical Federated Principal Component Analysis and Its Kernel Extension on Feature-

John 9 Sep 18, 2022
codes for "Scheduled Sampling Based on Decoding Steps for Neural Machine Translation" (long paper of EMNLP-2022)

Scheduled Sampling Based on Decoding Steps for Neural Machine Translation (EMNLP-2021 main conference) Contents Overview Background Quick to Use Furth

Adaxry 13 Jul 25, 2022
This code finds bounding box of a single human mouth.

This code finds bounding box of a single human mouth. In comparison to other face segmentation methods, it is relatively insusceptible to open mouth conditions, e.g., yawning, surgical robots, etc. T

iThermAI 4 Nov 27, 2022
Simple PyTorch implementations of Badnets on MNIST and CIFAR10.

Simple PyTorch implementations of Badnets on MNIST and CIFAR10.

Vera 75 Dec 13, 2022
Multi-Template Mouse Brain MRI Atlas (MBMA): both in-vivo and ex-vivo

Multi-template MRI mouse brain atlas (both in vivo and ex vivo) Mouse Brain MRI atlas (both in-vivo and ex-vivo) (repository relocated from the origin

8 Nov 18, 2022
Deploy recommendation engines with Edge Computing

RecoEdge: Bringing Recommendations to the Edge A one stop solution to build your recommendation models, train them and, deploy them in a privacy prese

NimbleEdge 131 Jan 02, 2023