This repo provides code for QB-Norm (Cross Modal Retrieval with Querybank Normalisation)

Last update: Dec 29, 2022

Related tags

Overview

This repo provides code for QB-Norm (Cross Modal Retrieval with Querybank Normalisation)

Usage example

python dynamic_inverted_softmax.py --sims_train_test_path msrvtt/tt-ce-train-captions-test-videos-seed0.pkl --sims_test_path msrvtt/tt-ce-test-captions-test-videos-seed0.pkl --test_query_masks_path msrvtt/tt-ce-test-query_masks.pkl

To test QB-Norm on your own data you need to:

Extract the similarity matrix between the caption from the training split and the videos from the testing split path/to/sims/train/test
Extract testing split similarity matrix (similarities between testing captions and testing video) path/to/sims/test
Run QB-Norm

python dynamic_inverted_softmax.py --sims_train_test_path path/to/sims/train/test --sims_test_path path/to/sims/test

Data

The similarity matrices for each method were extracted using the official repositories as follows: CE+, TT-CE+, CLIP2Video, CLIP4Clip (for CLIP4Clip we used the official repo to train from scratch new models since they do not provide pre-trained weights), CLIP, MMT, Audio-Retrieval.

You can download the extracted similarity matrices for training and testing here: MSRVTT, MSVD, DiDeMo, LSMDC.

Text-Video retrieval results

QB-Norm Results on MSRVTT Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
CE+	Full	t2v	_{^14.4_(0.1)}	_{^37.4_(0.1)}	_{^50.2_(0.1)}	_{^10.0_(0.0)}	_{^30.0_(0.1)}
CE+ (+QB-Norm)	Full	t2v	_{^16.4_(0.0)}	_{^40.3_(0.1)}	_{^52.9_(0.1)}	_{^9.0_(0.0)}	_{^32.7_(0.1)}
TT-CE+	Full	t2v	_{^14.9_(0.1)}	_{^38.3_(0.1)}	_{^51.5_(0.1)}	_{^10.0_(0.0)}	_{^30.9_(0.1)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^17.3_(0.0)}	_{^42.1_(0.2)}	_{^54.9_(0.1)}	_{^8.0_(0.0)}	_{^34.2_(0.1)}

QB-Norm Results on MSVD Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
TT-CE+	Full	t2v	_{^25.4_(0.3)}	_{^56.9_(0.4)}	_{^71.3_(0.2)}	_{^4.0_(0.0)}	_{^46.9_(0.3)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^26.6_(1.0)}	_{^58.6_(1.3)}	_{^71.8_(1.1)}	_{^4.0_(0.0)}	_{^48.2_(1.2)}
CLIP2Video	Full	t2v	_^47.0	_^76.8	_^85.9	_^2.0	_^67.7
CLIP2Video (+QB-Norm)	Full	t2v	_^48.0	_^77.9	_^86.2	_^2.0	_^68.5

QB-Norm Results on DiDeMo Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
TT-CE+	Full	t2v	_{^21.6_(0.7)}	_{^48.6_(0.4)}	_{^62.9_(0.6)}	_{^6.0_(0.0)}	_{^40.4_(0.4)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^24.2_(0.7)}	_{^50.8_(0.7)}	_{^64.4_(0.1)}	_{^5.3_(0.5)}	_{^43.0_(0.2)}
CLIP4Clip	Full	t2v	_^43.0	_^70.5	_^80.0	_^2.0	_^62.4
CLIP4Clip (+QB-Norm)	Full	t2v	_^43.5	_^71.4	_^80.9	_^2.0	_^63.1

QB-Norm Results on LSMDC Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
TT-CE+	Full	t2v	_{^17.2_(0.4)}	_{^36.5_(0.6)}	_{^46.3_(0.3)}	_{^13.7_(0.5)}	_{^30.7_(0.3)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^17.8_(0.4)}	_{^37.7_(0.5)}	_{^47.6_(0.6)}	_{^12.7_(0.5)}	_{^31.7_(0.3)}
CLIP4Clip	Full	t2v	_^21.3	_^40.0	_^49.5	_^11.0	_^34.8
CLIP4Clip (+QB-Norm)	Full	t2v	_^22.4	_^40.1	_^49.5	_^11.0	_^35.4

QB-Norm Results on VaTeX Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
TT-CE+	Full	t2v	_{^53.2_(0.2)}	_{^87.4_(0.1)}	_{^93.3_(0.0)}	_{^1.0_(0.0)}	_{^75.7_(0.1)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^54.8_(0.1)}	_{^88.2_(0.1)}	_{^93.8_(0.1)}	_{^1.0_(0.0)}	_{^76.8_(0.0)}
CLIP2Video	Full	t2v	_^57.4	_^87.9	_^93.6	_^1.0	_^77.9
CLIP2Video (+QB-Norm)	Full	t2v	_^58.8	_^88.3	_^93.8	_^1.0	_^78.7

QB-Norm Results on QuerYD Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
CE+	Full	t2v	_{^13.2_(2.0)}	_{^37.1_(2.9)}	_{^50.5_(1.9)}	_{^10.3_(1.2)}	_{^29.1_(2.2)}
CE+ (+QB-Norm)	Full	t2v	_{^14.1_(1.8)}	_{^38.6_(1.3)}	_{^51.1_(1.6)}	_{^10.0_(0.8)}	_{^30.2_(1.7)}
TT-CE+	Full	t2v	_{^14.4_(0.5)}	_{^37.7_(1.7)}	_{^50.9_(1.6)}	_{^9.8_(1.0)}	_{^30.3_(0.9)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^15.1_(1.6)}	_{^38.3_(2.4)}	_{^51.2_(2.8)}	_{^10.3_(1.7)}	_{^30.9_(2.3)}

Text-Image retrieval results

QB-Norm Results on MSCoCo Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
CLIP	5k	t2i	_^30.3	_^56.1	_^67.1	_^4.0	_^48.5
CLIP (+QB-Norm)	5k	t2i	_^34.8	_^59.9	_^70.4	_^3.0	_^52.8
MMT-Oscar	5k	t2i	_^52.2	_^80.2	_^88.0	_^1.0	_^71.7
MMT-Oscar (+QB-Norm)	5k	t2i	_^53.9	_^80.5	_^88.1	_^1.0	_^72.6

Text-Audio retrieval results

QB-Norm Results on AudioCaps Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
AR-CE	Full	t2a	_{^23.1_(0.6)}	_{^55.1_(0.7)}	_{^70.7_(0.6)}	_{^4.7_(0.5)}	_{^44.8_(0.7)}
AR-CE (+QB-Norm)	Full	t2a	_{^23.9_(0.2)}	_{^57.1_(0.3)}	_{^71.6_(0.4)}	_{^4.0_(0.0)}	_{^46.0_(0.3)}

References

If you find this code useful or use the extracted similarity matrices, please consider citing:

@misc{bogolin2021cross,
      title={Cross Modal Retrieval with Querybank Normalisation}, 
      author={Simion-Vlad Bogolin and Ioana Croitoru and Hailin Jin and Yang Liu and Samuel Albanie},
      year={2021},
      eprint={2112.12777},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

This repo provides code for QB-Norm (Cross Modal Retrieval with Querybank Normalisation)

Related tags

Overview

Data

Text-Video retrieval results

Text-Image retrieval results

Text-Audio retrieval results

References

Owner

A facial recognition doorbell system using a Raspberry Pi

Unsupervised Image-to-Image Translation

Epidemiology analysis package

Pytorch implementation for Patient Knowledge Distillation for BERT Model Compression

Official code of ICCV2021 paper "Residual Attention: A Simple but Effective Method for Multi-Label Recognition"

Sematic-Segmantation - Semantic Segmentation on MIT ADE20K dataset in PyTorch

Teaches a student network from the knowledge obtained via training of a larger teacher network

Evaluating saliency methods on artificial data with different background types

A repository for interferometer controller code.

This is the official implementation for the paper "(Almost) Free Incentivized Exploration from Decentralized Learning Agents" in NeurIPS 2021.

Code of 3D Shape Variational Autoencoder Latent Disentanglement via Mini-Batch Feature Swapping for Bodies and Faces

Tackling the Class Imbalance Problem of Deep Learning Based Head and Neck Organ Segmentation

Efficient Multi Collection Style Transfer Using GAN

[ICLR 2021] "Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective" by Wuyang Chen, Xinyu Gong, Zhangyang Wang

INSPIRED: A Transparent Dialogue Dataset for Interactive Semantic Parsing

Contrastive learning of Class-agnostic Activation Map for Weakly Supervised Object Localization and Semantic Segmentation (CVPR 2022)

PyTorch implementation for the paper Visual Representation Learning with Self-Supervised Attention for Low-Label High-Data Regime

How to Train a GAN? Tips and tricks to make GANs work

Deep-Learning-Image-Captioning - Implementing convolutional and recurrent neural networks in Keras to generate sentence descriptions of images

Scalable Graph Neural Networks for Heterogeneous Graphs