WACV 2022 Paper - Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Last update: Dec 17, 2022

Overview

Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Code based on our WACV 2022 Accepted Paper: https://arxiv.org/pdf/2110.02623.pdf

Project is built on top of the [CVSE] (https://github.com/BruceW91/CVSE) in PyTorch. However, it is easy to adapt to different Image-Text Matching models (SCAN, VSRN, SGRAF). Regarding the proposed metric code and evaluation, please visit: https://github.com/furkanbiten/ncs_metric.

Introduction

The task of image-text matching aims to map representations from different modalities into a common joint visual-textual embedding. However, the most widely used datasets for this task, MSCOCO and Flickr30K, are actually image captioning datasets that offer a very limited set of relationships between images and sentences in their ground-truth annotations. This limited ground truth information forces us to use evaluation metrics based on binary relevance: given a sentence query we consider only one image as relevant. However, many other relevant images or captions may be present in the dataset. In this work, we propose two metrics that evaluate the degree of semantic relevance of retrieved items, independently of their annotated binary relevance. Additionally, we incorporate a novel strategy that uses an image captioning metric, CIDEr, to define a Semantic Adaptive Margin (SAM) to be optimized in a standard triplet loss. By incorporating our formulation to existing models, a large improvement is obtained in scenarios where available training data is limited. We also demonstrate that the performance on the annotated image-caption pairs is maintained while improving on other non-annotated relevant items when employing the full training set. The code for our new metric can be found at https://github.com/furkanbiten/ncs_metric and model https://github.com/andrespmd/semantic_adaptive_margin

Install Environment

Git clone the project.

Create Conda environment:

$ conda env create -f env.yml

Activate the environment:

$ conda activate pytorch12

Download Metric Data

Please download the following compressed file from:

Uncompress the downloaded file under the main project folder. The uncompressed folder name should be "cider".

WACV 2022 Paper - Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Related tags

Overview

Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Introduction

Install Environment

Download Metric Data

Owner

Andres

Generate a list of papers with publicly available source code in the daily arxiv

零样本学习测评基准，中文版

A small C++ implementation of LSTM networks, focused on OCR.

Corner-based Region Proposal Network

Using computer vision method to recognize and calcutate the features of the architecture.

A community-supported supercharged version of paperless: scan, index and archive all your physical documents

CTPN + DenseNet + CTC based end-to-end Chinese OCR implemented using tensorflow and keras

Visual Attention based OCR

TextBoxes: A Fast Text Detector with a Single Deep Neural Network https://github.com/MhLiao/TextBoxes 基于SSD改进的文本检测算法，textBoxes_note记录了之前整理的笔记。

Python package for handwriting and sketching in Jupyter cells

An Implementation of the alogrithm in paper IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection

Discord QR Scam Code Generator + Token grab mobile device.

This can be use to convert text in a file to handwritten text.

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

Python Computer Vision application that allows users to draw/erase on the screen using their webcam.

Semantic-based Patch Detection for Binary Programs

Rubik's Cube in pygame with OpenGL

This project proposes a camera vision based cursor control system, using hand moment captured from a webcam through a landmarks of hand by using Mideapipe module

🖺 OCR using tensorflow with attention

Zoom , GoogleMeets에서 Vtuber 데뷔하기