A Persian Image Captioning model based on Vision Encoder Decoder Models of the transformers🤗.

Last update: Aug 25, 2022

Overview

Persian-Image-Captioning

We fine-tuning the Vision Encoder Decoder Model for the task of image captioning on the coco-flickr-farsi dataset. The implementation of our model is in PyTorch with transformers library by Hugging Face( 🤗 ).

You can choose any pretrained vision model and any language model to use in the Vision Encoder Decoder model. Here we use ViT as the encoder, and ParsBERT (v2.0) as the decoder. The encoder and decoder are loaded separately via from_pretrained()function. Cross-attention layers are randomly initialized and added to the decoder.

You may refer to the Vision Encoder Decoder Model for more information.

How to use

You can generate caption of an image using this model using the code below:

import torch
import urllib
import PIL
import matplotlib.pyplot as plt
from transformers import ViTFeatureExtractor, AutoTokenizer, \
                         VisionEncoderDecoderModel

def show_img(image):
    # show image
    plt.axis("off")
    plt.imshow(image)
    
if torch.cuda.is_available():
    device = 'cuda'
else:
    device = 'cpu'


#pass the url of any image to generate a caption for it    
urllib.request.urlretrieve("https://images.unsplash.com/photo-1628191011227-522c7c3f0af9?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=870&q=80", "sample.png")
image = PIL.Image.open("sample.png")


#Load the model you trained for inference 
model_checkpoint = 'MahsaShahidi/Persian-Image-Captioning'
model = VisionEncoderDecoderModel.from_pretrained(model_checkpoint).to(device)

feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')
tokenizer = AutoTokenizer.from_pretrained('HooshvareLab/bert-fa-base-uncased-clf-persiannews')

sample = feature_extractor(image, return_tensors="pt").pixel_values.to(device)
caption_ids = model.generate(sample, max_length = 30)[0]
caption_text = tokenizer.decode(caption_ids, skip_special_tokens=True)
print(caption_text)
show_img(image)

Inference

Following are the reslts of 3 captions generated on free stock photos after 2 epochs of training.

Image	Caption
	Generated Caption: زنی در آشپزخانه در حال اماده کردن غذا است.
	Generated Caption: گروهی از مردم در حال پرواز بادبادک در یک زمین چمنزار.
	Generated Caption: مردی در ماشین نشسته و به ماشین نگاه می کند.

Credits

A huge thanks to Kaggle for providing free access to GPU, and to the creators of Huggingface, ViT, and ParsBERT!

References

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A Persian Image Captioning model based on Vision Encoder Decoder Models of the transformers🤗.

Related tags

Overview

Persian-Image-Captioning

How to use

Inference

Credits

References

Owner

Hamtech-ai

Just Another Telegram Ai Chat Bot Written In Python With Pyrogram.

Mkdocs + material + cool stuff

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

A natural language modeling framework based on PyTorch

Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

Higher quality textures for the Metal Gear Solid series.

SentAugment is a data augmentation technique for semi-supervised learning in NLP.

Reading Wikipedia to Answer Open-Domain Questions

Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP 2020)

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

A number of methods in order to perform Natural Language Processing on live data derived from Twitter

End-to-End Speech Processing Toolkit

The RWKV Language Model

Chatbot with Pytorch, Python & Nextjs

ProtFeat is protein feature extraction tool that utilizes POSSUM and iFeature.

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents