Image captioning

End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Model is seq2seq model. In the encoder pretrained EfficientNet-b3 model is used to extract the features. Decoder is the LSTM with the Bahdanau Attention.

Dataset

The dataset is available at kaggle and contains 8,000 images that are each paired with five different captions.

Usage

run in terminal: python -m img_caption

Config

The user interface consists of file:

config.yaml - general configuration with data and model parameters

Default config.yaml:

data:
  path_to_data_folder: "data"
  caption_file_name: "captions.txt"
  images_folder_name: "Images"
  output_folder_name: "output"
  logging_file_name: "logging.txt"
  model_file_name: "model.pt"

batch_size: 32
num_worker: 1
gensim_model_name: "glove-wiki-gigaword-200"

model:
  embedding_dimension: 200
  decoder_hidden_dimension: 300
  learning_rate: 0.0001
  momentum: 0.9
  n_epochs: 50
  clip: 5
  fine_tune_encoder: false

Output

After training the model, the pipeline will return the following files:

model.pt - checkpoint with:
- epoch - last epoch
- model_state_dict - model parameters
- optimizer_state_dict - the state of the optimizer
- train_history - training history from a model
- valid_history - validation history from a model
- best_valid_loss - the best validation loss

End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Related tags

Overview

Image captioning

Dataset

Usage

Config

Output

Owner

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

What are the best Systems? New Perspectives on NLP Benchmarking

This repository describes our reproducible framework for assessing self-supervised representation learning from speech

Text classification is one of the popular tasks in NLP that allows a program to classify free-text documents based on pre-defined classes.

This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

CPC-big and k-means clustering for zero-resource speech processing

SAVI2I: Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

Transformer related optimization, including BERT, GPT

The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.

ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

Client library to download and publish models and other files on the huggingface.co hub

Meta learning algorithms to train cross-lingual NLI (multi-task) models

GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

Unofficial PyTorch implementation of Google AI's VoiceFilter system

Finding Label and Model Errors in Perception Data With Learned Observation Assertions

ASCEND Chinese-English code-switching dataset

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Continuously update some NLP practice based on different tasks.