chaii - hindi & tamil question answering

Overview

chaii - hindi & tamil question answering

This is the solution for rank 5th in Kaggle competition: chaii - Hindi and Tamil Question Answering. The competition can be found here: https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering

Datasets required

Download squadv2 data from https://rajpurkar.github.io/SQuAD-explorer/

$ mkdir input && cd input
$ wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
$ wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

Download tydiqa data in the input folder:

$ wget https://storage.googleapis.com/tydiqa/v1.1/tydiqa-goldp-v1.1-train.json
$ wget https://storage.googleapis.com/tydiqa/v1.1/tydiqa-goldp-v1.1-dev.json

Download data from https://www.kaggle.com/tkm2261/google-translated-squad20-to-hindi-and-tamil to input folder

Download original competition dataset to input folder: https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/data

Download outputs of this kernel: https://www.kaggle.com/rhtsingh/external-data-mlqa-xquad-preprocessing/ to input folder

Now, you have all the data needed to train the model. We will first create folds and munge the data a bit.

To create folds, please use the following command:

$ cd src
$ python create_folds.py

To munge the datasets and prepare for training, please run the following command:

$ cd src
$ python munge_data.py

Training

There are two GPU models and one model needs TPUs.

GPU models: XLM-Roberta & Rembert TPU model: Muril-Large

XLM-Roberta:

$ cd src
$ TOKENIZERS_PARALLELISM=false python xlm_roberta.py --fold 0
$ TOKENIZERS_PARALLELISM=false python xlm_roberta.py --fold 1
$ TOKENIZERS_PARALLELISM=false python xlm_roberta.py --fold 2
$ TOKENIZERS_PARALLELISM=false python xlm_roberta.py --fold 3
$ TOKENIZERS_PARALLELISM=false python xlm_roberta.py --fold 4

Rembert:

$ cd src
$ TOKENIZERS_PARALLELISM=false python rembert.py --fold 0
$ TOKENIZERS_PARALLELISM=false python rembert.py --fold 1
$ TOKENIZERS_PARALLELISM=false python rembert.py --fold 2
$ TOKENIZERS_PARALLELISM=false python rembert.py --fold 3
$ TOKENIZERS_PARALLELISM=false python rembert.py --fold 4

Muril-Large

** please note that training this model needs TPUs **

$ cd src
$ TOKENIZERS_PARALLELISM=false python muril_large.py --fold 0
$ TOKENIZERS_PARALLELISM=false python muril_large.py --fold 1
$ TOKENIZERS_PARALLELISM=false python muril_large.py --fold 2
$ TOKENIZERS_PARALLELISM=false python muril_large.py --fold 3
$ TOKENIZERS_PARALLELISM=false python muril_large.py --fold 4

Inference

After training all the models, the outputs were pushed to Kaggle Datasets.

The final model datasets can be found here:

- https://www.kaggle.com/abhishek/xlmrobertalargewithsquadv2tydiqasqdtrans384f
- https://www.kaggle.com/ubamba98/modelsrembertwithsquadv2tydiqa384
- https://www.kaggle.com/ubamba98/murillargecasedchaii

And the final inference kernel can be found here: https://www.kaggle.com/abhishek/chaii-xlm-roberta-x-muril-x-rembert-score-based

Solution writeup: https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/discussion/288049

Owner
abhishek thakur
Kaggle: www.kaggle.com/abhishek
abhishek thakur
Simple python code to fix your combo list by removing any text after a separator or removing duplicate combos

Combo List Fixer A simple python code to fix your combo list by removing any text after a separator or removing duplicate combos Removing any text aft

Hamidreza Dehghan 3 Dec 05, 2022
Search msDS-AllowedToActOnBehalfOfOtherIdentity

前言 现在进行RBCD的攻击手段主要是搜索mS-DS-CreatorSID,如果机器的创建者是我们可控的话,那就可以修改对应机器的msDS-AllowedToActOnBehalfOfOtherIdentity,利用工具SharpAllowedToAct-Modify 那我们索性也试试搜索所有计算机

Jumbo 26 Dec 05, 2022
本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料,该资料目前包含 自然语言处理各领域的 面试题积累。

【关于 NLP】那些你不知道的事 作者:杨夕、芙蕖、李玲、陈海顺、twilight、LeoLRH、JimmyDU、艾春辉、张永泰、金金金 介绍 本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料,该资料目前包含 自然语言处理各领域的 面试题积累。 目录架构 一、【

1.4k Dec 30, 2022
Creating an LSTM model to generate music

Music-Generation Creating an LSTM model to generate music music-generator Used to create basic sin wave sounds music-ai Contains the functions to conv

Jerin Joseph 2 Dec 02, 2021
Ceaser-Cipher - The Caesar Cipher technique is one of the earliest and simplest method of encryption technique

Ceaser-Cipher The Caesar Cipher technique is one of the earliest and simplest me

Lateefah Ajadi 2 May 12, 2022
多语言降噪预训练模型MBart的中文生成任务

mbart-chinese 基于mbart-large-cc25 的中文生成任务 Input source input: text + /s + lang_code target input: lang_code + text + /s Usage token_ids_mapping.jso

11 Sep 19, 2022
Translates basic English sentences into the Huna language (hoo-NAH)

huna-translator The Huna Language Translates basic English sentences into the Huna language (hoo-NAH). The Huna constructed language was developed in

Miles Smith 0 Jan 20, 2022
PortaSpeech - PyTorch Implementation

PortaSpeech - PyTorch Implementation PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech. Model Size Module Nor

Keon Lee 276 Dec 26, 2022
This repo contains simple to use, pretrained/training-less models for speaker diarization.

PyDiar This repo contains simple to use, pretrained/training-less models for speaker diarization. Supported Models Binary Key Speaker Modeling Based o

12 Jan 20, 2022
Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

THUNLP-MT 9 Jun 27, 2022
Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Google 6.4k Jan 01, 2023
A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode

Bloxflip Smart Bet A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode. https://bloxflip.com/crash. THIS

43 Jan 05, 2023
customer care chatbot made with Rasa Open Source.

Customer Care Bot Customer care bot for ecomm company which can solve faq and chitchat with users, can contact directly to team. 🛠 Features Basic E-c

Dishant Gandhi 23 Oct 27, 2022
This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

Laura 1 Jan 28, 2022
The code for the Subformer, from the EMNLP 2021 Findings paper: "Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers", by Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo

Subformer This repository contains the code for the Subformer. To help overcome this we propose the Subformer, allowing us to retain performance while

Machel Reid 10 Dec 27, 2022
Python library to make development of portfolio analysis faster and easier

Trafalgar Python library to make development of portfolio analysis faster and easier Installation 🔥 For the moment, Trafalgar is still in beta develo

Santosh Passoubady 641 Jan 01, 2023
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classifi

186 Dec 24, 2022
Uncomplete archive of files from the European Nopsled Team

European Nopsled CTF Archive This is an archive of collected material from various Capture the Flag competitions that the European Nopsled team played

European Nopsled 4 Nov 24, 2021
PyTorch implementation of NATSpeech: A Non-Autoregressive Text-to-Speech Framework

A Non-Autoregressive Text-to-Speech (NAR-TTS) framework, including official PyTorch implementation of PortaSpeech (NeurIPS 2021) and DiffSpeech (AAAI 2022)

760 Jan 03, 2023
Signature remover is a NLP based solution which removes email signatures from the rest of the text.

Signature Remover Signature remover is a NLP based solution which removes email signatures from the rest of the text. It helps to enchance data conten

Forges Alterway 8 Jan 06, 2023