L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources.

Last update: Dec 17, 2022

Overview

L3Cube-MahaCorpus

L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual corpus with 24.8M sentences and 289M tokens. We also present, MahaBERT, MahaAlBERT, and MahaRoBerta all BERT-based masked language models, and MahaFT, the fast text word embeddings both trained on full Marathi corpus with 752M tokens. The evaluation details are mentioned in our paper link

Dataset Statistics

L3Cube-MahaCorpus(full) = L3Cube-MahaCorpus(news) + L3Cube-MahaCorpus(non-news)

Full Marathi Corpus incorporates all existing sources .

Dataset	#tokens(M)	#sentences(M)	Link
L3Cube-MahaCorpus(news)	212	17.6	link
L3Cube-MahaCorpus(non-news)	76.4	7.2	link
L3Cube-MahaCorpus(full)	289	24.8	link
Full Marathi Corpus(all sources)	752	57.2	link

Marathi BERT models and Marathi Fast Text model

The full Marathi Corpus is used to train BERT language models and made available on HuggingFace model hub.

Model	Description	Link
MahaBERT	Base-BERT	link
MahaRoBERTa	RoBERTa	link
MahaAlBERT	AlBERT	link
MahaFT	Fast Text	bin vec

L3CubeMahaSent

L3CubeMahaSent is the largest publicly available Marathi Sentiment Analysis dataset to date. This dataset is made of marathi tweets which are manually labelled. The annotation guidelines are mentioned in our paper link .

Dataset Statistics

This dataset contains a total of 18,378 tweets which are classified into three classes - Positive(1), Negative(-1) and Neutral(0). All tweets are present in their original form, without any preprocessing.

Out of these, 15,864 tweets are considered for splitting them into train(tweets-train.csv), test(tweets-test.csv) and validation(tweets-valid.csv) datasets. This has been done to avoid class imbalance in our dataset.
The remaining 2,514 tweets are also provided in a separate sheet(tweets-extra.csv).

The statistics of the dataset are as follows :

Split	Total tweets	Tweets per class
Train	12114	4038
Test	2250	750
Validation	1500	500

The extra sheet contains 2355 positive and 159 negative tweets. These tweets have not been considered during baseline experiments.

Baseline Experimentations

Two-class(positive,negative) and Three-class(positive,negative,neutral) sentiment analysis / classification was performed on the dataset.

Models

Some of the models used or performing baseline experiments were:

CNN, BiLSTM
- fastText embeddings provided by IndicNLP and Facebook are also used along with the above two models. These embeddings are used in two variations: static and trainable.
BERT based models:
- Multilingual BERT
- IndicBERT

Results

Details of the best performing models are given in the following table:

Model	3-class	2-class
CNN IndicFT trainable	83.24	93.13
BiLSTM IndicFT trainable	82.89	91.80
IndicBERT	84.13	92.93

The fine-tuned IndicBERT model is available on huggingface here . Further details about the dataset and baseline experiments can be found in this paper pdf .

License

L3Cube-MahaCorpus and L3CubeMahaSent is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Citing

@article{joshi2022l3cube,
  title={L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources},
  author={Joshi, Raviraj},
  journal={arXiv preprint arXiv:2202.01159},
  year={2022}
}

@inproceedings{kulkarni2021l3cubemahasent,
  title={L3CubeMahaSent: A Marathi Tweet-based Sentiment Analysis Dataset},
  author={Kulkarni, Atharva and Mandhane, Meet and Likhitkar, Manali and Kshirsagar, Gayatri and Joshi, Raviraj},
  booktitle={Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis},
  pages={213--220},
  year={2021}
}

@inproceedings{kulkarni2022experimental,
  title={Experimental evaluation of deep learning models for marathi text classification},
  author={Kulkarni, Atharva and Mandhane, Meet and Likhitkar, Manali and Kshirsagar, Gayatri and Jagdale, Jayashree and Joshi, Raviraj},
  booktitle={Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications},
  pages={605--613},
  year={2022},
  organization={Springer}
}

L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources.

Related tags

Overview

L3Cube-MahaCorpus

Dataset Statistics

Marathi BERT models and Marathi Fast Text model

L3CubeMahaSent

Dataset Statistics

Baseline Experimentations

Models

Results

License

Citing

Owner

VoiceFixer VoiceFixer is a framework for general speech restoration.

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Py65 65816 - Add support for the 65C816 to py65

vits chinese, tts chinese, tts mandarin

A fast and lightweight python-based CTC beam search decoder for speech recognition.

GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

天池中药说明书实体识别挑战冠军方案；中文命名实体识别；NER; BERT-CRF & BERT-SPAN & BERT-MRC；Pytorch

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

NSFW A chatbot based on GPT2-chitchat

Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization (ACL 2021)

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources.

Related tags

Overview

L3Cube-MahaCorpus

Dataset Statistics

Marathi BERT models and Marathi Fast Text model

L3CubeMahaSent

Dataset Statistics

Baseline Experimentations

Models

Results

License

Citing

Owner

VoiceFixer VoiceFixer is a framework for general speech restoration.

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Py65 65816 - Add support for the 65C816 to py65

vits chinese, tts chinese, tts mandarin

A fast and lightweight python-based CTC beam search decoder for speech recognition.

GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

天池中药说明书实体识别挑战冠军方案；中文命名实体识别；NER; BERT-CRF & BERT-SPAN & BERT-MRC；Pytorch

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

**NSFW** A chatbot based on GPT2-chitchat

Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization (ACL 2021)

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

NSFW A chatbot based on GPT2-chitchat