DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Last update: Jan 07, 2023

Related tags

Overview

DziriBERT

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect. It handles Algerian text contents written using both Arabic and Latin characters. It sets new state of the art results on Algerian text classification datasets, even if it has been pre-trained on much less data (~1 million tweets).

The model is publicly available at: https://huggingface.co/alger-ia/dziribert.

For more information, please visit our paper: https://arxiv.org/pdf/2109.12346.pdf

Evaluation

The Twifil dataset was used to compare DziriBERT with current multilingual, standard Arabic and dialectal Arabic models:

Model	Sentiment acc.	Emotion acc.
bert-base-multilingual-cased	73.6 %	59.4 %
aubmindlab/bert-base-arabert	72.1 %	61.2 %
CAMeL-Lab/bert-base-arabic-camelbert-mix	77.1 %	65.7 %
qarib/bert-base-qarib	77.7 %	67.6 %
UBC-NLP/MARBERT	80.1 %	68.4 %
alger-ia/dziribert	80.3 %	69.3 %

In order to reproduce these results, please install the following requirements:

pip install -r requirements.txt

Then, run the following evaluation script:

python3 evaluate_model.py

These results have been obtained on a Tesla K80 GPU.

Pretrained DziriBERT

DziriBERT has been uploaded to the HuggingFace hub in order to facilitate its use: https://huggingface.co/alger-ia/dziribert.

It can be easily downloaded and loaded using the transformers library:

from transformers import BertTokenizer, BertForMaskedLM

tokenizer = BertTokenizer.from_pretrained("alger-ia/dziribert")
model = BertForMaskedLM.from_pretrained("alger-ia/dziribert")

How to cite

@article{dziribert,
  title={DziriBERT: a Pre-trained Language Model for the Algerian Dialect},
  author={Abdaoui, Amine and Berrimi, Mohamed and Oussalah, Mourad and Moussaoui, Abdelouahab},
  journal={arXiv preprint arXiv:2109.12346},
  year={2021}
}

Contact

Please contact [email protected] for any question, feedback or request.

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Related tags

Overview

DziriBERT

Evaluation

Pretrained DziriBERT

How to cite

Contact

Owner

End-to-End Speech Processing Toolkit

ChessCoach is a neural network-based chess engine capable of natural-language commentary.

gaiic2021-track3-小布助手对话短文本语义匹配复赛rank3、决赛rank4

Code Generation using a large neural network called GPT-J

A Fast Sequence Transducer Implementation with PyTorch Bindings

Python library for Serbian Natural language processing (NLP)

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Code-autocomplete, a code completion plugin for Python

Easy to start. Use deep nerual network to predict the sentiment of movie review.

Azure Text-to-speech service for Home Assistant

Code for the paper "Flexible Generation of Natural Language Deductions"

NLP - Machine learning

Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Paddlespeech Streaming ASR GUI

Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

Trex is a tool to match semantically similar functions based on transfer learning.

Pretrain CPM - 大规模预训练语言模型的预训练代码