Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Last update: Dec 20, 2022

Overview

Spanish Language Models 💃🏻

Corpora 📃

Corpora	Number of documents	Size (GB)
BNE	201,080,084	570GB

Models 🤖

RoBERTa-base BNE: https://huggingface.co/BSC-TeMU/roberta-base-bne
RoBERTa-large BNE: https://huggingface.co/BSC-TeMU/roberta-large-bne
Other models: (WIP)

Word embeddings 🔤

Word embeddings trained with FastText for 300d:

CBOW Word embeddings: https://zenodo.org/record/5044988
Skip-gram Word embeddings: https://zenodo.org/record/5046525

Evaluation ✅

Dataset	Metric	RoBERTa-b	RoBERTa-l	BETO	mBERT	BERTIN
UD-POS	F1	0.9907	0.9901	0.9900	0.9886	0.9904
Conll-NER	F1	0.8851	0.8772	0.8759	0.8691	0.8627
Capitel-POS	F1	0.9846	0.9851	0.9836	0.9839	0.9826
Capitel-NER	F1	0.8959	0.8998	0.8771	0.8810	0.8741
STS	Combined	0.8423	0.8420	0.8216	0.8249	0.7822
MLDoc	Accuracy	0.9595	0.9600	0.9650	0.9560	0.9673
PAWS-X	F1	0.9035	0.9000	0.8915	0.9020	0.8820
XNLI	Accuracy	0.8016	WiP	0.8130	0.7876	WiP

Usage example ⚗️

For the RoBERTa-base

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('BSC-TeMU/roberta-base-bne')
model = AutoModelForMaskedLM.from_pretrained('BSC-TeMU/roberta-base-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

For the RoBERTa-large

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('BSC-TeMU/roberta-large-bne')
model = AutoModelForMaskedLM.from_pretrained('BSC-TeMU/roberta-large-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

Other Spanish Language Models 👩‍👧‍👦

We are developing domain-specific language models:

Legal Language Model

Cite 📣

@misc{gutierrezfandino2021spanish,
      title={Spanish Language Models}, 
      author={Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Marc Pàmies and Joan Llop-Palao and Joaquín Silveira-Ocampo and Casimiro Pio Carrino and Aitor Gonzalez-Agirre and Carme Armentano-Oller and Carlos Rodriguez-Penagos and Marta Villegas},
      year={2021},
      eprint={2107.07253},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact 📧

📋 We are interested in (1) extending our corpora to make larger models (2) train/evaluate the model in other tasks.

For questions regarding this work, contact Asier Gutiérrez-Fandiño ([email protected])

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Related tags

Overview

Spanish Language Models 💃🏻

Corpora 📃

Models 🤖

Word embeddings 🔤

Evaluation ✅

Usage example ⚗️

Other Spanish Language Models 👩‍👧‍👦

Cite 📣

Contact 📧

Owner

PlanTL-SANIDAD

This is a MD5 password/passphrase brute force tool

A programming language with logic of Python, and syntax of all languages.

Simple translation demo showcasing our headliner package.

Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

Spooky Skelly For Python

Knowledge Management for Humans using Machine Learning & Tags

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

A telegram bot to translate 100+ Languages

KoBART model on huggingface transformers

Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

This repository has a implementations of data augmentation for NLP for Japanese.

NLP-SentimentAnalysis - Coursera Course ( Duration : 5 weeks ) offered by DeepLearning.AI

Chinese Pre-Trained Language Models (CPM-LM) Version-I

Voice Assistant inspired by Google Assistant, Cortana, Alexa, Siri, ...

A Structured Self-attentive Sentence Embedding

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

justCTF [*] 2020 challenges sources

This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"