Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

Overview

KR-BERT-SimCSE

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

Training

Unsupervised

python train_unsupervised.py --mixed_precision

I used Korean Wikipedia Corpus that is divided into sentences in advance. (Check out tfds-korean catalog page for details)

  • Settings
    • KR-BERT character
    • peak learning rate 3e-5
    • batch size 64
    • Total steps: 25,000
    • 0.05 warmup rate, and linear decay learning rate scheduler
    • temperature 0.05
    • evalaute on KLUE STS and KorSTS every 250 steps
    • max sequence length 64
    • Use pooled outputs for training, and [CLS] token's representations for inference

The hyperparameters were not tuned and mostly followed the values in the paper.

Supervised

python train_supervised.py --mixed_precision

I used KorNLI for supervised training. (Check out tfds-korean catalog page)

  • Settings
    • KR-BERT character
    • batch size 128
    • epoch 3
    • peak learning rate 5e-5
    • 0.05 warmup rate, and linear decay learning rate scheduler
    • temperature 0.05
    • evalaute on KLUE STS and KorSTS every 125 steps
    • max sequence length 48
    • Use pooled outputs for training, and [CLS] token's representations for inference

The hyperparameters were not tuned and mostly followed the values in the paper.

Results

KorSTS (dev set results)

model 100 X Spearman correlation
KR-BERT base
SimCSE
unsupervised bi encoding 79.99
KR-BERT base
SimCSE-supervised
trained on KorNLI bi encoding 84.88
SRoBERTa base* unsupervised bi encoding 63.34
SRoBERTa base* trained on KorNLI bi encoding 76.48
SRoBERTa base* trained on KorSTS bi encoding 83.68
SRoBERTa base* trained on KorNLI -> KorSTS bi encoding 83.54
SRoBERTa large* trained on KorNLI bi encoding 77.95
SRoBERTa large* trained on KorSTS bi encoding 84.74
SRoBERTa large* trained on KorNLI -> KorSTS bi encoding 84.21

KorSTS (test set results)

model 100 X Spearman correlation
KR-BERT base
SimCSE
unsupervised bi encoding 73.25
KR-BERT base
SimCSE-supervised
trained on KorNLI bi encoding 80.72
SRoBERTa base* unsupervised bi encoding 48.96
SRoBERTa base* trained on KorNLI bi encoding 74.19
SRoBERTa base* trained on KorSTS bi encoding 78.94
SRoBERTa base* trained on KorNLI -> KorSTS bi encoding 80.29
SRoBERTa large* trained on KorNLI bi encoding 75.46
SRoBERTa large* trained on KorSTS bi encoding 79.55
SRoBERTa large* trained on KorNLI -> KorSTS bi encoding 80.49
SRoBERTa base* trained on KorSTS cross encoding 83.00
SRoBERTa large* trained on KorSTS cross encoding 85.27

KLUE STS (dev set results)

model 100 X Pearson's correlation
KR-BERT base
SimCSE
unsupervised bi encoding 74.45
KR-BERT base
SimCSE-supervised
trained on KorNLI bi encoding 79.42
KR-BERT base* supervised cross encoding 87.50

References

@misc{gao2021simcse,
    title={SimCSE: Simple Contrastive Learning of Sentence Embeddings},
    author={Tianyu Gao and Xingcheng Yao and Danqi Chen},
    year={2021},
    eprint={2104.08821},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
@misc{ham2020kornli,
    title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
    author={Jiyeon Ham and Yo Joong Choe and Kyubyong Park and Ilji Choi and Hyungjoon Soh},
    year={2020},
    eprint={2004.03289},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
@misc{park2021klue,
    title={KLUE: Korean Language Understanding Evaluation},
    author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
    year={2021},
    eprint={2105.09680},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Owner
Jeong Ukjae
Jeong Ukjae
New Modeling The Background CodeBase

Modeling the Background for Incremental Learning in Semantic Segmentation This is the updated official PyTorch implementation of our work: "Modeling t

Fabio Cermelli 9 Dec 28, 2022
Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

A brief explanation This script provides a quick way to setup a Time-of-day (Tod

2 Feb 03, 2022
Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

PTR Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification" If you use the code, please cite the following paper: @art

THUNLP 118 Dec 30, 2022
CMeEE 数据集医学实体抽取

医学实体抽取_GlobalPointer_torch 介绍 思想来自于苏神 GlobalPointer,原始版本是基于keras实现的,模型结构实现参考现有 pytorch 复现代码【感谢!】,基于torch百分百复现苏神原始效果。 数据集 中文医学命名实体数据集 点这里申请,很简单,共包含九类医学

85 Dec 28, 2022
End-to-End Speech Processing Toolkit

ESPnet: end-to-end speech processing toolkit system/pytorch ver. 1.0.1 1.1.0 1.2.0 1.3.1 1.4.0 1.5.1 1.6.0 1.7.1 1.8.1 ubuntu18/python3.8/pip ubuntu18

ESPnet 5.9k Jan 03, 2023
Behavioral Testing of Clinical NLP Models

Behavioral Testing of Clinical NLP Models This repository contains code for testing the behavior of clinical prediction models based on patient letter

Betty van Aken 2 Sep 20, 2022
Natural Language Processing at EDHEC, 2022

Natural Language Processing Here you will find the teaching materials for the "Natural Language Processing" course at EDHEC Business School, 2022 What

1 Feb 04, 2022
Huggingface Transformers + Adapters = ❤️

adapter-transformers A friendly fork of HuggingFace's Transformers, adding Adapters to PyTorch language models adapter-transformers is an extension of

AdapterHub 1.2k Jan 09, 2023
GooAQ 🥑 : Google Answers to Google Questions!

This repository contains the code/data accompanying our recent work on long-form question answering.

AI2 112 Nov 06, 2022
Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

2 Dec 29, 2022
An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Ultra_Fast_Lane_Detection_TensorRT An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI to accelerate. our model support for in

steven.yan 121 Dec 27, 2022
A multi-voice TTS system trained with an emphasis on quality

TorToiSe Tortoise is a text-to-speech program built with the following priorities: Strong multi-voice capabilities. Highly realistic prosody and inton

James Betker 2.1k Jan 01, 2023
The simple project to separate mixed voice (2 clean voices) to 2 separate voices.

Speech Separation The simple project to separate mixed voice (2 clean voices) to 2 separate voices. Result Example (Clisk to hear the voices): mix ||

vuthede 31 Oct 30, 2022
Speech to text streamlit app

Speech to text Streamlit-app! 👄 This speech to text recognition is powered by t

Charly Wargnier 9 Jan 01, 2023
Main repository for the chatbot Bobotinho.

Bobotinho Bot Main repository for the chatbot Bobotinho. ℹ️ Introduction Twitch chatbot with entertainment commands. ‎ 💻 Technologies Concurrent code

Bobotinho 14 Nov 29, 2022
Blue Brain text mining toolbox for semantic search and structured information extraction

Blue Brain Search Source Code DOI Data & Models DOI Documentation Latest Release Python Versions License Build Status Static Typing Code Style Securit

The Blue Brain Project 29 Dec 01, 2022
Задания КЕГЭ по информатике 2021 на Python

КЕГЭ 2021 на Python В этом репозитории мои решения типовых заданий КЕГЭ по информатике в 2021 году, БЕСПЛАТНО! Задания Взяты с https://inf-ege.sdamgia

8 Oct 13, 2022
SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。

SimpleChinese2 SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。 声明 本项目是为方便个人工作所创建的,仅有部分代码原创。

Ming 30 Dec 02, 2022
OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters - where the final result looks like waves in the ocean.

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

THUNLP-MT 9 Jun 27, 2022