Codes for our paper "SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge" (EMNLP 2020)

Last update: Dec 30, 2022

Overview

SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge

Introduction

SentiLARE is a sentiment-aware pre-trained language model enhanced by linguistic knowledge. You can read our paper for more details. This project is a PyTorch implementation of our work.

Dependencies

Python 3
NumPy
Scikit-learn
PyTorch >= 1.3.0
PyTorch-Transformers (Huggingface) 1.2.0
TensorboardX
Sentence Transformers 0.2.6 (Optional, used for linguistic knowledge acquisition during pre-training and fine-tuning)
NLTK (Optional, used for linguistic knowledge acquisition during pre-training and fine-tuning)

Quick Start for Fine-tuning

Datasets of Downstream Tasks

Our experiments contain sentence-level sentiment classification (e.g. SST / MR / IMDB / Yelp-2 / Yelp-5) and aspect-level sentiment analysis (e.g. Lap14 / Res14 / Res16). You can download the pre-processed datasets (Google Drive / Tsinghua Cloud) of the downstream tasks. The detailed description of the data formats is attached to the datasets.

Fine-tuning

To quickly conduct the fine-tuning experiments, you can directly download the checkpoint (Google Drive / Tsinghua Cloud) of our pre-trained model. We show the example of fine-tuning SentiLARE on SST as follows:

cd finetune
CUDA_VISIBLE_DEVICES=0,1,2 python run_sent_sentilr_roberta.py \
          --data_dir data/sent/sst \
          --model_type roberta \
          --model_name_or_path pretrain_model/ \
          --task_name sst \
          --do_train \
          --do_eval \
          --max_seq_length 256 \
          --per_gpu_train_batch_size 4 \
          --learning_rate 2e-5 \
          --num_train_epochs 3 \
          --output_dir sent_finetune/sst \
          --logging_steps 100 \
          --save_steps 100 \
          --warmup_steps 100 \
          --eval_all_checkpoints \
          --overwrite_output_dir

Note that data_dir is set to the directory of pre-processed SST dataset, and model_name_or_path is set to the directory of the pre-trained model checkpoint. output_dir is the directory to save the fine-tuning checkpoints. You can refer to the fine-tuning codes to get the description of other hyper-parameters.

More details about fine-tuning SentiLARE on other datasets can be found in finetune/README.MD.

POS Tagging and Polarity Acquisition for Downstream Tasks

During pre-processing, we tokenize the original datasets with NLTK, tag the sentences with Stanford Log-Linear Part-of-Speech Tagger, and obtain the sentiment polarity with Sentence-BERT.

Pre-training

If you want to conduct pre-training by yourself instead of directly using the checkpoint we provide, this part may help you pre-process the pre-training dataset and run the pre-training scripts.

Dataset

We use Yelp Dataset Challenge 2019 as our pre-training dataset. According to the Term of Use of Yelp dataset, you should download Yelp dataset on your own.

POS Tagging and Polarity Acquisition for Pre-training Dataset

Similar to fine-tuning, we also conduct part-of-speech tagging and sentiment polarity acquisition on the pre-training dataset. Note that since the pre-training dataset is quite large, the pre-processing procedure may take a long time because we need to use Sentence-BERT to obtain the representation vectors of all the sentences in the pre-training dataset.

Pre-training

Refer to pretrain/README.MD for more implementation details about pre-training.

Citation

@inproceedings{ke-etal-2020-sentilare,
    title = "{S}enti{LARE}: Sentiment-Aware Language Representation Learning with Linguistic Knowledge",
    author = "Ke, Pei  and Ji, Haozhe  and Liu, Siyang  and Zhu, Xiaoyan  and Huang, Minlie",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    pages = "6975--6988",
}

Please kindly cite our paper if this paper and the codes are helpful.

Thanks

Many thanks to the GitHub repositories of Transformers and BERT-PT. Part of our codes are modified based on their codes.

Codes for our paper "SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge" (EMNLP 2020)

Related tags

Overview

SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge

Introduction

Dependencies

Quick Start for Fine-tuning

Datasets of Downstream Tasks

Fine-tuning

POS Tagging and Polarity Acquisition for Downstream Tasks

Pre-training

Dataset

POS Tagging and Polarity Acquisition for Pre-training Dataset

Pre-training

Citation

Thanks

Owner

End-To-End Memory Network using Tensorflow

This is the code repository implementing the paper "TreePartNet: Neural Decomposition of Point Clouds for 3D Tree Reconstruction".

Official PyTorch implementation of "Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble" (NeurIPS'21)

Speedy Implementation of Instance-based Learning (IBL) agents in Python

Web-interface + rest API for classification and regression (https://jeff1evesque.github.io/machine-learning.docs)

Neural network for digit classification powered by cuda

Official implementation of AAAI-21 paper "Label Confusion Learning to Enhance Text Classification Models"

Learning to Draw: Emergent Communication through Sketching

Codebase for Attentive Neural Hawkes Process (A-NHP) and Attentive Neural Datalog Through Time (A-NDTT)

Python port of R's Comprehensive Dynamic Time Warp algorithm package

Codebase for ECCV18 "The Sound of Pixels"

Implementation of CVPR 2021 paper "Spatially-invariant Style-codes Controlled Makeup Transfer"

A PyTorch implementation of "SimGNN: A Neural Network Approach to Fast Graph Similarity Computation" (WSDM 2019).

TensorFlow Tutorials with YouTube Videos

Codebase for the paper titled "Continual learning with local module selection"

Must-read Papers on Physics-Informed Neural Networks.

Implementation of the Remixer Block from the Remixer paper, in Pytorch

Roach: End-to-End Urban Driving by Imitating a Reinforcement Learning Coach

Council-GAN - Implementation for our paper Breaking the Cycle - Colleagues are all you need (CVPR 2020)

Code for "Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans" CVPR 2021 best paper candidate