When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings

Related tags

Deep Learningcasehold
Overview

When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings

This is the repository for the paper, When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings (Zheng and Guha et al., 2021), accepted to ICAIL 2021.

It includes models, datasets, and code for computing pretrain loss and finetuning Legal-BERT, Custom Legal-BERT, and BERT (double) models on legal benchmark tasks: Overruling, Terms of Service, CaseHOLD.

Download Models & Datasets

The legal benchmark task datasets and Legal-BERT, Custom Legal-BERT, and BERT (double) model files can be downloaded from the casehold Google Drive folder. For more information, see the Description of the folder.

The models can also be accessed directly from the Hugging Face model hub. To load a model from the model hub in a script, pass its Hugging Face model repository name to the model_name_or_path script argument. See demo.ipynb for more details.

Hugging Face Model Repositories

Download the legal benchmark task datasets and the models (optional, scripts can directly load models from Hugging Face model repositories) from the casehold Google Drive folder and unzip them under the top-level directory like:

reglab/casehold
├── data
│ ├── casehold.csv
│ └── overruling.csv
├── models
│ ├── bert-double
│ │ ├── config.json
│ │ ├── pytorch_model.bin
│ │ ├── special_tokens_map.json
│ │ ├── tf_model.h5
│ │ ├── tokenizer_config.json
│ │ └── vocab.txt
│ └── custom-legalbert
│ │ ├── config.json
│ │ ├── pytorch_model.bin
│ │ ├── special_tokens_map.json
│ │ ├── tf_model.h5
│ │ ├── tokenizer_config.json
│ │ └── vocab.txt
│ └── legalbert
│ │ ├── config.json
│ │ ├── pytorch_model.bin
│ │ ├── special_tokens_map.json
│ │ ├── tf_model.h5
│ │ ├── tokenizer_config.json
│ │ └── vocab.txt

Requirements

This code was tested with Python 3.7 and Pytorch 1.8.1.

Install required packages and dependencies:

pip install -r requirements.txt

Install transformers from source (required for tokenizers dependencies):

pip install git+https://github.com/huggingface/transformers

Model Descriptions

Legal-BERT

Training Data

The pretraining corpus was constructed by ingesting the entire Harvard Law case corpus from 1965 to the present. The size of this corpus (37GB) is substantial, representing 3,446,187 legal decisions across all federal and state courts, and is larger than the size of the BookCorpus/Wikipedia corpus originally used to train BERT (15GB). We randomly sample 10% of decisions from this corpus as a holdout set, which we use to create the CaseHOLD dataset. The remaining 90% is used for pretraining.

Training Objective

This model is initialized with the base BERT model (uncased, 110M parameters), bert-base-uncased, and trained for an additional 1M steps on the MLM and NSP objective, with tokenization and sentence segmentation adapted for legal text (cf. the paper).

Custom Legal-BERT

Training Data

Same pretraining corpus as Legal-BERT

Training Objective

This model is pretrained from scratch for 2M steps on the MLM and NSP objective, with tokenization and sentence segmentation adapted for legal text (cf. the paper).

The model also uses a custom domain-specific legal vocabulary. The vocabulary set is constructed using SentencePiece on a subsample (approx. 13M) of sentences from our pretraining corpus, with the number of tokens fixed to 32,000.

BERT (double)

Training Data

BERT (double) is pretrained using the same English Wikipedia corpus that the base BERT model (uncased, 110M parameters), bert-base-uncased, was pretrained on. For more information on the pretraining corpus, refer to the BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper.

Training Objective

This model is initialized with the base BERT model (uncased, 110M parameters), bert-base-uncased, and trained for an additional 1M steps on the MLM and NSP objective.

This facilitates a direct comparison to our BERT-based models for the legal domain, Legal-BERT and Custom Legal-BERT, which are also pretrained for 2M total steps.

Legal Benchmark Task Descriptions

Overruling

We release the Overruling dataset in conjunction with Casetext, the creators of the dataset.

The Overruling dataset corresponds to the task of determining when a sentence is overruling a prior decision. This is a binary classification task, where positive examples are overruling sentences and negative examples are non-overruling sentences extracted from legal opinions. In law, an overruling sentence is a statement that nullifies a previous case decision as a precedent, by a constitutionally valid statute or a decision by the same or higher ranking court which establishes a different rule on the point of law involved. The Overruling dataset consists of 2,400 examples.

Terms of Service

We provide a link to the Terms of Service dataset, created and made publicly accessible by the authors of CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service (Lippi et al., 2019).

The Terms of Service dataset corresponds to the task of identifying whether contractual terms are potentially unfair. This is a binary classification task, where positive examples are potentially unfair contractual terms (clauses) from the terms of service in consumer contracts. Article 3 of the Directive 93/13 on Unfair Terms in Consumer Contracts defines an unfair contractual term as follows. A contractual term is unfair if: (1) it has not been individually negotiated; and (2) contrary to the requirement of good faith, it causes a significant imbalance in the parties rights and obligations, to the detriment of the consumer. The Terms of Service dataset consists of 9,414 examples.

CaseHOLD

We release the CaseHOLD dataset, created by the authors of our paper, When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings (Zheng and Guha et al., 2021).

The CaseHOLD dataset (Case Holdings On Legal Decisions) provides 53,000+ multiple choice questions with prompts from a judicial decision and multiple potential holdings, one of which is correct, that could be cited. Holdings are central to the common law system. They represent the the governing legal rule when the law is applied to a particular set of facts. It is what is precedential and what litigants can rely on in subsequent cases. The CaseHOLD task derived from the dataset is a multiple choice question answering task, with five candidate holdings (one correct, four incorrect) for each citing context.

For more details on the construction of these legal benchmark task datasets, please see our paper.

Hyperparameters for Downstream Tasks

We split each task dataset into a train and test set with an 80/20 split for hyperparameter tuning. For the baseline model, we performed a random search with batch size set to 16 and 32 over learning rates in the bounded domain 1e-5 to 1e-2, training for a maximum of 20 epochs. To set the model hyperparameters for fine-tuning our BERT and Legal-BERT models, we refer to the suggested hyperparameter ranges for batch size, learning rate and number of epochs in Devlin et al. as a reference point and perform two rounds of grid search for each task. We performed the coarse round of grid search with batch size set to 16 for Overruling and Terms of Service and batch size set to 128 for Citation, over learning rates: 1e-6, 1e-5, 1e-4, training for a maximum of 4 epochs. From the coarse round, we discovered that the optimal learning rates for the legal benchmark tasks were smaller than the lower end of the range suggested in Devlin et al., so we performed a finer round of grid search over a range that included smaller learning rates. For Overruling and Terms of Service, we performed the finer round of grid search over batch sizes (16, 32) and learning rates (5e-6, 1e-5, 2e-5, 3e-5, 5e-5), training for a maximum of 4 epochs. For CaseHOLD, we performed the finer round of grid search with batch size set to 128 over learning rates (1e-6, 3e-6, 5e-6, 7e-6, 9e-6), training for a maximum of 4 epochs. We report the hyperparameters used for evaluation in the table below.

Hyperparameter Table

Results

The results from the paper for the baseline BiLSTM, base BERT model (uncased, 110M parameters), BERT (double), Legal-BERT, and Custom Legal-BERT, finetuned on the legal benchmark tasks, are displayed below.

Demo

demo.ipynb provides examples of how to run the scripts to compute pretrain loss and finetune Legal-BERT/Custom Legal-BERT models on the legal benchmark tasks. These examples should be able to run on a GPU that has 16GB of RAM using the hyperparameters specified in the examples.

See demo.ipynb for details on calculating domain specificity (DS) scores for tasks or task examples by taking the difference in pretrain loss on BERT (double) and Legal-BERT. DS score may be readily extended to estimate domain specificity of tasks in other domains using BERT (double) and existing pretrained models (e.g., SciBERT).

Citation

If you are using this work, please cite it as:

@inproceedings{zhengguha2021,
	title={When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset},
	author={Lucia Zheng and Neel Guha and Brandon R. Anderson and Peter Henderson and Daniel E. Ho},
	year={2021},
	eprint={2104.08671},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	booktitle={Proceedings of the 18th International Conference on Artificial Intelligence and Law},
	publisher={Association for Computing Machinery},
	note={(in press)}
}

Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset. In Proceedings of the 18th International Conference on Artificial Intelligence and Law (ICAIL '21), June 21-25, 2021, São Paulo, Brazil. ACM Inc., New York, NY, (in press). arXiv: 2104.08671 [cs.CL].

Owner
RegLab
RegLab
ICLR2021 (Under Review)

Self-Supervised Time Series Representation Learning by Inter-Intra Relational Reasoning This repository contains the official PyTorch implementation o

Haoyi Fan 58 Dec 30, 2022
This is RFA-Toolbox, a simple and easy-to-use library that allows you to optimize your neural network architectures using receptive field analysis (RFA) and create graph visualizations of your architecture.

ReceptiveFieldAnalysisToolbox This is RFA-Toolbox, a simple and easy-to-use library that allows you to optimize your neural network architectures usin

84 Nov 23, 2022
FEDn is an open-source, modular and ML-framework agnostic framework for Federated Machine Learning

FEDn is an open-source, modular and ML-framework agnostic framework for Federated Machine Learning (FedML) developed and maintained by Scaleout Systems. FEDn enables highly scalable cross-silo and cr

Scaleout 75 Nov 09, 2022
Saliency - Framework-agnostic implementation for state-of-the-art saliency methods (XRAI, BlurIG, SmoothGrad, and more).

Saliency Methods 🔴 Now framework-agnostic! (Example core notebook) 🔴 🔗 For further explanation of the methods and more examples of the resulting ma

PAIR code 849 Dec 27, 2022
MTA:SA Server Configer.

MTAConfiger MTA:SA Server Configer. Hi 👋 , I'm Alireza A Python Developer Boy 🔭 I’m currently working on my C# projects 🌱 I’m currently Learning CS

3 Jun 07, 2022
Collection of sports betting AI tools.

sports-betting sports-betting is a collection of tools that makes it easy to create machine learning models for sports betting and evaluate their perf

George Douzas 109 Dec 31, 2022
CVNets: A library for training computer vision networks

CVNets: A library for training computer vision networks This repository contains the source code for training computer vision models. Specifically, it

Apple 1.1k Jan 03, 2023
paper: Hyperspectral Remote Sensing Image Classification Using Deep Convolutional Capsule Network

DC-CapsNet This is a tensorflow and keras based implementation of DC-CapsNet for HSI in the Remote Sensing Letters R. Lei et al., "Hyperspectral Remot

LEI 7 Nov 29, 2022
Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study.

APR The repo for the paper Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study. Environment setu

ielab 8 Nov 26, 2022
Graph InfoClust: Leveraging cluster-level node information for unsupervised graph representation learning

Graph-InfoClust-GIC [PAKDD 2021] PAKDD'21 version Graph InfoClust: Maximizing Coarse-Grain Mutual Information in Graphs Preprint version Graph InfoClu

Costas Mavromatis 21 Dec 03, 2022
A library that can print Python objects in human readable format

objprint A library that can print Python objects in human readable format Install pip install objprint Usage op Use op() (or objprint()) to print obj

319 Dec 25, 2022
LSTM built using Keras Python package to predict time series steps and sequences. Includes sin wave and stock market data

LSTM Neural Network for Time Series Prediction LSTM built using the Keras Python package to predict time series steps and sequences. Includes sine wav

Jakob Aungiers 4.1k Jan 02, 2023
EFENet: Reference-based Video Super-Resolution with Enhanced Flow Estimation

EFENet EFENet: Reference-based Video Super-Resolution with Enhanced Flow Estimation Code is a bit messy now. I woud clean up soon. For training the EF

Yaping Zhao 19 Nov 05, 2022
Towards Rolling Shutter Correction and Deblurring in Dynamic Scenes (CVPR2021)

RSCD (BS-RSCD & JCD) Towards Rolling Shutter Correction and Deblurring in Dynamic Scenes (CVPR2021) by Zhihang Zhong, Yinqiang Zheng, Imari Sato We co

81 Dec 15, 2022
Project NII pytorch scripts

project-NII-pytorch-scripts By Xin Wang, National Institute of Informatics, since 2021 I am a new pytorch user. If you have any suggestions or questio

Yamagishi and Echizen Laboratories, National Institute of Informatics 184 Dec 23, 2022
The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

This repository is the official PyTorch implementation of SAINT. Find the paper on arxiv SAINT: Improved Neural Networks for Tabular Data via Row Atte

Gowthami Somepalli 284 Dec 21, 2022
A Novel Incremental Learning Driven Instance Segmentation Framework to Recognize Highly Cluttered Instances of the Contraband Items

A Novel Incremental Learning Driven Instance Segmentation Framework to Recognize Highly Cluttered Instances of the Contraband Items This repository co

Taimur Hassan 3 Mar 16, 2022
Regularizing Generative Adversarial Networks under Limited Data (CVPR 2021)

Regularizing Generative Adversarial Networks under Limited Data [Project Page][Paper] Implementation for our GAN regularization method. The proposed r

Google 148 Nov 18, 2022
Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"

Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning This is the Github repository of our paper, "Common S

INK Lab @ USC 19 Nov 30, 2022