Legal text retrieval for python

Last update: Dec 06, 2022

Related tags

Text Data & NLP legal_text_retrieval

Overview

legal-text-retrieval

Overview

This system contains 2 steps:

generate training data containing negative sample found by mixture score of cosine(tfidf) + bm25 (using top 150 law articles most similarity)
fine-tune PhoBERT model (+NlpHUST model - optional) on generated data

Environments

git clone https://github.com/vncorenlp/VnCoreNLP.git vncorenlp_data # for vncorebnlp tokenize lib

conda create -n legal_retrieval_env python=3.8
conda activate legal_retrieval_env
pip install -r requirements.txt

Run

Generate data from folder data/zac2021-ltr-data/ containing public_test_question.json and train_question_answer.json
```
python3 src/data_generator.py --path_folder_base data/zac2021-ltr-data/ --test_file public_test_question.json --topk 150  --tok --path_output_dir data/zalo-tfidfbm25150-full
```
Note:
- --test_file public_test_question.json is optional, if this parameter is not used, test set will be random 33% in file train_question_answer.json
- --path_output_dir is the folder save 3 output file (train.csv, dev.csv, test.csv) and tfidf classifier (tfidf_classifier.pkl) for top k best relevant documents.

Train model

bash scripts/run_finetune_bert.sh "magic"  vinai/phobert-base  ../  data/zalo-tfidfbm25150-full Tfbm150E5-full 5

Predict
```
python3 src/infer.py 
```
Note: This script will load model and run prediction, pls check the variable model_configs in file src/infer.py to modify.

License

MIT-licensed.

Citation

Please cite as:

@article{DBLP:journals/corr/abs-2106-13405,
  author    = {Ha{-}Thanh Nguyen and
               Phuong Minh Nguyen and
               Thi{-}Hai{-}Yen Vuong and
               Quan Minh Bui and
               Chau Minh Nguyen and
               Tran Binh Dang and
               Vu Tran and
               Minh Le Nguyen and
               Ken Satoh},
  title     = {{JNLP} Team: Deep Learning Approaches for Legal Processing Tasks in
               {COLIEE} 2021},
  journal   = {CoRR},
  volume    = {abs/2106.13405},
  year      = {2021},
  url       = {https://arxiv.org/abs/2106.13405},
  eprinttype = {arXiv},
  eprint    = {2106.13405},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2106-13405.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

@article{DBLP:journals/corr/abs-2011-08071,
  author    = {Ha{-}Thanh Nguyen and
               Hai{-}Yen Thi Vuong and
               Phuong Minh Nguyen and
               Tran Binh Dang and
               Quan Minh Bui and
               Vu Trong Sinh and
               Chau Minh Nguyen and
               Vu D. Tran and
               Ken Satoh and
               Minh Le Nguyen},
  title     = {{JNLP} Team: Deep Learning for Legal Processing in {COLIEE} 2020},
  journal   = {CoRR},
  volume    = {abs/2011.08071},
  year      = {2020},
  url       = {https://arxiv.org/abs/2011.08071},
  eprinttype = {arXiv},
  eprint    = {2011.08071},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2011-08071.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Legal text retrieval for python

Related tags

Overview

legal-text-retrieval

Overview

Environments

Run

License

Citation

Owner

Nguyễn Minh Phương

Arabic speech recognition, classification and text-to-speech.

A Python/Pytorch app for easily synthesising human voices

Fake Shakespearean Text Generator

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

Deduplication is the task to combine different representations of the same real world entity.

Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

Python powered crossword generator with database with 20k+ polish words

Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

Official PyTorch implementation of Time-aware Large Kernel (TaLK) Convolutions (ICML 2020)

Officile code repository for "A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning"

Sequence model architectures from scratch in PyTorch

lightweight, fast and robust columnar dataframe for data analytics with online update

Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec

Search Git commits in natural language

NeoDays-based tileset for the roguelike CDDA (Cataclysm Dark Days Ahead)

A multi-voice TTS system trained with an emphasis on quality

Python generation script for BitBirds