Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

Last update: Aug 10, 2022

Overview

korean extractive summarization

2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

Leaderboard

Notice

Text Summarization with Pretrained Encoders에 나오는 bertsumext모델(extractive summarization을 위해 bert위에 추가적으로 inter-sentence 레이어를 얹은구조)의 bert를 klue/roberta-large모델로 대체하여 구성하였음.
uoneway님 KoBertSum 레포지토리를 기반으로 만들어짐.
수정된 부분 - pytorch 1.1 ->pytorch 1.7.1버전 지원하도록 수정.
수정된 부분 - transformers 4.0 버전 지원하도록 수정, klue/roberta-large 포팅
수정된 부분 - 불필요한 부분 삭제 or 수정

Process

Environment Setting

pip install -r requirements.txt
python src/others/install_mecab.py # mecab설치

Preprocess( ./ext/data/raw/train.jsonl, ./ext/data/raw/test.jsonl이 있어야 함)

python main.py -task make_data -n_cpus 5

Train

python main.py -task train -target_summary_sent abs -visible_gpus 0

Validation(path에 있는 모델파일 전부 validation하는 코드임.)

python main.py -task valid -model_path 1209_1236

Test and submission 파일 생성

python main.py -task test -test_from 1209_1236/model_step_500.pt -visible_gpus 0
cd ext/results/
python get_submission.py -filename result_1209_1236_step_500.candidate.jsonl

포함되지 않은 부분

대회에선, ensemble 이용해서 rouge-L 53.15 -> 53.5 으로 끌어올렸는데, 간단하니까 필요하신 분들은 구현해서 사용하시면 성능향상에 도움이 될 듯.
추가로 데이터셋 폼(jsonl각 line)은 이렇게 구성됨(세줄요약 데이터셋)

{"category": "none", "id": 0, "article_original": ["","","","",""], "extractive": [2, 3, 4], "abstractive": "", "extractive_sents": ["", "", ""]}

Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

Related tags

Overview

korean extractive summarization

Leaderboard

Notice

Process

포함되지 않은 부분

Reference

Owner

小布助手对话短文本语义匹配的一个baseline

Simple GUI where you can enter an article and get a crisp summarized version.

Natural language computational chemistry command line interface.

Search for documents in a domain through Google. The objective is to extract metadata

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Open-source offline translation library written in Python. Uses OpenNMT for translations

Train GPT-3 model on V100(16GB Mem) Using improved Transformer.

Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

Official PyTorch Implementation of paper "NeLF: Neural Light-transport Field for Single Portrait View Synthesis and Relighting", EGSR 2021.

Guide to using pre-trained large language models of source code

Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

Code for evaluating Japanese pretrained models provided by NTT Ltd.

RuCLIP-SB (Russian Contrastive Language–Image Pretraining SWIN-BERT) is a multimodal model for obtaining images and text similarities and rearranging captions and pictures. Unlike other versions of the model we use BERT for text encoder and SWIN transformer for image encoder.

Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Partially offline multi-language translator built upon Huggingface transformers.

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing