Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

Overview

korean extractive summarization

2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

Leaderboard

1

Notice

  • Text Summarization with Pretrained Encoders에 나오는 bertsumext모델(extractive summarization을 위해 bert위에 추가적으로 inter-sentence 레이어를 얹은구조)의 bert를 klue/roberta-large모델로 대체하여 구성하였음.
  • uoneway님 KoBertSum 레포지토리를 기반으로 만들어짐.
  • 수정된 부분 - pytorch 1.1 ->pytorch 1.7.1버전 지원하도록 수정.
  • 수정된 부분 - transformers 4.0 버전 지원하도록 수정, klue/roberta-large 포팅
  • 수정된 부분 - 불필요한 부분 삭제 or 수정

Process

  1. Environment Setting
pip install -r requirements.txt
python src/others/install_mecab.py # mecab설치
  1. Preprocess( ./ext/data/raw/train.jsonl, ./ext/data/raw/test.jsonl이 있어야 함)
python main.py -task make_data -n_cpus 5
  1. Train
python main.py -task train -target_summary_sent abs -visible_gpus 0
  1. Validation(path에 있는 모델파일 전부 validation하는 코드임.)
python main.py -task valid -model_path 1209_1236
  1. Test and submission 파일 생성
python main.py -task test -test_from 1209_1236/model_step_500.pt -visible_gpus 0
cd ext/results/
python get_submission.py -filename result_1209_1236_step_500.candidate.jsonl

포함되지 않은 부분

  • 대회에선, ensemble 이용해서 rouge-L 53.15 -> 53.5 으로 끌어올렸는데, 간단하니까 필요하신 분들은 구현해서 사용하시면 성능향상에 도움이 될 듯.

  • 추가로 데이터셋 폼(jsonl각 line)은 이렇게 구성됨(세줄요약 데이터셋)

{"category": "none", "id": 0, "article_original": ["","","","",""], "extractive": [2, 3, 4], "abstractive": "", "extractive_sents": ["", "", ""]}

Reference

Owner
NLP, ML research
In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a model using HugginFace transformers framework.

Transformers are all you need In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a

Aymen Berriche 8 Apr 13, 2022
Transformation spoken text to written text

Transformation spoken text to written text This model is used for formatting raw asr text output from spoken text to written text (Eg. date, number, i

Nguyen Binh 16 Dec 28, 2022
Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.

Accurately generate all possible forms of an English word Word forms can accurately generate all possible forms of an English word. It can conjugate v

Dibya Chakravorty 570 Dec 31, 2022
Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

Dennis Priskorn 9 Nov 17, 2022
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Chung-Ming Chien 1k Dec 30, 2022
To classify the News into Real/Fake using Features from the Text Content of the article

Hoax-Detector Authenticity of news has now become a major problem. The Idea is to classify the News into Real/Fake using Features from the Text Conten

Aravindhan 1 Feb 09, 2022
Deduplication is the task to combine different representations of the same real world entity.

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training wi

63 Nov 17, 2022
Codes to pre-train Japanese T5 models

t5-japanese Codes to pre-train a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts. The model is available at https://hug

Megagon Labs 37 Dec 25, 2022
An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

THUNLP 3.9k Jan 03, 2023
PG-19 Language Modelling Benchmark

PG-19 Language Modelling Benchmark This repository contains the PG-19 language modeling benchmark. It includes a set of books extracted from the Proje

DeepMind 161 Oct 30, 2022
Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features Train python main.py --dataset brazil-flights C

wang zhang 0 Jun 28, 2022
A simple implementation of N-gram language model.

About A simple implementation of N-gram language model. Requirements numpy Data preparation Corpus Training data for the N-gram model, a text file lik

4 Nov 24, 2021
Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation This repository contains the original implementation of the unsupervised PBSMT and NMT models presented in Phrase-Bas

Facebook Research 1.5k Dec 28, 2022
Application for shadowing Chinese.

chinese-shadowing Simple APP for shadowing chinese. With this application, it is very easy to record yourself, play the sound recorded and listen to s

Thomas Hirtz 5 Sep 06, 2022
This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Nepali-news-notifier This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular in

Sachit Yadav 1 Feb 11, 2022
NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

MCSE: Multimodal Contrastive Learning of Sentence Embeddings This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multi

Saarland University Spoken Language Systems Group 39 Nov 15, 2022
Need: Image Search With Python

Need: Image Search The problem is that a user needs to search for a specific ima

Surya Komandooru 1 Dec 30, 2021
LCG T-TEST USING EUCLIDEAN METHOD

This project has been created for statistical usage, purposing for determining ATL takers and nontakers using LCG ttest and Euclidean Method, especially for internal business case in Telkomsel.

2 Jan 21, 2022
SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering.

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering. Contents Inst

0 Oct 21, 2021
Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models

Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models. A paraphrase framework is more than just a paraphrasing model.

Prithivida 681 Jan 01, 2023