For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

Last update: Nov 02, 2022

Related tags

Deep Learning SciBERTSUM

Overview

LongScientificFormer

For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

Some codes are borrowed from ONMT(https://github.com/OpenNMT/OpenNMT-py)

Data Preparation

Option 1: download the processed data

Pre-processed data

Put all files into raw_data directory

Step 2. Download Stanford CoreNLP

We will need Stanford CoreNLP to tokenize the data. Download it here and unzip it. Then add the following command to your bash_profile:

export CLASSPATH=/path/to/stanford-corenlp-4.2.2/stanford-corenlp-4.2.2.jar

replacing /path/to/ with the path to where you saved the stanford-corenlp-4.2.2 directory.

step 3. extracting sections from GROBID XML files

python preprocess.py -mode extract_pdf_sections -log_file ../logs/extract_section.log

step 4. extracting text from TIKA XML files

python preprocess.py -mode get_text_clean_tika -log_file ../logs/extract_tika_text.log

step 5. Tokenize texts from papers and slides using stanfordCoreNLP

python preprocess.py -mode tokenize  -save_path ../temp -log_file ../logs/tokenize_by_corenlp.log

Step 6. Extract source, section, and target from tokenized files

python preprocess.py -mode clean_paper_jsons -save_path ../json_data/  -n_cpus 10 -log_file ../logs/build_json.log

Step 7. Generate BERT `.pt` files from source, sections and targets

python preprocess.py -mode format_to_bert -raw_path ../json_data/ -save_path ../bert_data  -lower -n_cpus 40 -log_file ../logs/build_bert_files.log

Model Training

First run: For the first time, you should use single-GPU, so the code can download the BERT model. Use -visible_gpus -1, after downloading, you could kill the process and rerun the code with multi-GPUs.

Train

python train.py  -ext_dropout 0.1 -lr 2e-3  -visible_gpus 1,2,3 -report_every 200 -save_checkpoint_steps 1000 -batch_size 1 -train_steps 100000 -accum_count 2  -log_file ../logs/ext_bert -use_interval true -warmup_steps 10000

To continue training from a checkpoint

python train.py  -ext_dropout 0.1 -lr 2e-3  -train_from ../models/model_step_99000.pt -visible_gpus 1,2,3 -report_every 200 -save_checkpoint_steps 1000 -batch_size 1 -train_steps 100000 -accum_count 2  -log_file ../logs/ext_bert -use_interval true -warmup_steps 10000

Test

python train.py -mode test  -test_batch_size 1 -bert_data_path ../bert_data -log_file ../logs/ext_bert_test -test_from ../models/model_step_99000.pt -model_path ../models -sep_optim true -use_interval true -visible_gpus 1,2,3 -alpha 0.95 -result_path ../results/ext

For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

Related tags

Overview

LongScientificFormer

Data Preparation

Option 1: download the processed data

Step 2. Download Stanford CoreNLP

step 3. extracting sections from GROBID XML files

step 4. extracting text from TIKA XML files

step 5. Tokenize texts from papers and slides using stanfordCoreNLP

Step 6. Extract source, section, and target from tokenized files

Step 7. Generate BERT `.pt` files from source, sections and targets

Model Training

Train

Test

Owner

Athar Sefid

Blind Image Super-resolution with Elaborate Degradation Modeling on Noise and Kernel

Standalone pre-training recipe with JAX+Flax

Official Implementation of SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for Spatial-Aware Visual Representations

3DV 2021: Synergy between 3DMM and 3D Landmarks for Accurate 3D Facial Geometry

PyTorch code for MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

DTCN IJCAI - Sequential prediction learning framework and algorithm

This is the winning solution of the Endocv-2021 grand challange.

DECA: Detailed Expression Capture and Animation (SIGGRAPH 2021)

Post-training Quantization for Neural Networks with Provable Guarantees

Generative Adversarial Networks for High Energy Physics extended to a multi-layer calorimeter simulation

An implementation of Fastformer: Additive Attention Can Be All You Need in TensorFlow

Answer a series of contextually-dependent questions like they may occur in natural human-to-human conversations.

Code for training and evaluation of the model from "Language Generation with Recurrent Generative Adversarial Networks without Pre-training"

Unofficial PyTorch implementation of MobileViT based on paper "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer".

Source code for CIKM 2021 paper for Relation-aware Heterogeneous Graph for User Profiling

ECCV2020 paper: Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code and Data.

NLMpy - A Python package to create neutral landscape models

The devkit of the nuPlan dataset.

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Real-Time and Accurate Full-Body Multi-Person Pose Estimation&Tracking System

For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

Related tags

Overview

LongScientificFormer

Data Preparation

Option 1: download the processed data

Step 2. Download Stanford CoreNLP

step 3. extracting sections from GROBID XML files

step 4. extracting text from TIKA XML files

step 5. Tokenize texts from papers and slides using stanfordCoreNLP

Step 6. Extract source, section, and target from tokenized files

Step 7. Generate BERT .pt files from source, sections and targets

Model Training

Train

Test

Owner

Athar Sefid

Blind Image Super-resolution with Elaborate Degradation Modeling on Noise and Kernel

Standalone pre-training recipe with JAX+Flax

Official Implementation of SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for Spatial-Aware Visual Representations

3DV 2021: Synergy between 3DMM and 3D Landmarks for Accurate 3D Facial Geometry

PyTorch code for MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

DTCN IJCAI - Sequential prediction learning framework and algorithm

This is the winning solution of the Endocv-2021 grand challange.

DECA: Detailed Expression Capture and Animation (SIGGRAPH 2021)

Post-training Quantization for Neural Networks with Provable Guarantees

Generative Adversarial Networks for High Energy Physics extended to a multi-layer calorimeter simulation

An implementation of Fastformer: Additive Attention Can Be All You Need in TensorFlow

Answer a series of contextually-dependent questions like they may occur in natural human-to-human conversations.

Code for training and evaluation of the model from "Language Generation with Recurrent Generative Adversarial Networks without Pre-training"

Unofficial PyTorch implementation of MobileViT based on paper "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer".

Source code for CIKM 2021 paper for Relation-aware Heterogeneous Graph for User Profiling

ECCV2020 paper: Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code and Data.

NLMpy - A Python package to create neutral landscape models

The devkit of the nuPlan dataset.

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Real-Time and Accurate Full-Body Multi-Person Pose Estimation&Tracking System

Step 7. Generate BERT `.pt` files from source, sections and targets