GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Overview

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Code and model from our AAAI 2021 paper

Updates

[2020/02/05] Support to run the model on own databases and queries. Check out the notebook.

Abstract

Most recently, there has been significant interest in learning contextual representations for various NLP tasks, by leveraging large scale text corpora to train large neural language models with self-supervised learning objectives, such as Masked Language Model (MLM). However, based on a pilot study, we observe three issues of existing general-purpose language models when they are applied to text-to-SQL semantic parsers: fail to detect column mentions in the utterances, fail to infer column mentions from cell values, and fail to compose complex SQL queries. To mitigate these issues, we present a model pre-training framework, Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data. GAP MODEL is trained on 2M utterance-schema pairs and 30K utterance-schema-SQL triples, whose utterances are produced by generative models. Based on experimental results, neural semantic parsers that leverage GAP MODEL as a representation encoder obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-SQL benchmarks.

Setup

conda create --name gap-text2sql python=3.7
source activate gap-text2sql
conda install pytorch=1.5 cudatoolkit=10.2 -c pytorch
pip install -r requirements.txt
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"

Download the dataset

pip install gdown
cd rat-sql-gap
gdown --id 1_AckYkinAnhqmRQtGsQgUKAnTHxxX5J0
unzip spider.zip
bash data/spider/generate.sh ./spider

Build dataset directory

mkdir data/spider-bart
cp ./spider/tables.json data/spider-bart/
cp ./spider/train_spider.json data/spider-bart/
cp ./spider/train_others.json data/spider-bart/
cp ./spider/dev.json data/spider-bart/
ln -s $(pwd)/spider/database data/spider-bart/database

Download the library

mkdir third_party
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
unzip stanford-corenlp-full-2018-10-05.zip -d third_party/

Start the Stanford library

pushd third_party/stanford-corenlp-full-2018-10-05
nohup java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 8999 -timeout 15000 > server.log &
popd

Download the checkpoint

mkdir -p logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/
mkdir ie_dirs
aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/gap-finetuned-checkpoint logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/model_checkpoint-00041000

mkdir -p pretrained_checkpoint
aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/pretrained-checkpoint pretrained_checkpoint/pytorch_model.bin

Alternatively, you can download them here if you don't have awscli: gap-finetuned-checkpoint and pretrained-checkpoint

curl https://gap-text2sql-public.s3.amazonaws.com/checkpoint-artifacts/gap-finetuned-checkpoint -o logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/model_checkpoint-00041000
curl https://gap-text2sql-public.s3.amazonaws.com/checkpoint-artifacts/pretrained-checkpoint -o pretrained_checkpoint/pytorch_model.bin

Preprocess dataset

python run.py preprocess experiments/spider-configs/gap-run.jsonnet

Inference

python run.py eval experiments/spider-configs/gap-run.jsonnet

You then get the inference results and evaluation results in the paths:ie_dirs/bart_run_1_true_1-step41000.infer and ie_dirs/bart_run_1_true_1-step41000.eval.

Training

python run.py train experiments/spider-configs/gap-run.jsonnet

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Owner
Amazon Web Services - Labs
AWS Labs
Amazon Web Services - Labs
This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.

UIS-RNN Overview This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm. UIS-RNN solves the problem of s

Google 1.4k Dec 28, 2022
NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels

NumPy String-Indexed NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels, rather than conventio

Aitan Grossman 1 Jan 08, 2022
Exploring dimension-reduced embeddings

sleepwalk Exploring dimension-reduced embeddings This is the code repository. See here for the Sleepwalk web page. License and disclaimer This program

S. Anders's research group at ZMBH 91 Nov 29, 2022
A framework for implementing federated learning

This is partly the reproduction of the paper of [Privacy-Preserving Federated Learning in Fog Computing](DOI: 10.1109/JIOT.2020.2987958. 2020)

DavidChen 46 Sep 23, 2022
Yet another Python binding for fastText

pyfasttext Warning! pyfasttext is no longer maintained: use the official Python binding from the fastText repository: https://github.com/facebookresea

Vincent Rasneur 230 Nov 16, 2022
Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Machel Reid 82 Dec 19, 2022
Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

Weitang Liu 1.6k Jan 03, 2023
Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

THUNLP-MT 9 Jun 27, 2022
End-2-end speech synthesis with recurrent neural networks

Introduction New: Interactive demo using Google Colaboratory can be found here TTS-Cube is an end-2-end speech synthesis system that provides a full p

Tiberiu Boros 214 Dec 07, 2022
Mednlp - Medical natural language parsing and utility library

Medical natural language parsing and utility library A natural language medical

Paul Landes 3 Aug 24, 2022
Awesome Treasure of Transformers Models Collection

💁 Awesome Treasure of Transformers Models for Natural Language processing contains papers, videos, blogs, official repo along with colab Notebooks. 🛫☑️

Ashish Patel 577 Jan 07, 2023
Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和TextBlob

Rui Wang 6k Jan 02, 2023
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO 🦕 ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 01, 2023
Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

PTR Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification" If you use the code, please cite the following paper: @art

THUNLP 118 Dec 30, 2022
A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

We provide a notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository. The notebook also shows how to segment the corpus using BPE tokenizatio

Computation for Indian Language Technology (CFILT) 9 Oct 13, 2022
Lumped-element impedance calculator and frequency-domain plotter.

fastZ: Lumped-Element Impedance Calculator fastZ is a small tool for calculating and visualizing electrical impedance in Python. Features include: Sup

Wesley Hileman 47 Nov 18, 2022
Simple translation demo showcasing our headliner package.

Headliner Demo This is a demo showcasing our Headliner package. In particular, we trained a simple seq2seq model on an English-German dataset. We didn

Axel Springer News Media & Tech GmbH & Co. KG - Ideas Engineering 16 Nov 24, 2022
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

fastNLP 342 Jan 05, 2023
Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Yu Zhang 50 Nov 08, 2022
Training RNNs as Fast as CNNs

News SRU++, a new SRU variant, is released. [tech report] [blog] The experimental code and SRU++ implementation are available on the dev branch which

Tao Lei 14 Dec 12, 2022