PICARD - Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models

Related tags

Deep Learningpicard
Overview


make it parse

build license

This is the official implementation of the following paper:

Torsten Scholak, Nathan Schucher, Dzmitry Bahdanau. PICARD - Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).

If you use this code, please cite:

@inproceedings{Scholak2021:PICARD,
  author = {Torsten Scholak and Nathan Schucher and Dzmitry Bahdanau},
  title = {PICARD - Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models},
  booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
  year = {2021},
  publisher = {Association for Computational Linguistics},
}

Overview

This code implements:

  • The PICARD algorithm for constrained decoding from language models.
  • A text-to-SQL semantic parser based on pre-trained sequence-to-sequence models and PICARD achieving state-of-the-art performance on both the Spider and the CoSQL datasets.

About PICARD

TL;DR: We introduce PICARD -- a new method for simple and effective constrained decoding from large pre-trained language models. On the challenging Spider and CoSQL text-to-SQL datasets, PICARD significantly improves the performance of fine-tuned but otherwise unmodified T5 models. Using PICARD, our T5-3B models achieved state-of-the-art performance on both Spider and CoSQL.

In text-to-SQL translation, the goal is to translate a natural language question into a SQL query. There are two main challenges to this task:

  1. The generated SQL needs to be semantically correct, that is, correctly reflect the meaning of the question.
  2. The SQL also needs to be valid, that is, it must not result in an execution error.

So far, there has been a trade-off between these two goals: The second problem can be solved by using a special decoder architecture that -- by construction -- always produces valid SQL. This is the approach taken by most prior work. Those decoders are called "constrained decoders", and they need to be trained from scratch on the text-to-SQL dataset. However, this limits the generality of the decoders, which is a problem for the first goal.

A better approach would be to use a pre-trained encoder-decoder model and to constrain its decoder to produce valid SQL after fine-tuning the model on the text-to-SQL task. This is the approach taken by the PICARD algorithm.

How is PICARD different from existing constrained decoders?

  • It’s an incremental parsing algorithm that integrates with ordinary beam search.
  • It doesn’t require any training.
  • It doesn’t require modifying the model.
  • It works with any model that generates a sequence of tokens (including language models).
  • It doesn’t require a special vocabulary.
  • It works with character-, sub-word-, and word-level language models.

How does PICARD work?

The following picture shows how PICARD is integrated with beam search.



Decoding starts from the left and proceeds to the right. The algorithm begins with a single token (usually <s>), and then keeps expanding the beam with hypotheses generated token-by-token by the decoder. At each decoding step and for each hypothesis, PICARD checks whether the next top-k tokens are valid. In the image above, only 3 token predictions are shown, and k is set to 2. Valid tokens () are added to the beam. Invalid ones (☒) are discarded. The k+1-th, k+2-th, ... tokens are discarded, too. Like in normal beam search, the beam is pruned to contain only the top-n hypotheses. n is the beam size, and in the image above it is set to 2 as well. Hypotheses that are terminated with the end-of-sentence token (usually </s>) are not expanded further. The algorithm stops when the all hypotheses are terminated or when the maximum number of tokens has been reached.

How does PICARD know whether a token is valid?

In PICARD, checking, accepting, and rejecting of tokens and token sequences is achieved through parsing. Parsing means that we attempt to assemble a data structure from the tokens that are currently in the beam or are about to be added to it. This data structure (and the parsing rules that are used to build it) encode the constraints we want to enforce.

In the case of SQL, the data structure we parse to is the abstract syntax tree (AST) of the SQL query. The parsing rules are defined in a computer program called a parser. Database engines, such as PostgreSQL, MySQL, and SQLite, have their own built-in parser that they use internally to process SQL queries. For Spider and CoSQL, we have implemented a parser that supports a subset of the SQLite syntax and that checks additional constraints on the AST. In our implementation, the parsing rules are made up from simpler rules and primitives that are provided by a third-party parsing library.

PICARD uses a parsing library called attoparsec that supports incremental input. This is a special capability that is not available in many other parsing libraries. You can feed attoparsec a string that represents only part of the expected input to parse. When parsing reaches the end of an input fragment, attoparsec will return a continuation function that can be used to continue parsing. Think of the continuation function as a suspended computation that can be resumed later. Input fragments can be parsed one after the other when they become available until the input is complete.

Herein lies the key to PICARD: Incremental parsing of input fragments is exactly what we need to check tokens one by one during decoding.

In PICARD, parsing is initialized with an empty string, and attoparsec will return the first continuation function. We then call that continuation function with all the token predictions we want to check in the first decoding step. For those tokens that are valid, the continuation function will return a new continuation function that we can use to continue parsing in the next decoding step. For those tokens that are invalid, the continuation function will return a failure value which cannot be used to continue parsing. Such failures are discarded and never end up in the beam. We repeat the process until the end of the input is reached. The input is complete once the model predicts the end-of-sentence token. When that happens, we finalize the parsing by calling the continuation function with an empty string. If the parsing is successful, it will return the final AST. If not, it will return a failure value.

The parsing rules are described at a high level in the PICARD paper. For details, see the PICARD code, specifically the Language.SQL.SpiderSQL.Parse module.

How well does PICARD work?

Let's look at the numbers:

On Spider

URL Exact-set Match Accuracy Execution Accuracy
Dev Test Dev Test
tscholak/cxmefzzi w PICARD 75.5 % 71.9 % 79.3 % 75.1 %
tscholak/cxmefzzi w/o PICARD 71.5 % 68.0 % 74.4 % 70.1 %

Click on the links to download the model.

On CoSQL Dialogue State Tracking

URL Question Match Accuracy Interaction Match Accuracy
Dev Test Dev Test
tscholak/2e826ioa w PICARD 56.9 % 54.6 % 24.2 % 23.7 %
tscholak/2e826ioa w/o PICARD 53.8 % 51.4 % 21.8 % 21.7 %

Click on the links to download the model.

Quick Start

Prerequisites

This repository uses git submodules. Clone it like this:

$ git clone [email protected]:ElementAI/picard.git
$ cd picard
$ git submodule update --init --recursive

Training

The training script is located in seq2seq/run_seq2seq.py. You can run it with:

$ make train

The model will be trained on the Spider dataset by default. You can also train on CoSQL by running make train-cosql.

The training script will create the directory train in the current directory. Training artifacts like checkpoints will be stored in this directory.

The default configuration is stored in configs/train.json. The settings are optimized for a GPU with 40GB of memory.

These training settings should result in a model with at least 71% exact-set-match accuracy on the Spider development set. With PICARD, the accuracy should go up to at least 75%.

We have uploaded a model trained on the Spider dataset to the huggingface model hub, tscholak/cxmefzzi. A model trained on the CoSQL dialog state tracking dataset is available, too, tscholak/2e826ioa.

Evaluation

The evaluation script is located in seq2seq/run_seq2seq.py. You can run it with:

$ make eval

By default, the evaluation will be run on the Spider evaluation set. Evaluation on the CoSQL evaluation set can be run with make eval-cosql.

The evaluation script will create the directory eval in the current directory. The evaluation results will be stored there.

The default configuration is stored in configs/eval.json.

Docker

There are three docker images that can be used to run the code:

  • tscholak/text-to-sql-dev: Base image with development dependencies. Use this for development. Pull it with make pull-dev-image from the docker hub. Rebuild the image with make build-dev-image.
  • tsscholak/text-to-sql-train: Training image with development dependencies but without Picard dependencies. Use this for fine-tuning a model. Pull it with make pull-train-image from the docker hub. Rebuild the image with make build-train-image.
  • tscholak/text-to-sql-eval: Training/evaluation image with all dependencies. Use this for evaluating a fine-tuned model with Picard. This image can also be used for training if you want to run evaluation during training with Picard. Pull it with make pull-eval-image from the docker hub. Rebuild the image with make build-eval-image.

All images are tagged with the current commit hash. The images are built with the buildx tool which is available in the latest docker-ce. Use make init-buildkit to initialize the buildx tool on your machine. You can then use make build-dev-image, make build-train-image, etc. to rebuild the images. Local changes to the code will not be reflected in the docker images unless they are committed to git.

Owner
ElementAI
ElementAI
This reposityory contains the PyTorch implementation of our paper "Generative Dynamic Patch Attack".

Generative Dynamic Patch Attack This reposityory contains the PyTorch implementation of our paper "Generative Dynamic Patch Attack". Requirements PyTo

Xiang Li 8 Nov 17, 2022
Get a Grip! - A robotic system for remote clinical environments.

Get a Grip! Within clinical environments, sterilization is an essential procedure for disinfecting surgical and medical instruments. For our engineeri

Jay Sharma 1 Jan 05, 2022
StarGAN2 for practice

StarGAN2 for practice This version of StarGAN2 (coined as 'Post-modern Style Transfer') is intended mostly for fellow artists, who rarely look at scie

vadim epstein 87 Sep 24, 2022
It helps user to learn Pick-up lines and share if he has a better one

Pick-up-Lines-Generator(Open Source) It helps user to learn Pick-up lines Share and Add one or many to the DataBase Unique SQLite DataBase AI Undercon

knock_nott 0 May 04, 2022
Pytorch implementation of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Make-A-Scene - PyTorch Pytorch implementation (inofficial) of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors (https://arxiv.org/

Casual GAN Papers 259 Dec 28, 2022
EssentialMC2 Video Understanding

EssentialMC2 Introduction EssentialMC2 is a complete system to solve video understanding tasks including MHRL(representation learning), MECR2( relatio

Alibaba 106 Dec 11, 2022
This program creates a formatted excel file which highlights the undervalued stock according to Graham's number.

Over-and-Undervalued-Stocks Of Nepse Using Graham's Number Scrap the latest data using different websites and creates a formatted excel file that high

6 May 03, 2022
Code of paper "CDFI: Compression-Driven Network Design for Frame Interpolation", CVPR 2021

CDFI (Compression-Driven-Frame-Interpolation) [Paper] (Coming soon...) | [arXiv] Tianyu Ding*, Luming Liang*, Zhihui Zhu, Ilya Zharkov IEEE Conference

Tianyu Ding 95 Dec 04, 2022
[Link]deep_portfolo - Use Reforcemet earg ad Supervsed learg to Optmze portfolo allocato []

rl_portfolio This Repository uses Reinforcement Learning and Supervised learning to Optimize portfolio allocation. The goal is to make profitable agen

Deepender Singla 165 Dec 02, 2022
Implementation of Basic Machine Learning Algorithms on small datasets using Scikit Learn.

Basic Machine Learning Algorithms All the basic Machine Learning Algorithms are implemented in Python using libraries Acknowledgements Machine Learnin

Piyal Banik 47 Oct 16, 2022
Code for the paper "Combining Textual Features for the Detection of Hateful and Offensive Language"

The repository provides the source code for the paper "Combining Textual Features for the Detection of Hateful and Offensive Language" submitted to HA

Sherzod Hakimov 3 Aug 04, 2022
Official Implementation for Fast Training of Neural Lumigraph Representations using Meta Learning.

Fast Training of Neural Lumigraph Representations using Meta Learning Project Page | Paper | Data Alexander W. Bergman, Petr Kellnhofer, Gordon Wetzst

Alex 39 Oct 08, 2022
Self-Supervised Learning for Domain Adaptation on Point-Clouds

Self-Supervised Learning for Domain Adaptation on Point-Clouds Introduction Self-supervised learning (SSL) allows to learn useful representations from

Idan Achituve 66 Dec 20, 2022
PyTorch implementation of CloudWalk's recent work DenseBody

densebody_pytorch PyTorch implementation of CloudWalk's recent paper DenseBody. Note: For most recent updates, please check out the dev branch. Update

Lingbo Yang 401 Nov 19, 2022
TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation

TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation Zhaoyun Yin, Pichao Wang, Fan Wang, Xianzhe Xu, Hanling Zhang, Hao Li

DamoCV 25 Dec 16, 2022
天勤量化开发包, 期货量化, 实时行情/历史数据/实盘交易

TqSdk 天勤量化交易策略程序开发包 TqSdk 是一个由信易科技发起并贡献主要代码的开源 python 库. 依托快期多年积累成熟的交易及行情服务器体系, TqSdk 支持用户使用极少的代码量构建各种类型的量化交易策略程序, 并提供包含期货、期权、股票的 历史数据-实时数据-开发调试-策略回测-

信易科技 2.8k Dec 30, 2022
From Perceptron model to Deep Neural Network from scratch in Python.

Neural-Network-Basics Aim of this Repository: From Perceptron model to Deep Neural Network (from scratch) in Python. ** Currently working on a basic N

Aditya Kahol 1 Jan 14, 2022
[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

Large Scale Image Completion via Co-Modulated Generative Adversarial Networks, ICLR 2021 (Spotlight) Demo | Paper [NEW!] Time to play with our interac

Shengyu Zhao 373 Jan 02, 2023
Retrieve and analysis data from SDSS (Sloan Digital Sky Survey)

Author: Behrouz Safari License: MIT sdss A python package for retrieving and analysing data from SDSS (Sloan Digital Sky Survey) Installation Install

Behrouz 3 Oct 28, 2022
利用yolov5和TensorRT从0到1实现目标检测的模型训练到模型部署全过程

写在前面 利用TensorRT加速推理速度是以时间换取精度的做法,意味着在推理速度上升的同时将会有精度的下降,不过不用太担心,精度下降微乎其微。此外,要有NVIDIA显卡,经测试,CUDA10.2可以支持20系列显卡及以下,30系列显卡需要CUDA11.x的支持,并且目前有bug。 默认你已经完成了

Helium 6 Jul 28, 2022