《K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters》(2020)

Overview

K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters

This repository is the implementation of the paper "K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters".

In the K-adapter paper, we present a flexible approach that supports continual knowledge infusion into large pre-trained models (e.g. RoBERTa in this work). We infuse factual knowledge and linguistic knowledge, and show that adapters for both kinds of knowledge work well on downstream tasks.

For more details, please check the latest version of the paper: https://arxiv.org/abs/2002.01808

Prerequisites

  • Python 3.6
  • PyTorch 1.3.1
  • tensorboardX
  • transformers

We use huggingface/transformers framework, the environment can be installed with:

conda create -n kadapter python=3.6
pip install -r requirements.txt

Pre-training Adapters

In the pre-training procedure, we train each knowledge-specific adapter on different pre-training tasks individually.

1. Process Dataset

  • ./scripts/clean_T_REx.py: clean raw T-Rex dataset (32G), and save the cleaned T-Rex to JSON format
  • ./scripts/create_subdataset-relation-classification.ipynb: create the dataset from T-REx for pre-training factual adapter on relation classification task. This sub-dataset can be found here.
  • refer to this code to get the dependency parsing dataset : create the dataset from Book Corpus for pre-training the linguistic adapter on dependency parsing task.

2. Factual Adapter

To pre-train fac-adapter, run

bash run_pretrain_fac-adapter.sh

3. Linguistic Adapter

To pre-train lin-adapter, run

bash run_pretrain_lin-adapter.sh

The pre-trained fac-adapter and lin-adapter models can be found here.

Fine-tuning on Downstream Tasks

Adapter Structure

  • The fac-adapter (lin-adapter) consists of two transformer layers (L=2, H=768, A = 12)
  • The RoBERTa layers where adapters plug in: 0,11,23 or 0,11,22
  • For using only single adapter
    • Use the concatenation of the last hidden feature of RoBERTa and the last hidden feature of the adapter as the input representation for the task-specific layer.
  • For using combine adapter
    • For each adapter, first concat the last hidden feature of RoBERTa and the last hidden feature of every adapter and feed into a linear layer separately, then concat the representations as input for task-specific layer.

About how to load pretrained RoBERTa and pretrained adapter

  • The pre-trained adapters are in ./pretrained_models/fac-adapter/pytorch_model.bin and ./pretrained_models/lin-adapter/pytorch_model.bin. For using only single adapter, for example, fac-adapter, then you can set the argument meta_fac_adaptermodel= and set meta_lin_adaptermodel=””. For using both adapters, just set the arguments meta_fac_adaptermodel and meta_lin_adaptermodel as the path of adapters.
  • The pretrained RoBERTa will be downloaded automaticly when you run the pipeline.

1. Entity Typing

1.1 OpenEntity

One single 16G P100

(1) run the pipeline

bash run_finetune_openentity_adapter.sh

(2) result

  • with fac-adapter dev: (0.7967123287671233, 0.7580813347236705, 0.7769169115682607) test: (0.7929708951125755, 0.7584033613445378, 0.7753020134228187)
  • with lin-adapter dev: (0.8071672354948806, 0.7398331595411888, 0.7720348204570185) test:(0.8001135718341851, 0.7400210084033614, 0.7688949522510232)
  • with fac-adapter + lin-adapter dev: (0.8001101321585903, 0.7575599582898853, 0.7782538832351366) test: (0.7899568034557235, 0.7627737226277372, 0.7761273209549072)

the results may vary when running on different machines, but should not differ too much. I just search results from per_gpu_train_batch_sizeh: [4, 8] lr: [1e-5, 5e-6], warmup[0,200,500,1000,1200], maybe you can change other parameters and see the results. For w/fac-adapter, the best performance is achieved at gpu_num=1, per_gpu_train_batch_size=4, lr=5e-6, warmup=500(it takes about 2 hours to get the best result running on singe 16G P100) For w/lin-adapter, the best performance is achieved at gpu_num=1, per_gpu_train_batch_size=4, lr=5e-6, warmup=1000(it takes about 2 hours to get the best result running on singe 16G P100)

(3) Data format

Add special token "@" before and after a certain entity, then the first @ is adopted to perform classification. 9 entity categories: ['entity', 'location', 'time', 'organization', 'object', 'event', 'place', 'person', 'group'], each entity can be classified to several of them or none of them. The output is represented as [0,1,1,0,1,0,0,0,0], 0 represents the entity does not belong to the type, while 1 belongs to.

1.2 FIGER

(1) run the pipeline

bash run_finetune_figer_adapter.sh

The detailed hyperparamerters are listed in the running script.

2. Relation Classification

4*16G P100

(1) run the pipeline

bash run_finetune_tacred_adapter.sh

(2) result

  • with fac-adapter

    • 'dev': (0.6686945083853996, 0.7481604120676968, 0.7061989928807085)
    • 'test': (0.693900391717963, 0.7458646616541353, 0.7189447746050153)
  • with lin-adapter

    • 'dev': (0.6679165308118683, 0.7536791758646063, 0.7082108902333621),
    • 'test': (0.6884615384615385, 0.7536842105263157, 0.7195979899497488)
  • with fac-adapter + lin-adapter

    • 'dev': (0.6793893129770993, 0.7367549668874173, 0.7069102462271645)
    • 'test': (0.7014245014245014, 0.7404511278195489, 0.7204096561814192)
  • the results may vary when running on different machines, but should not differ too much.

  • I just search results from per_gpu_train_batch_sizeh: [4, 8] lr: [1e-5, 5e-6], warmup[0,200,1000,1200], maybe you can change other parameters and see the results.

  • The best performance is achieved at gpu_num=4, per_gpu_train_batch_size=8, lr=1e-5, warmup=200 (it takes about 7 hours to get the best result running on 4 16G P100)

  • The detailed hyperparamerters are listed in the running script.

(3) Data format

Add special token "@" before and after the first entity, add '#' before and after the second entity. Then the representations of @ and # are concatenated to perform relation classification.

3. Question Answering

3.1 CosmosQA

One single 16G P100

(1) run the pipeline

bash run_finetune_cosmosqa_adapter.sh

(2) result

CosmosQA dev accuracy: 80.9 CosmosQA test accuracy: 81.8

The best performance is achieved at gpu_num=1, per_gpu_train_batch_size=64, GRADIENT_ACC=32, lr=1e-5, warmup=0 (it takes about 8 hours to get the best result running on singe 16G P100) The detailed hyperparamerters are listed in the running script.

(3) Data format

For each answer, the input is contextquestionanswer, and will get a score for this answers. After getting four scores, we will select the answer with the highest score.

3.2 SearchQA and Quasar-T

The source codes for fine-tuning on SearchQA and Quasar-T dataset are modified based on the code of paper "Denoising Distantly Supervised Open-Domain Question Answering".

Use K-Adapter just like RoBERTa

  • You can use K-Adapter (RoBERTa with adapters) just like RoBERTa, which almost have the same inputs and outputs. Specifically, we add a class RobertawithAdapter in pytorch_transformers/my_modeling_roberta.py.
  • A demo code [run_example.sh and examples/run_example.py] about how to use “RobertawithAdapter”, do inference, save model and load model. You can leave the arguments of adapters as default.
  • Now it is very easy to use Roberta with adapters. If you only want to use single adapter, for example, fac-adapter, then you can set the argument meta_fac_adaptermodel='./pretrained_models/fac-adapter/pytorch_model.bin'' and set meta_lin_adaptermodel=””. If you want to use both adapters, just set the arguments meta_fac_adaptermodel and meta_lin_adaptermodel as the path of adapters.
bash run_example.sh

TODO

  • Remove and merge redundant codes
  • Support other pre-trained models, such as BERT...

Contact

Feel free to contact Ruize Wang ([email protected]) if you have any further questions.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
CR-FIQA: Face Image Quality Assessment by Learning Sample Relative Classifiability

This is the official repository of the paper: CR-FIQA: Face Image Quality Assessment by Learning Sample Relative Classifiability A private copy of the

Fadi Boutros 33 Dec 31, 2022
DeepLabv3+:Encoder-Decoder with Atrous Separable Convolution语义分割模型在tensorflow2当中的实现

DeepLabv3+:Encoder-Decoder with Atrous Separable Convolution语义分割模型在tensorflow2当中的实现 目录 性能情况 Performance 所需环境 Environment 注意事项 Attention 文件下载 Download

Bubbliiiing 31 Nov 25, 2022
[SIGMETRICS 2022] One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search

One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search paper | website One Proxy Device Is Enough for Hardware-Aware Neural Architec

10 Dec 16, 2022
ObsPy: A Python Toolbox for seismology/seismological observatories.

ObsPy is an open-source project dedicated to provide a Python framework for processing seismological data. It provides parsers for common file formats

ObsPy 979 Jan 07, 2023
This repository contains the source code of our work on designing efficient CNNs for computer vision

Efficient networks for Computer Vision This repo contains source code of our work on designing efficient networks for different computer vision tasks:

Sachin Mehta 386 Nov 26, 2022
This repository allows the user to automatically scale a 3D model/mesh/point cloud on Agisoft Metashape

Metashape-Utils This repository allows the user to automatically scale a 3D model/mesh/point cloud on Agisoft Metashape, given a set of 2D coordinates

INSCRIBE 4 Nov 07, 2022
Adversarial Self-Defense for Cycle-Consistent GANs

Adversarial Self-Defense for Cycle-Consistent GANs This is the official implementation of the CycleGAN robust to self-adversarial attacks used in pape

Dina Bashkirova 10 Oct 10, 2022
Optimized primitives for collective multi-GPU communication

NCCL Optimized primitives for inter-GPU communication. Introduction NCCL (pronounced "Nickel") is a stand-alone library of standard communication rout

NVIDIA Corporation 2k Jan 09, 2023
A tensorflow/keras implementation of StyleGAN to generate images of new Pokemon.

PokeGAN A tensorflow/keras implementation of StyleGAN to generate images of new Pokemon. Dataset The model has been trained on dataset that includes 8

19 Jul 26, 2022
ARAE-Tensorflow for Discrete Sequences (Adversarially Regularized Autoencoder)

ARAE Tensorflow Code Code for the paper Adversarially Regularized Autoencoders for Generating Discrete Structures by Zhao, Kim, Zhang, Rush and LeCun

19 Nov 12, 2021
Numerical-computing-is-fun - Learning numerical computing with notebooks for all ages.

As much as this series is to educate aspiring computer programmers and data scientists of all ages and all backgrounds, it is also a reminder to mysel

EKA foundation 758 Dec 25, 2022
PyTorch implementation of "Supervised Contrastive Learning" (and SimCLR incidentally)

PyTorch implementation of "Supervised Contrastive Learning" (and SimCLR incidentally)

Yonglong Tian 2.2k Jan 08, 2023
An executor that loads ONNX models and embeds documents using the ONNX runtime.

ONNXEncoder An executor that loads ONNX models and embeds documents using the ONNX runtime. Usage via Docker image (recommended) from jina import Flow

Jina AI 2 Mar 15, 2022
Local Attention - Flax module for Jax

Local Attention - Flax Autoregressive Local Attention - Flax module for Jax Install $ pip install local-attention-flax Usage from jax import random fr

Phil Wang 16 Jun 16, 2022
Certified Patch Robustness via Smoothed Vision Transformers

Certified Patch Robustness via Smoothed Vision Transformers This repository contains the code for replicating the results of our paper: Certified Patc

Madry Lab 35 Dec 14, 2022
This repo contains the implementation of YOLOv2 in Keras with Tensorflow backend.

Easy training on custom dataset. Various backends (MobileNet and SqueezeNet) supported. A YOLO demo to detect raccoon run entirely in brower is accessible at https://git.io/vF7vI (not on Windows).

Huynh Ngoc Anh 1.7k Dec 24, 2022
High-Resolution Image Synthesis with Latent Diffusion Models

Latent Diffusion Models arXiv | BibTeX High-Resolution Image Synthesis with Latent Diffusion Models Robin Rombach*, Andreas Blattmann*, Dominik Lorenz

CompVis Heidelberg 5.6k Dec 30, 2022
A Python framework for developing parallelized Computational Fluid Dynamics software to solve the hyperbolic 2D Euler equations on distributed, multi-block structured grids.

pyHype: Computational Fluid Dynamics in Python pyHype is a Python framework for developing parallelized Computational Fluid Dynamics software to solve

Mohamed Khalil 21 Nov 22, 2022
AdelaiDepth is an open source toolbox for monocular depth prediction.

AdelaiDepth is an open source toolbox for monocular depth prediction.

Adelaide Intelligent Machines (AIM) Group 743 Jan 01, 2023
Dense matching library based on PyTorch

Dense Matching A general dense matching library based on PyTorch. For any questions, issues or recommendations, please contact Prune at

Prune Truong 399 Dec 28, 2022