Contact Extraction with Question Answering.

Last update: Apr 20, 2022

Related tags

Text Data & NLP contactsQA

Overview

contactsQA

Extraction of contact entities from address blocks and imprints with Extractive Question Answering.

Goal

Input:

Dr. Max Mustermann
Hauptstraße 123
97070 Würzburg

Output:

entities = {
  "city" : "Würzburg",
  "email" : "",
  "fax" : "",
  "firstName" : "Max",
  "lastName" : "Mustermann",
  "mobile" : "",
  "organization" : "",
  "phone" : "",
  "position" : "",
  "street" : "Hauptstraße 123",
  "title" : "Dr.",
  "website" : "",
  "zip" : "97070"
}

Getting started

Creating a dataset

Due to data protection reasons, no dataset is included in this repository. You need to create a dataset in the SQuAD format, see https://huggingface.co/datasets/squad. Create the dataset in the jsonl-format where one line looks like this:

    {
        'id': '123',
        'title': 'mustermanns address',
        'context': 'Meine Adresse ist folgende: \n\nDr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg \n Schicken Sie mir bitte die Rechnung zu.',
        'fixed': 'Dr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg',
        'question': 'firstName',
        'answers': {
            'answer_start': [4],
            'text': ['Max']
        }
    }

Questions with no answers should look like this:

    {
        'id': '123',
        'title': 'mustermanns address',
        'context': 'Meine Adresse ist folgende: \n\nDr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg \n Schicken Sie mir bitte die Rechnung zu.',
        'fixed': 'Dr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg',
        'question': 'phone',
        'answers': {
            'answer_start': [-1],
            'text': ['EMPTY']
        }
    }

Split the dataset into a train-, validation- and test-dataset and save them in a directory with the name crawl, email or expected, like this:

├── data
│   ├── crawl
│   │   ├── crawl-test.jsonl
│   │   ├── crawl-train.jsonl
│   │   ├── crawl-val.jsonl

If you allow unanswerable questions like in SQuAD v2.0, add a -na behind the directory name, like this:

├── data
│   ├── crawl-na
│   │   ├── crawl-na-test.jsonl
│   │   ├── crawl-na-train.jsonl
│   │   ├── crawl-na-val.jsonl

Training a model

Example command for training and evaluating a dataset inside the crawl-na directory:

python app/qa-pipeline.py \
--batch_size 4 \
--checkpoint xlm-roberta-base \
--dataset_name crawl \
--dataset_path="../data/" \
--deactivate_map_caching \
--doc_stride 128 \
--epochs 3 \
--gpu_device 0 \
--learning_rate 0.00002 \
--max_answer_length 30 \
--max_length 384 \
--n_best_size 20 \
--n_jobs 8 \
--no_answers \
--overwrite_output_dir;

Virtual Environment Setup

Create and activate the environment (the python version and the environment name can vary at will):

$ python3.9 -m venv .env
$ source .env/bin/activate

To install the project's dependencies, activate the virtual environment and simply run (requires poetry):

$ poetry install

Alternatively, use the following:

$ pip install -r requirements.txt

Deactivate the environment:

$ deactivate

Troubleshooting

Common error:

ModuleNotFoundError: No module named 'setuptools'

The solution is to upgrade setuptools and then run poetry install or poetry update afterwards:

pip install --upgrade setuptools

Contact Extraction with Question Answering.

Related tags

Overview

contactsQA

Goal

Getting started

Creating a dataset

Training a model

Virtual Environment Setup

Troubleshooting

Owner

Jan

Pre-Training with Whole Word Masking for Chinese BERT

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

Generate text line images for training deep learning OCR model (e.g. CRNN)

Code Generation using a large neural network called GPT-J

Large-scale pretraining for dialogue

HAN2HAN : Hangul Font Generation

Twitter Sentiment Analysis using #tag, words and username

Language-Agnostic SEntence Representations

Words_And_Phrases - Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours

Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

A flask application to predict the speech emotion of any .wav file.

2021海华AI挑战赛·中文阅读理解·技术组·第三名

NLP Text Classification

Pytorch version of BERT-whitening

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Transformers-regression - Regression Bugs Are In Your Model! Measuring, Reducing and Analyzing Regressions In NLP Model Updates

Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Phrase-Based & Neural Unsupervised Machine Translation

Open-World Entity Segmentation

A Python 3.6+ package to run .many files, where many programs written in many languages may exist in one file.