Python package for performing Entity and Text Matching using Deep Learning.

Overview

DeepMatcher

https://travis-ci.org/anhaidgroup/deepmatcher.svg?branch=master

DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and utilities that enable you to train and apply state-of-the-art deep learning models for entity matching in less than 10 lines of code. The models are also easily customizable - the modular design allows any subcomponent to be altered or swapped out for a custom implementation.

As an example, given labeled tuple pairs such as the following:

https://raw.githubusercontent.com/anhaidgroup/deepmatcher/master/docs/source/_static/match_input_ex.png

DeepMatcher uses labeled tuple pairs and trains a neural network to perform matching, i.e., to predict match / non-match labels. The trained network can then be used to obtain labels for unlabeled tuple pairs.

Paper and Data

For details on the architecture of the models used, take a look at our paper Deep Learning for Entity Matching (SIGMOD '18). All public datasets used in the paper can be downloaded from the datasets page.

Quick Start: DeepMatcher in 30 seconds

There are four main steps in using DeepMatcher:

  1. Data processing: Load and process labeled training, validation and test CSV data.
import deepmatcher as dm
train, validation, test = dm.data.process(path='data_directory',
    train='train.csv', validation='validation.csv', test='test.csv')
  1. Model definition: Specify neural network architecture. Uses the built-in hybrid model (as discussed in section 4.4 of our paper) by default. Can be customized to your heart's desire.
model = dm.MatchingModel()
  1. Model training: Train neural network.
model.run_train(train, validation, best_save_path='best_model.pth')
  1. Application: Evaluate model on test set and apply to unlabeled data.
model.run_eval(test)

unlabeled = dm.data.process_unlabeled(path='data_directory/unlabeled.csv', trained_model=model)
model.run_prediction(unlabeled)

Installation

We currently support only Python versions 3.5 and 3.6. Installing using pip is recommended:

pip install deepmatcher

Note that during installation you may see an error message that says "Failed building wheel for fasttextmirror". You can safely ignore this - it does NOT mean that there are any problems with installation.

Tutorials

Using DeepMatcher:

  1. Getting Started: A more in-depth guide to help you get familiar with the basics of using DeepMatcher.
  2. Data Processing: Advanced guide on what data processing involves and how to customize it.
  3. Matching Models: Advanced guide on neural network architecture for entity matching and how to customize it.

Entity Matching Workflow:

End to End Entity Matching: A guide to develop a complete entity matching workflow. The tutorial discusses how to use DeepMatcher with Magellan to perform blocking, sampling, labeling and matching to obtain matching tuple pairs from two tables.

DeepMatcher for other matching tasks:

Question Answering with DeepMatcher: A tutorial on how to use DeepMatcher for question answering. Specifically, we will look at WikiQA, a benchmark dataset for the task of Answer Selection.

API Reference

API docs are here.

Support

Take a look at the FAQ for common issues. If you run into any issues or have questions not answered in the FAQ, please file GitHub issues and we will address them asap.

The Team

DeepMatcher was developed by University of Wisconsin-Madison grad students Sidharth Mudgal and Han Li, under the supervision of Prof. AnHai Doan and Prof. Theodoros Rekatsinas.

2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

繁體中文場景文字辨識 程式碼說明 組別:這就是我 成員:蔣明憲 唐碩謙 黃玥菱 林冠霆 蕭靖騰 目錄 環境套件 安裝方式 資料夾布局 前處理-製作偵測訓練註解檔 前處理-製作分類訓練樣本 part.py : 從 json 裁切出分類訓練樣本 Class.py : 將切出來的樣本按照文字分類到各資料夾

HuanyueTW 3 Jan 14, 2022
Resources for "Natural Language Processing" Coursera course.

Natural Language Processing course resources This github contains practical assignments for Natural Language Processing course by Higher School of Eco

Advanced Machine Learning specialisation by HSE 1.1k Jan 01, 2023
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 847 Dec 19, 2022
Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

PyStanfordDependencies Python interface for converting Penn Treebank trees to Universal Dependencies and Stanford Dependencies. Example usage Start by

David McClosky 64 May 08, 2022
An official repository for tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a University of Edinburgh master's course.

PMR computer tutorials on HMMs (2021-2022) This is a repository for computer tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a Univer

Vaidotas Šimkus 10 Dec 06, 2022
Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Sentance Parser Executing the Program Make sure Python 3.6+ is installed. Install requirements $ pip install requirements.txt Run the program:

Vaibhaw 12 Sep 28, 2022
Türkçe küfürlü içerikleri bulan bir yapay zeka kütüphanesi / An ML library for profanity detection in Turkish sentences

"Kötü söz sahibine aittir." -Anonim Nedir? sinkaf uygunsuz yorumların bulunmasını sağlayan bir python kütüphanesidir. Farkı nedir? Diğer algoritmalard

KaraGoz 4 Feb 18, 2022
A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code or write code yourself

Scriptfab - What is it? A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code

DevNugget 3 Jul 28, 2021
A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework.

Unpacker Karton Service A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework. This project is

c3rb3ru5 45 Jan 05, 2023
Convolutional 2D Knowledge Graph Embeddings resources

ConvE Convolutional 2D Knowledge Graph Embeddings resources. Paper: Convolutional 2D Knowledge Graph Embeddings Used in the paper, but do not use thes

Tim Dettmers 586 Dec 24, 2022
This is the source code of RPG (Reward-Randomized Policy Gradient)

RPG (Reward-Randomized Policy Gradient) Zhenggang Tang*, Chao Yu*, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Shaolei Du, Yu Wang, Yi Wu (

40 Nov 25, 2022
Shirt Bot is a discord bot which uses GPT-3 to generate text

SHIRT BOT · Shirt Bot is a discord bot which uses GPT-3 to generate text. Made by Cyclcrclicly#3420 (474183744685604865) on Discord. Support Server EX

31 Oct 31, 2022
:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want

deepset 6.4k Jan 09, 2023
Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

anlp21 Course materials for "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley) Syllabus: http://people.ischool.berkeley.edu/~dba

David Bamman 48 Dec 06, 2022
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 05, 2022
Prithivida 690 Jan 04, 2023
An example project using OpenPrompt under pytorch-lightning for prompt-based SST2 sentiment analysis model

pl_prompt_sst An example project using OpenPrompt under the framework of pytorch-lightning for a training prompt-based text classification model on SS

Zhiling Zhang 5 Oct 21, 2022
ASCEND Chinese-English code-switching dataset

ASCEND (A Spontaneous Chinese-English Dataset) introduces a high-quality resource of spontaneous multi-turn conversational dialogue Chinese-English code-switching corpus collected in Hong Kong.

CAiRE 11 Dec 09, 2022
A PyTorch implementation of VIOLET

VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling A PyTorch implementation of VIOLET Overview VIOLET is an implementati

Tsu-Jui Fu 119 Dec 30, 2022