Deduplication is the task to combine different representations of the same real world entity.

Last update: Nov 17, 2022

Related tags

Overview

DedupliPy

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training without having to provide a large, manually labelled dataset.

DedupliPy is an end-to-end solution with advantages over existing solutions:

active learning; no large manually labelled dataset required
during active learning, the user gets notified when the model converged and training may be finished
works out of the box, advanced users can choose settings as desired (custom blocking rules, custom metrics, interaction features)

Developed by Frits Hermans

Documentation

Documentation can be found here

Installation

Normal installation

Install directly from Pypi:

pip install deduplipy

Install to contribute

Clone this Github repo and install in editable mode:

python -m pip install -e ".[dev]"
python setup.py develop

Usage

Apply deduplication your Pandas dataframe df as follows:

myDedupliPy = Deduplicator(col_names=['name', 'address'])
myDedupliPy.fit(df)

This will start the interactive learning session in which you provide input on whether a pair is a match (y) or not (n). During active learning you will get the message that training may be finished once algorithm training has converged. Predictions on (new) data are obtained as follows:

result = myDedupliPy.predict(df)

Deduplication is the task to combine different representations of the same real world entity.

Related tags

Overview

DedupliPy

Documentation

Installation

Normal installation

Install to contribute

Usage

Owner

A curated list of efficient attention modules

SentAugment is a data augmentation technique for semi-supervised learning in NLP.

Code release for "COTR: Correspondence Transformer for Matching Across Images"

Python module (C extension and plain python) implementing Aho-Corasick algorithm

Course project of [email protected]

Python implementation of TextRank for phrase extraction and summarization of text documents

🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

Yet Another Neural Machine Translation Toolkit

Model for recasing and repunctuating ASR transcripts

Weird Sort-and-Compress Thing

End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

VMD Audio/Text control with natural language

Nmt - TensorFlow Neural Machine Translation Tutorial

Uses Google's gTTS module to easily create robo text readin' on command.

Py65 65816 - Add support for the 65C816 to py65

BERT score for text generation

A Python/Pytorch app for easily synthesising human voices

An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)