This project provides an unsupervised framework for mining and tagging quality phrases on text corpora with pretrained language models (KDD'21).

Last update: Dec 22, 2022

Related tags

Deep Learning UCPhrase-exp

Overview

UCPhrase: Unsupervised Context-aware Quality Phrase Tagging

To appear on KDD'21...[pdf]

This project provides an unsupervised framework for mining and tagging quality phrases on text corpora. In this work, we recognize the power of pretrained language models in identifying the structure of a sentence. The attention matrices generated by a Transformer model are informative to distinguish quality phrases from ordinary spans, as illustrated in the following example.

With a lightweight CNN model to capture inter-word relationships from various ranges, we can effectively tackle the task of phrase tagging as a multi-channel image classifiaction problem.

For model training, we seek to alleviate the need for human annotation and external knowledge bases. Instead, we show that sufficient supervision can be directly mined from large-scale unlabeled corpus. Specifically, we mine frequent max patterns with each document as context, since by definition, high-quality phrases are sequences that are consistently used in context. Compared with labels generated by distant supervision, silver labels mined from the corpus itself preserve better diversity, coverage, and contextual completeness. The superiority is supported by comparison on two public datasets.

We compare our method with existing ones on the KP20k dataset (publication data from CS domain) and the KPTimes dataset (news articles). UCPhrase significantly outperforms prior arts without supervision. Compared with off-the-shelf phrase tagging tools, UCPhrase also shows unique advantages, especially in its ability to generalize to specific domains without reliance on manually curated labels or KBs. We provide comprehensive case studies to demonstrate the comparison among different tagging methods. We also have some interesting findings in the discussion sections.

We aim to build UCPhrase as a practical tool for phrase tagging, though it is certainly far from perfect. Please feel free to try on your own corpus and give us feedbacks if you have any ideas that can help build better phrase tagging tools!

Facts: UCPhrase is a joint work by researchers from UI at Urbana Champaign, and University of California San Diago.

Quick Start

Step 1: Download and unzip the data folder

wget https://www.dropbox.com/s/1bv7dnjawykjsji/data.zip?dl=0 -O data.zip
unzip -n data.zip

Step 2: Install and compile dependencies

bash build.sh

Step 3: Run experiments

cd src
python exp.py --gpu 0 --dir_data ../data/devdata

Model checkpoint and output files will be stored under the generated "experiments" folder.

Citation

If you find the implementation useful, please consider citing the following paper:

Xiaotao Gu*, Zihan Wang*, Zhenyu Bi, Yu Meng, Liyuan Liu, Jiawei Han, Jingbo Shang, "UCPhrase: Unsupervised Context-aware Quality Phrase Tagging", in Proc. of 2021 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'21), Aug. 2021

This project provides an unsupervised framework for mining and tagging quality phrases on text corpora with pretrained language models (KDD'21).

Related tags

Overview

UCPhrase: Unsupervised Context-aware Quality Phrase Tagging

To appear on KDD'21...[pdf]

Quick Start

Step 1: Download and unzip the data folder

Step 2: Install and compile dependencies

Step 3: Run experiments

Citation

Owner

Xiaotao Gu

[ICRA 2022] An opensource framework for cooperative detection. Official implementation for OPV2V.

OOD Dataset Curator and Benchmark for AI-aided Drug Discovery

CoReNet is a technique for joint multi-object 3D reconstruction from a single RGB image.

Weakly-supervised semantic image segmentation with CNNs using point supervision

Machine Learning Framework for Operating Systems - Brings ML to Linux kernel

A collection of models for image<->text generation in ACM MM 2021.

AI-Bot - 一个基于watermelon改造的OpenAI-GPT-2的智能机器人

Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning

Iran Open Source Hackathon

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Deep Learning: Architectures & Methods Project: Deep Learning for Audio Super-Resolution

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

This is a five-step framework for the development of intrusion detection systems (IDS) using machine learning (ML) considering model realization, and performance evaluation.

VideoGPT: Video Generation using VQ-VAE and Transformers

This GitHub repo consists of Code and Some results of project- Diabetes Treatment using Gold nanoparticles. These Consist of ML Models used for prediction Diabetes and further the basic theory and working of Gold nanoparticles.

Official implementation for paper: Feature-Style Encoder for Style-Based GAN Inversion

Eye-Blink-Counter - Python based Computer Vision project which counts how many time a person blinks

This repository contains a PyTorch implementation of "AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis".

Repo for Photon-Starved Scene Inference using Single Photon Cameras, ICCV 2021

MWPToolkit is a PyTorch-based toolkit for Math Word Problem (MWP) solving.