Explainable Zero-Shot Topic Extraction

Related tags

Deep LearningZeSTE
Overview

Zero-Shot Topic Extraction with Common-Sense Knowledge Graph

This repository contains the code for reproducing the results reported in the paper "Explainable Zero-Shot Topic Extraction with Common-Sense Knowledge Graph" (pdf) at the LDK 2021 Conference.

A user-friendly demo is available at: http://zeste.tools.eurecom.fr/

ZeSTE

Based on ConceptNet's common sense knowledge graph and embeddings, ZeSTE generates explainable predictions for a document topical category (e.g. politics, sports, video_games ..) without reliance on training data. The following is a high-level illustration of the approach:

API

ZeSTE can also be accessed via a RESTful API for easy deployment and use. For further information, please refer to the documentation: https://zeste.tools.eurecom.fr/doc

Dependencies

Before running any code in this repo, please install the following dependencies:

  • numpy
  • pandas
  • matplotlib
  • nltk
  • sklearn
  • tqdm
  • gensim

Code Overview

This repo is organized as follows:

  • generate_cache.py: this script processes the raw ConceptNet dump to produce cached files for each node in ConceptNet to accelerate the label neighborhood generation. It also transforms ConceptNet Numberbatch text file into a Gensim word embedding that we pickle for quick loading.
  • zeste.py: this is the main script for evaluation. It takes as argument the dataset to process as well as model configuration parameters such as neighborhood depth (see below). The results (classification report, confusion matrix, and classification metrics) are persisted into text files.
  • util.py: contains the functions that are used in zeste.py
  • label_mappings: contains the tab-separated mappings for the studied datasets.

Reproducing Results

1. Downloads

The two following files need to be downloaded to bypass the use of ConceptNet's web API: the dump of ConceptNet triplets, and the ConceptNet Numberbatch pre-computed word embeddings. You can download them from ConceptNet's and Numberbatch's repos, respectively.

# wget https://s3.amazonaws.com/conceptnet/downloads/2019/edges/conceptnet-assertions-5.7.0.csv.gz
# wget https://conceptnet.s3.amazonaws.com/downloads/2019/numberbatch/numberbatch-19.08.txt.gz
# gzip -d conceptnet-assertions-5.7.0.csv.gz
# gzip -d numberbatch-19.08.txt.gz

2. generate_cache.py

This script takes as input the two just-downloaded files and the cache path to where precomputed 1-hop label neighborhoods will be saved. This can take up to 7.2G of storage space.

usage: generate_cache.py [-h] [-cnp CONCEPTNET_ASSERTIONS_PATH] [-nbp CONCEPTNET_NUMBERBATCH_PATH] [-zcp ZESTE_CACHE_PATH]

Zero-Shot Topic Extraction

optional arguments:
  -h, --help            show this help message and exit
  -cnp CONCEPTNET_ASSERTIONS_PATH, --conceptnet_assertions_path CONCEPTNET_ASSERTIONS_PATH
                        Path to CSV file containing ConceptNet assertions dump
  -nbp CONCEPTNET_NUMBERBATCH_PATH, --conceptnet_numberbatch_path CONCEPTNET_NUMBERBATCH_PATH
                        Path to W2V file for ConceptNet Numberbatch
  -zcp ZESTE_CACHE_PATH, --zeste_cache_path ZESTE_CACHE_PATH
                        Path to the repository where the generated files will be saved

3. zeste.py

This script uses the precomputed 1-hop label neighborhoods to recursively generate label neighborhoods of any given depth (-d). It takes also as parameters the path to the dataset CSV file (which should have two columns: text and label). The rest of the arguments are for model experimentation.

usage: zeste.py [-h] [-cp CACHE_PATH] [-pp PREFETCH_PATH] [-nb NUMBERBATCH_PATH] [-dp DATASET_PATH] [-lm LABELS_MAPPING] [-rp RESULTS_PATH]
                [-d DEPTH] [-f FILTER] [-s {simple,compound,depth,harmonized}] [-ar ALLOWED_RELS]

Zero-Shot Topic Extraction

optional arguments:
  -h, --help            show this help message and exit
  -cp CACHE_PATH, --cache_path CACHE_PATH
                        Path to where the 1-hop word neighborhoods are cached
  -pp PREFETCH_PATH, --prefetch_path PREFETCH_PATH
                        Path to where the precomputed n-hop neighborhoods are cached
  -nb NUMBERBATCH_PATH, --numberbatch_path NUMBERBATCH_PATH
                        Path to the pickled Numberbatch
  -dp DATASET_PATH, --dataset_path DATASET_PATH
                        Path to the dataset to process
  -lm LABELS_MAPPING, --labels_mapping LABELS_MAPPING
                        Path to the mapping between the dataset labels and ZeSTE labels (multiword labels are comma-separated)
  -rp RESULTS_PATH, --results_path RESULTS_PATH
                        Path to the directory where to store the results
  -d DEPTH, --depth DEPTH
                        How many hops to generate the neighborhoods
  -f FILTER, --filter FILTER
                        Filtering method: top[N], top[P]%, thresh[T], all
  -s {simple,compound,depth,harmonized}, --similarity {simple,compound,depth,harmonized}
  -ar ALLOWED_RELS, --allowed_rels ALLOWED_RELS
                        Which relationships to use (comma-separated or all)

Cite this work

@InProceedings{harrando_et_al_zeste_2021,
  author ={Harrando, Ismail and Troncy, Rapha\"{e}l},
  title ={{Explainable Zero-Shot Topic Extraction Using a Common-Sense Knowledge Graph}},
  booktitle ={3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages ={17:1--17:15},
  year ={2021},
  volume ={93},
  publisher ={Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  URL ={https://drops.dagstuhl.de/opus/volltexte/2021/14553},
  URN ={urn:nbn:de:0030-drops-145532},
  doi ={10.4230/OASIcs.LDK.2021.17},
}
Owner
D2K Lab
Data to Knowledge Virtual Lab (LINKS Foundation - EURECOM)
D2K Lab
Open source annotation tool for machine learning practitioners.

doccano doccano is an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequ

7.1k Jan 01, 2023
šŸ•µ Artificial Intelligence for social control of public administration

Non-tech crash course into OperaĆ§Ć£o Serenata de Amor Tech crash course into OperaĆ§Ć£o Serenata de Amor Contributing with code and tech skills Supportin

Open Knowledge Brasil - Rede pelo Conhecimento Livre 4.4k Dec 31, 2022
PyTorch implementation of a Real-ESRGAN model trained on custom dataset

Real-ESRGAN PyTorch implementation of a Real-ESRGAN model trained on custom dataset. This model shows better results on faces compared to the original

Sber AI 160 Jan 04, 2023
Restricted Boltzmann Machines in Python.

How to Use First, initialize an RBM with the desired number of visible and hidden units. rbm = RBM(num_visible = 6, num_hidden = 2) Next, train the m

Edwin Chen 928 Dec 30, 2022
Official PyTorch code for the paper: "Point-Based Modeling of Human Clothing" (ICCV 2021)

Point-Based Modeling of Human Clothing Paper | Project page | Video This is an official PyTorch code repository of the paper "Point-Based Modeling of

Visual Understanding Lab @ Samsung AI Center Moscow 64 Nov 22, 2022
Official code repository for the EMNLP 2021 paper

Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization PyTorch code for the EMNLP 2021 paper "Integrating Visuospatia

Adyasha Maharana 23 Dec 19, 2022
RODD: A Self-Supervised Approach for Robust Out-of-Distribution Detection

RODD Official Implementation of 2022 CVPRW Paper RODD: A Self-Supervised Approach for Robust Out-of-Distribution Detection Introduction: Recent studie

Umar Khalid 17 Oct 11, 2022
Cognate Detection Repository

Cognate Detection Repository Details This repository contains the data for two publications: Challenge Dataset of Cognates and False Friend Pairs from

Diptesh Kanojia 1 Apr 26, 2022
Official code for On Path Integration of Grid Cells: Group Representation and Isotropic Scaling (NeurIPS 2021)

On Path Integration of Grid Cells: Group Representation and Isotropic Scaling This repo contains the official implementation for the paper On Path Int

Ruiqi Gao 39 Nov 10, 2022
PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English

PASTRIE Official release of the corpus described in the paper: Michael Kranzlein, Emma Manning, Siyao Peng, Shira Wein, Aryaman Arora, and Nathan Schn

NERT @ Georgetown 4 Dec 02, 2021
Code for AA-RMVSNet: Adaptive Aggregation Recurrent Multi-view Stereo Network (ICCV 2021).

AA-RMVSNet Code for AA-RMVSNet: Adaptive Aggregation Recurrent Multi-view Stereo Network (ICCV 2021) in PyTorch. paper link: arXiv | CVF Change Log Ju

Qingtian Zhu 97 Dec 30, 2022
TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification

TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification [NeurIPS 2021] Abstract Multiple instance learn

132 Dec 30, 2022
Learning to Prompt for Continual Learning

Learning to Prompt for Continual Learning (L2P) Official Jax Implementation L2P is a novel continual learning technique which learns to dynamically pr

Google Research 207 Jan 06, 2023
Implementation of Continuous Sparsification, a method for pruning and ticket search in deep networks

Continuous Sparsification Implementation of Continuous Sparsification (CS), a method based on l_0 regularization to find sparse neural networks, propo

Pedro Savarese 23 Dec 07, 2022
Bayesian-Torch is a library of neural network layers and utilities extending the core of PyTorch to enable the user to perform stochastic variational inference in Bayesian deep neural networks

Bayesian-Torch is a library of neural network layers and utilities extending the core of PyTorch to enable the user to perform stochastic variational inference in Bayesian deep neural networks. Bayes

Intel Labs 210 Jan 04, 2023
Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

HiSD: Image-to-image Translation via Hierarchical Style Disentanglement Official pytorch implementation of paper "Image-to-image Translation

364 Dec 14, 2022
Simple tools for logging and visualizing, loading and training

TNT TNT is a library providing powerful dataloading, logging and visualization utilities for Python. It is closely integrated with PyTorch and is desi

1.5k Jan 02, 2023
Neural Dynamic Policies for End-to-End Sensorimotor Learning

This is a PyTorch based implementation for our NeurIPS 2020 paper on Neural Dynamic Policies for end-to-end sensorimotor learning.

Shikhar Bahl 47 Dec 11, 2022
Stochastic Tensor Optimization for Robot Motion - A GPU Robot Motion Toolkit

STORM Stochastic Tensor Optimization for Robot Motion - A GPU Robot Motion Toolkit [Install Instructions] [Paper] [Website] This package contains code

NVIDIA Research Projects 101 Dec 12, 2022
This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Introduction This is an official implementation of CvT: Introducing Convolutions to Vision Transformers. We present a new architecture, named Convolut

Bin Xiao 175 Jan 08, 2023