Hyperbolic Hierarchical Clustering.

Last update: Dec 15, 2022

Related tags

Overview

Hyperbolic Hierarchical Clustering (HypHC)

This code is the official PyTorch implementation of the NeurIPS 2020 paper:

From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering
Ines Chami, Albert Gu, Vaggos Chatziafratis and Christopher Ré
Stanford University
Paper: https://arxiv.org/abs/2010.00402

Abstract. Similarity-based Hierarchical Clustering (HC) is a classical unsupervised machine learning algorithm that has traditionally been solved with heuristic algorithms like Average-Linkage. Recently, Dasgupta reframed HC as a discrete optimization problem by introducing a global cost function measuring the quality of a given tree. In this work, we provide the first continuous relaxation of Dasgupta's discrete optimization problem with provable quality guarantees. The key idea of our method, HypHC, is showing a direct correspondence from discrete trees to continuous representations (via the hyperbolic embeddings of their leaf nodes) and back (via a decoding algorithm that maps leaf embeddings to a dendrogram), allowing us to search the space of discrete binary trees with continuous optimization. Building on analogies between trees and hyperbolic space, we derive a continuous analogue for the notion of lowest common ancestor, which leads to a continuous relaxation of Dasgupta's discrete objective. We can show that after decoding, the global minimizer of our continuous relaxation yields a discrete tree with a (1+epsilon)-factor approximation for Dasgupta's optimal tree, where epsilon can be made arbitrarily small and controls optimization challenges. We experimentally evaluate HypHC on a variety of HC benchmarks and find that even approximate solutions found with gradient descent have superior clustering quality than agglomerative heuristics or other gradient based algorithms. Finally, we highlight the flexibility of HypHC using end-to-end training in a downstream classification task.

Installation

This code has been tested with python3.7. First, create a virtual environment (or conda environment) and install the dependencies:

python3 -m venv hyphc_env

source hyphc_env/bin/activate

pip install -r requirements.txt

Then install the mst and unionfind packages which are used to decode embeddings into trees and compute the discrete Dasgupta Cost efficiently:

cd mst; python setup.py build_ext --inplace

cd unionfind; python setup.py build_ext --inplace

Datasets

source download_data.sh

This will download the zoo, iris and glass datasets from the UCI machine learning repository. Please refer to the paper for the download links of the other datasets used in the paper.

Code Usage

Train script

To use the code, first set environment variables in each shell session:

source set_env.sh

To train the HypHC mode, use the train script:

python train.py
    optional arguments:
      -h, --help            show this help message and exit
      --seed SEED
      --epochs EPOCHS
      --batch_size BATCH_SIZE
      --learning_rate LEARNING_RATE
      --eval_every EVAL_EVERY
      --patience PATIENCE
      --optimizer OPTIMIZER
      --save SAVE
      --fast_decoding FAST_DECODING
      --num_samples NUM_SAMPLES
      --dtype DTYPE
      --rank RANK
      --temperature TEMPERATURE
      --init_size INIT_SIZE
      --anneal_every ANNEAL_EVERY
      --anneal_factor ANNEAL_FACTOR
      --max_scale MAX_SCALE
      --dataset DATASET

Examples

We provide examples of training commands for the zoo, iris and glass datasets. For instance, to train HypHC on zoo, run:

source examples/run_zoo.sh

This will create an embedding directory and save training logs, embeddings and the configuration parameters in a embedding/zoo/[unique_id] where the unique id is based on the configuration parameters used to train the model.

Citation

If you find this code useful, please cite the following paper:

@inproceedings{NEURIPS2020_ac10ec1a,
 author = {Chami, Ines and Gu, Albert and Chatziafratis, Vaggos and R\'{e}, Christopher},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
 pages = {15065--15076},
 publisher = {Curran Associates, Inc.},
 title = {From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering},
 url = {https://proceedings.neurips.cc/paper/2020/file/ac10ec1ace51b2d973cd87973a98d3ab-Paper.pdf},
 volume = {33},
 year = {2020}
}

Hyperbolic Hierarchical Clustering.

Related tags

Overview

Hyperbolic Hierarchical Clustering (HypHC)

Installation

Datasets

Code Usage

Train script

Examples

Citation

Owner

HazyResearch

PaRT: Parallel Learning for Robust and Transparent AI

All the code and files related to the MI-Lab of UE19CS305 course in sem 5

This is the code for our KILT leaderboard submission to the T-REx and zsRE tasks. It includes code for training a DPR model then continuing training with RAG.

PyTorch implementation of our ICCV 2019 paper: Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis

Implementation of the GVP-Transformer, which was used in the paper "Learning inverse folding from millions of predicted structures" for de novo protein design alongside Alphafold2

Generative Adversarial Networks(GANs)

Tightness-aware Evaluation Protocol for Scene Text Detection

Unofficial pytorch implementation of 'Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization'

Submission to Twitter's algorithmic bias bounty challenge

GuideDog is an AI/ML-based mobile app designed to assist the lives of the visually impaired, 100% voice-controlled

[CVPR 2020] Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation

TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers.

Implementation of "A Deep Learning Loss Function based on Auditory Power Compression for Speech Enhancement" by pytorch

This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

The PASS dataset: pretrained models and how to get the data - PASS: Pictures without humAns for Self-Supervised Pretraining

All the essential resources and template code needed to understand and practice data structures and algorithms in python with few small projects to demonstrate their practical application.

Semi-supervised Semantic Segmentation with Directional Context-aware Consistency (CVPR 2021)

DeepLab2: A TensorFlow Library for Deep Labeling

PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

Real Time Object Detection and Classification using Yolo Algorithm.