Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Last update: Jul 09, 2022

Overview

Fast Laplacian Eigenmaps in python

Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python. Comes with an wrapper for NMSlib to compute approximate-nearest-neighbors. Performs several times faster than the default scikit-learn implementation.

Installation

You'll need NMSlib for using this package properly. Installing it with no binaries is recommended if your CPU supports advanced instructions (it problably does):

pip3 install --no-binary :all: nmslib
# Along with requirements:
pip3 install numpy pandas scipy scikit-learn

Then you can install this package with pip:

pip3 install fastlapmap

Usage

See the following example with the handwritten digits data. Here, I visually compare results from the scikit-learn Laplacian Eigenmaps implementation to those from my implementation.

Note that this implementation contains two similarity-learning algorithms: anisotropic diffusion maps and fuzzy simplicial sets.

# Import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import SpectralEmbedding
from fastlapmap import LapEigenmap

# Load some data
from sklearn.datasets import load_digits
digits = load_digits()
data = digits.data

# Define hyperparameters
N_EIGS=2
N_NEIGHBORS=10
N_JOBS=10

sk_se = SpectralEmbedding(n_components=N_EIGS, n_neighbors=N_NEIGHBORS, n_jobs=N_JOBS).fit_transform(data)

flapmap_diff = LapEigenmap(data, n_eigs=2, similarity='diffusion', norm_laplacian=True, k=N_NEIGHBORS, n_jobs=N_JOBS)
flapmap_fuzzy = LapEigenmap(data, n_eigs=2, similarity='fuzzy', norm_laplacian=True, k=N_NEIGHBORS, n_jobs=N_JOBS)

fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
fig.suptitle('Handwritten digits data:', fontsize=24)
ax1.scatter(sk_se[:, 0], sk_se[:, 1], c=digits.target, cmap='Spectral', s=5)
ax1.set_title('Sklearn\'s Laplacian Eigenmaps', fontsize=20)
ax2.scatter(flapmap_diff[:, 0], flapmap_diff[:, 1], c=digits.target, cmap='Spectral', s=5)
ax2.set_title('Fast Laplacian Eigenmaps with diffusion harmonics', fontsize=20)
ax3.scatter(flapmap_fuzzy[:, 0], flapmap_fuzzy[:, 1], c=digits.target, cmap='Spectral', s=5)
ax3.set_title('Fast Laplacian Eigenmaps with fuzzy simplicial sets', fontsize=20)
plt.show()

As we can see, results are nearly identical.

Benchmark

See the runtime comparison between this implementation and scikit-learn:

## Load benchmark function:
from fastlapmap.benchmark import runtime_benchmark

# Load data
from sklearn.datasets import load_digits
digits = load_digits()
data = digits.data

# Define hyperparameters
N_EIGS = 2
N_NEIGHBORS = 10
N_JOBS = 10
SIZES = [1000, 5000, 10000, 25000, 50000, 100000]
N_RUNS = 3

runtime_benchmark(data,
                  n_eigs=N_EIGS,
                  n_neighbors=N_NEIGHBORS,
                  n_jobs=N_JOBS,
                  sizes=SIZES,
                  n_runs=N_RUNS)

As you can see, the diffusion harmoics model is the fastest, followed closely by fuzzy simplicial sets. Both outperform scikit-learn default implementation and escalate linearly with sample size.

Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Related tags

Overview

Fast Laplacian Eigenmaps in python

Installation

Usage

Benchmark

Owner

A tool to compare differences between dataframes and create a differences report in Excel

Gaussian processes in TensorFlow

Retentioneering: product analytics, data-driven customer journey map optimization, marketing analytics, web analytics, transaction analytics, graph visualization, and behavioral segmentation with customer segments in Python.

ped-crash-techvol: Texas Ped Crash Tech Volume Pack

The lastest all in one bombing tool coded in python uses tbomb api

MIR Cheatsheet - Survival Guidebook for MIR Researchers in the Lab

Catalogue data - A Python Scripts to prepare catalogue data

PySpark Structured Streaming ROS Kafka ApacheSpark Cassandra

Snakemake workflow for converting FASTQ files to self-contained CRAM files with maximum lossless compression.

PipeChain is a utility library for creating functional pipelines.

A lightweight, hub-and-spoke dashboard for multi-account Data Science projects

pipeline for migrating lichess data into postgresql

Pyspark project that able to do joins on the spark data frames.

Python data processing, analysis, visualization, and data operations

Transform-Invariant Non-Negative Matrix Factorization

Extract data from a wide range of Internet sources into a pandas DataFrame.

PyTorch implementation for NCL (Neighborhood-enrighed Contrastive Learning)

Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.

A data parser for the internal syncing data format used by Fog of World.

The repo for mlbtradetrees.com. Analyze any trade in baseball history!