A library for building and serving multi-node distributed faiss indices.

Overview

About

Distributed faiss index service. A lightweight library that lets you work with FAISS indexes which don't fit into a single server memory. It follows a simple concept of a set of index server processes runing in a complete isolation from each other. All the coordination is done at the client side. This siplified many-vs-many client-to-server relationship architecture is flexible and is specifically designed for research projects vs more complicated solutions that aims mostly at production usage and transactionality support. The data is sharded over several indexes on different servers in RAM. The search client aggregates results from different servers during retrieval. The service is model-independent and operates with supplied embeddings and metadatas.

Features:

  • Multiple clients connect to all servers via RPC.
  • At indexing time: clients balance data across servers. The client sends the next available batch of embeddings to a server that is selected in a round-robin fashion.
  • The index client aggregates results from different servers during retrieval. It queries all the servers and uses a heap to find final results.
  • The API allows to send and store any additional metadata (e.g. raw bpe, language information, etc).
  • Launch servers with submitit.
  • Save/load the index/metadata periodically. Can restore from a stopped index state.
  • Supports several indexes at the same time (e.g. one index per language, or different versions of the same index).
  • The API is trying to optimize for network bandwidth.
  • Flexible index configuration.

Installation

pip install -e .

Testing

python -m unittest discover tests

or

pip install pytest
pytest tests

Code formatting

black --line-length 100 .

Usage

Starting the index servers

distributed-faiss consist of server and client parts which are supposed to be launched as separate services. The set of server processes can be launched either by using its API or the provided lauch tool that uses submitit library that works on clusters with SLURM cluster management and job scheduling system

Launching servers with submitit on SLURM managed clusters

Example:

python scripts/server_launcher.py \
    --log-dir /logs/distr-faiss/ \
    --discovery-config /tmp/discover_config.txt \
    --save-dir $HOME/dfaiss_data \
    --num-servers 64 \
    --num-servers-per-node 32 \
    --timeout-min 4320 \
    --mem-gb 400 \
    --base-port 12033 \
    --partition dev &

Clients can now read /tmp/discover_config.txt to discover servers.

Will launch a job running 64 servers in the background. To view logs (which are verbose but informative) run something like: watch 'tail /logs/distr-faiss/34785924_0_log.err' where the 34785924 will be the slurm job id you are allocated.

Launching servers using API

You can run each index server process indepentently using the following API:

server = IndexServer(global_rank, index_storage_dir)
server.start_blocking(port, load_index=True)

The rank of the server node is needed for reading/writing its own part of the index from/to files. Index are dumped to files for persistent storage. The filesytem path convetion is that there is a shared folder for the entire logical index with each server node working on its own sub-folder inside it. index_storage_dir is the default parameter to store indexes. Can be overrided for each logic index by specifing this attribute in the index configuration object (see client code examples below) When you start a server node on a specific machine and port, you need to write the host, port line to a specific file which can later be used to start a client.

Client API

Each client process is supposed to work with all the server nodes and does all the data balancing among them. Client processes can be run independently of each other and work with the same set of server nodes simulateously.

index_client = IndexClient(discovery_config)

discovery_config is the path to the shared FS file which was used to start the set of servers and contains all (host, port) info to connect to all of them.

Creating an index

Each client & server nodes can work with multiple logical indexes (consider them as fully separate tables in an SQL database). Each logical index can have its own faiss-related configuration, FS location and other parameters which affect its creation logic. Example of creating a simle IVF index:

index_client = IndexClient(discovery_config)
idx_cfg = IndexCfg(
    index_builder_type='ivf_simple',
    dim=128,
    train_num=10000,
    centroids=64,
    metric='dot',
    nprobe=12,
    index_storage_dir='path/to/your/index',
)
index_id = 'your logic index str id'
index_client.create_index(index_id, idx_cfg)

Index configuration

IndexCfg has multiple attributes to set the FAISS index type. List of values for index_builder_type attribute:

  • flat,
  • ivf_simple,
  • knnlm, corresponds to IndexIVFPQ,
  • hnswsq, corresponds to IndexHNSWSQ,
  • ivfsq, corresponds to IndexIVFScalarQuantizer,
  • ivf_gpu is a gpu version of IVF.

Alternatively, if index_builder_type is not specified, one can set faiss_factory just like in FAISS API factory call faiss.index_factory(...)

The following attributes defined the way the index is created:

  • train_num - if specified, sets the number of samples are used for the index training.
  • train_ratio - the same as train_num but as a ratio of total data size.

Data sent for indexing will be aggregated in memory until train_num threshold is exceeded. Please refer to the diagram below about the server and client side interactions and steps.

Client side operations

Once the index has been created, one can send batches of numpy arrays coupled with arbitrarily metadata (should be piackable)

index.add_index_data(index_id, vector_chunk, list_of_metadata)

The index training and creation are done asynchronously with the add() operation the index processing may take a lot of time after all the data are sent. In order to check if all server nodes have finished index building, it is recommended to use the following snippet:

while index.get_state(self.index_id) != IndexState.TRAINED:
    time.sleep(some_time)

Once the index is ready, one can query it:

scores, meta = index.search(query, topk=10, index_id, return_embeddings=False)

query is a query vector batch as a numpy array. return_embeddings enables to return the search result vectors in addition to metadata. If it is set to true, the result tuple will return vectors as the 3-rd element.

Loading Data

The following two commands load a medium sized mmap into distributed-faiss in about 1 minute:

First launch 64 servers in the background

python scripts/server_launcher.py \
    --log-dir /logs/distr-faiss/ \
    --discovery-config /tmp/discover_config.txt \
    --save-dir $HOME/dfaiss_data \
    --num-servers 64 \
    --num-servers-per-node 32 \
    --timeout-min 4320 \
    --mem-gb 400 \
    --base-port 12033 \
    --partition dev &

Once you receive your allocation, load in the data with

python scripts/load_data.py \
    --discover /tmp/discover_config.txt \
    --mmap $HOME/dfaiss_data/random_1000000000_768_fp16.mmap \
    --mmap-size 1000000000 \
    --dimension 768 \
    --dstore-fp16 \
    --cfg scripts/idx_cfg.json \
    --dstore-fp16

modify scripts/load_data.py to load other data formats.

Reference

Reference to cite when using distributed-faiss in a research paper:

@article{DBLP:journals/corr/abs-2112-09924,
  author    = {Aleksandra Piktus and
               Fabio Petroni and
               Vladimir Karpukhin and
               Dmytro Okhonko and
               Samuel Broscheit and
               Gautier Izacard and
               Patrick Lewis and
               Barlas Oguz and
               Edouard Grave and
               Wen{-}tau Yih and
               Sebastian Riedel},
  title     = {The Web Is Your Oyster - Knowledge-Intensive {NLP} against a Very
               Large Web Corpus},
  journal   = {CoRR},
  volume    = {abs/2112.09924},
  year      = {2021},
  url       = {https://arxiv.org/abs/2112.09924},
  eprinttype = {arXiv},
  eprint    = {2112.09924},
  timestamp = {Tue, 04 Jan 2022 15:59:27 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2112-09924.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

You can access the paper here.

License

distributed-faiss is released under the CC-BY-NC 4.0 license. See the LICENSE file for details.

Owner
Meta Research
Meta Research
Differentiable Optimizers with Perturbations in Pytorch

Differentiable Optimizers with Perturbations in PyTorch This contains a PyTorch implementation of Differentiable Optimizers with Perturbations in Tens

Jake Tuero 54 Jun 22, 2022
PECOS - Prediction for Enormous and Correlated Spaces

PECOS - Predictions for Enormous and Correlated Output Spaces PECOS is a versatile and modular machine learning (ML) framework for fast learning and i

Amazon 387 Jan 04, 2023
Use tensorflow to implement a Deep Neural Network for real time lane detection

LaneNet-Lane-Detection Use tensorflow to implement a Deep Neural Network for real time lane detection mainly based on the IEEE IV conference paper "To

MaybeShewill-CV 1.9k Jan 08, 2023
Vehicle detection using machine learning and computer vision techniques for Udacity's Self-Driving Car Engineer Nanodegree.

Vehicle Detection Video demo Overview Vehicle detection using these machine learning and computer vision techniques. Linear SVM HOG(Histogram of Orien

hata 1.1k Dec 18, 2022
Python Library for Signal/Image Data Analysis with Transport Methods

PyTransKit Python Transport Based Signal Processing Toolkit Website and documentation: https://pytranskit.readthedocs.io/ Installation The library cou

24 Dec 23, 2022
Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.datasets: The raw text iterators for common NLP datasets torchtext.data: Some basic NLP building bloc

3.2k Jan 08, 2023
CLOOB training (JAX) and inference (JAX and PyTorch)

cloob-training Pretrained models There are two pretrained CLOOB models in this repo at the moment, a 16 epoch and a 32 epoch ViT-B/16 checkpoint train

Katherine Crowson 64 Nov 27, 2022
Rotated Box Is Back : Accurate Box Proposal Network for Scene Text Detection

Rotated Box Is Back : Accurate Box Proposal Network for Scene Text Detection This material is supplementray code for paper accepted in ICDAR 2021 We h

NCSOFT 30 Dec 21, 2022
ObjectDetNet is an easy, flexible, open-source object detection framework

Getting started with the ObjectDetNet ObjectDetNet is an easy, flexible, open-source object detection framework which allows you to easily train, resu

5 Aug 25, 2020
Assessing the Influence of Models on the Performance of Reinforcement Learning Algorithms applied on Continuous Control Tasks

Assessing the Influence of Models on the Performance of Reinforcement Learning Algorithms applied on Continuous Control Tasks This is the master thesi

Giacomo Arcieri 1 Mar 21, 2022
My solution for the 7th place / 245 in the Umoja Hack 2022 challenge

Umoja Hack 2022 : Insurance Claim Challenge My solution for the 7th place / 245 in the Umoja Hack 2022 challenge Umoja Hack Africa is a yearly hackath

Souames Annis 17 Jun 03, 2022
naked is a Python tool which allows you to strip a model and only keep what matters for making predictions.

naked is a Python tool which allows you to strip a model and only keep what matters for making predictions. The result is a pure Python function with no third-party dependencies that you can simply c

Max Halford 24 Dec 20, 2022
《Deep Single Portrait Image Relighting》(ICCV 2019)

Ratio Image Based Rendering for Deep Single-Image Portrait Relighting [Project Page] This is part of the Deep Portrait Relighting project. If you find

62 Dec 21, 2022
Deep Reinforced Attention Regression for Partial Sketch Based Image Retrieval.

DARP-SBIR Intro This repository contains the source code implementation for ICDM submission paper Deep Reinforced Attention Regression for Partial Ske

2 Jan 09, 2022
DETReg: Unsupervised Pretraining with Region Priors for Object Detection

DETReg: Unsupervised Pretraining with Region Priors for Object Detection Amir Bar, Xin Wang, Vadim Kantorov, Colorado J Reed, Roei Herzig, Gal Chechik

Amir Bar 283 Dec 27, 2022
Testbed of AI Systems Quality Management

qunomon Description A testbed for testing and managing AI system qualities. Demo Sorry. Not deployment public server at alpha version. Requirement Ins

AIST AIRC 15 Nov 27, 2021
DeepMind Alchemy task environment: a meta-reinforcement learning benchmark

The DeepMind Alchemy environment is a meta-reinforcement learning benchmark that presents tasks sampled from a task distribution with deep underlying structure.

DeepMind 188 Dec 25, 2022
Learning Energy-Based Models by Diffusion Recovery Likelihood

Learning Energy-Based Models by Diffusion Recovery Likelihood Ruiqi Gao, Yang Song, Ben Poole, Ying Nian Wu, Diederik P. Kingma Paper: https://arxiv.o

Ruiqi Gao 41 Nov 22, 2022
The 2nd Version Of Slothybot

SlothyBot Go to this website: "https://bitly.com/SlothyBot" The 2nd Version Of Slothybot. The Bot Has Many Features, Such As: Moderation Commands; Kic

Slothy 0 Jun 01, 2022
This repository contains the source code and data for reproducing results of Deep Continuous Clustering paper

Deep Continuous Clustering Introduction This is a Pytorch implementation of the DCC algorithms presented in the following paper (paper): Sohil Atul Sh

Sohil Shah 197 Nov 29, 2022