An Indexer that works out-of-the-box when you have less than 100K stored Documents

Last update: Mar 15, 2022

Related tags

Overview

U100KIndexer

An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with 768-dim embeddings, you can expect 300ms for single query or 20~120QPS for batch queries. Results are full Documents.

U100KIndexer leverages jina.DocumenetArrayMemmap as the storage backend and .match() to conduct nearest neighbours search. It returns the full Documents as-is, hence no need to concatenate it with another key-value indexer to retrieve Documents.

Pros & cons

Pros

Exhaustive search: highest recall
Fast indexing
Acceptable query performance under 100K
Always return full Documents
No extra dependencies

Cons

Slow query time

Performance

The indexing and query performance on 768-dim embeddings is as follows (unit is second):

Stored data	Indexing time	Query size=1	Query size=8	Query size=64
10000	0.256	0.019	0.029	0.086
50000	1.156	0.147	0.177	0.314
100000	2.329	0.297	0.332	0.536
200000	4.704	0.656	0.744	1.050
400000	11.105	1.289	1.536	2.793

Benchmark script can be found in benchmark.py.

Tips

To change workspace,

U100KIndexer(metas={'workspace': './my'})

Or .add(..., uses_metas={'workspace': './my'}) when you use it in a Flow.

An Indexer that works out-of-the-box when you have less than 100K stored Documents

Related tags

Overview

U100KIndexer

Pros & cons

Pros

Cons

Performance

Tips

Owner

Jina AI

An orchestration platform for the development, production, and observation of data assets.

Working Time Statistics of working hours and working conditions by industry and company

High Dimensional Portfolio Selection with Cardinality Constraints

Data science/Analysis Health Care Portfolio

WithPipe is a simple utility for functional piping in Python.

The Spark Challenge Student Check-In/Out Tracking Script

Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

Dbt-core - dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

MidTerm Project for the Data Analysis FT Bootcamp, Adam Tycner and Florent ZAHOUI

Candlestick Pattern Recognition with Python and TA-Lib

A DSL for data-driven computational pipelines

API>local_db>AWS_RDS - Disclaimer! All data used is for educational purposes only.

Creating a statistical model to predict 10 year treasury yields

A meta plugin for processing timelapse data timepoint by timepoint in napari

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

A set of functions and analysis classes for solvation structure analysis

An interactive grid for sorting, filtering, and editing DataFrames in Jupyter notebooks

Data Analysis for First Year Laboratory at Imperial College, London.

BinTuner is a cost-efficient auto-tuning framework, which can deliver a near-optimal binary code that reveals much more differences than -Ox settings.

Python implementation of Principal Component Analysis