Super Simple Similarities Service

Last update: Dec 25, 2022

Related tags

Overview

simsity

Simsity is a Super Simple Similarities Service[tm].
It's all about building a neighborhood. Literally!

This repository contains simple tools to help in similarity retrieval scenarios by making a convenient wrapper around encoding strategies as well as nearest neighbor approaches. Typical usecases include early stage bulk labelling and duplication discovery.

Install

You can install simsity via pip.

python -m pip install simsity

Quickstart

This is the basic setup for this package.

from simsity.service import Service
from simsity.datasets import fetch_clinc
from simsity.indexer import PyNNDescentIndexer
from simsity.preprocessing import Identity, ColumnLister

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer

# The encoder defines how we encode the data going in.
encoder = make_pipeline(
    ColumnLister(column="text"),
    CountVectorizer()
)

# The indexer handles the nearest neighbor lookup.
indexer = PyNNDescentIndexer(metric="euclidean", n_neighbors=2)

# The service combines the two into a single object.
service_clinc = Service(
    encoder=encoder,
    indexer=indexer,
)

# We can now train the service using this data.
df_clinc = fetch_clinc()

# Important for later: we're only passing the 'text' column to encode
service_clinc.train_from_dataf(df_clinc, features=["text"])

# Query the datapoints
# Note that the keyword argument here refers to 'text'-column
service.query(text="give me directions", n_neighbors=20)

If you'd like you can also save and load the service on disk.

# Save the entire system
service.save("/tmp/simple-model")

# You can also load the model now.
reloaded = Service.load("/tmp/simple-model")

You could even run it as a webservice if you were so inclined.

reloaded.serve(host='0.0.0.0', port=8080)

You can now POST to http://0.0.0.0:8080/query with payload:

{"query": {"text": "hello there"}, "n_neighbors": 20}

Note that the query content here refers to "text"-column once again.

Examples

Check the examples folder for some interesting use-cases and tool integrations.

In particular:

benchmark.ipynb demonstrates an example on how you might benchmark simsity
votes-example.ipynb demonstrates how to label similar data using pigeon and simsity
text-widget-example.ipynb demonstrates how to add interactivity with ipywidgets

Comments

Add support for pretrained encoders and transformed data

First of all this project looks great! I've taken an initial stab at #12 and also tried to add support querying data that has already been transformed. If you have data that you've already transformed (e.g. a UMAP embedding), you probably don't want to rerun encoder.transform again. In this case you want to index the transformed data and query it directly.

This is just a first crack so happy to incorporate any feedback you might have!

opened by gclen 10
embetter: better embeddings
This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.

Problem Statement

When you submit where is my phoone and you get similarities you may get things like:

where is my phone

where is my credit card

Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;

The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.

Similar Issue

Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.
opened by koaning 3
Add `Identity` as default encoder for Service.

As mentioned in https://github.com/koaning/simsity/pull/13:

I think the refit parameter should go in the Service() call. I think there should also be a parameter somewhere to avoid calling .transform() if the data has already been transformed. Do you think it is worth adding an additional parameter to Service() and keeping the indexed_from_transformed_data method?

It's a fair remark. I think preventing a transfrom() is fair, but the solution would be to have an Identity() transformer that just keeps the data as-is. This would also make a great default value for the encoder.

Made this issue to track progress and to discuss the approach.

opened by koaning 2
Codecalm tutorial on simsity

Hi Vincent. Since I discovered you my barrier towards Python has eroded! Thank you. I'm a Data Scientist who wants to check if simsity can help with retrieving similar regions based on environmental variables.

opened by FrancyJGLisboa 2

Update indexer

Hi! Are there any plans to add support for updating the indexer, i.e. add new documents without retraining the entire pipeline? Would be a very useful feature .

from simsity.service import Service

service = Service(
    indexer=indexer,
    encoder=encoder
)

service.train_from_dataf(df, features=["text"])

....

service.update(new_docs, features=["text"])  # <- this

opened by nthomsencph 1

New API

I think the original design was flawed and this project should stick to the scikit-learn API more.

from simsity.preprocessing import Grab
from simsity.service import Service
from simsity.indexer import (AnnoyIndexer, PynnDescentIndexed, NMSlibIndexer,
                             PineconeIndexer, QdrantIndexer, WeviateIndexer)


encoder = make_pipeline(
    make_union(
        make_pipeline(Grab("text"), SentenceEncoder()),
        make_pipeline(Grab("title"), SentenceEncoder())
    )
)

service = Service(encoder, indexer, batch_size=50)
service.index(X)
items, dists = service.query(X, n=10)

opened by koaning 0

Education Day Goals
[x] add typing + type checker

[x] add tests for the minhash tools

[ ] collect more useful datasets

[x] automate the benchmarking

[x] write getting started guides

[ ] record a quick demo for colleagues

[ ] add github actions stash
opened by koaning 0
added-components
Adding the MinHash components. This is also an amazing opportunity to:

[ ] add types and a type checker

[ ] add some standard tests for indexers

[ ] add a script to run some benchmarks on the clinc dataset
opened by koaning 0

Releases(0.1.1)

0.1.1(Nov 4, 2021)

Thanks to @gclen you can now re-use scikit-learn pipelines without refitting them internally.
Source code(tar.gz)
Source code(zip)

Owner

vincent d warmerdam

Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].

GitHub Repository https://koaning.github.io/simsity/

Deep Image Search - AI-Based Image Search Engine

Deep Image Search is an AI-based image search engine that includes deep transfer learning features Extraction and tree-based vectorized search technique.

144 Jan 05, 2023

Super Simple Similarities Service

95 Dec 25, 2022

A library for fast import of Windows NT Registry(REGF) into Elasticsearch.

3 Apr 01, 2022

Pythonic search engine based on PyLucene.

Lupyne is a search engine based on PyLucene, the Python extension for accessing Java Lucene. Lucene is a relatively low-level toolkit, and PyLucene wr

83 Jan 02, 2023

Python script for finding duplicate images within a folder.

194 Dec 31, 2022

Simple algorithm search engine like google in python using function

Mini-Search-Engine-Like-Google I have created the simple algorithm search engine like google in python using function. I am matching every word with w

5 Sep 24, 2021

solrpy is a Python client for Solr

solrpy solrpy is a Python client for Solr, an enterprise search server built on top of Lucene. solrpy allows you to add documents to a Solr instance,

37 Jul 22, 2021

a Telegram bot writen in Python for searching files in Drive. Based on SearchX-bot

Drive Search Bot This is a Telegram bot writen in Python for searching files in Drive. Based on SearchX-bot How to deploy? Clone this repo: git clone

25 Dec 09, 2022

PwnWiki 数据库搜索命令行工具；该工具有点像 searchsploit 命令，只是搜索的不是 Exploit Database 而是 PwnWiki 条目

PWSearch PwnWiki 数据库搜索命令行工具。该工具有点像 searchsploit 命令，只是搜索的不是 Exploit Database 而是 PwnWiki 条目。

72 Dec 20, 2022

Google Search Engine Results Pages (SERP) in locally, no API key, no signup required

Local SERP Google Search Engine Results Pages (SERP) in locally, no API key, no signup required Make sure the chromedriver and required package are in

4 Jun 29, 2021

A Python web searcher library with different search engines

Robert A simple Python web searcher library with different search engines. Install pip install roberthelper Usage from robert import GoogleSearcher

1 Dec 23, 2021

A sphinx extension for designing beautiful, screen-size responsive web components.

sphinx-design A sphinx extension for designing beautiful, view size responsive web components. Created with inspiration from Bootstrap (v5), Material

109 Jan 01, 2023

esguard provides a Python decorator that waits for processing while monitoring the load of Elasticsearch.

esguard esguard provides a Python decorator that waits for processing while monitoring the load of Elasticsearch. Quick Start You need to launch elast

5 Dec 08, 2021

An image inline search telegram bot.

Image-Search-Bot An image inline search telegram bot. Note: Use Telegram picture bot. That is better. Not recommending to deploy this bot. Made with P

24 Oct 21, 2022

Full text search for flask.

flask-msearch Installation To install flask-msearch: pip install flask-msearch # when MSEARCH_BACKEND = "whoosh" pip install whoosh blinker # when MSE

197 Dec 29, 2022

Searches for MAC addresses in a text file of a Cisco "show IP arp" in any address format

show-ip-arp-mac-lookup Searches for MAC addresses in a text file of a Cisco "show IP arp" in any address format What it does: Takes a text file with t

0 Dec 24, 2022

An open source, non-profit search engine implemented in python

Mwmbl: No ads, no tracking, no cruft, no profit Mwmbl is a non-profit, ad-free, free-libre and free-lunch search engine with a focus on useability and

639 Jan 04, 2023

document organizer with tags and full-text-search, in a simple and clean sqlite3 schema

152 Oct 29, 2022

Senginta is All in one Search Engine Scrapper for used by API or Python Module. It's Free!

Senginta is All in one Search Engine Scrapper. With traditional scrapping, Senginta can be powerful to get result from any Search Engine, and convert to Json. Now support only for Google Product Sear

33 Nov 21, 2022

Home for Elasticsearch examples available to everyone. It's a great way to get started.

Introduction This is a collection of examples to help you get familiar with the Elastic Stack. Each example folder includes a README with detailed ins

2.5k Jan 03, 2023

Super Simple Similarities Service

Related tags

Overview

simsity

Install

Quickstart

Examples

Comments

Problem Statement

Similar Issue

Releases(0.1.1)

0.1.1(Nov 4, 2021)

Owner

vincent d warmerdam

Deep Image Search - AI-Based Image Search Engine

Super Simple Similarities Service

A library for fast import of Windows NT Registry(REGF) into Elasticsearch.

Pythonic search engine based on PyLucene.

Python script for finding duplicate images within a folder.

Simple algorithm search engine like google in python using function

solrpy is a Python client for Solr

a Telegram bot writen in Python for searching files in Drive. Based on SearchX-bot

PwnWiki 数据库搜索命令行工具；该工具有点像 searchsploit 命令，只是搜索的不是 Exploit Database 而是 PwnWiki 条目

Google Search Engine Results Pages (SERP) in locally, no API key, no signup required

A Python web searcher library with different search engines

A sphinx extension for designing beautiful, screen-size responsive web components.

esguard provides a Python decorator that waits for processing while monitoring the load of Elasticsearch.

An image inline search telegram bot.

Full text search for flask.

Searches for MAC addresses in a text file of a Cisco "show IP arp" in any address format

An open source, non-profit search engine implemented in python

document organizer with tags and full-text-search, in a simple and clean sqlite3 schema

Senginta is All in one Search Engine Scrapper for used by API or Python Module. It's Free!

Home for Elasticsearch examples available to everyone. It's a great way to get started.