Simple Similarities Service

Overview

simsity

Simsity is a Super Simple Similarities Service[tm].
It's all about building a neighborhood. Literally!

This repository contains simple tools to help in similarity retreival scenarios by making a convient wrapper around encoding strategies as well as nearest neighbor approaches. Typical usecases include early stage bulk labelling and duplication discovery.

Warning

Alpha software. Expect things to break. Do not use in production.

Quickstart

This is the basic setup for this package.

import pandas as pd

from simsity.service import Service
from simsity.indexer import PyNNDescentIndexer
from simsity.preprocessing import Identity, ColumnLister


# The Indexer handles the nearest neighbor search
# The Encoder handles the encoding of the datapoints
service = Service(
    indexer=PyNNDescentIndexer(metric="euclidean"),
    encoder=CountVectorizer()
)

# The encoder defines how we encode the data going in.
encoder = make_pipeline(
    ColumnLister(column="text"),
    CountVectorizer()
)

# The indexer handles the nearest neighbor lookup.
indexer = PyNNDescentIndexer(metric="euclidean", n_neighbors=2)

# The service combines the two into a single object.
service_clinc = Service(
    encoder=encoder,
    indexer=indexer,
)

# We can now train the service.
df_clinc = pd.read_csv("tests/data/clinc-data.csv")
service_clinc.train_from_dataf(df_clinc, features=["text"])

# Query the datapoints
service.query("give me directions", n_neighbors=20)

# Save the entire system
service.save("/tmp/simple-model")

# You can also load the model now.
reloaded = Service.load("/tmp/simple-model")

# We can also host it as a web service
reloaded.serve(host='0.0.0.0', port=8080)

# You can now POST to http://0.0.0.0:8080/query with payload:
# {"query": {"text": "hello there"}, "n_neighbors": 20}
Comments
  • Add support for pretrained encoders and transformed data

    Add support for pretrained encoders and transformed data

    First of all this project looks great! I've taken an initial stab at #12 and also tried to add support querying data that has already been transformed. If you have data that you've already transformed (e.g. a UMAP embedding), you probably don't want to rerun encoder.transform again. In this case you want to index the transformed data and query it directly.

    This is just a first crack so happy to incorporate any feedback you might have!

    opened by gclen 10
  • embetter: better embeddings

    embetter: better embeddings

    This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.

    Problem Statement

    When you submit where is my phoone and you get similarities you may get things like:

    • where is my phone
    • where is my credit card

    Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;

    image

    The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.

    Similar Issue

    Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.

    opened by koaning 3
  • Add `Identity` as default encoder for Service.

    Add `Identity` as default encoder for Service.

    As mentioned in https://github.com/koaning/simsity/pull/13:

    I think the refit parameter should go in the Service() call. I think there should also be a parameter somewhere to avoid calling .transform() if the data has already been transformed. Do you think it is worth adding an additional parameter to Service() and keeping the indexed_from_transformed_data method?

    It's a fair remark. I think preventing a transfrom() is fair, but the solution would be to have an Identity() transformer that just keeps the data as-is. This would also make a great default value for the encoder.

    Made this issue to track progress and to discuss the approach.

    opened by koaning 2
  • Codecalm tutorial on simsity

    Codecalm tutorial on simsity

    Hi Vincent. Since I discovered you my barrier towards Python has eroded! Thank you. I'm a Data Scientist who wants to check if simsity can help with retrieving similar regions based on environmental variables.

    opened by FrancyJGLisboa 2
  • Update indexer

    Update indexer

    Hi! Are there any plans to add support for updating the indexer, i.e. add new documents without retraining the entire pipeline? Would be a very useful feature .

    from simsity.service import Service
    
    service = Service(
        indexer=indexer,
        encoder=encoder
    )
    
    service.train_from_dataf(df, features=["text"])
    
    ....
    
    service.update(new_docs, features=["text"])  # <- this
    
    
    opened by nthomsencph 1
  • New API

    New API

    I think the original design was flawed and this project should stick to the scikit-learn API more.

    from simsity.preprocessing import Grab
    from simsity.service import Service
    from simsity.indexer import (AnnoyIndexer, PynnDescentIndexed, NMSlibIndexer,
                                 PineconeIndexer, QdrantIndexer, WeviateIndexer)
    
    
    encoder = make_pipeline(
        make_union(
            make_pipeline(Grab("text"), SentenceEncoder()),
            make_pipeline(Grab("title"), SentenceEncoder())
        )
    )
    
    service = Service(encoder, indexer, batch_size=50)
    service.index(X)
    items, dists = service.query(X, n=10)
    
    opened by koaning 0
  • Education Day Goals

    Education Day Goals

    • [x] add typing + type checker
    • [x] add tests for the minhash tools
    • [ ] collect more useful datasets
    • [x] automate the benchmarking
    • [x] write getting started guides
    • [ ] record a quick demo for colleagues
    • [ ] add github actions stash
    opened by koaning 0
  • added-components

    added-components

    Adding the MinHash components. This is also an amazing opportunity to:

    • [ ] add types and a type checker
    • [ ] add some standard tests for indexers
    • [ ] add a script to run some benchmarks on the clinc dataset
    opened by koaning 0
Releases(0.1.1)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
:snake: Python SDK to query Scaleway APIs.

Scaleway SDK Python SDK to query Scaleway's APIs. Stable release: Development: Installation The package is available on pip. To install it in a virtua

Scaleway 114 Dec 11, 2022
Python bindings for ArrayFire: A general purpose GPU library.

ArrayFire Python Bindings ArrayFire is a high performance library for parallel computing with an easy-to-use API. It enables users to write scientific

ArrayFire 402 Dec 20, 2022
Deep reinforcement learning library built on top of Neural Network Libraries

Deep Reinforcement Learning Library built on top of Neural Network Libraries NNablaRL is a deep reinforcement learning library built on top of Neural

Sony 100 Dec 14, 2022
Male' Map Telegram Bot

Male' Map TelegramBot A simple TelegramBot to fetch residential addresses in Male', Maldives. The bot can be queried inline or directly. sample .env f

Naail Abdul Rahman 12 Nov 25, 2022
Currency And Gold Prices - Currency And Gold Prices For Python

Currency_And_Gold_Prices Photos from the app New Update Show range Change better

Ali HemmatNia 4 Sep 19, 2022
A discord nuking tool made by python, this also has nuke accounts, inbuilt Selfbot, Massreport, Token Grabber, Nitro Sniper and ALOT more!

Disclaimer: Rage Multi Tool was made for Educational Purposes This project was created only for good purposes and personal use. By using Rage, you agr

†† 50 Jul 19, 2022
Ulaavi for nuke, helps to keep our stocl elements organised.

Ulaavi Ulaavi for nuke, helps to keep our stock elements organised. Installation Downlaod ffmpeg from ffmpeg.org linux : https://johnvansickle.com/ffm

Arun Subramaniyam 17 Aug 24, 2022
The programm for collecting data from Tinkoff API and building Excel table.

tinkproject The program for portfolio analysis via Tinkoff API Hello! This is my first project, please, don't judge me. This project was developed for

214 Dec 02, 2022
A Bot to Track Kernel Upstreams from kernel.org and Post it on Telegram Channel

Channel Kernel Tracker is the channel where the bot will be sending the updates in. Introduction This is a Telegram Bot to Track Kernel Upstreams kern

Kartikeya Hegde 3 Oct 05, 2021
KTUN Öğrenci Bilgi Sistemine bağlanıp her 15 dakikada notları kontrol eden ve değişiklik olduğu zaman size Discord Webhook ile mesaj atan uygulama.

KTUN_Obis KTUN Öğrenci Bilgi Sistemi KTUN Öğrenci Bilgi Sistemine selenium kullanarak girip setttings.py dosyasında verdiğiniz bilgeri doldurup ardınd

İbrahim Uysal 5 Oct 27, 2022
A discord bot with a leveling system (similar to mee6).

Discord.py A discord bot with a leveling system (like mee6) Pre-requisites Knowing how to get create an app/bot via discord's developer portal. Websit

26 Dec 11, 2022
This app is providing you to track some online products' prices via GMAIL.

Price Tracking App variables and descriptions of that code is in Turkish language. but we're working on translate them into English. This app is provi

Abdullah Aslan 1 Dec 11, 2021
L3DAS22 challenge supporting API

L3DAS22 challenge supporting API This repository supports the L3DAS22 IEEE ICASSP Grand Challenge and it is aimed at downloading the dataset, pre-proc

L3DAS 38 Dec 25, 2022
ELiza music is a telegram music bot project, allow you to play music on voice chat group telegram.

❤️ 𝗘𝗹𝗶𝘇𝗮 𝗠𝘂𝘀𝗶𝗰 ❤️ Unmaintained. The new repo of @MrsElizaRobot is private. (It is no longer based on this source code. The completely rewrit

Team Eliza 2 Dec 08, 2022
A command line interface for accessing google drive

Drive Cli Get the ability to access Google Drive without leaving your terminal. Inspiration Google Drive has become a vital part of our day to day lif

Chirag Shetty 538 Dec 12, 2022
Innocent-Bot - A Discord client self-bot for destroying, nuking and causing mischief in servers

Innocent-bot A Discord client self-bot for destroying, nuking and causing mischi

†† 5 Jan 26, 2022
A cool discord bot, called Fifi

Fifi A cool discord bot, called Fifi This bot is the official server bot of Meme Studios discord server. This github repo is the code we use for the b

Fifi Discord Bot 3 Jun 08, 2021
A fork of discord.py for anime enjoyers

A modern, easy to use, feature-rich, and async ready API wrapper for Discord written in Python. Key Features Modern Pythonic API using async and await

Senpai Development 4 Nov 05, 2021
Step by Step Guide To Install Discord Py Master Branch on Replit

Guide to Install Discord Py Master Branch on Replit Step 1 Create an empty repl on replit Step 2 Add this Basic Code to the file main.py so as to chec

Pranav Saxena 7 Nov 18, 2022
Flask extension that provides integration with Azure Storage

Flask-Azure-Storage A Flask extension that provides integration with Azure Storage Table of Contents Flask-Azure-Storage Install Usage Examples Create

Alejo Arias 17 Nov 14, 2021