Simple Similarities Service

Overview

simsity

Simsity is a Super Simple Similarities Service[tm].
It's all about building a neighborhood. Literally!

This repository contains simple tools to help in similarity retreival scenarios by making a convient wrapper around encoding strategies as well as nearest neighbor approaches. Typical usecases include early stage bulk labelling and duplication discovery.

Warning

Alpha software. Expect things to break. Do not use in production.

Quickstart

This is the basic setup for this package.

import pandas as pd

from simsity.service import Service
from simsity.indexer import PyNNDescentIndexer
from simsity.preprocessing import Identity, ColumnLister


# The Indexer handles the nearest neighbor search
# The Encoder handles the encoding of the datapoints
service = Service(
    indexer=PyNNDescentIndexer(metric="euclidean"),
    encoder=CountVectorizer()
)

# The encoder defines how we encode the data going in.
encoder = make_pipeline(
    ColumnLister(column="text"),
    CountVectorizer()
)

# The indexer handles the nearest neighbor lookup.
indexer = PyNNDescentIndexer(metric="euclidean", n_neighbors=2)

# The service combines the two into a single object.
service_clinc = Service(
    encoder=encoder,
    indexer=indexer,
)

# We can now train the service.
df_clinc = pd.read_csv("tests/data/clinc-data.csv")
service_clinc.train_from_dataf(df_clinc, features=["text"])

# Query the datapoints
service.query("give me directions", n_neighbors=20)

# Save the entire system
service.save("/tmp/simple-model")

# You can also load the model now.
reloaded = Service.load("/tmp/simple-model")

# We can also host it as a web service
reloaded.serve(host='0.0.0.0', port=8080)

# You can now POST to http://0.0.0.0:8080/query with payload:
# {"query": {"text": "hello there"}, "n_neighbors": 20}
Comments
  • Add support for pretrained encoders and transformed data

    Add support for pretrained encoders and transformed data

    First of all this project looks great! I've taken an initial stab at #12 and also tried to add support querying data that has already been transformed. If you have data that you've already transformed (e.g. a UMAP embedding), you probably don't want to rerun encoder.transform again. In this case you want to index the transformed data and query it directly.

    This is just a first crack so happy to incorporate any feedback you might have!

    opened by gclen 10
  • embetter: better embeddings

    embetter: better embeddings

    This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.

    Problem Statement

    When you submit where is my phoone and you get similarities you may get things like:

    • where is my phone
    • where is my credit card

    Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;

    image

    The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.

    Similar Issue

    Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.

    opened by koaning 3
  • Add `Identity` as default encoder for Service.

    Add `Identity` as default encoder for Service.

    As mentioned in https://github.com/koaning/simsity/pull/13:

    I think the refit parameter should go in the Service() call. I think there should also be a parameter somewhere to avoid calling .transform() if the data has already been transformed. Do you think it is worth adding an additional parameter to Service() and keeping the indexed_from_transformed_data method?

    It's a fair remark. I think preventing a transfrom() is fair, but the solution would be to have an Identity() transformer that just keeps the data as-is. This would also make a great default value for the encoder.

    Made this issue to track progress and to discuss the approach.

    opened by koaning 2
  • Codecalm tutorial on simsity

    Codecalm tutorial on simsity

    Hi Vincent. Since I discovered you my barrier towards Python has eroded! Thank you. I'm a Data Scientist who wants to check if simsity can help with retrieving similar regions based on environmental variables.

    opened by FrancyJGLisboa 2
  • Update indexer

    Update indexer

    Hi! Are there any plans to add support for updating the indexer, i.e. add new documents without retraining the entire pipeline? Would be a very useful feature .

    from simsity.service import Service
    
    service = Service(
        indexer=indexer,
        encoder=encoder
    )
    
    service.train_from_dataf(df, features=["text"])
    
    ....
    
    service.update(new_docs, features=["text"])  # <- this
    
    
    opened by nthomsencph 1
  • New API

    New API

    I think the original design was flawed and this project should stick to the scikit-learn API more.

    from simsity.preprocessing import Grab
    from simsity.service import Service
    from simsity.indexer import (AnnoyIndexer, PynnDescentIndexed, NMSlibIndexer,
                                 PineconeIndexer, QdrantIndexer, WeviateIndexer)
    
    
    encoder = make_pipeline(
        make_union(
            make_pipeline(Grab("text"), SentenceEncoder()),
            make_pipeline(Grab("title"), SentenceEncoder())
        )
    )
    
    service = Service(encoder, indexer, batch_size=50)
    service.index(X)
    items, dists = service.query(X, n=10)
    
    opened by koaning 0
  • Education Day Goals

    Education Day Goals

    • [x] add typing + type checker
    • [x] add tests for the minhash tools
    • [ ] collect more useful datasets
    • [x] automate the benchmarking
    • [x] write getting started guides
    • [ ] record a quick demo for colleagues
    • [ ] add github actions stash
    opened by koaning 0
  • added-components

    added-components

    Adding the MinHash components. This is also an amazing opportunity to:

    • [ ] add types and a type checker
    • [ ] add some standard tests for indexers
    • [ ] add a script to run some benchmarks on the clinc dataset
    opened by koaning 0
Releases(0.1.1)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
This is a telegram bot built using the Oxford Dictionary API

Oxford Dictionaries Telegram Bot This is a telegram bot built using the Oxford Dictionary API Source: Oxford Dictionaries API Documentation Install En

Abhijith N T 2 Mar 18, 2022
Tiktok-bot - A Simple Tiktok bot With Python

Install the requirements pip install selenium pip install pyfiglet==0.7.5 How ca

Muchlis Faroqi 5 Aug 23, 2022
Messing around with GitHub API to look at omicron build times

gh-workflow-runs This is a very simple tool to dump out basic information about workflow runs for a GitHub repo. The structure is based on gh-subscrip

David Pacheco 1 Nov 30, 2021
[Fullversion]Web3 Pancakeswap Sniper bot written in python3.

πŸš€ Pancakeswap BSC Sniper Bot πŸš€ Web3 Pancakeswap Sniper && Take Profit/StopLose bot written in python3, Please note the license conditions! The secon

21 Dec 11, 2022
A Discord bot coded in Python

Perseverance-Bot By Toricane Replit Code | GitHub Code | Discord Server | Website Perseverance is a multi-purpose bot coded in Python. It has moderati

4 Mar 30, 2022
VC-Music , Playing music without bot.

VC-Userbot A Telegram Userbot to play or streaming Audio and Video songs / files in Telegram Voice Chats. It's made with PyTgCalls and Pyrogram Requir

RioProjectX 8 Aug 04, 2022
Confirm that files have been uploaded to Backblaze Cloud Backup successfully

Backblaze Backup Checker This Python script compares metadata captured from files within source folders against data parsed from Backblaze Cloud Backu

18 Jul 29, 2022
Framework for Telegram users and chats investigating.

telegram_scan Fantastic and full featured framework for Telegram users and chats investigating. Prerequisites: pip3 install pyrogram; get api_id and a

71 Dec 17, 2022
Python script to Funge NFTs.

Python script to Funge NFTs. It scrapes OpenSea for a given list of NFT collections and downloads a certain number of NFTs from each collection or the entire collections.

3 Apr 28, 2022
Auto-Approved-Bot - Auto Approved Invaite Link Request Telegram Bot

πŸ€– π—”π˜‚π˜π—Ό-π—”π—½π—½π—Ώπ—Όπ˜ƒπ—²-π—•π—Όπ˜ πŸ€– ℹ️ π—¨π˜€π—²π—΄π—² ℹ️ When a join request invita

Muhammed 32 Dec 18, 2022
Stock Market Insights is a Dashboard that gives the 360 degree view of the particular company stock

fedora-easyfix A collection of self-contained and well-documented issues for newcomers to start contributing with How to setup the local development e

Ganesh N 3 Sep 10, 2021
Console BeautifulDiscord theme manager

BeautifulDiscord theme manager Console script for downloading & managing Discord .css themes via BeautifulDiscord. Setup Simply run # Linux/MacOS pip3

1 Dec 15, 2022
SpautiNoFay - A simple and beautiful music player created with Python

SpautiNoFay A simple and beautiful music player created with Python Why SpautiNo

8 Jan 19, 2022
Desktop Backup Client for Borg

Vorta Backup Client Vorta is a backup client for macOS and Linux desktops. It integrates the mighty BorgBackup with your desktop environment to protec

BorgBase.com 1.5k Jan 03, 2023
API to retrieve the number of grades on the OGE website (Website listing the grades of students) to know if a new grade is available. If a new grade has been entered, the program sends a notification e-mail with the subject.

OGE-ESIREM-API Introduction API to retrieve the number of grades on the OGE website (Website listing the grades of students) to know if a new grade is

Benjamin Milhet 5 Apr 27, 2022
A simple terminal UI for viewing fund P/L analysis through TEFAS

Tefas UI A simple terminal UI for viewing fund P/L analysis through TEFAS. Features (that my own bank's UI lack): Daily and weekly P/L FX comparisons

Batuhan Taskaya 4 Mar 14, 2022
Clippin n grafting Backend

Clipping' n Grafting Presenting you, πŸŽ‰ Clippin' n Grafting πŸŽ‰ , your very own ecommerce website displaying all your artsy-craftsy stuff. Not only the

Google-Developer-Student-Club-ISquareIT (GDSC IΒ²IT) 2 Oct 22, 2021
Hacktoberfest2021 - Submit Just 4 PRs to earn SWAGS and TshirtsπŸ”₯

dont contribute in this repo, contribute only in below mentioned repo Special Note For Everyone ''' always make more then 4 pull request lets you have

Keshav Singh 820 Jan 02, 2023
Easy and simple, Telegram Bot to Show alert when some edits a message in Group

Edit-Message-Alert Just a simple bot to show alert when someone edits a message sent by them, Just 17 Lines of Code These codes are for those who incu

Nuhman Pk 6 Dec 15, 2021
Discord Token Checker and Info

Discord Token Checker A simple way to check Discord user tokens and their info in bulk. By Roover#7098. https://discord.gg/W8hnMWY6XP Proxy support co

Roover 3 Dec 09, 2021