Simple Similarities Service

Overview

simsity

Simsity is a Super Simple Similarities Service[tm].
It's all about building a neighborhood. Literally!

This repository contains simple tools to help in similarity retreival scenarios by making a convient wrapper around encoding strategies as well as nearest neighbor approaches. Typical usecases include early stage bulk labelling and duplication discovery.

Warning

Alpha software. Expect things to break. Do not use in production.

Quickstart

This is the basic setup for this package.

import pandas as pd

from simsity.service import Service
from simsity.indexer import PyNNDescentIndexer
from simsity.preprocessing import Identity, ColumnLister


# The Indexer handles the nearest neighbor search
# The Encoder handles the encoding of the datapoints
service = Service(
    indexer=PyNNDescentIndexer(metric="euclidean"),
    encoder=CountVectorizer()
)

# The encoder defines how we encode the data going in.
encoder = make_pipeline(
    ColumnLister(column="text"),
    CountVectorizer()
)

# The indexer handles the nearest neighbor lookup.
indexer = PyNNDescentIndexer(metric="euclidean", n_neighbors=2)

# The service combines the two into a single object.
service_clinc = Service(
    encoder=encoder,
    indexer=indexer,
)

# We can now train the service.
df_clinc = pd.read_csv("tests/data/clinc-data.csv")
service_clinc.train_from_dataf(df_clinc, features=["text"])

# Query the datapoints
service.query("give me directions", n_neighbors=20)

# Save the entire system
service.save("/tmp/simple-model")

# You can also load the model now.
reloaded = Service.load("/tmp/simple-model")

# We can also host it as a web service
reloaded.serve(host='0.0.0.0', port=8080)

# You can now POST to http://0.0.0.0:8080/query with payload:
# {"query": {"text": "hello there"}, "n_neighbors": 20}
Comments
  • Add support for pretrained encoders and transformed data

    Add support for pretrained encoders and transformed data

    First of all this project looks great! I've taken an initial stab at #12 and also tried to add support querying data that has already been transformed. If you have data that you've already transformed (e.g. a UMAP embedding), you probably don't want to rerun encoder.transform again. In this case you want to index the transformed data and query it directly.

    This is just a first crack so happy to incorporate any feedback you might have!

    opened by gclen 10
  • embetter: better embeddings

    embetter: better embeddings

    This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.

    Problem Statement

    When you submit where is my phoone and you get similarities you may get things like:

    • where is my phone
    • where is my credit card

    Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;

    image

    The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.

    Similar Issue

    Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.

    opened by koaning 3
  • Add `Identity` as default encoder for Service.

    Add `Identity` as default encoder for Service.

    As mentioned in https://github.com/koaning/simsity/pull/13:

    I think the refit parameter should go in the Service() call. I think there should also be a parameter somewhere to avoid calling .transform() if the data has already been transformed. Do you think it is worth adding an additional parameter to Service() and keeping the indexed_from_transformed_data method?

    It's a fair remark. I think preventing a transfrom() is fair, but the solution would be to have an Identity() transformer that just keeps the data as-is. This would also make a great default value for the encoder.

    Made this issue to track progress and to discuss the approach.

    opened by koaning 2
  • Codecalm tutorial on simsity

    Codecalm tutorial on simsity

    Hi Vincent. Since I discovered you my barrier towards Python has eroded! Thank you. I'm a Data Scientist who wants to check if simsity can help with retrieving similar regions based on environmental variables.

    opened by FrancyJGLisboa 2
  • Update indexer

    Update indexer

    Hi! Are there any plans to add support for updating the indexer, i.e. add new documents without retraining the entire pipeline? Would be a very useful feature .

    from simsity.service import Service
    
    service = Service(
        indexer=indexer,
        encoder=encoder
    )
    
    service.train_from_dataf(df, features=["text"])
    
    ....
    
    service.update(new_docs, features=["text"])  # <- this
    
    
    opened by nthomsencph 1
  • New API

    New API

    I think the original design was flawed and this project should stick to the scikit-learn API more.

    from simsity.preprocessing import Grab
    from simsity.service import Service
    from simsity.indexer import (AnnoyIndexer, PynnDescentIndexed, NMSlibIndexer,
                                 PineconeIndexer, QdrantIndexer, WeviateIndexer)
    
    
    encoder = make_pipeline(
        make_union(
            make_pipeline(Grab("text"), SentenceEncoder()),
            make_pipeline(Grab("title"), SentenceEncoder())
        )
    )
    
    service = Service(encoder, indexer, batch_size=50)
    service.index(X)
    items, dists = service.query(X, n=10)
    
    opened by koaning 0
  • Education Day Goals

    Education Day Goals

    • [x] add typing + type checker
    • [x] add tests for the minhash tools
    • [ ] collect more useful datasets
    • [x] automate the benchmarking
    • [x] write getting started guides
    • [ ] record a quick demo for colleagues
    • [ ] add github actions stash
    opened by koaning 0
  • added-components

    added-components

    Adding the MinHash components. This is also an amazing opportunity to:

    • [ ] add types and a type checker
    • [ ] add some standard tests for indexers
    • [ ] add a script to run some benchmarks on the clinc dataset
    opened by koaning 0
Releases(0.1.1)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
Bot facebook

botfb Bot facebook Login via cookies cara install $pkg update && pkg upgrade $pkg install git python $git clone https://github.com/Ainx-BOT/botfb $cd

Fahmi Dev 12 Dec 18, 2022
Georeferencing large amounts of data for free.

Geolocate Georeferencing large amounts of data for free. Special thanks to @brunodepauloalmeida and the whole team for the contributions. How? It's us

Gabriel Gazola Milan 23 Dec 30, 2022
Discord-RAID-Tool - Hacks/tools

How to use Python must be installed run install-config If you dont have python installed, download python 3.7.6 and make sure you click on the 'ADD TO

1 Jan 01, 2022
An unofficial library for discord components (under-development)

discord-components An unofficial library for discord components (under-development) Welcome! Discord components are cool, but discord.py will support

11 Jun 14, 2021
Change the name and pfp of ur accounts, uses tokens.txt for ur tokens.

Change the name and pfp of ur accounts, uses tokens.txt for ur tokens. Also scrapes the pfps+names from a server chosen by you. For hq tokens go to discord.gg/tokenshop or t.me/praisetelegram

cChimney 36 Dec 09, 2022
A simple url uploader bot with permenent thumbnail support

URL-Uploader A simple url uploader bot with permenent thumbnail support Scrapped some code from @SpEcHIDe's AnyDLBot Repository Please fork this repos

Fayas Noushad 40 Nov 29, 2021
This project, search all entities related to A2P in twilio

Mirror A2P Twilio This project, search all entities related to A2P in twilio (phone numbers, messaging services, campaign, A2P brand information and P

Iván Cárdenas 2 Nov 03, 2022
Bill is a bot capable to Chat with you, search everything on web to you, and send message to yours contacts for you.

Bill Bot The inteligent Bot Bill is a intelligent bot, it can chat, search and send messages to you. Chat with You Send messages on WhatsApp for you S

João Assalim 3 Sep 12, 2021
Simple Telegram Bot to Download and Upload Files From Mega.nz

Mega.nz-Bot Simple Telegram Bot to Download Files From Mega.nz and Upload It to Telegram Features All Mega.nz File Links supported No login required A

I'm Not A Bot #Left_TG 245 Jan 01, 2023
A simple object model for the Notion SDK.

A simplified object model for the Notion SDK. This is loosely modeled after concepts found in SQLAlchemy.

Jason Heddings 54 Jan 02, 2023
A python API wrapper for temp-mail.org

temp-mail Python API Wrapper for temp-mail.ru service. Temp-mail is a service which lets you use anonymous emails for free. You can view full API spec

Denis Veselov 91 Nov 19, 2022
Flood discord webhooks

Webhook-Spammer Flood discord webhooks Asynchronous webhook spammer Fast & Efficient Usage - Use it with atleast 500 threads Put a valid webhook Use a

trey 1 Apr 22, 2022
Telegram bot with various Sticker Tools

Sticker Tools Bot @Sticker_Tools_Bot A star ⭐ from you means a lot to us! Telegram bot with various Sticker Tools Usage Deploy to Heroku Tap on above

Stark Bots 20 Dec 08, 2022
A Python module for communicating with the Twilio API and generating TwiML.

twilio-python The default branch name for this repository has been changed to main as of 07/27/2020. Documentation The documentation for the Twilio AP

Twilio 1.6k Jan 05, 2023
Watches your earnings on EarnApp and notifies you when you earned balance or received an payout.

EarnApp-Earning-Monitor Watches your earnings on EarnApp and notifies you when you earned balance or received an payout. Installation Install Python3

Yariya 21 Oct 17, 2022
Nonebot2 简易群管

简易群管 ✨ NoneBot2 简易群管 ✨ _ 踢 改 禁 欢迎issue pr 权限说明:permission=SUPERUSER 安装 💿 pip install nonebot-plugin-admin 导入 📲 在bot.py 导入,语句: nonebot.load_plugin("n

幼稚园园长 74 Dec 22, 2022
🦈 Blahaj is a discord bot that shares random images of himself on discord.

🦈 Blahaj Bot Blahaj is a discord bot that shares random images of himself on discord. ⚙️ Developer's Guide To use the bot, follow along the steps Hea

Atul Anand 3 Oct 21, 2022
multi-purpose discord bot

virus multi-purpose discord bot ⚠️ WARNING This project is incomplete and may not work as expected. Download & Run Install Python =3.10 Clone the sou

miten 2 Jan 17, 2022
SQS + Lambda를 활용한 문자 메시지 및 이메일, Voice call 호출을 간단하게 구현하는 serverless 템플릿

AWS SQS With Lambda notification 서버 구축을 위한 Poc TODO serverless를 통해 sqs 관련 리소스(람다, sqs) 배포 가능한 템플릿 작성 및 배포 poc차원에서 간단한 rest api 호출을 통한 sqs fifo 큐에 메시지

김세환 4 Aug 08, 2021
Um simples bot público para todos usarem no discord!

Discord Bot - Código Público Características: Linguagem de Programação: Python Quantidade de comandos: 17 Comandos: Prefixo do bot: O prefixo desse bo

Kevin 3 Dec 31, 2021