Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

Overview

Cherche

Neural search



Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers. Cherche is meant to be used with small to medium sized corpora. Cherche's main strength is its ability to build diverse and end-to-end pipelines.

Alt text

Installation 🤖

pip install cherche

To install the development version:

pip install git+https://github.com/raphaelsty/cherche

Documentation 📜

Documentation is available here. It provides details about retrievers, rankers, pipelines, question answering, summarization, and examples.

QuickStart 💨

Documents 📑

Cherche allows findings the right document within a list of objects. Here is an example of a corpus.

from cherche import data

documents = data.load_towns()

documents[:3]
[{'id': 0,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris is the capital and most populous city of France.'},
 {'id': 1,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': "Since the 17th century, Paris has been one of Europe's major centres of science, and arts."},
 {'id': 2,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France.'
  }]

Retriever ranker 🔍

Here is an example of a neural search pipeline composed of a TfIdf that quickly retrieves documents, followed by a ranking model. The ranking model sorts the documents produced by the retriever based on the semantic similarity between the query and the documents.

from cherche import data, retrieve, rank
from sentence_transformers import SentenceTransformer

# List of dicts
documents = data.load_towns()

# Retrieve on fields title and article
retriever = retrieve.TfIdf(key="id", on=["title", "article"], documents=documents, k=30)

# Rank on fields title and article
ranker = rank.Encoder(
    key = "id",
    on = ["title", "article"],
    encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    k = 3,
    path = "encoder.pkl"
)

# Pipeline creation
search = retriever + ranker

search.add(documents=documents)

search("Bordeaux")
[{'id': 57, 'similarity': 0.69513476},
 {'id': 63, 'similarity': 0.6214991},
 {'id': 65, 'similarity': 0.61809057}]

Map the index to the documents to access their contents.

search += documents
search("Bordeaux")
[{'id': 57,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': 'Bordeaux ( bor-DOH, French: [bɔʁdo] (listen); Gascon Occitan: Bordèu [buɾˈðɛw]) is a port city on the river Garonne in the Gironde department, Southwestern France.',
  'similarity': 0.69513476},
 {'id': 63,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': 'The term "Bordelais" may also refer to the city and its surrounding region.',
  'similarity': 0.6214991},
 {'id': 65,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': "Bordeaux is a world capital of wine, with its castles and vineyards of the Bordeaux region that stand on the hillsides of the Gironde and is home to the world's main wine fair, Vinexpo.",
  'similarity': 0.61809057}]

Retrieve 👻

Cherche provides different retrievers that filter input documents based on a query.

  • retrieve.Elastic
  • retrieve.TfIdf
  • retrieve.Lunr
  • retrieve.BM25Okapi
  • retrieve.BM25L
  • retrieve.Flash
  • retrieve.Encoder

Rank 🤗

Cherche rankers are compatible with SentenceTransformers models, Hugging Face sentence similarity models, Hugging Face zero shot classification models, and of course with your own models.

Summarization and question answering

Cherche provides modules dedicated to summarization and question answering. These modules are compatible with Hugging Face's pre-trained models and can be fully integrated into neural search pipelines.

Acknowledgements 👏

The BM25 models available in Cherche are wrappers around rank_bm25. Elastic retriever is a wrapper around Python Elasticsearch Client. TfIdf retriever is a wrapper around scikit-learn's TfidfVectorizer. Lunr retriever is a wrapper around Lunr.py. Flash retriever is a wrapper around FlashText. DPR and Encode rankers are wrappers dedicated to the use of the pre-trained models of SentenceTransformers in a neural search pipeline. ZeroShot ranker is a wrapper dedicated to the use of the zero-shot sequence classifiers of Hugging Face in a neural search pipeline.

See also 👀

Cherche is a minimalist solution and meets a need for modularity. Cherche is the way to go if you start with a list of documents as JSON with multiple fields to search on and want to create pipelines. Also ,Cherche is well suited for middle sized corpora.

Do not hesitate to look at Haystack, Jina, or TxtAi which offer very advanced solutions for neural search and are great.

Dev Team 💾

The Cherche dev team is made up of Raphaël Sourty and François-Paul Servant 🥳

Comments
  • Added spelling corrector object

    Added spelling corrector object

    Hello ! I added a spelling corrector base class as well as the original implementation of the Norvig spelling corrector. The spelling corrector can be fitted directly on the pipeline's documents with the '.add(documents)' method. I also provided an optional (defaults to False) external dictionary, the one originally used by Norvig.

    I have no issue updating my code for improvements, so feel free to suggest any modification !

    opened by NicolasBizzozzero 4
  • 0.0.5

    0.0.5

    Pull request for Cherche version 0.0.5

    • RAG: add RAG generator for open domain question answering
    • RapidFuzzy: New blazzing fast retriever
    • Retrievers: Provide similarities for each retriever
    • Union & Intersection: Keep similarity scores
    opened by raphaelsty 1
  • Batch processing

    Batch processing

    Retrieving documents with batch of queries can significantly speed up things. It is now available for few models using the development version via the batch method.

    Models involved are:

    • TfIdf retriever
    • Encoder retriever (milvus + faiss)
    • Encoder ranker (milvus)
    • DPR retriever (milvus + faiss)
    • DPR ranker (milvus)
    • Recommend retriever

    Batch is not yet compatible with pipelines.

    enhancement 
    opened by raphaelsty 0
  • Cherche 1.0.0

    Cherche 1.0.0

    Here is an essential update for Cherche. The update retains the previous API and is compatible with previous versions. 🥳

    Main additions:

    • Added compatibility with two new open-source retrievers: Meilisearch and TypeSense.
    • Compatibility with the Milvus index to use the retriever.Encoder and retriever.DPR models on massive corpora.
    • Compatibility with the Milvus index to store ranker embeddings in a database rather than in memory.
    • Progress bar when pre-computing embeddings by Encoder, DPR retrievers and Encoder, DPR rankers.
    • All pipelines (voting, intersection, concatenation) produce a similarity score. To do so, the pipeline object applies a softmax to normalize the scores, thus allowing us to "compare" the scores of two distinct models.
    • Integration of collaborative filtering models via adding a Recommend retriever and a Recommend ranker (indexation via Faiss and compatible with Milvus) to consider users' preferences in the search.
    opened by raphaelsty 0
  • "IndexError: index out of range in self "While adding documents to cherche pipeline

    I'm using a cherche pipline built of a tfidf retriever with a sentencetransformer ranker as follows : search = (retriever + ranker) While trying to add documents to the pipeline (search.add(documents=documents), I got this error :

    """/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse) 2181 # remove once script supports set_grad_enabled 2182 no_grad_embedding_renorm(weight, input, max_norm, norm_type) -> 2183 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) 2184 2185

    IndexError: index out of range in self"""

    opened by delmetni 0
  • incomplete doc about metrics

    incomplete doc about metrics

    opened by fpservant 0
Releases(1.0.1)
  • 1.0.1(Oct 27, 2022)

  • 1.0.0(Oct 26, 2022)

    What's Changed

    Here is an essential update for Cherche! 🥳

    • Added compatibility with two new open-source retrievers: Meilisearch and TypeSense.
    • Compatibility with the Milvus index to use the retriever.Encoder and retriever.DPR models on massive corpora.
    • Compatibility with the Milvus index to store ranker embeddings in a database rather than in memory.
    • Progress bar when pre-computing embeddings by Encoder, DPR retrievers and Encoder, DPR rankers.
    • The path parameter is no longer used.
    • All pipelines (voting, intersection, concatenation) produce a similarity score. To do so, the pipeline object applies a softmax to normalize the scores, thus allowing us to "compare" the scores of two distinct models.
    • Integration of collaborative filtering models via adding a Recommend retriever and a Recommend ranker (indexation via Faiss and compatible with Milvus) to consider users' preferences in the search.

    Cherche is now fully compatible with large-scale corpora and deeply integrates collaborative filtering. Updates retains the previous API and is compatible with previous versions.

    Source code(tar.gz)
    Source code(zip)
  • 0.1.0(Jun 16, 2022)

    Added compatibility with the ONNX environment and quantization to significantly speed up sentence transformers and question answering models. 🏎

    It is now possible to choose the type of index for the Encoder and DPR retrievers in order to process the largest corpora while using the GPU.

    Source code(tar.gz)
    Source code(zip)
  • 0.0.9(Apr 13, 2022)

  • 0.0.8(Mar 7, 2022)

  • 0.0.7(Mar 7, 2022)

  • 0.0.6(Mar 3, 2022)

    • Update documentation
    • Update retriever Encoder and DPR, path is optionnal
    • Add deployment documentation
    • Update similarity type
    • Avoid round similarity
    Source code(tar.gz)
    Source code(zip)
  • 0.0.5(Feb 8, 2022)

    • Loading and Saving tutorial
    • Fuzzy retriever
    • Similarities everywhere (retrievers, union, intersection provide similarity scores)
    • RAG generation
    Source code(tar.gz)
    Source code(zip)
  • 0.0.4(Jan 20, 2022)

    Update of the encoder retriever and the DPR retriever. Documents in the Faiss index will not be duplicated. Query embeddings can now be pre-computed for ranker Encoder and ranker DPR to speed up evaluation without having to compute it again.

    Source code(tar.gz)
    Source code(zip)
  • 0.0.3(Jan 13, 2022)

  • 0.0.2(Jan 12, 2022)

    Update of the Cherche dependencies. The previous dependencies were too strict and restrictive as they were limited to a specific version for each package.

    Source code(tar.gz)
    Source code(zip)
Owner
Raphael Sourty
PhD Student @ IRIT and Renault
Raphael Sourty
Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations Created by Jiahao Pang, Duanshun Li, and Dong Tian from InterDigital In

InterDigital 21 Dec 29, 2022
ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python) 日本語は以下に続きます (Japanese follows) English: This book is written in Japanese and primaril

Ryuichi Yamamoto 189 Dec 29, 2022
The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

Kay Savetz 60 Dec 25, 2022
Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets What is LASSL • How to Use What is LASSL LASSL은 LAnguage Semi-Super

LASSL: LAnguage Self-Supervised Learning 116 Dec 27, 2022
Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data

Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data Authors: Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang and Yi-Ren Ye

Yi-Chang Chen 5 Dec 15, 2022
Built for cleaning purposes in military institutions

Ferramenta do AL Construído para fins de limpeza em instituições militares. Instalação Requer python = 3.2 pip install -r requirements.txt Usagem Exe

0 Aug 13, 2022
PyTranslator é simultaneamente um editor e tradutor de texto com diversos recursos e interface feito com coração e 100% em Python

PyTranslator O Que é e para que serve o PyTranslator? PyTranslator é simultaneamente um editor e tradutor de texto em com interface gráfica que usa a

Elizeu Barbosa Abreu 1 May 12, 2022
Header-only C++ HNSW implementation with python bindings

Hnswlib - fast approximate nearest neighbor search Header-only C++ HNSW implementation with python bindings. NEWS: version 0.6 Thanks to (@dyashuni) h

2.3k Jan 05, 2023
Materials (slides, code, assignments) for the NYU class I teach on NLP and ML Systems (Master of Engineering).

FREE_7773 Repo containing material for the NYU class (Master of Engineering) I teach on NLP, ML Sys etc. For context on what the class is trying to ac

Jacopo Tagliabue 90 Dec 19, 2022
Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

flashgeotext ⚡ 🌍 Extract and count countries and cities (+their synonyms) from text, like GeoText on steroids using FlashText, a Aho-Corasick impleme

Ben 57 Dec 16, 2022
NLP command-line assistant powered by OpenAI

NLP command-line assistant powered by OpenAI

Axel 16 Dec 09, 2022
Fidibo.com comments Sentiment Analyser

Fidibo.com comments Sentiment Analyser Introduction This project first asynchronously grab Fidibo.com books comment data using grabber.py and then sav

Iman Kermani 3 Apr 15, 2022
NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

NL-Augmenter 🦎 → 🐍 The NL-Augmenter is a collaborative effort intended to add transformations of datasets dealing with natural language. Transformat

684 Jan 09, 2023
PUA Programming Language written in Python.

pua-lang PUA Programming Language written in Python. Installation git clone https://github.com/zhaoyang97/pua-lang.git cd pua-lang pip install . Try

zy 4 Feb 19, 2022
NLP project that works with news (NER, context generation, news trend analytics)

СоАвтор СоАвтор – платформа и открытый набор инструментов для редакций и журналистов-фрилансеров, который призван сделать процесс создания контента ма

38 Jan 04, 2023
KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.

KoBERTopic 모델 소개 KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정했습니다. 기존 BERTopic : https://github.com/MaartenGr/BERTopic/tree/05a6790b21009d

Won Joon Yoo 26 Jan 03, 2023
Stand-alone language identification system

langid.py readme Introduction langid.py is a standalone Language Identification (LangID) tool. The design principles are as follows: Fast Pre-trained

2k Jan 04, 2023
Wrapper to display a script output or a text file content on the desktop in sway or other wlroots-based compositors

nwg-wrapper This program is a part of the nwg-shell project. This program is a GTK3-based wrapper to display a script output, or a text file content o

Piotr Miller 94 Dec 27, 2022
BERT-based Financial Question Answering System

BERT-based Financial Question Answering System In this example, we use Jina, PyTorch, and Hugging Face transformers to build a production-ready BERT-b

Bithiah Yuan 61 Sep 18, 2022
Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application with a focus on embedded systems.

Welcome to Spokestack Python! This library is intended for developing voice interfaces in Python. This can include anything from Raspberry Pi applicat

Spokestack 133 Sep 20, 2022