Postgres full text search options (tsearch, trigram) examples

Overview

postgres-full-text-search

Postgres full text search options (tsearch, trigram) examples.

Create DB

>> CREATE DATABASE ftdb;

To feed db with an example dataset (dataset.txt, 100k rows, 15 words each one) I used python init_db.py script.

Full text search using simple ilike

>> EXPLAIN ANALYZE
   SELECT text, language
   FROM public.document
   WHERE
      text ilike '%field%'
      AND text ilike '%window%'
      AND text ilike '%lamp%'
      AND text ilike '%research%'
      AND language = 'en'
    LIMIT 1;
                                                                  QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..3734.02 rows=1 width=105) (actual time=87.473..87.474 rows=0 loops=1)
   ->  Seq Scan on document  (cost=0.00..3734.02 rows=1 width=105) (actual time=87.466..87.466 rows=0 loops=1)
         Filter: ((text ~~* '%field%'::text) AND (text ~~* '%window%'::text) AND (text ~~* '%lamp%'::text) AND (text ~~* '%research%'::text))
         Rows Removed by Filter: 100001
 Planning Time: 2.193 ms
 Execution Time: 87.500 ms

Full text search using ilike supported by trigram index

What is a trigram? See this example:

>> CREATE EXTENSION pg_trgm;
CREATE EXTENSION
>> select show_trgm('fielded');
                show_trgm
-----------------------------------------
 {"  f"," fi",ded,"ed ",eld,fie,iel,lde}

We can improve ilike performance using trigram index, e.g. gin_trgm_ops.

>> CREATE INDEX  ix_document_text_trigram ON document USING gin (text gin_trgm_ops) where language = 'en';
CREATE INDEX

>> EXPLAIN ANALYZE SELECT text, language
   FROM public.document
   WHERE
      text ilike '%field%'
      AND text ilike '%window%'
      AND text ilike '%lamp%'
      AND text ilike '%research%'
      AND language = 'en'
    LIMIT 1;
                                                                                       QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=176.00..180.02 rows=1 width=105) (actual time=1.473..1.474 rows=0 loops=1)
   ->  Bitmap Heap Scan on document  (cost=176.00..180.02 rows=1 width=105) (actual time=1.470..1.471 rows=0 loops=1)
         Recheck Cond: ((text ~~* '%field%'::text) AND (text ~~* '%window%'::text) AND (text ~~* '%lamp%'::text) AND (text ~~* '%research%'::text) AND ((language)::text = 'en'::text))
         ->  Bitmap Index Scan on ix_document_text_trigram  (cost=0.00..176.00 rows=1 width=0) (actual time=1.466..1.466 rows=0 loops=1)
               Index Cond: ((text ~~* '%field%'::text) AND (text ~~* '%window%'::text) AND (text ~~* '%lamp%'::text) AND (text ~~* '%research%'::text))
 Planning Time: 2.389 ms
 Execution Time: 1.524 ms

Create non-default language configuration for tsearch full text search

Postgres does not provide support for many languages by default. However, you can setup the configuration quite easily. You just need additional dictionary files. Here is an example for polish language. Polish dictionary files can be downloaded from: https://github.com/judehunter/polish-tsearch.

polish.affix, polish.stop and polish.dict files should be copied to postgresql tsearch_data location, e.g. /usr/share/postgresql/13/tsearch_data.

There also must be created a configuration (see the docs) inside database:

>> DROP TEXT SEARCH DICTIONARY IF EXISTS polish_hunspell CASCADE;
   CREATE TEXT SEARCH DICTIONARY polish_hunspell (
    TEMPLATE  = ispell,
    DictFile  = polish,
    AffFile   = polish,
    StopWords = polish
  );
  CREATE TEXT SEARCH CONFIGURATION public.polish (
    COPY = pg_catalog.english
  );
  ALTER TEXT SEARCH CONFIGURATION polish
    ALTER MAPPING
    FOR
        asciiword, asciihword, hword_asciipart,  word, hword, hword_part
    WITH
        polish_hunspell, simple;

You need these files and configuration because full text search engine uses lexeme comparing to find best matches (both query pattern and stored text are lexemized):

>> SELECT to_tsquery('english', 'fielded'), to_tsvector('english', text)
   FROM document
   LIMIT 1;
 to_tsquery |                                                                    to_tsvector
------------+----------------------------------------------------------------------------------------------------------------------------------------------------
 'field'    | '19':16 'bat':12 'dead':8 'degre':1 'depth':5 'field':15 'lamp':13 'men':6 'put':14 'ranch':2 'tall':4 'time':3 'underlin':11 'wast':10 'window':9

If you cannot provide dictionary files you can use full text in "simple" form (without transformation to lexeme):

>> SELECT to_tsquery('simple', 'fielded'), to_tsvector('simple', text)
   FROM document
   LIMIT 1;
 to_tsquery |                                                                             to_tsvector
------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 'fielded'  | '19':16 'bat':12 'below':7 'dead':8 'degree':1 'depth':5 'field':15 'lamp':13 'men':6 'putting':14 'ranch':2 'tall':4 'time':3 'underline':11 'waste':10 'window':9

Tsearch full text search without stored index

>> EXPLAIN ANALYZE SELECT text, language
   FROM public.document
   WHERE to_tsvector('english', text) @@ to_tsquery('english', 'fielded & window & lamp & depth & test ')
   LIMIT 1;
                                                                                  QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1000.00..18298.49 rows=1 width=103) (actual time=489.802..491.352 rows=0 loops=1)
   ->  Gather  (cost=1000.00..18298.49 rows=1 width=103) (actual time=489.800..491.349 rows=0 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         ->  Parallel Seq Scan on document  (cost=0.00..17298.39 rows=1 width=103) (actual time=486.644..486.644 rows=0 loops=2)
               Filter: (((language)::text = 'en'::text) AND (to_tsvector('english'::regconfig, text) @@ '''field'' & ''window'' & ''lamp'' & ''depth'' & ''test'''::tsquery))
               Rows Removed by Filter: 50000
 Planning Time: 0.272 ms
 Execution Time: 491.376 ms
(9 rows)

Tsearch full text search with stored partial index ('en')

>> CREATE INDEX ix_en_document_tsvector_text ON public.document USING gin (to_tsvector('english'::regconfig, text)) WHERE language = 'en';
CREATED INDEX
>> EXPLAIN ANALYZE SELECT text, language
   FROM public.document
   WHERE to_tsvector('english', text) @@ to_tsquery('english', 'fielded & window & lamp & depth & test ')
   LIMIT 1;
                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1000.00..18151.43 rows=1 width=103) (actual time=487.120..488.569 rows=0 loops=1)
   ->  Gather  (cost=1000.00..18151.43 rows=1 width=103) (actual time=487.117..488.567 rows=0 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         ->  Parallel Seq Scan on document  (cost=0.00..17151.33 rows=1 width=103) (actual time=484.418..484.419 rows=0 loops=2)
               Filter: (to_tsvector('english'::regconfig, text) @@ '''field'' & ''window'' & ''lamp'' & ''depth'' & ''test'''::tsquery)
               Rows Removed by Filter: 50000
 Planning Time: 0.193 ms
 Execution Time: 488.596 ms

No difference? Index has not been used... Why is it not working? Ohh, looks to the partial index docs:

However, keep in mind that the predicate must match the conditions used in the queries that are supposed to benefit from the index. To be precise, a partial index can be used in a query only if the system can recognize that the WHERE condition of the query mathematically implies the predicate of the index. PostgreSQL does not have a sophisticated theorem prover that can recognize mathematically equivalent expressions that are written in different forms. (Not only is such a general theorem prover extremely difficult to create, it would probably be too slow to be of any real use.) The system can recognize simple inequality implications, for example "x < 1" implies "x < 2"; otherwise the predicate condition must exactly match part of the query's WHERE condition or the index will not be recognized as usable. Matching takes place at query planning time, not at run time. As a result, parameterized query clauses do not work with a partial index.

We have to add to query a condition that was used to create partial index: document.language = 'en':

>> EXPLAIN ANALYZE SELECT text, language
   FROM public.document
   WHERE
      to_tsvector('english', text) @@ to_tsquery('english', 'fielded & window & lamp & depth & test ')
      AND language = 'en'
   LIMIT 1;                                                                           QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=64.00..68.27 rows=1 width=103) (actual time=0.546..0.548 rows=0 loops=1)
   ->  Bitmap Heap Scan on document  (cost=64.00..68.27 rows=1 width=103) (actual time=0.544..0.545 rows=0 loops=1)
         Recheck Cond: ((to_tsvector('english'::regconfig, text) @@ '''field'' & ''window'' & ''lamp'' & ''depth'' & ''test'''::tsquery) AND ((language)::text = 'en'::text))
         ->  Bitmap Index Scan on ix_en_document_tsvector_text  (cost=0.00..64.00 rows=1 width=0) (actual time=0.540..0.540 rows=0 loops=1)
               Index Cond: (to_tsvector('english'::regconfig, text) @@ '''field'' & ''window'' & ''lamp'' & ''depth'' & ''test'''::tsquery)
 Planning Time: 0.244 ms
 Execution Time: 0.590 ms

Tsearch full text search with partial words

>> EXPLAIN ANALYZE SELECT text, language
   FROM public.document
   WHERE
      to_tsvector('english', text) @@ to_tsquery('english', 'fielded & window & l:*')
      AND language = 'en'
   LIMIT 1;
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on document  (cost=168.00..172.27 rows=1 width=102) (actual time=5.207..5.210 rows=4 loops=1)
   Recheck Cond: ((to_tsvector('english'::regconfig, text) @@ '''field'' & ''window'' & ''l'':*'::tsquery) AND ((language)::text = 'en'::text))
   Heap Blocks: exact=4
   ->  Bitmap Index Scan on ix_en_document_tsvector_text  (cost=0.00..168.00 rows=1 width=0) (actual time=5.202..5.202 rows=4 loops=1)
         Index Cond: (to_tsvector('english'::regconfig, text) @@ '''field'' & ''window'' & ''l'':*'::tsquery)
 Planning Time: 0.240 ms
 Execution Time: 5.240 ms

>> SELECT id,  text
   FROM public.document
   WHERE
      to_tsvector('english', text) @@ to_tsquery('english', 'fielded & window & l:*')
      AND language = 'en'
   LIMIT 20;
  id   |                                                   text
-------+-----------------------------------------------------------------------------------------------------------
     1 | degree ranch time tall depth men below dead window waste underline bat lamp putting field               +
 20152 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual              +
 21478 | Dried symbol willing design managed shade window pick share faster education drive field land everybody  +
 30293 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic+

Tsearch full text search results ranking

>> SELECT
     id,
     ts_rank_cd(to_tsvector('english', text), to_tsquery('english', 'fielded & wind:*')) rank,
     text
    FROM public.document
    WHERE to_tsvector('english', text) @@ to_tsquery('english', 'fielded & wind:*')
    ORDER BY rank DESC
    LIMIT 20;
   id   |    rank     |                                                   text
--------+-------------+-----------------------------------------------------------------------------------------------------------
 100002 |         0.1 | fielded window
   9376 |        0.05 | Own mouse girl effect surprise physical newspaper forgot eat upper field element window simply unhappy   +
  96597 |        0.05 | Opinion fastened pencil rear more theory size window heading field understanding farm up position attack +
  44626 | 0.033333335 | Symbol each halfway window swam spider field page shinning donkey chose until cow cabin congress         +
  80922 | 0.033333335 | Victory famous field shelter girl wind adventure he divide rear tip few studied ruler judge              +
  30293 |       0.025 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic+
      1 | 0.016666668 | degree ranch time tall depth men below dead window waste underline bat lamp putting field               +
  21478 | 0.016666668 | Dried symbol willing design managed shade window pick share faster education drive field land everybody  +
  60059 | 0.016666668 | However hungry make proud kids come willing field officer row above highest round wind mile              +
  26001 | 0.014285714 | Earth earlier pocket might sense window way frog fire court family mouth field somebody recognize        +
  20152 | 0.014285714 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual              +
  37470 |      0.0125 | Farm weight balloon buried wind water donkey grain pig week should damage field was he                   +
  49433 |        0.01 | Wind scientist leaving atom year bad child drink shore spirit field facing indicate wagon here           +
  37851 | 0.007142857 | Field cloud you wife rhythm upward applied weigh continued property replace ahead forgotten trip window  +

text='fielded window' record was added manually to show best match result.

GIST vs GIN

We have created GIN index. But there is also GIST index option. Which one is better? It depends...

>> EXPLAIN ANALYZE SELECT text, language
   FROM public.document
   WHERE
      to_tsvector('english', text) @@ to_tsquery('english', 'fielded & window & lamp & depth & test ')
      AND language = 'en'
   LIMIT 1;
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.28..8.30 rows=1 width=103) (actual time=2.699..2.700 rows=0 loops=1)
   ->  Index Scan using ix_en_document_tsvector_text on document  (cost=0.28..8.30 rows=1 width=103) (actual time=2.697..2.697 rows=0 loops=1)
         Index Cond: (to_tsvector('english'::regconfig, text) @@ '''field'' & ''window'' & ''lamp'' & ''depth'' & ''test'''::tsquery)
 Planning Time: 0.274 ms
 Execution Time: 2.730 ms

GIN seems to be a little bit faster. I don't think I could explain it better than the docs already does:

In choosing which index type to use, GiST or GIN, consider these performance differences:

  • GIN index lookups are about three times faster than GiST
  • GIN indexes take about three times longer to build than GiST
  • GIN indexes are moderately slower to update than GiST indexes, but about 10 times slower if fast-update support was disabled (see Section 58.4.1 for details)
  • GIN indexes are two-to-three times larger than GiST indexes
Owner
Jarosław Orzeł
Backend developer with special interest in software design, architecture and system modelling.
Jarosław Orzeł
Python function to query SQLite files stored on S3

sqlite-s3-query Python function to query a SQLite file stored on S3. It uses multiple HTTP range requests per query to avoid downloading the entire fi

Michal Charemza 189 Dec 27, 2022
This is a simple graph database in SQLite, inspired by

This is a simple graph database in SQLite, inspired by "SQLite as a document database".

Denis Papathanasiou 1.2k Jan 03, 2023
Code for a db backend that relies on bash tools (grep, cat, echo, etc)

Simple-nosql-db is a python backend for a database that relies on unix tools such as cat, echo and grep. Funny enough I got the idea from this discuss

Sebastian Alonso 10 Aug 13, 2019
A Persistent Embedded Graph Database for Python

Cog - Embedded Graph Database for Python cogdb.io New release: 2.0.5! Installing Cog pip install cogdb Cog is a persistent embedded graph database im

Arun Mahendra 214 Dec 30, 2022
ClutterDB - Extremely simple JSON database made for infrequent changes which behaves like a dict

extremely simple JSON database made for infrequent changes which behaves like a dict this was made for ClutterBot

Clutter Development 1 Jan 12, 2022
Shelf DB is a tiny document database for Python to stores documents or JSON-like data

Shelf DB Introduction Shelf DB is a tiny document database for Python to stores documents or JSON-like data. Get it $ pip install shelfdb shelfquery S

Um Nontasuwan 35 Nov 03, 2022
A Painless Simple Way To Create Schema and Do Database Operations Quickly In Python

PainlessDB - Taking Your Pain away to the moon 🚀 Contribute · Community · Documentation 🎫 Introduction : PainlessDB is a Python-based free and open-

Aiden Ellis 3 Jul 15, 2022
Decentralised graph database management system

Decentralised graph database management system To get started clone the repo, and run the command below. python3 database.py Now, create a new termina

Omkar Patil 2 Apr 18, 2022
Лабораторные работы по Postgresql за 5 семестр

Практикум по Postgresql ERD для заданий 2.x: ERD для заданий 3.x: Их делал вот тут Ниже есть 2 инструкции — по установке postgresql на manjaro и по пе

Danila 10 Oct 31, 2022
A very simple document database

DockieDb A simple in-memory document database. Installation Build the Wheel Fork or clone this repository and run python setup.py bdist_wheel in the r

1 Jan 16, 2022
Codeqlcompile - 自动反编译闭源应用,创建codeql数据库

codeql_compile 自动反编译闭源应用,创建codeql数据库 准备 首先下载ecj.jar和idea提供反编译的java-decompiler.ja

236 Jan 05, 2023
EmployeeDB - Advanced Redis search functionalities on Python applied on an Employee management backend app

EmployeeDB - Advanced Redis search functionalities on Python applied on an Employee management backend app

Ahmad Bazzi 58 Oct 10, 2022
Metrics-advisor - Analyze reshaped metrics from TiDB cluster Prometheus and give some advice about anomalies and correlation.

metrics-advisor Analyze reshaped metrics from TiDB cluster Prometheus and give some advice about anomalies and correlation. Team freedeaths mashenjun

3 Jan 07, 2022
Oh-My-PickleDB is an open source key-value store using Python's json module.

OH-MY-PICKLEDB oh-my-pickleDB is a lightweight, fast, and intuitive data manager written in python 📝 Table of Contents About Getting Started Deployme

Adrián Toral 6 Feb 20, 2022
Tiny local JSON database for Python.

Pylowdb Simple to use local JSON database 🦉 # This is pure python, not specific to pylowdb ;) db.data['posts'] = ({ 'id': 1, 'title': 'pylowdb is awe

Hussein Sarea 3 Jan 26, 2022
securedb is a fast and lightweight Python framework to easily interact with JSON-based encrypted databases.

securedb securedb is a Python framework that lets you work with encrypted JSON databases. Features: newkey() to generate an encryption key write(key,

Filippo Romani 2 Nov 23, 2022
Migrate data from SQL to NoSQL easily

Migrate data from SQL to NoSQL easily Installation 💯 pip install sql2nosql --upgrade Dependencies 📢 For the package to work, it first needs "clients

Facundo Padilla 43 Mar 26, 2022
A Python wrapper API for operating and working with the Neo4j Graph Data Science (GDS) library

gdsclient This repo hosts the sources for gdsclient, a Python wrapper API for operating and working with the Neo4j Graph Data Science (GDS) library. g

Neo Technology 101 Jan 05, 2023
Given a metadata file with relevant schema, an SQL Engine can be run for a subset of SQL queries.

Mini-SQL-Engine Given a metadata file with relevant schema, an SQL Engine can be run for a subset of SQL queries. The query engine supports Project, A

Prashant Raj 1 Dec 03, 2021
MyReplitDB - the most simplistic and easiest wrapper to use for replit's database system.

MyReplitDB is the most simplistic and easiest wrapper to use for replit's database system. Installing You can install it from the PyPI Or y

kayle 4 Jul 03, 2022