Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Overview

FeatureHasher

Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Note, this requires Jina>=2.2.4.

Example

Here I use FeatureHasher to hash each sentence of Pride and Prejudice into a 128-dim vector, and then use .match to find top-K similar sentences.

from jina import Document, DocumentArray, Flow

# load 
   
d = Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').convert_uri_to_text()

# cut into non-empty sentences store in a DA
da = DocumentArray(Document(text=s.strip()) for s in d.text.split('\n') if s.strip())

# use FeatureHasher in a Flow
f = Flow().add(uses='jinahub://FeatureHasher')

embed_da = DocumentArray()
with f:
    f.post('/', da, on_done=lambda req: embed_da.extend(req.docs), show_progress=True)

print('self-matching...')
embed_da.match(embed_da, exclude_self=True, limit=5, normalization=(1, 0))
print('total sentences: ', len(embed_da))
for d in embed_da:
    print(d.text)
    for m in d.matches:
        print(m.scores['cosine'], m.text)
    input()
           [email protected][I]:🎉 Flow is ready to use!
	🔗 Protocol: 		GRPC
	🏠 Local access:	0.0.0.0:52628
	🔒 Private network:	192.168.178.31:52628
	🌐 Public address:	217.70.138.123:52628
⠹       DONE ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:01 100% ETA: 0 seconds 40 steps done in 1 second
total sentences:  12153
The Project Gutenberg eBook of Pride and Prejudice, by Jane Austen

   
     *** END OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

    
      *** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

     
       production, promotion and distribution of Project Gutenberg-tm

      
        Pride and Prejudice

       
         By Jane Austen This eBook is for the use of anyone anywhere in the United States and 
        
          This eBook is for the use of anyone anywhere in the United States and 
         
           by the awkwardness of the application, and at length wholly 
          
            Elizabeth passed the chief of the night in her sister’s room, and 
           
             the happiest memories in the world. Nothing of the past was 
            
              charities and charitable donations in all 50 states of the United 
            
           
          
         
        
       
      
     
    
   

In practice, you can implement matching and storing via an indexer inside Flow. This example is only for demo purpose so any non-feature hashing related ops are implemented outside the Flow to avoid distraction.

Owner
Jina AI
A Neural Search Company. We provide the cloud-native neural search solution powered by state-of-the-art AI technology.
Jina AI
Profil3r is an OSINT tool that allows you to find potential profiles of a person on social networks, as well as their email addresses 🕵️

Profil3r is an OSINT tool that allows you to find potential profiles of a person on social networks, as well as their email addresses. This program also alerts you to the presence of a data leak for

1.1k Aug 24, 2021
PoC encrypted diary in Python 3

Encrypted diary Sample program to store confidential data. Provides encryption in the form of AES-256 with bcrypt KDF. Does not provide authentication

1 Dec 25, 2021
Installation of hacking tools

Tools-Spartan This is a program that makes it easy for you to download and install tools used in Kali Linux, there are tons of tools available.

1 Nov 10, 2021
Static Token And Credential Scanner

Static Token And Credential Scanner What is it? STACS is a YARA powered static credential scanner which suports binary file formats, analysis of neste

STACS 81 Dec 27, 2022
Deltaspy - an advanced keylogger that can send keylogs and screenshots to gmail

Deltaspy Deltaspy is a advanced keylogger which sends keylogs and screenshot to

Praanesh S 1 Dec 31, 2021
🔐 A simple command-line password manager.

PassVault What Is It? It is a command-line password manager, for educational purposes, that stores localy, in AES encryption, your sensitives datas in

5 Aug 15, 2022
Proof of concept to check if hosts are vulnerable to CVE-2021-41773

CVE-2021-41773 PoC Proof of concept to check if hosts are vulnerable to CVE-2021-41773. Description (https://cve.mitre.org/cgi-bin/cvename.cgi?name=CV

Jordan Jay 43 Nov 09, 2022
CVE-2021-22205& GitLab CE/EE RCE

Vuln Impact An issue has been discovered in GitLab CE/EE affecting all versions starting from 11.9. GitLab was not properly validating image files tha

Al1ex 213 Dec 30, 2022
🔍 IRIS: An open-source intelligence framework

IRIS is an open-source OSINT framework, consisting of modules to find information about a target by scraping sites and fetching data from APIs.

IRIS 79 Dec 20, 2022
A python based tool that executes various CVEs to gain root privileges as root on various MAC OS platforms.

MacPer A python based tool that executes various CVEs to gain root privileges as root on various MAC OS platforms. Not all of the exploits directly sp

20 Nov 30, 2022
Bypass ReCaptcha: A Python script for dealing with recaptcha

Bypass ReCaptcha Bypass ReCaptcha is a Python script for dealing with recaptcha.

Marcos Camargo 1 Jan 11, 2022
Lite version of my Gatekeeper backdoor for public use.

MayorSec Backdoor Fully functioning bind-type backdoor This backdoor is a fully functioning bind shell and lite version of my full functioning Gatekee

Joe Helle 56 Mar 25, 2022
USSR-Scanner - USSR Scanner with python

Purposes ? Hey there is abosolutely no need to do this we do it only to irritate

Binary.club 2 Jan 24, 2022
Caretaker 2 Jun 06, 2022
自动化爆破子域名,并遍历所有端口寻找http服务,并使用crawlergo、dirsearch、xray等工具扫描并集成报告;支持动态添加扫描到的域名至任务;

AutoScanner AutoScanner是什么 AutoScanner是一款自动化扫描器,其功能主要是遍历所有子域名、及遍历主机所有端口寻找出所有http服务,并使用集成的工具进行扫描,最后集成扫描报告; 工具目前有:oneforall、masscan、nmap、crawlergo、dirse

633 Dec 30, 2022
Spray365 is a password spraying tool that identifies valid credentials for Microsoft accounts (Office 365 / Azure AD).

What is Spray365? Spray365 is a password spraying tool that identifies valid credentials for Microsoft accounts (Office 365 / Azure AD). How is Spray3

Mark Hedrick 246 Dec 28, 2022
DNS hijacking via dead records automation tool

DeadDNS Multi-threaded DNS hijacking via dead records automation tool How it works 1) Dig provided subdomains file for dead DNS records. 2) Dig the fo

45 Dec 20, 2022
Confluence Server Webwork OGNL injection

CVE-2021-26084 - Confluence Server Webwork OGNL injection An OGNL injection vulnerability exists that would allow an authenticated user and in some in

Fellipe Oliveira 295 Jan 06, 2023
BOF-Roaster is an automated buffer overflow exploit machine which is begin written with Python 3.

BOF-Roaster is an automated buffer overflow exploit machine which is begin written with Python 3. On first release it was able to successfully break many of the most well-known buffer overflow exampl

Kaan Caglan 5 Nov 23, 2021
🎻 Modularized exploit generation framework

Modularized exploit generation framework for x86_64 binaries Overview This project is still at early stage of development, so you might want to come b

ᴀᴇꜱᴏᴘʜᴏʀ 30 Jan 17, 2022