Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Overview

FeatureHasher

Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Note, this requires Jina>=2.2.4.

Example

Here I use FeatureHasher to hash each sentence of Pride and Prejudice into a 128-dim vector, and then use .match to find top-K similar sentences.

from jina import Document, DocumentArray, Flow

# load 
   
d = Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').convert_uri_to_text()

# cut into non-empty sentences store in a DA
da = DocumentArray(Document(text=s.strip()) for s in d.text.split('\n') if s.strip())

# use FeatureHasher in a Flow
f = Flow().add(uses='jinahub://FeatureHasher')

embed_da = DocumentArray()
with f:
    f.post('/', da, on_done=lambda req: embed_da.extend(req.docs), show_progress=True)

print('self-matching...')
embed_da.match(embed_da, exclude_self=True, limit=5, normalization=(1, 0))
print('total sentences: ', len(embed_da))
for d in embed_da:
    print(d.text)
    for m in d.matches:
        print(m.scores['cosine'], m.text)
    input()
           [email protected][I]:πŸŽ‰ Flow is ready to use!
	πŸ”— Protocol: 		GRPC
	🏠 Local access:	0.0.0.0:52628
	πŸ”’ Private network:	192.168.178.31:52628
	🌐 Public address:	217.70.138.123:52628
β Ή       DONE ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:01 100% ETA: 0 seconds 40 steps done in 1 second
total sentences:  12153
ο»ΏThe Project Gutenberg eBook of Pride and Prejudice, by Jane Austen

   
     *** END OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

    
      *** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

     
       production, promotion and distribution of Project Gutenberg-tm

      
        Pride and Prejudice

       
         By Jane Austen This eBook is for the use of anyone anywhere in the United States and 
        
          This eBook is for the use of anyone anywhere in the United States and 
         
           by the awkwardness of the application, and at length wholly 
          
            Elizabeth passed the chief of the night in her sister’s room, and 
           
             the happiest memories in the world. Nothing of the past was 
            
              charities and charitable donations in all 50 states of the United 
            
           
          
         
        
       
      
     
    
   

In practice, you can implement matching and storing via an indexer inside Flow. This example is only for demo purpose so any non-feature hashing related ops are implemented outside the Flow to avoid distraction.

Owner
Jina AI
A Neural Search Company. We provide the cloud-native neural search solution powered by state-of-the-art AI technology.
Jina AI
BOF-Roaster is an automated buffer overflow exploit machine which is begin written with Python 3.

BOF-Roaster is an automated buffer overflow exploit machine which is begin written with Python 3. On first release it was able to successfully break many of the most well-known buffer overflow exampl

Kaan Caglan 5 Nov 23, 2021
Internationalized Domain Names for Python (IDNA 2008 and UTS #46)

Internationalized Domain Names in Applications (IDNA) Support for the Internationalised Domain Names in Applications (IDNA) protocol as specified in R

Kim Davies 204 Dec 13, 2022
IDA scripts for hypervisor (Hyper-v) analysis and reverse engineering automation

Re-Scripts IA32-VMX-Helper (IDA-Script) IA32-MSR-Decoder (IDA-Script) IA32 VMX Helper It's an IDA script (Updated IA32 MSR Decoder) which helps you to

Behrooz Abbassi 16 Oct 08, 2022
Web Headers Security Scanner

Web Headers Security Scanner

Emre Koybasi 3 Dec 16, 2022
A semi-automatic osint/recon framework.

Smog Framework A semi-automatic osint/recon framework. Requirements git Python = 3.8 How to use it

toast 22 Oct 17, 2022
A python module for retrieving and parsing WHOIS data

pythonwhois A WHOIS retrieval and parsing library for Python. Dependencies None! All you need is the Python standard library. Instructions The manual

Sven Slootweg 384 Dec 23, 2022
Get important strings inside [Info.plist] & and Binary file also all output of result it will be saved in [app_binary].json , [app_plist_file].json file

Get important strings inside [Info.plist] & and Binary file also all output of result it will be saved in [app_binary].json , [app_plist_file].json file

12 Sep 28, 2022
A signature parser for hikari's command handler tanjun.

tanchi A signature parser for hikari's command handler tanjun. Finally be able to define your commands without those bloody decorator chains! Example

sadru 11 Nov 17, 2022
Static Token And Credential Scanner

Static Token And Credential Scanner What is it? STACS is a YARA powered static credential scanner which suports binary file formats, analysis of neste

STACS 81 Dec 27, 2022
The Modern Hash Identification System

πŸ”— Don't know what type of hash it is? Name That Hash will name that hash type! πŸ€– Identify MD5, SHA256 and 3000+ other hashes β˜„ Comes with a neat web app πŸ”₯

1.2k Dec 28, 2022
Colin O'Flynn's Hacakday talk at Remoticon 2021 support repo.

Hardware Hacking Resources This repo holds some of the examples used in Colin's Hardware Hacking talk at Remoticon 2021. You can see the very sketchy

Colin O'Flynn 19 Sep 12, 2022
A simple python code for hacking profile views

This code for hacking profile views. Not recommended to adding profile views in profile. This code is not illegal code. This code is for beginners.

Fayas Noushad 3 Nov 28, 2021
A Burp Pro extension that adds log4shell checks to Burp Scanner

scan4log4shell A Burp Pro extension that adds log4shell checks to Burp Scanner, written by Daniel Crowley of IBM X-Force Red. Installation To install

X-Force Red 26 Mar 15, 2022
CVE-2021-26855: PoC (Not a HoneyPoC for once!)

Exch-CVE-2021-26855 ProxyLogon is the formally generic name for CVE-2021-26855, a vulnerability on Microsoft Exchange Server that allows an attacker b

ZephrFish 24 Nov 14, 2022
log4j2 dos exploit,CVE-2021-45105 exploit,Denial of Service poc

说明 about author: ζˆ‘θΆ…ζ€•ηš„ blog: https://www.cnblogs.com/iAmSoScArEd/ github: https://github.com/iAmSOScArEd/ date: 2021-12-20 log4j2 dos exploit log4j2 do

3 Aug 13, 2022
A Static Analysis Tool for Detecting Security Vulnerabilities in Python Web Applications

This project is no longer maintained March 2020 Update: Please go see the amazing Pysa tutorial that should get you up to speed finding security vulne

2.1k Dec 25, 2022
LdapRelayScan - Check for LDAP protections regarding the relay of NTLM authentication

LDAP Relay Scan A tool to check Domain Controllers for LDAP server protections r

315 Dec 18, 2022
Js File Scanner This is Js File Scanner

Js File Scanner This is Js File Scanner . Which are scan in js file and find juicy information Toke,Password Etc.

122 Dec 12, 2022
telegram bug that discloses user's hidden phone number (still unpatched) (exploit included)

CVE-2019-15514 Type: Information Disclosure Affected Users, Versions, Devices: All Telegram Users Still not fixed/unpatched. brute.py is available exp

Gray Programmerz 66 Dec 08, 2022