Improving Representations via Similarities

Related tags

Miscellaneousembetter
Overview

embetter

warning

I like to build in public, but please don't expect anything yet. This is alpha stuff!

notes

Improving Representations via Similarities

The object to implement:

Embetter(multi_output=True, epochs=50, sampling_kwargs)
  .fit(X, y)
  .fit_sim(X1, X2, y_sim, weights)
  .partial_fit(X, y, classes, weights)
  .partial_fit_sim(X1, X2, y_sim, weights)
  .predict(X)
  .predict_proba(X)
  .predict_sim(X1, X2)
  .transform(X)
  .translate_X_y(X, y, classes=none)

Observation: especially when multi_output=True there's an opportunity with regards to NaN y-values. We can simply choose with values to translate and which to ignore.

Comments
  • [WIP] Feature/progress bar

    [WIP] Feature/progress bar

    Fixes issue #20

    • [x] Adds progress bar to all text and image embedders.
    • [x] Tests for SentenceEncoder.
    • [ ] Use perfplot for progress bar?
    • [ ] Can we ensure fast NumPy vectorization while using a progress bar?
    opened by CarloLepelaars 5
  • [BUG] `device` should be attribute on `SentenceEncoder`

    [BUG] `device` should be attribute on `SentenceEncoder`

    The device argument in SentenceEncoder is not defined as an attribute. This leads to bugs when using it with sklearn. I encountered attribute errors when trying to print out a Pipeline representation that has SentenceEncoder as a component.

    Should be easy to fix by just adding self.device in SentenceEncoder.__init__. We can consider adding tests for text encoders so we can catch these errors beforehand.

    The scikit-learn development docs make it clear every argument should be defined as an attribute:

    every keyword argument accepted by init should correspond to an attribute on the instance. Scikit-learn relies on this to find the relevant attributes to set on an estimator when doing model selection.

    Error message: AttributeError: 'SentenceEncoder' object has no attribute 'device'.

    Reproduction: Python 3.8 with embetter = "^0.2.2"

    se = SentenceEncoder()
    repr(se)
    

    Fix:

    Add self.device on SentenceEncoder

    class SentenceEncoder(EmbetterBase):
        .
        .
        def __init__(self, name="all-MiniLM-L6-v2", device=None):
            if not device:
                device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
            self.device = device
            self.name = name
            self.tfm = SBERT(name, device=self.device)
    
    opened by CarloLepelaars 4
  • Color Histograms - Additional Tricks

    Color Histograms - Additional Tricks

    This approach could work pretty well as an implementation: https://danielmuellerkomorowska.com/2020/06/17/analyzing-image-histograms-with-scikit-image/

    To do something similar to what is explained here: https://www.pinecone.io/learn/color-histograms/

    opened by koaning 4
  • Support for word embeddings

    Support for word embeddings

    Hi,

    Do you think it would be a good idea to add support for static word embeddings (word2vec, glove, etc.)? The embedder would need:

    • A filename to a local embedding file (e.g., glove.6b.100d.txt)
    • Either a callable tokenizer or regex string (i.e., the way sci-kit learn's TfIdfVectorizer splits words).
    • A (name of a) pooling function (e.g., "mean", "max", "sum").

    The second and third parameters could easily have sensible defaults, of course. If you think it's a good idea, I can do the PR somewhere next week.

    Stéphan

    opened by stephantul 3
  • [FEATURE] SpaCyEmbedder

    [FEATURE] SpaCyEmbedder

    I think it would be a nice addition to add an embedder that can easily vectorize text through SpaCy. I already have an implementation class for this and would be happy to contribute it here.

    SpaCy Docs on vector: https://spacy.io/api/doc#vector

    Example code for single string:

    import spacy
    nlp = spacy.load("en_core_web_sm")
    doc = nlp("This here text")
    doc.vector
    
    opened by CarloLepelaars 2
  • `get_feature_names_out` for encoders

    `get_feature_names_out` for encoders

    I would be happy to implement get_feature_names_out for all the Embetter objects. I will implement them by just adding a new method (without a Mixin).

    opened by CarloLepelaars 1
  • Remove the classification layer in timm models

    Remove the classification layer in timm models

    I was playing a bit with the library and found out that the TimmEncoder returns 1000-dimensional vectors for all the models I selected. That is caused by returning the state of the last FC classification layer and the fact all of the models were trained on ImageNet with 1000 classes. In practice, it's typically replaced with identity.

    Are there any reasons for returning the state of that last layer as an embedding? I'd be happy to submit a PR fixing that.

    opened by kacperlukawski 1
  • xception mobilenet

    xception mobilenet

    https://keras.io/api/applications/

    https://www.tensorflow.org/api_docs/python/tf/keras/applications/mobilenet_v2/MobileNetV2 https://www.tensorflow.org/api_docs/python/tf/keras/applications/xception/Xception

    opened by koaning 0
  • 'SentenceEncoder' object has no attribute 'device'

    'SentenceEncoder' object has no attribute 'device'

    text_emb_pipeline = make_pipeline(
      ColumnGrabber("text"),
      SentenceEncoder('all-MiniLM-L6-v2')
    )
    
    # This pipeline can also be trained to make predictions, using
    # the embedded features. 
    text_clf_pipeline = make_pipeline(
      text_emb_pipeline,
      LogisticRegression()
    )
    
    dataf = pd.DataFrame({
      "text": ["positive sentiment", "super negative"],
      "label_col": ["pos", "neg"]
    })
    
    X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
    text_clf_pipeline.fit(dataf, dataf['label_col'])
    

    This code gives this error: 'SentenceEncoder' object has no attribute 'device'

    opened by nicholas-dinicola 6
Releases(0.2.2)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
⏰ Shutdown Timer is an application that you can shutdown, restart, logoff, and hibernate your computer with a timer.

Shutdown Timer is a an application that you can shutdown, restart, logoff, and hibernate your computer with a timer. After choosing an action from the

Mehmet Güdük 5 Jun 27, 2022
Transparently load variables from environment or JSON/YAML file.

A thin wrapper over Pydantic's settings management. Allows you to define configuration variables and load them from environment or JSON/YAML file. Also generates initial configuration files and docum

Lincoln Loop 90 Dec 14, 2022
Blender 3.1 Alpha (and later) PLY importer that correctly loads point clouds (and all PLY models as point clouds)

import-ply-as-verts Blender 3.1 Alpha (and later) PLY importer that correctly loads point clouds (and all PLY models as point clouds) Latest News Mand

Michael Prostka 82 Dec 20, 2022
Your Google Recon is Now Automated

GRecon : GRecon (Greei-Conn) is a simple python tool that automates the process of Google Based Recon AKA Google Dorking The current Version 1.0 Run 7

adnane-tebbaa 189 Dec 21, 2022
Program to send ROM files to Turbo Everdrive; reverse-engineered and designed to be platform-independent

PCE_TurboEverdrive_USB What is this "TurboEverdrive USB" thing ? For those who have a TurboEverdrive v2.x from krikzz.com, there was originally an opt

David Shadoff 10 Sep 18, 2022
token vesting escrow with cliff and clawback

Yearn Vesting Escrow A modified version of Curve Vesting Escrow contracts with added functionality: An escrow can have a start_date in the past.

62 Dec 08, 2022
ClamNotif: A tool to send you ClamAV notifications

A tool to forward notifications to different recipients categorised by two severity levels of the regular health reports produced by `clamscan` bundled with the ClamAV antivirus engine.

PiSoft Company Ltd. 1 Nov 15, 2021
Tomador de ramos UC automatico para Windows, Linux y macOS

auto-ramos v2.0 Tomador de ramos UC automatico para Windows, Linux y macOS Funcion Este script de Python tiene como principal objetivo hacer que la to

Open Source eUC 13 Jun 29, 2022
Social reading and reviewing, decentralized with ActivityPub

BookWyrm Social reading and reviewing, decentralized with ActivityPub Contents Joining BookWyrm Contributing About BookWyrm What it is and isn't The r

BookWyrm 1.4k Jan 08, 2023
Our product DrLeaf which not only makes the work easier but also reduces the effort and expenditure of the farmer to identify the disease and its treatment methods.

Our product DrLeaf which not only makes the work easier but also reduces the effort and expenditure of the farmer to identify the disease and its treatment methods. We have to upload the image of an

Aniruddha Jana 2 Feb 02, 2022
Multi-Process / Censorship Detection

Multi-Process / Censorship Detection

Baris Dincer 2 Dec 22, 2021
Small scripts to learn about GNOME internals

gnome-hacks This is a collection of APIs that allow programmatic manipulation of the GNOME shell. If you use GNOME (the default graphical shell in Ubu

Alex Nichol 5 Oct 22, 2021
:art: Diagram as Code for prototyping cloud system architectures

Diagrams Diagram as Code. Diagrams lets you draw the cloud system architecture in Python code. It was born for prototyping a new system architecture d

MinJae Kwon 27.5k Jan 04, 2023
Parser for the GeoSuite[tm] PRV export format

Parser for the GeoSuite[tm] PRV export format This library provides functionality to parse geotechnical investigation data in .prv files generated by

EMerald Geomodelling 1 Dec 17, 2021
Demodulate and error correct FIS-B and ADS-B signals on 978 MHz.

FIS-B 978 ('fisb-978') is a set of programs that demodulates and error corrects FIS-B (Flight Information System - Broadcast) and ADS-B (Automatic Dep

2 Nov 15, 2022
monster hunter world randomizer project

mhw_randomizer monster hunter world randomizer project Settings are in rando_config.py Current script for attack randomization is n mytest.py There ar

2 Jan 24, 2022
Open source stenotype engine

Plover Bringing stenography to everyone. Homepage Releases Wiki Blog Google Group Discord Chat About Installation Getting help Contributing Donations

Open Steno Project 2k Jan 09, 2023
CalHacks 8 Repo: Megha Jain, Gaurav Bhatnagar, Howard Meng, Vibha Tantry

CalHacks8 CalHacks 8 Repo: Megha Jain, Gaurav Bhatnagar, Howard Meng, Vibha Tantry Setup FE Install React Native via Expo, run App.js. Backend Create

0 Aug 20, 2022
This is the DBMS Project done in 5th sem of B.E CS.

Student-Result-Management-System This is the DBMS Project done in 5th sem of B.E CS. You need to install SQlite DB Browser in your pc or laptop to ope

Vivek kulkarni 1 Jan 14, 2022
GDIT: Geometry Dash Info Tool

GDIT: Geometry Dash Info Tool This is the first large script that allows you to quickly get information from the Geometry Dash server

dezz0xY 2 Jan 09, 2022