Improving Representations via Similarities

Related tags

Miscellaneousembetter
Overview

embetter

warning

I like to build in public, but please don't expect anything yet. This is alpha stuff!

notes

Improving Representations via Similarities

The object to implement:

Embetter(multi_output=True, epochs=50, sampling_kwargs)
  .fit(X, y)
  .fit_sim(X1, X2, y_sim, weights)
  .partial_fit(X, y, classes, weights)
  .partial_fit_sim(X1, X2, y_sim, weights)
  .predict(X)
  .predict_proba(X)
  .predict_sim(X1, X2)
  .transform(X)
  .translate_X_y(X, y, classes=none)

Observation: especially when multi_output=True there's an opportunity with regards to NaN y-values. We can simply choose with values to translate and which to ignore.

Comments
  • [WIP] Feature/progress bar

    [WIP] Feature/progress bar

    Fixes issue #20

    • [x] Adds progress bar to all text and image embedders.
    • [x] Tests for SentenceEncoder.
    • [ ] Use perfplot for progress bar?
    • [ ] Can we ensure fast NumPy vectorization while using a progress bar?
    opened by CarloLepelaars 5
  • [BUG] `device` should be attribute on `SentenceEncoder`

    [BUG] `device` should be attribute on `SentenceEncoder`

    The device argument in SentenceEncoder is not defined as an attribute. This leads to bugs when using it with sklearn. I encountered attribute errors when trying to print out a Pipeline representation that has SentenceEncoder as a component.

    Should be easy to fix by just adding self.device in SentenceEncoder.__init__. We can consider adding tests for text encoders so we can catch these errors beforehand.

    The scikit-learn development docs make it clear every argument should be defined as an attribute:

    every keyword argument accepted by init should correspond to an attribute on the instance. Scikit-learn relies on this to find the relevant attributes to set on an estimator when doing model selection.

    Error message: AttributeError: 'SentenceEncoder' object has no attribute 'device'.

    Reproduction: Python 3.8 with embetter = "^0.2.2"

    se = SentenceEncoder()
    repr(se)
    

    Fix:

    Add self.device on SentenceEncoder

    class SentenceEncoder(EmbetterBase):
        .
        .
        def __init__(self, name="all-MiniLM-L6-v2", device=None):
            if not device:
                device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
            self.device = device
            self.name = name
            self.tfm = SBERT(name, device=self.device)
    
    opened by CarloLepelaars 4
  • Color Histograms - Additional Tricks

    Color Histograms - Additional Tricks

    This approach could work pretty well as an implementation: https://danielmuellerkomorowska.com/2020/06/17/analyzing-image-histograms-with-scikit-image/

    To do something similar to what is explained here: https://www.pinecone.io/learn/color-histograms/

    opened by koaning 4
  • Support for word embeddings

    Support for word embeddings

    Hi,

    Do you think it would be a good idea to add support for static word embeddings (word2vec, glove, etc.)? The embedder would need:

    • A filename to a local embedding file (e.g., glove.6b.100d.txt)
    • Either a callable tokenizer or regex string (i.e., the way sci-kit learn's TfIdfVectorizer splits words).
    • A (name of a) pooling function (e.g., "mean", "max", "sum").

    The second and third parameters could easily have sensible defaults, of course. If you think it's a good idea, I can do the PR somewhere next week.

    Stéphan

    opened by stephantul 3
  • [FEATURE] SpaCyEmbedder

    [FEATURE] SpaCyEmbedder

    I think it would be a nice addition to add an embedder that can easily vectorize text through SpaCy. I already have an implementation class for this and would be happy to contribute it here.

    SpaCy Docs on vector: https://spacy.io/api/doc#vector

    Example code for single string:

    import spacy
    nlp = spacy.load("en_core_web_sm")
    doc = nlp("This here text")
    doc.vector
    
    opened by CarloLepelaars 2
  • `get_feature_names_out` for encoders

    `get_feature_names_out` for encoders

    I would be happy to implement get_feature_names_out for all the Embetter objects. I will implement them by just adding a new method (without a Mixin).

    opened by CarloLepelaars 1
  • Remove the classification layer in timm models

    Remove the classification layer in timm models

    I was playing a bit with the library and found out that the TimmEncoder returns 1000-dimensional vectors for all the models I selected. That is caused by returning the state of the last FC classification layer and the fact all of the models were trained on ImageNet with 1000 classes. In practice, it's typically replaced with identity.

    Are there any reasons for returning the state of that last layer as an embedding? I'd be happy to submit a PR fixing that.

    opened by kacperlukawski 1
  • xception mobilenet

    xception mobilenet

    https://keras.io/api/applications/

    https://www.tensorflow.org/api_docs/python/tf/keras/applications/mobilenet_v2/MobileNetV2 https://www.tensorflow.org/api_docs/python/tf/keras/applications/xception/Xception

    opened by koaning 0
  • 'SentenceEncoder' object has no attribute 'device'

    'SentenceEncoder' object has no attribute 'device'

    text_emb_pipeline = make_pipeline(
      ColumnGrabber("text"),
      SentenceEncoder('all-MiniLM-L6-v2')
    )
    
    # This pipeline can also be trained to make predictions, using
    # the embedded features. 
    text_clf_pipeline = make_pipeline(
      text_emb_pipeline,
      LogisticRegression()
    )
    
    dataf = pd.DataFrame({
      "text": ["positive sentiment", "super negative"],
      "label_col": ["pos", "neg"]
    })
    
    X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
    text_clf_pipeline.fit(dataf, dataf['label_col'])
    

    This code gives this error: 'SentenceEncoder' object has no attribute 'device'

    opened by nicholas-dinicola 6
Releases(0.2.2)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
Distribute PySPI jobs across a PBS cluster

Distribute PySPI jobs across a PBS cluster This repository contains scripts for distributing PySPI jobs across a PBS-type cluster. Each job will conta

Oliver Cliff 1 Feb 10, 2022
Tool to automate the enumeration of a website (CTF)

had4ctf Tool to automate the enumeration of a website (CTF) DISCLAIMER: THE TOOL HAS BEEN DEVELOPED SOLELY FOR EDUCATIONAL PURPOSE ,I WILL NOT BE LIAB

Had 2 Oct 24, 2021
A beautiful and useful prompt for your shell

A Powerline style prompt for your shell A beautiful and useful prompt generator for Bash, ZSH, Fish, and tcsh: Shows some important details about the

Buck Ryan 6k Jan 08, 2023
Direct Multi-view Multi-person 3D Human Pose Estimation

Implementation of NeurIPS-2021 paper: Direct Multi-view Multi-person 3D Human Pose Estimation [paper] [video-YouTube, video-Bilibili] [slides] This is

Sea AI Lab 253 Jan 05, 2023
This is a pretty basic but relatively nice looking Python Pomodoro Timer.

Python Pomodoro-Timer This is a pretty basic but relatively nice looking Pomodoro Timer. Currently its set to a very basic mode, but the funcationalit

EmmHarris 2 Oct 18, 2021
Viewer for NFO files

NFO Viewer NFO Viewer is a simple viewer for NFO files, which are "ASCII" art in the CP437 codepage. The advantages of using NFO Viewer instead of a t

Osmo Salomaa 114 Dec 29, 2022
skimpy is a light weight tool that provides summary statistics about variables in data frames within the console.

skimpy Welcome Welcome to skimpy! skimpy is a light weight tool that provides summary statistics about variables in data frames within the console. Th

267 Dec 29, 2022
Python meta class and abstract method library with restrictions.

abcmeta Python meta class and abstract method library with restrictions. This library provides a restricted way to validate abstract methods. The Pyth

Morteza NourelahiAlamdari 8 Dec 14, 2022
App to get data from popular polish pages with job offers

Job board parser I written simple app to get me data from popular pages with job offers, because I wanted to knew immidietly if there is some new offe

0 Jan 04, 2022
Python Commodore BBS multi-client

python-cbm-bbs-petscii Python Commodore BBS multi-client This is intended for commodore 64, c128 and most commodore compatible machines (as the new Co

7 Sep 16, 2022
Easy way to build a SaaS application using Python and Dash

EasySaaS This project will be attempt to make a great starting point for your next big business as easy and efficent as possible. This project will cr

xianhu 3 Nov 17, 2022
A python library with various gambling and gaming classes

gamble is a simple library that implements a collection of some common gambling-related classes Features die, dice, d-notation cards, decks, hands pok

Jacobi Petrucciani 16 May 24, 2022
Whatsapp Messenger master

Whatsapp Messenger master

Swarup Kharul 5 Nov 21, 2021
pybicyclewheel calulates the required spoke length for bicycle wheels

pybicyclewheel pybicyclewheel calulates the required spoke length for bicycle wheels. (under construcion) - homepage further readings wikipedia bicyc

karl 0 Aug 24, 2022
An esoteric programming language that supports concurrency, regex, and web requests.

The Hofstadter Esoteric Programming Language Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's La

Austin Henley 19 Dec 27, 2022
Google Foobar challenge solutions from my experience and other's on the web.

Google Foobar challenge Google Foobar challenge solutions from my experience and other's on the web. Note: Problems indicated with "Mine" are tested a

Islam Ayman 6 Jan 20, 2022
Простенький ботик для троллинга с интерфейсом #Yakima_Visus

Bot-Trolling-Vk Простенький ботик для троллинга с интерфейсом #Yakima_Visus Установка pip install vk_api pip install requests если там еще чото будет

Yakima Visus 4 Oct 11, 2022
🍏 Make Thinc faster on macOS by calling into Apple's native Accelerate library

🍏 Make Thinc faster on macOS by calling into Apple's native Accelerate library

Explosion 81 Nov 26, 2022
Make your Discord Account Online 24/7!

Online-Forever Make your Discord Account Online 24/7! A Code written in Python that helps you to keep your account 24/7. The main.py is the main file.

SealedSaucer 0 Mar 16, 2022
A server shell for you to play with Powered by Django + Nginx + Postgres + Bootstrap + Celery.

A server shell for you to play with Powered by Django + Nginx + Postgres + Bootstrap + Celery.

Mengting Song 1 Jan 10, 2022