Improving Representations via Similarities

Related tags

Miscellaneousembetter
Overview

embetter

warning

I like to build in public, but please don't expect anything yet. This is alpha stuff!

notes

Improving Representations via Similarities

The object to implement:

Embetter(multi_output=True, epochs=50, sampling_kwargs)
  .fit(X, y)
  .fit_sim(X1, X2, y_sim, weights)
  .partial_fit(X, y, classes, weights)
  .partial_fit_sim(X1, X2, y_sim, weights)
  .predict(X)
  .predict_proba(X)
  .predict_sim(X1, X2)
  .transform(X)
  .translate_X_y(X, y, classes=none)

Observation: especially when multi_output=True there's an opportunity with regards to NaN y-values. We can simply choose with values to translate and which to ignore.

Comments
  • [WIP] Feature/progress bar

    [WIP] Feature/progress bar

    Fixes issue #20

    • [x] Adds progress bar to all text and image embedders.
    • [x] Tests for SentenceEncoder.
    • [ ] Use perfplot for progress bar?
    • [ ] Can we ensure fast NumPy vectorization while using a progress bar?
    opened by CarloLepelaars 5
  • [BUG] `device` should be attribute on `SentenceEncoder`

    [BUG] `device` should be attribute on `SentenceEncoder`

    The device argument in SentenceEncoder is not defined as an attribute. This leads to bugs when using it with sklearn. I encountered attribute errors when trying to print out a Pipeline representation that has SentenceEncoder as a component.

    Should be easy to fix by just adding self.device in SentenceEncoder.__init__. We can consider adding tests for text encoders so we can catch these errors beforehand.

    The scikit-learn development docs make it clear every argument should be defined as an attribute:

    every keyword argument accepted by init should correspond to an attribute on the instance. Scikit-learn relies on this to find the relevant attributes to set on an estimator when doing model selection.

    Error message: AttributeError: 'SentenceEncoder' object has no attribute 'device'.

    Reproduction: Python 3.8 with embetter = "^0.2.2"

    se = SentenceEncoder()
    repr(se)
    

    Fix:

    Add self.device on SentenceEncoder

    class SentenceEncoder(EmbetterBase):
        .
        .
        def __init__(self, name="all-MiniLM-L6-v2", device=None):
            if not device:
                device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
            self.device = device
            self.name = name
            self.tfm = SBERT(name, device=self.device)
    
    opened by CarloLepelaars 4
  • Color Histograms - Additional Tricks

    Color Histograms - Additional Tricks

    This approach could work pretty well as an implementation: https://danielmuellerkomorowska.com/2020/06/17/analyzing-image-histograms-with-scikit-image/

    To do something similar to what is explained here: https://www.pinecone.io/learn/color-histograms/

    opened by koaning 4
  • Support for word embeddings

    Support for word embeddings

    Hi,

    Do you think it would be a good idea to add support for static word embeddings (word2vec, glove, etc.)? The embedder would need:

    • A filename to a local embedding file (e.g., glove.6b.100d.txt)
    • Either a callable tokenizer or regex string (i.e., the way sci-kit learn's TfIdfVectorizer splits words).
    • A (name of a) pooling function (e.g., "mean", "max", "sum").

    The second and third parameters could easily have sensible defaults, of course. If you think it's a good idea, I can do the PR somewhere next week.

    Stéphan

    opened by stephantul 3
  • [FEATURE] SpaCyEmbedder

    [FEATURE] SpaCyEmbedder

    I think it would be a nice addition to add an embedder that can easily vectorize text through SpaCy. I already have an implementation class for this and would be happy to contribute it here.

    SpaCy Docs on vector: https://spacy.io/api/doc#vector

    Example code for single string:

    import spacy
    nlp = spacy.load("en_core_web_sm")
    doc = nlp("This here text")
    doc.vector
    
    opened by CarloLepelaars 2
  • `get_feature_names_out` for encoders

    `get_feature_names_out` for encoders

    I would be happy to implement get_feature_names_out for all the Embetter objects. I will implement them by just adding a new method (without a Mixin).

    opened by CarloLepelaars 1
  • Remove the classification layer in timm models

    Remove the classification layer in timm models

    I was playing a bit with the library and found out that the TimmEncoder returns 1000-dimensional vectors for all the models I selected. That is caused by returning the state of the last FC classification layer and the fact all of the models were trained on ImageNet with 1000 classes. In practice, it's typically replaced with identity.

    Are there any reasons for returning the state of that last layer as an embedding? I'd be happy to submit a PR fixing that.

    opened by kacperlukawski 1
  • xception mobilenet

    xception mobilenet

    https://keras.io/api/applications/

    https://www.tensorflow.org/api_docs/python/tf/keras/applications/mobilenet_v2/MobileNetV2 https://www.tensorflow.org/api_docs/python/tf/keras/applications/xception/Xception

    opened by koaning 0
  • 'SentenceEncoder' object has no attribute 'device'

    'SentenceEncoder' object has no attribute 'device'

    text_emb_pipeline = make_pipeline(
      ColumnGrabber("text"),
      SentenceEncoder('all-MiniLM-L6-v2')
    )
    
    # This pipeline can also be trained to make predictions, using
    # the embedded features. 
    text_clf_pipeline = make_pipeline(
      text_emb_pipeline,
      LogisticRegression()
    )
    
    dataf = pd.DataFrame({
      "text": ["positive sentiment", "super negative"],
      "label_col": ["pos", "neg"]
    })
    
    X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
    text_clf_pipeline.fit(dataf, dataf['label_col'])
    

    This code gives this error: 'SentenceEncoder' object has no attribute 'device'

    opened by nicholas-dinicola 6
Releases(0.2.2)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
Runtime fault injection platform by Daniele Rizzieri (2021)

GDBitflip [v1.04] Runtime fault injection platform by Daniele Rizzieri (2021) This platform executes N times a binary and during each execution it inj

Daniele Rizzieri 1 Dec 07, 2021
Datamol is a python library to work with molecules.

Datamol is a python library to work with molecules. It's a layer built on top of RDKit and aims to be as light as possible.

datamol 276 Dec 19, 2022
Simple kivy project to help new kivy users build android apps with python.

Kivy Calculator A Simple Calculator made with kivy framework.Works on all platforms from Windows/linux to android. Description Simple kivy project to

Oussama Ben Sassi 6 Oct 06, 2022
This is a backport of the BaseExceptionGroup and ExceptionGroup classes from Python 3.11.

This is a backport of the BaseExceptionGroup and ExceptionGroup classes from Python 3.11. It contains the following: The exceptiongroup.BaseExceptionG

Alex Grönholm 19 Dec 15, 2022
Taking the fight to the establishment.

Throwdown Taking the fight to the establishment. Wat? I wanted a simple markdown interpreter in python and/or javascript to output html for my website

Trevor van Hoof 1 Feb 01, 2022
Aerial Ace is a helper bot for poketwo which provide various functionalities on top of being a pokedex.

Aerial Ace is a helper bot for poketwo which provide various functionalities on top of being a pokedex.

Devanshu Mishra 1 Dec 01, 2021
Python script for converting obsidian md-file to html (recursively adds all link/images)

ObsidianToHtmlConverter I made a small python script for converting obsidian md-file to static (local) html (recursively adds all link/images) I made

47 Jan 03, 2023
A replacement of qsreplace, accepts URLs as standard input, replaces all query string values with user-supplied values and stdout.

Bhedak A replacement of qsreplace, accepts URLs as standard input, replaces all query string values with user-supplied values and stdout. Works on eve

Eshan Singh 84 Dec 31, 2022
Nicotine+: A graphical client for the SoulSeek peer-to-peer system

Nicotine+ Nicotine+ is a graphical client for the Soulseek peer-to-peer file sharing network. Nicotine+ aims to be a pleasant, Free and Open Source (F

940 Jan 03, 2023
"Hacking" the (Telekom) Zyxel GPON SFP module (PMG3000-D20B)

"Hacking" the (Telekom) Zyxel GPON SFP module (PMG3000-D20B) The SFP can be sour

Matthias Riegler 52 Jan 03, 2023
kodi addon 115网盘

plugin.video.115 kodi addon 115网盘 插件,需要kodi 18以上版本,原码播放需配合 https://github.com/feelfar/115proxy-for-kodi 使用 安装 HEAD 由于release包尚未释出,可直接下载源代码zip包

109 Dec 29, 2022
A python script for compiling and executing .cc files

Debug And Run A python script for compiling and executing .cc files Example dbrun fname.cc [DEBUG MODE] Compiling fname.cc with C++17 ------------

1 May 28, 2022
A novel dual model approach for categorization of unbalanced skin lesion image classes (Presented technical paper 📃)

A novel dual model approach for categorization of unbalanced skin lesion image classes (Presented technical paper 📃)

1 Jan 19, 2022
Wordle Solver

Wordle Solver Installation Install the following onto your computer: Python 3.10.x Download Page Run pip install -r requirements.txt Instructions To r

John Bucknam 1 Feb 15, 2022
Thumbor-bootcamp - learning and contribution experience with ❤️ and 🤗 from the thumbor team

Thumbor-bootcamp - learning and contribution experience with ❤️ and 🤗 from the thumbor team

Thumbor (by @globocom) 9 Jul 11, 2022
E-Paper display loop with plugins

PaperPi V3 NOTE This version of PaperPi is under heavy development and is not ready for the average user. We are working on adding more screen compati

Aaron Ciuffo 34 Dec 30, 2022
create cohort visualizations for a subscription business

pycohort The main revenue generator for subscription businesses is recurring payments. There might be additional one-time offerings but the number of

Yalim Demirkesen 4 Sep 09, 2022
Manually Install Python 2.7 pip without any problem !

Python2.7_install_pip Manually Install Python 2.7 pip without any problem ! Download installPip.py to your system and Run the code using this Command

Ali Jafari 1 Dec 09, 2021
Python Programming Bootcamp

python-bootcamp Python Programming Bootcamp Begin: 27th August 2021 End: 8th September 2021 Registration deadline: 22nd August 2021 Fees: No course or

Rohitash Chandra 11 Oct 19, 2022
Something like Asteroids but not really, done in CircuitPython

CircuitPython Staroids Something like Asteroids, done in CircuitPython. Works with FunHouse, MacroPad, Pybadge, EdgeBadge, CLUE, and Pygamer. circuitp

Tod E. Kurt 14 May 31, 2022