Improving Representations via Similarities

Related tags

Miscellaneousembetter
Overview

embetter

warning

I like to build in public, but please don't expect anything yet. This is alpha stuff!

notes

Improving Representations via Similarities

The object to implement:

Embetter(multi_output=True, epochs=50, sampling_kwargs)
  .fit(X, y)
  .fit_sim(X1, X2, y_sim, weights)
  .partial_fit(X, y, classes, weights)
  .partial_fit_sim(X1, X2, y_sim, weights)
  .predict(X)
  .predict_proba(X)
  .predict_sim(X1, X2)
  .transform(X)
  .translate_X_y(X, y, classes=none)

Observation: especially when multi_output=True there's an opportunity with regards to NaN y-values. We can simply choose with values to translate and which to ignore.

Comments
  • [WIP] Feature/progress bar

    [WIP] Feature/progress bar

    Fixes issue #20

    • [x] Adds progress bar to all text and image embedders.
    • [x] Tests for SentenceEncoder.
    • [ ] Use perfplot for progress bar?
    • [ ] Can we ensure fast NumPy vectorization while using a progress bar?
    opened by CarloLepelaars 5
  • [BUG] `device` should be attribute on `SentenceEncoder`

    [BUG] `device` should be attribute on `SentenceEncoder`

    The device argument in SentenceEncoder is not defined as an attribute. This leads to bugs when using it with sklearn. I encountered attribute errors when trying to print out a Pipeline representation that has SentenceEncoder as a component.

    Should be easy to fix by just adding self.device in SentenceEncoder.__init__. We can consider adding tests for text encoders so we can catch these errors beforehand.

    The scikit-learn development docs make it clear every argument should be defined as an attribute:

    every keyword argument accepted by init should correspond to an attribute on the instance. Scikit-learn relies on this to find the relevant attributes to set on an estimator when doing model selection.

    Error message: AttributeError: 'SentenceEncoder' object has no attribute 'device'.

    Reproduction: Python 3.8 with embetter = "^0.2.2"

    se = SentenceEncoder()
    repr(se)
    

    Fix:

    Add self.device on SentenceEncoder

    class SentenceEncoder(EmbetterBase):
        .
        .
        def __init__(self, name="all-MiniLM-L6-v2", device=None):
            if not device:
                device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
            self.device = device
            self.name = name
            self.tfm = SBERT(name, device=self.device)
    
    opened by CarloLepelaars 4
  • Color Histograms - Additional Tricks

    Color Histograms - Additional Tricks

    This approach could work pretty well as an implementation: https://danielmuellerkomorowska.com/2020/06/17/analyzing-image-histograms-with-scikit-image/

    To do something similar to what is explained here: https://www.pinecone.io/learn/color-histograms/

    opened by koaning 4
  • Support for word embeddings

    Support for word embeddings

    Hi,

    Do you think it would be a good idea to add support for static word embeddings (word2vec, glove, etc.)? The embedder would need:

    • A filename to a local embedding file (e.g., glove.6b.100d.txt)
    • Either a callable tokenizer or regex string (i.e., the way sci-kit learn's TfIdfVectorizer splits words).
    • A (name of a) pooling function (e.g., "mean", "max", "sum").

    The second and third parameters could easily have sensible defaults, of course. If you think it's a good idea, I can do the PR somewhere next week.

    Stéphan

    opened by stephantul 3
  • [FEATURE] SpaCyEmbedder

    [FEATURE] SpaCyEmbedder

    I think it would be a nice addition to add an embedder that can easily vectorize text through SpaCy. I already have an implementation class for this and would be happy to contribute it here.

    SpaCy Docs on vector: https://spacy.io/api/doc#vector

    Example code for single string:

    import spacy
    nlp = spacy.load("en_core_web_sm")
    doc = nlp("This here text")
    doc.vector
    
    opened by CarloLepelaars 2
  • `get_feature_names_out` for encoders

    `get_feature_names_out` for encoders

    I would be happy to implement get_feature_names_out for all the Embetter objects. I will implement them by just adding a new method (without a Mixin).

    opened by CarloLepelaars 1
  • Remove the classification layer in timm models

    Remove the classification layer in timm models

    I was playing a bit with the library and found out that the TimmEncoder returns 1000-dimensional vectors for all the models I selected. That is caused by returning the state of the last FC classification layer and the fact all of the models were trained on ImageNet with 1000 classes. In practice, it's typically replaced with identity.

    Are there any reasons for returning the state of that last layer as an embedding? I'd be happy to submit a PR fixing that.

    opened by kacperlukawski 1
  • xception mobilenet

    xception mobilenet

    https://keras.io/api/applications/

    https://www.tensorflow.org/api_docs/python/tf/keras/applications/mobilenet_v2/MobileNetV2 https://www.tensorflow.org/api_docs/python/tf/keras/applications/xception/Xception

    opened by koaning 0
  • 'SentenceEncoder' object has no attribute 'device'

    'SentenceEncoder' object has no attribute 'device'

    text_emb_pipeline = make_pipeline(
      ColumnGrabber("text"),
      SentenceEncoder('all-MiniLM-L6-v2')
    )
    
    # This pipeline can also be trained to make predictions, using
    # the embedded features. 
    text_clf_pipeline = make_pipeline(
      text_emb_pipeline,
      LogisticRegression()
    )
    
    dataf = pd.DataFrame({
      "text": ["positive sentiment", "super negative"],
      "label_col": ["pos", "neg"]
    })
    
    X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
    text_clf_pipeline.fit(dataf, dataf['label_col'])
    

    This code gives this error: 'SentenceEncoder' object has no attribute 'device'

    opened by nicholas-dinicola 6
Releases(0.2.2)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
A module comment generator for python

Module Comment Generator The comment style is as a tribute to the comment from the RA . The comment generator can parse the ast tree from the python s

飘尘 1 Oct 21, 2021
Location of public benchmarking; primarily final results

CSL_public_benchmark This repo is intended to provide a periodically-updated, public view into genome sequencing benchmarks managed by HudsonAlpha's C

HudsonAlpha Institute for Biotechnology 15 Jun 13, 2022
The repository is about 100+ python programming exercise problem discussed, explained, and solved in different ways

Break The Ice With Python A journey of 100+ simple yet interesting problems which are explained, solved, discussed in different pythonic ways Introduc

Abdullah Al Masud Tushar 2.2k Jan 04, 2023
京东自动入会获取京豆

京东入会领京豆 要求 有一定的电脑知识 or 有耐心爱折腾 需要Chrome(推荐)、Edge(Chromium)、Firefox 操作系统需是Mac(本人没在m1上测试)、Linux(在deepin上测试过)、Windows 安装方法 脚本采用Selenium遍历京东入会有礼界面,由于遍历了200

Vanke Anton 500 Dec 22, 2022
Albert launcher extension for rolling dice.

dice-roll-albert-ext Extension for rolling dice in Albert launcher Installation Locate the modules directory in the Python extension data directory. T

Jonah Lawrence 1 Nov 18, 2021
EasyBuild is a software build and installation framework that allows you to manage (scientific) software on High Performance Computing (HPC) systems in an efficient way.

EasyBuild is a software build and installation framework that allows you to manage (scientific) software on High Performance Computing (HPC) systems in an efficient way.

EasyBuild community 87 Dec 27, 2022
PORTSCANNING-IN-PYTHON - A python threaded portscanner to scan websites and ipaddresses

PORTSCANNING-IN-PYTHON This is a python threaded portscanner to scan websites an

1 Feb 16, 2022
Subscribe, listen and (in the future) download your favorite podcasts, quickly and easily.

Minimal Podcasts Player https://github.com/son-link/minimal-podcasts-player Subscribe, listen and (in the future) download your favorite podcasts, qui

Alfonso Saavedra 14 Nov 11, 2022
Python Multilingual Ucrel Semantic Analysis System

PymUSAS Python Multilingual Ucrel Semantic Analysis System, it currently is a rule based token level semantic tagger which can be added to any spaCy p

UCREL 13 Nov 18, 2022
A jokes python module

Made with Python3 (C) @FayasNoushad Copyright permission under MIT License License - https://github.com/FayasNoushad/Jokes/blob/main/LICENSE Deploy

Fayas Noushad 3 Nov 28, 2021
Hypothesis strategies for generating Python programs, something like CSmith

hypothesmith Hypothesis strategies for generating Python programs, something like CSmith. This is definitely pre-alpha, but if you want to play with i

Zac Hatfield-Dodds 73 Dec 14, 2022
In this project we will be using OpenCV to virtually drag a rectangle and drop it at a different location. It will be further used for Virtual Mouse.

Virtual Drag & Drog using OpenCV In this project we will be using OpenCV to virtually drag a rectangle and drop it at a different location. It will be

Hassan Shahzad 5 Sep 27, 2021
RISE allows you to instantly turn your Jupyter Notebooks into a slideshow

RISE RISE allows you to instantly turn your Jupyter Notebooks into a slideshow. No out-of-band conversion is needed, switch from jupyter notebook to a

Damian Avila 3.4k Jan 04, 2023
Powerful virtual assistant in python

Virtual assistant in python Powerful virtual assistant in python Set up Step 1: download repo and unzip Step 2: pip install requirements.txt (if py au

Arkal 3 Jan 23, 2022
LinkScope allows you to perform online investigations by representing information as discrete pieces of data, called Entities.

LinkScope Client Description This is the repository for the LinkScope Client Online Investigation software. LinkScope allows you to perform online inv

108 Jan 04, 2023
WATTS provides a set of Python classes that can manage simulation workflows for multiple codes where information is exchanged at a coarse level

WATTS (Workflow and Template Toolkit for Simulation) provides a set of Python classes that can manage simulation workflows for multiple codes where information is exchanged at a coarse level.

13 Dec 23, 2022
Python with braces. Because Python is awesome, but whitespace is awful.

Bython Python with braces. Because Python is awesome, but whitespace is awful. Bython is a Python preprosessor which translates curly brackets into in

1 Nov 04, 2021
Archive, organize, and watch for changes to publicly available information.

0. Overview The Trapper Keeper is a collection of scripts that support archiving information from around the web to make it easier to study and use. I

Bill Fitzgerald 9 Oct 26, 2022
A python script to turn tabs into spaces the right way.

detab A python script to turn tabs into spaces the right way. detab turns all tabs into spaces, not just leading tabs. Not all tabs have the same leng

1 Jan 26, 2022
This is a modified variation of abhiTronix's vidgear. In this variation, it is possible to write the output file anywhere regardless the permissions.

Info In order to download this package: Windows 10: Press Windows+S, Type PowerShell (cmd in older versions) and hit enter, Type pip install vidgear_n

Ege Akman 3 Jan 30, 2022