Improving Representations via Similarities

Last update: Jan 08, 2023

Related tags

Miscellaneous embetter

Overview

embetter

warning

I like to build in public, but please don't expect anything yet. This is alpha stuff!

notes

Improving Representations via Similarities

The object to implement:

Embetter(multi_output=True, epochs=50, sampling_kwargs)
  .fit(X, y)
  .fit_sim(X1, X2, y_sim, weights)
  .partial_fit(X, y, classes, weights)
  .partial_fit_sim(X1, X2, y_sim, weights)
  .predict(X)
  .predict_proba(X)
  .predict_sim(X1, X2)
  .transform(X)
  .translate_X_y(X, y, classes=none)

Observation: especially when multi_output=True there's an opportunity with regards to NaN y-values. We can simply choose with values to translate and which to ignore.

Comments

[WIP] Feature/progress bar
Fixes issue #20

[x] Adds progress bar to all text and image embedders.

[x] Tests for SentenceEncoder.

[ ] Use perfplot for progress bar?

[ ] Can we ensure fast NumPy vectorization while using a progress bar?
opened by CarloLepelaars 5
[BUG] `device` should be attribute on `SentenceEncoder`
The device argument in SentenceEncoder is not defined as an attribute. This leads to bugs when using it with sklearn. I encountered attribute errors when trying to print out a Pipeline representation that has SentenceEncoder as a component.

Should be easy to fix by just adding self.device in SentenceEncoder.__init__. We can consider adding tests for text encoders so we can catch these errors beforehand.

The scikit-learn development docs make it clear every argument should be defined as an attribute:

every keyword argument accepted by init should correspond to an attribute on the instance. Scikit-learn relies on this to find the relevant attributes to set on an estimator when doing model selection.

Error message: AttributeError: 'SentenceEncoder' object has no attribute 'device'.

Reproduction: Python 3.8 with embetter = "^0.2.2"

se = SentenceEncoder() repr(se)

Fix:

Add self.device on SentenceEncoder

class SentenceEncoder(EmbetterBase): . . def __init__(self, name="all-MiniLM-L6-v2", device=None): if not device: device = torch.device("cuda" if torch.cuda.is_available() else "cpu") self.device = device self.name = name self.tfm = SBERT(name, device=self.device)
opened by CarloLepelaars 4
Color Histograms - Additional Tricks

This approach could work pretty well as an implementation: https://danielmuellerkomorowska.com/2020/06/17/analyzing-image-histograms-with-scikit-image/

To do something similar to what is explained here: https://www.pinecone.io/learn/color-histograms/

opened by koaning 4
Support for word embeddings
Hi,

Do you think it would be a good idea to add support for static word embeddings (word2vec, glove, etc.)? The embedder would need:

A filename to a local embedding file (e.g., glove.6b.100d.txt)

Either a callable tokenizer or regex string (i.e., the way sci-kit learn's TfIdfVectorizer splits words).

A (name of a) pooling function (e.g., "mean", "max", "sum").

The second and third parameters could easily have sensible defaults, of course. If you think it's a good idea, I can do the PR somewhere next week.

Stéphan
opened by stephantul 3
[FEATURE] SpaCyEmbedder
I think it would be a nice addition to add an embedder that can easily vectorize text through SpaCy. I already have an implementation class for this and would be happy to contribute it here.

SpaCy Docs on vector: https://spacy.io/api/doc#vector

Example code for single string:

import spacy nlp = spacy.load("en_core_web_sm") doc = nlp("This here text") doc.vector
opened by CarloLepelaars 2
`get_feature_names_out` for encoders

I would be happy to implement get_feature_names_out for all the Embetter objects. I will implement them by just adding a new method (without a Mixin).

opened by CarloLepelaars 1
Remove the classification layer in timm models

I was playing a bit with the library and found out that the TimmEncoder returns 1000-dimensional vectors for all the models I selected. That is caused by returning the state of the last FC classification layer and the fact all of the models were trained on ImageNet with 1000 classes. In practice, it's typically replaced with identity.

Are there any reasons for returning the state of that last layer as an embedding? I'd be happy to submit a PR fixing that.

opened by kacperlukawski 1
xception mobilenet

https://keras.io/api/applications/

https://www.tensorflow.org/api_docs/python/tf/keras/applications/mobilenet_v2/MobileNetV2 https://www.tensorflow.org/api_docs/python/tf/keras/applications/xception/Xception

opened by koaning 0

'SentenceEncoder' object has no attribute 'device'

text_emb_pipeline = make_pipeline(
  ColumnGrabber("text"),
  SentenceEncoder('all-MiniLM-L6-v2')
)

# This pipeline can also be trained to make predictions, using
# the embedded features. 
text_clf_pipeline = make_pipeline(
  text_emb_pipeline,
  LogisticRegression()
)

dataf = pd.DataFrame({
  "text": ["positive sentiment", "super negative"],
  "label_col": ["pos", "neg"]
})

X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
text_clf_pipeline.fit(dataf, dataf['label_col'])

This code gives this error: 'SentenceEncoder' object has no attribute 'device'

opened by nicholas-dinicola 6

Releases(0.2.2)

0.2.2(Dec 20, 2022)

Adds GPU support for Sentence Encoders.
Source code(tar.gz)
Source code(zip)
0.2.1(Dec 5, 2022)

Fixed some error messages related to installing extra dependencies.
Source code(tar.gz)
Source code(zip)
0.2.0(Oct 10, 2022)

Fixes a bug related to the Timm vision models.
Source code(tar.gz)
Source code(zip)
0.1.0(Sep 19, 2022)

The first original release. Should have enough components to be interesting.
Source code(tar.gz)
Source code(zip)

Owner

vincent d warmerdam

Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].

GitHub Repository

Runtime fault injection platform by Daniele Rizzieri (2021)

GDBitflip [v1.04] Runtime fault injection platform by Daniele Rizzieri (2021) This platform executes N times a binary and during each execution it inj

1 Dec 07, 2021

Datamol is a python library to work with molecules.

Datamol is a python library to work with molecules. It's a layer built on top of RDKit and aims to be as light as possible.

276 Dec 19, 2022

Simple kivy project to help new kivy users build android apps with python.

Kivy Calculator A Simple Calculator made with kivy framework.Works on all platforms from Windows/linux to android. Description Simple kivy project to

6 Oct 06, 2022

This is a backport of the BaseExceptionGroup and ExceptionGroup classes from Python 3.11.

This is a backport of the BaseExceptionGroup and ExceptionGroup classes from Python 3.11. It contains the following: The exceptiongroup.BaseExceptionG

19 Dec 15, 2022

Taking the fight to the establishment.

Throwdown Taking the fight to the establishment. Wat? I wanted a simple markdown interpreter in python and/or javascript to output html for my website

1 Feb 01, 2022

Aerial Ace is a helper bot for poketwo which provide various functionalities on top of being a pokedex.

1 Dec 01, 2021

Python script for converting obsidian md-file to html (recursively adds all link/images)

ObsidianToHtmlConverter I made a small python script for converting obsidian md-file to static (local) html (recursively adds all link/images) I made

47 Jan 03, 2023

A replacement of qsreplace, accepts URLs as standard input, replaces all query string values with user-supplied values and stdout.

Bhedak A replacement of qsreplace, accepts URLs as standard input, replaces all query string values with user-supplied values and stdout. Works on eve

84 Dec 31, 2022

Nicotine+: A graphical client for the SoulSeek peer-to-peer system

Nicotine+ Nicotine+ is a graphical client for the Soulseek peer-to-peer file sharing network. Nicotine+ aims to be a pleasant, Free and Open Source (F

940 Jan 03, 2023

"Hacking" the (Telekom) Zyxel GPON SFP module (PMG3000-D20B)

"Hacking" the (Telekom) Zyxel GPON SFP module (PMG3000-D20B) The SFP can be sour

52 Jan 03, 2023

kodi addon 115网盘

plugin.video.115 kodi addon 115网盘插件,需要kodi 18以上版本，原码播放需配合 https://github.com/feelfar/115proxy-for-kodi 使用安装 HEAD 由于release包尚未释出，可直接下载源代码zip包

109 Dec 29, 2022

A python script for compiling and executing .cc files

Debug And Run A python script for compiling and executing .cc files Example dbrun fname.cc [DEBUG MODE] Compiling fname.cc with C++17 ------------

1 May 28, 2022

A novel dual model approach for categorization of unbalanced skin lesion image classes (Presented technical paper 📃)

1 Jan 19, 2022

Wordle Solver

Wordle Solver Installation Install the following onto your computer: Python 3.10.x Download Page Run pip install -r requirements.txt Instructions To r

1 Feb 15, 2022

Thumbor-bootcamp - learning and contribution experience with ❤️ and 🤗 from the thumbor team

9 Jul 11, 2022

E-Paper display loop with plugins

PaperPi V3 NOTE This version of PaperPi is under heavy development and is not ready for the average user. We are working on adding more screen compati

34 Dec 30, 2022

create cohort visualizations for a subscription business

pycohort The main revenue generator for subscription businesses is recurring payments. There might be additional one-time offerings but the number of

4 Sep 09, 2022

Manually Install Python 2.7 pip without any problem !

Python2.7_install_pip Manually Install Python 2.7 pip without any problem ! Download installPip.py to your system and Run the code using this Command

1 Dec 09, 2021

Python Programming Bootcamp

python-bootcamp Python Programming Bootcamp Begin: 27th August 2021 End: 8th September 2021 Registration deadline: 22nd August 2021 Fees: No course or

11 Oct 19, 2022

Something like Asteroids but not really, done in CircuitPython

CircuitPython Staroids Something like Asteroids, done in CircuitPython. Works with FunHouse, MacroPad, Pybadge, EdgeBadge, CLUE, and Pygamer. circuitp

14 May 31, 2022

Improving Representations via Similarities

Related tags

Overview

embetter

warning

notes

Comments

Releases(0.2.2)

0.2.2(Dec 20, 2022)

0.2.1(Dec 5, 2022)

0.2.0(Oct 10, 2022)

0.1.0(Sep 19, 2022)

Owner

vincent d warmerdam

Runtime fault injection platform by Daniele Rizzieri (2021)

Datamol is a python library to work with molecules.

Simple kivy project to help new kivy users build android apps with python.

This is a backport of the BaseExceptionGroup and ExceptionGroup classes from Python 3.11.

Taking the fight to the establishment.

Aerial Ace is a helper bot for poketwo which provide various functionalities on top of being a pokedex.

Python script for converting obsidian md-file to html (recursively adds all link/images)

A replacement of qsreplace, accepts URLs as standard input, replaces all query string values with user-supplied values and stdout.

Nicotine+: A graphical client for the SoulSeek peer-to-peer system

"Hacking" the (Telekom) Zyxel GPON SFP module (PMG3000-D20B)

kodi addon 115网盘

A python script for compiling and executing .cc files

A novel dual model approach for categorization of unbalanced skin lesion image classes (Presented technical paper 📃)

Wordle Solver

Thumbor-bootcamp - learning and contribution experience with ❤️ and 🤗 from the thumbor team

E-Paper display loop with plugins

create cohort visualizations for a subscription business

Manually Install Python 2.7 pip without any problem !

Python Programming Bootcamp

Something like Asteroids but not really, done in CircuitPython