A PyTorch-based open-source framework that provides methods for improving the weakly annotated data and allows researchers to efficiently develop and compare their own methods.

Overview

Python Version license GitHub Release build status PyPI codecov

Knodle (Knowledge-supervised Deep Learning Framework) - a new framework for weak supervision with neural networks. It provides a modularization for separating weak data annotations, powerful deep learning models, and methods for improving weakly supervised training.

More details about Knodle are in our recent paper.


Latest news

Installation

pip install knodle

Usage

knodle offers various methods for denoising weak supervision sources and improve them. There are several methods available for denoising. Examples can be seen in the tutorials folder.

There are four mandatory inputs for knodle:

  1. model_input_x: Your model features (e.g. TF-IDF values) without any labels. Shape: (n_instances x features)
  2. mapping_rules_labels_t: This matrix maps all weak rules to a label. Shape: (n_rules x n_classes)
  3. rule_matches_z: This matrix shows all applied rules on your dataset. Shape: (n_instances x n_rules)
  4. model: A PyTorch model which can take your provided model_input_x as input. Examples are in the model folder.

If you know which denoising method you want to use, you can directly call the corresponding module (the list of currently supported methods is provided below).

Example for training the baseline classifier:

from knodle.model.logistic_regression_model import LogisticRegressionModel
from knodle.trainer.baseline.majority import MajorityVoteTrainer

NUM_OUTPUT_CLASSES = 2

model = LogisticRegressionModel(model_input_x.shape[1], NUM_OUTPUT_CLASSES)

trainer = MajorityVoteTrainer(
  model=model,
  mapping_rules_labels_t=mapping_rules_labels_t,
  model_input_x=model_input_x,
  rule_matches_z=rule_matches_z,
  dev_model_input_x=X_dev,
  dev_gold_labels_y=Y_dev
)

trainer.train()

trainer.test(X_test, Y_test)

A more detailed example of classifier training is here.

Main Principles

The framework provides a simple tensor-driven abstraction based on PyTorch allowing researchers to efficiently develop and compare their methods. The emergence of machine learning software frameworks is the biggest enabler for the wide spread adoption of machine learning and its speed of development. With Knodle we want to empower researchers in a similar fashion.

Knodle main goals:

  • Data abstraction. The interface is a tensor-driven data abstraction which unifies a large number of input variants and is applicable to a large number of tasks.
  • Method independence. We distinguish between weak supervision and prediction model. This enables comparability and accounts for a domain-specific inductive biases.
  • Accessibility. There is a high-level access to the library, that makes it easy to test existing methods, incorporate new ones and benchmark them against each other.

Apart from that, Knodle includes a selection of well-known data sets from prior work in weak supervision. Knodle ecosystem provides modular access to datasets and denoising methods (that can, in turn, be combined with arbitrary deep learning models), enabling easy experimentation.

Datasets currently provided in Knodle:

  • Spam Dataset - a dataset, based on the YouTube comments dataset from Alberto et al. (2015). Here, the task is to classify whether a text is relevant to the video or holds spam, such as adver- tisement.
  • Spouse Dataset - relation extraction dataset is based on the Signal Media One-Million News Articles Dataset from Corney et al. (2016).
  • IMDb Dataset - a dataset, that consists of short movie reviews. The task is to determine whether a review holds a positive or negative sentiment.
  • TAC-based Relation Extraction Dataset - a dataset built over Knowledge Base Population challenges in the Text Analysis Conference. For development and test purposes the corpus annotated via crowdsourcing and human labeling from KBP is used (Zhang et al. (2017). The training is done on a weakly-supervised noisy dataset based on TAC KBP corpora (Surdeanu (2013)).

All datasets are added to the Knodle framework in the tensor format described above and could be dowloaded here. To see how the datasets were created please have a look at the dedicated tutorial.

Denoising Methods

There are several denoising methods available.

Trainer Name Module Description
MajorityVoteTrainer knodle.trainer.baseline This builds the baseline for all methods. No denoising takes place. The final label will be decided by using a simple majority vote approach and the provided model will be trained with these labels.
AutoTrainer knodle.trainer This incorporates all denoising methods currently provided in Knodle.
KNNAggregationTrainer knodle.trainer.knn_aggregation This method looks at the similarities in sentence values. The intuition behind it is that similar samples should be activated by the same rules which is allowed by a smoothness assumption on the target space. Similar sentences will receive the same label matches of the rules. This counteracts the problem of missing rules for certain labels.
WSCrossWeighTrainer knodle.trainer.wscrossweigh This method weighs the training samples basing on how reliable their labels are. The less reliable sentences (i.e. sentences, whose weak labels are possibly wrong) are detected using a DS-CrossWeigh method, which is similar to k-fold cross-validation, and got reduced weights in further training. This counteracts the problem of wrongly classified sentences.
SnorkelTrainer knodle.trainer.snorkel A wrapper of the Snorkel system, which incorporates both generative and discriminative Snorkel steps in a single call.

Each of the methods has its own default config file, which will be used in training if no custom config is provided.

Details about negative samples

Tutorials

We also aimed at providing the users with basic tutorials that would explain how to use our framework. All of them are stored in examples folder and logically divided into two groups:

  • tutorials that demonstrate how to prepare the input data for Knodle Framework...
    • ... on the example of a well-known ImdB dataset. A weakly supervised dataset is created by incorporating keywords as weak sources (link).
    • ... on the example of a TAC-based dataset in .conll format. A relation extraction dataset is created using entity pairs from Freebase as weak sources (link).
  • tutorials how to work with Knodle Framework...
    • ... on the example of AutoTrainer. This trainer is to be called when user wants to train a weak classifier, but has no intention to use any specific denoising method, but rather try all currently provided in Knodle (link).
    • ... on the example of WSCrossWeighTrainer. With this trainer a weak classifier with WSCrossWeigh denoising method will be trained (link).

Compatibility

Currently the package is tested on Python 3.7. It is possible to add further versions. The CI/CD pipeline needs to be updated in that case.

Structure

The structure of the code is as follows

knodle
β”œβ”€β”€ knodle
β”‚    β”œβ”€β”€ evaluation
β”‚    β”œβ”€β”€ model
β”‚    β”œβ”€β”€ trainer
β”‚          β”œβ”€β”€ baseline
β”‚          β”œβ”€β”€ knn_aggregation
β”‚          β”œβ”€β”€ snorkel
β”‚          β”œβ”€β”€ wscrossweigh
β”‚          └── utils
β”‚    β”œβ”€β”€ transformation
β”‚    └── utils
β”œβ”€β”€ tests
β”‚    β”œβ”€β”€ data
β”‚    β”œβ”€β”€ evaluation
β”‚    β”œβ”€β”€ trainer
β”‚          β”œβ”€β”€ baseline
β”‚          β”œβ”€β”€ wscrossweigh
β”‚          β”œβ”€β”€ snorkel
β”‚          └── utils
β”‚    └── transformation
└── examples
     β”œβ”€β”€ data_preprocessing
           β”œβ”€β”€ imdb_dataset
           └── tac_based_dataset
     └── training
           β”œβ”€β”€ simple_auto_trainer
           └── wscrossweigh

License

Licensed under the Apache 2.0 License.

Contact

If you notices a problem in the code, you can report it by submitting an issue.

If you want to share your feedback with us or take part in the project, contact us via [email protected].

And don't forget to follow @knodle_ai on Twitter :)

Authors

Citation

@misc{sedova2021knodle,
      title={Knodle: Modular Weakly Supervised Learning with PyTorch}, 
      author={Anastasiia Sedova, Andreas Stephan, Marina Speranskaya, and Benjamin Roth},
      year={2021},
      eprint={2104.11557},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Acknowledgments

This research was funded by the WWTF though the project β€œKnowledge-infused Deep Learning for Natural Language Processing” (WWTF Vienna Research Group VRG19-008).

Comments
  • Feature/#297 mimic cxr dataset

    Feature/#297 mimic cxr dataset

    I will add some more comments to the jupiter notebook, this is still WIP. So far, I only got the X and T matrix, since I need some advice for the Z matrix.

    Linked issue: #297

    dataset student project 
    opened by marlenesteiner 5
  • Feature/#295 chexpert dataset

    Feature/#295 chexpert dataset

    I added the .py file and the Jupyter notebook for the preprocessing of CheXpert. Please let me know if you have any suggestions for improvement.

    Linked issue: #295

    dataset student project 
    opened by LenaZellinger 2
  • Evaluation method

    Evaluation method

    Implement an evaluate method

    
      def evaluate(
            self, eval_dataset: Optional[Dataset] = None, ignore_keys: Optional[List[str]] = None
        ) -> Dict[str, float]:
        pass
    
    opened by AlessandroVol23 2
  • Bump numpy from 1.21.4 to 1.22.0 in /examples/data_preprocessing/police_killing_dataset

    Bump numpy from 1.21.4 to 1.22.0 in /examples/data_preprocessing/police_killing_dataset

    Bumps numpy from 1.21.4 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    wontfix dependencies 
    opened by dependabot[bot] 1
  • [WIP] Feature/CheXpert and CheXbert labelers

    [WIP] Feature/CheXpert and CheXbert labelers

    I ran into some issues when trying to adjust the Z matrix & was not able to resolve them until now, therefore the code is unfinished. (This will definitely be resolved by the end of the holidays) Missing: pattern transformation, adjusting Z matrix depending on neg/unc patterns; improving code & file architecture

    The build fails at the moment, but if I understand it correctly, this happens, because I use a file where all constants are defined and always just use import * from constants at the beginning of each file.

    enhancement student project 
    opened by elisabear 5
  • First try, PoC PyTorchLightning

    First try, PoC PyTorchLightning

    Goal: Try to use PyTorchLightning

    Subgoals:

    • Build MajorityTrainer, SnorkelTrainer
    • Train with TfIDf + Log. regression and Huggingface Trafo DistilBert
    • Use abstractions.

    Still very problematic:

    • Data Container. How to abstract properly from input (e.g. tfidf vs ids + attention mask)
    • Test Step for Transformers
    hackathon optimization 
    opened by AndSt 0
  • Feature/#316 cosine

    Feature/#316 cosine

    The COSINE trainer is added.

    How to use it?

    • knodle\examples\trainer\cosine\cosine_training_example.py contains an example code for training and testing cosine on the TREC dataset. The TREC dataset is downloaded and preprocessed in Knodle dataset format.
    • All cosine related code components can be found in knodle\knodle\trainer\cosine
    • After the training is completed (i.e. after trainer.train()), the best model during training obtained via early stopping is stored in trainer.model.

    Remarks

    • training with COSINE on the TREC dataset (via weak supervision) is tested.
    • The function convert_text_to_transformer_input() in knodle\examples\trainer\preprocessing.py is updated, now it takes an additional parameter max_sen_len
    enhancement hackathon 
    opened by davidpig 1
Releases(v0.1.3)
Generate vibrant and detailed images using only text.

CLIP Guided Diffusion From RiversHaveWings. Generate vibrant and detailed images using only text. See captions and more generations in the Gallery See

Clay M. 401 Dec 28, 2022
Production First and Production Ready End-to-End Speech Recognition Toolkit

WeNet δΈ­ζ–‡η‰ˆ Discussions | Docs | Papers | Runtime (x86) | Runtime (android) | Pretrained Models We share neural Net together. The main motivation of WeN

2.7k Jan 04, 2023
Synthetic LiDAR sequential point cloud dataset with point-wise annotations

SynLiDAR dataset: Learning From Synthetic LiDAR Sequential Point Cloud This is official repository of the SynLiDAR dataset. For technical details, ple

78 Dec 27, 2022
Implementation for "Conditional entropy minimization principle for learning domain invariant representation features"

Implementation for "Conditional entropy minimization principle for learning domain invariant representation features". The code is reproduced from thi

1 Nov 02, 2022
Plenoxels: Radiance Fields without Neural Networks, Code release WIP

Plenoxels: Radiance Fields without Neural Networks Alex Yu*, Sara Fridovich-Keil*, Matthew Tancik, Qinhong Chen, Benjamin Recht, Angjoo Kanazawa UC Be

Alex Yu 2.3k Dec 30, 2022
Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language (NeurIPS 2021)

VRDP (NeurIPS 2021) Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language Mingyu Ding, Zhenfang Chen, Tao Du, Pin

Mingyu Ding 36 Sep 20, 2022
Functional deep learning

Pipeline abstractions for deep learning. Full documentation here: https://lf1-io.github.io/padl/ PADL: is a pipeline builder for PyTorch. may be used

LF1 101 Nov 09, 2022
A state of the art of new lightweight YOLO model implemented by TensorFlow 2.

CSL-YOLO: A New Lightweight Object Detection System for Edge Computing This project provides a SOTA level lightweight YOLO called "Cross-Stage Lightwe

Miles Zhang 54 Dec 21, 2022
Trading Strategies for Freqtrade

Freqtrade Strategies Strategies for Freqtrade, developed primarily in a partnership between @werkkrew and @JimmyNixx from the Freqtrade Discord. Use t

Bryan Chain 242 Jan 07, 2023
[ICLR 2021] Is Attention Better Than Matrix Decomposition?

Enjoy-Hamburger πŸ” Official implementation of Hamburger, Is Attention Better Than Matrix Decomposition? (ICLR 2021) Under construction. Introduction T

Gsunshine 271 Dec 29, 2022
Jupyter notebooks for using & learning Keras

deep-learning-with-keras-notebooks 這個githubηš„repositoryδΈ»θ¦ζ˜―ε€‹δΊΊεœ¨ε­ΈηΏ’Kerasηš„δΈ€δΊ›θ¨˜ιŒ„εŠη·΄ηΏ’γ€‚εΈŒζœ›εœ¨ε­ΈηΏ’ιŽη¨‹δΈ­η™ΌηΎεˆ°δΈ€δΊ›ε₯½ηš„θ³‡θ¨Šθˆ‡η―„δΎ‹δΉŸε―δ»₯對想要學習使用 KerasδΎ†θ§£ζ±Ίε•ι‘Œηš„εŒε₯½οΌŒζˆ–ζ˜―ε°ζ·±εΊ¦ε­ΈηΏ’ζœ‰θˆˆθΆ£ηš„εœ¨ε­Έε­Έη”Ÿε―δ»₯ζœ‰δΈ€δΊ›ζ–ΉδΎΏη†θ§£θˆ‡δΈŠζ‰‹η―„δΎ‹

ErhWen Kuo 2.1k Dec 27, 2022
PyTorch implementation of MuseMorphose, a Transformer-based model for music style transfer.

MuseMorphose This repository contains the official implementation of the following paper: Shih-Lun Wu, Yi-Hsuan Yang MuseMorphose: Full-Song and Fine-

Yating Music, Taiwan AI Labs 142 Jan 08, 2023
The fundamental package for scientific computing with Python.

NumPy is the fundamental package needed for scientific computing with Python. Website: https://www.numpy.org Documentation: https://numpy.org/doc Mail

NumPy 22.4k Jan 09, 2023
Simple image captioning model - CLIP prefix captioning.

CLIP prefix captioning. Inference Notebook: πŸ₯³ New: πŸ₯³ Our technical papar is finally out! Official implementation for the paper "ClipCap: CLIP Prefix

688 Jan 04, 2023
A simplistic and efficient pure-python neural network library from Phys Whiz with CPU and GPU support.

A simplistic and efficient pure-python neural network library from Phys Whiz with CPU and GPU support.

Manas Sharma 19 Feb 28, 2022
Predict the latency time of the deep learning models

Deep Neural Network Prediction Step 1. Genernate random parameters and Run them sequentially : $ python3 collect_data.py -gp -ep -pp -pl pooling -num

QAQ 1 Nov 12, 2021
CUda Matrix Multiply library.

cumm CUda Matrix Multiply library. cumm is developed during learning of CUTLASS, which use too much c++ template and make code unmaintainable. So I de

49 Dec 27, 2022
Torch Implementation of "Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network"

Photo-Realistic-Super-Resoluton Torch Implementation of "Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network" [Paper]

Harry Yang 199 Dec 01, 2022
Technical Indicators implemented in Python only using Numpy-Pandas as Magic - Very Very Fast! Very tiny! Stock Market Financial Technical Analysis Python library . Quant Trading automation or cryptocoin exchange

MyTT Technical Indicators implemented in Python only using Numpy-Pandas as Magic - Very Very Fast! to Stock Market Financial Technical Analysis Python

dev 34 Dec 27, 2022
Code for Referring Image Segmentation via Cross-Modal Progressive Comprehension, CVPR2020.

CMPC-Refseg Code of our CVPR 2020 paper Referring Image Segmentation via Cross-Modal Progressive Comprehension. Shaofei Huang*, Tianrui Hui*, Si Liu,

spyflying 55 Dec 01, 2022