Data pipelines for both TensorFlow and PyTorch!

Overview

rapidnlp-datasets

Python package PyPI version Python

Data pipelines for both TensorFlow and PyTorch !

If you want to load public datasets, try:

If you want to load local, personal dataset with minimized boilerplate, use rapidnlp-datasets!

installation

pip install -U rapidnlp-datasets

If you work with PyTorch, you should install PyTorch first.

If you work with TensorFlow, you should install TensorFlow first.

Usage

Here are few examples to show you how to use this library.

sequence-classification-quickstart

In PyTorch,

>>> import torch
>>> from rapidnlp_datasets.pt import DatasetForSequenceClassification
>>> dataset = DatasetForSequenceClassification.from_jsonl_files(
        input_files=["testdata/sequence_classification.jsonl"],
        vocab_file="testdata/vocab.txt",
    )
>>> dataloader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=32, collate_fn=dataset.batch_padding_collate)
>>> for idx, batch in enumerate(dataloader):
...     print("No.{} batch: \n{}".format(idx, batch))
... 

In TensorFlow,

>>> from rapidnlp_datasets.tf import TFDatasetForSequenceClassifiation
>>> dataset, d = TFDatasetForSequenceClassifiation.from_jsonl_files(
        input_files=["testdata/sequence_classification.jsonl"],
        vocab_file="testdata/vocab.txt",
        return_self=True,
    )
>>> for idx, batch in enumerate(iter(dataset)):
...     print("No.{} batch: \n{}".format(idx, batch))
... 

Especially, you can save dataset to tfrecord format when working with TensorFlow, and then build dataset from tfrecord files directly!

>>> d.save_tfrecord("testdata/sequence_classification.tfrecord")
2021-12-08 14:52:41,295    INFO             utils.py  128] Finished to write 2 examples to tfrecords.
>>> dataset = TFDatasetForSequenceClassifiation.from_tfrecord_files("testdata/sequence_classification.tfrecord")
>>> for idx, batch in enumerate(iter(dataset)):
...     print("No.{} batch: \n{}".format(idx, batch))
... 

question-answering-quickstart

In PyTorch:

>>> import torch
>>> from rapidnlp_datasets.pt import DatasetForQuestionAnswering
>>>
>>> dataset = DatasetForQuestionAnswering.from_jsonl_files(
        input_files="testdata/qa.jsonl",
        vocab_file="testdata/vocab.txt",
    )
>>> dataloader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=32, collate_fn=dataset.batch_padding_collate)
>>> for idx, batch in enumerate(dataloader):
...     print("No.{} batch: \n{}".format(idx, batch))
... 

In TensorFlow,

>>> from rapidnlp_datasets.tf import TFDatasetForQuestionAnswering
>>> dataset, d = TFDatasetForQuestionAnswering.from_jsonl_files(
        input_files="testdata/qa.jsonl",
        vocab_file="testdata/vocab.txt",
        return_self=True,
    )
2021-12-08 15:09:06,747    INFO question_answering_dataset.py  101] Read 3 examples in total.
>>> for idx, batch in enumerate(iter(dataset)):
        print()
        print("NO.{} batch: \n{}".format(idx, batch))
... 

Especially, you can save dataset to tfrecord format when working with TensorFlow, and then build dataset from tfrecord files directly!

>>> d.save_tfrecord("testdata/qa.tfrecord")
2021-12-08 15:09:31,329    INFO             utils.py  128] Finished to write 3 examples to tfrecords.
>>> dataset = TFDatasetForQuestionAnswering.from_tfrecord_files(
        "testdata/qa.tfrecord",
        batch_size=32,
        padding="batch",
    )
>>> for idx, batch in enumerate(iter(dataset)):
        print()
        print("NO.{} batch: \n{}".format(idx, batch))
... 

token-classification-quickstart

masked-language-models-quickstart

simcse-quickstart

You might also like...
In this project we use both Resnet and Self-attention layer for cat, dog and flower classification.
In this project we use both Resnet and Self-attention layer for cat, dog and flower classification.

cdf_att_classification classes = {0: 'cat', 1: 'dog', 2: 'flower'} In this project we use both Resnet and Self-attention layer for cdf-Classification.

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

🤗 Push your spaCy pipelines to the Hugging Face Hub
🤗 Push your spaCy pipelines to the Hugging Face Hub

spacy-huggingface-hub: Push your spaCy pipelines to the Hugging Face Hub This package provides a CLI command for uploading any trained spaCy pipeline

AI pipelines for Nvidia Jetson Platform

Jetson Multicamera Pipelines Easy-to-use realtime CV/AI pipelines for Nvidia Jetson Platform. This project: Builds a typical multi-camera pipeline, i.

This is a repository for a No-Code object detection inference API using the OpenVINO. It's supported on both Windows and Linux Operating systems.
This is a repository for a No-Code object detection inference API using the OpenVINO. It's supported on both Windows and Linux Operating systems.

OpenVINO Inference API This is a repository for an object detection inference API using the OpenVINO. It's supported on both Windows and Linux Operati

Machine learning framework for both deep learning and traditional algorithms
Machine learning framework for both deep learning and traditional algorithms

NeoML is an end-to-end machine learning framework that allows you to build, train, and deploy ML models. This framework is used by ABBYY engineers for

CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

A transformer which can randomly augment VOC format dataset (both image and bbox) online.
A transformer which can randomly augment VOC format dataset (both image and bbox) online.

VocAug It is difficult to find a script which can augment VOC-format dataset, especially the bbox. Or find a script needs complex requirements so it i

Official repository for GCR rerank, a GCN-based reranking method for both image and video re-ID

Official repository for GCR rerank, a GCN-based reranking method for both image and video re-ID

Releases(v0.2.0)
An AFL implementation with UnTracer (our coverage-guided tracer)

UnTracer-AFL This repository contains an implementation of our prototype coverage-guided tracing framework UnTracer in the popular coverage-guided fuz

113 Dec 17, 2022
Game Agent Framework. Helping you create AIs / Bots that learn to play any game you own!

Serpent.AI - Game Agent Framework (Python) Update: Revival (May 2020) Development work has resumed on the framework with the aim of bringing it into 2

Serpent.AI 6.4k Jan 05, 2023
This is the implementation of "SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEECH" submitted to ICASSP 2022

CPC_DeepCluster This is the implementation of "SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEEC

LEAP Lab 2 Sep 15, 2022
Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields.

This repository contains the code release for Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields. This implementation is written in JAX, and is a fork of Google's JaxNeRF

Google 625 Dec 30, 2022
PyTorch implementation of PP-LCNet: A Lightweight CPU Convolutional Neural Network

PyTorch implementation of PP-LCNet Reproduction of PP-LCNet architecture as described in PP-LCNet: A Lightweight CPU Convolutional Neural Network by C

Quan Nguyen (Fly) 47 Nov 02, 2022
exponential adaptive pooling for PyTorch

AdaPool: Exponential Adaptive Pooling for Information-Retaining Downsampling Abstract Pooling layers are essential building blocks of Convolutional Ne

Alexandros Stergiou 55 Jan 04, 2023
This repository contains code demonstrating the methods outlined in Path Signature Area-Based Causal Discovery in Coupled Time Series presented at Causal Analysis Workshop 2021.

signed-area-causal-inference This repository contains code demonstrating the methods outlined in Path Signature Area-Based Causal Discovery in Coupled

Will Glad 1 Mar 11, 2022
NuPIC Studio is an all­-in-­one tool that allows users create a HTM neural network from scratch

NuPIC Studio is an all­-in-­one tool that allows users create a HTM neural network from scratch, train it, collect statistics, and share it among the members of the community. It is not just a visual

HTM Community 93 Sep 30, 2022
McGill Physics Hackathon 2021: Reaction-Diffusion Models for the Generation of Biological Patterns

DiffuseAnimals: Reaction-Diffusion Models for the Generation of Biological Patterns Introduction Reaction-diffusion equations can be utilized in order

Austin Szuminsky 2 Mar 07, 2022
Emotion Recognition from Facial Images

Reconhecimento de Emoções a partir de imagens faciais Este projeto implementa um classificador simples que utiliza técncias de deep learning e transfe

Gabriel 2 Feb 09, 2022
[CVPR 2016] Unsupervised Feature Learning by Image Inpainting using GANs

Context Encoders: Feature Learning by Inpainting CVPR 2016 [Project Website] [Imagenet Results] Sample results on held-out images: This is the trainin

Deepak Pathak 829 Dec 31, 2022
Python scripts for performing road segemtnation and car detection using the HybridNets multitask model in ONNX.

ONNX-HybridNets-Multitask-Road-Detection Python scripts for performing road segemtnation and car detection using the HybridNets multitask model in ONN

Ibai Gorordo 45 Jan 01, 2023
Gym environment for FLIPIT: The Game of "Stealthy Takeover"

gym-flipit Gym environment for FLIPIT: The Game of "Stealthy Takeover" invented by Marten van Dijk, Ari Juels, Alina Oprea, and Ronald L. Rivest. Desi

Lisa Oakley 2 Dec 15, 2021
[NeurIPS 2021] "G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of Teacher Discriminators"

G-PATE This is the official code base for our NeurIPS 2021 paper: "G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of T

AI Secure 14 Oct 12, 2022
How Effective is Incongruity? Implications for Code-mix Sarcasm Detection.

Code for the paper: How Effective is Incongruity? Implications for Code-mix Sarcasm Detection - ICON ACL 2021

2 Jun 05, 2022
FewBit — a library for memory efficient training of large neural networks

FewBit FewBit — a library for memory efficient training of large neural networks. Its efficiency originates from storage optimizations applied to back

24 Oct 22, 2022
(CVPR2021) DANNet: A One-Stage Domain Adaptation Network for Unsupervised Nighttime Semantic Segmentation

DANNet: A One-Stage Domain Adaptation Network for Unsupervised Nighttime Semantic Segmentation CVPR2021(oral) [arxiv] Requirements python3.7 pytorch==

W-zx-Y 85 Dec 07, 2022
Data Augmentation with Variational Autoencoders

Documentation Pyraug This library provides a way to perform Data Augmentation using Variational Autoencoders in a reliable way even in challenging con

112 Nov 30, 2022
The official implementation of CircleNet: Anchor-free Detection with Circle Representation, MICCAI 2030

CircleNet: Anchor-free Detection with Circle Representation The official implementation of CircleNet, MICCAI 2020 [PyTorch] [project page] [MICCAI pap

The Biomedical Data Representation and Learning Lab 45 Nov 18, 2022
This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models are Pix2Pix, Pix2PixHD, CycleGAN and PointWise.

RGB2NIR_Experimental This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models

5 Jan 04, 2023