Several simple examples for popular neural network toolkits calling custom CUDA operators.

Last update: Jan 01, 2023

Overview

Neural Network CUDA Example

Several simple examples for neural network toolkits (PyTorch, TensorFlow, etc.) calling custom CUDA operators.

We provide several ways to compile the CUDA kernels and their cpp wrappers, including jit, setuptools and cmake.

We also provide several python codes to call the CUDA kernels, including kernel time statistics and model training.

For more accurate time statistics, you'd best use nvprof or nsys to run the code.

Environments

NVIDIA Driver: 418.116.00
CUDA: 11.0
Python: 3.7.3
PyTorch: 1.7.0+cu110
TensorFlow: 2.4.1
CMake: 3.16.3
Ninja: 1.10.0
GCC: 8.3.0

Cannot ensure successful running in other environments.

Code structure

├── include
│   └── add2.h # header file of add2 cuda kernel
├── kernel
│   └── add2_kernel.cu # add2 cuda kernel
├── pytorch
│   ├── add2_ops.cpp # torch wrapper of add2 cuda kernel
│   ├── time.py # time comparison of cuda kernel and torch
│   ├── train.py # training using custom cuda kernel
│   ├── setup.py
│   └── CMakeLists.txt
├── tensorflow
│   ├── add2_ops.cpp # tensorflow wrapper of add2 cuda kernel
│   ├── time.py # time comparison of cuda kernel and tensorflow
│   ├── train.py # training using custom cuda kernel
│   └── CMakeLists.txt
├── LICENSE
└── README.md

PyTorch

Compile cpp and cuda

JIT
Directly run the python code.

Setuptools

python3 pytorch/setup.py install

CMake

mkdir build
cd build
cmake ../pytorch
make

Run python

Compare kernel running time

python3 pytorch/time.py --compiler jit
python3 pytorch/time.py --compiler setup
python3 pytorch/time.py --compiler cmake

Train model

python3 pytorch/train.py --compiler jit
python3 pytorch/train.py --compiler setup
python3 pytorch/train.py --compiler cmake

TensorFlow

Compile cpp and cuda

CMake

mkdir build
cd build
cmake ../tensorflow
make

Run python

Compare kernel running time

python3 tensorflow/time.py --compiler cmake

Train model

python3 tensorflow/train.py --compiler cmake

Implementation details (in Chinese)

PyTorch自定义CUDA算子教程与运行时间分析
 详解PyTorch编译并调用自定义CUDA算子的三种方式
 三分钟教你如何PyTorch自定义反向传播

F.A.Q

Q. ImportError: libc10.so: cannot open shared object file: No such file or directory

A. You must do import torch before import add2.

Several simple examples for popular neural network toolkits calling custom CUDA operators.

Related tags

Overview

Neural Network CUDA Example

Environments

Code structure

PyTorch

Compile cpp and cuda

Run python

TensorFlow

Compile cpp and cuda

Run python

Implementation details (in Chinese)

F.A.Q

Owner

WeiYang

These are the materials for the paper "Few-Shot Out-of-Domain Transfer Learning of Natural Language Explanations"

Train Scene Graph Generation for Visual Genome and GQA in PyTorch >= 1.2 with improved zero and few-shot generalization.

Automatic packaging of the open-composite libs for OvGME

Implementation of the CVPR 2021 paper "Online Multiple Object Tracking with Cross-Task Synergy"

Pytorch implementation of our paper under review — Lottery Jackpots Exist in Pre-trained Models

Visual Adversarial Imitation Learning using Variational Models (VMAIL)

Implementation of "Distribution Alignment: A Unified Framework for Long-tail Visual Recognition"(CVPR 2021)

Deep Semisupervised Multiview Learning With Increasing Views (IEEE TCYB 2021, PyTorch Code)

The code for 'Deep Residual Fourier Transformation for Single Image Deblurring'

Cowsay - A rewrite of cowsay in python

HAT: Hierarchical Aggregation Transformers for Person Re-identification

NAS-HPO-Bench-II is the first benchmark dataset for joint optimization of CNN and training HPs.

An Api for Emotion recognition.

Tutoriais publicados nas nossas redes sociais para obtenção de dados, análises simples e outras tarefas relevantes no mercado financeiro.

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

Complex-Valued Neural Networks (CVNN)Complex-Valued Neural Networks (CVNN)

Gym-TORCS is the reinforcement learning (RL) environment in TORCS domain with OpenAI-gym-like interface.

Geometric Vector Perceptrons --- a rotation-equivariant GNN for learning from biomolecular structure

Combinatorially Hard Games where the levels are procedurally generated

Housing Price Prediction