An exploration of log domain "alternative floating point" for hardware ML/AI accelerators.

Overview

This repository contains the SystemVerilog RTL, C++, HLS (Intel FPGA OpenCL to wrap RTL code) and Python needed to reproduce the numerical results in "Rethinking floating point for deep learning" [1].

There are two types of floating point implemented:

  • N-bit (N, l, alpha, beta, gamma) log with ELMA [1]
  • N-bit (N, s) (linear) posit [2]

with partial implementation of IEEE-style (e, s) floating point (likely quite buggy) and non-posit tapered log.

8-bit (8, 1, 5, 5, 7) log is the format described in "Rethinking floating point for deep learning", shown within to be more energy efficient than int8/32 integer multiply-add at 28 nm and an effective drop-in replacement for IEEE 754 binary32 single precision floating point via round to nearest even for CNN inference on ResNet-50 on ImageNet.

[1] Johnson, J. "Rethinking floating point for deep learning." (2018). https://arxiv.org/abs/1811.01721

[2] Gustafson, J. and Yonemoto, I. "Beating floating point at its own game: Posit arithmetic." Supercomputing Frontiers and Innovations 4.2 (2017): 71-86.

Requirements

You will need:

  • a PyTorch CPU installation
  • a C++11-compatible compiler to use to generate a PyTorch C++ extension module
  • the ImageNet ILSVRC12 image validation set
  • an Intel OpenCL for FPGA compatible board
  • a Quartus Prime Pro installation with the Intel OpenCL for FPGA compiler

rtl contains the SystemVerilog modules needed for the design.

bitstream contains the OpenCL that wraps the RTL modules.

cpp contains host CPU-side code for interacting with the FPGA OpenCL design.

py contains the top-level functionality to compile the CPU code and run networks.

Flow

In bitstream, run

./build_lib.sh <design>

followed by

./build_afu.sh <design> (this will take several hours to synthesize the FPGA design)

where <design> is one of loglib or positlib. The aoc/aocl tools, Quartus, Quartus license file, OpenCL BSP etc. must be in your path/environment. loglib is configured to generate a design with 8-bit (8, 1, 5, 5, 7) log arithmetic, and positlib is configured to generate a design with 8-bit (8, 1) posit arithmetic by default.

The aoc build seems to require a Python 2.x interpreter in the path, otherwise it will fail.

Update the aocx_file in py/run_fpga_resnet.py to your choice of design.

Update valdir towards the end of py/validate.py to point to a Torch dataset loader compatible installation of the ImageNet validation set.

Using a python environment with PyTorch available, in py run:

python run_fpga_resnet.py

If successful, this will run the complete validation set against the FPGA design. This requires a Python 3.x interpreter.

RTL comments

The modules used by the OpenCL design reside in rtl/log/operators and rtl/posit/operators. You can see how they are assembled here.

rtl/paper_syn contains the modules used in the paper's 28 nm synthesis results (Paper*Top.sv are the top-level modules). Waves_*.sv are the testbench programs used to generate switching activity for power analysis output.

You will have to provide your own Synopsys Design Compiler scripts/flow/cell libraries/PDK/etc. for synthesis, as we are not allowed to share details on which 28 nm semiconductor process was used or our Design Compiler synthesis scripts.

Other comments

The posit encoding implemented herein implements negative values with a sign bit rather than two's complement encoding. It is a TODO to change it, but the cost either way is largely dwarfed by other concerns in my opinion.

The FPGA design itself is not super flexible yet to support different bit widths than 8. loglib is restricted to N <= 8 bits at the moment, while positlib should be ok for N <= 16 bits, though some of the larger designs may run into FPGA resource issues if synthesized for the FPGA.

Contributions

This repo currently exists as a proof of concept. Contributions may be considered, but the design is mostly that which is needed to reproduce the results from the paper.

License

This code is licensed under CC-BY-NC 4.0.

This code also includes and uses the Single Python Fixed-Point Module for LUT SystemVerilog log-to-linear and linear-to-log mapping module generation in rtl/log/luts, which is licensed by the Python-2.4.2 license.

Owner
Facebook Research
Facebook Research
A transformer which can randomly augment VOC format dataset (both image and bbox) online.

VocAug It is difficult to find a script which can augment VOC-format dataset, especially the bbox. Or find a script needs complex requirements so it i

Coder.AN 1 Mar 05, 2022
STARCH compuets regional extreme storm physical characteristics and moisture balance based on spatiotemporal precipitation data from reanalysis or climate model data.

STARCH (Storm Tracking And Regional CHaracterization) STARCH computes regional extreme storm physical and moisture balance characteristics based on sp

Onosama 7 Oct 20, 2022
Making Structure-from-Motion (COLMAP) more robust to symmetries and duplicated structures

SfM disambiguation with COLMAP About Structure-from-Motion generally fails when the scene exhibits symmetries and duplicated structures. In this repos

Computer Vision and Geometry Lab 193 Dec 26, 2022
This repository contains project created during the Data Challenge module at London School of Hygiene & Tropical Medicine

LSHTM_RCS This repository contains project created during the Data Challenge module at London School of Hygiene & Tropical Medicine (LSHTM) in collabo

Lukas Kopecky 3 Jan 30, 2022
Train emoji embeddings based on emoji descriptions.

emoji2vec This is my attempt to train, visualize and evaluate emoji embeddings as presented by Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko

Miruna Pislar 17 Sep 03, 2022
A Comprehensive Study on Learning-Based PE Malware Family Classification Methods

A Comprehensive Study on Learning-Based PE Malware Family Classification Methods Datasets Because of copyright issues, both the MalwareBazaar dataset

8 Oct 21, 2022
Neural Architecture Search Powered by Swarm Intelligence 🐜

Neural Architecture Search Powered by Swarm Intelligence 🐜 DeepSwarm DeepSwarm is an open-source library which uses Ant Colony Optimization to tackle

288 Oct 28, 2022
Train Yolov4 using NBX-Jobs

yolov4-trainer-nbox Train Yolov4 using NBX-Jobs. Use the powerfull functionality available in nbox-SDK repo to train a tiny-Yolo v4 model on Pascal VO

Yash Bonde 1 Jan 12, 2022
Cmsc11 arcade - Final Project for CMSC11

cmsc11_arcade Final Project for CMSC11 Developers: Limson, Mark Vincent Peñafiel

Gregory 1 Jan 18, 2022
Official Keras Implementation for UNet++ in IEEE Transactions on Medical Imaging and DLMIA 2018

UNet++: A Nested U-Net Architecture for Medical Image Segmentation UNet++ is a new general purpose image segmentation architecture for more accurate i

Zongwei Zhou 1.8k Jan 07, 2023
an implementation of softmax splatting for differentiable forward warping using PyTorch

softmax-splatting This is a reference implementation of the softmax splatting operator, which has been proposed in Softmax Splatting for Video Frame I

Simon Niklaus 338 Dec 28, 2022
Implementation of fast algorithms for Maximum Spanning Tree (MST) parsing that includes fast ArcMax+Reweighting+Tarjan algorithm for single-root dependency parsing.

Fast MST Algorithm Implementation of fast algorithms for (Maximum Spanning Tree) MST parsing that includes fast ArcMax+Reweighting+Tarjan algorithm fo

Miloš Stanojević 11 Oct 14, 2022
An implementation of shampoo

shampoo.pytorch An implementation of shampoo, proposed in Shampoo : Preconditioned Stochastic Tensor Optimization by Vineet Gupta, Tomer Koren and Yor

Ryuichiro Hataya 69 Sep 10, 2022
Official implementation of NeurIPS'21: Implicit SVD for Graph Representation Learning

isvd Official implementation of NeurIPS'21: Implicit SVD for Graph Representation Learning If you find this code useful, you may cite us as: @inprocee

Sami Abu-El-Haija 16 Jan 08, 2023
Image morphing without reference points by applying warp maps and optimizing over them.

Differentiable Morphing Image morphing without reference points by applying warp maps and optimizing over them. Differentiable Morphing is machine lea

Alex K 380 Dec 19, 2022
Like Dirt-Samples, but cleaned up

Clean-Samples Like Dirt-Samples, but cleaned up, with clear provenance and license info (generally a permissive creative commons licence but check the

TidalCycles 39 Nov 30, 2022
FluxTraining.jl gives you an endlessly extensible training loop for deep learning

A flexible neural net training library inspired by fast.ai

86 Dec 31, 2022
Pytorch implementation of various High Dynamic Range (HDR) Imaging algorithms

Deep High Dynamic Range Imaging Benchmark This repository is the pytorch impleme

Tianhong Dai 5 Nov 16, 2022
This is an unofficial implementation of the paper “Student-Teacher Feature Pyramid Matching for Unsupervised Anomaly Detection”.

This is an unofficial implementation of the paper “Student-Teacher Feature Pyramid Matching for Unsupervised Anomaly Detection”.

haifeng xia 32 Oct 26, 2022
Implementation of ReSeg using PyTorch

Implementation of ReSeg using PyTorch ReSeg: A Recurrent Neural Network-based Model for Semantic Segmentation Pascal-Part Annotations Pascal VOC 2010

Onur Kaplan 46 Nov 23, 2022