Source code and Dataset creation for the paper "Neural Symbolic Regression That Scales"

Overview

NeuralSymbolicRegressionThatScales

Pytorch implementation and pretrained models for the paper "Neural Symbolic Regression That Scales", presented at ICML 2021. Our deep-learning based approach is the first symbolic regression method that leverages large scale pre-training. We procedurally generate an unbounded set of equations, and simultaneously pre-train a Transformer to predict the symbolic equation from a corresponding set of input-output-pairs.

For details, see Neural Symbolic Regression That Scales. [arXiv]

Installation

Please clone and install this repository via

git clone https://github.com/SymposiumOrganization/NeuralSymbolicRegressionThatScales.git
cd NeuralSymbolicRegressionThatScales/
pip3 install -e src/

This library requires python>3.7

Pretrained models

We offer two models, "10M" and "100M". Both are trained with parameter configuration showed in dataset_configuration.json (which contains details about how datasets are created) and scripts/config.yaml (which contains details of how models are trained). "10M" model is trained with 10 million datasets and "100M" model is trained with 100 millions dataset.

  • Link to 100M: [Link]
  • Link to 10M: [Link]

If you want to try the models out, look at jupyter/fit_func.ipynb. Before running the notebook, make sure to first create a folder named "weights" and to download the provided checkpoints there.

Dataset Generation

Before training, you need a dataset of equations. Here the steps to follow

Raw training dataset generation

The equation generator scripts are based on [SymbolicMathematics] First, if you want to change the defaults value, configure the dataset_configuration.json file:

{
    "max_len": 20, #Maximum length of an equation
    "operators": "add:10,mul:10,sub:5,div:5,sqrt:4,pow2:4,pow3:2,pow4:1,pow5:1,ln:4,exp:4,sin:4,cos:4,tan:4,asin:2", #Operator unnormalized probability
    "max_ops": 5, #Maximum number of operations
    "rewrite_functions": "", #Not used, leave it empty
    "variables": ["x_1","x_2","x_3"], #Variable names, if you want to add more add follow the convention i.e. x_4, x_5,... and so on
    "eos_index": 1,
    "pad_index": 0
}

There are two ways to generate this dataset:

  • If you are running on linux, you use makefile in terminal as follows:
export NUM=${NumberOfEquations} #Export num of equations
make data/raw_datasets/${NUM}: #Launch make file command

NumberOfEquations can be defined in two formats with K or M suffix. For instance 100K is equal to 100'000 while 10M is equal to 10'0000000 For example, if you want to create a 10M dataset simply:

export NUM=10M #Export num variable
make data/raw_datasets/10M: #Launch make file command
  • Run this script:
python3 scripts/data_creation/dataset_creation.py --number_of_equations NumberOfEquations --no-debug #Replace NumberOfEquations with the number of equations you want to generate

After this command you will have a folder named data/raw_data/NumberOfEquations containing .h5 files. By default, each of this h5 files contains a maximum of 5e4 equations.

Raw test dataset generation

This step is optional. You can skip it if you want to use our test set used for the paper (located in test_set/nc.csv). Use the same commands as before for generating a validation dataset. All equations in this dataset will be remove from the training dataset in the next stage, hence this validation dataset should be small. For our paper it constisted of 200 equations.

#Code for generating a 150 equation dataset 
python3 scripts/data_creation/dataset_creation.py --number_of_equations 150 --no-debug #This code creates a new folder data/raw_datasets/150

If you want, you can convert the newly created validation dataset in a csv format. To do so, run: python3 scripts/csv_handling/dataload_format_to_csv.py raw_test_path=data/raw_datasets/150 This command will create two csv files named test_nc.csv (equations without constants) and test_wc.csv (equation with constants) in the test_set folder.

Remove test and numerical problematic equations from the training dataset

The following steps will remove the validation equations from the training set and remove equations that are always nan, inf, etc.

  • path_to_data_folder=data/raw_datasets/100000 if you have created a 100K dataset
  • path_to_csv=test_set/test_nc.csv if you have created 150 equations for validation. If you want to use the one in the paper replace it with nc.csv
python3 scripts/data_creation/filter_from_already_existing.py --data_path path_to_data_folder --csv_path path_to_csv #You can leave csv_path empty if you do not want to create a validation set
python3 scripts/data_creation/apply_filtering.py --data_path path_to_data_folder 

You should now have a folder named data/datasets/100000. This will be the training folder.

Training

Once you have created your training and validation datasets run

python3 scripts/train.py

You can configure the config.yaml with the necessary options. Most important, make sure you have set train_path and val_path correctly. If you have followed the 100K example this should be set as:

train_path:  data/datasets/100000
val_path: data/raw_datasets/150
Sparse-dense operators implementation for Paddle

Sparse-dense operators implementation for Paddle This module implements coo, csc and csr matrix formats and their inter-ops with dense matrices. Feel

北海若 3 Dec 17, 2022
Hard cater examples from Hopper ICLR paper

CATER-h Honglu Zhou*, Asim Kadav, Farley Lai, Alexandru Niculescu-Mizil, Martin Renqiang Min, Mubbasir Kapadia, Hans Peter Graf (*Contact: honglu.zhou

NECLA ML Group 6 May 11, 2021
A PyTorch implementation of "From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network" (ICCV2021)

From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network The official code of VisionLAN (ICCV2021). VisionLAN successfully a

81 Dec 12, 2022
A computer vision pipeline to identify the "icons" in Christian paintings

Christian-Iconography A computer vision pipeline to identify the "icons" in Christian paintings. A bit about iconography. Iconography is related to id

Rishab Mudliar 3 Jul 30, 2022
ScaleNet: A Shallow Architecture for Scale Estimation

ScaleNet: A Shallow Architecture for Scale Estimation Repository for the code of ScaleNet paper: "ScaleNet: A Shallow Architecture for Scale Estimatio

Axel Barroso 34 Nov 09, 2022
Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue

Realtime Unsupervised Depth Estimation from an Image This is the caffe implementation of our paper "Unsupervised CNN for single view depth estimation:

Ravi Garg 227 Nov 28, 2022
Python library containing BART query generation and BERT-based Siamese models for neural retrieval.

Neural Retrieval Embedding-based Zero-shot Retrieval through Query Generation leverages query synthesis over large corpuses of unlabeled text (such as

Amazon Web Services - Labs 35 Apr 14, 2022
Improving Compound Activity Classification via Deep Transfer and Representation Learning

Improving Compound Activity Classification via Deep Transfer and Representation Learning This repository is the official implementation of Improving C

NingLab 2 Nov 24, 2021
Single-step adversarial training (AT) has received wide attention as it proved to be both efficient and robust.

Subspace Adversarial Training Single-step adversarial training (AT) has received wide attention as it proved to be both efficient and robust. However,

15 Sep 02, 2022
An implementation of DeepMind's Relational Recurrent Neural Networks in PyTorch.

relational-rnn-pytorch An implementation of DeepMind's Relational Recurrent Neural Networks (Santoro et al. 2018) in PyTorch. Relational Memory Core (

Sang-gil Lee 241 Nov 18, 2022
Spatial color quantization in Rust

rscolorq Rust port of Derrick Coetzee's scolorq, based on the 1998 paper "On spatial quantization of color images" by Jan Puzicha, Markus Held, Jens K

Collyn O'Kane 37 Dec 22, 2022
Physics-informed convolutional-recurrent neural networks for solving spatiotemporal PDEs

PhyCRNet Physics-informed convolutional-recurrent neural networks for solving spatiotemporal PDEs Paper link: [ArXiv] By: Pu Ren, Chengping Rao, Yang

Pu Ren 11 Aug 23, 2022
Unoffical reMarkable AddOn for Firefox.

reMarkable for Firefox (Download) This repo converts the offical reMarkable Chrome Extension into a Firefox AddOn published here under the name "Unoff

Jelle Schutter 45 Nov 28, 2022
Official PyTorch implementation of "Synthesis of Screentone Patterns of Manga Characters"

Manga Character Screentone Synthesis Official PyTorch implementation of "Synthesis of Screentone Patterns of Manga Characters" presented in IEEE ISM 2

Tsubota 2 Nov 20, 2021
The Pytorch implementation for "Video-Text Pre-training with Learned Regions"

Region_Learner The Pytorch implementation for "Video-Text Pre-training with Learned Regions" (arxiv) We are still cleaning up the code further and pre

Rui Yan 0 Mar 20, 2022
A project that uses optical flow and machine learning to detect aimhacking in video clips.

waldo-anticheat A project that aims to use optical flow and machine learning to visually detect cheating or hacking in video clips from fps games. Che

waldo.vision 542 Dec 03, 2022
Dynamic Neural Representational Decoders for High-Resolution Semantic Segmentation

Dynamic Neural Representational Decoders for High-Resolution Semantic Segmentation Requirements This repository needs mmsegmentation Training To train

Adelaide Intelligent Machines (AIM) Group 7 Sep 12, 2022
Matching python environment code for Lux AI 2021 Kaggle competition, and a gym interface for RL models.

Lux AI 2021 python game engine and gym This is a replica of the Lux AI 2021 game ported directly over to python. It also sets up a classic Reinforceme

Geoff McDonald 74 Nov 03, 2022
PyTorch implementation of normalizing flow models

PyTorch implementation of normalizing flow models

Vincent Stimper 242 Jan 02, 2023
Code for visualizing the loss landscape of neural nets

Visualizing the Loss Landscape of Neural Nets This repository contains the PyTorch code for the paper Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer

Tom Goldstein 2.2k Jan 09, 2023