A neural-based binary analysis tool

Related tags

Data Analysisnbref
Overview

A neural-based binary analysis tool

Introduction

This directory contains the demo of a neural-based binary analysis tool. We test the framework using multiple binary analysis tasks: (i) vulnerability detection. (ii) code similarity measures. (iii) decompilations. (iv) malware analysis (coming later).

Requirements

  • Python 3.7.6
  • Python packages
    • dgl 0.6.0
    • numpy 1.18.1
    • pandas 1.2.0
    • scipy 1.4.1
    • sklearn 0.0
    • tensorboard 2.2.1
    • torch 1.5.0
    • torchtext 0.2.0
    • tqdm 4.42.1
    • wget 3.2
  • C++14 compatible compiler
  • Clang++ 3.7.1

Tasks and Dataset preparation

Binary code similarity measures

  1. Download dataset
    • Download POJ-104 datasets from here and extract them into data/.
  2. Compile and preprocess
    • Run python extract_obj.py -a data/obj (clang++-3.7.1 required)
    • Run python preprocess/split_dataset.py -i data/obj -m p -o data/split.pkl to split the dataset into train/valid/test sets.
    • Run python preprocess/sim_preprocess.py to compile the binary code into graphs data.
    • *(part of the preprocessing code are from [1])

Binary Vulnerability detections

  1. Cramming the binary dataset
    • The dataset is built on top of Devign. We compile the entire library based on the commit id and dump the binary code of the vulnerable functions. The cramming code is given in preprocess/cram_vul_dataset.
  2. Download Preprocessed data
    • Run ./preprocess.sh (clang++-3.7.1 required), or
    • You can directly download the preprocessed datasets from here and extract them into data/.
    • Run python preprocess/vul_preprocess.py to compile the binary code into graphs data

Binary decompilation [N-Bref]

  1. Download dataset
    • Download the demo datasets (raw and preprocessed data) from here and extract them into data/. (More datasets to come.)
    • No need to compile the code into graph again as the data has already been preprocessed.

Training and Evaluation

Binary code similarity measures

  • Run cd baseline_model && python run_similarity_check.py

Binary Vulnerability detections

  • Run cd baseline_model && python run_vulnerability_detection.py

Binary decompilation [N-Bref]

  1. Dump the trace of tree expansion:
    • To accelerate the online processing of the tree output, we will dump the trace of the trea data by running python -m preprocess.dump_trace
  2. Training scripts:
    • First, cd baseline model.
    • To train the model using torch parallel, run python run_tree_transformer.py.
    • To train it on multi-gpu using distribute pytorch, run python run_tree_transformer_multi_gpu.py
    • To evaluate, run python run_tree_transformer.py --eval
    • To evaluate a multi-gpu trained model, run python run_tree_transformer_multi_gpu.py --eval

References

[1] Ye, Fangke, et al. "MISIM: An End-to-End Neural Code Similarity System." arXiv preprint arXiv:2006.05265 (2020).

[2] Zhou, Yaqin, et al. "Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks." Advances in Neural Information Processing Systems. 2019.

[3] Shi, Zhan, et al. "Learning Execution through Neural Code Fusion.", ICLR (2019).

License

This repo is CC-BY-NC licensed, as found in the LICENSE file.

Owner
Facebook Research
Facebook Research
Functional tensors for probabilistic programming

Funsor Funsor is a tensor-like library for functions and distributions. See Functional tensors for probabilistic programming for a system description.

208 Dec 29, 2022
Gaussian processes in TensorFlow

Website | Documentation (release) | Documentation (develop) | Glossary Table of Contents What does GPflow do? Installation Getting Started with GPflow

GPflow 1.7k Jan 06, 2023
Single machine, multiple cards training; mix-precision training; DALI data loader.

Template Script Category Description Category script comparison script train.py, loader.py for single-machine-multiple-cards training train_DP.py, tra

2 Jun 27, 2022
An Aspiring Drop-In Replacement for NumPy at Scale

Legate NumPy is a Legate library that aims to provide a distributed and accelerated drop-in replacement for the NumPy API on top of the Legion runtime. Using Legate NumPy you do things like run the f

Legate 502 Jan 03, 2023
Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production

Numerics Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production Use procedure: Initialise a new i

George Whittle 1 Nov 13, 2021
CubingB is a timer/analyzer for speedsolving Rubik's cubes, with smart cube support

CubingB is a timer/analyzer for speedsolving Rubik's cubes (and related puzzles). It focuses on supporting "smart cubes" (i.e. bluetooth cubes) for recording the exact moves of a solve in real time.

Zach Wegner 5 Sep 18, 2022
International Space Station data with Python research 🌎

International Space Station data with Python research 🌎 Plotting ISS trajectory, calculating the velocity over the earth and more. Plotting trajector

Facundo Pedaccio 41 Jun 16, 2022
Full automated data pipeline using docker images

Create postgres tables from CSV files This first section is only relate to creating tables from CSV files using postgres container alone. Just one of

1 Nov 21, 2021
Stochastic Gradient Trees implementation in Python

Stochastic Gradient Trees - Python Stochastic Gradient Trees1 by Henry Gouk, Bernhard Pfahringer, and Eibe Frank implementation in Python. Based on th

John Koumentis 2 Nov 18, 2022
A Numba-based two-point correlation function calculator using a grid decomposition

A Numba-based two-point correlation function (2PCF) calculator using a grid decomposition. Like Corrfunc, but written in Numba, with simplicity and hackability in mind.

Lehman Garrison 3 Aug 24, 2022
Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

PremiershipPlayerAnalysis Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. No

5 Sep 06, 2021
Pyspark Spotify ETL

This is my first Data Engineering project, it extracts data from the user's recently played tracks using Spotify's API, transforms data and then loads it into Postgresql using SQLAlchemy engine. Data

16 Jun 09, 2022
Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Gabriele 3 Jul 05, 2022
A variant of LinUCB bandit algorithm with local differential privacy guarantee

Contents LDP LinUCB Description Model Architecture Dataset Environment Requirements Script Description Script and Sample Code Script Parameters Launch

Weiran Huang 4 Oct 25, 2022
Python-based Space Physics Environment Data Analysis Software

pySPEDAS pySPEDAS is an implementation of the SPEDAS framework for Python. The Space Physics Environment Data Analysis Software (SPEDAS) framework is

SPEDAS 98 Dec 22, 2022
Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

The following Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks (MOFs). The training set is extracted from the Cambridge S

1 Jan 09, 2022
In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

ETL Pipeline for AWS Project Description In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 t

Mobeen Ahmed 1 Nov 01, 2021
HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays o

HyperSpy 411 Dec 27, 2022
A library to create multi-page Streamlit applications with ease.

A library to create multi-page Streamlit applications with ease.

Jackson Storm 107 Jan 04, 2023
This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

This repo contains a powerful tool made using python which is used to visualize, analyse and finally assess the quality of the product depending upon the given observations

SasiVatsal 8 Oct 18, 2022