A neural-based binary analysis tool

Related tags

Data Analysisnbref
Overview

A neural-based binary analysis tool

Introduction

This directory contains the demo of a neural-based binary analysis tool. We test the framework using multiple binary analysis tasks: (i) vulnerability detection. (ii) code similarity measures. (iii) decompilations. (iv) malware analysis (coming later).

Requirements

  • Python 3.7.6
  • Python packages
    • dgl 0.6.0
    • numpy 1.18.1
    • pandas 1.2.0
    • scipy 1.4.1
    • sklearn 0.0
    • tensorboard 2.2.1
    • torch 1.5.0
    • torchtext 0.2.0
    • tqdm 4.42.1
    • wget 3.2
  • C++14 compatible compiler
  • Clang++ 3.7.1

Tasks and Dataset preparation

Binary code similarity measures

  1. Download dataset
    • Download POJ-104 datasets from here and extract them into data/.
  2. Compile and preprocess
    • Run python extract_obj.py -a data/obj (clang++-3.7.1 required)
    • Run python preprocess/split_dataset.py -i data/obj -m p -o data/split.pkl to split the dataset into train/valid/test sets.
    • Run python preprocess/sim_preprocess.py to compile the binary code into graphs data.
    • *(part of the preprocessing code are from [1])

Binary Vulnerability detections

  1. Cramming the binary dataset
    • The dataset is built on top of Devign. We compile the entire library based on the commit id and dump the binary code of the vulnerable functions. The cramming code is given in preprocess/cram_vul_dataset.
  2. Download Preprocessed data
    • Run ./preprocess.sh (clang++-3.7.1 required), or
    • You can directly download the preprocessed datasets from here and extract them into data/.
    • Run python preprocess/vul_preprocess.py to compile the binary code into graphs data

Binary decompilation [N-Bref]

  1. Download dataset
    • Download the demo datasets (raw and preprocessed data) from here and extract them into data/. (More datasets to come.)
    • No need to compile the code into graph again as the data has already been preprocessed.

Training and Evaluation

Binary code similarity measures

  • Run cd baseline_model && python run_similarity_check.py

Binary Vulnerability detections

  • Run cd baseline_model && python run_vulnerability_detection.py

Binary decompilation [N-Bref]

  1. Dump the trace of tree expansion:
    • To accelerate the online processing of the tree output, we will dump the trace of the trea data by running python -m preprocess.dump_trace
  2. Training scripts:
    • First, cd baseline model.
    • To train the model using torch parallel, run python run_tree_transformer.py.
    • To train it on multi-gpu using distribute pytorch, run python run_tree_transformer_multi_gpu.py
    • To evaluate, run python run_tree_transformer.py --eval
    • To evaluate a multi-gpu trained model, run python run_tree_transformer_multi_gpu.py --eval

References

[1] Ye, Fangke, et al. "MISIM: An End-to-End Neural Code Similarity System." arXiv preprint arXiv:2006.05265 (2020).

[2] Zhou, Yaqin, et al. "Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks." Advances in Neural Information Processing Systems. 2019.

[3] Shi, Zhan, et al. "Learning Execution through Neural Code Fusion.", ICLR (2019).

License

This repo is CC-BY-NC licensed, as found in the LICENSE file.

Owner
Facebook Research
Facebook Research
MoRecon - A tool for reconstructing missing frames in motion capture data.

MoRecon - A tool for reconstructing missing frames in motion capture data.

Yuki Nishidate 38 Dec 03, 2022
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen 3.7k Jan 03, 2023
An experimental project I'm undertaking for the sole purpose of increasing my Python knowledge

5ePy is an experimental project I'm undertaking for the sole purpose of increasing my Python knowledge. #Goals Goal: Create a working, albeit lightwei

Hayden Covington 1 Nov 24, 2021
Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

The following Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks (MOFs). The training set is extracted from the Cambridge S

1 Jan 09, 2022
Functional tensors for probabilistic programming

Funsor Funsor is a tensor-like library for functions and distributions. See Functional tensors for probabilistic programming for a system description.

208 Dec 29, 2022
Random dataframe and database table generator

Random database/dataframe generator Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, USA Introduction Often, beginners in SQL or data scien

Tirthajyoti Sarkar 249 Jan 08, 2023
Data analysis and visualisation projects from a range of individual projects and applications

Python-Data-Analysis-and-Visualisation-Projects Data analysis and visualisation projects from a range of individual projects and applications. Python

Tom Ritman-Meer 1 Jan 25, 2022
Pyspark Spotify ETL

This is my first Data Engineering project, it extracts data from the user's recently played tracks using Spotify's API, transforms data and then loads it into Postgresql using SQLAlchemy engine. Data

16 Jun 09, 2022
Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

WhiteBox 3 Oct 03, 2022
Analytical view of olist e-commerce in Brazil

Analysis of E-Commerce Public Dataset by Olist The objective of this project is to propose an analytical view of olist e-commerce in Brazil. For this

Gurpreet Singh 1 Jan 11, 2022
Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities

Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities. This is aimed at those looking to get into the field of D

Joachim 1 Dec 26, 2021
Senator Trades Monitor

Senator Trades Monitor This monitor will grab the most recent trades by senators and send them as a webhook to discord. Installation To use the monito

Yousaf Cheema 5 Jun 11, 2022
Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer

Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer

Mohammed Hassan 13 Mar 31, 2022
My first Python project is a simple Mad Libs program.

Python CLI Mad Libs Game My first Python project is a simple Mad Libs program. Mad Libs is a phrasal template word game created by Leonard Stern and R

Carson Johnson 1 Dec 10, 2021
BErt-like Neurophysiological Data Representation

BENDR BErt-like Neurophysiological Data Representation This repository contains the source code for reproducing, or extending the BERT-like self-super

114 Dec 23, 2022
A highly efficient and modular implementation of Gaussian Processes in PyTorch

GPyTorch GPyTorch is a Gaussian process library implemented using PyTorch. GPyTorch is designed for creating scalable, flexible, and modular Gaussian

3k Jan 02, 2023
Example Of Splunk Search Query With Python And Splunk Python SDK

SSQAuto (Splunk Search Query Automation) Example Of Splunk Search Query With Python And Splunk Python SDK installation: ➜ ~ git clone https://github.c

AmirHoseinTangsiriNET 1 Nov 14, 2021
This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot.

superSFS This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot. It is easy-to-use and runing fast. What you s

3 Dec 16, 2022
Calculate multilateral price indices in Python (with Pandas and PySpark).

IndexNumCalc Calculate multilateral price indices using the GEKS-T (CCDI), Time Product Dummy (TPD), Time Dummy Hedonic (TDH), Geary-Khamis (GK) metho

Dr. Usman Kayani 3 Apr 27, 2022