A Comprehensive Study on Learning-Based PE Malware Family Classification Methods

Overview

A Comprehensive Study on Learning-Based PE Malware Family Classification Methods

Datasets

Because of copyright issues, both the MalwareBazaar dataset and the MalwareDrift dataset just contain the malware SHA-256 hash and all of the related information which can be find in the Datasets folder. You can download raw malware samples from the open-source malware release website by applying an api-key, and use disassembly tool to convert the malware into binary and disassembly files.

  • The MalwareBazaar dataset : you can download the samples from MalwareBazaar.
  • The MalwareDrift dataset : you can download the samples from VirusShare.

Experimental Settings

Model Training Strategy Optimizer Learning Rate Batch Size Input Format
ResNet-50 From Scratch Adam 1e-3 64 224*224 color image
ResNet-50 Transfer Adam 1e-3 All data* 224*224 color image
VGG-16 From Scratch SGD 5e-6** 64 224*224 color image
VGG-16 Transfer SGD 5e-6 64 224*224 color image
Inception-V3 From Scratch Adam 1e-3 64 224*224 color image
Inception-V3 Transfer Adam 1e-3 All data 224*224 color image
IMCFN From Scratch SGD 5e-6*** 32 224*224 color image
IMCFN Transfer SGD 5e-6*** 32 224*224 color image
CBOW+MLP - SGD 1e-3 128 CBOW: byte sequences; MLP: 256*256 matrix
MalConv - SGD 1e-3 32 2MB raw byte values
MAGIC - Adam 1e-4 10 ACFG
Word2Vec+KNN - - - - Word2Vec: Opcode sequences; KNN distance measure: WMD
MCSC - SGD 5e-3 64 Opcode sequences

* The batch size is set to 128 for the MalwareBazaar dataset
** The learning rate is set to 5e-5 for the Malimg dataset and 1e-5 for the MalwareBazaar dataset
*** The learning rate is set to 1e-5 for the MalwareBazaar dataset
CBOW is with default parameters in the Word2Vec package in the Gensim library of Python

Graphically Analysis of Table 4 and Table 5

Here is a more detailed figure analysis for Table 4 and Table 5 in order to make the raw information in the paper easier to digest.

Table 4

  • The classification performance (F1-Score) of each approach on three datasets classification performance

    The figure shows the classification performance (F1-Score) of each methods on three datasets. It is noteworthy that the Malimg dataset only contains malware images, and thus it can only be used to evaluate the 4 image-based methods.

  • The average classification performance (F1-Score) of each approach for three datasets average classification performance

    The figure shows the average classification performance (F1-Score) of each method for the three datasets. Among them, the F1-score corresponding to each model is obtained by averaging the F1-score of the model on three datasets, which represents the average performance.

  • The train time and resource overhead of each method on three datasets
    resource consumption

    The figure shows the train time (left subgraph) and resource overhead (right subgraph) needed for every method on three datasets. The bar immediately to the right of the train time bar is the memory overhead of this model. Similarly, there are only 4 image-based models for the Malimg dataset.

Table 5

  • The classification performance (F1-Score) of transfer learning for image-based approaches on three datasets transfer learning

    This figure shows the F1-Score obtained by every image-based model using the strategy of training from scratch, 10% transfer learning, 50% transfer learning, 80% transfer learning, and 100% transfer learning, respectively. Every subgraph correspond to the BIG-15, Malimg, and MalwareBazaar dataset, respectively.

  • The train time and resource overhead of transfer learning for image-based approaches on three datasets
    resource consumption

    Each row correspond to the BIG-15, Mmalimg, and MalwareBazaar dataset, respectively. For each row, there are 4 models (ResNet-50, VGG-16, Inception-V3 and IMCFN). For each model, there are 8 bars on the right, the left 4 bars stands for the train time under 10%, 50%, 80% and 100% transfer learning, and the right 4 bars are the memory overhead under 10%, 50%, 80% and 100% transfer learning.

Official PyTorch Implementation of Learning Architectures for Binary Networks

Learning Architectures for Binary Networks An Pytorch Implementation of the paper Learning Architectures for Binary Networks (BNAS) (ECCV 2020) If you

Computer Vision Lab. @ GIST 25 Jun 09, 2022
This is the latest version of the PULP SDK

PULP-SDK This is the latest version of the PULP SDK, which is under active development. The previous (now legacy) version, which is no longer supporte

78 Dec 07, 2022
Social Distancing Detector

Computer vision has opened up a lot of opportunities to explore into AI domain that were earlier highly limited. Here is an application of haarcascade classifier and OpenCV to develop a social distan

Ashish Pandey 2 Jul 18, 2022
Kinetics-Data-Preprocessing

Kinetics-Data-Preprocessing Kinetics-400 and Kinetics-600 are common video recognition datasets used by popular video understanding projects like Slow

Kaihua Tang 7 Oct 27, 2022
Framework for Spectral Clustering on the Sparse Coefficients of Learned Dictionaries

Dictionary Learning for Clustering on Hyperspectral Images Overview Framework for Spectral Clustering on the Sparse Coefficients of Learned Dictionari

Joshua Bruton 6 Oct 25, 2022
MINIROCKET: A Very Fast (Almost) Deterministic Transform for Time Series Classification

MINIROCKET: A Very Fast (Almost) Deterministic Transform for Time Series Classification

187 Dec 26, 2022
Semantic Segmentation for Real Point Cloud Scenes via Bilateral Augmentation and Adaptive Fusion (CVPR 2021)

Semantic Segmentation for Real Point Cloud Scenes via Bilateral Augmentation and Adaptive Fusion (CVPR 2021) This repository is for BAAF-Net introduce

90 Dec 29, 2022
[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

Large Scale Image Completion via Co-Modulated Generative Adversarial Networks, ICLR 2021 (Spotlight) Demo | Paper [NEW!] Time to play with our interac

Shengyu Zhao 373 Jan 02, 2023
Physics-Aware Training (PAT) is a method to train real physical systems with backpropagation.

Physics-Aware Training (PAT) is a method to train real physical systems with backpropagation. It was introduced in Wright, Logan G. & Onodera, Tatsuhiro et al. (2021)1 to train Physical Neural Networ

McMahon Lab 230 Jan 05, 2023
Trainable Bilateral Filter Layer (PyTorch)

Trainable Bilateral Filter Layer (PyTorch) This repository contains our GPU-accelerated trainable bilateral filter layer (three spatial and one range

FabianWagner 26 Dec 25, 2022
Supervised Sliding Window Smoothing Loss Function Based on MS-TCN for Video Segmentation

SSWS-loss_function_based_on_MS-TCN Supervised Sliding Window Smoothing Loss Function Based on MS-TCN for Video Segmentation Supervised Sliding Window

3 Aug 03, 2022
Contenido del curso Bases de datos del DCC PUC versión 2021-2

IIC2413 - Bases de Datos Tabla de contenidos Equipo Profesores Ayudantes Contenidos Calendario Evaluaciones Resumen de notas Foro Política de integrid

54 Nov 23, 2022
Fast Differentiable Matrix Sqrt Root

Fast Differentiable Matrix Sqrt Root Geometric Interpretation of Matrix Square Root and Inverse Square Root This repository constains the official Pyt

YueSong 42 Dec 30, 2022
Official Implementation of CVPR 2022 paper: "Mimicking the Oracle: An Initial Phase Decorrelation Approach for Class Incremental Learning"

(CVPR 2022) Mimicking the Oracle: An Initial Phase Decorrelation Approach for Class Incremental Learning ArXiv This repo contains Official Implementat

Yujun Shi 24 Nov 01, 2022
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.6k Jan 01, 2023
A wrapper around SageMaker ML Lineage Tracking extending ML Lineage to end-to-end ML lifecycles, including additional capabilities around Feature Store groups, queries, and other relevant artifacts.

ML Lineage Helper This library is a wrapper around the SageMaker SDK to support ease of lineage tracking across the ML lifecycle. Lineage artifacts in

AWS Samples 12 Nov 01, 2022
Official PyTorch implementation of "Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble" (NeurIPS'21)

Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble This is the code for reproducing the results of the paper Uncertainty-Bas

43 Nov 23, 2022
Read number plates with https://platerecognizer.com/

HASS-plate-recognizer Read vehicle license plates with https://platerecognizer.com/ which offers free processing of 2500 images per month. You will ne

Robin 69 Dec 30, 2022
Implementations of polygamma, lgamma, and beta functions for PyTorch

lgamma Implementations of polygamma, lgamma, and beta functions for PyTorch. It's very hacky, but that's usually ok for research use. To build, run: .

Rachit Singh 24 Nov 09, 2021
SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data

SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data Au

14 Nov 28, 2022