A Guide for Feature Engineering and Feature Selection, with implementations and examples in Python.

Overview

Feature Engineering & Feature Selection

A comprehensive guide [pdf] [markdown] for Feature Engineering and Feature Selection, with implementations and examples in Python.

Motivation

Feature Engineering & Selection is the most essential part of building a useable machine learning project, even though hundreds of cutting-edge machine learning algorithms coming in these days like deep learning and transfer learning. Indeed, like what Prof Domingos, the author of  'The Master Algorithm' says:

“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.”

— Prof. Pedro Domingos

001

Data and feature has the most impact on a ML project and sets the limit of how well we can do, while models and algorithms are just approaching that limit. However, few materials could be found that systematically introduce the art of feature engineering, and even fewer could explain the rationale behind. This repo is my personal notes from learning ML and serves as a reference for Feature Engineering & Selection.

Download

Download the PDF here:

Same, but in markdown:

PDF has a much readable format, while Markdown has auto-generated anchor link to navigate from outer source. GitHub sucks at displaying markdown with complex grammar, so I would suggest read the PDF or download the repo and read markdown with Typora.

What You'll Learn

Not only a collection of hands-on functions, but also explanation on Why, How and When to adopt Which techniques of feature engineering in data mining.

  • the nature and risk of data problem we often encounter
  • explanation of the various feature engineering & selection techniques
  • rationale to use it
  • pros & cons of each method
  • code & example

Getting Started

This repo is mainly used as a reference for anyone who are doing feature engineering, and most of the modules are implemented through scikit-learn or its communities.

To run the demos or use the customized function, please download the ZIP file from the repo or just copy-paste any part of the code you find helpful. They should all be very easy to understand.

Required Dependencies:

  • Python 3.5, 3.6 or 3.7
  • numpy>=1.15
  • pandas>=0.23
  • scipy>=1.1.0
  • scikit_learn>=0.20.1
  • seaborn>=0.9.0

Table of Contents and Code Examples

Below is a list of methods currently implemented in the repo.

1. Data Exploration

2. Feature Cleaning

3. Feature Engineering

4. Feature Selection

Key Links and Resources

  • Udemy's Feature Engineering online course

https://www.udemy.com/feature-engineering-for-machine-learning/

  • Udemy's Feature Selection online course

https://www.udemy.com/feature-selection-for-machine-learning

  • JMLR Special Issue on Variable and Feature Selection

http://jmlr.org/papers/special/feature03.html

  • Data Analysis Using Regression and Multilevel/Hierarchical Models, Chapter 25: Missing data

http://www.stat.columbia.edu/~gelman/arm/missing.pdf

  • Data mining and the impact of missing data

http://core.ecu.edu/omgt/krosj/IMDSDataMining2003.pdf

  • PyOD: A Python Toolkit for Scalable Outlier Detection

https://github.com/yzhao062/pyod

  • Weight of Evidence (WoE) Introductory Overview

http://documentation.statsoft.com/StatisticaHelp.aspx?path=WeightofEvidence/WeightofEvidenceWoEIntroductoryOverview

  • About Feature Scaling and Normalization

http://sebastianraschka.com/Articles/2014_about_feature_scaling.html

  • Feature Generation with RF, GBDT and Xgboost

https://blog.csdn.net/anshuai_aw1/article/details/82983997

  • A review of feature selection methods with applications

https://ieeexplore.ieee.org/iel7/7153596/7160221/07160458.pdf

Owner
Yimeng.Zhang
I'm a lovely machine learning learner~
Yimeng.Zhang
Roblox Limited Sniper For Python

Info this is version 2.1 version 3 will support more options (install python: https://www.python.org) the program will buy any limited item with a pri

1 Dec 09, 2021
Types for the Rasterio package

types-rasterio Types for the rasterio package A work in progress Install Not yet published to PyPI pip install types-rasterio These type definitions

Kyle Barron 7 Sep 10, 2021
A Python 3 client for the beanstalkd work queue

Greenstalk Greenstalk is a small and unopinionated Python client library for communicating with the beanstalkd work queue. The API provided mostly map

Justin Mayhew 67 Dec 08, 2022
Users can read others' travel journeys in addition to being able to upload and delete posts detailing their own experiences

Users can read others' travel journeys in addition to being able to upload and delete posts detailing their own experiences! Posts are organized by country and destination within that country.

Christopher Zeas 1 Feb 03, 2022
Web app to find your chance of winning at Texas Hold 'Em

poker_mc Web app to find your chance of winning at Texas Hold 'Em A working version of this project is deployed at poker-mc.ue.r.appspot.com. It's run

Aadith Vittala 7 Sep 15, 2021
Convert your Gyrosco.pe travels to GPX files

gyroscope2gpx This little python joint will do you a favor of taking your "Travel" export from Gyroscope (https://gyrosco.pe) and turn it into a bunch

nick g 4 Oct 02, 2022
Small C-like language compiler for the Uxn assembly language

Pyuxncle is a single-pass compiler for a small subset of C (albeit without the std library). This compiler targets Uxntal, the assembly language of the Uxn virtual computer. The output Uxntal is not

CPunch 13 Jun 28, 2022
Analisador de strings feito em Python // String parser made in Python

Este é um analisador feito em Python, neste programa, estou estudando funções e a sua junção com "if's" e dados colocados pelo usuário. Neste código,

Dev Nasser 1 Nov 03, 2021
Zapiski za ure o C++-u

cpp-notes Zapiski o C++-u. Objavljena verzija je na https://e6.ijs.si/~jslak/c++/ Generating the notes The setup assumes you are working in a Linux en

Jure Slak 1 Jan 05, 2022
calculadora financiera hecha en python

Calculadora financiera Calculadora de factores financieros basicos, puede calcular tanto factores como expresiones algebraicas en funcion de dichos fa

crudo 5 Nov 10, 2021
Export transactions for an algorand wallet to a CSV file

algorand_txn_csv_exporter - (Algorand transaction CSV exporter) This script will export transactions for an algorand wallet to a CSV file. It is inten

TeneoPython01 5 Jun 19, 2022
Syarat.ID Source Code - Syarat.ID is a content aggregator website

Syarat.ID is a content aggregator website that gathering all informations with the specific keyword: "syarat" from the internet.

Syarat.ID 2 Oct 15, 2021
Basic Hspice runner with Python

HSpicePy Bilgisayarınıza PATH değişkenlerine eklediğiniz HSPICE programını python ile çalıştırmanızı sağlayan basit bir araç. A simple tool that allow

1 Nov 16, 2021
Multiple GNOME terminals in one window

Terminator by Chris Jones [email protected] and others. Description Terminator was

GNOME Terminator 1.5k Jan 01, 2023
Slotscheck - Find mistakes in your slots definitions

🎰 Slotscheck Adding __slots__ to a class in Python is a great way to reduce mem

Arie Bovenberg 67 Dec 31, 2022
Mata kuliah Bahasa Pemrograman

praktikum2 MENGHITUNG LUAS DAN KELILING LINGKARAN FLOWCHART : OUTPUT PROGRAM : PENJELASAN : Tetapkan nilai pada variabel sesuai inputan dari user :

2 Nov 09, 2021
Import Apex legends mprt files exported from Legion

Apex-mprt-importer-for-Blender Import Apex legends mprt files exported from Legion. REQUIRES CAST IMPORTER Usage: Use a VPK extracter to extract the m

15 Dec 18, 2022
TickerRain is an open-source web app that stores and analysis Reddit posts in a transparent and semi-interactive manner.

TickerRain is an open-source web app that stores and analysis Reddit posts in a transparent and semi-interactive manner

GonVas 180 Oct 08, 2022
Get the stats of a (or more) Hypixel player(s)

Hypixel_Stats Get the statistics of a (or more) Hypixel player(s) Who needs this? Everyone who plays a lot of Minecraft and often plays on mc.hypixel.

Finnomator 1 Feb 12, 2022
Lightweight library for accessing data and configuration

accsr This lightweight library contains utilities for managing, loading, uploading, opening and generally wrangling data and configurations. It was ba

appliedAI Initiative 7 Mar 09, 2022