🤖 ⚡ scikit-learn tips

Overview

🤖 ⚡ scikit-learn tips

New tips are posted on LinkedIn, Twitter, and Facebook.

👉 Sign up to receive 2 video tips by email every week! 👈

List of all tips

Click to discuss the tip on LinkedIn, click to view the Jupyter notebook for a tip, or click to watch the tip video on YouTube:

# Description Links
1 Use ColumnTransformer to apply different preprocessing to different columns
2 Seven ways to select columns using ColumnTransformer
3 What is the difference between "fit" and "transform"?
4 Use "fit_transform" on training data, but "transform" (only) on testing/new data
5 Four reasons to use scikit-learn (not pandas) for ML preprocessing
6 Encode categorical features using OneHotEncoder or OrdinalEncoder
7 Handle unknown categories with OneHotEncoder by encoding them as zeros
8 Use Pipeline to chain together multiple steps
9 Add a missing indicator to encode "missingness" as a feature
10 Set a "random_state" to make your code reproducible
11 Impute missing values using KNNImputer or IterativeImputer
12 What is the difference between Pipeline and make_pipeline?
13 Examine the intermediate steps in a Pipeline
14 HistGradientBoostingClassifier natively supports missing values
15 Three reasons not to use drop='first' with OneHotEncoder
16 Use cross_val_score and GridSearchCV on a Pipeline
17 Try RandomizedSearchCV if GridSearchCV is taking too long
18 Display GridSearchCV or RandomizedSearchCV results in a DataFrame
19 Important tuning parameters for LogisticRegression
20 Plot a confusion matrix
21 Compare multiple ROC curves in a single plot
22 Use the correct methods for each type of Pipeline
23 Display the intercept and coefficients for a linear model
24 Visualize a decision tree two different ways
25 Prune a decision tree to avoid overfitting
26 Use stratified sampling with train_test_split
27 Two ways to impute missing values for a categorical feature
28 Save a model or Pipeline using joblib
29 Vectorize two text columns in a ColumnTransformer
30 Four ways to examine the steps of a Pipeline
31 Shuffle your dataset when using cross_val_score
32 Use AUC to evaluate multiclass problems
33 Use FunctionTransformer to convert functions into transformers
34 Add feature selection to a Pipeline
35 Don't use .values when passing a pandas object to scikit-learn
36 Most parameters should be passed as keyword arguments
37 Create an interactive diagram of a Pipeline in Jupyter
38 Get the feature names output by a ColumnTransformer
39 Load a toy dataset into a DataFrame
40 Estimators only print parameters that have been changed
41 Drop the first category from binary features (only) with OneHotEncoder
42 Passthrough some columns and drop others in a ColumnTransformer
43 Use OrdinalEncoder instead of OneHotEncoder with tree-based models
44 Speed up GridSearchCV using parallel processing
45 Create feature interactions using PolynomialFeatures
46 Ensemble multiple models using VotingClassifer or VotingRegressor
47 Tune the parameters of a VotingClassifer or VotingRegressor
48 Access part of a Pipeline using slicing
49 Tune multiple models simultaneously with GridSearchCV
50 Adapt this pattern to solve many Machine Learning problems

You can interact with all of these notebooks online using Binder:

Note: Some of the tips do not include any code, and can only be viewed on LinkedIn.

Who creates these tips?

Hi! I'm Kevin Markham, the founder of Data School. I've been teaching data science in Python since 2014. I create these tips because I love using scikit-learn and I want to help others use it more effectively.

How can I get better at scikit-learn?

I teach three courses:

👉 Find out which course is right for you! 👈

Do you have any other tips?

Yes! In 2019, I posted 100 pandas tricks. I also created a video featuring my top 25 pandas tricks.

© 2020-2021 Data School. All rights reserved.

Owner
Kevin Markham
Founder of Data School
Kevin Markham
Quantum Machine Learning

The Machine Learning package simply contains sample datasets at present. It has some classification algorithms such as QSVM and VQC (Variational Quantum Classifier), where this data can be used for e

Qiskit 364 Jan 08, 2023
Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

EconML/CausalML KDD 2021 Tutorial 124 Dec 28, 2022
LiuAlgoTrader is a scalable, multi-process ML-ready framework for effective algorithmic trading

LiuAlgoTrader is a scalable, multi-process ML-ready framework for effective algorithmic trading. The framework simplify development, testing, deployment, analysis and training algo trading strategies

Amichay Oren 458 Dec 24, 2022
Uber Open Source 1.6k Dec 31, 2022
This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

Crypto-Currency-Predictor This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you

Hazim Arafa 6 Dec 04, 2022
Pragmatic AI Labs 421 Dec 31, 2022
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Ray provides a simple, universal API for building distributed applications. Ray is packaged with the following libraries for accelerating machine lear

23.3k Dec 31, 2022
MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees.

MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees. MooGBT optimizes for multiple objectives by defining constraints on sub-objective(s) along with a primary objective. Th

Swiggy 66 Dec 06, 2022
Made in collaboration with Chris George for Art + ML Spring 2019.

Deepdream Eyes Made in collaboration with Chris George for Art + ML Spring 2019.

Francisco Cabrera 1 Jan 12, 2022
This project impelemented for midterm of the Machine Learning #Zoomcamp #Alexey Grigorev

MLProject_01 This project impelemented for midterm of the Machine Learning #Zoomcamp #Alexey Grigorev Context Dataset English question data set file F

Hadi Nakhi 1 Dec 18, 2021
Toolkit for building machine learning models that generalize to unseen domains and are robust to privacy and other attacks.

Toolkit for Building Robust ML models that generalize to unseen domains (RobustDG) Divyat Mahajan, Shruti Tople, Amit Sharma Privacy & Causal Learning

Microsoft 149 Jan 06, 2023
A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

802 Jan 01, 2023
BASTA: The BAyesian STellar Algorithm

BASTA: BAyesian STellar Algorithm Current stable version: v1.0 Important note: BASTA is developed for Python 3.8, but Python 3.7 should work as well.

BASTA team 16 Nov 15, 2022
Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

Artsem Zhyvalkouski 64 Nov 30, 2022
Graphsignal is a machine learning model monitoring platform.

Graphsignal is a machine learning model monitoring platform. It helps ML engineers, MLOps teams and data scientists to quickly address issues with data and models as well as proactively analyze model

Graphsignal 143 Dec 05, 2022
ThunderGBM: Fast GBDTs and Random Forests on GPUs

Documentations | Installation | Parameters | Python (scikit-learn) interface What's new? ThunderGBM won 2019 Best Paper Award from IEEE Transactions o

Xtra Computing Group 648 Dec 16, 2022
Iris-Heroku - Putting a Machine Learning Model into Production with Flask and Heroku

Puesta en Producción de un modelo de aprendizaje automático con Flask y Heroku L

Jesùs Guillen 1 Jun 03, 2022
Transpile trained scikit-learn estimators to C, Java, JavaScript and others.

sklearn-porter Transpile trained scikit-learn estimators to C, Java, JavaScript and others. It's recommended for limited embedded systems and critical

Darius Morawiec 1.2k Jan 05, 2023
pandas, scikit-learn, xgboost and seaborn integration

pandas, scikit-learn and xgboost integration.

299 Dec 30, 2022
Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

pyspark-anonymizer Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark envir

6 Jun 30, 2022