Data exploration done quick.

Overview

Pandas Tab

Implementation of Stata's tabulate command in Pandas for extremely easy to type one-way and two-way tabulations.

Support:

  • Python 3.7 and 3.8: Pandas >=0.23.x
  • Python 3.9: Pandas >=1.0.x

Background & Purpose

As someone who made the move from Stata to Python, one thing I noticed is that I end up doing fewer tabulations of my data when working in Pandas. I believe that the reason for this has a lot to do with API differences that make it slightly less convenient to run tabulations extremely quickly.

For example, if you want to look at values counts in column "foo", in Stata it's merely tab foo. In Pandas, it's df["foo"].value_counts(). This is over twice the amount of typing.

It's not just a brevity issue. If you want to add one more column and to go from one-way to two-way tabulation (e.g. look at "foo" and "bar" together), this isn't as simple as adding one more column:

  • df[["foo", "bar"]].value_counts().unstack() requires one additional transformation to move away from a multi-indexed series.
  • pd.crosstab(df["foo"], df["bar"]) is a totally different interface from the one-way tabulation.

Pandas Tab attempts to solve these issues by creating an interface more similar to Stata: df.tab("foo") and df.tab("foo", "bar") give you, respectively, your one-way and two-way tabulations.

Example

# using IPython integration:
# ! pip install pandas-tab[full]
# ! pandas_tab init

import pandas as pd

df = pd.DataFrame({
    "foo":  ["a", "a", "b", "a", "b", "c", "a"],
    "bar":  [4,   5,   7,   6,   7,   7,   5],
    "fizz": [12,  63,  23,  36,  21,  28,  42]
})

# One-way tabulation
df.tab("foo")

# Two-way tabulation
df.tab("foo", "bar")

# One-way with aggregation
df.tab("foo", values="fizz", aggfunc=pd.Series.mean)

# Two-way with aggregation
df.tab("foo", "bar", values="fizz", aggfunc=pd.Series.mean)

Outputs:

>> # Two-way tabulation >>> df.tab("foo", "bar") bar 4 5 6 7 foo a 1 2 1 0 b 0 0 0 2 c 0 0 0 1 >>> # One-way with aggregation >>> df.tab("foo", values="fizz", aggfunc=pd.Series.mean) mean foo a 38.25 b 22.00 c 28.00 >>> # Two-way with aggregation >>> df.tab("foo", "bar", values="fizz", aggfunc=pd.Series.mean) bar 4 5 6 7 foo a 12.0 52.5 36.0 NaN b NaN NaN NaN 22.0 c NaN NaN NaN 28.0 ">
>>> # One-way tabulation
>>> df.tab("foo")

     size  percent
foo               
a       4    57.14
b       2    28.57
c       1    14.29

>>> # Two-way tabulation
>>> df.tab("foo", "bar")

bar  4  5  6  7
foo            
a    1  2  1  0
b    0  0  0  2
c    0  0  0  1

>>> # One-way with aggregation
>>> df.tab("foo", values="fizz", aggfunc=pd.Series.mean)

      mean
foo       
a    38.25
b    22.00
c    28.00

>>> # Two-way with aggregation
>>> df.tab("foo", "bar", values="fizz", aggfunc=pd.Series.mean)

bar     4     5     6     7
foo                        
a    12.0  52.5  36.0   NaN
b     NaN   NaN   NaN  22.0
c     NaN   NaN   NaN  28.0

Setup

Full Installation (IPython / Jupyter Integration)

The full installation includes a CLI that adds a startup script to IPython:

pip install pandas-tab[full]

Then, to enable the IPython / Jupyter startup script:

pandas_tab init

You can quickly remove the startup script as well:

pandas_tab delete

More on the startup script in the section IPython / Jupyter Integration.

Simple installation:

If you don't want the startup script, you don't need the extra dependencies. Simply install with:

pip install pandas-tab

IPython / Jupyter Integration

The startup script auto-loads pandas_tab each time you load up a new IPython kernel (i.e. each time you fire up or restart your Jupyter Notebook).

You can run the startup script in your terminal with pandas_tab init.

Without the startup script:

# WITHOUT STARTUP SCRIPT
import pandas as pd
import pandas_tab

df = pd.read_csv("foo.csv")
df.tab("x", "y")

Once you install the startup script, you don't need to do import pandas_tab:

# WITH PANDAS_TAB STARTUP SCRIPT INSTALLED
import pandas as pd

df = pd.read_csv("foo.csv")
df.tab("x", "y")

The IPython startup script is convenient, but there are some downsides to using and relying on it:

  • It needs to load Pandas in the background each time the kernel starts up. For typical data science workflows, this should not be a problem, but you may not want this if your workflows ever avoid Pandas.
  • The IPython integration relies on hidden state that is environment-dependent. People collaborating with you may be unable to replicate your Jupyter notebooks if there are any df.tab()'s in there and you don't import pandas_tab manually.

For that reason, I recommend the IPython integration for solo exploratory analysis, but for collaboration you should still import pandas_tab in your notebook.

Limitations / Known Issues

  • No tests or guarantees for 3+ way cross tabulations. Both pd.crosstab and pd.Series.value_counts support multi-indexing, however this behavior is not yet tested for pandas_tab.
  • Behavior for dropna kwarg mimics pd.crosstab (drops blank columns), not pd.value_counts (include NaN/None in the index), even for one-way tabulations.
  • No automatic hook into Pandas; you must import pandas_tab in your code to register the extensions. Pandas does not currently search entry points for extensions, other than for plotting backends, so it's not clear that there's a clean way around this.
  • Does not mimic Stata's behavior of taking unambiguous abbreviations of column names, and there is no option to turn this on/off.
  • Pandas 0.x is incompatible with Numpy 1.20.x. If using Pandas 0.x, you need Numpy 1.19.x.
  • (Add more stuff here?)
Owner
W.D.
memes
W.D.
Generates a simple report about the current Covid-19 cases and deaths in Malaysia

Generates a simple report about the current Covid-19 cases and deaths in Malaysia. Results are delay one day, data provided by the Ministry of Health Malaysia Covid-19 public data.

Yap Khai Chuen 7 Dec 15, 2022
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023
A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

Unnikrishnan 2 Dec 12, 2021
A neural-based binary analysis tool

A neural-based binary analysis tool Introduction This directory contains the demo of a neural-based binary analysis tool. We test the framework using

Facebook Research 208 Dec 22, 2022
scikit-survival is a Python module for survival analysis built on top of scikit-learn.

scikit-survival scikit-survival is a Python module for survival analysis built on top of scikit-learn. It allows doing survival analysis while utilizi

Sebastian Pölsterl 876 Jan 04, 2023
An easy-to-use feature store

A feature store is a data storage system for data science and machine-learning. It can store raw data and also transformed features, which can be fed straight into an ML model or training script.

ByteHub AI 48 Dec 09, 2022
vartests is a Python library to perform some statistic tests to evaluate Value at Risk (VaR) Models

gg I wasn't satisfied with any of the other available Gemini clients, so I wrote my own. Requires Python 3.9 (maybe older, I haven't checked) and opti

RAFAEL RODRIGUES 5 Jan 03, 2023
Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Aryan Raj 7 Sep 04, 2022
A real data analysis and modeling project - restaurant inspections

A real data analysis and modeling project - restaurant inspections Jafar Pourbemany 9/27/2021 This project represents data analysis and modeling of re

Jafar Pourbemany 2 Aug 21, 2022
A Python package for Bayesian forecasting with object-oriented design and probabilistic models under the hood.

Disclaimer This project is stable and being incubated for long-term support. It may contain new experimental code, for which APIs are subject to chang

Uber Open Source 1.6k Dec 29, 2022
General Assembly's 2015 Data Science course in Washington, DC

DAT8 Course Repository Course materials for General Assembly's Data Science course in Washington, DC (8/18/15 - 10/29/15). Instructor: Kevin Markham (

Kevin Markham 1.6k Jan 07, 2023
Improving your data science workflows with

Make Better Defaults Author: Kjell Wooding [email protected] This is the git re

Kjell Wooding 18 Dec 23, 2022
DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis.

DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis. The main goal of the package is to accelerate the process of computing estimates of forward reachable sets for nonlinear dy

2 Nov 08, 2021
The Dash Enterprise App Gallery "Oil & Gas Wells" example

This app is based on the Dash Enterprise App Gallery "Oil & Gas Wells" example. For more information and more apps see: Dash App Gallery See the Dash

Austin Caudill 1 Nov 08, 2021
Python implementation of Principal Component Analysis

Principal Component Analysis Principal Component Analysis (PCA) is a dimension-reduction algorithm. The idea is to use the singular value decompositio

Ignacio Darago 1 Nov 06, 2021
Time ranges with python

timeranges Time ranges. Read the Docs Installation pip timeranges is available on pip: pip install timeranges GitHub You can also install the latest v

Micael Jarniac 2 Sep 01, 2022
💬 Python scripts to parse Messenger, Hangouts, WhatsApp and Telegram chat logs into DataFrames.

Chatistics Python 3 scripts to convert chat logs from various messaging platforms into Pandas DataFrames. Can also generate histograms and word clouds

Florian 893 Jan 02, 2023
Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Sensitivity Analysis Library (SALib) Python implementations of commonly used sensitivity analysis methods. Useful in systems modeling to calculate the

SALib 663 Jan 05, 2023
Kennedy Institute of Rheumatology University of Oxford Project November 2019

TradingBot6M Kennedy Institute of Rheumatology University of Oxford Project November 2019 Run Change api.txt to binance api key: https://www.binance.c

Kannan SAR 2 Nov 16, 2021
A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

TennisBusinessIntelligenceProject - A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

carlo paladino 1 Jan 02, 2022