An extension to pandas dataframes describe function.

Overview

pandas_summary

An extension to pandas dataframes describe function.

The module contains DataFrameSummary object that extend describe() with:

  • properties
    • dfs.columns_stats: counts, uniques, missing, missing_perc, and type per column
    • dsf.columns_types: a count of the types of columns
    • dfs[column]: more in depth summary of the column
  • function
    • summary(): extends the describe() function with the values with columns_stats

Installation

The module can be easily installed with pip:

> pip install pandas-summary

This module depends on numpy and pandas. Optionally you can get also some nice visualisations if you have matplotlib installed.

Tests

To run the tests, execute the command python setup.py test

Usage

The module contains one class:

DataFrameSummary

The DataFrameSummary expect a pandas DataFrame to summarise.

from pandas_summary import DataFrameSummary

dfs = DataFrameSummary(df)

getting the columns types

dfs.columns_types


numeric     9
bool        3
categorical 2
unique      1
date        1
constant    1
dtype: int64

getting the columns stats

dfs.columns_stats


                      A            B        C              D              E 
counts             5802         5794     5781           5781           4617   
uniques            5802            3     5771            128            121   
missing               0            8       21             21           1185   
missing_perc         0%        0.14%    0.36%          0.36%         20.42%   
types            unique  categorical  numeric        numeric        numeric 

getting a single column summary, e.g. numerical column

# we can also access the column using numbers A[1]
dfs['A']

std                                                                 0.2827146
max                                                                  1.072792
min                                                                         0
variance                                                           0.07992753
mean                                                                0.5548516
5%                                                                  0.1603367
25%                                                                 0.3199776
50%                                                                 0.4968588
75%                                                                 0.8274732
95%                                                                  1.011255
iqr                                                                 0.5074956
kurtosis                                                            -1.208469
skewness                                                            0.2679559
sum                                                                  3207.597
mad                                                                 0.2459508
cv                                                                  0.5095319
zeros_num                                                                  11
zeros_perc                                                               0,1%
deviating_of_mean                                                          21
deviating_of_mean_perc                                                  0.36%
deviating_of_median                                                        21
deviating_of_median_perc                                                0.36%
top_correlations                         {u'D': 0.702240243124, u'E': -0.663}
counts                                                                   5781
uniques                                                                  5771
missing                                                                    21
missing_perc                                                            0.36%
types                                                                 numeric
Name: A, dtype: object

Future development

Summary analysis between columns, i.e. dfs[[1, 2]]

Owner
Mourad
engineer, startup enthusiast, philosophy and music lover, coffeeholic... and more
Mourad
PyNHD is a part of HyRiver software stack that is designed to aid in watershed analysis through web services.

A part of HyRiver software stack that provides access to NHD+ V2 data through NLDI and WaterData web services

Taher Chegini 23 Dec 14, 2022
Data collection, enhancement, and metrics calculation.

l3_data_collection Data collection, enhancement, and metrics calculation. Summary Repository containing code for QuantDAO's JDT data collection task.

Ruiwyn 3 Dec 23, 2022
Statistical Rethinking course winter 2022

Statistical Rethinking (2022 Edition) Instructor: Richard McElreath Lectures: Uploaded Playlist and pre-recorded, two per week Discussion: Online, F

Richard McElreath 3.9k Dec 31, 2022
track your GitHub statistics

GitHub-Stalker track your github statistics 👀 features find new followers or unfollowers find who got a star on your project or remove stars find who

Bahadır Araz 34 Nov 18, 2022
API>local_db>AWS_RDS - Disclaimer! All data used is for educational purposes only.

APIlocal_dbAWS_RDS Disclaimer! All data used is for educational purposes only. ETL pipeline diagram. Aim of project By creating a fully working pipe

0 Apr 25, 2022
This is an analysis and prediction project for house prices in King County, USA based on certain features of the house

This is a project for analysis and estimation of House Prices in King County USA The .csv file contains the data of the house and the .ipynb file con

Amit Prakash 1 Jan 21, 2022
Evaluation of a Monocular Eye Tracking Set-Up

Evaluation of a Monocular Eye Tracking Set-Up As part of my master thesis, I implemented a new state-of-the-art model that is based on the work of Che

Pascal 19 Dec 17, 2022
Finding project directories in Python (data science) projects, just like there R rprojroot and here packages

Find relative paths from a project root directory Finding project directories in Python (data science) projects, just like there R here and rprojroot

Daniel Chen 102 Nov 16, 2022
Picka: A Python module for data generation and randomization.

Picka: A Python module for data generation and randomization. Author: Anthony Long Version: 1.0.1 - Fixed the broken image stuff. Whoops What is Picka

Anthony 108 Nov 30, 2021
A Python package for modular causal inference analysis and model evaluations

Causal Inference 360 A Python package for inferring causal effects from observational data. Description Causal inference analysis enables estimating t

International Business Machines 506 Dec 19, 2022
CINECA molecular dynamics tutorial set

High Performance Molecular Dynamics Logging into CINECA's computer systems To logon to the M100 system use the following command from an SSH client ss

J. W. Dell 0 Mar 13, 2022
Analyse the limit order book in seconds. Zoom to tick level or get yourself an overview of the trading day.

Analyse the limit order book in seconds. Zoom to tick level or get yourself an overview of the trading day. Correlate the market activity with the Apple Keynote presentations.

2 Jan 04, 2022
signac-flow - manage workflows with signac

signac-flow - manage workflows with signac The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, a

Glotzer Group 44 Oct 14, 2022
An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

1 Feb 11, 2022
BinTuner is a cost-efficient auto-tuning framework, which can deliver a near-optimal binary code that reveals much more differences than -Ox settings.

BinTuner is a cost-efficient auto-tuning framework, which can deliver a near-optimal binary code that reveals much more differences than -Ox settings. it also can assist the binary code analysis rese

BinTuner 42 Dec 16, 2022
An Indexer that works out-of-the-box when you have less than 100K stored Documents

U100KIndexer An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with

Jina AI 7 Mar 15, 2022
Integrate bus data from a variety of sources (batch processing and real time processing).

Purpose: This is integrate bus data from a variety of sources such as: csv, json api, sensor data ... into Relational Database (batch processing and r

1 Nov 25, 2021
Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data.

Hatchet Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data. It is intended for analyzing

Lawrence Livermore National Laboratory 14 Aug 19, 2022
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023
A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

TennisBusinessIntelligenceProject - A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

carlo paladino 1 Jan 02, 2022