Describing statistical models in Python using symbolic formulas

Related tags

Data Analysispatsy
Overview

Patsy is a Python library for describing statistical models (especially linear models, or models that have a linear component) and building design matrices. Patsy brings the convenience of R "formulas" to Python.

Documentation Build status Coverage Zenodo
Documentation:
https://patsy.readthedocs.io/
Downloads:
http://pypi.python.org/pypi/patsy/
Dependencies:
  • Python (2.6, 2.7, or 3.3+)
  • six
  • numpy
Optional dependencies:
  • nose: needed to run tests
  • scipy: needed for spline-related functions like bs
Install:
pip install patsy (or, for traditionalists: python setup.py install)
Code and bug tracker:
https://github.com/pydata/patsy
Mailing list:
License:
2-clause BSD, see LICENSE.txt for details.
Comments
  • Capture vars not env

    Capture vars not env

    This is the current state of my work to capture only the required variables for patsy formulas instead of the whole environment.

    Current status:

    • [x] Add new methods to EvalEnvironment to create a environment that contains only a subset of variables.
    • [x] Refactor EvalFactor (and update all users) to move the eval_env parameter out of the its initializer method and into the state.
    • [x] Additions to doc/changes.rst (esp. to document the incompatible changes!)
    • [x] Do a quick pass through the docs to catch anything that's now out-of-date.
    • [x] Add "end-to-end" test or two -- something that tests the user-visible behaviour directly, like running a formula through dmatrix, then modifying the original environment, and then running some more data through the same builder (like we were doing predictions) and checking that it is unaffected by our changes to the env. (test_highlevel.py is where the other tests like this go.)
    • [x] Create right subset EvalEnvironment and stash it into the state dict for EvalFactor.
    • [x] ~~Figure out what the right thing to do is when variables are shadowed between the EvalFactor's environment and the data.~~ (Can be done together with #13 instead. Not a blocker for this.)

    When this is complete, it will fix bug #25.

    opened by chrish42 33
  • Fix: Add compat code for pd.Categorical in pandas>=0.15

    Fix: Add compat code for pd.Categorical in pandas>=0.15

    pandas renamed pd.Categorical.labels to pd.Categorical.codes. It's also now possible to have Categoricals as blocks, so Series can contain Categoricals.

    opened by jankatins 28
  • Mixed effects formulas

    Mixed effects formulas

    Dear Nathaniel,

    What do you think it would take to implement support for mixed effects formulas (parsing and design matrix construction) as they are used in R's lme4? In terms of effort, does it appear to you like it could be easily achieved by composing/reusing existing patsy features, or is it way more involved?

    moved to formulaic 
    opened by andportnoy 20
  • MAINT: Reorder imports to avoid deprecation warning importing Mapping from collections.

    MAINT: Reorder imports to avoid deprecation warning importing Mapping from collections.

    Line 13 (from collections import Mapping) in constraint.py gives the following warning for python versions > 3.3:

    DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working

    A change on line 13 in constraint.py from:

    from collections import Mapping

    to:

    from collections.abc import Mapping

    should fix this and makes sure that patsy also works with python 3.9

    enhancement 
    opened by jeroenbouwman 13
  • How to do a polynomial?

    How to do a polynomial?

    I've seen https://groups.google.com/forum/#!topic/pystatsmodels/96cMRgFXBaA, but why doesn't something like this work.

    from patsy import dmatrices
    data = dict(y=range(1,11), x1=range(21,31), x2=range(11,21))
    
    dmatrices("y ~ x1 + x2**2", data)
    

    or

    dmatrices("y ~ x1 + I(x2**2)", data)
    

    This works

    dmatrices("y ~ x1 + np.power(x2, 2)", data)
    
    opened by jseabold 13
  • Python. Titanic Data Error

    Python. Titanic Data Error

    I am trying out some machine learning algorithms to be able to predict the people who survived who were aboard the titanic. I am following this example https://github.com/mlakhavani/titanic/blob/master/TitanixFinal.ipynb

    However on from patsy import dmatrices

    y, x = dmatrices('survived ~ sex + age + sibsp + parch + pclass + fare + C + Q + S + Col + Dr + Master + Miss + Mr + Mrs + Rev',
                     titanic_train, return_type="dataframe")
    
    y_test, x_test = dmatrices('survived ~ sex + age + sibsp + parch + pclass + fare + + C + Q + S + Col + Dr + Master + Miss + Mr + Mrs + Rev',
                               titanic_test, return_type="dataframe")
    i get this error
    PatsyError                                Traceback (most recent call last)
    <ipython-input-153-63b2f538454b> in <module>()
          1 y_test, x_test = dmatrices('Survived ~ Sex + Age + SibSp + Parch + Pclass + Fare + + C + Q + S ++ Dr + Master + Miss + Mr + Mrs + Rev',
    ----> 2                            titanic_test, return_type="dataframe")
    
    C:\Anaconda\lib\site-packages\patsy\highlevel.pyc in dmatrices(formula_like, data, eval_env, NA_action, return_type)
        295     eval_env = EvalEnvironment.capture(eval_env, reference=1)
        296     (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
    --> 297                                       NA_action, return_type)
        298     if lhs.shape[1] == 0:
        299         raise PatsyError("model is missing required outcome variables")
    
    C:\Anaconda\lib\site-packages\patsy\highlevel.pyc in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
        154         return build_design_matrices(builders, data,
        155                                      NA_action=NA_action,
    --> 156                                      return_type=return_type)
        157     else:
        158         # No builders, but maybe we can still get matrices
    
    C:\Anaconda\lib\site-packages\patsy\build.pyc in build_design_matrices(builders, data, NA_action, return_type, dtype)
        945         for evaluator in builder._evaluators:
        946             if evaluator not in evaluator_to_values:
    --> 947                 value, is_NA = evaluator.eval(data, NA_action)
        948                 evaluator_to_isNAs[evaluator] = is_NA
        949                 # value may now be a Series, DataFrame, or ndarray
    
    C:\Anaconda\lib\site-packages\patsy\build.pyc in eval(self, data, NA_action)
        161         result = self.factor.eval(self._state, data)
        162         result = categorical_to_int(result, self._levels, NA_action,
    --> 163                                     origin=self.factor)
        164         assert result.ndim == 1
        165         return result, np.asarray(result == -1)
    
    C:\Anaconda\lib\site-packages\patsy\categorical.pyc in categorical_to_int(data, levels, NA_action, origin)
        270     if hasattr(data, "shape") and len(data.shape) > 1:
        271         raise PatsyError("categorical data must be 1-dimensional",
    --> 272                          origin)
        273     if (not iterable(data)
        274         or isinstance(data, (six.text_type, six.binary_type))):
    
    PatsyError: categorical data must be 1-dimensional
        Survived ~ Sex + Age + SibSp + Parch + Pclass + Fare + + C + Q + S ++ Dr + Master + Miss + Mr + Mrs + Rev
    
    

    How can i solve this issue?

    opened by pintolx 12
  • 0.2.2 or 0.3(?) release please

    0.2.2 or 0.3(?) release please

    Finally echo of API change in Ipython 2.0 reached Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=751045 so I wondered if I should patch or consider taking current master snapshot or just wait for a new release... ?

    opened by yarikoptic 11
  • WIP: parse . in formulas

    WIP: parse . in formulas

    Adds basic support for . in formulas like '~ .' to the highlevel interface, to indicate all otherwise unused variables.

    It does not support embedding . in arbitrary python strings like '~ np.log(.)'. I suppose that would be nice, in theory, but it would require rebuilding Python expressions, instead of just using patsy's formula language.

    Let me know what you think. It definitely needs more tests, documentation and exploration of the various edge cases.

    Apologies for all the extra lines in the pull request -- my editor automatically trims trailing whitespace when saving a file.

    CC: #10

    opened by shoyer 11
  • patsy folder licenses

    patsy folder licenses

    Hello,

    Regarding patsy 0.4.1:

    Can you please specify if all files with Copyright Nathaniel Smith are under Python 2 ? In LICENSE.txt is mentioned only the module python.compat which I presume refers to python/compat.py

    Can you please specify which license applies on mgcv_cubic_splines.py ? Since this file is the only one where other copyright than Nathaniel Smith appears, what license refers to is not clear. Is it the BSD 2 Clause or Python 2 ?

    Thanks, Silviu

    opened by Silviu-Caprar 7
  • Pickling

    Pickling

    Took a stab at implementing pickling for EvalFactor. This is mainly to show the approach. I think the verbose error message might be overkill but it just shows the flexibility of this pattern.

    opened by louispotok 7
  • two fatal errors during test

    two fatal errors during test

    system versions:

    Python 2.7.6
    >>> patsy.__version__
    '0.3.0'
    >>> numpy.__version__
    '1.8.1'
    >>> pandas.__version__
    '0.16.1'
    

    test results:

    .............py276/lib/python2.7/site-packages/pandas/core/categorical.py:472: FutureWarning: Accessing 'levels' is deprecated, use 'categories'
    warn("Accessing 'levels' is deprecated, use 'categories'", FutureWarning)
    py276/lib/python2.7/site-packages/pandas/core/categorical.py:420: FutureWarning: 'labels' is deprecated. Use 'codes' instead
    warnings.warn("'labels' is deprecated. Use 'codes' instead", FutureWarning)
    ...FF.................
    ======================================================================
    FAIL: patsy.test_highlevel.test_formula_likes
    ----------------------------------------------------------------------
    Traceback (most recent call last):
    File "py276/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
        self.test(*self.arg)
    File "py276/lib/python2.7/site-packages/patsy/test_highlevel.py", line 202, in test_formula_likes
        [[1], [2], [3]], ["x"])
    File "py276/lib/python2.7/site-packages/patsy/test_highlevel.py", line 104, in t
        expected_lhs_values, expected_lhs_names)
    File "py276/lib/python2.7/site-packages/patsy/test_highlevel.py", line 32, in check_result
        assert rhs.design_info.column_names == expected_rhs_names
    AssertionError
    
    ======================================================================
    FAIL: patsy.test_highlevel.test_return_pandas
    ----------------------------------------------------------------------
    Traceback (most recent call last):
    File "py276/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
        self.test(*self.arg)
    File "py276/lib/python2.7/site-packages/patsy/test_highlevel.py", line 348, in test_return_pandas
        assert np.array_equal(df4.columns, ["AA"])
    AssertionError
    
    ----------------------------------------------------------------------
    Ran 35 tests in 65.391s
    
    FAILED (failures=2)
    
    opened by braindevices 7
  • Does python have an analogue to R's splines::ns()

    Does python have an analogue to R's splines::ns()

    See my stackoverflow question here https://stackoverflow.com/questions/71550468/does-python-have-an-analogue-to-rs-splinesns

    I am thinking that there is not current way to make this happen in patsy.

    I am requesting a feature for a ns function much like the cr function, but where ns matches the behavior of R's splines::ns

    opened by alexhallam 0
  • Wrote a convenience function for getting variable names from formula

    Wrote a convenience function for getting variable names from formula

    I am using patsy as a key dependency in a stats project, and I found myself needing to identify which variables are categorical after constructing a dataframe using patsy formulas.

    After an attempt using regexps ("...now you have two problems..."), I read Model specification for experts and computers a few times, and spent a lot of time poking around in X.design_info (where y, X=dmatrices(formula, data, return_type='dataframe')). Thankfully I ended up with something much shorter and more robust than my regexps attempt.

    I have two questions:

    1. I'm still not sure if I've used the interiors details of X.design_info correctly -- it does what I want but there are places where multiple things provide the same info. I'd love to have someone "in the know" look at the function and tell me if I should make a different choice. Is there a way to do this? (Counting comments the function is ~60 lines; not counting comments it is about 30 lines).

    2. Is there any interest in having something like this contributed back to the project? I've commented and unit tested the function already, and happy to make sure final comments/tests conform to your norms & standards. I skimmed the issues before posting, and for example it appears this issue #155: patsy equivalent of R's all.vars might benefit from my function (not exactly the same but perhaps close enough).

    opened by compumetrika 1
  • When passing DataFrame, dmatrices returns design matrix with zero rows using standardize in formula

    When passing DataFrame, dmatrices returns design matrix with zero rows using standardize in formula

    When passing a pandas.DataFrame, dmatrices is returning a design matrix with no rows in at least two cases I could find.

    Here's some minimal examples.

    Case 1: All column values are the same

    import pandas as pd
    from patsy import dmatrices
    
    df = pd.DataFrame({'a': [1, 1, 1], 'b': [0, 1, 0]})
    formula = 'b ~ standardize(a)'
    dmatrices(formula, data=df)
    

    give

    DesignMatrix with shape (0, 2)
      Intercept  standardize(a)
      Terms:
        'Intercept' (column 0)
        'standardize(a)' (column 1)
    

    Case 2. Column values are different but contain np.nan

    import pandas as pd
    import numpy as np
    from patsy import dmatrices
    
    df = pd.DataFrame({'a': [2, 3, np.nan], 'b': [0, 1, 0]})
    formula = 'b ~ standardize(a)'
    dmatrices(formula, data=df)
    

    gives the same

    DesignMatrix with shape (0, 2)
      Intercept  standardize(a)
      Terms:
        'Intercept' (column 0)
        'standardize(a)' (column 1)
    

    patsy version is the latest on conda, 0.5.1

    opened by rmwenzel 0
  • Is there a way to get dmatrix to drop all-zero columns?

    Is there a way to get dmatrix to drop all-zero columns?

    I have an experiment design that does not include all combinations of its categorical variables, and ran into some difficulties getting a full-rank design matrix for statsmodels. I included a simplified version below.

    import numpy as np
    import numpy.linalg as la
    import pandas as pd
    import patsy
    
    index_vals = tuple("abc")
    level_names = list("ABD")
    n_samples = 2
    
    
    def describe_design_matrix(design_matrix):
        print("Shape:", design_matrix.shape)
        print("Rank: ", la.matrix_rank(design_matrix))
        print(
            "Approximate condition number: {0:.2g}".format(
                np.divide(*la.svd(design_matrix)[1][[0, -1]])
            )
        )
    
    
    ds_simple = pd.DataFrame(
        index=pd.MultiIndex.from_product(
            [index_vals] * len(level_names) + [range(n_samples)],
            names=level_names + ["sample"],
        ),
        columns=["y"],
        data=np.random.randn(len(index_vals) ** len(level_names) * n_samples),
    ).reset_index()
    
    print("All sampled")
    simple_X = patsy.dmatrices("y ~ (A + B + D) ** 3", ds_simple)[1]
    describe_design_matrix(simple_X)
    
    print("Only some sampled")
    simple_X = patsy.dmatrices(
        "y ~ (A + B + D) ** 3", ds_simple.query("A != 'a' or B == 'a'")
    )[1]
    describe_design_matrix(simple_X)
    
    print("Reduced X")
    simple_X = patsy.dmatrices(
        "y ~ (A + B + D) ** 3",
        ds_simple.query("A != 'a' or B == 'a'"),
        return_type="dataframe",
    )[1]
    reduced_X = simple_X.loc[
        :, [col for col in simple_X.columns if not col.startswith("A[T.b]:B")]
    ]
    describe_design_matrix(reduced_X)
    
    print("Only some sampled: alternate method")
    simple_X = patsy.dmatrices(
        "y ~ (C(A, Treatment('b')) + B + D) ** 3", ds_simple.query("A != 'a' or B == 'a'")
    )[1]
    describe_design_matrix(simple_X)
    print("Number of nonzero elements:", (simple_X != 0).sum(axis=0))
    print("Number of all-zero columns:", np.count_nonzero((simple_X != 0).sum(axis=0) == 0))
    
    print("Reduced X: alternate method")
    simple_X = patsy.dmatrices(
        "y ~ (C(A, Treatment('b')) + B + D) ** 3",
        ds_simple.query("A != 'a' or B == 'a'"),
        return_type="dataframe",
    )[1]
    reduced_X = simple_X.loc[
        :,
        [
            col
            for col in simple_X.columns
            if not col.startswith("C(A, Treatment('b'))[T.a]:B")
        ],
    ]
    describe_design_matrix(reduced_X)
    

    produces as output

    All sampled
    Shape: (54, 27)
    Rank:  27
    Approximate condition number: 52
    Only some sampled
    Shape: (42, 27)
    Rank:  21
    Approximate condition number: 3.8e+16
    Reduced X
    Shape: (42, 21)
    Rank:  21
    Approximate condition number: 37
    Only some sampled: alternate method
    Shape: (42, 27)
    Rank:  21
    Approximate condition number: 3.4e+16
    Number of nonzero elements: [42  6 18 12 12 14 14  0  6  0  6  2  6  2  6  4  4  4  4  0  2  0  2  0
      2  0  2]
    Number of all-zero columns: 6
    Reduced X: alternate method
    Shape: (42, 21)
    Rank:  21
    Approximate condition number: 39
    

    I don't mind spending the time to find the representation that produces all-zero columns, but there doesn't seem to be a way within patsy to say "I know some of these columns are going to be all zeros" or "These columns will be linear dependent on others". Since some statsmodels functions require the formula information from patsy.DesignInfo objects, I wanted to see what could be done within patsy.

    matthewwardrop/formulaic#19 is a related issue, with some discussion of how to generalize the "Reduced X" method in the script.

    opened by DWesl 1
  • "+ -1" equivalence with "-1" in patsy formula?

    As per patsy's documentation, the following results in a design matrix with no intercept:

    >>> patsy.dmatrix('a + b - 1',
           pandas.DataFrame({'a':[1,2], 'b':[3,4]}))
    
    DesignMatrix with shape (2, 2)
      a  b
      1  3
      2  4
      Terms:
        'a' (column 0), 'b' (column 1)
    

    The documentation seems to imply that appending + -1 to the formula a + b should have the same effect as appending '-1'; however, it doesn't seem that the intercept is removed for the former:

    >>> patsy.dmatrix('a + b + -1',
           pandas.DataFrame({'a':[1,2],'b':[3,4]}))
    
    DesignMatrix with shape (2, 3)
      Intercept  a  b
              1  1  3
              1  2  4
      Terms:
        'Intercept' (column 0)
        'a' (column 1)
        'b' (column 2)
    

    Is the above expected?

    I'm using patsy 0.5.1 with python 3.7.6 and pandas 1.0.5.

    bug 
    opened by lebedov 2
  • Introducing `formulaic`, a high-performance `patsy`

    Introducing `formulaic`, a high-performance `patsy` "competitor"

    Greetings all,

    Late last year I had the need to generate sparse model matrices from large pandas DataFrames (dense model matrices would not fit in memory for the dataset I was using). I originally set about trying to patch patsy, but the code was not set up to allow overriding individual methods, and since I felt it would be a didactic experience in any case, I decided to rewrite something like patsy from scratch. The result is Formulaic.

    I wasn't expecting much more than the addition of sparse matrix support, but it seems I've also managed to improve the performance of model matrix generation by (in many cases) orders of magnitude, even beating R in many cases. I'm in the process of writing up documentation, and there is some low-hanging fruit in terms of improvements, but I'd love to get some eyes on the project, and would welcome feedback.

    opened by matthewwardrop 2
Releases(v0.5.3)
Owner
Python for Data
Python for Data
CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner. It is aimed to integrate this tool with several more features including providing a U

Ravi Prakash 3 Jun 27, 2021
AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures.

AptaMAT Purpose AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures. The method is based on the compa

GEC UTC 3 Nov 03, 2022
PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

PostQF Copyright © 2022 Ralph Seichter PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j. See the ma

Ralph Seichter 11 Nov 24, 2022
Example Of Splunk Search Query With Python And Splunk Python SDK

SSQAuto (Splunk Search Query Automation) Example Of Splunk Search Query With Python And Splunk Python SDK installation: ➜ ~ git clone https://github.c

AmirHoseinTangsiriNET 1 Nov 14, 2021
Fitting thermodynamic models with pycalphad

ESPEI ESPEI, or Extensible Self-optimizing Phase Equilibria Infrastructure, is a tool for thermodynamic database development within the CALPHAD method

Phases Research Lab 42 Sep 12, 2022
An interactive grid for sorting, filtering, and editing DataFrames in Jupyter notebooks

qgrid Qgrid is a Jupyter notebook widget which uses SlickGrid to render pandas DataFrames within a Jupyter notebook. This allows you to explore your D

Quantopian, Inc. 2.9k Jan 08, 2023
A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

Unnikrishnan 2 Dec 12, 2021
Techdegree Data Analysis Project 2

Basketball Team Stats Tool In this project you will be writing a program that reads from the "constants" data (PLAYERS and TEAMS) in constants.py. Thi

2 Oct 23, 2021
The official pytorch implementation of ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias Introduction | Updates | Usage | Results&Pretrained Models | Statement | Intr

104 Nov 27, 2022
Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods

Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods Introduction Graph Neural Networks (GNNs) have demonstrated

37 Dec 15, 2022
Creating a statistical model to predict 10 year treasury yields

Predicting 10-Year Treasury Yields Intitially, I wanted to see if the volatility in the stock market, represented by the VIX index (data source), had

10 Oct 27, 2021
Tkinter Izhikevich Neuron Model With Python

TKINTER IZHIKEVICH NEURON MODEL WITH PYTHON Hodgkin-Huxley Model It is a mathematical model for the generation and transmission of action potentials i

Rabia KOÇ 8 Jul 16, 2022
Data collection, enhancement, and metrics calculation.

l3_data_collection Data collection, enhancement, and metrics calculation. Summary Repository containing code for QuantDAO's JDT data collection task.

Ruiwyn 3 Dec 23, 2022
Python Practicum - prepare for your Data Science interview or get a refresher.

Python-Practicum Python Practicum - prepare for your Data Science interview or get a refresher. Data Data visualization using data on births from the

Jovan Trajceski 1 Jul 27, 2021
bigdata_analyse 大数据分析项目

bigdata_analyse 大数据分析项目 wish 采用不同的技术栈,通过对不同行业的数据集进行分析,期望达到以下目标: 了解不同领域的业务分析指标 深化数据处理、数据分析、数据可视化能力 增加大数据批处理、流处理的实践经验 增加数据挖掘的实践经验

Way 2.4k Dec 30, 2022
A pipeline that creates consensus sequences from a Nanopore reads. I

A pipeline that creates consensus sequences from a Nanopore reads. It clusters reads that are similar to each other and creates a consensus that is then identified using BLAST.

Ada Madejska 2 May 15, 2022
University Challenge 2021 With Python

University Challenge 2021 This repository contains: The TeX file of the technical write-up describing the University / HYPER Challenge 2021 under late

2 Nov 27, 2021
small package with utility functions for analyzing (fly) calcium imaging data

fly2p Tools for analyzing two-photon (2p) imaging data collected with Vidrio Scanimage software and micromanger. Loading scanimage data relies on scan

Hannah Haberkern 3 Dec 14, 2022
Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Aryan Raj 7 Sep 04, 2022