×
Dustin Lennon

Dustin Lennon

Applied Scientist
dlennon.org

 
pandas extension categorical data metadata

Metadata by Design

We describe a Pandas extention of the categorical dtype that better encapsulates metadata.


Dustin Lennon
April 2021
https://dlennon.org/20210402_xcatvars
April 2021


Metadata by Design

Metadata by Design

One of the common frustrations of data analysis is confronting poorly documented datasets. Categorical variables can be particularly painful. They are often coded and stored as integers, and without proper metadata or apriori domain knowledge, can easily be ingested as a covariate that wrongly enjoys the same properties as the natural numbers. This has the potential to silently break statistical and machine learning models.

This post introduces a pandas extension that more tightly couples categorical codings with their metadata. It guides users toward better data design.

Prerequisites

This notebook requires the pydlennon python package, available on github.

pip install git+https://github.com/dustinlennon/pydlennon

This package is in an early development stage, and any suggestions or other constructive feedback would be very much appreciated.

Pandas ‘category’ dtype

Let’s start with the existing pandas functionality and a simple dataset extracted from the National Election Study. We’ll treat this like a CSV file.

import pandas as pd
from pydlennon.extensions.pandas.ext_categorical import ExtCategorical, ExtCategoricalDtype
import io

testdata = """
,gender,partyid3,race
19641,1,1,2
19642,2,3,1
19643,2,3,
19644,2,3,1
19645,2,2,2
""".strip()

def read_csv(**_kw):
    fp = io.StringIO( testdata )
    kw = {
        'index_col' : 'record_id',
        'header' : 0,
        'names' : [
            'record_id',
            'gender',
            'partyid3',
            'race',
        ],
    }
    kw.update(**_kw)
    df = pd.read_csv(fp, **kw)

    return df

df = read_csv()
display( df.head() )
display( df.dtypes )
gender partyid3 race
record_id
19641 1 1 2.0
19642 2 3 1.0
19643 2 3 NaN
19644 2 3 1.0
19645 2 2 2.0
gender        int64
partyid3      int64
race        float64
dtype: object

For a tiny data set like this one, it’s easy to surmise that gender should map ‘1’ and ‘2’ to ‘male’ and ‘female’, but it’s ambiguous as to which one is which. The intent of partyid3 is less clear. On first glance, race seems similar to gender, but what other ethnicities were considered? Surely, there should be more than just the two represented here.

These problems of interpretation persist even if we specify the variables as categorical. In pandas, this would take the form:

kw = {
    'dtype' : {
        'gender' : 'category',
        'partyid3' : 'category',
        'race' : 'category'
    }
}
df = read_csv( **kw )
df.dtypes
gender      category
partyid3    category
race        category
dtype: object

This is an improvement, as downstream tools will be able to treat these variables in a more meaningful way. However, there is additional ambiguity lurking just below the surface. In particular, pandas has its own internal coding of the data.

# our expectation
display( df.race.dtype.categories )

# internal coding
df.race.cat.codes
Index(['1', '2'], dtype='object')
record_id
19641    1
19642    0
19643   -1
19644    0
19645    1
dtype: int8

The issue here is that the internal coding doesn’t respect the intuitive coding, and it’s easy to mentally mangle the two representations. This is particularly true if the analyst wants to recenter an ordered categorical variable.

Ordered Categorical Variables

Recall partyid3. Here, ‘1’ represents a preference for the Democratic candidate; ‘2’, no preference; and ‘3’, a preference for the Republican candidate. This is an ordered categorical variable, and it is natural to prefer a design matrix that maps these values to -1, 0, and 1 respectively. We might accomplish this as follows:

kw = {
    'dtype' : {
        'gender' : 'category',
        'partyid3' : pd.CategoricalDtype([1,2,3], ordered=True),
        'race' : 'category'
    }
}
df = read_csv( **kw )

df['partyid3'] = df.partyid3.map({1:-1,2:0,3:1})

# an ordered categorical admits min/max
(df.partyid3.min(), df.partyid3.max())
(-1, 1)

This feels ugly. In particular, this requires a single idea–coding the categorical varible–to be separated into two distinct steps occuring in separate places in the code. Furthermore, there are now three representations to consider, and we’ve discarded the original coding. This will make it more difficult to cross check against the original data.

Reference categories

For a popular class of simple regressions models, namely linear and generalized linear regressions, it is crucial that the design matrix be full rank. This requires constraints on how categorical variables are coded. Usually, this is handled by introducing reference categories.

For example, in their 2007 textbook, Data Analysis Using Regression and Multilevel/Hierarchical Models, Gelman and Hill define the reference category to be white and male and introduce two boolean variables, one for female and one for black. In the context of the fitted model, the coefficients associated with the two boolean variables are interpreted as level shifts from the white and male reference group. Social scientists often define reference groups based on the specific questions to be addressed by the study.

We can encode this as follows:

kw = {
    'dtype' : {
        'gender' : pd.CategoricalDtype([1,2]),
        'partyid3' : pd.CategoricalDtype([1,2,3], ordered=True),
        'race' : pd.CategoricalDtype([1,2,3])
    }
}
df = read_csv( **kw )

# map: internal codes -> gender
gender_map = dict( enumerate( df.gender.cat.categories ) )

# map: internal codes -> race
race_map = dict( enumerate( df.gender.cat.categories ) )

# reference levels are at zero index
[gender_map[0], race_map[0]]
[1, 1]
The ExtCategorical dtype

Our contribution is to extend the pandas Categorical dtype by explicitly including the metadata at the point of ingestion. This winds up being only marginally more work than the above. Specifically, this takes the form:

kw = {
    'dtype' : {
        'gender' : ExtCategoricalDtype([
            (1, 'male'),
            (2, 'female')
        ]),
        'partyid3' : ExtCategoricalDtype([
            (1, 'prefers Democratic party', -1),
            (2, 'no preference', 0),
            (3, 'prefers Republican party', 1)
        ], ordered = True),        
        'race' : ExtCategoricalDtype([
            (1, 'white'),
            (2, 'black'),
            (3, 'other')
        ])
    }
}
df = read_csv( **kw )
df.dtypes
gender      ext_category
partyid3    ext_category
race        ext_category
dtype: object

This is an improvement because metadata at a code level is necessarily self documenting! For larger datasets, the ExtCategoricalDtype could be populated dynamically, and this presupposes that the metadata exists in a format that can be ingested as such. But that’s a win, too, because it means that the metadata must exist in a structured format before an analysis begins.

New features in ExtCategorical

Let’s investigate the new features. The main utility comes from the following accessor, useful during adhoc exploratory analysis or constructing design matrices.

First, an ext_category looks and feels like a normal categorical variable:

# the usual display
df.partyid3
record_id
19641    1
19642    3
19643    3
19644    3
19645    2
Name: partyid3, dtype: ext_category

But we can ask it to masquerade as though the metadata was explicitly available:

# act like a categorical with catagories that match the metadata
partyid3_metadata = df.partyid3.xcat.relevel(1)
partyid3_metadata
record_id
19641    prefers Democratic party
19642    prefers Republican party
19643    prefers Republican party
19644    prefers Republican party
19645               no preference
dtype: ext_category

or, as a categorical variable that uses the design matrix codings

# act like a categorical with catagories that match the design matrix codings
partyid3_design = df.partyid3.xcat.relevel(2)
partyid3_design
record_id
19641   -1
19642    1
19643    1
19644    1
19645    0
dtype: ext_category

Or convert back to the values found in the original data:

partyid3_raw = partyid3_design.xcat.relevel(0)
partyid3_raw
record_id
19641    1
19642    3
19643    3
19644    3
19645    2
dtype: ext_category

The argument to relevel is zero indexed and corresponds to the order in which we defined the ExtCategoricalDtype in the dictionary passed to pandas read_csv function.

OOP Design

To downstream libraries that anticipate categorical variables, the ExtCategoricalDtype enjoys the following property: it’s a subclass of CategoricalDtype, and instances of ExtCategoricalDtype behave accordingly:

isinstance(partyid3_design.values, pd.Categorical)
True

This means that well written packages should be able to handle an ExtCategorical transparently, exactly as they would handle a Categorical dtype.

Additional Functionality

To date, we’ve implemented basic functionality. Methods like dropna should just work as expected:

display(df)
df.dropna()
gender partyid3 race
record_id
19641 1 1 2.0
19642 2 3 1.0
19643 2 3 NaN
19644 2 3 1.0
19645 2 2 2.0
gender partyid3 race
record_id
19641 1 1 2
19642 2 3 1
19644 2 3 1
19645 2 2 2

And we can add and replace columns in a data frame:

df['race_m'] = df.race.xcat.relevel(1)
df['partyid3_coded'] = df.partyid3.xcat.relevel(2).astype(int)
df
gender partyid3 race race_m partyid3_coded
record_id
19641 1 1 2.0 black -1
19642 2 3 1.0 white 1
19643 2 3 NaN NaN 1
19644 2 3 1.0 white 1
19645 2 2 2.0 black 0
Final words

We’ve only handled a few use cases; I’m sure there are plenty of code paths that we haven’t walked down. That means your mileage may vary. However, we’re interested in improving this as time permits, so be in touch if there are bugs or feature requests. Thanks!