pydlennon: pandas extensions
An extension of the pandas categorical dtype that better encapsulates metadata
One of the common frustrations of data analysis is confronting poorly documented datasets. Categorical variables can be particularly painful. They are often coded and stored as integers, and without proper metadata or apriori domain knowledge, can easily be ingested as a covariate that wrongly enjoys the same properties as the natural numbers. This has the potential to silently break statistical and machine learning models.
This post introduces a pandas extension that more tightly couples categorical codings with their metadata. It guides users toward better data design.
This notebook requires the pydlennon python package, available on github.
pip install git+https://github.com/dustinlennon/pydlennon
This package is in an early development stage, and any suggestions or other constructive feedback would be very much appreciated.
Note, the code below was eventually deprecated because simpler alternatives exist. The commit that supports this workflow is 6cd56b0.
Let’s start with the existing pandas functionality and a simple dataset extracted from the National Election Study. We’ll treat this like a CSV file.
import pandas as pd
from pydlennon.extensions.pandas.ext_categorical import ExtCategorical, ExtCategoricalDtype
import io
testdata = """
,gender,partyid3,race
19641,1,1,2
19642,2,3,1
19643,2,3,
19644,2,3,1
19645,2,2,2
""".strip()
def read_csv(**_kw):
fp = io.StringIO( testdata )
kw = {
'index_col' : 'record_id',
'header' : 0,
'names' : [
'record_id',
'gender',
'partyid3',
'race',
],
}
kw.update(**_kw)
df = pd.read_csv(fp, **kw)
return df
df = read_csv()
display( df.head() )
display( df.dtypes )
gender | partyid3 | race | |
---|---|---|---|
record_id | |||
19641 | 1 | 1 | 2.0 |
19642 | 2 | 3 | 1.0 |
19643 | 2 | 3 | NaN |
19644 | 2 | 3 | 1.0 |
19645 | 2 | 2 | 2.0 |
gender int64
partyid3 int64
race float64
dtype: object
For a tiny data set like this one, it’s easy to surmise that gender should map ‘1’ and ‘2’ to ‘male’ and ‘female’, but it’s ambiguous as to which one is which. The intent of partyid3 is less clear. On first glance, race seems similar to gender, but what other ethnicities were considered? Surely, there should be more than just the two represented here.
These problems of interpretation persist even if we specify the variables as categorical. In pandas, this would take the form:
kw = {
'dtype' : {
'gender' : 'category',
'partyid3' : 'category',
'race' : 'category'
}
}
df = read_csv( **kw )
df.dtypes
gender category
partyid3 category
race category
dtype: object
This is an improvement, as downstream tools will be able to treat these variables in a more meaningful way. However, there is additional ambiguity lurking just below the surface. In particular, pandas has its own internal coding of the data.
# our expectation
display( df.race.dtype.categories )
# internal coding
df.race.cat.codes
Index(['1', '2'], dtype='object')
record_id
19641 1
19642 0
19643 -1
19644 0
19645 1
dtype: int8
The issue here is that the internal coding doesn’t respect the intuitive coding, and it’s easy to mentally mangle the two representations. This is particularly true if the analyst wants to recenter an ordered categorical variable.
Recall partyid3. Here, ‘1’ represents a preference for the Democratic candidate; ‘2’, no preference; and ‘3’, a preference for the Republican candidate. This is an ordered categorical variable, and it is natural to prefer a design matrix that maps these values to -1, 0, and 1 respectively. We might accomplish this as follows:
kw = {
'dtype' : {
'gender' : 'category',
'partyid3' : pd.CategoricalDtype([1,2,3], ordered=True),
'race' : 'category'
}
}
df = read_csv( **kw )
df['partyid3'] = df.partyid3.map({1:-1,2:0,3:1})
# an ordered categorical admits min/max
(df.partyid3.min(), df.partyid3.max())
(-1, 1)
This feels ugly. In particular, this requires a single idea–coding the categorical varible–to be separated into two distinct steps occuring in separate places in the code. Furthermore, there are now three representations to consider, and we’ve discarded the original coding. This will make it more difficult to cross check against the original data.
For a popular class of simple regressions models, namely linear and generalized linear regressions, it is crucial that the design matrix be full rank. This requires constraints on how categorical variables are coded. Usually, this is handled by introducing reference categories.
For example, in their 2007 textbook, Data Analysis Using Regression and Multilevel/Hierarchical Models, Gelman and Hill define the reference category to be white and male and introduce two boolean variables, one for female and one for black. In the context of the fitted model, the coefficients associated with the two boolean variables are interpreted as level shifts from the white and male reference group. Social scientists often define reference groups based on the specific questions to be addressed by the study.
We can encode this as follows:
kw = {
'dtype' : {
'gender' : pd.CategoricalDtype([1,2]),
'partyid3' : pd.CategoricalDtype([1,2,3], ordered=True),
'race' : pd.CategoricalDtype([1,2,3])
}
}
df = read_csv( **kw )
# map: internal codes -> gender
gender_map = dict( enumerate( df.gender.cat.categories ) )
# map: internal codes -> race
race_map = dict( enumerate( df.gender.cat.categories ) )
# reference levels are at zero index
[gender_map[0], race_map[0]]
[1, 1]
Our contribution is to extend the pandas Categorical dtype by explicitly including the metadata at the point of ingestion. This winds up being only marginally more work than the above. Specifically, this takes the form:
kw = {
'dtype' : {
'gender' : ExtCategoricalDtype([
(1, 'male'),
(2, 'female')
]),
'partyid3' : ExtCategoricalDtype([
(1, 'prefers Democratic party', -1),
(2, 'no preference', 0),
(3, 'prefers Republican party', 1)
], ordered = True),
'race' : ExtCategoricalDtype([
(1, 'white'),
(2, 'black'),
(3, 'other')
])
}
}
df = read_csv( **kw )
df.dtypes
gender ext_category
partyid3 ext_category
race ext_category
dtype: object
This is an improvement because metadata at a code level is necessarily self documenting! For larger datasets, the ExtCategoricalDtype could be populated dynamically, and this presupposes that the metadata exists in a format that can be ingested as such. But that’s a win, too, because it means that the metadata must exist in a structured format before an analysis begins.
Let’s investigate the new features. The main utility comes from the following accessor, useful during adhoc exploratory analysis or constructing design matrices.
First, an ext_category looks and feels like a normal categorical variable:
# the usual display
df.partyid3
record_id
19641 1
19642 3
19643 3
19644 3
19645 2
Name: partyid3, dtype: ext_category
But we can ask it to masquerade as though the metadata was explicitly available:
# act like a categorical with catagories that match the metadata
partyid3_metadata = df.partyid3.xcat.relevel(1)
partyid3_metadata
record_id
19641 prefers Democratic party
19642 prefers Republican party
19643 prefers Republican party
19644 prefers Republican party
19645 no preference
dtype: ext_category
or, as a categorical variable that uses the design matrix codings
# act like a categorical with catagories that match the design matrix codings
partyid3_design = df.partyid3.xcat.relevel(2)
partyid3_design
record_id
19641 -1
19642 1
19643 1
19644 1
19645 0
dtype: ext_category
Or convert back to the values found in the original data:
partyid3_raw = partyid3_design.xcat.relevel(0)
partyid3_raw
record_id
19641 1
19642 3
19643 3
19644 3
19645 2
dtype: ext_category
The argument to relevel is zero indexed and corresponds to the order in which we defined the ExtCategoricalDtype in the dictionary passed to pandas read_csv function.
To downstream libraries that anticipate categorical variables, the ExtCategoricalDtype enjoys the following property: it’s a subclass of CategoricalDtype, and instances of ExtCategoricalDtype behave accordingly:
isinstance(partyid3_design.values, pd.Categorical)
True
This means that well written packages should be able to handle an ExtCategorical transparently, exactly as they would handle a Categorical dtype.
To date, we’ve implemented basic functionality. Methods like dropna should just work as expected:
display(df)
df.dropna()
gender | partyid3 | race | |
---|---|---|---|
record_id | |||
19641 | 1 | 1 | 2.0 |
19642 | 2 | 3 | 1.0 |
19643 | 2 | 3 | NaN |
19644 | 2 | 3 | 1.0 |
19645 | 2 | 2 | 2.0 |
gender | partyid3 | race | |
---|---|---|---|
record_id | |||
19641 | 1 | 1 | 2 |
19642 | 2 | 3 | 1 |
19644 | 2 | 3 | 1 |
19645 | 2 | 2 | 2 |
And we can add and replace columns in a data frame:
df['race_m'] = df.race.xcat.relevel(1)
df['partyid3_coded'] = df.partyid3.xcat.relevel(2).astype(int)
df
gender | partyid3 | race | race_m | partyid3_coded | |
---|---|---|---|---|---|
record_id | |||||
19641 | 1 | 1 | 2.0 | black | -1 |
19642 | 2 | 3 | 1.0 | white | 1 |
19643 | 2 | 3 | NaN | NaN | 1 |
19644 | 2 | 3 | 1.0 | white | 1 |
19645 | 2 | 2 | 2.0 | black | 0 |
We’ve only handled a few use cases; I’m sure there are plenty of code paths that we haven’t walked down. That means your mileage may vary. However, we’re interested in improving this as time permits, so be in touch if there are bugs or feature requests. Thanks!