When working with data, it is very useful to be able to group and aggregate data by multiple columns to understand the various segments of our data.
With pandas, we can easily find the frequencies of columns in a dataframe using the pandas value_counts() function, and we can do cross tabulations very easily using the pandas crosstab() function.
However, what is not obvious is how to use pandas to create a crosstab for 3 columns or a crosstab for an arbitrary number of columns and make it easy to work with.
In the rest of this post, you’ll find exactly the code you need to be able to produce a pandas crosstab on multiple columns.
Pandas Crosstab with Multiple Columns
Let’s say I have the following data:
animal_type gender weight age state trained
0 cat male 10 1 CA no
1 dog male 20 4 FL no
2 dog male 30 5 NY no
3 cat female 40 3 FL yes
4 cat female 10 2 NY yes
5 dog female 20 4 TX yes
6 cat female 50 6 TX yes
7 dog male 60 1 CA no
8 dog male 70 5 NY no
9 cat female 80 4 FL yes
10 cat female 90 3 TX yes
11 cat male 100 2 TX no
12 dog female 80 4 FL no
To calculate a simple two column crosstab, we can do the following:
pd.crosstab(data["gender"],data["state"])
#output
state CA FL NY TX
gender
female 0 3 1 3
male 2 1 2 1
While this is nice, it would be better to have something that we can work with more easily.
The key to creating a generalized function which will create a crosstab for any number of columns using pandas and Python is to utilize the pandas unstack() function with the pandas reset_index() function like shown below:
pd.crosstab(data["gender"],data["state"]).unstack().reset_index().rename(columns={0:'Count'})
#output:
state gender Count
0 CA female 0
1 CA male 2
2 FL female 3
3 FL male 1
4 NY female 1
5 NY male 2
6 TX female 3
7 TX male 1
With this in mind, we can easily create a function which will produce a crosstab for multiple columns.
Below is a function which takes a dataframe and a list of column names and produces the frequencies for each of the groups we want.
def frequency(ds, vars):
if len(vars) > 1:
c1 = ds[vars[0]]
c2 = []
for i in range(1,len(vars)):
c2.append(ds[vars[i]])
dfs = []
dfs.append(pd.crosstab(c1,c2).unstack().reset_index().rename(columns={0:'Count'}))
dfs.append(pd.crosstab(c1,c2, normalize='all').unstack().reset_index().rename(columns={0:'Percent'}))
dfs = [df.set_index(vars) for df in dfs]
df = dfs[0].join(dfs[1:]).reset_index()
return df
First, we need to create a list of columns which we will do the crosstab with. Then, we will call the pandas crosstab() function, unstack the result, and reset the index.
I’ve also included the ability to calculate the percentages for each group which is easily done by passing the “normalize=’all'” option to the crosstab() function.
Here’s the output of our frequency function which gives us the counts and percentages of each segment from our crosstab:
frequency(data,["animal_type","gender","state","trained"])
#output
animal_type gender state trained Count Percent
0 cat female FL no 0 0.000000
1 dog female FL no 1 0.076923
2 cat female FL yes 2 0.153846
3 dog female FL yes 0 0.000000
4 cat female NY yes 1 0.076923
5 dog female NY yes 0 0.000000
6 cat female TX yes 2 0.153846
7 dog female TX yes 1 0.076923
8 cat male CA no 1 0.076923
9 dog male CA no 1 0.076923
10 cat male FL no 0 0.000000
11 dog male FL no 1 0.076923
12 cat male NY no 0 0.000000
13 dog male NY no 2 0.153846
14 cat male TX no 1 0.076923
15 dog male TX no 0 0.000000
Finding Row Percentages and Column Percentages in Crosstab
Many times when we are analyzing data, we want to find the row percentages and column percentages in a crosstab.
To do this, we can add two lines to our function and utilize the “normalize” option of the crosstab() function.
Below is the new function which will allow us to calculate the row and column percentages in a crosstab using pandas and Python:
def frequency(ds, vars):
if len(vars) > 1:
c1 = ds[vars[0]]
c2 = []
for i in range(1,len(vars)):
c2.append(ds[vars[i]])
dfs = []
dfs.append(pd.crosstab(c1,c2).unstack().reset_index().rename(columns={0:'Count'}))
dfs.append(pd.crosstab(c1,c2, normalize='all').unstack().reset_index().rename(columns={0:'Percent'}))
dfs.append(pd.crosstab(c1,c2, normalize='columns').unstack().reset_index().rename(columns={0:'Column Percent'}))
dfs.append(pd.crosstab(c1,c2, normalize='index').unstack().reset_index().rename(columns={0:'Row Percent'}))
dfs = [df.set_index(vars) for df in dfs]
df = dfs[0].join(dfs[1:]).reset_index()
return df
Here’s the output of our frequency function which gives us the counts and percentages of each segment in our dataframe, and also the row and column percentages in our crosstab:
frequency(data,["animal_type","gender","state","trained"])
#output
animal_type gender state trained Count Percent Column Percent Row Percent
0 cat female FL no 0 0.000000 0.000000 0.000000
1 dog female FL no 1 0.076923 1.000000 0.166667
2 cat female FL yes 2 0.153846 1.000000 0.285714
3 dog female FL yes 0 0.000000 0.000000 0.000000
4 cat female NY yes 1 0.076923 1.000000 0.142857
5 dog female NY yes 0 0.000000 0.000000 0.000000
6 cat female TX yes 2 0.153846 0.666667 0.285714
7 dog female TX yes 1 0.076923 0.333333 0.166667
8 cat male CA no 1 0.076923 0.500000 0.142857
9 dog male CA no 1 0.076923 0.500000 0.166667
10 cat male FL no 0 0.000000 0.000000 0.000000
11 dog male FL no 1 0.076923 1.000000 0.166667
12 cat male NY no 0 0.000000 0.000000 0.000000
13 dog male NY no 2 0.153846 1.000000 0.333333
14 cat male TX no 1 0.076923 1.000000 0.142857
15 dog male TX no 0 0.000000 0.000000 0.000000
Hopefully this article has been helpful for you to learn how to use pandas to create a crosstab across multiple columns in Python.
Hello,
is it possible to put 2 values in a column under 2 levels in a crosstab.
example
2022
Male Female
age income age income
25 1000 35 1500
So far i have for now
data2 = pd.crosstab([SQL_query.State],[SQL_query.year,SQL_query.sex], values = SQL_query.bedrag , aggfunc=np.sum)
Hi Tonnie, I don’t think you need to be using crosstab to accomplish what you are doing. A crosstab is just doing A x B and then finding the count or aggregation of a single variable for each of the groups. From what I see here, it looks like you are trying to just groupby and then find the sums of the two variables.
Also from doing some research, I don’t know how to you can use two different values for “values” without getting an error.
Instead, it’s possible you should be using groupby(), sum() and stack().
This gets me pretty close…
It’s not what you were going for but it is horizontal with the information you want vs. vertical. I can’t find something that will do it vertical.
This might be a relevant article: http://theprogrammingexpert.com/pandas-group-by-aggregate-multiple-columns/