• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

The Programming Expert

Solving All of Your Programming Headaches

  • HTML
  • JavaScript
  • jQuery
  • PHP
  • Python
  • SAS
  • Ruby
  • About
You are here: Home / Python / Pandas Crosstab on Multiple Columns

Pandas Crosstab on Multiple Columns

December 7, 2021 2 Comments

When working with data, it is very useful to be able to group and aggregate data by multiple columns to understand the various segments of our data.

With pandas, we can easily find the frequencies of columns in a dataframe using the pandas value_counts() function, and we can do cross tabulations very easily using the pandas crosstab() function.

However, what is not obvious is how to use pandas to create a crosstab for 3 columns or a crosstab for an arbitrary number of columns and make it easy to work with.

In the rest of this post, you’ll find exactly the code you need to be able to produce a pandas crosstab on multiple columns.

Pandas Crosstab with Multiple Columns

Let’s say I have the following data:

   animal_type  gender  weight  age state trained
0          cat    male      10    1    CA      no
1          dog    male      20    4    FL      no
2          dog    male      30    5    NY      no
3          cat  female      40    3    FL     yes
4          cat  female      10    2    NY     yes
5          dog  female      20    4    TX     yes
6          cat  female      50    6    TX     yes
7          dog    male      60    1    CA      no
8          dog    male      70    5    NY      no
9          cat  female      80    4    FL     yes
10         cat  female      90    3    TX     yes
11         cat    male     100    2    TX      no
12         dog  female      80    4    FL      no

To calculate a simple two column crosstab, we can do the following:

pd.crosstab(data["gender"],data["state"])

#output
state   CA  FL  NY  TX
gender
female   0   3   1   3
male     2   1   2   1

While this is nice, it would be better to have something that we can work with more easily.

The key to creating a generalized function which will create a crosstab for any number of columns using pandas and Python is to utilize the pandas unstack() function with the pandas reset_index() function like shown below:

pd.crosstab(data["gender"],data["state"]).unstack().reset_index().rename(columns={0:'Count'})

#output:
  state  gender  Count
0    CA  female      0
1    CA    male      2
2    FL  female      3
3    FL    male      1
4    NY  female      1
5    NY    male      2
6    TX  female      3
7    TX    male      1

With this in mind, we can easily create a function which will produce a crosstab for multiple columns.

Below is a function which takes a dataframe and a list of column names and produces the frequencies for each of the groups we want.

def frequency(ds, vars):
    if len(vars) > 1:
        c1 = ds[vars[0]]
        c2 = []
        for i in range(1,len(vars)):
            c2.append(ds[vars[i]])
        dfs = []
        dfs.append(pd.crosstab(c1,c2).unstack().reset_index().rename(columns={0:'Count'}))
        dfs.append(pd.crosstab(c1,c2, normalize='all').unstack().reset_index().rename(columns={0:'Percent'}))
        dfs = [df.set_index(vars) for df in dfs]
        df = dfs[0].join(dfs[1:]).reset_index()
        return df

First, we need to create a list of columns which we will do the crosstab with. Then, we will call the pandas crosstab() function, unstack the result, and reset the index.

I’ve also included the ability to calculate the percentages for each group which is easily done by passing the “normalize=’all'” option to the crosstab() function.

Here’s the output of our frequency function which gives us the counts and percentages of each segment from our crosstab:

frequency(data,["animal_type","gender","state","trained"])

#output
   animal_type  gender state trained  Count   Percent
0          cat  female    FL      no      0  0.000000
1          dog  female    FL      no      1  0.076923
2          cat  female    FL     yes      2  0.153846
3          dog  female    FL     yes      0  0.000000
4          cat  female    NY     yes      1  0.076923
5          dog  female    NY     yes      0  0.000000
6          cat  female    TX     yes      2  0.153846
7          dog  female    TX     yes      1  0.076923
8          cat    male    CA      no      1  0.076923
9          dog    male    CA      no      1  0.076923
10         cat    male    FL      no      0  0.000000
11         dog    male    FL      no      1  0.076923
12         cat    male    NY      no      0  0.000000
13         dog    male    NY      no      2  0.153846
14         cat    male    TX      no      1  0.076923
15         dog    male    TX      no      0  0.000000

Finding Row Percentages and Column Percentages in Crosstab

Many times when we are analyzing data, we want to find the row percentages and column percentages in a crosstab.

To do this, we can add two lines to our function and utilize the “normalize” option of the crosstab() function.

Below is the new function which will allow us to calculate the row and column percentages in a crosstab using pandas and Python:

def frequency(ds, vars):
    if len(vars) > 1:
        c1 = ds[vars[0]]
        c2 = []
        for i in range(1,len(vars)):
            c2.append(ds[vars[i]])
        dfs = []
        dfs.append(pd.crosstab(c1,c2).unstack().reset_index().rename(columns={0:'Count'}))
        dfs.append(pd.crosstab(c1,c2, normalize='all').unstack().reset_index().rename(columns={0:'Percent'}))
        dfs.append(pd.crosstab(c1,c2, normalize='columns').unstack().reset_index().rename(columns={0:'Column Percent'}))
        dfs.append(pd.crosstab(c1,c2, normalize='index').unstack().reset_index().rename(columns={0:'Row Percent'}))
        dfs = [df.set_index(vars) for df in dfs]
        df = dfs[0].join(dfs[1:]).reset_index()
        return df

Here’s the output of our frequency function which gives us the counts and percentages of each segment in our dataframe, and also the row and column percentages in our crosstab:

frequency(data,["animal_type","gender","state","trained"])

#output
   animal_type  gender state trained  Count   Percent  Column Percent  Row Percent
0          cat  female    FL      no      0  0.000000        0.000000     0.000000
1          dog  female    FL      no      1  0.076923        1.000000     0.166667
2          cat  female    FL     yes      2  0.153846        1.000000     0.285714
3          dog  female    FL     yes      0  0.000000        0.000000     0.000000
4          cat  female    NY     yes      1  0.076923        1.000000     0.142857
5          dog  female    NY     yes      0  0.000000        0.000000     0.000000
6          cat  female    TX     yes      2  0.153846        0.666667     0.285714
7          dog  female    TX     yes      1  0.076923        0.333333     0.166667
8          cat    male    CA      no      1  0.076923        0.500000     0.142857
9          dog    male    CA      no      1  0.076923        0.500000     0.166667
10         cat    male    FL      no      0  0.000000        0.000000     0.000000
11         dog    male    FL      no      1  0.076923        1.000000     0.166667
12         cat    male    NY      no      0  0.000000        0.000000     0.000000
13         dog    male    NY      no      2  0.153846        1.000000     0.333333
14         cat    male    TX      no      1  0.076923        1.000000     0.142857
15         dog    male    TX      no      0  0.000000        0.000000     0.000000

Hopefully this article has been helpful for you to learn how to use pandas to create a crosstab across multiple columns in Python.

Other Articles You'll Also Like:

  • 1.  Python issuperset() Function – Check if Set is Superset of Another Set
  • 2.  Clear File Contents with Python
  • 3.  Remove None From List Using Python
  • 4.  Creating a List of Zeros in Python
  • 5.  How to Rotate String in Python
  • 6.  Find Maximum of Three Numbers in Python
  • 7.  Remove Leading and Trailing Characters from String with strip() in Python
  • 8.  Python Square Root Without Math Module – ** or Newton’s Method
  • 9.  Python power function – Exponentiate Numbers with math.pow()
  • 10.  pandas mode – Find Mode of Series or Columns in DataFrame

About The Programming Expert

The Programming Expert is a compilation of a programmer’s findings in the world of software development, website creation, and automation of processes.

Programming allows us to create amazing applications which make our work more efficient, repeatable and accurate.

At the end of the day, we want to be able to just push a button and let the code do it’s magic.

You can read more about us on our about page.

Reader Interactions

Comments

  1. Tonnie says

    June 23, 2022 at 5:22 am

    Hello,
    is it possible to put 2 values in a column under 2 levels in a crosstab.
    example

    2022
    Male Female
    age income age income
    25 1000 35 1500

    So far i have for now
    data2 = pd.crosstab([SQL_query.State],[SQL_query.year,SQL_query.sex], values = SQL_query.bedrag , aggfunc=np.sum)

    Reply
    • Erik says

      June 23, 2022 at 2:01 pm

      Hi Tonnie, I don’t think you need to be using crosstab to accomplish what you are doing. A crosstab is just doing A x B and then finding the count or aggregation of a single variable for each of the groups. From what I see here, it looks like you are trying to just groupby and then find the sums of the two variables.

      Also from doing some research, I don’t know how to you can use two different values for “values” without getting an error.

      Instead, it’s possible you should be using groupby(), sum() and stack().

      This gets me pretty close…

      d = [{"sex":"male","income":1000, "age":25, "year":2022}, {"sex":"female","income":1500, "age":30, "year":2022},{"sex":"female","income":2000, "age":35, "year":2022}]
      
      df = pd.DataFrame.from_dict(d)
      
      df.groupby(["year","sex"]).sum(["age","income"]).stack()
      
      #Output:
      year  sex
      2022  female  income    3500
                    age         65
            male    income    1000
                    age         25

      It’s not what you were going for but it is horizontal with the information you want vs. vertical. I can’t find something that will do it vertical.

      This might be a relevant article: http://theprogrammingexpert.com/pandas-group-by-aggregate-multiple-columns/

      Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Primary Sidebar

About The Programming Expert

the programming expert main image

Welcome to The Programming Expert. We are a group of US-based programming professionals who have helped companies build, maintain, and improve everything from simple websites to large-scale projects.

We built The Programming Expert to help you solve your programming problems with useful coding methods and functions in various programming languages.

Search

Learn Coding from Experts on Udemy

Looking to boost your skills and learn how to become a programming expert?

Check out the links below to view Udemy courses for learning to program in the following languages:

Copyright © 2023 · The Programming Expert · About · Privacy Policy