• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

The Programming Expert

Solving All of Your Programming Headaches

  • HTML
  • JavaScript
  • jQuery
  • PHP
  • Python
  • SAS
  • VBA
  • About
You are here: Home / Python / Drop Duplicates pandas – Remove Duplicate Rows in DataFrame

Drop Duplicates pandas – Remove Duplicate Rows in DataFrame

January 13, 2022 Leave a Comment

To drop duplicate rows in a DataFrame or Series in pandas, the easiest way is to use the pandas drop_duplicates() function.

df.drop_duplicates()

When working with data, it’s important to be able to find any problems with our data. Finding and removing duplicate records in our data is one such situation where we may have to fix our data.

With Python, we can find and remove duplicate rows in data very easily using the pandas package and the pandas drop_duplicates() function.

Let’s say we have the following DataFrame:

df = pd.DataFrame({'Name': ['Jim','Jim','Jim','Sally','Bob','Sue','Sue','Larry'],
                   'Weight':['100','100','200','100','200','150','150','200']})


# Output:
    Name Weight
0    Jim    100
1    Jim    100
2    Jim    200
3  Sally    100
4    Bob    200
5    Sue    150
6    Sue    150
7  Larry    200

Let’s find the duplicate rows in this DataFrame. We can do this easily using the pandas duplicated() function. The duplicated() function returns a Series with boolean values denoting where we have duplicate rows. By default, it marks all duplicates as True except the first occurrence.

print(df.duplicated())

# Output:
0    False
1     True
2    False
3    False
4    False
5    False
6     True
7    False
dtype: bool

We see above that we have 2 duplicate rows. If we want to remove these duplicate rows, we can use the pandas drop_duplicates() function like in the following Python code:

print(df.drop_duplicates())

# Output:
    Name Weight
0    Jim    100
2    Jim    200
3  Sally    100
4    Bob    200
5    Sue    150
7  Larry    200

The default setting for drop_duplicates() is to drop all duplicates except the first. We can drop all duplicates except the last occurrence, or drop all duplicates by passing ‘keep=”last”‘ or ‘keep=False’ respectively.

print(df.drop_duplicates(keep="last"))
print(df.drop_duplicates(keep=False))

# Output:
    Name Weight
1    Jim    100
2    Jim    200
3  Sally    100
4    Bob    200
6    Sue    150
7  Larry    200

    Name Weight
2    Jim    200
3  Sally    100
4    Bob    200
7  Larry    200

The pandas drop_duplicates() function returns a DataFrame, and if you want to reset the index, you can do this with the ‘ignore_index’ option. Additionally, you can remove duplicates ‘inplace’ like many other pandas functions.

print(df.drop_duplicates(keep=False, ignore_index=True))

# Output:
    Name Weight
0    Jim    200
1  Sally    100
2    Bob    200
3  Larry    200

Drop Duplicate Rows based on Column Using Pandas

By default, the drop_duplicates() function removes duplicates based on all columns of a DataFrame. We can remove duplicate rows based on just one column or multiple columns using the “subset” parameter.

Let’s say we have the same DataFrame as above. We can find all of the duplicates based on the “Name” column by passing ‘subset=[“Name”]’ to the drop_duplicates() function.

print(df.drop_duplicates(subset=["Name"]))

#Output: 
    Name Weight
0    Jim    100
3  Sally    100
4    Bob    200
5    Sue    150
7  Larry    200

Hopefully this article has been beneficial for you to understand how to use the pandas drop_duplicates() function to remove duplicate rows in your data in Python.

Other Articles You'll Also Like:

  • 1.  Python tanh – Find Hyperbolic Tangent of Number Using math.tanh()
  • 2.  Reverse a List in Python Without Reverse Function
  • 3.  Keep Every Nth Element in List in Python
  • 4.  Using Lambda Expression with max() in Python
  • 5.  PROC MIXED Equivalent in Python for Least Squared Means ANOVA
  • 6.  How to Remove Vowels from a String in Python
  • 7.  How to Group and Aggregate By Multiple Columns in Pandas
  • 8.  pandas interpolate() – Fill NaN Values with Interpolation in DataFrame
  • 9.  Python Square Root – Finding Square Roots Using math.sqrt() Function
  • 10.  Using Python to Read File Character by Character

About The Programming Expert

The Programming Expert is a compilation of a programmer’s findings in the world of software development, website creation, and automation of processes.

Programming allows us to create amazing applications which make our work more efficient, repeatable and accurate.

At the end of the day, we want to be able to just push a button and let the code do it’s magic.

You can read more about us on our about page.

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Primary Sidebar

About The Programming Expert

the programming expert main image

The Programming Expert is a compilation of hundreds of code snippets to help you find solutions to your problems in Python, JavaScript, PHP, HTML, SAS, and more.

Search

Learn Coding from Experts on Udemy

Looking to boost your skills and learn how to become a programming expert?

Check out the links below to view Udemy courses for learning to program in the following languages:

Copyright © 2022 · The Programming Expert · About · Privacy Policy