Saturday, 11 June 2016

Python Pandas - Find difference between two data frames



I have two data frames df1 and df2, where df2 is a subset of df1. How do I get a new data frame (df3) which is the difference between the two data frames?



In other word, a data frame that has all the rows/columns in df1 that are not in df2?




enter image description here


Answer



By using drop_duplicates



pd.concat([df1,df2]).drop_duplicates(keep=False)





Update :





Above method only working for those dataframes they do not have duplicate itself, For example




df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})


It will output like below , which is wrong





Wrong Output :




pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]:
A B
1 2 3




Correct Output




Out[656]: 
A B
1 2 3
2 3 4
3 3 4







How to achieve that?




Method 1: Using isin with tuple




df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]:
A B
1 2 3
2 3 4
3 3 4


Method 2: merge with indicator




df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]:
A B _merge
1 2 3 left_only
2 3 4 left_only
3 3 4 left_only

No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...