Concat DataFrame Reindexing only valid with uniquely valued Index objects

pd.concat requires that the indices be unique. To remove rows with duplicate indices, use

df = df.loc[~df.index.duplicated(keep='first')]

import pandas as pd
from pandas import Timestamp

df1 = pd.DataFrame(
    {'price': [0.7286, 0.7286, 0.7286, 0.7286],
     'side': [2, 2, 2, 2],
     'timestamp': [1451865675631331, 1451865675631400,
                  1451865675631861, 1451865675631866]},
    index=pd.DatetimeIndex(['2000-1-1', '2000-1-1', '2001-1-1', '2002-1-1']))


df2 = pd.DataFrame(
    {'bid': [0.7284, 0.7284, 0.7284, 0.7285, 0.7285],
     'bid_size': [4000000, 4000000, 5000000, 1000000, 4000000],
     'offer': [0.7285, 0.729, 0.7286, 0.7286, 0.729],
     'offer_size': [1000000, 4000000, 4000000, 4000000, 4000000]},
    index=pd.DatetimeIndex(['2000-1-1', '2001-1-1', '2002-1-1', '2003-1-1', '2004-1-1']))


df1 = df1.loc[~df1.index.duplicated(keep='first')]
#              price  side         timestamp
# 2000-01-01  0.7286     2  1451865675631331
# 2001-01-01  0.7286     2  1451865675631861
# 2002-01-01  0.7286     2  1451865675631866

df2 = df2.loc[~df2.index.duplicated(keep='first')]
#                bid  bid_size   offer  offer_size
# 2000-01-01  0.7284   4000000  0.7285     1000000
# 2001-01-01  0.7284   4000000  0.7290     4000000
# 2002-01-01  0.7284   5000000  0.7286     4000000
# 2003-01-01  0.7285   1000000  0.7286     4000000
# 2004-01-01  0.7285   4000000  0.7290     4000000

result = pd.concat([df1, df2], axis=0)
print(result)
               bid  bid_size   offer  offer_size   price  side     timestamp
2000-01-01     NaN       NaN     NaN         NaN  0.7286     2  1.451866e+15
2001-01-01     NaN       NaN     NaN         NaN  0.7286     2  1.451866e+15
2002-01-01     NaN       NaN     NaN         NaN  0.7286     2  1.451866e+15
2000-01-01  0.7284   4000000  0.7285     1000000     NaN   NaN           NaN
2001-01-01  0.7284   4000000  0.7290     4000000     NaN   NaN           NaN
2002-01-01  0.7284   5000000  0.7286     4000000     NaN   NaN           NaN
2003-01-01  0.7285   1000000  0.7286     4000000     NaN   NaN           NaN
2004-01-01  0.7285   4000000  0.7290     4000000     NaN   NaN           NaN

Note there is also pd.join, which can join DataFrames based on their indices, and handle non-unique indices based on the how parameter. Rows with duplicate index are not removed.

In [94]: df1.join(df2)
Out[94]: 
             price  side         timestamp     bid  bid_size   offer  \
2000-01-01  0.7286     2  1451865675631331  0.7284   4000000  0.7285   
2000-01-01  0.7286     2  1451865675631400  0.7284   4000000  0.7285   
2001-01-01  0.7286     2  1451865675631861  0.7284   4000000  0.7290   
2002-01-01  0.7286     2  1451865675631866  0.7284   5000000  0.7286   

            offer_size  
2000-01-01     1000000  
2000-01-01     1000000  
2001-01-01     4000000  
2002-01-01     4000000  

In [95]: df1.join(df2, how='outer')
Out[95]: 
             price  side     timestamp     bid  bid_size   offer  offer_size
2000-01-01  0.7286     2  1.451866e+15  0.7284   4000000  0.7285     1000000
2000-01-01  0.7286     2  1.451866e+15  0.7284   4000000  0.7285     1000000
2001-01-01  0.7286     2  1.451866e+15  0.7284   4000000  0.7290     4000000
2002-01-01  0.7286     2  1.451866e+15  0.7284   5000000  0.7286     4000000
2003-01-01     NaN   NaN           NaN  0.7285   1000000  0.7286     4000000
2004-01-01     NaN   NaN           NaN  0.7285   4000000  0.7290     4000000

Related Posts:

Leave a Comment Cancel reply