-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
How can you create a copy of the DataFrame without copying the actual data, but having a new DataFrame that when updated (not in place) does not modify the original ("shallow copy")? And how is this expected to behave?
I suppose that in technical terms, this would be a new BlockManager that references the same arrays?
I ran in the above questions, and actually didn't know a clear answer. The context was: I wanted to replace one column of a DataFrame, but without modifying the original one. And so was wondering if I could do that without making a full copy of the DataFrame (as in theory this is not needed, and I just wanted to update one object column before serializing).
So you can do something like this with copy(deep=False). Let's explore this somewhat:
Making a normal (deep) and shallow copy:
In [1]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [.1, .2, .3]})
In [2]: df_copy = df.copy()
In [3]: df_shallow = df.copy(deep=False)
Modifying values in place works as expected: for the copy it does not change the original df, for the shallow copy it does:
In [4]: df_copy.iloc[0,0] = 10
In [5]: df_shallow.iloc[1,0] = 20
In [6]: df
Out[6]:
a b
0 1 0.1
1 20 0.2
2 3 0.3
Overwriting a full column, however, becomes more tricky (due to our BlockManager ...):
# this updates the original df
In [7]: df_shallow['a'] = [10, 20, 30]
In [8]: df
Out[8]:
a b
0 10 0.1
1 20 0.2
2 30 0.3
# this does not update the original
In [9]: df_shallow['b'] = [100, 200, 300]
In [10]: df_shallow
Out[10]:
a b
0 10 100
1 20 200
2 30 300
In [11]: df
Out[11]:
a b
0 10 0.1
1 20 0.2
2 30 0.3
This is of course somewhat expected if you know the internals: if the new column is of the same dtype, it seems to modify the array of the block in place, while if it needs to create a new block (because the dtype changed on assignment), the reference with the old data is broken and it doesn't modify the original dataframe.
While writing this down, I am realizing that my question is maybe more: should assigning a column (df['a'] = ..) be seen as an in-place modification of your dataframe that has impact through shallow copies?
Because in reality, df['a'] cannot always happen in place (if you are overwriting with a different dtype), this gives rather inconsistent and surprising behaviour depending on the dtypes.