Skip to content

API: how to create a "shallow copy" of a DataFrame? #29309

@jorisvandenbossche

Description

@jorisvandenbossche

How can you create a copy of the DataFrame without copying the actual data, but having a new DataFrame that when updated (not in place) does not modify the original ("shallow copy")? And how is this expected to behave?
I suppose that in technical terms, this would be a new BlockManager that references the same arrays?

I ran in the above questions, and actually didn't know a clear answer. The context was: I wanted to replace one column of a DataFrame, but without modifying the original one. And so was wondering if I could do that without making a full copy of the DataFrame (as in theory this is not needed, and I just wanted to update one object column before serializing).


So you can do something like this with copy(deep=False). Let's explore this somewhat:

Making a normal (deep) and shallow copy:

In [1]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [.1, .2, .3]}) 

In [2]: df_copy = df.copy() 

In [3]: df_shallow = df.copy(deep=False)

Modifying values in place works as expected: for the copy it does not change the original df, for the shallow copy it does:

In [4]: df_copy.iloc[0,0] = 10  

In [5]: df_shallow.iloc[1,0] = 20  

In [6]: df    
Out[6]: 
    a    b
0   1  0.1
1  20  0.2
2   3  0.3

Overwriting a full column, however, becomes more tricky (due to our BlockManager ...):

# this updates the original df
In [7]: df_shallow['a'] = [10, 20, 30] 

In [8]: df
Out[8]: 
    a    b
0  10  0.1
1  20  0.2
2  30  0.3

# this does not update the original
In [9]: df_shallow['b'] = [100, 200, 300]  

In [10]: df_shallow  
Out[10]: 
    a    b
0  10  100
1  20  200
2  30  300

In [11]: df  
Out[11]: 
    a    b
0  10  0.1
1  20  0.2
2  30  0.3

This is of course somewhat expected if you know the internals: if the new column is of the same dtype, it seems to modify the array of the block in place, while if it needs to create a new block (because the dtype changed on assignment), the reference with the old data is broken and it doesn't modify the original dataframe.

While writing this down, I am realizing that my question is maybe more: should assigning a column (df['a'] = ..) be seen as an in-place modification of your dataframe that has impact through shallow copies?
Because in reality, df['a'] cannot always happen in place (if you are overwriting with a different dtype), this gives rather inconsistent and surprising behaviour depending on the dtypes.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions