Views and Copies in Pandas#
In our readings on numpy, we discussed in detail how, when one subsets an array, what numpy returns is often not an entirely new array, but rather a view of the original array. These views share the underlying data of the array from which they were spawned, meaning changes to either the original array or the view have the potential to impact one another.
Since pandas Series and DataFrames are backed by numpy arrays, it will probably come as no surprise that something similar sometimes happens in pandas. As of pandas 2.0, however, I’m delighted to report the behavior of pandas is much more intuitive than numpy.
In this reading, we’ll first quickly review how views work in numpy before turning to how this behavior looks in pandas in the next reading.
A Review of Views in numpy#
As a reminder of how views work in numpy, let’s create a simple array, subset it with slicing, and then diagram what’s going on.
import numpy as np
my_vector = np.array([1, 2, 3, 4])
my_vector
array([1, 2, 3, 4])
my_subset = my_vector[1:3]
my_subset
array([2, 3])
Now while it may appear that my_subset
is just a new array holding the values 2
and 3
, that’s actually not quite what’s going on behind the scenes. Rather, what numpy has done is create a reference to the selected subset of my_vector
. We call this type of reference a view, and we can visualize what’s going on like this:
This reference is called a view because it’s not a copy of the data in the original array, but an easy way to referring back to the original array – it provides a view onto a subset of the original array.
Why is this distinction important? It’s important because it means that both variables – a
and new
are actually both referencing the same data, and so changes make through one variable will propagate to the other.
TO illustrate, suppose we change the first entry of my_subset
to be -99
:
my_subset[0] = -99
Since the first entry in my_subset
is just a reference to the second entry in my_vector
, the change I made to my_subset
will also propagate to my_vector
:
my_vector
array([ 1, -99, 3, 4])
And just as edits to my_subset
will propagate to my_vector
, so too will edits to my_vector
propagate forward to my_subset
:
my_vector[2] = 42
my_subset
array([-99, 42])
Language and Symmetry#
It’s worth pausing for a moment to point out a bit of a problem with the language of views and copies. It is common, in numpy circles, to look at the example above and talk about my_vector
being the original data, and my_subset
as a view. And it is true that, because my_vector
came first, there is a difference between my_vector
and my_subset
in terms of how numpy is creating and managing these objects.
But from your perspective as a user, it is important to recognize that there is a symmetric dependency between my_vector
and my_subset
in the example above. Yes, one may be “the original,” but once a view has been created, changes to either array have the potential to propagate to the other: changes to the my_subset
may resultant changes to my_vector
, and changes to my_vector
can impact the my_subset
(if they impact the portion of the array referenced by the subset).
So when you think about views, always remember that what we’re talking about is multiple objects sharing the same data, even if we tend to only talk about one of our arrays as “a view.”
A Reminder on When numpy Returns Views#
While a full discussion of the rules of when numpy returns a view and when it returns a copy is a little too complicated to fit into this review reading, just a reminder that not all numpy subsets are views. For example, if you subset an array with “fancy indexing” (e.g. passing a list of indices—my_array[[1, 2]]
—instead of a range of sequential indices separated by a colon as we did above), you will always get a copy. We cover this in detail in Week 3 of Course 2 of this specialization, and you can also find a summary in the official numpy documentation here.