Views and Copies#
Before we dive into matrices, there’s one nuance to how vectors (and indeed all numpy arrays) work that we need to cover: how numpy manages memory allocation when we take subsets of arrays.
Because this reading relates to the nuances of how numpy is actually managing how it writes 1s and 0s in memory, it may seem a little intimidating at first—don’t worry! This topic is definitely a little mind-bending, and many learners may need to read this more than once to develop a good understanding of what’s going on, and most learners won’t develop an intuitive sense of what’s going on until they have been working with numpy for a while. But this isn’t the start of a big transition of the course into abstract computer science theory—we will get back to working with data quickly. This is just one kinda esoteric topic that numpy users have to wrestle with that we can’t avoid introducing around here.
Memory Allocation & Subsetting#
In our previous reading, we talked about how we could not just look at subsets of vectors, but also store those subsets in a new variable. For example, we could pull the middle entry of one vector and assign it to a new vector:
import numpy as np
a = np.array([42, 47, -1])
a
array([42, 47, -1])
new = a[1]
new
47
At the time, we illustrated what was happening with this set of pictures:
And that was close to the truth about what was going on, but it wasn’t quite the full truth.
The reality is that when we create a subset in numpy and assign it to a new variable, what is actually happening is not that the variable is being assigned a copy of the values being the subset, but rather the variable is being assigned a reference to the subset, something that looks more like this:
When numpy creates a reference to a subset of an existing array, that reference is called a view, because it’s not a copy of the data in the original array, but an easy way to refer back to the original array—it provides a view onto a subset of the original array.
Why is this distinction important? It’s important because it means that both variables—a
and new
are actually both referencing the same data, and so changes made to one variable propagate to the other.
To illustrate in more detail, let’s create two new vectors: my_vector
and my_subset
, where my_subset
(as the name implies) is just a subset of my_vector
:
my_vector = np.array([1, 2, 3, 4])
my_vector
array([1, 2, 3, 4])
my_subset = my_vector[1:3]
my_subset
array([2, 3])
Now suppose we change the first entry of my_subset
to -99
:
my_subset[0] = -99
Since the first entry in my_subset
is just a reference to the second entry in my_vector
, the change I made to my_subset
will also propagate to my_vector
:
my_vector
array([ 1, -99, 3, 4])
Note that because my_subset
is a view of the second and third elements of my_vector
, our change to the first element of my_subset
resulted in a change to the second element of my_vector
.
And just as edits to my_subset
propagate to my_vector
, edits to my_vector
will also propagate to my_subset
. Suppose we changed the second entry of my_vector
to 42
. What would happen to my_subset
? Which entry would change?
Well if we return to our illustration, we see:
And indeed we can confirm the change occurs to the second element of my_subset
:
my_vector[2] = 42
my_subset
array([-99, 42])
Language and Symmetry#
It’s worth pausing for a moment to point out a bit of a problem with the language of views and copies. It is common, in numpy circles, to look at the example above and talk about my_vector
being the original data and my_subset
as a view. And it is true that, because my_vector
came first, there is a difference between my_vector
and my_subset
in terms of how numpy is creating and managing these objects.
But from your perspective as a user, it is important to recognize that there is a symmetric dependency between my_vector
and my_subset
in the example above. Yes, one may be “the original,” but once a view has been created, changes to either array have the potential to propagate to the other: changes to the my_subset
may resultant changes to my_vector
, and changes to my_vector
can impact the my_subset
(if they impact the portion of the array referenced by the subset).
So when you think about views, always remember that what we’re talking about is multiple objects sharing the same data, even if we tend to only talk about one of our arrays as “a view.”
Why? Why Would numpy Do This?!#
It is not uncommon for students to feel a little betrayed by numpy when they are first introduced to this behavior. “Why,” they ask, “why would numpy do something that makes it so much harder to keep track of the consequences of changes I make to my data?”
The short answer, as with most things in numpy, is that it’s all about speed. Creating a new copy of the data contained in the subset of a vector takes time (it literally requires your computer to write lots of 1s and 0s to memory), and so creating views instead of copies makes numpy faster.
How much faster? The short answer is a lot faster. The longer answer? Well, let’s talk a little more about how views and copies work, then in our next reading, we can do an experiment to measure the speed difference below.