Big Data Strategies#
Step 1: Do I Have “Big Data”?#
Oddly, this is not actually a straightforward question for two reasons:
Programs often copy data to manipulate it. When you, say, sort an array, what your computer actually does is create a new empty array, then move over entries from your original array into the new array one entry at a time (in sorted order). While it does then erase the original array, while it’s working it effectively has two copies of your array. As a result, the fact that you have 16GB of RAM doesn’t mean you can easily work with a 14GB file. As a general rule, you need at least two times as much memory as your file takes up when first loaded.
The size of data changes as you change its format. For example, if you’re reading data from an inefficient file format (like a .csv text file), it may be the case that a file that is 10GB on disk may actually be only a few GB once you load it into a program like Python that knows how to store data more efficiently.
So how do I check to see if my data is fitting in memory?#
If your program starts using more space than you have main memory, your operating system will usually just start using your hard drive for extra space without telling you (this is called “virtual memory”, and is nice in that it prevents your computer from crashing, though it will slow stuff down a lot). As a result, you won’t always get an error message if you try and load a file to is much bigger than main memory.
To avoid this, use one of the tools provided by your operating system to monitor whether it’s using the hard drive:
OSX: OSX comes with a program called Activity Monitor in the Applications > Utilities folder. The memory tab of Activity Monitor has a graph at the bottom called “Memory Pressure”. If this is green, you’re good. If it’s yellow or red, you’re actively going back and forth to your hard drive. (Note your computer may be using “virtual memory” without affecting performance if it’s just storing away things you aren’t actively using).
Windows: Windows has a program called “Resource Monitor” that shows data on memory use. If you get near 100% Used Physical Memory, your computer is probably using the hard drive.
Step 2: Pick a Strategy#
If you have big data, you basically have four options:
Use chunking to trim and thin out your data so it does fit in memory (this works if the data you were given is huge, but the final analysis dataset you want to work with is small).
Buy more memory. Seriously, consider it.
Minimize the penalties of working off your hard drive (not usually practical for data science)
Break your job into pieces and distribute over multiple machines.
Strategy 1: Trim till it fits#
At least in my experience, most of the time people have a dataset that’s too big to load into main memory, they’re in a situation where the raw data they want to work with is too big for main memory, but the dataset they are trying to create for analysis will not be. This can happen when the raw data you’re working with:
has lots of variables you don’t need for your analysis,
has lots of observations you don’t need for your analysis,
the data is at a more granular level than you need for analysis (e.g. it’s individual level, but you only need averages for different groups), or
where data is formatted in an inefficient manner where better data management would allow it to fit.
In these situations, the go-to strategy is chunking, which is the practice of:
loading only a small part of your data, preprocessing it (dropping extra variables, dropping extra observations, taking averages, and/or storing the data in more efficient formats), and saving it away,
then recombining all the now substantially smaller bits into a new analysis dataset that has everything you want and fits into main memory.
Chunking is pretty easy with most data formats. In pandas, for example, you can specify what rows you want to import using either the nrows
(to specify you only want the first n
rows read into memory), or you can use the combination of skiprows
and nrows
to get a specific set of rows (i.e. “start right after the number of rows in skiprows
, then take nrows
from there). Or if you really want to be fancy, you can just use read_csv
as an iterator using iterator=True
and chunksize
, which will return an object that you can loop over and which pulls in the number of rows you specify in chunksize
each pass of the loop. We’ll also discuss a tool (dask
) that helps automate this process, though using it properly requires understanding how it works, so we won’t introduce it till later.
Types of thinning you may wish to do during import:
Drop unneeded variables
Convert strings variables to numeric or categorical/factor datatypes (Strings almost always have the largest memory footprint)
Drop unneeded observations
Collapse observations
Change numeric datatypes to more compressed formats (e.g. using
Float32
instead ofFloat64
when you know you don’t need all the precision offered byFloat64
. See numbers in machines for a refresher on tradeoffs from different data types. **Be especially careful compressing Ints from Int64 to Int32: it’s hard to overflow Int64, but very easy to overflow Int32!).
Finally, note that some import functions can do this automatically without you having to chunk your data manually. For example, read_csv
has a usecols
option (where you can specify a subset of columns you want to keep), and a dtypes
option (for specifying the data format in which you want data stored as it’s streaming in from disk).
Note: Memory management in Python: If you’re chunking, you need to know about how Python manages memory.
In general, we never think about memory management. When you create an object, Python allocates it memory, and when no one is using an object anymore, or when your code finishes running, Python throws it away. Great.
But when you’re pushing the limits of your computer, it’s sometimes important to intervene to encourage Python to delete things sooner than it might otherwise.
In general, Python will only remove something from memory when there is no longer any way for the programmer to access that data. So for example, if you ran the code:
x = [1, 2, 3]
x = "hello!"
In line 1, Python would create your list in memory. Then in line 2, it would delete that list since the only way you previously had to access that list (the variable x
) has been reassigned, leaving you incapable of getting to the list.
Why does this matter to you? Because if you have a big thing that you’re done with, and you want to delete it, the only way to do it is to reassign all the variables that previously pointed to it to something else (by convention, you often reassign them to None
), or you use the del
command:
x = [1, 2, 3]
del x
Just remember that del
removes the variable, not the object. The object will only disappear if nothing points to it. So in this situation:
x = [1, 2, 3]
y = x
del x
Python won’t delete the list because while you deleted x
, the list is still accessible via y
!
Strategy 2: Pay for More Memory#
Don’t overlook this option: buying RAM doesn’t cost a fortune (especially for desktops), and renting a single big virtual machine on the cloud can be similarly cost-effective.
For example, you can get 100GB of RAM new for about $300 (and this will only go down over time), which means you can easily buy a desktop PC with all the ram and processing power you want for well under a thousand dollars. That’s not a small amount of money, but your time as a data scientist is also valuable (both to you and to whoever is paying you), and if you can obviate having to deal with all these kinds of problems by just buying a bigger computer (especially if it helps you avoid the strategies described below), it can be an incredibly good investment.
And you can rent a Virtual Machine with hundreds of GB of RAM for a couple bucks an hour, something we’ll discuss in a later lesson.
Strategy 3: Minimize disk-access penalties.#
There are a number of tools designed to minimize the penalties associated with moving data off-disk. HDF5
, for example, is a file format that does things like:
compresses data (it’s actually faster to move a small amount of compressed data from the hard drive to main memory and then decompress it than it is to move a large amount of uncompressed memory off-disk. That’s how big the penalty is to read data off disk).
can give you select rows or select columns from a dataset without loading the whole thing into memory first
stores data in efficient memory formats.
SQL, similarly, is designed to maximize the efficiency by which data can be pulled from disk, but in general, if you aren’t in a situation where (a) lots of people are gonna be accessing the same data at the same time, or (b) you don’t plan to do tons of different analyses using the same dataset over a long period, the upfront costs of setting up a SQL database overwhelm the benefits it has at the time you want to run a query.
But it’s also worth emphasizing that the strategies described above are generally only efficient if your goal is to pull a subset of data from disk. This makes them useful for things like medical databases, where doctors want to be able to quickly pull the records of individual patients. But in data science, we usually want to load a lot of data at once, so the ability to quickly pull subsets of data isn’t always super useful.
Strategy 4: Distributed Computing#
If absolutely everything else has failed it’s time to turn to distributed computing!
Distributed computing (using lots of individual computers working in concert in a “cluster”) is the practice of taking your task and your data, breaking it into smaller pieces, sending those pieces to different computers, running the “smaller pieces” on those different computers, then bringing back the results and recombining them into a single output.
Distributed computing is amazing, and makes possible many things that would never be possible otherwise. But it is not without two very important substantial costs:
Programmer time: it takes a lot of time to learn to use these systems, to set up these systems, and to find a way to take code that works easily on one computer and parallelize it and distribute it across lots of computers. Note that not all jobs can be parallelized! See parallelism.
Computer time: there is a huge fixed cost to break up jobs, move those jobs between computers, and recombine the results. As a result, if a job can be done on one computer, it will usually run faster on one computer than distributed across lots of computers.
For these reasons, you should only use distributed computing if the following apply:
Your job is highly parallelizable so you can distribute the job across lots of computers.
You can’t do the job on one computer either because there isn’t enough space in memory, or because your job is parallelizable and you need far more processors than you can get on one computer to run the job in a reasonable period of time.
One obvious case for this is a situation where your data doesn’t even fit on the hard drives of one computer. This is a real problem for companies like Google and Facebook, and is actually one of the main reasons that distributed computing exists.
So if you’re in this domain, what options are open to you?
The Cloud! There are a few platforms for managing a distributed cluster of computers, including things you may have heard of like Apache Hadoop or Apache Spark, which you can kind of think of as Hadoop v2.0.
In this class, though, we’ll primarily focus on how to use a program called dask
, which has the same syntax as pandas
, but works like Spark. We’ll cover dask
in a later class.