Solving Performance Issues#

So: you’ve done all the things suggested in the last page on Performance basics that you can do within the constraints of your project, and you still have a performance problem. Now what?

Do you need to optimize?#

So, before you start investing in optimizing your code, ask yourself whether you really need to optimize your code.

In programming, there is often a trade-off between developer’s time (i.e. the time you spend actively playing with your code) and hardware time (the amount of computer time you code occupies when running). If you have a block of code you need to run every day, then investing your time in making it fast may really help make you more productive, and so may be worth the investment.

If you have a block of code you only plan to run once or a couple times (say, the initial ingest of a big dataset), then it probably isn’t worth investing days making it 10x faster if the running time is less than insane (i.e. if you can just let it run overnight and it’ll be done by morning).

So when deciding whether to optimize something, ask first “Will investing my time in optimizing this code really pay off in future productivity?”

Profiling Code#

If you take nothing else away from this page, please read and remember this section!

There’s no reason to tune a line of code that is only responsible for 1/100 of your running time, so before you invest in speeding up your code, figure out exactly what in your code is causing it to be slow – a process known as “profiling”.

Thankfully, because this is so important, there are lots of tools (called profilers) for measuring exactly how long your computer is spending doing each step in a block of code. Here are a couple, with some demonstrations below:

  • Profiling in R: the two packages I’ve seen used most are Rprof and lineprof.

  • Profiling in Python: if you use Jupyter Notebooks or Jupyter Labs, you can use the prun tool. If for some reason you’re not using Jupyter, here’s a guide to a few other tools.

Profiling Example#

To illustrate, let’s write a function (called my_analysis) which we can pretend is a big analysis that’s causing me problems. Within this analysis we’ll place several functions, most of which are fast, but one of which is slow. To make it really easy to see what is fast and what is slow, these functions will just call the time.sleep() function, which literally just tells the computer to pause for a given number of seconds (i.e. time.sleep(10) makes execution pause for 10 seconds).

import time


def a_slow_function():
    time.sleep(5)
    return 1


def a_medium_function():
    time.sleep(1)
    return 1


def a_fast_function():
    return 1


def my_analysis():
    x = 0
    x = x + a_slow_function()
    x = x + a_medium_function()
    x = x + a_fast_function()
    print(f"the result of my analysis is: {x}")


my_analysis()
the result of my analysis is: 3

Now we can profile this code with the IPython magic %prun:

%prun my_analysis()
the result of my analysis is: 3
 
         44 function calls in 6.007 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    6.007    3.003    6.007    3.003 {built-in method time.sleep}
        3    0.000    0.000    0.000    0.000 socket.py:337(send)
        1    0.000    0.000    6.007    6.007 <ipython-input-1-2718bcdb1d57>:14(my_analysis)
        3    0.000    0.000    0.000    0.000 iostream.py:197(schedule)
        2    0.000    0.000    0.000    0.000 iostream.py:384(write)
        1    0.000    0.000    6.007    6.007 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        3    0.000    0.000    0.000    0.000 threading.py:1080(is_alive)
        1    0.000    0.000    1.002    1.002 <ipython-input-1-2718bcdb1d57>:7(a_medium_function)
        3    0.000    0.000    0.000    0.000 threading.py:1038(_wait_for_tstate_lock)
        1    0.000    0.000    5.005    5.005 <ipython-input-1-2718bcdb1d57>:3(a_slow_function)
        2    0.000    0.000    0.000    0.000 iostream.py:309(_is_master_process)
        3    0.000    0.000    0.000    0.000 {method 'acquire' of '_thread.lock' objects}
        3    0.000    0.000    0.000    0.000 iostream.py:93(_event_pipe)
        2    0.000    0.000    0.000    0.000 iostream.py:322(_schedule_flush)
        2    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        2    0.000    0.000    0.000    0.000 {built-in method posix.getpid}
        3    0.000    0.000    0.000    0.000 threading.py:507(is_set)
        1    0.000    0.000    6.007    6.007 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 <ipython-input-1-2718bcdb1d57>:11(a_fast_function)
        3    0.000    0.000    0.000    0.000 {method 'append' of 'collections.deque' objects}

The output shows a number of things, but the most important are tottime and cumtime.

From tottime we can see that 6 seconds was dedicated to running time.sleep().

From cumtime, you can also see in which functions time.sleep() took the most time. As you can see, cumtime is not equal to the total time the function took to run – rather, it’s all the time spent within each function. time.sleep() has a cumtime of 6.009 because a total of 6 seconds was spend while that function ran, but it is also the case that a_slow_function (listed as <ipython-input-2-2718bcdb1d57>:3(a_slow_function)) has a cumtime of 5 seconds (because that function was in the process of executing when time.sleep() paused for 5 seconds).

From this, we can deduce that time.sleep() was slowing down our code, and that the occurance of time.sleep() that slowed down our code the most was in a_slow_function.

Speeding Code with Cython#

There are two libraries designed to allow you to massively speed up Python code. The first is called Cython, and it is a way of writing code that is basically Python with type declarations. For example, if you wanted to add up all the numbers to a million in Python, you could write something like the following (obviously not the most concise way to do it, but you get the idea):

def avg_numbers_up_to(N):
    adding_total = 0
    for i in range(N):
        adding_total = adding_total + i

    avg = adding_total / N

    return avg

But in Cython, you would write:

    def avg_numbers_up_to(int N):
        cdef int adding_total

        adding_total = 0

        for i in range(N):
            adding_total = adding_total + i

        return adding_total

Then to integrate this into your Python code, you would save this function definition into a new file (with the suffix .pyx (say, avg_numbers.pyx), and put this code at the top of your Python script:

from distutils.core import setup
from Cython.Build import cythonize

setup(ext_modules=cythonize('avg_numbers.pyx'))

Then you can call your Cythonized function (avg_number_up_to) in your normal Python script, but you’ll now find it runs ~10x - 100x faster! (Note that this speedup is only likely when compared to pure python code. If you’re comparing Cython to a library function that was already written in C, youre Cythonized Python is unlikely be any faster (and may be slower) than that library function.

Also, note that in Cythonized code, loops are just as fast as vectorized code!

To illustrate, we can actually compile Cython code directly in Jupyter Notebooks with the %%cythonize magic. This is good for teaching, but is not how you should plan on using Cython yourself – it’s just not robust to rely on jupyter magics for core functionality in your programs (and you should only cythonize something if it’s a core part of your code!).

%load_ext Cython
%%cython

def cython_avg_numbers_up_to(long N):
    cdef long adding_total
    adding_total = 0

    cdef long i
    for i in range(N):
        adding_total += i

    return adding_total
%timeit avg_numbers_up_to(1_000_000_000)
45.4 s ± 1.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit cython_avg_numbers_up_to(1_000_000_000)
46.1 ns ± 0.751 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

But while Cython is nice, it does have lots of limitations!

The biggest is that can get integer overflows in Cython! Moreover, be aware that int is a C integer (32 bit, which overflows at about 2 billion), so super prone to overflowing. long is what you need to use to get a 64 bit integer! So if I made the accumulator an int rather than a long

%%cython

def cython_w_overflow_avg_numbers_up_to(long N):
    cdef int adding_total
    adding_total = 0

    cdef long i
    for i in range(N):
        adding_total += i

    return adding_total
cython_w_overflow_avg_numbers_up_to(1_000_000_000)
-1243309312

Yup… we reached the limit of the 32-bit integer, then keep adding things, making it go negative!

Cython limitations#

There are a few limitations to be aware of, however:

  • Cython is it’s own language. It’s very similar to Python, but you’ll have to invest a little in learning it’s quirks.

  • You can get integer overflows in Cython! Moreover, be aware that int is a C integer (32 bit, which overflows at about 2 billion), so super prone to overflowing. long is what you need to use to get a 64 bit integer!

  • Cython only really works with (a) native Python and (b) NumPy (numpy instructions here). Some other Python libraries are / can be supported, but it’s not nearly as straightfoward as the example above. So if your code uses other libraries, all bets are off.

  • The function you write will not be dynamically typed, so if you said the function would accept integers, you can only give it integers.

  • Distributing code you write with Cython can be tricky…

Cython Advantages#

  • Can make C libraries directly accessable from Python (if you know how to work with C libraries)

  • It is a very mature tool, so well supported and documented. Indeed, a lot of both pandas and libraries like scikit-learn are built on Cython to ensure performance!

  • Definitely easier to write one function in Cython than move all your code to C!

Speeding Code with Numba#

Another tool you can use is numba. Numba is a program that, when it works, is super easy and kinda magic, but can also be rather finicky.

The idea of numba is that it treats each function like it’s own little program, and tries to compile it to make it faster.

It can operate in two modes. In the first (“python mode”), it achieves it’s speed-up by saving the machine code that was used the first time a function is run. The speed benefits of this aren’t huge – Python still has to do all the work of doing type checking and de-referencing, but it only has to actually convert what it’s doing to machine code once, so if you plan on using a function over and over, it can be beneficial.

The second mode (“nopython mode”) is blazing fast. In nopython mode, numba analyzes your function, and then makes inferences about the types of variables that it will encounter. For example, if you gave the following code to Python:

def my_big_loop(N):
    accumulator = 0
    for i in range(N):
        accumulator = accumulator + i
    return accumulator

And then you ran that code with N=1000000, then numba would look at the function and think to itself “ok, so N is an integer. And accumulator starts as an integer. And if I add integers, they will always stay integers. So accumulator + i will always be integer addition. So I don’t have to think about types at every stop of this loop!

The only catch is that numba can’t always compile code in nopython mode. For example, numba isn’t compatible with pandas, so if you put pandas code in a function you pass to numba, it can’t work in nopython mode.

But when it does work, it’s magic, because instead of making a new file that has to be compiled seperately and which won’t “just work” on other computers, to make numba work you just add a “decorator” to the function you want to speed up. For example:

# Without numba
def my_big_loop(N):
    accumulator = 0
    for i in range(N):
        accumulator = accumulator + i
    return accumulator
%timeit my_big_loop(1_000_000)
45.3 ms ± 598 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# With numba
from numba import jit

# We just add this "decorator" (A line that starts with @ just above a function)
# The "nopython=True" option says to jit "tell me if you can't work in nopython more, 
# dont' just silently revert to Python mode."
@jit(nopython=True) 
def my_big_loop(N):
    accumulator = 0
    for i in range(N):
        accumulator = accumulator + i
    return accumulator

%timeit my_big_loop(1_000_000)
114 ns ± 0.883 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Not bad, huh? Add one line of code, and this speeds up by 600x!

But as I said, it doesn’t work with everything – not even all regular Python datatypes (e.g. if you define your own class, you’re apparently out of luck with Numba). Here’s an intro to numba, and here’s a full list of things it can handle in nopython mode, and things it can’t..

Also, as with Cython, note that in nopython Numba functions, loops are just as fast as vectorized code!

Type Stability#

If you want numba to work well, there is one overriding rule: your code must be “type-stable”. Type stability means that conditional on the types of the input arguments, the types of all variables in your code must be fully predictable.

For example, consider the following code:

def my_doubler(x):
    y = x * 2
    return y

If x is an integer, than we know that doubling it will also generate an integer, so the type of y will always be “integer”.

Similarly, if x is a float, then we know that doubling it will also generate a float, sot he type of y will always be “float”.

In other words, conditional on the type of the input arguments, this function is type stable.

But now consider the following code:

def is_this_even(x):
    if x % 2 == 0:
        y = x
    else:
        y = "not even, sorry!"
    return y

This code is not type stable because the fact that x is an integer does not guarantee the type of y’. If x is an even integer, than y will also be an integer; and if x is an odd integer, then y will be a string. As a result, numba can’t predict the type of y in advance of running the code, and so can’t be as efficient.

@jit(nopython=True)
def is_this_even(x):
    if x % 2 == 0:
        y = x
    else:
        y = "not even, sorry!"
    return y
is_this_even(3)
---------------------------------------------------------------------------
TypingError                               Traceback (most recent call last)
<ipython-input-16-d6ab8c8f87e7> in <module>
----> 1 is_this_even(3)

~/miniconda3/lib/python3.7/site-packages/numba/dispatcher.py in _compile_for_args(self, *args, **kws)
    374                 e.patch_message(msg)
    375 
--> 376             error_rewrite(e, 'typing')
    377         except errors.UnsupportedError as e:
    378             # Something unsupported is present in the user code, add help info

~/miniconda3/lib/python3.7/site-packages/numba/dispatcher.py in error_rewrite(e, issue_type)
    341                 raise e
    342             else:
--> 343                 reraise(type(e), e, None)
    344 
    345         argtypes = []

~/miniconda3/lib/python3.7/site-packages/numba/six.py in reraise(tp, value, tb)
    656             value = tp()
    657         if value.__traceback__ is not tb:
--> 658             raise value.with_traceback(tb)
    659         raise value
    660 

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Cannot unify int64 and Literal[str](not even, sorry!) for 'y', defined at <ipython-input-15-9cd2f2682352> (4)

File "<ipython-input-15-9cd2f2682352>", line 4:
def is_this_even(x):
    <source elided>
    if x % 2 == 0:
        y = x
        ^

[1] During: typing of assignment at <ipython-input-15-9cd2f2682352> (6)

File "<ipython-input-15-9cd2f2682352>", line 6:
def is_this_even(x):
    <source elided>
    else:
        y = 'not even, sorry!'
        ^

This is not usually a problem with Numba itself but instead often caused by
the use of unsupported features or an issue in resolving types.

To see Python/NumPy features supported by the latest release of Numba visit:
http://numba.pydata.org/numba-doc/latest/reference/pysupported.html
and
http://numba.pydata.org/numba-doc/latest/reference/numpysupported.html

For more information about typing errors and how to debug them visit:
http://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile

If you think your code should work with Numba, please report the error message
and traceback, along with a minimal reproducer at:
https://github.com/numba/numba/issues/new

Use Modin#

If the code you want to speed up is pandas code, consider trying modin. It’s a drop-in replacement for pandas (you literally just write import modin as pd instead of import pandas as pd), and it will give you a parallelized implementation of your code. Basically it’s a really easy-to-use wrapper for some other distributed libraries (especially dask, which will cover soon), and I hear it’s quite nice.

Use Julia#

I would be remiss at this point to not mention one other option for getting more speed: use the programming language Julia. Julia is a very new language that has syntax that is very similar to Python, but which runs tens or hundreds of times faster out of the box. Basically, it’s kinda like an entire language built around the technology also used by numba, but where numba is kind of finiky because it’s been tacked on to a language that was never built for speed, Julia was designed from the ground up for speed.

If you want to know why I love Julia, you can find a talk I gave on it here. It’s a little old (I refer to Julia 1.0 not being out yet, but Julia’s up to 1.2 now), but the core arguments all still apply.

To be clear, I wouldn’t recommend jumping languages if you just have one function you need to speed up, but if you’re doing work that causes you to have performance issues regularly, consider Julia.

(Note: Julia also reaches peak speed when code is type stable, but is also super fast even if your code isn’t type stable.)

Parallelization#

If you’ve done all this and your code is still too slow, it’s time to look into parallelization, which we’ll be doing soon!