Prickly Pythons: speeding up numerical code in Python

By Derek Mendez (damende3@asu.edu)

Contents

Intro / method 1

Method 2: Embarrassingly Paralel

Profiling

Numpy approach

Method 3: numpy

Method 4: numexpr

OpenCL

Method 5: pyopencl

Small data speed tests

Large data speed tests

Setting up OpenCL and pyopencl example

Note: If you manage to optimize the GPU kernel at the end of this document, please post an answer to my code-review question here: https://codereview.stackexchange.com/q/159748/78230!

Intro / Method 1

Evaluate this:

\[ I\,(\vec{q}\,) = \big |\, \sum_j\,e^{\,i \,\,\vec{q}\cdot \vec{r}_j} \,\big |^2 = \big | \,A(\vec{q})\,\big | ^2\] given different (not necessarily gridded) vectors \(\vec{q}\) where

First, generate the test data (random numbers)

import numpy as np
# pick a random set of qx,qy,qz (10,000 q vectors)
q_vecs = np.random.random((10000, 3))
# pick a random set of ax,ay,az (100 atom vectors)
atom_vecs = np.random.random((100, 3))

Now, compute.

Below is a naive implementation to compute \(I(\vec{q}\,)\)

# how to calculate A(q) ??
# lets try the naive way first... 
def method1(q_vecs, atom_vecs):
    Nq = q_vecs.shape[0]
    ampsR = np.zeros( Nq ) 
    ampsI = np.zeros( Nq )
    for i_q, q in enumerate( q_vecs):
        qx,qy,qz = q
        for i_atom, atom in enumerate( atom_vecs):
            ax,ay,az = atom
            phase = qx*ax + qy*ay + qz*az
            ampsR[i_q] += np.cos( -phase)
            ampsI[i_q] += np.sin( -phase)
    I = ampsR**2 + ampsI**2 
    return I

Method 2: Embarrassingly parallel

One way to speed-up any Python code is to use multiple processors. Here is an embarrassingly parallelized version of method1

from joblib import Parallel, delayed
def method2(q_vecs, atom_vecs, my_method, n_jobs=4,):
    q_vecs_split = np.array_split(q_vecs, n_jobs)
    results = Parallel(n_jobs=n_jobs)(
        delayed(my_method)(qs,atom_vecs) 
        for qs in q_vecs_split)
    return np.concatenate(results,0) 
    

One would execute e.g. method2(q_vecs, atom_vecs, my_method=method1, n_jobs=4) for small savings. This can be really useful for RAM intensive computations that run for days.

alt text

Profiling

We can investigate the code (profiling) to see where to speed things up. For this you can try https://github.com/rkern/line_profiler. Once installed you can profile script as follows. Create a file name profileme.py

import numpy as np
q_vecs = np.random.random((10000, 3))
atom_vecs = np.random.random((100, 3))
 
@profile  # this bit is important!
def method1(q_vecs, atom_vecs):
    Nq = q_vecs.shape[0]
    ampsR = np.zeros( Nq ) 
    ampsI = np.zeros( Nq )
    for i_q, q in enumerate( q_vecs):
        qx,qy,qz = q
        for i_atom, atom in enumerate( atom_vecs):
            ax,ay,az = atom
            phase = qx*ax + qy*ay + qz*az
            ampsR[i_q] += np.cos( -phase)
            ampsI[i_q] += np.sin( -phase)
    I = ampsR**2 + ampsI**2 
    return I

method1(q_vecs, atom_vecs)

Now in the terminal run kernpprof -l profileme.py, which will generate a file called profileme.py.prof. To view it, execute python -m line_profiler profileme.py.lprof, which prints:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    14                                           @profile
    15                                           def method1(q_vecs, atom_vecs):
    16         1            5      5.0      0.0      Nq = q_vecs.shape[0]
    17         1           36     36.0      0.0      ampsR = np.zeros( Nq ) 
    18         1           30     30.0      0.0      ampsI = np.zeros( Nq )
    19     10001         6507      0.7      0.1      for i_q, q in enumerate( q_vecs):
    20     10000        15171      1.5      0.2          qx,qy,qz = q
    21   1010000       638603      0.6      8.7          for i_atom, atom in enumerate( atom_vecs):
    22   1000000      1534379      1.5     20.8              ax,ay,az = atom
    23   1000000       864692      0.9     11.7              phase = qx*ax + qy*ay + qz*az
    24   1000000      2180374      2.2     29.6              ampsR[i_q] += np.cos( -phase)
    25   1000000      2125909      2.1     28.9              ampsI[i_q] += np.sin( -phase)
    26         1          205    205.0      0.0      I = ampsR**2 + ampsI**2 
    27         1            1      1.0      0.0      return I

Lets see how we can speed this up..

Numpy approach

alt text

Notice how for each atom_vec and q_vec we calculate a phase term. If we have 10,000 atom_vectors and 100 q vectors, then we will calculate 1,000,000 phase terms. We can actually do that in a single line using a matrix product. Let \(Q\) be a Nq x 3 matrix of q_vecs and let \(A\) be a Natom x 3 matrix of atom_vecs. Then the matrix product \(Q\cdot A^T\) will give us a 10000 x 100 matrix of phase terms. Fortunately we can do this in a single line with numpy, and the algorithm will be fairly well optimized, depending on the numpy config. You should check your numpy config and ensure you see openblas, atlas, or mkl links, and if not reinstall numpy. See https://stackoverflow.com/a/14391693/2077270 for an example on linking openblas to numpy. Note, anaconda numpy will typically ship with mkl links which are libraries designed by intel for speeding up computations on their chips.

import numpy as np
np.show_config()

My output is the following, because I installed openblas prior to setting up numpy:

lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/opt/openblas/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
    runtime_library_dirs = ['/usr/local/opt/openblas/lib']
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/opt/openblas/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
    runtime_library_dirs = ['/usr/local/opt/openblas/lib']
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/opt/openblas/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
    runtime_library_dirs = ['/usr/local/opt/openblas/lib']
blis_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/opt/openblas/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
    runtime_library_dirs = ['/usr/local/opt/openblas/lib']
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE

Method 3: numpy

Here is a method that makes full use of numpy, but note this methods consumes extra memory (needed to store the phase terms as opposed to computing them on the fly):

def method3(q_vecs, atom_vecs):
    phases = np.dot( q_vecs, atom_vecs.T) # T is the transpose, hence (Nq x 3) dotted with (3 x Natom)
    # This results in an (Nq x Natom) array
    
    #In the end we want Nq amplitudes (both real and imaginary)
    cosph = np.cos( -phases) 
    sinph = np.sin( -phases)
    ampsR = np.sum( cosph, axis=1) # summing over the atoms axis... 
    ampsI = np.sum( sinph, axis=1)

    #take the mod-squared and return 
    I = ampsR**2 + ampsI**2 
    return I

Method 4: numexpr

Can we make it faster? Lets try with numexpr, which makes use of multi-threading to speed-up element-wise numpy operations. Also, see this link for some discussions on numpy backend speedups VS numexpr: https://stackoverflow.com/a/24498248/2077270

import numexpr as ne
def method4(q_vecs, atom_vecs):
    phases = np.dot( q_vecs, atom_vecs.T) # T is the transpose, hence (Nq x 3) dotted with (3 x Natom)
    # This results in an (Nq x Natom) array
    
#   In the end we want Nq amplitudes (both real and imaginary)
    cosph = ne.evaluate('cos( -phases)')
    sinph = ne.evaluate('sin( -phases)')
    ampsR = np.sum( cosph, axis=1) # summing over the atoms axis... 
    ampsI = np.sum( sinph, axis=1)

#   take the mod-squared and return 
    #I = ne.evaluate('ampsR**2 + ampsI**2')
    I = ampsR**2 + ampsI**2 
    return I

We might be able to squeeze out some extra speedups, but its getting increasingly harder.

OpenCL

alt text

Profiling method4 reveals that the element-wise trig. operations are the slowest part of the code. This is where we can use the GPU to speed things up. There is an important thing to remember with GPU coding - computations are fast, memory management is slow. So therefore when coding on a GPU, seek to maximize the computations per processing unit, and minimize the memory transfer to and from the GPU device. Coding on a CPU and a GPU simultaneously requires a little more care. One should always be thinking about the following two questions

  1. what are my data types (e.g. np.float32)
  2. are my arrays contiguous in memory

There are many ways to utilize a GPU in Python, and here we will talk about the wrapper to OpenCL, otherwise known as pyopencl (see https://wiki.tiker.net/PyOpenCL and the installation instructions therein). On a macintosh, pyopencl can easily be installed with pip install pyopencl, however this is because openCL ships with mac OS. In Linux and (I think) windows you will need to install openCL by installing either CUDA,SDK depending on whether you have an Intel,AMD GPU, respectively. You will then need to locate the openCL headers/libs to install pyopencl. See the end of this doc for a brief example on setting things up in Ubuntu (for AMD chip).

OpenCL jargon includes

This is how you can create context and queue using pyopencl

import pyopencl as cl
def get_context_queue():
#   list the platforms
    platforms = cl.get_platforms()
    print("Found platforms (will use first listed):", platforms)
#   select the gpu
    my_gpu = platforms[0].get_devices(
        device_type=cl.device_type.GPU)
    assert( my_gpu)
    print("Found GPU(s):", my_gpu)
#   create the context for the gpu, and the corresponding queue
    context = cl.Context(devices=my_gpu)
    queue = cl.CommandQueue(context)
    return context, queue

Executing (in a python term) context,queue = get_context_queue() on my Mac laptop yielded

('Found platforms (will use first listed):', [<pyopencl.Platform 'Apple' at 0x7fff0000>])
('Found GPU(s):', [<pyopencl.Device 'Iris Pro' on 'Apple' at 0x1024500>])

Once you have a context and a queue, you are ready to write an openCL kernel, which is basically the instructions that will be provided to each GPU worker unit. Each worker should do the same roughly the same things, so try avoiding e.g. conditional statements that lead to workers executing different instructions (I could be wrong on this). Below I will write the new method which uses pyopencl, however I want to discuss the flow of things. Our test data q_vecs and atom_vecs are np.ndarray objects that live on the CPU. We want to transfer those data to the GPU, which we do in pyopencl using pyopencl.array. Host buffer refers to the numpy arrays and device buffer refers to the GPU arrays. This is why its important to be type specific, as well as to work with contiguous memory objects.

Method 5: pyopencl

Here is an openCL example

import pyopencl as cl
import pyopencl.array as clarray
def method5( q_vecs, atom_vecs, context, queue):
    Nato = atom_vecs.shape[0]
    Nq = q_vecs.shape[0]
    # these are the output host buffers which will also be transferred to the device, updated on the device and then copied back to the host.
    ampsR = np.ascontiguousarray(np.zeros( Nq, dtype=np.float32) )
    ampsI = np.ascontiguousarray(np.zeros( Nq, dtype=np.float32) )
    # OpenCL C source code , this is the instructions to the GPU workers
    # note this code is essentially the inner-most loop of method 1, such that
    # each worker is doing a single iteration of the outer-loop of method 1.
    # this allows for insane speedups, without the need to use excess memory
    # as in the numpy case (although GPUs have limited memory)
    kernel = """ 
    __kernel void sim_amps(__global float* q_vecs,
                            __global float* atom_vecs,
                             __global float* ampsR,
                             __global float* ampsI,
                             int Nq, int Natoms){
    //  this is the unique ID of each worker, and each worker will be loading a single q vec
        int g_i = get_global_id(0);
        float qx,qy,qz,ax,ay,az, cph, sph, phase;
    //  we pass 1D arrays to openCL, in row-major order
        qx = q_vecs[g_i*3];
        qy = q_vecs[g_i*3+1];
        qz = q_vecs[g_i*3+2];
        for(int i =0; i < Natoms; i++){
            ax = atom_vecs[i*3];
            ay = atom_vecs[i*3+1];
            az = atom_vecs[i*3+2];
            phase = ax*qx + ay*qy + az*qz;
            cph = native_cos(phase); // native openCL trig functions
            sph = native_sin(phase);
            ampsR[g_i] += cph;
            ampsI[g_i] += sph;
            
        }
    }
    """
    #   setup opencl, compile bugs will show up here
    program = cl.Program(context, kernel).build()

#   move host arrays to GPU device, note forcing q_vecs and atom_vecs to be contiguous , ampsR and ampsI are already contiguous
    qs_dev = clarray.to_device(queue, np.ascontiguousarray(q_vecs.astype(np.float32)))
    atoms_dev = clarray.to_device(queue, np.ascontiguousarray(atom_vecs.astype(np.float32)))
    ampsR_dev = clarray.to_device(queue, ampsR)
    ampsI_dev = clarray.to_device(queue, ampsI)

#   specify scalar args (just tell openCL which kernel args are scalar)
    program.sim_amps.set_scalar_arg_dtypes(
            [None, None, None,None,np.int32, np.int32])
#   run the kernel
#   note there are 3 pre-arguments to our kernel, these are the queue, 
#   the total number of workers, and the desired worker-group size. 
#   Leaving worker-group size as None lets openCL decide a value (I think)
    program.sim_amps(queue, (Nq,), None, qs_dev.data, atoms_dev.data, 
        ampsR_dev.data, ampsI_dev.data, np.int32(Nq), np.int32(Nato))

#   transfer data from device back to host
#    you can try to optimize enqueue_copy by passing different flags 
    cl.enqueue_copy(queue, ampsR, ampsR_dev.data)
    cl.enqueue_copy(queue, ampsI, ampsI_dev.data)
    
    I = ampsR**2 + ampsI**2
    return I

Running the above method shows little to no improvement to method4, but that is beacuse q_vecs and atom_vecs are relatively small. If we make them much larger , e.g. 1e6 q_vecs and 1e4 atom_vecs, then we will notice that the GPU version kicks some serious butt.

If you manage to optimize the above kernel, please post an answer to my code-review question here: https://codereview.stackexchange.com/q/159748/78230.

Small data speed tests

Below are timing results of the different methods for a small data set of 10,000 q-vectors and 100 atom vectors:

On the MacBook pro 16GB RAM / El Capitan / intel i7 / Iris Pro:

method timing result
method 1 5.41 s
method 2 (method 1 / n_jobs=4) 1.37 s
method 3 30.9 ms
method 4 10.6 ms
method 5 9.38 ms

And for the DELL Server 64GB RAM / Scientific Linux / 16 core / Tesla K40m:

method timing result
method 1 3.04 s
method 2 (method 1 / n_jobs=8) 569 ms
method 3 47.6 ms
method 4 20 ms
method 5 7.44 ms

Large data speed tests

Below are timing results of the different methods for a large data set of 1,000,000 q-vectors and 1,000 atom vectors:

On the DELL Server 64GB RAM / Scientific Linux / 16 core / Tesla K40m:

method timing result
method 1 too impatient to test
method 2 (method 1 / n_jobs=8) 8 min 23 s
method 3 29.1 s
method 4 6.91 s
method 5 650 ms

As the data gets bigger the GPU wins by far, and eventually, if the data gets sufficiently large, then methods3/4 will require too much memory to complete.

Setting up OpenCL for an AMD chip in Ubuntu server

On my linux machine I first installed SDK (I was using an AMD Radeon HD 5770 chip on an old Mac Pro tower from 2010 running Ubuntu 14 server). To do so, I installed the proprietary GPU drivers. For the Radeon I followed the instructions https://help.ubuntu.com/community/BinaryDriverHowto/AMD#InstallationviatheUbunturepositories (section 4). Don't worry if fglrx doesn't work - especially if you are in server mode.

Installing the SDK is straight forward, just download the installer and run e.g.

chmod ugo+x AMD-APP-SDK-v3.0.130.136-GA-linux64.sh
sudo ./AMD-APP-SDK-v3.0.130.136-GA-linux64.sh

For me, this installed the sdk into

/opt/AMDAPPSDK-3/lib/x86_64/sdk

Now, we want to check that openCL installed correctly and that it located the GPU. Do this by executing the clinfo script

/opt/AMDAPPSDK-3.0/bin/x86_64/clinfo | less    # piping to less because the output is long

Now you should see your GPU listed as a hardware device, as well as the CPU.

Note, If you already installed CUDA or SDK, then you can try search for the binary clinfo in order to help locate the libs, as the libs will generally live nearby. Try e.g. find /usr -name clinfo or find /opt -name clinfo, stuff like that. You might also try sudo dpkg -L libOpenCL.

Next, prior to installing pyopencl. I installed the following using pip (some might not be needed)

pip install mako
pip install pytools
pip install pytest
pip install decorator
pip install cffi
pip install appdirs
pip install six

From there I was ready to install pyopencl (see https://wiki.tiker.net/PyOpenCL/Installation/Linux for additional details). I grabbed a recent python-opencl installer (pyopencl-2016.2.1.tar.gz) After unpacking and navigating to the install folder do

python configure.py --cl-inc-dir=/opt/AMDAPPSDK-3.0/include --cl-lib-dir=/opt/AMDAPPSDK-3.0/lib/x86_64/sdk --cl-libname=OpenCL
su -c "make install"

the --cl-inc-dir should contain a folder called CL/ which itself will contain the header files cl.h and opencl.h, amongst others. The --cl-lib-dir should contain the library, e.g. libOpenCL.so, and hence the libname in this case would be OpenCL.

If it worked you should be able to run

python -c "import pyopencl"

without any error messages. Now that you are done, you can begin using pyopencl.