Basic NumPy Refresher

May, 2014

Chang Y. Chung

Introduction

  • NumPy, Numerical Python, is the package for scientific computing and data analysis, implementing the multidimensional array object, ndarray.
  • Elements of an ndarray are homogeneous (all of the same dtype) and are indexed by a tuple of positive integers.
  • Dimensions are called axes. The number of axes is rank.
In [1]:
import numpy as np

a = np.array([[0, 1, 2], [3, 4, 5]])   # a 2D, 2 x 3, array
a
Out[1]:
array([[0, 1, 2],
       [3, 4, 5]])
In [2]:
a.dtype
Out[2]:
dtype('int64')
In [3]:
a.shape
Out[3]:
(2, 3)

Get and set an element

Get a (element) value via integer index.

In [4]:
a = np.array([[0, 1, 2], [3, 4, 5]])
a[0, 2]
Out[4]:
2

Set an element.

In [5]:
a[0, 2] = 99
a                  
Out[5]:
array([[ 0,  1, 99],
       [ 3,  4,  5]])

Why NumPy ndarray?

Because lists are not good for numerical calculations.

Multiplication just repeats.

In [6]:
L1 = [2, 3]
2 * L1
Out[6]:
[2, 3, 2, 3]

Addition concatenates.

In [7]:
L1 = [2, 3]
L2 = [5, 6]
L1 + L2
Out[7]:
[2, 3, 5, 6]

NumPy ndarray provides element-wise operations.

In [8]:
x = np.array([2, 3])   # error if (2, 3) 
2 * x
Out[8]:
array([4, 6])
In [9]:
y = np.array([5, 6])
x + y
Out[9]:
array([7, 9])

Familiar mathematical functions (ufunc's) operate element-wise, as well.

In [10]:
x = np.array([1.0, 2.0, 3.0])
np.exp(x), np.sqrt(x), np.log(x)
Out[10]:
(array([  2.71828183,   7.3890561 ,  20.08553692]),
 array([ 1.        ,  1.41421356,  1.73205081]),
 array([ 0.        ,  0.69314718,  1.09861229]))

See the list of available ufunc's here.

NumPy array creation

In [11]:
a = np.array([2, 3, 4], dtype=float)  # from list
a, a.dtype, a.shape            
Out[11]:
(array([ 2.,  3.,  4.]), dtype('float64'), (3,))
In [12]:
b = np.arange(10)   # short-cut for np.array(range(...))
b
Out[12]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Some convenient functions.

In [13]:
Z = np.zeros((2, 3))
Z
Out[13]:
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])
In [14]:
Ones = np.ones((10,)) 
Ones
Out[14]:
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])
In [15]:
I = np.eye(4)
I
Out[15]:
array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])

5 evenly spaced numbers from 0.0 to 2.0, inclusive at both ends.

In [16]:
x = np.linspace(0.0, 2.0, 5)
x
Out[16]:
array([ 0. ,  0.5,  1. ,  1.5,  2. ])
In [17]:
%matplotlib inline
import matplotlib.pyplot as plt

x = np.linspace(0.0, 2.0, 100)
y = np.sin(x * np.pi) + x ** 2 - np.pi
plt.plot(x, y)
Out[17]:
[<matplotlib.lines.Line2D at 0x105e26f10>]

Many unary operations are implemented as methods.

In [18]:
np.random.seed(12345)
a = np.random.random((2, 4))
a
Out[18]:
array([[ 0.92961609,  0.31637555,  0.18391881,  0.20456028],
       [ 0.56772503,  0.5955447 ,  0.96451452,  0.6531771 ]])

Sum of all elements, regardless of shape.

In [19]:
a.sum()
Out[19]:
4.4154320862971987

Sum of each column.

In [20]:
a.sum(axis=0)
Out[20]:
array([ 1.49734112,  0.91192026,  1.14843333,  0.85773738])

Sum of each row.

In [21]:
a.sum(axis=1)
Out[21]:
array([ 1.63447074,  2.78096135])

Changing the shape

In [22]:
c = np.arange(10) 
c
Out[22]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [23]:
c.shape = (5, 2)
c
Out[23]:
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
In [24]:
d = c.T # transpose() also works
d
Out[24]:
array([[0, 2, 4, 6, 8],
       [1, 3, 5, 7, 9]])
In [25]:
f = d.flatten() # ravel() also flattens
f
Out[25]:
array([0, 2, 4, 6, 8, 1, 3, 5, 7, 9])
In [26]:
f = np.array([0, 2, 4, 6, 8, 1, 3, 5, 7, 9])
g = f.reshape(5, 2)
g
Out[26]:
array([[0, 2],
       [4, 6],
       [8, 1],
       [3, 5],
       [7, 9]])

Array elements order is, by default, rightmost index changes the fastest (C-style). This applies both when the array is flattened and is (re)shaped.

Slicing works as expected, but returns a view.

In [27]:
a = np.arange(15).reshape(3, 5)
a
Out[27]:
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

A view of the third row and all columns.

In [28]:
v = a[2, :]
v
Out[28]:
array([10, 11, 12, 13, 14])

Change a value of a view like so:

In [29]:
v[0] = 999
v
Out[29]:
array([999,  11,  12,  13,  14])

You change the underlying array data as well.

In [30]:
a
Out[30]:
array([[  0,   1,   2,   3,   4],
       [  5,   6,   7,   8,   9],
       [999,  11,  12,  13,  14]])

In order to create an independent copy, use copy() method.

In [31]:
b = a.copy()
b[2, 0] = 888
b
Out[31]:
array([[  0,   1,   2,   3,   4],
       [  5,   6,   7,   8,   9],
       [888,  11,  12,  13,  14]])

a remains unchanged.

In [32]:
a 
Out[32]:
array([[  0,   1,   2,   3,   4],
       [  5,   6,   7,   8,   9],
       [999,  11,  12,  13,  14]])

Boolean indexing returns an array and works great for setting some cells to certain values.

In [33]:
b = a.copy()
idx = np.logical_or(b == 9, b == 999)
idx
Out[33]:
array([[False, False, False, False, False],
       [False, False, False, False,  True],
       [ True, False, False, False, False]], dtype=bool)
In [34]:
b[idx] = 0
b
Out[34]:
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  0],
       [ 0, 11, 12, 13, 14]])

Selecting the first and the third rows.

In [35]:
idx = np.array([True, False, True])
r = b[idx, :]
r
Out[35]:
array([[ 0,  1,  2,  3,  4],
       [ 0, 11, 12, 13, 14]])

For matrix multiplication, use np.dot.

In [36]:
M = np.arange(6.0).reshape(3,2)
M
Out[36]:
array([[ 0.,  1.],
       [ 2.,  3.],
       [ 4.,  5.]])
In [37]:
N = 2 * M.T
N
Out[37]:
array([[  0.,   4.,   8.],
       [  2.,   6.,  10.]])
In [38]:
M * N
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-38-d0d446ebbecd> in <module>()
----> 1 M * N

ValueError: operands could not be broadcast together with shapes (3,2) (2,3) 
In [39]:
np.dot(M, N)
Out[39]:
array([[  2.,   6.,  10.],
       [  6.,  26.,  46.],
       [ 10.,  46.,  82.]])

One dimensional array is a one dimensional array.

In [40]:
v = np.ones(2.,)
v
Out[40]:
array([ 1.,  1.])
In [41]:
np.all(v == v.T)
Out[41]:
True
In [42]:
M, v
Out[42]:
(array([[ 0.,  1.],
        [ 2.,  3.],
        [ 4.,  5.]]), array([ 1.,  1.]))
In [43]:
np.dot(M, v)
Out[43]:
array([ 1.,  5.,  9.])

Reference for all the NumPy methods and routines found at the reference page

An Example: Random Walks

This example is mostly based on a section in McKinney (2012, Ch.4), except that a step can be taken one of the three, not two, directions. All the code has been re-written.

Let's simulate a simple random walk.

Let \(t = 0, 1, \ldots\) denote (discrete) time. Starting from the initial position of 0, a step is taken randomly either to a positive direction (+1), to a neutral direction (0), or to a negative direction (-1) at each time. A \(walk\) is a list of positions indexed by time, \(t\).

Without NumPy

The function, randint(a, b), returns an integer from the range a to b, inclusive at both ends.

In [44]:
import random

rnd = random.Random(1234)
T = range(21)            
walk = []

for t in T:              
    if t == 0:  
        pos = 0
    else:
        step = rnd.randint(0, 2) - 1   
        pos += step
    walk.append(pos)

print "walk = ", walk
plt.plot(T, walk)                              
walk =  [0, 1, 1, 0, 1, 2, 2, 3, 2, 3, 2, 1, 2, 2, 2, 2, 1, 0, -1, -2, -2]

Out[44]:
[<matplotlib.lines.Line2D at 0x106184210>]

Using np.cumsum()

Notice that each element in the \(walk\) is simply a cumulative sum of the preceeding (and the current) steps. That is, letting \(Step_t\) be the random step taken at time \(t\), then:

\[ \begin{align} walk[t] & = & walk[t-1] + Step_t \\ & = & walk[t-2] + Step_{t-1} + Step_t \\ & \ldots & \\ & = & walk[0] + Step_1 + \ldots + Step_{t-1} + Step_t \end{align} \]

Let \(walk[0] = Step_0 = 0\) for simplicity, then we define a \(walk\) with cummulative sums of steps.

Conveniently, NumPy ndarray provides a method, cumsum(). Let's take advantage of the method and simplify the simulation as so:

The NumPy randint(low, high) returns an integer from low (inclusive) to high (exclusive).

In [45]:
np.random.seed(1234) 
T = range(21)        

steps = np.random.randint(0, 3, size=len(T)) - 1 
steps[0] = 0
walk  = steps.cumsum()

print walk
plt.plot(T, walk)
[ 0  0 -1 -2 -3 -3 -3 -3 -2 -1  0 -1 -2 -1  0  1  0 -1 -2 -2 -3]

Out[45]:
[<matplotlib.lines.Line2D at 0x1061423d0>]

Notice that we are using a different random number generator. Even with the same random seed, we get a different sequence of random numbers.

First Crossing Time

Say we would like to know how long it took a longer random walk to get at least 10 steps away from the origin, 0, in either direction.

NumPy ndarray's argmax() function returns the first index of the maximum value. We are re-using the above code, but are setting the arguments so that we are simulating a longer walk.

In [46]:
np.random.seed(1235)
T = range(101)

steps = np.random.randint(0, 3, size=len(T)) - 1
steps[0] = 0
walk  = steps.cumsum()

faraway = (np.abs(walk) >= 10) 
cross = faraway.argmax() 

print cross 
plt.plot(T, walk)
plt.axhline(y = 10.0, ls='dashed')
plt.axvline(x = cross, color='red')
63

Out[46]:
<matplotlib.lines.Line2D at 0x106178250>

Simulating Many Random Walks at Once

Usually we would like to simulate repeatedly.

In our context, this means generating multiple walks. NumPy ndarray can have multiple axes or dimensions. We just introduce the second axis representing the repetition.

In [47]:
np.random.seed(1235) 
T = xrange(1000)
num_walks = 5000                              

steps = np.random.randint(0, 3, size=(num_walks, len(T))) - 1
steps[:, 0] = 0
walks = steps.cumsum(1)

import time
from IPython.display import display, clear_output 

f, ax = plt.subplots()
for i in xrange(0, num_walks, 160):   
    time.sleep(0.2)
    ax.plot(T, walks[i, :], color='darkgrey')
    clear_output(wait=True)
    display(f)
plt.close()

Quiz.

Simulate 5,000 random walks, each of which runs from the position of 0 at \(t == 0\) to \(t == 700\). For each walk, find out the "crossing time" to +30 or -30, that is the first time when the position reaches either +30 or -30. Calculate the mean crossing time over all 3,000 random walks. Notice that not all the walks may reach \(\pm 30\). In that case, use only those walks that ever reach \(\pm 30\) within the given time \(t <= 700\). Make sure to use a random seed so that the simulation can be replicated exactly.

In [48]:
# an answer
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

np.random.seed(123)
T = xrange(700)
num_walks = 5000                              

steps = np.random.randint(0, 3, size=(num_walks, len(T))) - 1
steps[:, 0] = 0
walks = steps.cumsum(1)

# filter, get crossing times, and print the mean crossing time
hits30 = (np.abs(walks) >= 30).any(1)
xtimes = (np.abs(walks[hits30]) >= 30).argmax(1)
print "mean crossing time (for walks that reach +30 or -30): ", xtimes.mean()
mean crossing time (for walks that reach +30 or -30):  443.027027027

References and Resources

  • Bressert, Eli (2012) SciPy and NumPy: An Overview for Developers O'Reilly. ISBN:149305466. at Amazon
  • McKinney, Wes (2012) Chapter 4. "NumPy Basics: Arrays and Vectorized Computation" in Python for Data Analysis:Data Wrangling with Pandas, NumPy, and IPython O'Reilly. ISBN:1449319793. at Amazon
  • NumPy Reference at NumPy site
  • NumPy User Guide at NumPy site
  • Tentative Tutorial at SciPy site