Data Management with pandas (Python) 1

May, 2014

Chang Y. Chung

Overview

Pandas provide data structures that are flexible containers for lower dimensional data. Two main objects are Series and DataFrame:

  • Series: 1D labeled NumPy ndarray
  • DataFrame: 2D labeled, table of potentially heterogeneously-typed Series

Also, there are TimeSeries (Series indexed by datetimes) and Panel (3D labeled table of DataFrames).

Series is a NumPy ndarray with an axis index and name.

This is an ndarray.

In [1]:
import numpy as np
import pandas as pd

data = np.array([23, 31, 2, 3])
data
Out[1]:
array([23, 31,  2,  3])

This is an ndarray, super-charged.

In [2]:
s = pd.Series(data, index=['a', 'b', 'c', 'd'], name='mySeries')
s
Out[2]:
a    23
b    31
c     2
d     3
Name: mySeries, dtype: int64
In [3]:
s.index
Out[3]:
Index([u'a', u'b', u'c', u'd'], dtype='object')

Series is like ndarray

Get by an integer index:

In [4]:
s[0]
Out[4]:
23

Slicing returns a view:

In [5]:
s3 = s[:3]
s3[0] = 999
s
Out[5]:
a    999
b     31
c      2
d      3
Name: mySeries, dtype: int64
In [6]:
s
Out[6]:
a    999
b     31
c      2
d      3
Name: mySeries, dtype: int64

Boolean indexing works, as well.

In [7]:
above_median = s[s > s.median()]
above_median
Out[7]:
a    999
b     31
Name: mySeries, dtype: int64

Mathematical operations and functions are vectorized.

In [8]:
s ** 0.5
Out[8]:
a    31.606961
b     5.567764
c     1.414214
d     1.732051
Name: mySeries, dtype: float64
In [9]:
np.sqrt(s)
Out[9]:
a    31.606961
b     5.567764
c     1.414214
d     1.732051
Name: mySeries, dtype: float64

Series behaves like a dictionary, as well.

You can create a Series from a dictionary.

In [10]:
s = pd.Series({
    'Tom': 0,
    'Mike': 1,
    'Jane': 2,
    'Mary': 3,
    'Claudia': 4
})
s
Out[10]:
Claudia    4
Jane       2
Mary       3
Mike       1
Tom        0
dtype: int64

In-place sort by value. Default is ascending=True.

In [11]:
s.sort()
s
Out[11]:
Tom        0
Mike       1
Jane       2
Mary       3
Claudia    4
dtype: int64
In [12]:
s
Out[12]:
Tom        0
Mike       1
Jane       2
Mary       3
Claudia    4
dtype: int64

sort_index() returns a new Series, sorted by index.

In [13]:
t = s.sort_index()
t
Out[13]:
Claudia    4
Jane       2
Mary       3
Mike       1
Tom        0
dtype: int64

Getting and setting are just like a dictionary.

In [14]:
s['Claudia'] = 23
s
Out[14]:
Tom         0
Mike        1
Jane        2
Mary        3
Claudia    23
dtype: int64
In [15]:
'Tom' in s
Out[15]:
True

Use np.nan for default value for get(). Otherwise, get() will return None, when not found.

In [16]:
s.get('NOBODY', np.nan)
Out[16]:
nan

Series automatically aligns data based on index.

In [17]:
s1 = pd.Series({'a':1, 'b':2, 'c':3})
s1
Out[17]:
a    1
b    2
c    3
dtype: int64
In [18]:
s2 = pd.Series({'b':20, 'c':10, 'd':9})
s2
Out[18]:
b    20
c    10
d     9
dtype: int64

The resulting index is a union of the input indices.

In [19]:
s = s1 + s2
s
Out[19]:
a   NaN
b    22
c    13
d   NaN
dtype: float64

Missing values (np.nan and None) can be dropped easily.

In [20]:
s.dropna()
Out[20]:
b    22
c    13
dtype: float64

Example: Age-Specific Mortality

Following German Rodriguez's nice Stata code available at his web page, let's graph Age-Specific Mortality.

We are going to read a bit newer data on White Population in US, 2009, from CDC's ftp server address found in the publication:

Arias, Elizabeth (2014) "United States Life Tables, 2009" National Vital Statistics Reports Vol. 62, No. 7. Hyattsville, MD: National Center for Health Statistics. Available here.

Download an excel file from CDC's ftp site and write it locally. urllib.urlretrieve() returns a tuple of (local) filename, and the header information, if successful.

In [21]:
import urllib

nchs = r'ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS' 
ftp  = r'{0}/Publications/NVSR/62_07/Table04.xls'.format(nchs)
xls = 'white2009.xls'
urllib.urlretrieve(ftp, xls)
Out[21]:
('white2009.xls', <mimetools.Message instance at 0x10697a8c0>)

pandas.read_excel function can read the excel file

Use zero-based indices for column and row numbers.

In [22]:
xls = 'white2009.xls'
options = {'header': 2, 'parse_cols': [2], 'skiprows': 6, 'skip_footer': 2}
df = pd.read_excel(xls, 'Sheet1', **options)

We then copy only one column (Series) out of the DataFrame returned from read_excel(). Let's see first a few rows.

In [23]:
lx = df['l(x)'].copy()
lx.head()
Out[23]:
0    100000.000000
1     99472.078125
2     99434.945312
3     99409.906250
4     99390.539062
Name: l(x), dtype: float64

And the last a few rows.

In [24]:
lx.tail()
Out[24]:
96     7127.180664
97     5392.454102
98     3973.862549
99     2848.187988
100    1982.758057
Name: l(x), dtype: float64

We do some wrangling (munging, recoding, ...) and graphing

In [25]:
%matplotlib inline

# convert to per-person
lx /= 100000.0

# cummulative hazard
Hx = - np.log(lx)

# shift(-1) brings up the value of the next row
hx = Hx.shift(-1) - Hx               

# take the mid-range value for age
hx.index += 0.5

# finally
hx.plot(logy=True)
Out[25]:
<matplotlib.axes.AxesSubplot at 0x107dc8e90>

Let's see how the Series, hx, looks:

In [26]:
hx.head()
Out[26]:
0.5    0.005293
1.5    0.000373
2.5    0.000252
3.5    0.000195
4.5    0.000148
Name: l(x), dtype: float64
In [27]:
hx.tail()
Out[27]:
96.5     0.278915
97.5     0.305262
98.5     0.333056
99.5     0.362194
100.5         NaN
Name: l(x), dtype: float64

Fit a line for those over 30 years old.

In [28]:
import statsmodels.api as sm

# more munging
loghx = pd.Series(np.log(hx), name='loghx')[30:-1]
am30 = pd.Series(hx.index, index=hx.index, name='am30')[30:-1] - 30.0

# model fit
model = sm.OLS(loghx, sm.add_constant(am30))
result = model.fit()
print result.params
print 'R^2 : {:6.4f}'.format(result.rsquared)
const   -7.308683
am30     0.087695
dtype: float64
R^2 : 0.9949

Finally, we graph.

In [29]:
# predicted value
pred = model.predict(result.params).astype(np.float64, copy=False)
hf = pd.Series(np.exp(pred), index=am30.index, name='hf')

# plot
hx.plot(logy=True)
hf.plot(logy=True)
Out[29]:
<matplotlib.axes.AxesSubplot at 0x107ed1e10>

Quiz

There are other life tables in the same report:

  • Total population
  • Males
  • Females
  • and so on ...

String Methods

Series has the built-in string method equivalents. They, however:

  • are vectorized, so that it can be called with a whole Series;
  • made aware of the missing value (i.e., np.nan); and
  • have names that starts with .str
In [30]:
s = pd.Series(['Aaba', 'Baca', np.nan, 'CcDD'])
s
Out[30]:
0    Aaba
1    Baca
2     NaN
3    CcDD
dtype: object
In [31]:
lowered = s.str.lower()
lowered
Out[31]:
0    aaba
1    baca
2     NaN
3    ccdd
dtype: object

str.replace and str.findall take regular expression, as well!

  • Finding out why Gracie could not medal in Sochi! :-) Notice that the str.findall returns a Series, whose elements are a list.
In [32]:
s = pd.Series(['Adelina', 'Yuna', 'Carolina', 'Gracie'])
ends_with_na = s.str.findall(r'.+na$')
ends_with_na
Out[32]:
0     [Adelina]
1        [Yuna]
2    [Carolina]
3            []
dtype: object
  • str.replace() relies on re.sub().
In [33]:
s
Out[33]:
0     Adelina
1        Yuna
2    Carolina
3      Gracie
dtype: object
In [34]:
s.str.replace(r'(.+na)', r'\g<0> medals')
Out[34]:
0     Adelina medals
1        Yuna medals
2    Carolina medals
3             Gracie
dtype: object

Summary

  • Important data structures in pandas
  • Series has indexed values and with a *name.
  • Series is (like) a NumPy ndarray.
  • Series is (like) a dictionary, as well.
  • Series automatically aligns data based on index.
  • Series has many .str vectorized methods.