Data Management with pandas (Python) 2

May, 2014

Chang Y. Chung

What is DataFrame?

The reference defines it:

Two-dimentional size-mutable, potentially heterogenious tabular data structure with labeled axes (rows and columns)

Here is a longer way to approach it from the perspective of data structures.

  • NumPy ndarray is an array of dtype elements shaped.
In [103]:
import numpy as np

rid = np.array([0, 1, 2, 3])
rid, rid.dtype, rid.shape
Out[103]:
(array([0, 1, 2, 3]), dtype('int64'), (4,))
  • pandas Series is an ndarray with an index and name.
In [104]:
import pandas as pd

s1 = pd.Series(
    ['Abe', 'Babe', 'Colombe', 'Daube'],
    index = rid,
    name='firstname')
s1
Out[104]:
0        Abe
1       Babe
2    Colombe
3      Daube
Name: firstname, dtype: object

Let's create another Series, age.

In [105]:
s2 = pd.Series([23, 20, 22], index=[0, 1, 3], name='age')
s2
Out[105]:
0    23
1    20
3    22
Name: age, dtype: int64
  • DataFrame is an aligned dictionary of Series'.
In [106]:
df = pd.DataFrame({s1.name: s1, s2.name: s2})
df
Out[106]:
age firstname
0 23 Abe
1 20 Babe
2 NaN Colombe
3 22 Daube

4 rows × 2 columns

We now have two columns nicely aligned by matching the indices. Age did not have a value that was indexed by the number 2. When age was brought into the DataFrame, pandas aligned the indices and put the missing value, np.nan, there.

  • Important properties of DataFrame includes: index (row label or 'observation id') and columns (column labels or 'variable name').
In [107]:
df.index
Out[107]:
Int64Index([0, 1, 2, 3], dtype='int64')
In [108]:
df.columns
Out[108]:
Index([u'age', u'firstname'], dtype='object')
In [109]:
df.dtypes
Out[109]:
age          float64
firstname     object
dtype: object

A NumPy ndarray or a Series has a dtype. DataFrame has dtypes, since each column has its own dtype and they can be different. A dtypes is a Series.

  • Notice that the age entry in the dtypes became float. This is because pandas does not have a proper missing value for an integer type. We will talk about this more later when we talk about missing values.

Quiz

Create a data frame of Age Specific Fertility Rate (ASFR) for women in Republic of Korea (South Korea) and US, for 2000-2005. Age ranges and ASFRs for five-year age range for both countries are geven below. Data are from UN World Population Prospect 2006.

In [110]:
start = pd.Series(range(15, 46, 5))
finish = start + 4
age_range = ['{0}-{1}'.format(s, f) for s, f in zip(start, finish)]
age_range
Out[110]:
['15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49']
In [111]:
korea = [3.2, 33.1, 119.9, 72.8, 16.8, 2.5, 0.2]
usa = [43.7, 104.5, 114.5, 93.5, 42.8, 8.5, 0.5]
In [112]:
# create a DataFrame here.
df = pd.DataFrame({'korea': korea, 'usa':usa}, index=age_range)
df
Out[112]:
korea usa
15-19 3.2 43.7
20-24 33.1 104.5
25-29 119.9 114.5
30-34 72.8 93.5
35-39 16.8 42.8
40-44 2.5 8.5
45-49 0.2 0.5

7 rows × 2 columns

Creating DataFrame and I/O

There are many different ways to create a data frame.

  • The constructor is flexible and takes different things as data arguments, not just a Series:
    • a dictionary of columns
    • a list of dictionaries
    • a list of tuples
    • another DataFrame
    • ...

Below is an example of the DataFrame constructor taking in a list of tuples.

In [113]:
df2 = pd.DataFrame([
  ('SNSD', 'female', 9),
  ('Big Bang', 'male', 5)
], columns=['band', 'gender', 'size'])
df2
Out[113]:
band gender size
0 SNSD female 9
1 Big Bang male 5

2 rows × 3 columns

  • There are many ways to create a DataFrame, since pandas has many importing functions that return a DataFrame.
data in this format import function
pickled object read_pickle
delimited text read_table
CSV read_csv
fixed-width text read_fwf
clipboard read_clipboard
excel read_excel
JSON string read_json
HTML table read_html
HDFStore(HDF5) read_hdf
SQL database read_sql
Google BigQuery read_gbq
Stata read_stata
  • DataFrame also has many corresponding methods that go the other way. That is, to serialize itself (or to write itself out) in various formats, including to_pickle, to_csv, to_hdf, to_excel, to_json, to_stata, to_clipboard, ...

I/O into and from a DataFrame

Here is an example of:

1. Importing a Stata data file (auto.dta) into a pandas DataFrame
2. Saving only the foreign rows into a JSON file (foreign.json); and then
3. Read it back to `DataFrame` (df_foreign)
In [114]:
auto = pd.read_stata('auto.dta')
auto.head(3)
Out[114]:
make price mpg rep78 headroom trunk weight length turn displacement gear_ratio foreign
0 AMC Concord 4099 22 3 2.5 11 2930 186 40 121 3.58 Domestic
1 AMC Pacer 4749 17 3 3.0 11 3350 173 40 258 2.53 Domestic
2 AMC Spirit 3799 22 NaN 3.0 12 2640 168 35 121 3.08 Domestic

3 rows × 12 columns

Save the columns for the later use.

In [115]:
cols = auto.columns.copy()
cols
Out[115]:
Index([u'make', u'price', u'mpg', u'rep78', u'headroom', u'trunk', u'weight', u'length', u'turn', u'displacement', u'gear_ratio', u'foreign'], dtype='object')

Select foreign makes only.

In [116]:
foreign = auto[auto.foreign == 'Foreign']
foreign.head(3)
Out[116]:
make price mpg rep78 headroom trunk weight length turn displacement gear_ratio foreign
52 Audi 5000 9690 17 5 3.0 15 2830 189 37 131 3.20 Foreign
53 Audi Fox 6295 23 3 2.5 11 2070 174 36 97 3.70 Foreign
54 BMW 320i 9735 25 4 2.5 12 2650 177 34 121 3.64 Foreign

3 rows × 12 columns

And then write it out as a JSON string.

In [117]:
foreign.to_json('foreign.json')

Reading it back using the read_json method. While writing to JSON, the order of columns has been changed. We fix it by re-indexing along the second axis (i.e. columns). And then show the first 3 rows.

In [118]:
df_foreign = pd.read_json('foreign.json')
df_foreign.reindex_axis(cols, axis=1).head(3)
Out[118]:
make price mpg rep78 headroom trunk weight length turn displacement gear_ratio foreign
52 Audi 5000 9690 17 5 3.0 15 2830 189 37 131 3.20 Foreign
53 Audi Fox 6295 23 3 2.5 11 2070 174 36 97 3.70 Foreign
54 BMW 320i 9735 25 4 2.5 12 2650 177 34 121 3.64 Foreign

3 rows × 12 columns

Getting and Setting

Once you have your data in a DataFrame, then you may naturally want to retrieve (get) and/or to change (set) the value or values.

Recall when we get/set a value in a NumPy ndarray, we use an integer position (and zero-) based index.

In [119]:
import numpy as np

a = np.array([
    [0, 1, 2],
    [3, 4, 5]
])
a[1,2] # get
Out[119]:
5
In [120]:
a[1, 2] = 99 # set
a
Out[120]:
array([[ 0,  1,  2],
       [ 3,  4, 99]])

pandas data structures have labels, that can be non-integers. This permits us getting/setting via labels, in addition to doing so via integer positions. For instance,

In [121]:
s = pd.Series([0, 1, 2], index=['Abe', 'Betty', 'Caroline'])
s
Out[121]:
Abe         0
Betty       1
Caroline    2
dtype: int64
In [122]:
print 'via label.            s[\'Betty\']=', s['Betty']
print 'via integer position. s[1]=', s[1]
via label.            s['Betty']= 1
via integer position. s[1]= 1

Getting a value does work. Considering that an index can be an integer, a better way is to tell pandas explicitly which way you intend the key should be understood.

If you mean the key to be:

  • a label (or index), then use .loc;
  • an integer position, then use .iloc.
In [123]:
print 'via integer position. s.loc[\'Betty\']=', s.loc['Betty']
print 'via label.            s.iloc[1]=', s.iloc[1]
via integer position. s.loc['Betty']= 1
via label.            s.iloc[1]= 1

.ix permits both but assume it a label first, if not found, then a position. It is better to avoid it, if possible.

In [124]:
s.ix['Betty'], s.ix[1]
Out[124]:
(1, 1)

This works the same way for a DataFrame. Only that it is a 2D object and we have a 2-tuple of integer positions with .iloc or a 2-tuple of labels with .loc.

That is, given the following DataFrame.

In [125]:
df = pd.DataFrame([
    (23, 'Abe'),
    (20, 'Babe'),
    (22, 'Daube')
], index=['R1', 'R2', 'R3'], columns=['age', 'firstname'])
df
Out[125]:
age firstname
R1 23 Abe
R2 20 Babe
R3 22 Daube

3 rows × 2 columns

In [126]:
df.loc['R1', 'age'], df.iloc[0, 0]
Out[126]:
(23, 23)

Setting a value is straight forward as well.

In [127]:
df.loc['R1', 'age'] = 100
df
Out[127]:
age firstname
R1 100 Abe
R2 20 Babe
R3 22 Daube

3 rows × 2 columns

In [128]:
df.iloc[0,1] = 'Abraham'
df
Out[128]:
age firstname
R1 100 Abraham
R2 20 Babe
R3 22 Daube

3 rows × 2 columns

.loc can actually do more:

  • Raises a KeyError exception when the label was not found.
  • Can take a list or array of labels, in either dimentions or both.
  • Can take a slice with labels, i.e. 'age':'firstname'. Notice that both start and the stop parts are inclusive!!!
  • Can take a boolean array.
In [129]:
df.loc['R1', 'age']
Out[129]:
100
In [130]:
df.loc[['R1', 'R3'], ['age', 'firstname']]
Out[130]:
age firstname
R1 100 Abraham
R3 22 Daube

2 rows × 2 columns

Remember the stop part of a label slice is inclusive. That is, 'R2' row is included below.

In [131]:
df.loc['R1':'R2', 'age':'age']
Out[131]:
age
R1 100
R2 20

2 rows × 1 columns

.iloc is also powerful:

  • Raises an IndexError exception when the integer position is out of bounds (0 to length -1).
  • Can take an integer list or array.
  • Can take a slice.

Let's see the df again.

In [132]:
df
Out[132]:
age firstname
R1 100 Abraham
R2 20 Babe
R3 22 Daube

3 rows × 2 columns

In [133]:
df.iloc[0, 1]
Out[133]:
'Abraham'
In [134]:
df.iloc[[0, 2], [0]]
Out[134]:
age
R1 100
R3 22

2 rows × 1 columns

Quiz

How many rows below returns? Assume that the DataFrame df has more than three rows.

In [135]:
df.iloc[0:2, :]
Out[135]:
age firstname
R1 100 Abraham
R2 20 Babe

2 rows × 2 columns

Specifying the key using .loc and .iloc is the best practice. For the sake of convenience, however, a simple bracketed syntax is often used.

  • A simple bracketed slice gets/sets the rows.
In [136]:
df2 = df[:2]
df2
Out[136]:
age firstname
R1 100 Abraham
R2 20 Babe

2 rows × 2 columns

  • A simple bracketed column name (or list of names), gets/sets the columns.
In [137]:
df2 = df[['firstname', 'age']] # col order changed
df2
Out[137]:
firstname age
R1 Abraham 100
R2 Babe 20
R3 Daube 22

3 rows × 2 columns

Here is an Python-like idiom of swapping columns in a DataFrame. In this case, it is rather silly. :-)

In [138]:
df[['firstname', 'age']] = df[['age', 'firstname']]
df
Out[138]:
age firstname
R1 Abraham 100
R2 Babe 20
R3 Daube 22

3 rows × 2 columns

Quiz

Given a DataFrame df1, create another, df2, which looks like df3.

In [139]:
rows = range(3)
cols = list('ABCDE')
df1 = pd.DataFrame(np.arange(15).reshape(3,5), index=rows, columns=cols)
df1
Out[139]:
A B C D E
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14

3 rows × 5 columns

In [140]:
df3 = pd.DataFrame([
    ( 0,  1,  -2, 23),
    (10, 11, -12, 45),
], index=[0,2], columns=list('ABCF'))
df3
Out[140]:
A B C F
0 0 1 -2 23
2 10 11 -12 45

2 rows × 4 columns

An answer:

In [141]:
df2 = df1.loc[0:2:2, 'A':'C']
df2['C'] *= -1
df2['F'] = [23, 45]
all(df2 == df3)
Out[141]:
True

Missing Data

  • Recall that None is a Python universal object. It is a placeholder and often evaluated as False, but it compares itself True.
In [142]:
None, type(None), None == None
Out[142]:
(None, NoneType, True)
  • np.nan is also a placeholder. It is of a float type, representing 'Not a Number', something like log(-1). It always compares False, including being compared to itself, of which is the defining characteristic. Use np.isnan() to identify a value is np.nan.
In [143]:
np.log(-1), type(np.nan)
Out[143]:
(nan, float)
In [144]:
np.nan < 1.0, np.nan >= 1.0, np.nan == np.nan
Out[144]:
(False, False, False)
In [145]:
np.nan != np.nan, np.isnan(np.nan)
Out[145]:
(True, True)
  • pandas uses np.nan (and None) to represent missing data. They are, by default, excluded from calculations.

pandas automatically promotes int dtype to a float, when there is an np.nan within a Series. Similarly, boolean dtype is cast to object. Here is an example of the former.

In [146]:
left = pd.DataFrame({'a': [23, 13, 25, 42]}, index=range(4))
right = pd.DataFrame({'b': [11, 11, 11]}, index=[0, 2, 3])

together = pd.merge(left, right, how='left', left_index=True, right_index=True)

print together
print together.dtypes
    a   b
0  23  11
1  13 NaN
2  25  11
3  42  11

[4 rows x 2 columns]
a      int64
b    float64
dtype: object

  • pandas provide isnull() and notnull() functions to identify missing values. They return booleans.
In [147]:
df = pd.DataFrame(np.log(-np.eye(3)))
df
Out[147]:
0 1 2
0 NaN -inf -inf
1 -inf NaN -inf
2 -inf -inf NaN

3 rows × 3 columns

In [148]:
pd.isnull(df)
Out[148]:
0 1 2
0 True False False
1 False True False
2 False False True

3 rows × 3 columns

In [149]:
pd.notnull(df)
Out[149]:
0 1 2
0 False True True
1 True False True
2 True True False

3 rows × 3 columns

  • dropna(), removes either columns (axis=1) or rows (axis=0) with any missing values.
In [150]:
df.iloc[1,1] = 23
df
Out[150]:
0 1 2
0 NaN -inf -inf
1 -inf 23.000000 -inf
2 -inf -inf NaN

3 rows × 3 columns

In [151]:
dropped = df.dropna(axis=0)
dropped
Out[151]:
0 1 2
1 -inf 23 -inf

1 rows × 3 columns

  • When filling the missing values, fillna() and interpolate() come in handy.
In [152]:
df = pd.DataFrame(np.arange(18.).reshape(6, 3), columns=list('abc'))
df.iloc[2:4, [0, 2]] = np.nan
df.iloc[1, 1] = np.nan
df
Out[152]:
a b c
0 0 1 2
1 3 NaN 5
2 NaN 7 NaN
3 NaN 10 NaN
4 12 13 14
5 15 16 17

6 rows × 3 columns

In [153]:
df['a'] = df['a'].interpolate(method='polynomial', order=3)
df['b'] = df['b'].fillna(999)
df
Out[153]:
a b c
0 0 1 2
1 3 999 5
2 6 7 NaN
3 9 10 NaN
4 12 13 14
5 15 16 17

6 rows × 3 columns

Some Data Munging Tools: Append, Concat, Group By

Let's download some World Bank's World Development Indicators.

This example is largely based on the "World Bank" section of pandas 0.13.1 documentation available here but was expanded to demonstrate more methods and functions.

First, we download a GDP per capita series and a fertility rate. The search method shows available series.

In [154]:
from pandas.io import wb

wb.search('fertility').iloc[:, :2]
Out[154]:
id name
5917 SP.ADO.TFRT Adolescent fertility rate (births per 1,000 wo...
5937 SP.DYN.TFRT.IN Fertility rate, total (births per woman)
5940 SP.DYN.WFRT Wanted fertility rate (births per woman)

3 rows × 2 columns

Let's choose two series: one fore GDP per capita and another for Total Fertility Rate. We request all the available countries and some years.

In [155]:
ind = ['NY.GDP.PCAP.KD', 'SP.DYN.TFRT.IN']
df = wb.download(indicator=ind, country='all', start=1950, end=2014)

Shorten the column labels. and let's see the dataframe. It has a MultiIndex (or hierarchical index).

In [156]:
df.columns = ['gdp', 'tfr']
df.head()
Out[156]:
gdp tfr
country year
Arab World 2013 NaN NaN
2012 4621.266629 3.233402
2011 4519.897714 3.266215
2010 4367.718774 3.297409
2009 4236.984875 3.325365

5 rows × 2 columns

Before we do anything, let's drop any rows that has missing values, and convert both columns to numbers.

In [157]:
df = df.dropna()
df = df.convert_objects(convert_numeric=True)
df.to_pickle('df.pkl')
df.dtypes
Out[157]:
gdp    float64
tfr    float64
dtype: object

How many records do we have? We can get summary data using describe() method.

In [158]:
df.describe()
Out[158]:
gdp tfr
count 9132.000000 9132.000000
mean 7917.940221 3.929455
std 12389.688012 1.962027
min 50.042206 0.836000
25% 723.931595 2.090000
50% 2436.573682 3.525500
75% 9498.393125 5.758250
max 117493.640235 8.667000

8 rows × 2 columns

Let's now try some group by using the multiIndex we have. Let's aggregate our data at the country level by calculating a mean over years within each country. Amazinly, it is just one line.

In [159]:
country = df.groupby(level=['country']).mean()
print country.describe()
                 gdp         tfr
count     226.000000  226.000000
mean     8536.483605    3.758308
std     13785.155100    1.721368
min       156.179889    1.223333
25%       815.803895    2.136105
50%      2703.505162    3.729304
75%      9309.100281    5.078099
max    108089.599455    7.585679

[8 rows x 2 columns]

We save the country data file locally, only after we sort it by the gdp.

In [160]:
country.sort(columns=['gdp'], axis=0, inplace=True)
country.to_pickle('country.pkl')

We can graph gdp and tfr together,

In [161]:
country = pd.read_pickle('country.pkl')
country['log_gdp'] = np.log(country['gdp'])
country = country.drop('gdp', axis=1)
import matplotlib.pyplot as plt
plt.scatter(country['tfr'], country['log_gdp'])
Out[161]:
<matplotlib.collections.PathCollection at 0x10dab9890>

Another way is to aggregate over countries, so that we end up with yearly data. Aggregation is simple using group by again.

In [162]:
df = pd.read_pickle('df.pkl')
year = df.groupby(level='year').mean()
year['ln_gdp'] = np.log(year['gdp'])
year = year.drop('gdp', axis=1)
year.sort(columns=['ln_gdp'], axis=0, inplace=True)
year.to_pickle('year.pkl')
year.plot()
Out[162]:
<matplotlib.axes.AxesSubplot at 0x10ac75950>

Suppose that we separate out before and after year 2000 into two datasets, like so. It happened that the index is now a list of Unicode strings.

In [163]:
import pandas as pd
upto2000 = year.loc[:u'2000']
after2000 = year.loc[u'2001':]
print upto2000.tail(3), '\n', after2000.head(3)
           tfr    ln_gdp
year                    
1998  3.287430  9.056870
1999  3.262645  9.071454
2000  3.189638  9.191046

[3 rows x 2 columns] 
           tfr    ln_gdp
year                    
2001  3.139298  9.199501
2002  3.097953  9.211897
2003  3.068448  9.217876

[3 rows x 2 columns]

Both pd.concat() or df.append() can put them together.

In [164]:
yearAgain = pd.concat([upto2000, after2000])
print yearAgain.head(3), '\n', yearAgain.tail(3)
           tfr    ln_gdp
year                    
1961  5.464729  8.218653
1960  5.423905  8.235099
1963  5.441211  8.291204

[3 rows x 2 columns] 
           tfr    ln_gdp
year                    
2006  2.968326  9.322465
2008  2.924269  9.347202
2007  2.948572  9.354581

[3 rows x 2 columns]

In [165]:
yearAgain2 = upto2000.append(after2000)
print yearAgain2.head(3), '\n', yearAgain.tail(3)
           tfr    ln_gdp
year                    
1961  5.464729  8.218653
1960  5.423905  8.235099
1963  5.441211  8.291204

[3 rows x 2 columns] 
           tfr    ln_gdp
year                    
2006  2.968326  9.322465
2008  2.924269  9.347202
2007  2.948572  9.354581

[3 rows x 2 columns]

Summary

There are so many other powerful functions and methods remain in pandas and other related packages. For example, we haven't had chance to talk about merge and join. We have not talked about powerful 'reshaping' (stack, and pivot tables), nor Time Series related topics.

We did study together, though, the following topics:

  • What is a DataFrame?
  • Creating DataFrame and I/O
  • Getting and Setting values
  • Missing Data
  • Append, Concat, Group By

References

  • McKinney, Wes (2012) Chapter 4. "NumPy Basics: Arrays and Vectorized Computation" in Python for Data Analysis:Data Wrangling with Pandas, NumPy, and IPython O'Reilly. ISBN:1449319793. at Amazon
  • pandas development team (2014) "pandas: powerful Python data analysis toolkit" Version 0.13.1 Available at pandas site