The Office of Population Research at Princeton University

November 7, 2009


Administration
Faculty
Staff
Students
Jobs


Projects
Seminars
Working Papers
Publications
Dissertations


  Prospective  Students
Programs
Courses
Course Schedule


Data Archive
Library
Pop Index
NIH Public
    Access Policy


Calendar


CRCW
CHW
CMD
PUM
OPR Mail

Search

 

Manipulating Files

Now that we know how to create and manipulate variables in Stata, we’ll learn how to manipulate entire files. This will entail learning about the append, merge, collapse/contract, expand and reshape commands. First though, a few terms we’ll need to know:

  • Master dataset – this refers to the dataset you currently have loaded into Stata.
  • Using dataset – this is a dataset not in memory, but will used to append or merge to the master dataset.
  • Wide format – a dataset that has one variable for each time period in the study, i.e., twelve variables for income in each month of a year. These are sometimes referred to as “repeated measures” datasets.
  • Long format – the converse of a wide dataset, there is one observation per unit for each time period in the study. These are often referred to as “person-year”, “person-month”, or time-series datasets.

Append

The append command is used when you have two datasets with (mostly) the same variables, but different observations. These observations may be additional respondents or additional years, months, etc. for the same respondents.  Be sure that you have enough memory to hold all the datasets you want to combine before you start. 

append using newdata

 

append using newdata, keep(age income weight)

The append command is very easy. Here, we are simply appending the dataset “newdata” to the data that are already in memory. You can only append one file at a time.

You do not need to append all of the variables from the using data set if you use the "keep" option.

 

Merge

The merge command can be used to add new variables to existing observations.  Like append, you need to make sure that you have enough memory to hold all the datasets you want to merge.  Additionally, your master and using data must be sorted according to the id or key variables you will use to match the observations.  By defualt, Stata will not change any of the values in your master dataset and any observations in the using dataset that do not match anyhting in the master dataset will simply be appended with all of the variables from the master data set to missing.

merge perid year using newdata

 

merge perid year using newdata, keep(age weight)

Here, we simply merge the data from the file newdata to the master data, matching observations with the variables perid and year.

Just like with the "append" command, you can merge only the variables you want from the using data set by using the "keep" option.

merge perid year using newdata, update

The update option tells Stata to replace missing data in the master data with values from the using data.

merge perid year using newdata, update replace

Including the replace option will cause Stata to replace all of the master dataset values with values from the using dataset except if the using dataset value is missing.  The update option must be included if you use replace.

merge perid year using newdata, nokeep

The nokeep option will cause Stata to drop any observations in the using dataset that do not have matching observations in the master dataset.

When Stata merges two datasets, it automatically creates a variable called “_merge.”  This variable can be used to determine if the merge produced the results you wanted.  It can take up to five values, depending on the options you use:

If you use neither the update nor replace options:

            1 = an observation from the master dataset that did NOT have a corresponding observation in the using dataset.

            2 = an observation from the using dataset that did NOT have a corresponding observation in the master dataset.

            3 = observations that matched

 

If you use the update option:  

 

            1 = an observation from the master dataset that did NOT have a corresponding observation in the using dataset.

            2 = an observation from the using dataset that did NOT have a corresponding observation in the master dataset.

3 = an observation that matched and had all values the same

            4 = an observation that matched, but the missing values in the master dataset were updated

            5 = an observation that matched, but the values in the master dataset were not the same as those in the using dataset.

Collapse and Contract

The collapse and contract commands are used to create a new dataset from the one in memory by summarizing the data.  In other words, collapse and contract will create a dataset of means, medians, and frequencies.  Be aware that the newly created dataset will replace the one in memory, so be sure to save any changes you have made before executing the command.

 

collapse age income

Without specifying any options or statistics, Stata will produce a one-observation dataset of means.

collapse age income, by(year)

Most often, you will use the “by” option to produce statistics for each group.

collapse (mean) age income (max) rand, by(year)

You can have several different types of statistics in your output dataset.

collapse (mean) age income (max) max_age=age , by(year)

By default, the variables in the new dataset will have the same name as those in your original dataset.  If you want more than one statistic for a particular variable, then you must supply a new name for it:  “new name” = “original name”

collapse (mean) age income, by(year) cw

By default, Stata will use all possible observations when calculating the statistics.  This may result in the mean for age being based on a different number of observations than the mean for income.  The “cw” option performs a case-wise deletion meaning that any observation that has a missing value for any variable specified will be dropped.

contract employed mugged cityname

Contract will create a dataset containing the frequencies of all the combinations of the listed variables.

contract emlpoyed mugged cityname, zero

The “zero” option tells Stata to include any combinations with zero frequencies.

contract employed mugged cityname, nomiss

The “nomiss” option tells Stata to not use any observation that has a missing value for any of the listed variables.

 

 

Expand

 

You may not use “expand” very often, but it is included here because it is the complement of contract.  Namely, “expand” will create copies of observations based on an expression or, more commonly, a variable.

 

expand freq

Assuming you have a variable named “freq”, Stata will create duplicates of each observation, the number of duplicates being equal to the value of freq for that observation.

 

 

Reshape

 

The reshape command is one of the more complex commands in Stata.  Reshape is used to re-format a dataset from wide to long and vice-versa.  One thing to remember is that Stata likes to have the variables named in a logical and consistent manner, like “income90”, “income91”, “income92” and so on.  There are ways of working with variables that are not named like this, but it may be easier in the long run to just take the time and rename them.

 

In these examples, we’ll assume that we have a dataset with the variables id, income90-income95, score90-score95, and month

 

reshape wide income , i(id) j(year)

This form of the command takes data in long format and makes it wide.  The “i” variable still tells Stata how to identify unique observations.  The “j” variable tells Stata what variable’s values to use when creating the new variables.

reshape long income , i(id) j(year)

In the reshape command, we tell Stata that we want to convert the file from its current wide format to long format.  The variables we want reshaped are the income variables and the score variables.  The “i” variable(s) tell Stata how to identify an individual observation.  The “j” variable(s) tells Stata how to identify unique variables in the time series.

 

On to the next lesson, Time Series Commands
top
Mail: Office of Population Research, Princeton University, Wallace Hall, Princeton NJ 08544
Phone: (609) 258-4870  •  Fax: (609) 258-1039  •  Email: webmaster@opr.princeton.edu