|
November 7, 2009

Administration
Faculty
Staff
Students
Jobs

Projects
Seminars
Working Papers
Publications
Dissertations

Programs
Courses
Course Schedule

Data Archive
Library
Pop Index
NIH Public Access Policy

Calendar

CRCW
CHW
CMD
PUM
OPR Mail
Search
|
 |
|
Manipulating Files
Now that we know how to create and manipulate variables in
Stata, we’ll learn how to manipulate entire files. This will entail learning about the append, merge, collapse/contract, expand and reshape commands. First though, a few terms we’ll need to know:
- Master dataset – this refers to the dataset you currently have
loaded into Stata.
- Using dataset – this is a dataset not in memory, but will used
to append or merge to the master dataset.
- Wide format – a dataset that has one variable for each time
period in the study, i.e., twelve variables for income in each month of a
year. These are sometimes referred to as “repeated measures” datasets.
- Long format – the converse of a wide dataset, there is one
observation per unit for each time period in the study. These are often referred to as
“person-year”, “person-month”, or time-series datasets.
Append
The append command is used when you have two datasets with
(mostly) the same variables, but different observations. These observations may be additional
respondents or additional years, months, etc. for the same respondents. Be sure that you have enough memory to hold
all the datasets you want to combine before you start.
|
append using newdata
append using newdata, keep(age income weight)
|
The append command is very easy. Here, we are simply appending the dataset “newdata” to the data that are already in memory. You can only append one file at a time.
You do not need to append all of the variables from the using data set if you use the "keep" option.
|
Merge
The merge command can be used to add new variables to
existing observations. Like append, you
need to make sure that you have enough memory to hold all the datasets you want
to merge. Additionally, your master and
using data must be sorted according to the id or key variables you will use to
match the observations. By defualt,
Stata will not change any of the values in your master dataset and any
observations in the using dataset that do not match anyhting in the master
dataset will simply be appended with all of the variables from the master data
set to missing.
|
merge perid year using newdata
merge perid year using newdata, keep(age weight)
|
Here, we simply merge the
data from the file newdata to the master data, matching observations with the
variables perid and year.
Just like with the "append" command, you can merge only the variables you want from the using data set by using the "keep" option.
|
|
merge perid year using
newdata, update
|
The update option tells
Stata to replace missing data in the master data with values from the using
data.
|
|
merge perid year using
newdata, update replace
|
Including the replace
option will cause Stata to replace all of the master dataset values with
values from the using dataset except if the using dataset value is
missing. The update option must be
included if you use replace.
|
|
merge perid year using newdata, nokeep
|
The nokeep option will cause Stata to drop any
observations in the using dataset that do not have matching observations in
the master dataset.
|
When Stata merges two datasets, it automatically creates a
variable called “_merge.” This variable
can be used to determine if the merge produced the results you wanted. It can take up to five values, depending on
the options you use:
If you use neither the update nor replace options:
1 = an
observation from the master dataset that did NOT have a corresponding
observation in the using dataset.
2 = an
observation from the using dataset that did NOT have a corresponding
observation in the master dataset.
3 =
observations that matched
If you use the update option:
1 = an
observation from the master dataset that did NOT have a corresponding
observation in the using dataset.
2 = an
observation from the using dataset that did NOT have a corresponding
observation in the master dataset.
3 = an observation that matched and
had all values the same
4 = an
observation that matched, but the missing values in the master dataset were
updated
5 = an
observation that matched, but the values in the master dataset were not the
same as those in the using dataset.
Collapse and Contract
The collapse and contract commands are used to create a new
dataset from the one in memory by summarizing the data. In other words, collapse and contract will
create a dataset of means, medians, and frequencies. Be aware that the newly created dataset will replace the one in
memory, so be sure to save any changes you have made before executing the
command.
|
collapse age income
|
Without specifying any
options or statistics, Stata will produce a one-observation dataset of means.
|
|
collapse age income, by(year)
|
Most often, you will use
the “by” option to produce statistics for each group.
|
|
collapse (mean) age income
(max) rand, by(year)
|
You can have several
different types of statistics in your output dataset.
|
|
collapse (mean) age income
(max) max_age=age , by(year)
|
By default, the variables
in the new dataset will have the same name as those in your original
dataset. If you want more than one
statistic for a particular variable, then you must supply a new name for it: “new name” = “original name”
|
|
collapse (mean) age income,
by(year) cw
|
By default, Stata will use
all possible observations when calculating the statistics. This may result in the mean for age being
based on a different number of observations than the mean for income. The “cw” option performs a case-wise
deletion meaning that any observation that has a missing value for any
variable specified will be dropped.
|
|
contract employed mugged
cityname
|
Contract will create a
dataset containing the frequencies of all the combinations of the listed
variables.
|
|
contract emlpoyed mugged
cityname, zero
|
The “zero” option tells
Stata to include any combinations with zero frequencies.
|
|
contract employed mugged cityname, nomiss
|
The “nomiss” option tells Stata to not use any observation
that has a missing value for any of the listed variables.
|
Expand
You may not use “expand” very often, but it is included here
because it is the complement of contract.
Namely, “expand” will create copies of observations based on an
expression or, more commonly, a variable.
|
expand freq
|
Assuming you have a variable named “freq”, Stata will
create duplicates of each observation, the number of duplicates being equal
to the value of freq for that observation.
|
Reshape
The reshape command is one of the more complex commands in
Stata. Reshape is used to re-format a
dataset from wide to long and vice-versa.
One thing to remember is that Stata likes to have the variables named in
a logical and consistent manner, like “income90”, “income91”, “income92” and so
on. There are ways of working with
variables that are not named like this, but it may be easier in the long run to
just take the time and rename them.
In these examples, we’ll assume that we have a dataset with
the variables id, income90-income95, score90-score95, and month
|
reshape wide income , i(id) j(year)
|
This form of the command takes data in long format and
makes it wide. The “i” variable still
tells Stata how to identify unique observations. The “j” variable tells Stata what variable’s values to use when
creating the new variables.
|
|
reshape long income ,
i(id) j(year)
|
In the reshape command, we
tell Stata that we want to convert the file from its current wide format to
long format. The variables we want
reshaped are the income variables and the score variables. The “i” variable(s) tell Stata how to
identify an individual observation.
The “j” variable(s) tells Stata how to identify unique variables in
the time series.
|
On to the next lesson, Time Series Commands
|