|
November 7, 2009

Administration
Faculty
Staff
Students
Jobs

Projects
Seminars
Working Papers
Publications
Dissertations

Programs
Courses
Course Schedule

Data Archive
Library
Pop Index
NIH Public Access Policy

Calendar

CRCW
CHW
CMD
PUM
OPR Mail
Search
|
 |
|
Sorting, By Groups and Observation Indexing
Many of Stata's commands can be executed on a group-by-group
basis. To do so, however, the data must first be sorted by the variable or
variables that define those groups. Stata can do some tricky things to you when
you sort so you always need to be aware of what you are actually asking Stata
to do and know what it will do. Chief among these is that Stata will sometimes
randomize the order of observations within the sort order you specified, unless you use the "stable" option. We will demonstrate this point in a moment.
Another of Stata's more powerful features is its observation indexing.
Observation indexing simply means keeping track of what observation you are
working on and being able to move back and forth through the data with ease.
Internally, Stata numbers each observation 1 through however many observations
you have. The "number" assigned to each observation depends on the
current sorting order. Stata allows you to make use of the observation number
to create other variables such as "lags" or "leads."
Sorting
Sorting in Stata is very easy - there are only two commands
to use.
|
sort hhid famid perid
sort hhid famid, stable
|
The sort command orders your data, in ascending sequence,
according to the variables you specify. Unless you use the "stable" option, Stata may randomize the order of the observations within those you specified. In the second example, Stata might randomize the order of "perid" if you did not use the option.
|
|
gsort -hhid famid
gsort hhid -famid
|
The "gsort" command can be used to sort
variables in descending sequence, ascending sequence, or both. All you need
to do is place a minus sign in front of the variable or variables you want
sorted in descending sequence.
|
|
duplicates report
duplicates examples hhid famid perid
|
The "duplicates" command can be used to find duplicate observations in your data. The "report" form of the command simply produces a list of how many duplicate observations there may be. The "examples" form gives you more detail as to which observations are duplicated.
|
By Groups
You can execute almost any command on each level of a
variable by prefacing it with "by." Unfortunately, "by:"
only works with the "sort" command, not "gsort".
|
by citynum: sum income
by citynum year: sum income
|
The "by citynum:" command prefix simply tells Stata
to execute the "sum income" command on each citynum separately. You
can list more than one variable in the "by...:" prefix. Stata
assumes the data are already sorted in this order and result in an error if
they are not.
|
|
by citynum, sort: sum income
|
You can use the "sort" option to tell Stata to
sort the data if they are not already sorted.
|
|
by citynum, rc0: reg income year
|
Finally, the "rc0" option tells Stata not to
stop if it encounters an error along the way. Some statistical analyses
require a minimum number of observations, if one or more of your groups does
not have enough observations, Stata will stop executing the command unless
you specify this option.
|
|
by citynum (year): gen income_lag=income[_n-1]
|
By enclosing a variable in parentheses you ensure that the data are in the
correct order, but the command is executed only on citynum.
|
Observation Indexing
Observation indexing is one of Stata's coolest features. It
is also one of its more esoteric features. As mentioned before, Stata numbers the
observations in your dataset internally from 1 to N in the current sort order. This is an actual variable which you
can use to like any other variable; it is called "_n". It is not saved with your data and you won't see it in your variable list, but you can create your own variable using it.
|
gen num=_n
|
Here is the simplest way to use "_n". In the
first command, we are simply creating a variable that numbers the
observations. The variable "num" can be used to get your data back into its original sort order.
|
|
gen lagyear=year[_n-1]
gen diff=year-year[_n-1]
|
These examples really show the power of observation
indexing. The first command simply creates a lag variable for year. To create
a lead variable, simply use "_n+1". What this is actually saying is: "this current observation minus (or plus) one."
|
|
by id: gen diffinc2=income-income[_n-1]
|
Often, when you create variables this way you need to make sure that you
don't use information from a different person. The "_n" is computed
within each id. When combined with "by...:", the numbering repeats
within each level of the grouping variable.
|
|
gen diffinc=income-income[1]
|
You can also tell Stata to always use a specific observation. Here, of course, we are telling Stata to use the first observation, but we could use any observation.
|
|
gen fancyschmancy=income[_n-months]
|
You can also use another variable to determine
how far back or forward to move. Any valid mathematical expression can be
used inside the brackets.
|
|
by citynum: gen bigN=_N
by citynum: gen lastyear=year[_N]
|
Finally, "_N" is the total number of
observations within each citynum. This gives the same result as
egen....count(id), by(citynum).
The second example sets the value of lastyear to the value of year for the last observation witihin each citynum - remember the sort order is important!
|
On to the next lesson, Manipulating Files
|