The Office of Population Research at Princeton University

August 28, 2008


Administration
Faculty
Staff
Students
Jobs


Projects
Seminars
Working Papers


Prospective  Students
Programs
Courses
Course Schedule


Data Archive
Library
Pop Index


Calendar


CRCW
CHW
CMD
PUM
OPR Mail
OPR Mail - New

Search

 

Sorting, By Groups and Observation Indexing

Many of Stata's commands can be executed on a group-by-group basis. To do so, however, the data must first be sorted by the variable or variables that define those groups. Stata can do some tricky things to you when you sort so you always need to be aware of what you are actually asking Stata to do and know what it will do. Chief among these is that Stata will sometimes randomize the order of observations within the sort order you specified, unless you use the "stable" option. We will demonstrate this point in a moment.

Another of Stata's more powerful features is its observation indexing. Observation indexing simply means keeping track of what observation you are working on and being able to move back and forth through the data with ease. Internally, Stata numbers each observation 1 through however many observations you have. The "number" assigned to each observation depends on the current sorting order. Stata allows you to make use of the observation number to create other variables such as "lags" or "leads."

Sorting

Sorting in Stata is very easy - there are only two commands to use.

sort hhid famid perid

sort hhid famid, stable

The sort command orders your data, in ascending sequence, according to the variables you specify. Unless you use the "stable" option, Stata may randomize the order of the observations within those you specified. In the second example, Stata might randomize the order of "perid" if you did not use the option.

gsort -hhid famid

gsort hhid -famid

The "gsort" command can be used to sort variables in descending sequence, ascending sequence, or both. All you need to do is place a minus sign in front of the variable or variables you want sorted in descending sequence.

duplicates report

duplicates examples hhid famid perid

The "duplicates" command can be used to find duplicate observations in your data. The "report" form of the command simply produces a list of how many duplicate observations there may be. The "examples" form gives you more detail as to which observations are duplicated.

By Groups

You can execute almost any command on each level of a variable by prefacing it with "by." Unfortunately, "by:" only works with the "sort" command, not "gsort".

by citynum: sum income

by citynum year: sum income

The "by citynum:" command prefix simply tells Stata to execute the "sum income" command on each citynum separately. You can list more than one variable in the "by...:" prefix. Stata assumes the data are already sorted in this order and result in an error if they are not.

by citynum, sort: sum income

You can use the "sort" option to tell Stata to sort the data if they are not already sorted.

by citynum, rc0: reg income year

Finally, the "rc0" option tells Stata not to stop if it encounters an error along the way. Some statistical analyses require a minimum number of observations, if one or more of your groups does not have enough observations, Stata will stop executing the command unless you specify this option.

by citynum (year): gen income_lag=income[_n-1]

By enclosing a variable in parentheses you ensure that the data are in the correct order, but the command is executed only on citynum.

 


Observation Indexing

Observation indexing is one of Stata's coolest features. It is also one of its more esoteric features. As mentioned before, Stata numbers the observations in your dataset internally from 1 to N in the current sort order. This is an actual variable which you can use to like any other variable; it is called "_n". It is not saved with your data and you won't see it in your variable list, but you can create your own variable using it.

gen num=_n

Here is the simplest way to use "_n". In the first command, we are simply creating a variable that numbers the observations. The variable "num" can be used to get your data back into its original sort order.

gen lagyear=year[_n-1]

gen diff=year-year[_n-1]

These examples really show the power of observation indexing. The first command simply creates a lag variable for year. To create a lead variable, simply use "_n+1". What this is actually saying is: "this current observation minus (or plus) one."

by id: gen diffinc2=income-income[_n-1]

 

Often, when you create variables this way you need to make sure that you don't use information from a different person. The "_n" is computed within each id. When combined with "by...:", the numbering repeats within each level of the grouping variable.

gen diffinc=income-income[1]

You can also tell Stata to always use a specific observation. Here, of course, we are telling Stata to use the first observation, but we could use any observation.

gen fancyschmancy=income[_n-months]

You can also use another variable to determine how far back or forward to move. Any valid mathematical expression can be used inside the brackets.

by citynum: gen bigN=_N

by citynum: gen lastyear=year[_N]

Finally, "_N" is the total number of observations within each citynum. This gives the same result as egen....count(id), by(citynum).

The second example sets the value of lastyear to the value of year for the last observation witihin each citynum - remember the sort order is important!

 

On to the next lesson, Manipulating Files
top
Mail: Office of Population Research, Princeton University, Wallace Hall, Princeton NJ 08544
Phone: (609) 258-4870  •  Fax: (609) 258-1039  •  Email: webmaster@opr.princeton.edu