|
November 7, 2009

Administration
Faculty
Staff
Students
Jobs

Projects
Seminars
Working Papers
Publications
Dissertations

Programs
Courses
Course Schedule

Data Archive
Library
Pop Index
NIH Public Access Policy

Calendar

CRCW
CHW
CMD
PUM
OPR Mail
Search
|
 |
|
Creating and Manipulating Variables
Stata would be pretty useless if you didn't have any
variables to analyze. Although we have talked some about variables so far,
in this module we will talk about them in much more detail. By the end of the lesson you
will know the different types of variables there are, how to create them and
how to manipulate them.
In Stata, there are three basic kinds of variables: numeric, string and
date. Although dates are technically stored as numeric data, their use is
different from regular numeric data, so we will discuss them separately.
Numeric variables are just what you would think: numbers. They can be
integers, decimals, negative and positive. In the output from the describe
command, numeric variables can show up with several different "Storage
Types." The specific meaning of these is nothing you need to be concerned
with right now, just know that anything other than "str" is a number.
Many analysis commands like "reg" and "sum" will work only
on numeric variables. If you receive a "type mismatch" error, then
you are probably trying to do an analysis on a string variable.
String variables, often referred to as "character" or
"alphanumeric" variables, are variables whose values may have letters or other special characters in
them. It is possible to store numbers as though they were letters (a
common source of the "type mismatch" error).
Date variables are a special case of numeric variables. Although they are
often entered as strings (i.e.: 01JAN1992 or 01/01/92), they must be stored
in Stata as numbers to make them useful. Stata has several commands for working
with dates and time-dependent data. Briefly, Stata stores all dates as the
number of days (or months or quarters, etc) from January 1, 1960. Dates before
then are negative, and dates after are positive. Other packages such as SAS or
SPSS use the same date as their origin, however, Excel uses January 1, 1900 by defualt (it can use other dates, so you need to make sure).
If you are importing data from an Excel spreadsheet, you must re-format them
and import them as strings.
There are four basic commands for creating and manipulating variables:
"gen," "egen," "replace," and "recode."
The gen and egen (short for "generate" and "extended
generate", respectively) commands are used to create new variables. The
replace and recode commands are used to change the values of existing
variables. You will often use gen and replace together.
Rules for Variables
- Variable names can have up to
32 characters (8 or less is better, though) and must begin with a letter.
- Variable names are
case-sensitive.
- Use descriptive names –
“var1” means nothing. The
question number from a survey is a good choice.
- When listing variables in
commands, you can use the "?" and "*" to represent any
single character or any number of characters, respectively.
- Values for string variables
can be up to 80 characters in Intercooled and 244 characters in Special Edition. Anything over these limits will be dropped.
- Values for string variables
are enclosed in double quotes.
- Missing values for numeric
variables are represented with a dot: "." Stata also supports "extended" missing values: ".a", ".b", ".c" up to ".z". These can be useful for coding responses such as "Refused" or "Not Applicable". Remember: missing is
not the same thing as zero!
- Missing values for string
variables are represented by two double quotes with nothing in between:
"". This is not the same as a blank!
- Missing values are considered
to be greater than any positive number: positive infinity < . < .a < .z . This is very important when using
"if" statements and when sorting your data.
String Variables
We'll begin with string variables since they are the easiest
to work with. As in any package, their values are case-sensitive.
|
gen firstname="paul"
gen initial="abcdefghij"
|
To create a string variable, use the gen command. Enclose the values themselves in double
quotes.
|
|
replace firstname="bob" if employed==4
replace firstname="sue" if mugged != 3
replace firstname="none" if firstname==""
|
Not everyone's first name is Paul, so we will need to
change the values for some observations. The "if" clause allows us
to do this. In the first example only those observations whose values for the
variables "employed" is 4 will be changed, all other observations
will not be changed. Note the double "=".
The second example will change all observations whose
values for mugged are NOT 3.
Finally, we can even change values of a variable based on
itself - in this case, we change all the missing names to "none."
Note the double equal sign in the if clause!
|
Numeric Variables
Creating and manipulating numeric variables is just as easy
as string variables.
|
gen numvar1=1
gen numvar2=numvar1+income
gen numvar3=(numvar1/income)*100
|
Just like with string variables, you can create new
numeric variables with the gen command. Any valid mathematical expression is
allowed.
|
|
replace numvar1=5 if mugged==3
replace numvar2=income/rand if numvar3>.05
replace numvar2=income/rand if numvar3>.05 & numvar3 != .
replace numvar2=. if citynum==2 | citynum==5 | citynum==7
replace numvar2=. if inlist(citynum,2,5,7)
|
Replacing values in numeric variables works much the same
way as for string variables.
We can use "if" clauses in replacing numeric values as well.
One caveat that often comes up is how Stata treats missing values. Since
missing values are equal to positive infinity, the expression
"numvar3>5" will include missing values. This may not be what
you really want, so you must include the "& numvar3!=." to
exclude any missing values.
This will make numvar “missing” if citynum is equal to 2, 5 or 7
A very useful function is “inlist” which allows you to simply list the
values you want to match.
|
|
recode mugged 1=2
recode mugged 1=2 3=4
recode mugged 1=2 *=5
recode mugged 1 2 3 4=5
recode mugged 1 2 3 4=5, gen(mugged2)
|
The recode command can be an easy way of changing the
values of a numeric variable (recode only works with numeric variables). All
you need to do is just provide a list of the values you want to change.
The "*" means all other values not explicitly listed - including
missing! Finally, the "gen()" option tells Stata to create a new variable that will be the recoded version of the original. This is highly reccommended so that you do not destroy your original variable!
This is one way of collapsing values.
|
|
gen income_dummy=.
replace income_dummy=1 if income>=6000
replace income_dummy=0 if income<6000
tab mugged, gen(mugged_dummy)
|
Dummy variables are numeric variables whose values are 0
and 1. There are two basic ways
of creating dummy variables, one is for when you are creating dummies for a
continuous variable,
and one for a categorical variable.
|
Extended Generate (egen)
Egen is one of Stata's most powerful and useful commands.
Like generate, it is used to create new variables, but it is much more than
that. Egen can create variables that would be difficult and tedious to create
on your own. Some examples are variables whose values are the mean of another
variable for each group such as income for males and females. Egen can also
create other variables that count the number of observations that fit a certain
criteria, or even simply number observations. The only way to truly see how
powerful egen can be is to show a few examples and then have you explore the
other available functions on your own.
|
egen age_cat = cut(age), at(10,15,20,25,30,35)
egen age_cat = cut(age), group(6)
|
"cut" is very useful for collapsing variables.
You can either specify the lowest value for each new group with the
"at()" option. Any observations with a value less than 10 will be given a missing value for age_cat, and all observations with a value greater than 35 will be placed in the "35" age_cat group. or simply specify the number of groups you want with
"group()".
|
|
egen age_mean = mean(age), by(year)
|
This creates a variable that is the mean of age for each year. In addition to mean, there are min, max, sd, and several
other statistics.
|
|
egen numobs = count(personid), by(personid year)
|
"Count" simply counts the number of observations
within each year. This can be
used to make sure that you have the same number of observations for each
respondent in each year.
|
|
egen city_yr = group(cityname year)
egen city_yr = group(cityname year), label
|
"Group" numbers the groups formed by crossing
cityname and year. The groups
are numbered consecutively which makes this a good variable to use in
analysis. The "label" option causes Stata to use the value labels (if any) of cityname and year in creating city_yr.
|
|
egen comp_id=concat(householdid familyid personid),decode
p(/)
|
The "concat" function is very useful when you
have two or more variables that you want to combine to form one variable but
adding or multiplying them would not make sense. The "decode"
option works like the "decode" command in that it uses the value
labels to create the new variable. The "p()" option allows you to
put a separator character between the values.
|
Converting Between String and Numeric Variables
Before we get into date variables, it will be useful to
learn how to convert string variables into numeric and vice-versa. Sometimes,
for various reasons, a number will get read into Stata as a string. We must
convert it before we can do any analyses on it. There may even be times when we
want to treat a numeric variable as a string (such as Soical Security Numbers or other ID variables), although not as often. There are
four commands that allow us to make these different conversions:
"destring," "decode," "encode" and using the
"real" and "string" functions with the gen command.
|
destring d_income, gen(inc_pct_num)
ignore("$")
destring inc_pct, gen(inc_pct_num) percent
destring inc_pct, gen(inc_pct_num) percent force
|
The "destring" command will convert a string
variable into a numeric variable. It is used particularly when you have data
that include special characters such as dollar or percent signs. The general
form of the command is to specify the string variable, generate a new numeric
variable, and the character or characters you want to remove.
If you have a percent variable with a percent sign, you
can use the "percent" option. This has the same effect as
specifying ignore("%") and then multiplying the result by 100.
Using the "force" option tells Stata that if it
can't make a proper conversion, then the new variable should have a missing
value.
|
|
gen numvar = real(str_num)
|
The "real" function simply tells Stata to
convert all numbers in strvar into numeric data. Anything that is not a
number will be made missing. Use the real function only when you do not have
special characters.
|
|
encode city, gen(citynum)
|
Sometimes you have a legitimate string variable such as
city names. To use this variable in a statistical analysis it must be
numeric. The "encode" command will accomplish this. A nice feature
of this is that the character values will be used to automatically create
value labels for the new numeric variable.
|
|
decode citynum2, gen(cityname)
|
To convert a number into a string, you can use the
"decode" command. One caveat to the decode command is that the
numeric variable must have value labels assigned.
|
|
gen city_str2 = string(city_num)
|
If you have to many values to bother making labels for,
you can still make the numeric to string conversion using the
"string" function with the generate command.
|
Date Variables
Date variables in Stata are a special case of numeric
variables. As mentioned before, dates in Stata are the number of days (or
months or quarters) from January 1, 1960. Treating dates this way makes it easy
to compute the time between two dates. Stata has many
functions for working with dates and many display formats for them as well. We
only have time to discuss the most common or useful ones, so you are encouraged
to read about them on your own.
Often, dates are entered into data files as string variables:
"01JAN1958", "Feb. 25, 1990", or "19/5/93". We
must, of course, convert these into numeric data, but it's not as
straightforward a conversion as simply removing a dollar sign. Fortunately,
Stata makes these conversions rather easy. One sticking point with Stata
though, is that it really likes the years to be four digits. This is not always
the case, but we can still deal with it.
|
gen datevar=date(str_date, "mdy")
gen datevar=date(string_date19, "md19y")
gen datevar=date(string_date00, "mdy", 2010)
|
Whenever you have a date that has been entered as a single
string variable, you can use the "date" function with the gen
command to convert it. The string form of the date must have some kind of
delimiter separating the month, day and year. Generally, if it is obvious to
you what the month day and year are, then Stata will be able to make the
conversion. The "mdy" portion tells Stata the order of the month
day and year in the string variable.
If you have only two digit years, and they are all in the
same century, then you can specify that century before the "y".
If, on the other hand all of your dates are not in the
same century, then you must specify what the latest year might be.
|
|
gen birthdate=mdy(b_month,b_day,b_year)
gen intvdate=mdy(i_month,i_day,1999)
|
Sometimes, dates are entered as separate variables for the
month, day and year. The "mdy" function allows us to convert these
to date variables.
|
|
todate strdate, gen(num_date) p(yymmdd)
|
The "todate" command is not part of official
Stata, you must install it yourself. The todate command lets us convert dates
that do not have any kind of delimiter. As you might guess, the
"p(yymmdd)" tells Stata the pattern of digits in the variable.
There are other options so you should install it and check the help file.
|
|
gen age_today= birthdate-d(17sep2002)
|
We can also enter dates directly with "d()".
|
|
gen yearvar=year(birthdate)
gen monthvar=month(birthdate)
gen dayvar=day(birthdate)
|
Stata also has functions to extract different parts of
date from a date variable. There are a few others you may find useful (help
dexfcns)
|
|
gen age_intv=(intvdate-birthdate)/365.25
|
Now that we have two date variables, we can determine how
old each person was at the time of the interview.
|
|
gen calendardate=ym(year,month)
gen
age_calendar=(calendardate-ym(b_year,b_month))/12
|
Often your data may only be in monthly intervals. Remember, “12” in monthly data means
12 months after January 1, 1960.
Now we can also calculate age at any point in the
calendar.
|
|
format birthdate %d
format calendardate %tmmcy
|
Stata provides many display formats for your convenience.
The "%d" will make this value display as "09SEP2002". For
monthly data, we can use the "%tmmcy" format.
|
Just a quick note about "century-months." Century-months were invented to facilitate analysis of monthly data and are computed by multpilying the two-digit year by 12 and adding the number of the month where January=1, February=2 and so on. Thus, January of 1900 is the century-month 1 and January 1960 is the century-month 721. So, to convert from a century month to a Stata month, you only need to subtract 721 from the century month:
. gen statamonth=centurymonth-721
. format statamonth %tmmcy
On to the next lesson, Sorting, By-groups and Indexing
|