The Office of Population Research at Princeton University

August 28, 2008


Administration
Faculty
Staff
Students
Jobs


Projects
Seminars
Working Papers


Prospective  Students
Programs
Courses
Course Schedule


Data Archive
Library
Pop Index


Calendar


CRCW
CHW
CMD
PUM
OPR Mail
OPR Mail - New

Search

 

Creating and Manipulating Variables

Stata would be pretty useless if you didn't have any variables to analyze. Although we have talked some about variables so far, in this module we will talk about them in much more detail. By the end of the lesson you will know the different types of variables there are, how to create them and how to manipulate them.

In Stata, there are three basic kinds of variables: numeric, string and date. Although dates are technically stored as numeric data, their use is different from regular numeric data, so we will discuss them separately.

Numeric variables are just what you would think: numbers. They can be integers, decimals, negative and positive. In the output from the describe command, numeric variables can show up with several different "Storage Types." The specific meaning of these is nothing you need to be concerned with right now, just know that anything other than "str" is a number. Many analysis commands like "reg" and "sum" will work only on numeric variables. If you receive a "type mismatch" error, then you are probably trying to do an analysis on a string variable.

String variables, often referred to as "character" or "alphanumeric" variables, are variables whose values may have letters or other special characters in them. It is possible to store numbers as though they were letters (a common source of the "type mismatch" error).

Date variables are a special case of numeric variables. Although they are often entered as strings (i.e.: 01JAN1992 or 01/01/92), they must be stored in Stata as numbers to make them useful. Stata has several commands for working with dates and time-dependent data. Briefly, Stata stores all dates as the number of days (or months or quarters, etc) from January 1, 1960. Dates before then are negative, and dates after are positive. Other packages such as SAS or SPSS use the same date as their origin, however, Excel uses January 1, 1900 by defualt (it can use other dates, so you need to make sure). If you are importing data from an Excel spreadsheet, you must re-format them and import them as strings.

There are four basic commands for creating and manipulating variables: "gen," "egen," "replace," and "recode." The gen and egen (short for "generate" and "extended generate", respectively) commands are used to create new variables. The replace and recode commands are used to change the values of existing variables. You will often use gen and replace together.

Rules for Variables

  • Variable names can have up to 32 characters (8 or less is better, though) and must begin with a letter.
  • Variable names are case-sensitive.
  • Use descriptive names – “var1” means nothing.  The question number from a survey is a good choice.
  • When listing variables in commands, you can use the "?" and "*" to represent any single character or any number of characters, respectively.
  • Values for string variables can be up to 80 characters in Intercooled and 244 characters in Special Edition. Anything over these limits will be dropped.
  • Values for string variables are enclosed in double quotes.
  • Missing values for numeric variables are represented with a dot: "." Stata also supports "extended" missing values: ".a", ".b", ".c" up to ".z". These can be useful for coding responses such as "Refused" or "Not Applicable". Remember: missing is not the same thing as zero!
  • Missing values for string variables are represented by two double quotes with nothing in between: "". This is not the same as a blank!
  • Missing values are considered to be greater than any positive number: positive infinity < . < .a < .z . This is very important when using "if" statements and when sorting your data.

String Variables

We'll begin with string variables since they are the easiest to work with. As in any package, their values are case-sensitive.

gen firstname="paul"

 

gen initial="abcdefghij"

To create a string variable, use the gen command. Enclose the values themselves in double quotes.

replace firstname="bob" if employed==4

 

 

 

replace firstname="sue" if mugged != 3

 

 

replace firstname="none" if firstname==""

Not everyone's first name is Paul, so we will need to change the values for some observations. The "if" clause allows us to do this. In the first example only those observations whose values for the variables "employed" is 4 will be changed, all other observations will not be changed. Note the double "=".

 

The second example will change all observations whose values for mugged are NOT 3.

 

Finally, we can even change values of a variable based on itself - in this case, we change all the missing names to "none." Note the double equal sign in the if clause!

 

Numeric Variables

Creating and manipulating numeric variables is just as easy as string variables.

gen numvar1=1

gen numvar2=numvar1+income

gen numvar3=(numvar1/income)*100

Just like with string variables, you can create new numeric variables with the gen command. Any valid mathematical expression is allowed.

replace numvar1=5 if mugged==3

replace numvar2=income/rand if numvar3>.05

 

replace numvar2=income/rand if numvar3>.05 & numvar3 != .

 

replace numvar2=. if citynum==2 | citynum==5 | citynum==7

 

replace numvar2=. if inlist(citynum,2,5,7)

Replacing values in numeric variables works much the same way as for string variables.

We can use "if" clauses in replacing numeric values as well.

One caveat that often comes up is how Stata treats missing values. Since missing values are equal to positive infinity, the expression "numvar3>5" will include missing values. This may not be what you really want, so you must include the "& numvar3!=." to exclude any missing values.

This will make numvar “missing” if citynum is equal to 2, 5 or 7

 

A very useful function is “inlist” which allows you to simply list the values you want to match.

recode mugged 1=2

recode mugged 1=2 3=4

recode mugged 1=2 *=5

recode mugged 1 2 3 4=5

recode mugged 1 2 3 4=5, gen(mugged2)

The recode command can be an easy way of changing the values of a numeric variable (recode only works with numeric variables). All you need to do is just provide a list of the values you want to change.
The "*" means all other values not explicitly listed - including missing!
Finally, the "gen()" option tells Stata to create a new variable that will be the recoded version of the original. This is highly reccommended so that you do not destroy your original variable!

 

This is one way of collapsing values.

gen income_dummy=.

replace income_dummy=1 if income>=6000

replace income_dummy=0 if income<6000

tab mugged, gen(mugged_dummy)

Dummy variables are numeric variables whose values are 0 and 1.  There are two basic ways of creating dummy variables, one is for when you are creating dummies for a continuous variable,

 

and one for a categorical variable.


Extended Generate (egen)

Egen is one of Stata's most powerful and useful commands. Like generate, it is used to create new variables, but it is much more than that. Egen can create variables that would be difficult and tedious to create on your own. Some examples are variables whose values are the mean of another variable for each group such as income for males and females. Egen can also create other variables that count the number of observations that fit a certain criteria, or even simply number observations. The only way to truly see how powerful egen can be is to show a few examples and then have you explore the other available functions on your own.

egen age_cat = cut(age), at(10,15,20,25,30,35)

egen age_cat = cut(age), group(6)

 

"cut" is very useful for collapsing variables. You can either specify the lowest value for each new group with the "at()" option. Any observations with a value less than 10 will be given a missing value for age_cat, and all observations with a value greater than 35 will be placed in the "35" age_cat group. or simply specify the number of groups you want with "group()".

egen age_mean = mean(age), by(year)

This creates a variable that is the mean of age for each year. In addition to mean, there are min, max, sd, and several other statistics.

egen numobs = count(personid), by(personid year)

"Count" simply counts the number of observations within each year.  This can be used to make sure that you have the same number of observations for each respondent in each year.

egen city_yr = group(cityname year)

egen city_yr = group(cityname year), label

"Group" numbers the groups formed by crossing cityname and year. The groups are numbered consecutively which makes this a good variable to use in analysis. The "label" option causes Stata to use the value labels (if any) of cityname and year in creating city_yr.

egen comp_id=concat(householdid familyid personid),decode p(/)

The "concat" function is very useful when you have two or more variables that you want to combine to form one variable but adding or multiplying them would not make sense. The "decode" option works like the "decode" command in that it uses the value labels to create the new variable. The "p()" option allows you to put a separator character between the values.


Converting Between String and Numeric Variables

Before we get into date variables, it will be useful to learn how to convert string variables into numeric and vice-versa. Sometimes, for various reasons, a number will get read into Stata as a string. We must convert it before we can do any analyses on it. There may even be times when we want to treat a numeric variable as a string (such as Soical Security Numbers or other ID variables), although not as often. There are four commands that allow us to make these different conversions: "destring," "decode," "encode" and using the "real" and "string" functions with the gen command.

destring d_income, gen(inc_pct_num) ignore("$")

 

 

destring inc_pct, gen(inc_pct_num) percent

 

destring inc_pct, gen(inc_pct_num) percent force

The "destring" command will convert a string variable into a numeric variable. It is used particularly when you have data that include special characters such as dollar or percent signs. The general form of the command is to specify the string variable, generate a new numeric variable, and the character or characters you want to remove.

If you have a percent variable with a percent sign, you can use the "percent" option. This has the same effect as specifying ignore("%") and then multiplying the result by 100.

Using the "force" option tells Stata that if it can't make a proper conversion, then the new variable should have a missing value.

gen numvar = real(str_num)

The "real" function simply tells Stata to convert all numbers in strvar into numeric data. Anything that is not a number will be made missing. Use the real function only when you do not have special characters.

encode city, gen(citynum)

Sometimes you have a legitimate string variable such as city names. To use this variable in a statistical analysis it must be numeric. The "encode" command will accomplish this. A nice feature of this is that the character values will be used to automatically create value labels for the new numeric variable.

decode citynum2, gen(cityname)

To convert a number into a string, you can use the "decode" command. One caveat to the decode command is that the numeric variable must have value labels assigned.

gen city_str2 = string(city_num)

If you have to many values to bother making labels for, you can still make the numeric to string conversion using the "string" function with the generate command.


Date Variables

Date variables in Stata are a special case of numeric variables. As mentioned before, dates in Stata are the number of days (or months or quarters) from January 1, 1960. Treating dates this way makes it easy to compute the time between two dates. Stata has many functions for working with dates and many display formats for them as well. We only have time to discuss the most common or useful ones, so you are encouraged to read about them on your own.

Often, dates are entered into data files as string variables: "01JAN1958", "Feb. 25, 1990", or "19/5/93". We must, of course, convert these into numeric data, but it's not as straightforward a conversion as simply removing a dollar sign. Fortunately, Stata makes these conversions rather easy. One sticking point with Stata though, is that it really likes the years to be four digits. This is not always the case, but we can still deal with it.

gen datevar=date(str_date, "mdy")

 

 

gen datevar=date(string_date19, "md19y")

 

gen datevar=date(string_date00, "mdy", 2010)

Whenever you have a date that has been entered as a single string variable, you can use the "date" function with the gen command to convert it. The string form of the date must have some kind of delimiter separating the month, day and year. Generally, if it is obvious to you what the month day and year are, then Stata will be able to make the conversion. The "mdy" portion tells Stata the order of the month day and year in the string variable.

If you have only two digit years, and they are all in the same century, then you can specify that century before the "y".

If, on the other hand all of your dates are not in the same century, then you must specify what the latest year might be.

gen birthdate=mdy(b_month,b_day,b_year)

gen intvdate=mdy(i_month,i_day,1999)

Sometimes, dates are entered as separate variables for the month, day and year. The "mdy" function allows us to convert these to date variables.

todate strdate, gen(num_date) p(yymmdd)

The "todate" command is not part of official Stata, you must install it yourself. The todate command lets us convert dates that do not have any kind of delimiter. As you might guess, the "p(yymmdd)" tells Stata the pattern of digits in the variable. There are other options so you should install it and check the help file.

gen age_today= birthdate-d(17sep2002)

We can also enter dates directly with "d()".

gen yearvar=year(birthdate)

gen monthvar=month(birthdate)

gen dayvar=day(birthdate)

Stata also has functions to extract different parts of date from a date variable. There are a few others you may find useful (help dexfcns)

gen age_intv=(intvdate-birthdate)/365.25

Now that we have two date variables, we can determine how old each person was at the time of the interview.

gen calendardate=ym(year,month)

 

gen age_calendar=(calendardate-ym(b_year,b_month))/12

Often your data may only be in monthly intervals.  Remember, “12” in monthly data means 12 months after January 1, 1960.

Now we can also calculate age at any point in the calendar.

format birthdate %d

format calendardate %tmmcy

Stata provides many display formats for your convenience. The "%d" will make this value display as "09SEP2002". For monthly data, we can use the "%tmmcy" format.

Just a quick note about "century-months." Century-months were invented to facilitate analysis of monthly data and are computed by multpilying the two-digit year by 12 and adding the number of the month where January=1, February=2 and so on. Thus, January of 1900 is the century-month 1 and January 1960 is the century-month 721. So, to convert from a century month to a Stata month, you only need to subtract 721 from the century month:

. gen statamonth=centurymonth-721
. format statamonth %tmmcy

 

 

On to the next lesson, Sorting, By-groups and Indexing
top
Mail: Office of Population Research, Princeton University, Wallace Hall, Princeton NJ 08544
Phone: (609) 258-4870  •  Fax: (609) 258-1039  •  Email: webmaster@opr.princeton.edu