The Office of Population Research at Princeton University

August 20, 2008


Administration
Faculty
Staff
Students
Jobs


Projects
Seminars
Working Papers


Prospective  Students
Programs
Courses
Course Schedule


Data Archive
Library
Pop Index


Calendar


CRCW
CHW
CMD
PUM
OPR Mail
OPR Mail - New

Search

 

Stata 8 Refresher

Table of Contents


Introduction

Log and Do Files
   Log files: Keep a record of what you are doing
   Do Files: How to re-run all your commands

Data Management:
   Reading Data
   Insheet: Reading data from a delimited file (ie, from Excel)
   Infile: Reading data where variables are separated by spaces
   Infix: Reading data where the variables are in specific columns

Examining Your Data
   Describe: Listing all your variables
   Codebook: Detailed information about your data
   List: Displaying your data

Creating Variables
   Generate: Many ways to create new variables
   Destring: Change string variables to numeric
   String Variables: How to work with character data
   Date Variables: How to work with dates
   Dummy Variables: Creating dummy variables
   Extended GenerateCreating complex variables such as the mean of a group
   Recode: Changing the values of variables
   Encode: Creating a numeric variable from a string
   Variable/Value Labels: Make your output easier to read
   Variable Notes: Attach more information on your variables
   Drop/Rename: Deleting variables and changing their names
   Order, aorder and move: Re-arrange the order of your variables

File Management:
   Sort: Put all your observations in a certain order
   Append: Combining two datasets by "stacking" them
   Merge: Combining two datasets by matching their observations
   Collapse: Creating a dataset of means, sums, etc
   Contract: Creating a dataset of frequencies
   Expand: Duplicating observations
   Reshape: Converting variables to observations and vice-versa

Basic Commands
   Summarize: Calculating means, standard deviations, etc
   Inspect: More information on your variables
   Tab: Display frequencies, crosstabs, chi-square, etc
   Tabstat: Create tables of summary statistics
   "By" Group Processing: Execute commands on a group-by-group basis

Fancy Commands
   Macro Variables: Shorthand way of referring to many variables
   For: Execute the same command on many variables
   Foreach: Execute the same command on many variables
   While: Do repetetive tasks
   Xi: Create dummy variables for interactions of categorical variables

Useful Commands
   Duplicates: Search for duplicate observations
   ds3: A very detailed version of "describe"
   codebook2: A different version of "codebook"
   bigtab: For three-way crosstabs and tables when you get "too many values" errors
   xcollapse: An extended version of collapse
   xcontract: An extended version of contract
   dmerge: For merging two data sets
   dsconcat: To append several data sets in one command
   log2do2: Convert a log file to a do file
   log2html: Convert a log file to an html file

Introduction

This document is meant as a refresher for persons who have already have some experience with Stata. It is geared toward those who use the windowing version (whether MS Windows on a PC or XWindows on Unix). The majority of commands described here will work in a non-windowed environment as well.

You can click on the : to see what the output from a particular command or commands look like. There is also a page with all the output so you can print them out easily. You can also click on the command names a see the Stata help page. Use the "back" button on your browser to return to this page.

We also have a printer-friendly version of this document.

Only the basic forms of the various commands are described here. You can find more detail in the Stata manuals and by using the "help" command in Stata. You can also find the help text on the web at:

http://www.stata.com/help.cgi?command

Just replace command with the name of the command you need help for. This is often easier than using the help command since it is easier to read and won't clutter your output window.

A few of the commands described here are part of what's known as the SJ (Stata Journal) or user-submitted programs. If you try a command and get a message that the command doesn't exist, it is most likely due to this (or due to a typo). In this case, please see our instructions on how to install these SJ and user-written programs.

Last, but certainly not least, are the Stata FAQs. The Stata FAQs are a wonderful resource on how to do many things in Stata - they're worth looking through just for the heck of it!

Log and Do Files

Log files

Log and do files are very useful. Logs keep a record of what commands you have issued and their results during your Stata session. Do files are good for long series of commands that may need to be "tweaked" to work properly. They are also necessary to replicate things that you have done on new or modified datasets.

  You can create a log file with the command:

 . log using filename.log

Where filename is any name you wish to give the file. You will find it helpful to use names that will help you to remember what you did during that session. Using the ".log" extension will automatically create the log file in plain text format which you can then import into a word processor. If you do not supply this extension, Stata will create the file in its own format and you will then have to translate the file to text format before you can use it in any other program. If you do not want to use the .log extension but still want the file in text format, you can use the "set log type" command:

    . set log type text
    . set log type text, permanently
The first example sets the log type for the current session only. The second, sets the log type for all sessions. You can, of course, re-set the type permanently.

By default, everything displayed on the screen will be recorded in the log file. You can give the file any name you like, but you should use names that will help you remember what analyses you did.

If you work for a long time, this file can get to be very large and have quite a bit of unnecessary output. So, if you are trying different things to "see what happens," you may want to start and stop the logging several times in one session. You can stop the logging with the command:

     . log off

and restart it with the command:

     . log on

You can also continue or replace previously written programs with the append and/or replace options:

    . log using filename, append
    . log using filename, replace

The append option simply adds more information to the file, whereas the replace option erases anything that was already in the file. Be careful!  It’s often helpful to put comments in the log file to help you remember why you did something. You can do this by simply preceding anything you type with a "*".

But, what happens if you forget to start a log file? In the Windows version of Stata,you can also click on the icon in the top left corner of the Review window and save the contents of the window to a file. You can then run the file as a do-file and save the output.

You can also automatically create a log file of just the commands you run from the Command window with the command:

     . cmdlog using filename.log

Be sure to use a different name from the regular log file though!!

Regular log files and command log files can be run at the same time.

 Do Files

A "do" file is a set of commands just as you would type them in one-by-one during a regular Stata session. Any command you use in Stata can be part of a do file. Do files are very useful, particularly when you have many commands to issue repeatedly, or to reproduce results with minor or no changes.

You can use any editor you like to create do-files, although Stata has its own, which we reccomend using. If you use another editor such as MSWord, be sure to save the file in a plain text format.

To use Stata's editor, simply click on the do-file button at the top of the Stata window and start typing your commands. If you are beginning a new program, save the file immediately so it has a name. Also, in the do-file editor, under the "Edit" menu, go to "Preferences" and make sure that "Auto-save on Do/Run" is checked. This way, every time you run the program it will be saved automatically.

You do not need to run the entire program everytime you make a change to it. You can run just selected commands or all commands from a certain point (the cursor) to the end of the file. This can be very useful when testing different forms of commands. Simply highlight the command or commands you wish to run, go the the "Tools" menu and select "Do selection." You will also see in the Tools menu, "Run" and "Do". The difference between Run and Do is that Run executes the commands, but does not show any output. Run is useful only when you need to execute many commands (or a command that produces alot of output) before running the one in which you are most interested. Typically, if you do not want or need to see the output of a command, you are better off using the "quietly" command prefix.

The only thing you need to remember is that if a command is longer than one line, then you need to use the #delimiter command. In Stata, hitting the return key tells Stata to execute the command. In a .do file, the return key is at the end of every line, so you need a way of telling Stata that the command is longer than one line. Here is an example of a short do file we’ll call mydofile.do:
 

     
    log using mydofile  - start a log file
    #delimiter ;  - set the delimiter to ";"
    use mydata; - open the data file
    des; - describe the file
    collapse (mean) var1 var2 var3 (meadian) medvar=var1, by(var4); - collapse the data
    save mydata2;  - save the collapsed data
    #delimiter cr - set the delimiter back to the "enter" key
    clear - clear the data from memory
    log close - stop logging
     

Data Management

Data management is perhaps the most important part of your work in Stata. It is how you get your data into a format in which you can run your analyses properly. Major mistakes have been the result of poor data management rather than poor analysis.

 

Reading Data

There are three basic ways of reading data into Stata. "Insheet" is for spreadsheets saved as "CSV" or "tab-delimited" files from a package such as Excel, "infile" is for data in what are called "flat" files, and "infix" is for when you have data that has to be read in using special formats. Sometimes data will come with a SAS and/or SPSS program to read the data. In these instances, it is better and easier to use the program provided then convert the data to Stata with DBMSCopy or Stat Transfer. Otherwise, you will have to write your own program. Be aware that Stata can only read data that are in ASCII (also called "raw" or "text") format. If the codebook mentions anything like "zoned decimal" or "binary integer," then you must use SAS or SPSS.

 

 

This is an example of a spreadsheet from Excel. It cannot be brought into Stata without some editing and saving it as a ".csv" file.

 

Insheet

The insheet command is very useful in reading in data from a spreadsheet. There are some conventions required though:

 

    1. The first line should have Stata variable names and the second line begins the data. Stata variable names can have up to 32 characters and must begin with a letter or underscore and cannot have spaces. Sometimes the first row of your spreadsheet has descriptions that do not conform to this. In these instances, Stata will use the first 32 characters, not including the spaces, to create the variable name and use the description as the variable label. If two or more variables have the same first 32 characters, Stata will choose its own names (usually something like "var5") for the subsequent variables.
    2. Missing numeric data should be coded as an empty cell, not a space, dot, or any other non-numeric data. Often, 0, 9, or 99 is used to code missing numeric data; this is fine as long as these are not also valid values for that variable.
    3. When some spreadsheets create a Comma Separated Values or Tab-Delimited file they do not add commas or tabs to the end of a line if the cells at the end of that line are empty. This will confuse Stata which relies on the commas/tabs to tell it where the values are. You can avoid this problem by adding another column of 1’s (or any other character) to your spreadsheet. You can drop this variable once you have read it into Stata.
    4. The file must be specifically saved as a "comma separated values" or "tab-delimited" file in Excel. You can do this by going to "File", then "Save As…", then choosing either option. If your data contain commas (such as "Princeton, NJ") then you MUST choose tab-delimited otherwise Stata will get confused as to which commas separate the variables and which are part of the value. When you do this, Excel will tell you that it can only save the current sheet; click "OK." Then it tells you that the sheet may have features that are not compatible with .txt files; click "Yes." You must close the spreadsheet before you can use the file in Stata. When you close the spreadsheet and Excel asks if you want to save the changes, say "No." This is counter-intuitive, but the changes it’s asking you about are the changes it needs to make the spreadsheet a regular Excel spreadsheet again.
 

 

 

 

Once you have completed the changes and you are in Stata, give the command:

    . insheet using filename.csv

to read in the file. If you get an error message about "wrong number of values," then you have a problem with the number of commas or tabs in the file, see #3 above. You may also get a "No room to add observations" or "No room to add variables" error. In these instances, you must increase the amount of memory Stata uses with the "set mem" command.
 

Infile

"Infile" may be useful if you have downloaded a data file from the web and it conforms to these specifications:

    1. The file should NOT have variables names on the first line.
    2. Character variables that have spaces in them, such as full names, must be enclosed in quotes.
    3. Numbers can have commas and minus signs, but not dollar or percent signs (although this can be fixed with "destring".
    4. Infile assumes that the variables have spaces between them and that there are no blank spaces where it expects data (missing data need to be represented by something).
  The command to read a file with infile is:
    . infile var1 var2 var3 using mydata.txt

Essentially, insheet assumes that commas or tabs separate the variables, and infile assumes that blanks separate the variables. It may be easier, though, to bring a file into Excel first, save it as a csv, then use insheet.

Infix

Infix is used to read raw data in specific columns into Stata. This is common when you have data from a particular source and a codebook or data dictionary describing the data.

There are two basic forms of infix: one that allows you to specify the columns on the command line and one in which you write your own Stata dictionary to describe the file. The first form looks like this:

    . infix var1 1-3 var2 4 str name 10-20 using mydata.raw

This form is good when you only want to read a few variables from the file. The syntax is rather simple. You give the variable name followed by in which column or columns it can be found. If the variable is a string (often called "alphanumeric" or "character") then you must preceed the variable name with the "str" designation. Lastly, you tell Stata the name of the raw data file. You can also specify which observations to read:

    . infix var1 1-3 var2 4 str name 10-20 if var2==1 using mydata.raw

The second form is similar, however, the information must be entered in a separate file and the infix command is a little different. The file into which you enter your specifications should have the extension od ".dct". The file name can be anything, but it makes sense to give it the same name as the data file it is intended to read. A simple dictionary file, named "mydata.dct" for example, looks like this:

    infix dictionary using mydata.raw { var1 1-3 var2 4 str name 10-20 }

You would save this file, then from the command line in Stata type:

    . infix using mydata.dct

or

    . infix using mydata.dct if var2==1

Sometimes raw data files have more than one line of data for each observation, often referred to as "multiple records" or "multiple cards." Stata can read these types of files as well with just a slight modification of the dictionary file:

infix dictionary using mydata.raw {
    3 lines
    1: var1 1-3
	    var2 4
	    str name 10-20
	 2: var3 2-5
	    var4 6
	 3: var5 1-2
	    str var6 24-50
}

Here, the "3 lines" tells Stata how many lines of data there are for each observation. The "1:", "2:" and "3:" tell Stata on which specific lines certain variables can be found. You do not need to read data from all lines. If, in the above example you did not need var3 and var4, you could leave the designation for line 2 out altogether. The "3 lines" and "3:" remain, though:

infix dictionary using mydata.raw {
    3 lines
    1: var1 1-3
	    var2 4
	    str name 10-20
	 3: var5 1-2
	    str var6 24-50
}

Please note that "multiple records" data files are not the same as "hierarchical" data files. Hierarchical files have different types of information on different lines. A common example is the Current Population Survey which has Household, Family, and Person records in each file. To process these kinds of files in Stata, you must read each type of record separately - being sure to include the proper identification variables - then merge the files together.

Examining your Data

Describe

Once you have the data in Stata, you will want to make sure that all the variables are there and that they are in the format you need. You can do this with the "describe" command. Describe, which can be abbreviated as simply "d," will provide basic information about the file and the variables. If you do not list variables, all the variables in the file will be described. You don’t have to call the data into Stata to be able to describe it, though. The command:

    .d var1 var2 var3  :
    .d using census
    .d var1 var2 var3 using census

will accomplish this. This can be very useful if the the file is too large to fit into memory. By default, Stata starts with 1 megabyte of memory. Often, your dataset will be larger than this and you will need to increase the amount of memory Stata uses. To do this, look at the "size" line in the describe output; this is the size of the file in bytes. Since 1,000,000 bytes is a megabyte, files larger than this will not be loaded into Stata unless you increase the memory with the "set memory" command:

    . set mem 10m

This example increases the memory to 10 megabytes. Make sure that you give yourself some extra memory so you can create new variables and/or add observations.

  See also: ds3

Codebook

Stata can produce a rather detailed codebook for your data. The command:

    . codebook state  :
will produce some useful information about the state variable. By default, though, any variable that has more than nine unique values will only have "examples" dislpayed. To override this, use the "tab(#)" option:
    . codebook state, tab(50)
The codebook command can also be used to identify potential problems in your data:
    . codebook, problems
Finally, using the "all" option will produce a great amount of information about your data:
    . codebook, all
The "all" option can be combined with the "tab(#)" option but be aware that the "tab()" option will apply to all your variables.

See also: codebook2

List

It is often useful to just look at the data without doing any kind of analysis. The list command, abbreiviated as "l" will let you do this. Simply giving the "l" command will display all of the data on your screen, if you specify certain variables in the command, then only those variables will be printed:
 

    . l  :
    . l state pop death-divorce   :

By default, Stata will insert lines every five observations. Sometimes this is fine, sometimes it can be confusing. There are three ways of overriding the default:

    . l state pop death-divorce, separator(10)  :
    . l state pop death-divorce, sepby(region)  :
    . l state pop death-divorce, clean  :
The first option, "separator(10)" tells Stata to draw the lines every ten observations. You can also use "separator(0)" to suppress the lines altogether. The "sepby(region)" option tells Stata to draw the line bewtween each value of region; usually, you will want to sort your data by region first. Finally, the "clean" option suppresses all lines around the display.

Stata will also display the labelled values of your variables. Sometimes you want to see the actual values instead. In these instances, use the "nol" option:

    . l state pop death-divorce, nol   :

Creating Variables

 
Stata can store data as either numbers or characters. Stata will allow you to do most analyses only on numeric data. Sometimes, when you use insheet or infile, a numeric variable gets read in as a character, or string variable. This may be due to a ‘space’ character or a ‘.’ in one of the cells. Since Stata allows you to do analyses only on numeric variables, you will need to convert them to numeric data.

There are different types of numeric variables - float, binary, double, long and int - the differences among them are simply how much space they take up in the file. In most cases, you will not need to concern yourself with these differences.
 

Generate

Often, you will need to create new variables based on the ones you have already. The two most common ways of creating new variables is by using "gen" and "egen." Here are some examples of gen:

    . gen poplt18 = poplt5 + pop5_17
    . gen cumpop = sum(poplt18)
    . gen id=_n   :

In the first example, we generate a new variable called "poplt18" which is simply the addition of poplt5 and pop5_17. In the next example, we create a cumulative sum of poplt18 (the value of cumtot for this observation is the sum of poplt18 for all previous observations). Finally, we create an "id" variable by simply setting it equal to "_n", which is Stata’s way of numbering the observations in a dataset. This id variable can be very useful when you need to sort your data in different ways (see sorting below). The variable "_n" can be used at any time, but you must remember that it refers to the number of the observation in the current order, not the original order. Here are some more examples of gen and _n:

    . gen newpop = 0 if _n==20
    . gen lagpop = poplt18[_n-1]   :

In the first example we simply set newpop equal to a specific value for the the 20th observation. In the second, we create what is called a "lagged" variable, lagpop, which is the value of poplt18 from the previous observation which is designated by the "[_n-1]." You can use any valid arithmetic expression inside the brackets. Here are two last examples:

    . gen living = cumpop - death if state=="Louisiana"
    . replace living = pop - death if state!="Louisiana"

Here, we generate a new variable, "living", which is the difference of cumpop and death for Louisiana. Note two things here: first is the double "=" after "if", and second, the use of the quotes around Louisiana. When you refer to string variables in equality expressions such as this, you must put the quotes around the value so Stata will not confuse it with a variable name. Had we left the quotes out, Stata would have thought we wanted to generate living for observations where the variable state equals the variable "Louisiana." In the second example we use "replace" instead of "gen." This reason for this is that in the first example, Stata creates values only for those observation fulfilling the "if" condition. All other observations get "missing" values. We want, then, to replace those missing values with something. The "!" in Stata means "not," and suffices as the extra "=."

Destring

Sometimes a numeric variable will be read in as a character variable by mistake. This happens most commonly when you "insheet" a file that may have stray characters in it or perhaps "." was used to indicate missing values. The "destring" command will convert the character form of the variable into a numeric one. The simplest way to use the command is:

    . destring varlist, replace force   :

which will convert all variables in varlist to numeric. The "replace" option tells Stata to simply relace the variable with the same name. The "force" option tells Stata to make the conversion and convert any data that contain non-numeric values to missing. An example of this is when your data contains percent and/or dollar signs. Since you usually want to keep the numbers and just get rid of the signs, you can tell Stata to ignore certain characters:

    . destring income, ignore("$") replace

    . destring poppct, replace percent

In the first example, the "$" is simply removed from the value and converted to numeric format. In the second example, the "percent" option tells Stata to remove the "%" and divide the value by 100 to create a decimal value. If you do not want the decimal value, then use the same syntax as for the dollar sign. In some cases, you may want to keep the original form of the variable for reference or other purposes. One example might be if you have symbols for different currencies depending on what country the observation is from. In this instance you can have Stata create a new variable:

    . destring income, generate(income2) ignore("$")

Here, Stata will keep income as is and create a new variable, income2, which is the numeric representation of income.

Other times, a number may be stored as a string even though there are no special characters in it. This happens most commonly when reading a .csv file that may have had an actual space for a missing value or some other stray character in that column. In these instances, you can use the "real" function with the generate command to convert the variable to numeric:

    . gen inc_num = real(inc_str)   :

Character Variables

Stata assumes that the variable you are creating is numeric unless you tell it otherwise. Sometimes, though, you will want to create a variable whose values are strings. Here’s how:

    . gen name="John"
    . replace name="Albert" in 1/5

As above, the values for string variables must be enclosed in quotes. The maximum length of string variables in Intercooled Stata is 80; in Stata SE, it is 244.

The "in" expression simply tells Stata, observations 1 to 5, inclusive, in the order they appear in the dataset. You can specify any consecutive range. The "in" expression can also be used with numeric variables, as well as any other commands.

    . gen yn="yes" if team=="Atlanta" & name=="john"
    . replace yn="no" if team=="New York" | team=="Boston"

Here we extend the use of the "if" clause to include two expressions. As you may have guessed, the "&" means "and." The "|" means "or." One thing you must be wary of is that the values of character variables are case-sensitive, so "John" is not the same as "john."

    . gen day=substr(code,1,2)

The substr function can be helpful; it specifies certain parts of a character variable to be used. The first argument specifies the variable to be picked apart (code), the second argument is the starting character, and the last argument is the number of characters to be pulled out. In the first example, starting with the first character in the variable "code," two characters will be extracted.
 

Date Variables

Dates are a special type of numeric variable. Dates are actually stored as a number which counts the number of days, weeks, months, etc. from January 1, 1960, commonly referrred to as "elapsed dates". So, January 1, 1960 = 0, and dates before that are negative numbers and dates after that are positive numbers. Often, dates must be entered as string variables because they need to have a "/" or have the month as letters instead of numbers. In these instances, you can use the "date" function to generate an actual date variable:

    . gen date = date(strdate,"mdy")  :

The "mdy" tells Stata the order of month, day and year as they appear in the values. Stata assumes that you have a four-digit year; if you have a two-digit year, then replace the "mdy" with "md19y" or "md20y", depending on what century your data are in. One caveat is that Stata needs to have some kind of separator between the month, day and year so values like 01/01/1998, 01-01-1998, and 01jan1998 are all valid, but 010198 is not. If there is no delimiter, then you can use the "todate" command:

    . todate olddate, gen(newdate) p(yyyymmdd)

This tells Stata to take the variable "olddate" and create a new one called "newdate" which is a Stata date variable. The "p(yyyymmdd)" indicates the pattern of numbers in olddate.

    . gen newdate = mdy(monthvar,dayvar,yearvar)

The "mdy" function can put three variables together to create a new date variable. The variables do not have to be called "month," "day," and "year," but they do have to follow that order. One more caveat is that the year must be stored as a four-digit number, otherwise Stata will not know what century you want. You may need to do some more programming to add "1900" to the value. Once you have your date variable created you will probably want to display it on the screen to make sure you did it correctly. The problem is that Stata will display the elapsed date, which is just a number, not a date as we are used to seeing them. You will need to format the variable so it displays as a "normal" date, but is still used as an elapsed date:

    . format date %d

This will display dates as "01jan1998." Other formats can be used, check the Stata manual. Here are some other useful functions for working with dates:

    . gen day = day(date) - numeric day of the month
    . gen weekday = dow(date) - numeric day of the week
    . gen month = month(date) - numeric month
    . gen year = year(date) - year
    . gen qtr = quarter(date) - numeric quarter of year
    . gen week = week(date) - numeric week of year
     

All of these examples assume that the date variable is an elapsed date. The "day" function returns the number of the day, ie, 23. The "dow" returns a number from 0 to 7: 0 if it is a Sunday and 6 if it is a Saturday. "Month" returns a number from 1 to 12. "Year" returns the year as a four-digit number.
 

Often we need to use the date as a value, say in an if .... expression. Since Stata stores dates as actual numbers, we would (in the olden days) have to calculate how many days from Jan.1, 1960 our particular date was. In Stata, we can identify a date with a function:

    . gen before=1 if date<=d(10Feb1995)
    . gen target=d(05Aug1989)

The d() function allows us to enter the date in words rather than as numbers.

This is only a small subset of what Stata can do with dates. We have a more complete tutorial on Time Series commands.

A note on "Century-months": You can easily convert century-months to Stata monthly dates by simply subtracting 721 from the century-month then formatting the result as a monthly date:

    . gen statamonth=centmonth-721
    . format statamonth %tmmcy

Dummy variables

Sometimes we need to generate a "dummy" variable, or variables. Stata makes this very easy:

    . tab region, gen(region_dummy)   :

Here, Stata will create a dummy variable for each value found in region.  So, region_dummy1 = 1 if it is in region 1, 0 otherwise; region_dummy2 = 1 if it is in region 2, 0 otherwise; region_dummy3 = 1 if it is in region 3, 0 otherwise. See also the discussion of the xi: command prefix below.

Extended Generate(egen)

Egen, or "extended generate" is useful when you need a new variable that is the mean, median, etc. of another variable, for all observations or for groups of observations. Egen is also useful when you need to simply number groups of observations based on some classification variable. Here are some examples:

    . egen sumpop= sum(pop)  :
    . egen meanpop=mean(pop), by(region)
    . egen num_st=count(id), by(region)
    . egen popcat=cut(pop), at(671742.5,3066433,1.11e+07)
    . egen group = group(region popcat), label missing   :

In the first example, we simply create a variable whose value for each observation is equal to the sum of pop for all observations (all observations will have the same value). The second example shows how to create a variable that is the mean of another variable for each group designated by region (all observations within a group will have the same value). The third example simply counts the number of non-missing id values in each region (this basically counts the number of observations you have for each region). The last example simply assigns a number to each of the groups created by the combination of region and popcat. The "label" option tells Stata to create a value label for the created variable; if either of the variables used to create the new variable has a value label assigned, that label will be included. Finally, the "missing" option tells Stata to treat missing values in creating the groups.

One thing to watch out for when using egen is that some functions treat missing data as zero. This is very different from regular functions and analysis commands in Stata.
 

Encode

Use encode when the original variable is, indeed, a character variable (such as gender being coded as "m" and "f") and you need numbers instead. The encode command does not produce dummy variables, it just assigns numbers to each group defined by the character variable. In this example, state was the original character variable and st is the new numeric variable:

    . encode state, gen(st)   :

Recoding Variables

Using the "gen ... replace" commands are okay if we are creating new variables or want to change the values of a few specific observations. Often, however, we don't need or want to create a new variable, or we want to change all occurrances of a specific value. For instance, many surveys will have "999" to indicate a missing value rather than just a blank and an "888" to indicate that the question was not applicable to this particular respondent. Since Stata will include "999" and "888" as valid values in any calculations, we would want to change all of them to actual missing values. To do this, we use the "recode" command:

    . recode x 999=., gen(new_x)
    . recode x 999 888=., test
    . recode x y z 1/5=1, pre(new_)
    . recode x 1/5=1 6/10=2 11 12=3 *=4

The first example will change all "999" to missing and create a new variable called "new_x". The second will change all "888" and "999" to missing, but before doing so, will test to make sure you have not defined overlapping recodes (such as 3=1 1/5=2, where "3" would be recoded twice). The third example will change all values in the variables "x", "y" and "z" from 1 to 5 (inclusive) to 1, and create new variables called "new_x", "new_y" and "new_z". In the last example, the "*" tells Stata "all other values not previously stated".

Variable and Value labels

Now that we’ve created all these new variables, we’ll want some way of keeping track of what each one is and what their values mean. We can do this by creating variable labels and value labels. These labels are not necessary in Stata, they just make the output easier to read. Variable labels correspond to the variable names, whereas value labels correspond to the different values a variable may have. Here’s how to create them:

    . label variable sumpop "Sum of Pop. for region"   :

This assigns a label to the variable, sumpop, so whenever the variable name sumpop is displayed on the screen, the description "Sum of Pop. for region" will be displayed as well. Variable labels must be 80 characters or less. Often, though, it’s more helpful to label the values a variable can take so you don’t have to memorize them or keep referring to a codebook.

    . label define grp 1 "NE, mid" 2 "NE, low" 3 "NE, hi"
    . label values group grp
    . label define grp 4 "N.ctrl, low",add
    . label define grp 5 "N.Cntrl, low" 4 "N.Ctrl, mid" .a "Missing",modify

Value labels correspond to the actual number or letters in the data. They are used so that the printout will show "NE, mid" "NE, low," and "NE, hi" instead of "1," "2," and "3". First you have to "define" the label, then associate that label with a variable. Each label can be associated with more than one variable, so if there are several "yes/no/maybe" questions, you just have to define one value label and use it for all the questions. You may associate only one value label with a particular variable, however. The maximum length for a label in Intercooled Stata is 80 and in Stata SE, 244 characters. Rarely, if ever, will you want to make labels this long; most output in Stata will only display the first 12 characters, anyhow. Finally, only integers and extended missing values can have value labels assinged.

Stata has some new commands to help you manage your labels:

The "label list" command will list all your value labels or just the ones you specify.

    . label list
    . label list cenreg   :
"numlabel" will add the numeric value being labelled to the label itself. This can make output easier to use.
    . numlabel cenreg,add   :
The labelbook command produces a "codebook" for your labels.
    . labelbook   :
    . labelbook cenreg

Variable Notes

Similar to variable labels, Stata allows you to attach "notes" to variables. The notes are displayed only when you specifically list them or in the codebook. Variable notes can be useful for keeping track of information such as skip patterns or formulas for constructed variables. To create a note for a variable use the command:

    . note state: this variable is the full state name
    . note death: created on TS
    . note _dta: this is useless data

To list all the notes in a file:

    . notes   :

Drop and Rename

Well, now we’ve created many new variables and converted some old ones. Since we no longer need all of these variables, we’ll want to eliminate some of the ones we don’t really need and perhaps rename some of the ones we keep.

    . drop strdate strdate2 olddate
    . rename num_st state_num

Here we drop the variables, "strdate," "strdate2," and "olddate". Be careful! Once they are dropped, they can’t be picked up again unless you clear the data and use it again. You can drop as many variables as you want in one command. The second example renames the variable "num_st" to "state_num." You can rename only one variable at a time. You can also drop observations:

    . drop if size==1

will drop all observations whose size is equal to 1.
 


order, move, aorder:

The commands "order," "move" and "aorder" can be used to change the order of variables in your data. This can be particularly useful when you have variables that are conceptually grouped but necessarily grouped within your data. An example might be when you have information collected on the same variables at different times; all of the time 1 variables are followed by all of the time2 variables and so on. Specifying a range of variables such as "var1-var4" would give you ALL of the variables between "var1" and "var4" not just those that begin with "var". To circumvent this, you can re-order the variables:

    . order var1 var2 var3 var4

would move these four variables to the beginning of your data in the order specified. You could also simply switch the order of two variables:

    . move var3 var2

this moves var3 to the position occupied by var2. Sometimes the exact order doesn't matter, but if you have many variables, it's easier to search the variable list if they are in alphabetical order:

    . aorder

will put all of the variables in alphabetical order. You can also specify a variable list with this command.

File Management

 Once you have all of your variables in the format you want, you’ll need to get the entire file in a format that will make it easier to use. You can do this by sorting, appending, merging and collapsing.

 
Sort

Sort puts the observations in a data set in a specific order. Some procedures require the file to be sorted before it can work. You can sort a file based on more than one variable. One thing you must be careful of, is that by default Stata will randomize the order of the observations within the variables used to sort. This is why creating the id variable mentioned earlier is important. With the id variable, you will always be able to go back to the original order and start over. You can override this behavior by using the "stable" option which will keep the observations in their current order.

    . sort state   :
    . sort region state
    . sort region state, stable

Append

Sometimes, you have more than one file of data which you need to analyze. One case may be that you have two files with the same variables but different observations. The other case is when you have two files with the same observations, but different variables.

 Use append when you simply want to add more observations, in other words you already have data for 1999, and now you want to include new observations from 2000 for the same variables. For example:

    . use ds1   :
    . append using ds2
    . append using ds2, keep(var1 var2 var3)

will add the observations from the file, ds2 (what Stata refers to as the "using" dataset), to the end of ds1 (what Stata refers to as the "master" dataset). Any variables with different names in the two files will have missing values for the observations from the other dataset. By default, all variables from the "using" dataset will be included in the new file. You can use the "keep()" option to specify which variables you want to keep (from the using data set).

See also: dsconcat

Merge

If you have two files that have the same observations, but different variables, then you’ll want to "merge" them so you can use all of the variables at once. When you merge datasets, you are adding new variables to existing observations rather than adding observations to existing variables. There are two basic kinds of merges, a one-to-one and a match.

A one-to-one merge simply takes the two files and puts them side-by-side, regardless of whether the observations in each dataset are in the same order. A simple one-to-one merge isn’t very common, and isn’t recommended, even if you think all the observations match. Sometimes there may be a "glitch" that can throw off the order of the data. It’s just as easy to do a match-merge, and you’ll be able to check that all the observations matched correctly.

There are some "rules" when doing a match-merge. First, each dataset must have a "key" variable or variables by which the observations can be matched - social security number is a good example. Second, both datasets must be sorted by this key variable. You can use more than one key variable if you want, such as month and year. These key variables must be of the same type (string or numeric) in both datasets. By default, if a variable is present in both datasets, then the values in the master dataset will remain unchanged. If the variables in the using dataset have the same names as ones in the master dataset, but represent additional information, then you will need to rename them in one of the datasets before you can merge them.

    . use ds1
    . sort a
    . merge a using ds2  :

In this example we simply match observations using the variable "a" as the key.

    . merge a using ds2, keep(var1 var2 var3)
The "keep()" option allows you to merge only certain variables from the using dataset instead of all them.

Sometimes, you will want to update the data you have, namely, fill in missing values. Using the update option will accomplish this:

      . merge a using ds2, update

Values in the master dataset will be changed only if they are missing.

In some instances, you may want to replace non-missing values in the master dataset. Use the replace option in addition to the update option to do this:

    . merge a using ds2, update replace   :

"Replace" will not work by itself, it must always be used with update. Stata will not, under any circumstances, change a non-missing value in the master dataset with a missing value from the using dataset. You must use the "replace" command discussed earlier to do this.

Stata will automatically create a variable called "_merge" which will indicate the results of the merge. Always check this variable to make sure that you got what you wanted. Here are what the possible values of _merge mean:

1 = Observations from the master dataset that did not match observations from the using dataset.
2 = Observations from the using dataset that did not match observations from the master dataset.
3 = Observations from both datasets that did match.
4 = Observations from both datasets that did match, missing values in the master dataset were updated.
5 = Observations from both datasets that matched, values in the master dataset disagree with those in the using dataset.
Usually, you will want all the observations to have a value of 3. Values of 4 occur only when you use the "update" option. Values of 5 occur only when you use the "update" and "replace" options. If you need to merge another dataset, you have to re-name or drop _merge first, other wise you will get a "_merge already defined" error.

    See also: dmerge

Collapse

  Collapse is used when you want to create a dataset containing the means sums, etc., of the various groups in the data. One example might be when you have one dataset of monthly data and another of yearly data and you need to analyze both sets of information together.

      . collapse (mean) pop income, by(region)  :

This will create a dataset of the means of pop and income by region. If you have four regions in your original dataset, then you will have four observations in the collapsed dataset. You must be careful, though, because Stata will compute the statistics on a variable-by-variable basis. If one variable has more missing observations than another, the means (and any other statistics you request) will be based on a different number of observations. This is not always an acceptable practice. To avoid this, you must use the "cw" options for "casewise deletion." This means that Stata will drop any observation that does not have data for both pop AND income, thereby ensuring that all the statistics will be based on the same number of observations.

    . collapse (mean) pop income, by(region) cw  

See also: xcollapse  

Contract

Contract works similarly to Collapse, however, it creates a file of frequencies or crosstabulations. It is similar to outputting the results of a "tab x y" command to a new data file. The sytax is:

    . contract region, freq(reg_freq)   :

Where "region" is the variable we wish to contract and "reg_freq" is the name of the new variable that will contain the values of the crossatbulation of company and gender. By default, Stata will not output observations for which there are zero frequencies; you must use the "zero" option to include them. You can include more than one variable in the command.

See also: xcollapse.

Expand

As you might guess, "expand" is the opposite of "contract" - sort-of, anyway. Expand will duplicate observations based on a specific value you give it or a variable:

    expand 5

will create four additional observations for each observation you have in your dataset. On the other hand:

    expand popvar

will create a number of observations according to the value of popvar.

Reshape

Reshape is one of Stata's more complicated but necessary commands. It is used to convert data from "wide" to "long" format and vice-versa. These formats are sometimes referred to as "repeated measures" and "time-series" formats, respectively. Here are a couple of examples:

Wide:

id    sex   inc90   inc91   inc92
1     M     20      22      25
2     M     33      37      42
3     F     24      24      26
4     F     55      60      65

Long:

id    sex   year  inc
1     M     90    20
1     M     91    22
1     M     92    25
2     M     90    33
2     M     91    37
2     M     92    42
3     F     90    24
3     F     91    24
3     F     92    26
4     F     90    55
4     F     91    60
4     F     92    65

The long format is also referred to as "person-years". To convert the wide format to long, you would use the command:

    . reshape long inc, i(id) j(year)

To convert the long format to wide, you would use the command:

    . reshape wide inc, i(id) j(year)

In the reshape command, the "i" indicates what variable(s) identify the rows (observations), and the "j" indicates the columns (variables). In the first example, going from wide to long, the unique rows of data are identified by the variable "id," the variables we want reshaped begin with the prefix "inc" and we want the new variable indicating what column the value was from to be called "year". You should notice two things about these commands and the data: First, we did not specify "sex" in either command. Stata will automatically shift all other variables accordingly. Second, and more importantly, is/are the names of the variables that were transposed, "inc90", "inc91" and "inc92." As you can tell, they all have the same prefix, "inc" and have the year as a suffix. This is the easiest way for Stata to work with your data. The years (or whatever the numbers represent) do not have to be consecutive, nor do they even have to be in numerical order.

Sometimes, though, the data do not represent years, rather they represent different types of observations. For example, instead of "inc90," "inc91" and "inc92" the variables were named "mominc," "dadinc" and "kidinc."

 id  dadinc    kidinc    mominc  
  1   25        20        22      
  2   42        33        37     
  3   26        24        24     
  4   65        55        60     

There are no numbers, and the "inc" is now the suffix rather than the prefix. Stata can still handle this:

    . reshape long @inc, i(id) j(member) string

would produce:


id       member    inc        
1        dad        25          
1        kid        20          
1        mom        22          
2        dad        42          
2        kid        33          
2        mom        37          
3        dad        26          
3        kid        24          
3        mom        24          
4        dad        65          
4        kid        55          
4        mom        60          

There are two new items here: The "@" and "string." The "@" tells Stata where the "j" identifier is within the variable name. If you had a variable name such as "mom90inc," then you would use "mom@inc." "String" tells Stata that the identifier is a string rather than a number. The "@" and "string" can be used by themselves as well.

Basic Commands

Now that you have your data in a format you want, check it before doing any analyses. This can save you quite a bit of frustration later on.

Summarize

Sum, short for summarize, will give you the means, sd’s, etc. of the variables listed. If you don’t list any variables, it will give you the information for all numeric variables. If a variable you thought was numeric shows up as having 0 observations and a mean of 0, then, most likely, Stata still thinks it’s a character variable.

    . sum   :
    . sum pop income

The "detail" option gives you additional information about the distribution of the variable.

    . sum pop , detail  

Inspect

Inspect is another easy way to eyeball the distribution of a variable.

    . inspect pop   :

Tab

Tab, short for tabulate, will produce frequency tables. By specifying two variables, you will get a crosstab. There are other options to get the row, column and cell percentages as well as chi-square and other statistics; check the manual.

    . tab region  :
    . tab region group
    . tab region group, row col chisq 
    . tab1 region pop state
The "tab1" command will produce one-way tables for each of the variables listed.

See also: bigtab, xcontract


tabstat:

"Tabstat" is similar to "sum" except that it allows you to specify which (and more) statistics are to be displayed in the table, as well as how the table is oriented. Here are a couple of examples:

    . tabstat pop income, stats(n mean sd range iqr)   :

    . tabstat pop income, stats(n mean sd range iqr) c(var)

    . tabstat pop income, stats(n mean sd range iqr) c(stat)

    . tabstat pop income, stats(n mean sd range iqr) by(region)

"By" group processing

Sometimes you’ll want to run a command or analysis on different groups of observations. The "by variable:" subcommand is the same thing as running the command with separate "if" statements for each group. You must sort the data before you can use the "by:"

    . sort region
    . by region: summ pop
    . by region: gen poplag=pop[_n-1]   :
One alternative to sorting first, would be to use the "bysort" version of the "by:" prefix which will sort the data for you:
    . bysort region: summ pop

Often when you have time-series data (i.e., monthly data for several persons) you want to execute a command for each person, but need to ensure that the data are in the proper time order. One example would be if you want to calculate the change in income from one month to the next for each person. To do this, you may be tempeted to use:

    . by person month: gen inc_diff=income-income[_n-1]
What this will do, however, is generate a variable that is the difference from one month to the next within each month - you will get all mssisng data! Instead you should use:
    . by person (month): gen inc_diff=income-income[_n-1]
Putting "month" in the parentheses tells Stata to make sure that the data are in the correct monthly order, but execute the command on the person level.

Sometimes, the "by" prefix can produce more output on the screen than you would care to look at. To suppress output to the screen, but not to the log file, use "quietly."

    . quietly by region: gen poplag=pop[_n-1]
Last, but not least, some commands require a minimum number of observations to execute correctly. If one of your "by" groups does not have enough observations, Stata will stop. To circumvent this, you can use the "rc0" option (that's a zero, not the letter "O"):
    . by person, rc0: logit income month

Fancy commands

 

Macro variables

Sometimes you need to use many variables the same way many times. One example would be if you want to run regressions on different dependent variables using the same set of independent variables. This can mean a lot of typing. One way around this is to create a macro variable. A macro variable is simply a variable that has as its value a particular string. This string can be anything you specify: a list of variables, a particular command, or whatever. Whenever Stata comes across the macro variable, it will interpret it to mean whatever string you set the variable to.

    . local macvar var1 var2 lagtot lag2
    . reg depvar `macvar’

In this example, when Stata "sees" the variable macvar in the regression command, it replaces `macvar’ with the string "att att2 itt itt2 date." Pay careful attention to the different type of quotation marks: the first quote in `macvar’ is the opening left quote usually found under the ~. The second quote is the closing right, or single quote usually found under the ".

 
For

Other times, you will want to perform the same command on several variables. Again, this can mean a lot of typing. The "for:" command can also be very useful in these situations. It can use several types of variable lists, and with a little ingenuity, can be very powerful. The general syntax is:

    for [id in] listtype list : stata command

where "id" is a symbol used to identify variables or values, "listtype" is what kind of entries are in the "list" to be changed, and the Stata command you want applied. There are four types of lists:

  • var - a list of existing variable names
  • new - a list of names to be used in creating new variables
  • num - a list of numbers
  • any - a list of words, numbers, or symbols

Here are a few examples of the for command:

    . for var var1-var25: replace X=. if X==99
    . for var m*: replace X=. if X=99
    . for new var1-var3 : gen X=0
    . for any a b c: gen str2 X="aaa"
    . for new v2-v5 \ num 2/5: gen X = myvar^Y

The X in the for command represents the variable names in the list, one by one. In the first two examples, Stata would interpret the command as though you had typed replace var1=. if var1=99, replace var2=. If var2=99, and so on. The "m*" in the second command simply tells Stata to change any variable beginning with the letter "m". The third example creates "new" variables named var1 var2 and var3 and sets them all equal to 0. The fourth example does basically the same thing, but since we are not using consecutive names as in the third example, we must use the list type of "any." The fifth example shows how to use two lists to create new variables. It creates the variables v2,v3,v4, and v5 and sets them to the 2nd, 3rd, 4th and 5th powers of "myvar" (an existing variable), respectively. You could even repeat several Stata commands on the right hand side of the ":" by separating them with a "\" as we did on the left side.

Sometimes using an "X" as the id might be confusing - especially if our variable names have an "X" in them. If this is the case, we can change what Stata uses as the id:

    . for @ in var X1-X25: replace @=. if @==99

will accomplish the same thing as the first example above.

Sometime its a good idea to test the command before you actually run it. The "dryrun" option will allow you to do this:

    . for var var1-var10, dryrun: replace X=. if X==99

will show the commands that will be executed without actually executing them.


Foreach

Closely related to For is "Foreach." Foreach allows you to run several commands in the same fashion as For:

    . foreach var in varlist var1-var10{
       replace `var'=. if `var'==99
       replace `var'="NA" if `var'==88
    }

Note the use of the macro variable. Foreach can use the same types of lists as For.


While

"While" is very similar in purpose to "for," however, it can be much more powerful (and complicated). "While" is used mostly in programs to perform several tasks many times by iterating through the data. For example, let's say you need to perform a regression on each of 100 companies and save the betas as variables. If you had nothing better to do with your life, you could issue the follwing commands:

    . reg y x, if company=1
    . gen beta=_b, if company=1

100 times each, changing the company number each time. Or, you could write a quick program using "while" to do it for you:

    1)  program define regit  
    2)    local i=1
    3)    while `i'<=100 { 
    4)       reg y x, if company=`i' 
    5)       replace beta=_b, if company=`i' 
    6)    local i = `i' + 1 
    7)    } 
    8) end
    

We are assuming here that you have a variable called "company" that identifies each company sequentially from 1 to 100. It's important that the companies are identified in this way so Stata will not issue an error when it comes to a missing id. To create a variable like this, use the egen group() command explained above. We have also assumed that you created a variable called "beta" with missing values for all observations.

Let's look at this program line-by-line:

    1) This tells Stata that we are defining a "program", which we will call "regit." Once we are done writing the program, we can execute it by entering "regit" on the Stata command line like any other command.

    2) We must first define a macro variable (see the explanation above on macro variables) which we set to 1 to begin the iterations. You do not have to start at 1, it's just convenient.

    3) Here's the "while" command. It tells Stata to repeat the following commands (up to the "}" on line 7) as long as i is less then or equal to 100. Note the ` and ' around the i; the actual value used here will change each time the program loops through the commands.

    4) and 5) These are the actual commands we want to execute on our data. As in line 3, the `i' will change its value each time we loop through the program.

    6) This line increments the value of i each time it is executed.

    8) This simply tells Stata that this is the end of the program.

Here's what Stata "sees" as it executes the commands in the program: The first iteration:

    3) while 1<=100 { 4) reg y x, if company=1 5) replace beta=_b, if company=1 6) local i = 1 + 1

The second iteration:

    3) while 2<=100 { 4) reg y x, if company=2 5) replace beta=_b, if company=2 6) local i = 2 + 1

The third iteration:

    3) while 3<=100 { 4) reg y x, if company=3 5) replace beta=_b, if company=3 6) local i = 3 + 1

Get the idea? This will continue until it gets to the 101st iteration and then stop because, of course, 101 is not less than or equal to 100. You would be very smart to write a program like this in a do file as you will most likely need to tweak it a few times to get it to work the way you want.


XI:

The xi: command prefix is used for interaction expansion. Interaction expansion means creating dummy variables for interactions of categorical variables. There may be instances in which you will run a regression (or some other analysis) when you will want to estimate the interaction terms for these variables. Normally, you would have to create new dummy variables for each term, which would be rather tedious. The xi: prefix does this for you. Here's an example:

    . xi: reg income i.region i.popcat  :
    . xi: reg income i.region*i.popcat

The i.region and i.popcat tell Stata that you want each of those variables expanded, so you would have the dummy variables created for you. The first example simply creates these dummies for you and uses them in the regression. The second example will also produce the interaction and main effect terms for them. You can use this to interact categorical variables with continuous ones as well:

    . xi: reg income i.region*popcat
    . xi: reg income i.region|popcat

The first example will give you the interactions and main effects of region and popcat, whereas the second example will leave out the main effect of popcat.

As with any regression using dummy variables, one category must be left out. By default, xi: will leave out the group with the lowest value. If this is not what you want, then you can change it by issuing the command:

    . char region[omit] 3

which would leave out the region group coded 3 instead. If the variable is a string variable, simply substitue the string value for the number in the example.

Useful Commands

Some of the commands in this section are part of the STB collection of commands and may not be installed on your machine. They are, however, very easy to install yourself.


duplicates:

The "duplicates" command is used to find and, optionally, delete duplicate observations. This can be very useful when you are cleaning data from several sources. There are several forms of the command, so you are encouraged to look at the manual for an explanation for them; we'll look at just a couple here.

    . duplicates report   :
    . duplicates report region
The "report" form of the duplicates command simply tells you how many duplicate observations there are. If you do not specify any variables, then all variables are used and all variables must be the same for the observation to be considered a duplicate. By specifying a variable or variables, only those variables are used to determine duplicate observations. This can be useful when you need to use more than one variable to identify a single observation.
    . duplicates examples region
The "examples" form of the command will list examples of the duplicate observations.
    . duplicates drop region
The "drop" form of the command will drop all duplicate observations - be very careful in using this!!!


ds3:

The "ds3" command is a very versatile version of describe. Unlike describe, though, ds3 can list different types of variables (strings, numeric, byte, float, etc.) as well as variables that have labels or other attributes. It can also do case-insensitive searches. Here are some examples:

    . ds3, str detail  :

This will list all string variables.

    . ds3, has(vallabel)

This will list all variables that have value labels defined for them. Perhaps more useful:

    . ds3, not(vallabel)

will list all variables that do NOT have value labels defined. You can even find variables that have a specific value label defined:

    . ds3, has(vallabel yesno)

which will find all variables that have been assigned the "yesno" value label. To do the same, but ignoring case:

    . ds3, has(vallabel yesno) case

Last, but not least, you can use ds3 to select certain variables for other commands:

    . ds3, num

    . sum `r(varlist)'

This will do a "sum" on all numeric variables - leaving out all the string variables.


codebook2 and vartyp

The codebook2 command is similar to the standard "codebook" command, but takes a different approach to determining what information to display for the variables. The codebk command should be used in conjunction with the "vartyp" command. Both of these commands were written at OPR and can be installed from within Stata with the commands "ssc install codebook2" and "ssc install vartyp".

The codebk command will display information on a variable or variables based on whether the variable is an "id", continuous, discrete or a date variable. You can specify the type in the codebk command, or by using the vartyp command. To specify the type using the vartyp command:

    . vartyp person house, s(id)
    . vartyp weight height, s(cont)
    . vartyp sex race state, s(disc)
These commands will set the type for the variables to "id", "continuous" and "discrete". If you do not list any variables, then all variables receive the type. Once the type(s) have been set, you can use the codebk command:
    . codebook2 :
    . codebook2 person weight sex
    . codebook2 educ, t(disc)
The first example will produce information on all variables in memory. The second will display information for just the variables listed. Finally, you can, for the duration of the command, set or re-set the type for a particular variable or variables.

You can also produce codebook information for variables in another dataset:

    . codebook2 age ethnicity state using(mydata)
This will display codebook information for the listed variables in the "mydata" dataset. The "t()" option can alos be used.


bigtab

The "bigtab" command is for when you want three-way crosstabs and/or you get a "too many values" error from the "tabulate" command. "Bigtab" was written at OPR and can be installed from within Stata with the command "ssc install bigtab"

The bigtab command can produce one-, two-, and three-way frequency tables with an unlimited number of values for each variable. It can also produce row, column and cumulative frequecies and percentages, as well as saving the results in a separate dataset. Here are some examples:

    . bigtab height weight :
    . bigtab sex race agecat, sep(sex)
    . bigtab sex race, saving(sexracefreq)
    . bigtab sex race, all
The first two examples simply produces a two-way crosstab - the "tabulate" command would produce a "too many values" error for this. The third add the "separator()" option to produce separator lines after each level of "sex" (see the list command). The fourth example saves the results of the crosstabs (the various frequencies and percentages) in a separate dataset called "sexracefreq". The last example tells Stata to include all labeled values in the output, even if those values do not appear in the data. In other words, if you defined a value label for the variable race as 5 "Martian" but had no Martians in your data, the output would still include a line for the Martians with a frequency of zero.


xcollapse

The xcollapse command is an extension of the collapse command. The most useful feature of this command is the ability to save the results in a separate data file without replacing the data already in memory.

    . xcollapse sex race agecat, saving(new) list(*)
will create a dataset called "new" and list it to the output screen. Xcollapse also has other options for how the data are displayed.


xcontract

The xcontract command is an extension to the contract command in that it has many more options. Perhaps the most useful option is the one to save the resulting data set in another file without destroying the one currently in memory.

    . xcontract sex race agecat, saving(new) list(*)
will create a dataset called "new" and list it to the output screen. Xcontract also has other options for how the data are displayed.


dsconcat

The dsconcat command will allow you to append several files in one step. You can also create a variable that will identify which file a particular observation came from:

    . dsconcat file1 file2 file3, dsid(filenum) obsseq
    . dsconcat file1 file2 file3, dsname(filename)
In the first example, the data sets "file1," "file2," and "file3" will be appended, in the order listed, to the data set in memory. The "dsid(filenum)" option tells Stata to create a new variable called "filenum" that will identify from which file that observation came. In other words, all the observations from the file "file1" will have the value "1" for the variable "filenum," all the observations from the file "file2" will have the value "2" for the variable "filenum" and so on. The "obsseq" option numbers the observations from 1 to however many observations are in the particular dataset from which it came. In the second example, the "dsname(filename)" option is like the "dsid()" option except that the variable "filename" takes on the name of the data set rather than simply being numbered sequentially. essentially, "dsid()" creates a numeric identifier and "dsname()" creates a string identifier.


dmerge

"dmerge" is just like the merge command with a couple of useful exceptions: If the variable "_merge" already exists in either the using or master data sets, it is automatically dropped. Also, if the data sets are not merged by the key variables, "dmerge" will sort them first. This is particularly nice if your using data set is not sorted.


log2do2

The "log2do2" command can be a big time-saver if you have not done all of your work in a do file (but you did do all your work in a do file, didn't you???) or if, for whatever reason, you have a log file but no corresponding do file. log2do2 extracts all of the commands from a log file and creates a do file.

    . log2do2 somelogfile.log , saving(somelogfile.do) replace
You must look at the resulting do file before you run it, though. Sometimes some unusual output may confuse Stata into think that results are commands, and if a command streches over more than one line, it may not be written correctly to the do file. Also, if there are any commands in the log file that did not work the first time around, they won't work now, either!


log2html

Like log2do2, log2html converts a log file to an html, or web document. Unlike log2do2, however, log2html MUST use a .smcl, or defualt Stata log file as its input. This is because the .smcl files have a markup language of their own which is converted into html, another markup language. So, if you have been creating your logs as plain text (using the .log extension or ,text option), you will need to re-run your program and change the type of log file.

    . log2html somelogfile.smcl, replace
By default, log2html will create a file with the same filename as your log file but with the extension ".html". It assumes that the log file is in you current working directory and will create the html file there as well.
top
Mail: Office of Population Research, Princeton University, Wallace Hall, Princeton NJ 08544
Phone: (609) 258-4870  •  Fax: (609) 258-1039  •  Email: webmaster@opr.princeton.edu