![]() |
|
Programs
Courses
Course Schedule
![]() ![]() |
Stata 8 RefresherTable of ContentsIntroduction Log and Do Files    Log files: Keep a record of what you are doing    Do Files: How to re-run all your commands Data Management:    Reading Data    Insheet: Reading data from a delimited file (ie, from Excel)    Infile: Reading data where variables are separated by spaces    Infix: Reading data where the variables are in specific columns Examining Your Data    Describe: Listing all your variables    Codebook: Detailed information about your data    List: Displaying your data Creating Variables    Generate: Many ways to create new variables    Destring: Change string variables to numeric    String Variables: How to work with character data    Date Variables: How to work with dates    Dummy Variables: Creating dummy variables    Extended GenerateCreating complex variables such as the mean of a group    Recode: Changing the values of variables    Encode: Creating a numeric variable from a string    Variable/Value Labels: Make your output easier to read    Variable Notes: Attach more information on your variables    Drop/Rename: Deleting variables and changing their names    Order, aorder and move: Re-arrange the order of your variables File Management:    Sort: Put all your observations in a certain order    Append: Combining two datasets by "stacking" them    Merge: Combining two datasets by matching their observations    Collapse: Creating a dataset of means, sums, etc    Contract: Creating a dataset of frequencies    Expand: Duplicating observations    Reshape: Converting variables to observations and vice-versa Basic Commands    Summarize: Calculating means, standard deviations, etc    Inspect: More information on your variables    Tab: Display frequencies, crosstabs, chi-square, etc    Tabstat: Create tables of summary statistics    "By" Group Processing: Execute commands on a group-by-group basis Fancy Commands    Macro Variables: Shorthand way of referring to many variables    For: Execute the same command on many variables    Foreach: Execute the same command on many variables    While: Do repetetive tasks    Xi: Create dummy variables for interactions of categorical variables Useful Commands    Duplicates: Search for duplicate observations    ds3: A very detailed version of "describe"    codebook2: A different version of "codebook"    bigtab: For three-way crosstabs and tables when you get "too many values" errors    xcollapse: An extended version of collapse    xcontract: An extended version of contract    dmerge: For merging two data sets    dsconcat: To append several data sets in one command    log2do2: Convert a log file to a do file    log2html: Convert a log file to an html file IntroductionThis document is meant as a refresher for persons who have already have some experience with Stata. It is geared toward those who use the windowing version (whether MS Windows on a PC or XWindows on Unix). The majority of commands described here will work in a non-windowed environment as well.You can click on the : to see what the output from a particular command or commands look like. There is also a page with all the output so you can print them out easily. You can also click on the command names a see the Stata help page. Use the "back" button on your browser to return to this page. We also have a printer-friendly version of this document. Only the basic forms of the various commands are described here. You can find more detail in the Stata manuals and by using the "help" command in Stata. You can also find the help text on the web at: http://www.stata.com/help.cgi?command Just replace command with the name of the command you need help for. This is often easier than using the help command since it is easier to read and won't clutter your output window. A few of the commands described here are part of what's known as the SJ (Stata Journal) or user-submitted programs. If you try a command and get a message that the command doesn't exist, it is most likely due to this (or due to a typo). In this case, please see our instructions on how to install these SJ and user-written programs. Last, but certainly not least, are the Stata FAQs. The Stata FAQs are a wonderful resource on how to do many things in Stata - they're worth looking through just for the heck of it! Log and Do FilesLog and do files are very useful. Logs keep a record of what commands you have issued and their results during your Stata session. Do files are good for long series of commands that may need to be "tweaked" to work properly. They are also necessary to replicate things that you have done on new or modified datasets. You can create a log file with the command: Where filename is any name you wish to give the file. You will find it helpful to use names that will help you to remember what you did during that session. Using the ".log" extension will automatically create the log file in plain text format which you can then import into a word processor. If you do not supply this extension, Stata will create the file in its own format and you will then have to translate the file to text format before you can use it in any other program. If you do not want to use the .log extension but still want the file in text format, you can use the "set log type" command:
. set log type text, permanently By default, everything displayed on the screen will be recorded in the log file. You can give the file any name you like, but you should use names that will help you remember what analyses you did. If you work for a long time, this file can get to be very large and have quite a bit of unnecessary output. So, if you are trying different things to "see what happens," you may want to start and stop the logging several times in one session. You can stop the logging with the command:
and restart it with the command:
You can also continue or replace previously written programs with the append and/or replace options:
. log using filename, replace The append option simply adds more information to the file, whereas the replace option erases anything that was already in the file. Be careful! It’s often helpful to put comments in the log file to help you remember why you did something. You can do this by simply preceding anything you type with a "*". But, what happens if you forget to start a log file? In the Windows version of Stata,you can also click on the icon in the top left corner of the Review window and save the contents of the window to a file. You can then run the file as a do-file and save the output. You can also automatically create a log file of just the commands you run from the Command window with the command:
Be sure to use a different name from the regular log file though!! Regular log files and command log files can be run at the same time. A "do" file is a set of commands just as you would type them in one-by-one during a regular Stata session. Any command you use in Stata can be part of a do file. Do files are very useful, particularly when you have many commands to issue repeatedly, or to reproduce results with minor or no changes. You can use any editor you like to create do-files, although Stata has its own, which we reccomend using. If you use another editor such as MSWord, be sure to save the file in a plain text format. To use Stata's editor, simply click on the do-file button at the top of the Stata window and start typing your commands. If you are beginning a new program, save the file immediately so it has a name. Also, in the do-file editor, under the "Edit" menu, go to "Preferences" and make sure that "Auto-save on Do/Run" is checked. This way, every time you run the program it will be saved automatically. You do not need to run the entire program everytime you make a change to it. You can run just selected commands or all commands from a certain point (the cursor) to the end of the file. This can be very useful when testing different forms of commands. Simply highlight the command or commands you wish to run, go the the "Tools" menu and select "Do selection." You will also see in the Tools menu, "Run" and "Do". The difference between Run and Do is that Run executes the commands, but does not show any output. Run is useful only when you need to execute many commands (or a command that produces alot of output) before running the one in which you are most interested. Typically, if you do not want or need to see the output of a command, you are better off using the "quietly" command prefix.
The only thing you need to remember is that if a command is longer than one line, then you need to use the #delimiter command. In Stata, hitting the return key tells Stata to execute the command. In a .do file, the return key is at the end of every line, so you need a way of telling Stata that the command is longer than one line. Here is an example of a short do file we’ll call mydofile.do:
Data ManagementData management is perhaps the most important part of your work in Stata. It is how you get your data into a format in which you can run your analyses properly. Major mistakes have been the result of poor data management rather than poor analysis.
There are three basic ways of reading data into Stata. "Insheet" is for spreadsheets saved as "CSV" or "tab-delimited" files from a package such as Excel, "infile" is for data in what are called "flat" files, and "infix" is for when you have data that has to be read in using special formats. Sometimes data will come with a SAS and/or SPSS program to read the data. In these instances, it is better and easier to use the program provided then convert the data to Stata with DBMSCopy or Stat Transfer. Otherwise, you will have to write your own program. Be aware that Stata can only read data that are in ASCII (also called "raw" or "text") format. If the codebook mentions anything like "zoned decimal" or "binary integer," then you must use SAS or SPSS. This is an example of a spreadsheet from Excel. It cannot be brought into Stata without some editing and saving it as a ".csv" file.
The insheet command is very useful in reading in data from a spreadsheet. There are some conventions required though:
Once you have completed the changes and you are in Stata, give the command:
to read in the file. If you get an error message about "wrong number of
values," then you have a problem with the number of commas or tabs in the file, see #3 above. You may also get a "No room to add observations" or "No room to add variables" error. In these instances, you must increase the amount of memory Stata uses with the "set mem" command.
"Infile" may be useful if you have downloaded a data file from the web and it conforms to these specifications:
Essentially, insheet assumes that commas or tabs separate the variables, and infile assumes that blanks separate the variables. It may be easier, though, to bring a file into Excel first, save it as a csv, then use insheet.
Infix is used to read raw data in specific columns into Stata. This is common when you have data from a particular source and a codebook or data dictionary describing the data. There are two basic forms of infix: one that allows you to specify the columns on the command line and one in which you write your own Stata dictionary to describe the file. The first form looks like this:
This form is good when you only want to read a few variables from the file. The syntax is rather simple. You give the variable name followed by in which column or columns it can be found. If the variable is a string (often called "alphanumeric" or "character") then you must preceed the variable name with the "str" designation. Lastly, you tell Stata the name of the raw data file. You can also specify which observations to read:
The second form is similar, however, the information must be entered in a separate file and the infix command is a little different. The file into which you enter your specifications should have the extension od ".dct". The file name can be anything, but it makes sense to give it the same name as the data file it is intended to read. A simple dictionary file, named "mydata.dct" for example, looks like this:
You would save this file, then from the command line in Stata type:
or
Sometimes raw data files have more than one line of data for each observation, often referred to as "multiple records" or "multiple cards." Stata can read these types of files as well with just a slight modification of the dictionary file:
infix dictionary using mydata.raw {
3 lines
1: var1 1-3
var2 4
str name 10-20
2: var3 2-5
var4 6
3: var5 1-2
str var6 24-50
}
Here, the "3 lines" tells Stata how many lines of data there are for each observation. The "1:", "2:" and "3:" tell Stata on which specific lines certain variables can be found. You do not need to read data from all lines. If, in the above example you did not need var3 and var4, you could leave the designation for line 2 out altogether. The "3 lines" and "3:" remain, though:
infix dictionary using mydata.raw {
3 lines
1: var1 1-3
var2 4
str name 10-20
3: var5 1-2
str var6 24-50
}
Please note that "multiple records" data files are not the same as "hierarchical" data files. Hierarchical files have different types of information on different lines. A common example is the Current Population Survey which has Household, Family, and Person records in each file. To process these kinds of files in Stata, you must read each type of record separately - being sure to include the proper identification variables - then merge the files together.
Examining your DataOnce you have the data in Stata, you will want to make sure that all the variables are there and that they are in the format you need. You can do this with the "describe" command. Describe, which can be abbreviated as simply "d," will provide basic information about the file and the variables. If you do not list variables, all the variables in the file will be described. You don’t have to call the data into Stata to be able to describe it, though. The command:
.d using census .d var1 var2 var3 using census will accomplish this. This can be very useful if the the file is too large to fit into memory. By default, Stata starts with 1 megabyte of memory. Often, your dataset will be larger than this and you will need to increase the amount of memory Stata uses. To do this, look at the "size" line in the describe output; this is the size of the file in bytes. Since 1,000,000 bytes is a megabyte, files larger than this will not be loaded into Stata unless you increase the memory with the "set memory" command:
This example increases the memory to 10 megabytes. Make sure that you give yourself some extra memory so you can create new variables and/or add observations. See also: ds3 Stata can produce a rather detailed codebook for your data. The command:
See also: codebook2 It is often useful to just look at the data without doing any kind of
analysis. The list command, abbreiviated as "l" will let you do this. Simply
giving the "l" command will display all of the data on your screen, if
you specify certain variables in the command, then only those variables
will be printed:
By default, Stata will insert lines every five observations. Sometimes this is fine, sometimes it can be confusing. There are three ways of overriding the default:
. l state pop death-divorce, sepby(region) : . l state pop death-divorce, clean : Stata will also display the labelled values of your variables. Sometimes you want to see the actual values instead. In these instances, use the "nol" option:
Creating Variables
There are different types of numeric variables - float, binary, double,
long and int - the differences among them are simply how much space they
take up in the file. In most cases, you will not need to concern yourself
with these differences.
Often, you will need to create new variables based on the ones you have already. The two most common ways of creating new variables is by using "gen" and "egen." Here are some examples of gen:
. gen cumpop = sum(poplt18) . gen id=_n : In the first example, we generate a new variable called "poplt18" which is simply the addition of poplt5 and pop5_17. In the next example, we create a cumulative sum of poplt18 (the value of cumtot for this observation is the sum of poplt18 for all previous observations). Finally, we create an "id" variable by simply setting it equal to "_n", which is Stata’s way of numbering the observations in a dataset. This id variable can be very useful when you need to sort your data in different ways (see sorting below). The variable "_n" can be used at any time, but you must remember that it refers to the number of the observation in the current order, not the original order. Here are some more examples of gen and _n:
. gen lagpop = poplt18[_n-1] : In the first example we simply set newpop equal to a specific value for the the 20th observation. In the second, we create what is called a "lagged" variable, lagpop, which is the value of poplt18 from the previous observation which is designated by the "[_n-1]." You can use any valid arithmetic expression inside the brackets. Here are two last examples:
. replace living = pop - death if state!="Louisiana" Here, we generate a new variable, "living", which is the difference of cumpop and death for Louisiana. Note two things here: first is the double "=" after "if", and second, the use of the quotes around Louisiana. When you refer to string variables in equality expressions such as this, you must put the quotes around the value so Stata will not confuse it with a variable name. Had we left the quotes out, Stata would have thought we wanted to generate living for observations where the variable state equals the variable "Louisiana." In the second example we use "replace" instead of "gen." This reason for this is that in the first example, Stata creates values only for those observation fulfilling the "if" condition. All other observations get "missing" values. We want, then, to replace those missing values with something. The "!" in Stata means "not," and suffices as the extra "=." Sometimes a numeric variable will be read in as a character variable by mistake. This happens most commonly when you "insheet" a file that may have stray characters in it or perhaps "." was used to indicate missing values. The "destring" command will convert the character form of the variable into a numeric one. The simplest way to use the command is:
which will convert all variables in varlist to numeric. The "replace" option tells Stata to simply relace the variable with the same name. The "force" option tells Stata to make the conversion and convert any data that contain non-numeric values to missing. An example of this is when your data contains percent and/or dollar signs. Since you usually want to keep the numbers and just get rid of the signs, you can tell Stata to ignore certain characters:
In the first example, the "$" is simply removed from the value and converted to numeric format. In the second example, the "percent" option tells Stata to remove the "%" and divide the value by 100 to create a decimal value. If you do not want the decimal value, then use the same syntax as for the dollar sign. In some cases, you may want to keep the original form of the variable for reference or other purposes. One example might be if you have symbols for different currencies depending on what country the observation is from. In this instance you can have Stata create a new variable:
Here, Stata will keep income as is and create a new variable, income2, which is the numeric representation of income. Other times, a number may be stored as a string even though there are no special characters in it. This happens most commonly when reading a .csv file that may have had an actual space for a missing value or some other stray character in that column. In these instances, you can use the "real" function with the generate command to convert the variable to numeric:
Stata assumes that the variable you are creating is numeric unless you tell it otherwise. Sometimes, though, you will want to create a variable whose values are strings. Here’s how:
. replace name="Albert" in 1/5 As above, the values for string variables must be enclosed in quotes. The maximum length of string variables in Intercooled Stata is 80; in Stata SE, it is 244. The "in" expression simply tells Stata, observations 1 to 5, inclusive, in the order they appear in the dataset. You can specify any consecutive range. The "in" expression can also be used with numeric variables, as well as any other commands.
. replace yn="no" if team=="New York" | team=="Boston" Here we extend the use of the "if" clause to include two expressions. As you may have guessed, the "&" means "and." The "|" means "or." One thing you must be wary of is that the values of character variables are case-sensitive, so "John" is not the same as "john."
The substr function can be helpful; it specifies certain parts of a character
variable to be used. The first argument specifies the variable to be picked
apart (code), the second argument is the starting character, and the last
argument is the number of characters to be pulled out. In the first example,
starting with the first character in the variable "code," two characters
will be extracted.
Dates are a special type of numeric variable. Dates are actually stored as a number which counts the number of days, weeks, months, etc. from January 1, 1960, commonly referrred to as "elapsed dates". So, January 1, 1960 = 0, and dates before that are negative numbers and dates after that are positive numbers. Often, dates must be entered as string variables because they need to have a "/" or have the month as letters instead of numbers. In these instances, you can use the "date" function to generate an actual date variable:
The "mdy" tells Stata the order of month, day and year as they appear in
the values. Stata assumes that you have a four-digit year; if you have
a two-digit year, then replace the "mdy" with "md19y" or "md20y", depending on what century your data are in.
One caveat is that Stata needs to have some kind of separator
between the month, day and year so values like 01/01/1998, 01-01-1998, and
01jan1998 are all valid, but 010198 is not. If there is no delimiter, then you can use the "todate"
command:
This tells Stata to take the variable "olddate" and create a new one called "newdate" which is a Stata date variable. The "p(yyyymmdd)" indicates the pattern of numbers in olddate.
The "mdy" function can put three variables together to create a new date variable. The variables do not have to be called "month," "day," and "year," but they do have to follow that order. One more caveat is that the year must be stored as a four-digit number, otherwise Stata will not know what century you want. You may need to do some more programming to add "1900" to the value. Once you have your date variable created you will probably want to display it on the screen to make sure you did it correctly. The problem is that Stata will display the elapsed date, which is just a number, not a date as we are used to seeing them. You will need to format the variable so it displays as a "normal" date, but is still used as an elapsed date:
This will display dates as "01jan1998." Other formats can be used, check the Stata manual. Here are some other useful functions for working with dates:
. gen weekday = dow(date) - numeric day of the week . gen month = month(date) - numeric month . gen year = year(date) - year . gen qtr = quarter(date) - numeric quarter of year . gen week = week(date) - numeric week of year
All of these examples assume that the date variable is an elapsed date.
The "day" function returns the number of the day, ie, 23. The "dow" returns
a number from 0 to 7: 0 if it is a Sunday and 6 if it is a Saturday. "Month"
returns a number from 1 to 12. "Year" returns the year as a four-digit
number.
Often we need to use the date as a value, say in an if .... expression. Since Stata stores dates as actual numbers, we would (in the olden days) have to calculate how many days from Jan.1, 1960 our particular date was. In Stata, we can identify a date with a function:
. gen target=d(05Aug1989) The d() function allows us to enter the date in words rather than as numbers. This is only a small subset of what Stata can do with dates. We have a more complete tutorial on Time Series commands. A note on "Century-months": You can easily convert century-months to Stata monthly dates by simply subtracting 721 from the century-month then formatting the result as a monthly date:
. format statamonth %tmmcy Sometimes we need to generate a "dummy" variable, or variables. Stata makes this very easy:
Here, Stata will create a dummy variable for each value found in region. So, region_dummy1 = 1 if it is in region 1, 0 otherwise; region_dummy2 = 1 if it is in region 2, 0 otherwise; region_dummy3 = 1 if it is in region 3, 0 otherwise. See also the discussion of the xi: command prefix below. Egen, or "extended generate" is useful when you need a new variable that is the mean, median, etc. of another variable, for all observations or for groups of observations. Egen is also useful when you need to simply number groups of observations based on some classification variable. Here are some examples:
. egen meanpop=mean(pop), by(region) . egen num_st=count(id), by(region) . egen popcat=cut(pop), at(671742.5,3066433,1.11e+07) . egen group = group(region popcat), label missing : In the first example, we simply create a variable whose value for each observation is equal to the sum of pop for all observations (all observations will have the same value). The second example shows how to create a variable that is the mean of another variable for each group designated by region (all observations within a group will have the same value). The third example simply counts the number of non-missing id values in each region (this basically counts the number of observations you have for each region). The last example simply assigns a number to each of the groups created by the combination of region and popcat. The "label" option tells Stata to create a value label for the created variable; if either of the variables used to create the new variable has a value label assigned, that label will be included. Finally, the "missing" option tells Stata to treat missing values in creating the groups. One thing to watch out for when using egen is that some functions treat missing data as zero.
This is very different from regular functions and analysis commands in Stata.
Use encode when the original variable is, indeed, a character variable (such as gender being coded as "m" and "f") and you need numbers instead. The encode command does not produce dummy variables, it just assigns numbers to each group defined by the character variable. In this example, state was the original character variable and st is the new numeric variable:
Using the "gen ... replace" commands are okay if we are creating new variables or want to change the values of a few specific observations. Often, however, we don't need or want to create a new variable, or we want to change all occurrances of a specific value. For instance, many surveys will have "999" to indicate a missing value rather than just a blank and an "888" to indicate that the question was not applicable to this particular respondent. Since Stata will include "999" and "888" as valid values in any calculations, we would want to change all of them to actual missing values. To do this, we use the "recode" command:
. recode x 999 888=., test . recode x y z 1/5=1, pre(new_) . recode x 1/5=1 6/10=2 11 12=3 *=4 The first example will change all "999" to missing and create a new variable called "new_x". The second will change all "888" and "999" to missing, but before doing so, will test to make sure you have not defined overlapping recodes (such as 3=1 1/5=2, where "3" would be recoded twice). The third example will change all values in the variables "x", "y" and "z" from 1 to 5 (inclusive) to 1, and create new variables called "new_x", "new_y" and "new_z". In the last example, the "*" tells Stata "all other values not previously stated". Now that we’ve created all these new variables, we’ll want some way of keeping track of what each one is and what their values mean. We can do this by creating variable labels and value labels. These labels are not necessary in Stata, they just make the output easier to read. Variable labels correspond to the variable names, whereas value labels correspond to the different values a variable may have. Here’s how to create them:
This assigns a label to the variable, sumpop, so whenever the variable name sumpop is displayed on the screen, the description "Sum of Pop. for region" will be displayed as well. Variable labels must be 80 characters or less. Often, though, it’s more helpful to label the values a variable can take so you don’t have to memorize them or keep referring to a codebook.
. label values group grp . label define grp 4 "N.ctrl, low",add . label define grp 5 "N.Cntrl, low" 4 "N.Ctrl, mid" .a "Missing",modify Value labels correspond to the actual number or letters in the data. They are used so that the printout will show "NE, mid" "NE, low," and "NE, hi" instead of "1," "2," and "3". First you have to "define" the label, then associate that label with a variable. Each label can be associated with more than one variable, so if there are several "yes/no/maybe" questions, you just have to define one value label and use it for all the questions. You may associate only one value label with a particular variable, however. The maximum length for a label in Intercooled Stata is 80 and in Stata SE, 244 characters. Rarely, if ever, will you want to make labels this long; most output in Stata will only display the first 12 characters, anyhow. Finally, only integers and extended missing values can have value labels assinged. Stata has some new commands to help you manage your labels: The "label list" command will list all your value labels or just the ones you specify.
. label list cenreg :
. labelbook cenreg Similar to variable labels, Stata allows you to attach "notes" to variables. The notes are displayed only when you specifically list them or in the codebook. Variable notes can be useful for keeping track of information such as skip patterns or formulas for constructed variables. To create a note for a variable use the command:
. note death: created on TS . note _dta: this is useless data To list all the notes in a file:
Well, now we’ve created many new variables and converted some old ones. Since we no longer need all of these variables, we’ll want to eliminate some of the ones we don’t really need and perhaps rename some of the ones we keep.
. rename num_st state_num Here we drop the variables, "strdate," "strdate2," and "olddate". Be careful! Once they are dropped, they can’t be picked up again unless you clear the data and use it again. You can drop as many variables as you want in one command. The second example renames the variable "num_st" to "state_num." You can rename only one variable at a time. You can also drop observations:
will drop all observations whose size is equal to 1.
The commands "order," "move" and "aorder" can be used to change the order of variables in your data. This can be particularly useful when you have variables that are conceptually grouped but necessarily grouped within your data. An example might be when you have information collected on the same variables at different times; all of the time 1 variables are followed by all of the time2 variables and so on. Specifying a range of variables such as "var1-var4" would give you ALL of the variables between "var1" and "var4" not just those that begin with "var". To circumvent this, you can re-order the variables:
would move these four variables to the beginning of your data in the order specified. You could also simply switch the order of two variables:
this moves var3 to the position occupied by var2. Sometimes the exact order doesn't matter, but if you have many variables, it's easier to search the variable list if they are in alphabetical order:
will put all of the variables in alphabetical order. You can also specify a variable list with this command. File ManagementOnce you have all of your variables in the format you want, you’ll need to get the entire file in a format that will make it easier to use. You can do this by sorting, appending, merging and collapsing. Sort puts the observations in a data set in a specific order. Some procedures require the file to be sorted before it can work. You can sort a file based on more than one variable. One thing you must be careful of, is that by default Stata will randomize the order of the observations within the variables used to sort. This is why creating the id variable mentioned earlier is important. With the id variable, you will always be able to go back to the original order and start over. You can override this behavior by using the "stable" option which will keep the observations in their current order.
. sort region state . sort region state, stable Sometimes, you have more than one file of data which you need to analyze. One case may be that you have two files with the same variables but different observations. The other case is when you have two files with the same observations, but different variables. Use append when you simply want to add more observations, in other words you already have data for 1999, and now you want to include new observations from 2000 for the same variables. For example:
. append using ds2 . append using ds2, keep(var1 var2 var3) will add the observations from the file, ds2 (what Stata refers to as the "using" dataset), to the end of ds1 (what Stata refers to as the "master" dataset). Any variables with different names in the two files will have missing values for the observations from the other dataset. By default, all variables from the "using" dataset will be included in the new file. You can use the "keep()" option to specify which variables you want to keep (from the using data set). See also: dsconcat If you have two files that have the same observations, but different variables, then you’ll want to "merge" them so you can use all of the variables at once. When you merge datasets, you are adding new variables to existing observations rather than adding observations to existing variables. There are two basic kinds of merges, a one-to-one and a match. A one-to-one merge simply takes the two files and puts them side-by-side, regardless of whether the observations in each dataset are in the same order. A simple one-to-one merge isn’t very common, and isn’t recommended, even if you think all the observations match. Sometimes there may be a "glitch" that can throw off the order of the data. It’s just as easy to do a match-merge, and you’ll be able to check that all the observations matched correctly. There are some "rules" when doing a match-merge. First, each dataset must have a "key" variable or variables by which the observations can be matched - social security number is a good example. Second, both datasets must be sorted by this key variable. You can use more than one key variable if you want, such as month and year. These key variables must be of the same type (string or numeric) in both datasets. By default, if a variable is present in both datasets, then the values in the master dataset will remain unchanged. If the variables in the using dataset have the same names as ones in the master dataset, but represent additional information, then you will need to rename them in one of the datasets before you can merge them.
. sort a . merge a using ds2 : In this example we simply match observations using the variable "a" as the key.
Sometimes, you will want to update the data you have, namely, fill in missing values. Using the update option will accomplish this:
Values in the master dataset will be changed only if they are missing. In some instances, you may want to replace non-missing values in the master dataset. Use the replace option in addition to the update option to do this:
"Replace" will not work by itself, it must always be used with update. Stata will not, under any circumstances, change a non-missing value in the master dataset with a missing value from the using dataset. You must use the "replace" command discussed earlier to do this. Stata will automatically create a variable called "_merge" which will indicate the results of the merge. Always check this variable to make sure that you got what you wanted. Here are what the possible values of _merge mean: 2 = Observations from the using dataset that did not match observations from the master dataset. 3 = Observations from both datasets that did match. 4 = Observations from both datasets that did match, missing values in the master dataset were updated. 5 = Observations from both datasets that matched, values in the master dataset disagree with those in the using dataset. See also: dmerge Collapse is used when you want to create a dataset containing the means sums, etc., of the various groups in the data. One example might be when you have one dataset of monthly data and another of yearly data and you need to analyze both sets of information together.
This will create a dataset of the means of pop and income by region. If you have four regions in your original dataset, then you will have four observations in the collapsed dataset. You must be careful, though, because Stata will compute the statistics on a variable-by-variable basis. If one variable has more missing observations than another, the means (and any other statistics you request) will be based on a different number of observations. This is not always an acceptable practice. To avoid this, you must use the "cw" options for "casewise deletion." This means that Stata will drop any observation that does not have data for both pop AND income, thereby ensuring that all the statistics will be based on the same number of observations.
See also: xcollapse Contract works similarly to Collapse, however, it creates a file of frequencies or crosstabulations. It is similar to outputting the results of a "tab x y" command to a new data file. The sytax is:
Where "region" is the variable we wish to contract and "reg_freq" is the name of the new variable that will contain the values of the crossatbulation of company and gender. By default, Stata will not output observations for which there are zero frequencies; you must use the "zero" option to include them. You can include more than one variable in the command. See also: xcollapse. As you might guess, "expand" is the opposite of "contract" - sort-of, anyway. Expand will duplicate observations based on a specific value you give it or a variable:
will create four additional observations for each observation you have in your dataset. On the other hand:
will create a number of observations according to the value of popvar. Reshape is one of Stata's more complicated but necessary commands. It is used to convert data from "wide" to "long" format and vice-versa. These formats are sometimes referred to as "repeated measures" and "time-series" formats, respectively. Here are a couple of examples: Wide: id sex inc90 inc91 inc92 1 M 20 22 25 2 M 33 37 42 3 F 24 24 26 4 F 55 60 65 Long:
id sex year inc 1 M 90 20 1 M 91 22 1 M 92 25 2 M 90 33 2 M 91 37 2 M 92 42 3 F 90 24 3 F 91 24 3 F 92 26 4 F 90 55 4 F 91 60 4 F 92 65 The long format is also referred to as "person-years". To convert the wide format to long, you would use the command:
To convert the long format to wide, you would use the command:
In the reshape command, the "i" indicates what variable(s) identify the rows (observations), and the "j" indicates the columns (variables). In the first example, going from wide to long, the unique rows of data are identified by the variable "id," the variables we want reshaped begin with the prefix "inc" and we want the new variable indicating what column the value was from to be called "year". You should notice two things about these commands and the data: First, we did not specify "sex" in either command. Stata will automatically shift all other variables accordingly. Second, and more importantly, is/are the names of the variables that were transposed, "inc90", "inc91" and "inc92." As you can tell, they all have the same prefix, "inc" and have the year as a suffix. This is the easiest way for Stata to work with your data. The years (or whatever the numbers represent) do not have to be consecutive, nor do they even have to be in numerical order. Sometimes, though, the data do not represent years, rather they represent different types of observations. For example, instead of "inc90," "inc91" and "inc92" the variables were named "mominc," "dadinc" and "kidinc." id dadinc kidinc mominc 1 25 20 22 2 42 33 37 3 26 24 24 4 65 55 60 There are no numbers, and the "inc" is now the suffix rather than the prefix. Stata can still handle this:
would produce: id member inc 1 dad 25 1 kid 20 1 mom 22 2 dad 42 2 kid 33 2 mom 37 3 dad 26 3 kid 24 3 mom 24 4 dad 65 4 kid 55 4 mom 60 There are two new items here: The "@" and "string." The "@" tells Stata where the "j" identifier is within the variable name. If you had a variable name such as "mom90inc," then you would use "mom@inc." "String" tells Stata that the identifier is a string rather than a number. The "@" and "string" can be used by themselves as well. Basic CommandsNow that you have your data in a format you want, check it before doing any analyses. This can save you quite a bit of frustration later on. Sum, short for summarize, will give you the means, sd’s, etc. of the variables listed. If you don’t list any variables, it will give you the information for all numeric variables. If a variable you thought was numeric shows up as having 0 observations and a mean of 0, then, most likely, Stata still thinks it’s a character variable.
. sum pop income The "detail" option gives you additional information about the distribution of the variable.
Inspect is another easy way to eyeball the distribution of a variable.
Tab, short for tabulate, will produce frequency tables. By specifying two variables, you will get a crosstab. There are other options to get the row, column and cell percentages as well as chi-square and other statistics; check the manual.
. tab region group . tab region group, row col chisq . tab1 region pop state "Tabstat" is similar to "sum" except that it allows you to specify which (and more) statistics are to be displayed in the table, as well as how the table is oriented. Here are a couple of examples:
"By" group processing Sometimes you’ll want to run a command or analysis on different groups of observations. The "by variable:" subcommand is the same thing as running the command with separate "if" statements for each group. You must sort the data before you can use the "by:"
. by region: summ pop . by region: gen poplag=pop[_n-1] :
Often when you have time-series data (i.e., monthly data for several persons) you want to execute a command for each person, but need to ensure that the data are in the proper time order. One example would be if you want to calculate the change in income from one month to the next for each person. To do this, you may be tempeted to use:
Sometimes, the "by" prefix can produce more output on the screen than you would care to look at. To suppress output to the screen, but not to the log file, use "quietly."
Fancy commands
Macro variables Sometimes you need to use many variables the same way many times. One example would be if you want to run regressions on different dependent variables using the same set of independent variables. This can mean a lot of typing. One way around this is to create a macro variable. A macro variable is simply a variable that has as its value a particular string. This string can be anything you specify: a list of variables, a particular command, or whatever. Whenever Stata comes across the macro variable, it will interpret it to mean whatever string you set the variable to.
. reg depvar `macvar’ In this example, when Stata "sees" the variable macvar in the regression command, it replaces `macvar’ with the string "att att2 itt itt2 date." Pay careful attention to the different type of quotation marks: the first quote in `macvar’ is the opening left quote usually found under the ~. The second quote is the closing right, or single quote usually found under the ". Other times, you will want to perform the same command on several variables. Again, this can mean a lot of typing. The "for:" command can also be very useful in these situations. It can use several types of variable lists, and with a little ingenuity, can be very powerful. The general syntax is:
where "id" is a symbol used to identify variables or values, "listtype" is what kind of entries are in the "list" to be changed, and the Stata command you want applied. There are four types of lists:
Here are a few examples of the for command:
. for var m*: replace X=. if X=99 . for new var1-var3 : gen X=0 . for any a b c: gen str2 X="aaa" . for new v2-v5 \ num 2/5: gen X = myvar^Y The X in the for command represents the variable names in the list, one by one. In the first two examples, Stata would interpret the command as though you had typed replace var1=. if var1=99, replace var2=. If var2=99, and so on. The "m*" in the second command simply tells Stata to change any variable beginning with the letter "m". The third example creates "new" variables named var1 var2 and var3 and sets them all equal to 0. The fourth example does basically the same thing, but since we are not using consecutive names as in the third example, we must use the list type of "any." The fifth example shows how to use two lists to create new variables. It creates the variables v2,v3,v4, and v5 and sets them to the 2nd, 3rd, 4th and 5th powers of "myvar" (an existing variable), respectively. You could even repeat several Stata commands on the right hand side of the ":" by separating them with a "\" as we did on the left side. Sometimes using an "X" as the id might be confusing - especially if our variable names have an "X" in them. If this is the case, we can change what Stata uses as the id:
will accomplish the same thing as the first example above. Sometime its a good idea to test the command before you actually run it. The "dryrun" option will allow you to do this:
will show the commands that will be executed without actually executing them. Closely related to For is "Foreach." Foreach allows you to run several commands in the same fashion as For:
replace `var'=. if `var'==99 replace `var'="NA" if `var'==88 } Note the use of the macro variable. Foreach can use the same types of lists as For. "While" is very similar in purpose to "for," however, it can be much more powerful (and complicated). "While" is used mostly in programs to perform several tasks many times by iterating through the data. For example, let's say you need to perform a regression on each of 100 companies and save the betas as variables. If you had nothing better to do with your life, you could issue the follwing commands:
. gen beta=_b, if company=1 100 times each, changing the company number each time. Or, you could write a quick program using "while" to do it for you:
1) program define regit
2) local i=1
3) while `i'<=100 {
4) reg y x, if company=`i'
5) replace beta=_b, if company=`i'
6) local i = `i' + 1
7) }
8) end
We are assuming here that you have a variable called "company" that identifies each company sequentially from 1 to 100. It's important that the companies are identified in this way so Stata will not issue an error when it comes to a missing id. To create a variable like this, use the egen group() command explained above. We have also assumed that you created a variable called "beta" with missing values for all observations. Let's look at this program line-by-line:
2) We must first define a macro variable (see the explanation above on macro variables) which we set to 1 to begin the iterations. You do not have to start at 1, it's just convenient. 3) Here's the "while" command. It tells Stata to repeat the following commands (up to the "}" on line 7) as long as i is less then or equal to 100. Note the ` and ' around the i; the actual value used here will change each time the program loops through the commands. 4) and 5) These are the actual commands we want to execute on our data. As in line 3, the `i' will change its value each time we loop through the program. 6) This line increments the value of i each time it is executed. 8) This simply tells Stata that this is the end of the program. Here's what Stata "sees" as it executes the commands in the program: The first iteration:
The second iteration:
The third iteration:
Get the idea? This will continue until it gets to the 101st iteration and then stop because, of course, 101 is not less than or equal to 100. You would be very smart to write a program like this in a do file as you will most likely need to tweak it a few times to get it to work the way you want. The xi: command prefix is used for interaction expansion. Interaction expansion means creating dummy variables for interactions of categorical variables. There may be instances in which you will run a regression (or some other analysis) when you will want to estimate the interaction terms for these variables. Normally, you would have to create new dummy variables for each term, which would be rather tedious. The xi: prefix does this for you. Here's an example:
. xi: reg income i.region*i.popcat The i.region and i.popcat tell Stata that you want each of those variables expanded, so you would have the dummy variables created for you. The first example simply creates these dummies for you and uses them in the regression. The second example will also produce the interaction and main effect terms for them. You can use this to interact categorical variables with continuous ones as well:
. xi: reg income i.region|popcat The first example will give you the interactions and main effects of region and popcat, whereas the second example will leave out the main effect of popcat. As with any regression using dummy variables, one category must be left out. By default, xi: will leave out the group with the lowest value. If this is not what you want, then you can change it by issuing the command:
which would leave out the region group coded 3 instead. If the variable is a string variable, simply substitue the string value for the number in the example. Useful CommandsSome of the commands in this section are part of the STB collection of commands and may not be installed on your machine. They are, however, very easy to install yourself.The "duplicates" command is used to find and, optionally, delete duplicate observations. This can be very useful when you are cleaning data from several sources. There are several forms of the command, so you are encouraged to look at the manual for an explanation for them; we'll look at just a couple here.
. duplicates report region
The "ds3" command is a very versatile version of describe. Unlike describe, though, ds3 can list different types of variables (strings, numeric, byte, float, etc.) as well as variables that have labels or other attributes. It can also do case-insensitive searches. Here are some examples:
This will list all string variables.
This will list all variables that have value labels defined for them. Perhaps more useful:
will list all variables that do NOT have value labels defined. You can even find variables that have a specific value label defined:
which will find all variables that have been assigned the "yesno" value label. To do the same, but ignoring case:
Last, but not least, you can use ds3 to select certain variables for other commands:
This will do a "sum" on all numeric variables - leaving out all the string variables. The codebook2 command is similar to the standard "codebook" command, but takes a different approach to determining what information to display for the variables. The codebk command should be used in conjunction with the "vartyp" command. Both of these commands were written at OPR and can be installed from within Stata with the commands "ssc install codebook2" and "ssc install vartyp". The codebk command will display information on a variable or variables based on whether the variable is an "id", continuous, discrete or a date variable. You can specify the type in the codebk command, or by using the vartyp command. To specify the type using the vartyp command:
. vartyp weight height, s(cont) . vartyp sex race state, s(disc)
. codebook2 person weight sex . codebook2 educ, t(disc) You can also produce codebook information for variables in another dataset:
The "bigtab" command is for when you want three-way crosstabs and/or you get a "too many values" error from the "tabulate" command. "Bigtab" was written at OPR and can be installed from within Stata with the command "ssc install bigtab" The bigtab command can produce one-, two-, and three-way frequency tables with an unlimited number of values for each variable. It can also produce row, column and cumulative frequecies and percentages, as well as saving the results in a separate dataset. Here are some examples:
. bigtab sex race agecat, sep(sex) . bigtab sex race, saving(sexracefreq) . bigtab sex race, all The xcollapse command is an extension of the collapse command. The most useful feature of this command is the ability to save the results in a separate data file without replacing the data already in memory.
The xcontract command is an extension to the contract command in that it has many more options. Perhaps the most useful option is the one to save the resulting data set in another file without destroying the one currently in memory.
The dsconcat command will allow you to append several files in one step. You can also create a variable that will identify which file a particular observation came from:
. dsconcat file1 file2 file3, dsname(filename) "dmerge" is just like the merge command with a couple of useful exceptions: If the variable "_merge" already exists in either the using or master data sets, it is automatically dropped. Also, if the data sets are not merged by the key variables, "dmerge" will sort them first. This is particularly nice if your using data set is not sorted. The "log2do2" command can be a big time-saver if you have not done all of your work in a do file (but you did do all your work in a do file, didn't you???) or if, for whatever reason, you have a log file but no corresponding do file. log2do2 extracts all of the commands from a log file and creates a do file.
Like log2do2, log2html converts a log file to an html, or web document. Unlike log2do2, however, log2html MUST use a .smcl, or defualt Stata log file as its input. This is because the .smcl files have a markup language of their own which is converted into html, another markup language. So, if you have been creating your logs as plain text (using the .log extension or ,text option), you will need to re-run your program and change the type of log file.
| ||||||||||||||||||||||
| top | |||||||||||||||||||||||
| |||||||||||||||||||||||