The Office of Population Research at Princeton University

August 20, 2008


Administration
Faculty
Staff
Students
Jobs


Projects
Seminars
Working Papers


Prospective  Students
Programs
Courses
Course Schedule


Data Archive
Library
Pop Index


Calendar


CRCW
CHW
CMD
PUM
OPR Mail
OPR Mail - New

Search

  Basic Stata: Reading Data Into Stata

Reading Data Into Stata

This lesson will cover in more detail how to get data into Stata. By the end of the lesson you should be able to read a raw data file into Stata using the "infile" and "infix" commands. We will also briefly talk about the "insheet" command for reading data from a spreadsheet program such as Excel.

Often, the data we want to analyze is not already in Stata (.dta) format. What we will work on now is how to get data that are not in Stata format into Stata.

Before we get into the details, let's just note that sometimes data come in SAS or SPSS format, or with SAS or SPSS programs to read the data. In these instances, it is much easier to use a program like StatTransfer or DBMS/Copy to convert the data from one of these formats to Stata.

You may even have data in an Excel or other spreadsheet program. You can use StatTransfer or DBMS/Copy for these files as well, or you can save the spreadsheet as a tab-delimited file and use the "insheet" command to read it into Stata. This is pretty easy to do, so you can try it on your own with the instructions from our refresher lesson

If you do have to write your own program to read data into Stata, then you have two basic choices: "infile" and "infix". The infile command has more capabilities than infix, but is also more complex. Essentially, the choice comes down to this: if your codebook has "start" and "length" information for your variables, or your variables are separated by spaces (not tabs or commas) then use infile. If your codebook has "start" and "end" column information, then use infix.

Some Terminology

  • Codebook: A document that describes in detail the data with which you are working. Although there is no standard format for a codebook, a good one has the full wording of the questions and answers, a list of all the codes or values used to enter the data, and either the begin and end columns or the begin column and the length of the variable.
  • Dictionary: There are two types of dictionaries: a data dictionary and a Stata dictionary. A data dictionary is a document that lists all the variables and their locations (columns) in the data file, and, sometimes, the values for those variables. These are appropriate for any statistical package. A Stata dictionary is a program or file that Stata uses to read a raw data file - the information in this program can be obtained from either a data dictionary or a codebook. For our purposes, "dictionary" will refer only to a Stata dictionary.
  • Observation: An observation is a unit of analysis - a respondent in a survey, for example.
  • Record: A record is one line in a raw data file. Sometimes there is more than one record per observation or there are different types of records in a single file. Records are sometimes referred to as "cards."
  • Column: a column actually refers to a single character, including spaces, in a raw data file. This is not the same as a column in a spreadsheet.
  • Length: The number of columns (or characters) a variable in a raw data file will occupy. Sometimes this also means the number of bytes the same variable will take in a system file.
  • Raw data: Raw data are data that have not been read into Stata (or some other package). If you were to open this file you would see numbers and, perhaps, letters. You need a codebook or data dictionary to be able to read the data into Stata. These are sometimes called "text" or "ascii" files.
    • Flat: a flat file is a raw data file that has one record for each observation. These are sometimes called "rectangular" files.
    • Multiple Records: Sometimes there is more than one record or line of data for each observation in a raw data file. Each record has a different set of variables on it, so each record must be read differently.
    • Hierarchical: Sometimes a raw data file will have different types or levels of information on different records. An example is when a survey collects information about a household, each family within that household, and each person within each family. The different types of variables will be on different records within the raw data file.
  • String variable: a string variable is one that has letters and/or numbers as opposed to just numbers. An example would be a person's name. Numbers can be treated as strings, but strings cannot be treated as numbers. Strings are also referred to as "character" or "alphanumeric" variables.
  • Numeric variable: These are numbers, plain and simple. Decimals, commas, and minus signs are the only acceptable non-number characters allowed.
  • Variable label: A variable label is a short description of a variable. These appear in the Variable window in Stata and also in the output. They are not required, but make the output easier to understand.
  • Value label: Like variable labels, value labels make the output easier to read.  Instead of printing “1”, “2”, “3”, Stata will print “Yes”, “No”, “Maybe”.

Infile

This is a sample dictionary using the "infile" command. These statements are entered into a separate file called "demo.dct".

infile dictionary {
    _column(1)           month     %2f   "Calendar Month"
    _column(3)           year      %4f   "Calendar Year"
    _column(7)           employed  %1f   "Employment status that month"
    _column(8)           mugged    %1f   "Perpetrator/victim/witness of assualt"
    _column(9)           gender    %1f   "Gender"
   _column(10)           income    %5f   "That month's income"
   _column(15)    str6   d_income  %6s   "Income with a dollar sign"
   _column(21)    str13  str_num   %13s  "Number read in as a string"
   _column(34)           hhid      %3f   "Household ID"
   _column(37)           famid     %1f   "Family ID"
   _column(38)           perid     %1f   "Person ID"
   _column(39)           citynum   %1f   "City ID"
   _column(40)    str15  cityname  %15s  "City name"
   _column(55)           b_day     %2f   "Day of R's Birth"
   _column(57)           b_month   %2f   "Month of R's birth"
   _column(59)           b_year    %4f   "Year of R's birth"
   _column(63)           i_day     %2f   "Interview day"
   _column(65)           i_month   %2f   "Interview month"
   _column(67)    str10  str_date  %10s  "String Date"
   _column(77)           rand      %21f  "Just a random number"
}

The first line tells Stata where the data file is; this must be the very first line in the dictionary file. The next several lines provide Stata with a "map" of the data file. The "_column()" statement tells Stata in which column the variable starts. You can skip columns/variables if you want. Next, if the variable is a string, then we must tell Stata that it is a string as well as how long it is. Next is the variable name. Then we tell Stata how long the variable is. Note that for string variables we use an "s" and for numeric we use an "f". Finally, we have the variable label. The label is not necessary, but it is helpful.

infile using demo.dct, using(demo.dat)

infile using demo.dct if gender==1, using(demo.dat)

infile using demo.dct in 1/100 , using(demo.dat)

To execute the dictionary, we use the "infile using" command. The first form reads the entire data file into memory. The second form reads only those records where the variable "gender" is "1" (note the double equal sign). The last form only reads the first 100 records; this is good for testing your program before executing it on the full data file.

infile month year employ mugged gender income str6 d_income using demo.dat

This form of the infile command can be used only if your variables are separated by spaces. If you have string variables with embedded blanks, then they must be enclosed in double quotes.

 


Infix

Here is a sample infix dictionary. Like the infile dictionary, it is entered into a separate file called "demo.dct".

infix dictionary{
      month 1-2 
      year  3-6
      employed 7
      mugged 8
      gender 9
      income 10-14
str6  d_income 15-20
str13 str_num 21-33
      hhid 34-36
      famid 37
      perid 38
      citynum 39
str15 cityname 40-54
      b_day 55-56
      b_month 57-58
      b_year 59-62
      i_day 63-64
      i_month 65-66
str10 str_date 67-76
      rand 77-96
}

In the infix dictionary we can simply tell Stata in what columns the variables are and specify the string variables as we did in the Infile dictionary. Note that the first line of an Infix dictionary begins with the word "infix".

infix using infix_demo.dct

infix using infix_demo.dct if year==1997

infix using infix_demo.dct in 1/100

Executing an infix dictionary is done in the same manner as for an infile dictionary.

Saving Your Data

Of course, once you’ve read your data into Stata, you’ll want to save it.

save mydata

This is the simplest form of the save command. It simply saves the data with the name “mydata”.Stata will automatically attach the “.dta” extension.

save mydata, replace

Stata will not let you save a dataset over itself without you explicitly telling it to do so.  If you have made many changes to your data, you may want to save it with a new name instead.

saveold myolddata

If you intend to give the data to someone who has an older version of Stata, then you must use the "saveold" command. You should be aware that older versions of Stata have smaller limits for the size of the dataset as well as length of variable names and labels. The same holds true for the "Intercooled" and "Special Edition" versions: Intercooled has smaller limits than Special Edition.

 

Documenting Your Data

Once you have your data loaded into Stata, you'll want to document it so you know what the variables are and how they are coded. You can do this with variable and value labels, as well as having good records of the data using Stata's "codebook" and "describe" commands.

In this lesson you will learn how to create and modify variable and value labels, create notes for your data and variables as well as make good use of the codebook and describe commands.

Variable and Value Labels

The output from the describe and codebook commands can be very useful, but it can be even more useful if you have defined variable and value labels. Variable and value labels also make the output from other commands easier to read as well. A variable label is a short description of the variable itself and what it is intended to represent. Value labels are a short description of the individual values a variable may have. Used together, they document your program as well as your output.

Here are a few rules and tips for labels:

  • Variable labels can be up to 80 characters long in Intercooled Stata, 244 in Special Edition, anything beyond that gets truncated.
  • Often, a variable label is simply the text of a question in a survey. If the question is more than 80 characters, you must re-word the question be sure that you do not change the meaning of the question by doing so!!! This may seem obvious, but it has happened.
  • If you do not use question numbers as your variable names, then include the question number in the variable label.
  • Value labels have the same limits as variable labels, but since many procedures only print the first 12 characters, you would be wise to keep them rather short. Again, be sure not to change the meaning.
  • Only integers and special missing values can be labeled.
  • One label can be applied to many variables; if you have several yes/no questions, you only have to define one label and then apply it to all your yes/no variables.
  • You can define more values in a label than you have in your variable. For example, if you have some yes/no questions and some yes/no/maybe questions, you can define a yes/no/maybe label and apply it to the yes/no variables, provided "yes" and "no" are coded with the same numbers.
  • Each variable can have only one variable or value label applied at a time.

label var employ "Employment Status"

label var str_num `"This has "quotes" in the middle"’

Defining variable labels is pretty simple. Be sure to use the double quotes.
If you need quote marks in the label itself, this is how you can do it.  The first character is the accent mark above your “tab” key.

label define sex 0 "Male" 1 "Female"

label values gender sex

First you need to define the value label. Simply list the values and the text of the label in quotes.
Then, you apply that label to a variable. Unfortunately, you can list only one variable at a time in the label values command.

label define sex 2 "Other", add

label define sex 0 "0: No" 1 "1: Yes" 2 "2: Maybe", modify

You may need to change a label definition at a later time. Here are two ways of doing so. You can use the "modify" option to add new values as well. Once you have re-defined the label, you do not need to re-apply it.

label dir

label list

label list sex

label drop sex

Here are some commands to help you work with labels.

numlabel sex,add

numlabel sex,remove

The numlabel command adds (or removes) the numeric value of a label to the label itself. For this example, the "sex" label would become "1. Yes", "2. No" and "3. Maybe"

labelbook

labelbook sex

The labelbook command produces a "codebook" for your value labels

label _dta “This is Employment and Assault data”

You can also label your entire dataset.

 


Notes

Variable and value labels go a long way toward documenting your data. Sometimes, though you may need or want to include information that you wouldn't necessarily want in a label. Information on how you constructed a variable and in which do-file you did it or reminding yourself of skip patterns are examples. This is where variable "notes" are useful. Notes are not displayed in any output by default, but are a good way of keeping a record of what you did. The only drawback to using notes is the fact that only Stata has them; other packages like SAS or SPSS do not have notes, so if you convert your data, the notes will not be converted as well (variable and value labels are, though). You can have a total of 9,999 notes in your data and you can have more than one note for any variable.

note: i started this on monday

note: i started this on TS

note income: i constructed this variable in day2b.do

To define a note for the dataset as a whole, simply use the note command - be sure to use the colon.
By adding "TS" to the note, you can include a time stamp in any of your notes as well.

Finally, you can attach notes to specific variables.

notes

notes _dta

notes mugged

To display all of your notes, just give the "notes" command.

You can display just the general notes

And you can display the notes specific to a variable or variables.

notes drop _all

notes drop mugged

notes drop _dta 3

To drop notes, just give the "notes drop" command.

If you have more than one note attached to your data or a variable, you can also drop just certain ones.

Describe and Codebook

The describe and codebook commands provide information about your data. Codebook provides much more detail than describe, but you will probably use describe more often.

describe

describe is the most basic form of the command. It produces a short description of the file: name, number of observations, number of variables, and size. It also lists all of the variables in the dataset, as well as information about those variables.

d,s

Describe can be abbreviated as "d". The "s" option - an abbreviation for "short" - only lists the information about the whole file.

d col*

You can list information for just certain variables as well.

d using demo

d var1 var var3 using demo

You can list information for a dataset that is not loaded into memory. This is useful for determining how high to set the memory before loading it.

 

The describe command lists some useful information, but sometimes you need even more detail about the variables. For this, we have the codebook command.

codebook

Codebook lists a very detailed description of each of your variables. Like describe, you can list this information for just certain variables.

codebook, header

The header option lists information similar to what you get from using the "short" option in describe.

codebook, tab(15)

By default, the codebook command will list frequencies for variables that have nine discrete values or less and means for those with ten or more. The tab option allows you to specify a higher or lower threshold.

codebook, notes

The notes option simply includes the data and variable notes, if any, in the output.

codebook, problems

The problems option is good for identifying potential problems such as variables that are constant or all missing, variables that have non-existent value labels assigned, and several others.

 


Examining Your Data

Now that you have some data loaded, you’ll want to take a good look at it to make sure it was read in correctly and that it was what you expected.  There are four basic commands you can use to do this:  “list,” “browse,” “tab” and “sum”. 

 

l

l id month year employ

l id month year employ in 500/1000

l id month year employ if mugged==1

l id month year employ,clean

l id month year employ, sepby(year)

“l” is short for “list” which simly prints out all your data.  You can specify which variables to print, as well as limit the observations.

 

 

The "clean" option prevents the lines from being drawn around the data

The "sepby()" option tells Stata where to put the separator lines

browse

browse id month year employ

browse id month year employ in 500/1000

browse id month year employ if mugged==1

Sometimes you need to look at more variables than can fit in the Results window.  The “browse” command will open a new window which allows you to scroll left and right as well as up and down through your data.

tab employ

tab mugged employ

tab1 mugged employ year month

“tab” is short for tabulate and produces frequency tables of the variables you specify.  If you list two variables, then you get a “crosstab.”  If you simpy want several one-way tables, then use the “tab1” command and list the variables.

sum

sum income

sum income, detail

“sum” is short for “summarize”.  This command produces means, standard deviations, minimum and maximum values.  The “detail” option produces some other useful information as well.

 

 

On to the next tutorial, Creating and Manipulating Variables
top
Mail: Office of Population Research, Princeton University, Wallace Hall, Princeton NJ 08544
Phone: (609) 258-4870  •  Fax: (609) 258-1039  •  Email: webmaster@opr.princeton.edu