|
November 7, 2009

Administration
Faculty
Staff
Students
Jobs

Projects
Seminars
Working Papers
Publications
Dissertations

Programs
Courses
Course Schedule

Data Archive
Library
Pop Index
NIH Public Access Policy

Calendar

CRCW
CHW
CMD
PUM
OPR Mail
Search
|
 |
|
Basic Stata: Reading Data Into Stata
Reading Data Into Stata
This lesson will cover in more detail how to get data into
Stata. By the end of the lesson you should be able to read a raw data file
into Stata using the "infile" and "infix" commands. We will
also briefly talk about the "insheet" command for reading data from a
spreadsheet program such as Excel.
Often, the data we want to analyze is not already in Stata (.dta) format.
What we will work on now is how to get data that are not in Stata format into
Stata.
Before we get into the details, let's just note that sometimes data come in
SAS or SPSS format, or with SAS or SPSS programs to read the data. In these
instances, it is much easier to use a program like StatTransfer or DBMS/Copy to
convert the data from one of these formats to Stata.
You may even have data in an Excel or other spreadsheet program. You can use
StatTransfer or DBMS/Copy for these files as well, or you can save the
spreadsheet as a tab-delimited file and use the "insheet" command to
read it into Stata. This is pretty easy to do, so you can try it on your own
with the instructions from our refresher lesson
If you do have to write your own program to read data into Stata, then you
have two basic choices: "infile" and "infix". The infile
command has more capabilities than infix, but is also more complex.
Essentially, the choice comes down to this: if your codebook has
"start" and "length" information for your variables, or
your variables are separated by spaces (not tabs or commas) then use infile. If
your codebook has "start" and "end" column information,
then use infix.
Some Terminology
- Codebook: A document that
describes in detail the data with which you are working. Although there is
no standard format for a codebook, a good one has the full wording of the
questions and answers, a list of all the codes or values used to enter the
data, and either the begin and end columns or the begin column and the
length of the variable.
- Dictionary: There are two
types of dictionaries: a data dictionary and a Stata dictionary. A data
dictionary is a document that lists all the variables and their locations
(columns) in the data file, and, sometimes, the values for those
variables. These are appropriate for any statistical package. A Stata
dictionary is a program or file that Stata uses to read a raw data file -
the information in this program can be obtained from either a data
dictionary or a codebook. For our purposes, "dictionary" will
refer only to a Stata dictionary.
- Observation: An observation is
a unit of analysis - a respondent in a survey, for example.
- Record: A record is one line
in a raw data file. Sometimes there is more than one record per
observation or there are different types of records in a single file.
Records are sometimes referred to as "cards."
- Column: a column actually
refers to a single character, including spaces, in a raw data file. This
is not the same as a column in a spreadsheet.
- Length: The number of columns
(or characters) a variable in a raw data file will occupy. Sometimes this
also means the number of bytes the same variable will take in a system
file.
- Raw data: Raw data are data
that have not been read into Stata (or some other package). If you were to
open this file you would see numbers and, perhaps, letters. You need a
codebook or data dictionary to be able to read the data into Stata. These
are sometimes called "text" or "ascii" files.
- Flat: a flat file is a
raw data file that has one record for each observation. These are
sometimes called "rectangular" files.
- Multiple Records:
Sometimes there is more than one record or line of data for each
observation in a raw data file. Each record has a different set of
variables on it, so each record must be read differently.
- Hierarchical:
Sometimes a raw data file will have different types or levels of
information on different records. An example is when a survey collects
information about a household, each family within that household, and
each person within each family. The different types of variables will be
on different records within the raw data file.
- String variable: a string
variable is one that has letters and/or numbers as opposed to just
numbers. An example would be a person's name. Numbers can be treated as
strings, but strings cannot be treated as numbers. Strings are also
referred to as "character" or "alphanumeric"
variables.
- Numeric variable: These are
numbers, plain and simple. Decimals, commas, and minus signs are the only
acceptable non-number characters allowed.
- Variable label: A variable
label is a short description of a variable. These appear in the Variable
window in Stata and also in the output. They are not required, but make
the output easier to understand.
- Value label: Like variable
labels, value labels make the output easier to read. Instead of printing “1”, “2”, “3”,
Stata will print “Yes”, “No”, “Maybe”.
Infile
This is a sample dictionary using the "infile"
command. These statements are entered into a separate file called
"demo.dct".
infile dictionary {
_column(1) month %2f "Calendar Month"
_column(3) year %4f "Calendar Year"
_column(7) employed %1f "Employment status that month"
_column(8) mugged %1f "Perpetrator/victim/witness of assualt"
_column(9) gender %1f "Gender"
_column(10) income %5f "That month's income"
_column(15) str6 d_income %6s "Income with a dollar sign"
_column(21) str13 str_num %13s "Number read in as a string"
_column(34) hhid %3f "Household ID"
_column(37) famid %1f "Family ID"
_column(38) perid %1f "Person ID"
_column(39) citynum %1f "City ID"
_column(40) str15 cityname %15s "City name"
_column(55) b_day %2f "Day of R's Birth"
_column(57) b_month %2f "Month of R's birth"
_column(59) b_year %4f "Year of R's birth"
_column(63) i_day %2f "Interview day"
_column(65) i_month %2f "Interview month"
_column(67) str10 str_date %10s "String Date"
_column(77) rand %21f "Just a random number"
}
|
The first line tells Stata where the data file is; this
must be the very first line in the dictionary file. The next several lines
provide Stata with a "map" of the data file. The
"_column()" statement tells Stata in which column the variable
starts. You can skip
columns/variables if you want. Next, if the variable is a string, then we
must tell Stata that it is a string as well as how long it is. Next is the
variable name. Then we tell Stata how long the variable is. Note that for
string variables we use an "s" and for numeric we use an
"f". Finally, we have the variable label. The label is not
necessary, but it is helpful.
|
|
infile using demo.dct, using(demo.dat)
infile using demo.dct if gender==1, using(demo.dat)
infile using demo.dct in 1/100 , using(demo.dat)
|
To execute the dictionary, we use the "infile
using" command. The first form reads the entire data file into memory.
The second form reads only those records where the variable
"gender" is "1" (note the double equal sign). The last
form only reads the first 100 records; this is good for testing your program
before executing it on the full data file.
|
|
infile month year employ mugged gender income str6
d_income using demo.dat
|
This form of the infile command can be used only if your
variables are separated by spaces. If you have string variables with embedded
blanks, then they must be enclosed in double quotes.
|
Infix
Here is a sample infix dictionary. Like the infile
dictionary, it is entered into a separate file called
"demo.dct".
infix dictionary{
month 1-2
year 3-6
employed 7
mugged 8
gender 9
income 10-14
str6 d_income 15-20
str13 str_num 21-33
hhid 34-36
famid 37
perid 38
citynum 39
str15 cityname 40-54
b_day 55-56
b_month 57-58
b_year 59-62
i_day 63-64
i_month 65-66
str10 str_date 67-76
rand 77-96
}
|
In the infix dictionary we can simply tell Stata in what
columns the variables are and specify the string variables as we did in the
Infile dictionary. Note that the first line of an Infix dictionary begins
with the word "infix".
|
|
infix using infix_demo.dct
infix using infix_demo.dct if year==1997
infix using infix_demo.dct in 1/100
|
Executing an infix dictionary is done in the same manner
as for an infile dictionary.
|
Saving Your Data
Of course, once you’ve read your data into Stata, you’ll
want to save it.
|
save mydata
|
This is the simplest form of the save command. It simply saves the data with the name
“mydata”.Stata will automatically
attach the “.dta” extension.
|
|
save mydata, replace
|
Stata will not let you save a dataset over itself without
you explicitly telling it to do so.
If you have made many changes to your data, you may want to save it
with a new name instead.
|
|
saveold myolddata
|
If you intend to give the data to someone who has
an older version of Stata, then you must use the "saveold" command. You should be aware that older versions of Stata have smaller limits for the size of the dataset as well as length of variable names and labels. The same holds true for the "Intercooled" and "Special Edition" versions: Intercooled has smaller limits than Special Edition.
|
Documenting Your Data
Once you have your data loaded into Stata, you'll want to
document it so you know what the variables are and how they are coded. You can
do this with variable and value labels, as well as having good records of the
data using Stata's "codebook" and "describe" commands.
In this lesson you will learn how to create and modify variable and value
labels, create notes for your data and variables as well as make good use of
the codebook and describe commands.
Variable and Value Labels
The output from the describe and codebook commands can be
very useful, but it can be even more useful if you have defined variable and
value labels. Variable and value labels also make the output from other
commands easier to read as well. A variable label is a short description of the
variable itself and what it is intended to represent. Value labels are a short
description of the individual values a variable may have. Used together, they
document your program as well as your output.
Here are a few rules and tips for labels:
- Variable labels can be up to
80 characters long in Intercooled Stata, 244 in Special Edition, anything beyond that gets truncated.
- Often, a variable label is
simply the text of a question in a survey. If the question is more than 80
characters, you must re-word the question be sure that you do not
change the meaning of the question by doing so!!! This may seem
obvious, but it has happened.
- If you do not use question
numbers as your variable names, then include the question number in the
variable label.
- Value labels have the same limits as variable labels, but since many procedures only print the first 12 characters, you would be wise to keep them rather short. Again, be sure not to change the meaning.
- Only integers and special missing values can be labeled.
- One label can be applied to
many variables; if you have several yes/no questions, you only have to
define one label and then apply it to all your yes/no variables.
- You can define more values in
a label than you have in your variable. For example, if you have some
yes/no questions and some yes/no/maybe questions, you can define a
yes/no/maybe label and apply it to the yes/no variables, provided "yes" and "no" are coded with the same numbers.
- Each variable can have only
one variable or value label applied at a time.
|
label var employ "Employment Status"
label var str_num `"This has "quotes" in the
middle"’
|
Defining variable labels is pretty simple. Be sure to use
the double quotes.
If you need quote marks in the label itself, this is how you can do it. The first character is the accent mark
above your “tab” key.
|
|
label define sex 0 "Male" 1 "Female"
label values gender sex
|
First you need to define the value label. Simply list the
values and the text of the label in quotes.
Then, you apply that label to a variable. Unfortunately, you can list only
one variable at a time in the label values command.
|
|
label define sex 2 "Other", add
label define sex 0 "0: No" 1 "1: Yes" 2 "2:
Maybe", modify
|
You may need to change a label definition at a later time.
Here are two ways of doing so. You can use the "modify" option to
add new values as well. Once you have re-defined the label, you do not need
to re-apply it.
|
|
label dir
label list
label list sex
label drop sex
|
Here are some commands to help you work with labels.
|
|
numlabel sex,add
numlabel sex,remove
|
The numlabel command adds (or removes) the numeric value of a label to the label itself.
For this example, the "sex" label would become "1. Yes", "2. No" and "3. Maybe"
|
|
labelbook
labelbook sex
|
The labelbook command produces a "codebook" for your value labels
|
|
label _dta “This is Employment and Assault data”
|
You can also label your entire dataset.
|
Notes
Variable and value labels go a long way toward documenting
your data. Sometimes, though you may need or want to include information that
you wouldn't necessarily want in a label. Information on how you constructed a
variable and in which do-file you did it or reminding yourself of skip patterns
are examples. This is where variable "notes" are useful. Notes are
not displayed in any output by default, but are a good way of keeping a record
of what you did. The only drawback to using notes is the fact that only Stata
has them; other packages like SAS or SPSS do not have notes, so if you convert
your data, the notes will not be converted as well (variable and value labels
are, though). You can have a total of 9,999 notes in your data and you can have
more than one note for any variable.
|
note: i started this on monday
note: i started this on TS
note income: i constructed this variable in day2b.do
|
To define a note for the dataset as a whole, simply use
the note command - be sure to use the colon.
By adding "TS" to the note, you can include a time stamp in any of your notes as well.
Finally, you can attach notes to specific variables.
|
|
notes
notes _dta
notes mugged
|
To display all of your notes, just give the
"notes" command.
You can display just the general notes
And you can display the notes specific to a variable or variables.
|
|
notes drop _all
notes drop mugged
notes drop _dta 3
|
To drop notes, just give the "notes drop"
command.
If you have more than one note attached to your data or a variable, you
can also drop just certain ones.
|
Describe and Codebook
The describe and codebook commands provide information about
your data. Codebook provides much more detail than describe, but you will
probably use describe more often.
|
describe
|
describe is the most basic form of the command. It
produces a short description of the file: name, number of observations,
number of variables, and size. It also lists all of the variables in the
dataset, as well as information about those variables.
|
|
d,s
|
Describe can be abbreviated as "d". The "s"
option - an abbreviation for "short" - only lists the information
about the whole file.
|
|
d col*
|
You can list information for just certain variables as
well.
|
|
d using demo
d var1 var var3 using demo
|
You can list information for a dataset that is not loaded
into memory. This is useful for determining how high to set the memory before
loading it.
|
The describe command lists some useful information, but sometimes you need
even more detail about the variables. For this, we have the codebook command.
|
codebook
|
Codebook lists a very detailed description of each of your
variables. Like describe, you can list this information for just certain
variables.
|
|
codebook, header
|
The header option lists information similar to what you
get from using the "short" option in describe.
|
|
codebook, tab(15)
|
By default, the codebook command will list frequencies for
variables that have nine discrete values or less and means for those with ten
or more. The tab option allows you to specify a higher or lower threshold.
|
|
codebook, notes
|
The notes option simply includes the data and variable
notes, if any, in the output.
|
|
codebook, problems
|
The problems option is good for identifying potential problems such as variables that are constant or all missing, variables that have non-existent value labels assigned, and several others.
|
Examining Your Data
Now that you have some data loaded, you’ll want to take a
good look at it to make sure it was read in correctly and that it was what you
expected. There are four basic commands
you can use to do this: “list,”
“browse,” “tab” and “sum”.
|
l
l id month year employ
l id month year employ in 500/1000
l id month year employ if mugged==1
l id month year employ,clean
l id month year employ, sepby(year)
|
“l” is short for “list” which simly prints out all your
data. You can specify which variables
to print, as well as limit the observations.
The "clean" option prevents the lines from being drawn around the data
The "sepby()" option tells Stata where to put the separator lines
|
|
browse
browse id month year employ
browse id month year employ in 500/1000
browse id month year employ if mugged==1
|
Sometimes you need to look at more variables than can fit
in the Results window. The “browse”
command will open a new window which allows you to scroll left and right as
well as up and down through your data.
|
|
tab employ
tab mugged employ
tab1 mugged employ year month
|
“tab” is short for tabulate and produces frequency tables
of the variables you specify. If you
list two variables, then you get a “crosstab.” If you simpy want several one-way tables, then use the “tab1”
command and list the variables.
|
|
sum
sum income
sum income, detail
|
“sum” is short for “summarize”. This command produces means, standard deviations, minimum and
maximum values. The “detail” option
produces some other useful information as well.
|
On to the next tutorial, Creating and Manipulating Variables
|