Computing and Statistics Workshops

Workshops marked with the icon have downloadable materials available; click on the title for details.


May 2017

Dawn Koffman, Boriana Pratt
5/09/2017 from 9:30 AM to 12:00 PM ~ WWS 015

This workshop will show how the Stata commands margins and marginsplot can be used for model interpretation and visualization, and will present ways to compute adjusted predictions and marginal effects, as well as ways to compare predictions for levels of a factor variable.

The workshop will also demonstrate how the user-written command coefplot, by Ben Jann, can be applied to any estimation results to graphically display regression coefficients or other statistics of interest.  One nice feature of coefplot is that it can be used to very easily display results from several models on one graph.  Another nice feature is that most of the options available for other twoway plot types can also be used with the coefplot command.

Angel Brady
5/10/2017 from 10:30 AM to 12:30 PM ~ WWS 015

This workshop provides an introduction to OpenScholar, a website building and content management tool for hosting professional profiles. The workshop will go over the basics of creating a website, adding content, and creating an appealing site layout.  Attendees are encouraged to bring content such as a CV and images, so that they are ready to upload content to their website.  A valid Princeton University netID is needed to log in to OpenScholar.

Dawn Koffman
5/11/2017 from 9:30 AM to 12:00 PM ~ 290 Wallace Hall

This workshop provides a discussion of issues to consider when designing statistical graphs. Topics include:

  1. tables vs graphs
  2. audience and setting
  3. representing data accurately
  4. highlighting comparisons of interest
  5. simplicity and clarity
  6. color

Several sets of graphs are examined that attempt to "tell the same story" and discussion will center on why one display may be preferable to another. While the workshop goes over the pros and cons of using bar graphs, dot plots, line graphs, box plots, violin plots and several other graph types, the discussion is implementation tool independent, and is intended to be useful for those building graphs using R Base Graphics, ggplot2, and Stata Graphics, as well as many other tools.


January 2017

Matthew J. Salganik
1/10/2017 from 9:30 AM to 12:00 PM ~ 300 Wallace

If you want to increase the quality and impact of your work, you should consider doing open and reproducible research. In this workshop, I will begin by providing a working definition for what it means for your research to be open and reproducible. Then, I will describe the ways that you can overcome the obstacles that may be preventing you from being open and reproducible. The workshop will be illustrated by some of my own struggles with these issues during my career. Because there are many complicated technical, legal, professional, and ethical issues involved, there will be lots of time for questions and discussion.

Boriana Pratt
1/10/2017 from 1:30 PM to 4:00 PM ~ 300 Wallace

This workshop introduces concepts, software tools and best practices for making research reproducible. Topics include version control (Git/Github), managing file dependencies (make), and tools for creating dynamic documents – most tools presented are in R through Rstudio (Sweave, Rmarkdown, Knitr, R Notebook); two Stata commands (Weaver, Stata markdown) will be presented towards the end.

Dawn Koffman
1/12/2017 from 9:30 AM to 12:00 PM ~ 300 Wallace

This workshop shows how you can access Princeton's high performance computing resources. Discussion includes an overview of the Linux systems that are available at Princeton and how to: obtain accounts, connect and transfer files, run R and Stata programs on these systems, and submit jobs using a job scheduler called SLURM. In addition, time will be spent showing Linux commands for managing files, and explaining how to write Linux shell scripts to automate repeating tasks


September 2016

Germán Rodríguez
9/20/2016 from 5:00 PM to 6:00 PM ~ Wallace 217

This workshop provides a brief introduction to Stata. Attendance is limited to first-year OPR graduate students and post-docs.

Boriana Pratt
9/21/2016 from 5:00 PM to 6:00 PM ~ Wallace 217
This workshop provides a brief introduction to Stata Data Management. Attendance is limited to first-year OPR graduate students and post-docs.
Dawn Koffman
9/22/2016 from 5:00 PM to 6:00 PM ~ Wallace 217
This workshop provides a brief introduction to Stata Graphics. Attendance is limited to first-year OPR graduate students and post-docs.

May 2016

Brandon M. Stewart
5/03/2016 from 9:30 AM to 12:00 PM ~ Wallace 300

The Structural Topic Model is a general framework for topic modeling with document-level covariate information. The covariates can improve inference and qualitative interpretability and are allowed to affect topical prevalence, topical content or both. The software package implements the estimation algorithms for the model and also includes tools for every stage of a standard workflow from reading in and processing raw text through making publication quality figures.  The workshop will provide a hands-on introduction to using the stm package which currently includes functionality to:

  •  ingest and manipulate text data
  •  estimate Structural Topic Models
  • calculate covariate effects on latent topics with uncertainty
  • estimate a graph of topic correlations
  • compute model diagnostics and summary measures
  • create the plots used in various papers about stm

Dawn Koffman
5/09/2016 from 9:30 AM to 12:00 PM ~ Wallace 300

This workshop introduces two modern R packages, both written by Hadley Wickham, that provide intuitive tools for handling common data management tasks. The first package, tidyr, provides functions that reshape data so it conforms to a specific “tidy” structure where each variable is saved in its own column, each observations is saved in its own row, and each type of observational unit is stored in a separate table. The second package, dplyr, provides a set of functions (referred to as “verbs”) that allow you to easily subset observations, re-order observations, select specific variables, add new variables, group observations, and summarize groups of observations.


January 2016

Matthew J. Salganik
1/06/2016 from 9:30 AM to 12:00 PM ~ 300 Wallace

If you want to increase the quality and impact of your work, you should consider doing open and reproducible research. In this workshop, I will begin by providing a working definition for what it means for your research to be open and reproducible. Then, I will describe the ways that you can overcome the obstacles that may be preventing you from being open and reproducible. The workshop will be illustrated by some of my own struggles with these issues during my career. Because there are many complicated technical, legal, professional, and ethical issues involved, there will be lots of time for questions and discussion.

David Robinson
1/06/2016 from 1:30 PM to 4:00 PM ~ 300 Wallace

The concept of "tidy data" offers a powerful framework for structuring data to ease manipulation, modeling and visualization. However, most R functions, both those built-in and those found in third-party packages, produce output that is not tidy, and that is therefore difficult to reshape, recombine, and otherwise manipulate. This workshop introduces the broom package, which turns the output of model objects into tidy data frames that are well-suited to further analysis, manipulation, and visualization with input-tidy tools such as ggplot2 and dplyr.

Dawn Koffman
1/11/2016 from 9:30 AM to 12:00 PM ~ 300 Wallace

This workshop provides an introduction to the R graphics package ggplot2. Because ggplot2 is based on Wilkinson's Grammar of Graphics (2005), time is spent both (1) describing the main concepts of the grammar that define the graphical building blocks and (2), exploring many examples that show how to use ggplot2's layered approach to create basic and more complex graphs.

Dawn Koffman
1/19/2016 from 9:30 AM to 12:00 PM ~ 217 Wallace - OPR Computer Lab

This workshop provides a discussion of issues to consider when designing statistical graphs. Topics include:

  1. tables vs graphs
  2. audience and setting 
  3. representing data accurately 
  4. highlighting comparisons of interest 
  5. simplicity and clarity
  6. color
Several sets of graphs are examined that attempt to "tell the same story" and discussion will center on why one display may be preferable to another. While the workshop goes over the pros and cons of using bar graphs, dot plots, line graphs, box plots, violin plots and several other graph types, the discussion is implementation tool independent, and is intended to be useful for those building graphs using R Base Graphics, ggplot2, and Stata Graphics, as well as many other tools.


September 2015

Germán Rodríguez
9/22/2015 from 5:00 PM to 6:00 PM ~ 217 Wallace
This workshop provides a brief introduction to Stata. Attendance is limited to first-year OPR graduate students and post-docs.
Dawn Koffman
9/23/2015 from 5:00 PM to 6:00 PM ~ 217 Wallace
This workshop provides a brief introduction to Stata Data Management. Attendance is limited to first-year OPR graduate students and post-docs.
Dawn Koffman
9/24/2015 from 5:00 PM to 6:00 PM ~ 217 Wallace
This workshop provides a brief introduction to Stata Graphics. Attendance is limited to first-year OPR graduate students and post-docs.

May 2015

Dawn Koffman
5/05/2015 from 9:30 AM to 12:00 PM ~ 300 Wallace Hall
This workshop provides introductory and advanced techniques for making graphs in Stata. Topics include: Pros and cons of Stata graphics; Graphics syntax; Setting overall look of graphs; Making simple line graphs and scatter plots; Overlaying plot types; Adding text, titles and legends; Showing linear fit lines and confidence intervals; Labeling points and axes; Generating separate graphs for subsets of data; Storing graphs in memory and on disk; Combining graphs; and Using Stata graphs in other documents.
Dawn Koffman
5/06/2015 from 9:30 AM to 12:00 PM ~ 300 Wallace Hall
This workshop provides the nuts-and-bolts information you need to be able to *start* doing your work in an environment that's more powerful than your laptop. Topics inclde: an overview of the Linux environments available at Princeton, how to obtain accounts, connecting and transferring files, running R and Stata in these environments, submitting jobs via SLURM, Linux commands for managing your files, and a quick introduction to writing Linux shell scripts to automate repeating tasks.
Chang Y. Chung
5/07/2015 from 9:30 AM to 4:00 PM ~ 217 Wallace Hall
5/08/2015 from 9:30 AM to 4:00 PM ~ 217 Wallace Hall
5/11/2015 from 9:30 AM to 4:00 PM ~ 217 Wallace Hall

Python is a very popular, general-purpose, multi-paradigm, open-source, scripting language. It is designed to emphasize code readability and has a clean syntax with high level data types. It is well-suited for interactive work and quick prototyping, yet it is powerful enough for writing large applications. In this full-day workshop, attendees are introduced to basic Python syntax and to its ecosystem. See the workshop syllabus for objectives.

Samuel Henry
5/12/2015 from 1:30 PM to 2:30 PM ~ 300 Wallace Hall
Amazon Mechanical Turk (mturk.com) is a web service that provides an on-demand, scalable workforce to complete tasks that people can do better than computers, such as providing survey response data. This workshop gives a unique opportunity to hear from a member of the Mechanical Turk software development team about using MTurk for survey research. Discussion will focus on MTurk's history, capabilities and ecosystems, with an emphasis on its capabilities.

January 2015

Dawn Koffman
1/05/2015 from 9:30 AM to 12:00 PM ~ 300 Wallace Hall
This workshop provides a discussion of issues to consider when designing statistical graphs. Topics include: 1. tables vs graphs 2. audience and setting 3. representing data accuratey 4. highlighting comparisons of interest 5. simplicity and clarity and 6. color. Several sets of graphs are examined that attempt to ""tell the same story"" and discussion will center on why one display may be preferable to another. While the workshop goes over the pros and cons of using bar graphs, dot plots, line graphs, box plots, violin plots and several other graph types, the discussion is implementation tool independent, and is intended to be useful for those building graphs using R Base Graphics, ggplot2, and Stata Graphics, as well as many other tools.
Dawn Koffman
1/07/2015 from 9:30 AM to 12:00 PM ~ 300 Wallace Hall
This workshop provides an introduction to the R graphics package ggplot2. Because ggplot2 is based on Wilkinson's Grammar of Graphics (2005), time is spent both (1) describing the main concepts of the grammar that define the graphical building blocks and (2), exploring many examples that show how to use ggplot2's layered approach to create basic and more complex graphs.
Matthew J. Salganik
1/08/2015 from 9:30 AM to 12:00 PM ~ Bowl 001 Robertson Hall (Lower Level)
If you want to increase the quality and impact of your work, you should consider doing open and reproducible research. In this workshop, I will begin by providing a working definition for what it means for your research to be open and reproducible. Then, I will describe the ways that you can overcome the obstacles that may be preventing you from being open and reproducible. The workshop will be illustrated by some of my own struggles with these issues during my career. Because there are many complicated technical, legal, professional, and ethical issues involved, we will have a team of experts on hand to help answer your questions, and we will leave lots of time for discussion.
Chang Y. Chung
1/08/2015 from 1:30 PM to 4:00 PM ~ 217 Wallace Hall
1/09/2015 from 1:30 PM to 4:00 PM ~ 217 Wallace Hall
This workshop introduces concepts, software tools, and best practices for making research more open and reproducible using R. Topics include: 1. Motivations for open and and reproducible research 2. Replication and Reproduction 3. Research Pipeline 4. Version control: git and github 5. Literate programming and presentation tools: LaTeX (.Rnw) and Rmarkdown(.Rmd) 6. Dependency management: make 7. Reproducible research check list.

September 2014

Germán Rodríguez
9/16/2014 from 5:00 PM to 6:00 PM ~ 217 Wallace Hall

This workshop provides a brief introduction to Stata. Attendance is limited to first-year OPR graduate students and post-docs.

Chang Y. Chung
9/17/2014 from 5:00 PM to 6:00 PM ~ 217 Wallace Hall
This workshop provides a brief introduction to Stata Data Management. Attendance is limited to first-year OPR graduate students and post-docs.
Dawn Koffman
9/18/2014 from 5:00 PM to 6:00 PM ~ 217 Wallace Hall
This workshop provides a brief introduction to Stata Graphics. Attendance is limited to first-year OPR graduate students and post-docs.

May 2014

Dawn Koffman
5/05/2014 from 9:30 AM to 12:00 PM ~ 300 Wallace Hall
You'll be a better data scientist if you're comfortable working in a Unix (or Linux or Mac OS X) command-line environment and are able to make use of command-line tools. For example, as we all know, most data needs to be cleaned, and often times reshaped and combined with other data before it can be easily viewed or used to obtain descriptive statistics or estimate multivariate models. Command-line tools provide flexible and efficient ways to handle these cleaning and data management tasks, regardless of how big the data is. The specific topics included in the Tour of the Terminal workshop are: using an interactive shell; file system structure, pathnames and permissions; pipelines, sequential execution, background execution and i/o redirection; emacs text editor; commands, options and arguments; building shell scipts; regular expressions; and the Unix stream editor (sed). So consider taking a break from the point and click interface and enhance your data science toolset.
Chang Y. Chung
5/08/2014 from 9:30 AM to 12:00 PM ~ 217 Wallace Hall
5/09/2014 from 9:30 AM to 12:00 PM ~ 217 Wallace Hall
Python is a popular, general-purpose, multi-paradigm, open-source, scripting language. It is designed to emphasize code readability and has a clean syntax with high level data types. It is well-suited for interactive work and quick prototyping, yet it is powerful enough for writing large applications. Python has a large number of available and well-written modules for everything from abstract syntax trees to ZIP file manipulation. Its ecosystem features an extensive set of tools including a JIT compiler and fancy IDE's. In this half-day workshop, attendees are introduced to basic Python syntax and to its ecosystem.
Chang Y. Chung
5/08/2014 from 1:30 PM to 4:00 PM ~ 217 Wallace Hall
5/09/2014 from 1:30 PM to 4:00 PM ~ 217 Wallace Hall
pandas is an open-source Python package that enables users to handle table-like (""relational"") and key-value paired (""labeled"") data, large and small, easily, intuitively, and quickly. Designed for practical, real world data handling and analysis in Python, pandas is considered one of the new killer apps for the Big Data era Python language, and one of the six packages of the SciPy core stack, which itself is rapidly gaining popularity among scientific communities. Specific data management problems/topics that we will discuss include: handling missing data; fast insertion and deletion; (automatic) data aligning; group by (like SQL) or split-apply-combine (like plyr); efficient slicing, indexing, and subsetting larger data based on hierarchical labels; using intuitive merge and join operations on multiple datasets; and utilizing robust and extensive I/O tools that interact well with many data formats, including CSV, Excel, SQL databases, HDF5, JSON, and even STATA.

January 2014

Dawn Koffman
1/09/2014 from 9:00 AM to 12:00 PM ~ 300 Wallace Hall
This workshop provides an introduction to the R package ggplot2. Because ggplot2 is based on Wilkinson's Grammar of Graphics (2005), time is spent both (1) describing the main concepts of the grammar that define the graphical building blocks and (2), exploring many examples that show how to use ggplot2's layered approach to create basic and more complex graphs.
Chang Y. Chung
1/14/2014 from 1:30 PM to 4:30 PM ~ Bowl 001 Robertson Hall (Lower Level)
Python is a popular, general-purpose, multi-paradigm, open-source, scripting language. It is designed to emphasize code readability and has a clean syntax with high level data types. It is well-suited for interactive work and quick prototyping, yet it is powerful enough for writing large applications. Python has a large number of available and well-written modules for everything from abstract syntax trees to ZIP file manipulation. Its ecosystem features an extensive set of tools including a JIT compiler and fancy IDE's. In this half-day workshop, attendees are introduced to basic Python syntax and to its ecosystem.

September 2013

Germán Rodríguez
9/17/2013 from 6:00 PM to 7:00 PM ~ 217 Wallace Hall
This workshop provides a brief introduction to Stata. Attendance is limited to first-year OPR graduate students and post-docs.
Chang Y. Chung
9/18/2013 from 5:30 PM to 7:30 PM ~ 217 Wallace Hall
This workshop provides a brief introduction to Stata Data Management. Attendance is limited to first-year OPR graduate students and post-docs.
Dawn Koffman
9/19/2013 from 5:00 PM to 6:00 PM ~ 217 Wallace Hall
This workshop provides a brief introduction to Stata Graphics. Attendance is limited to first-year OPR graduate students and post-docs.

May 2013

Dawn Koffman
5/07/2013 from 2:30 PM to 4:00 PM ~ B01 Fisher Hall
This workshop provides introductory and advanced techniques for making graphs in Stata 12. Topics include: Pros and cons of Stata graphics; Graphics syntax; Setting overall look of graphs; Making simple line graphs and scatter plots; Overlaying plot types; Adding text, titles and legends; Showing linear fit lines and confidence intervals; Labeling points and axes; Generating separate graphs for subsets of data; Storing graphs in memory and on disk; Combining graphs; and Using Stata graphs in other documents. The format is example-based. Attendees are provided with a Stata do file containing code for the graphs presented during the workshop, and are encouraged to modify and/or execute the code as graphs are discussed.
Felix Elwert
5/21/2013 from 9:00 AM to 5:00 PM ~ 300 Wallace Hall
5/22/2013 from 9:00 AM to 5:00 PM ~ 300 Wallace Hall
This course offers an applied introduction to the theory and practice of directed acyclic graphs (DAGs) for causal inference. DAGs offer rigorous yet intuitive tools for handling complicated causal problems in the observational social and biomedical sciences. The two primary uses of DAGs are: (1) determining the identifiability of causal effects from observed data, and (2) deriving the testable implications of a causal model. DAGs are also useful for illuminating the causal assumptions implicit in widely used statistical estimation strategies. This course introduces the essential elements for causal reasoning with DAGs and exemplifies these insights with social science examples.
Topics include: non-parametric identification by adjustment; d-separation; the difference between overcontrol bias, confounding bias, and selection bias; what variables to control for and what variables not to control for in observational research; effect heterogeneity; structural assumptions in instrumental variables identification; and recent work on causal mediation analysis.
Please note that this course focuses on spotting and understanding causal opportunities and causal problems. It is not a course on statistical methods (no software component). Students will discuss numerous exercises in class and solve a short homework assignment for the second day.
Rachel Schutt
5/24/2013 from 9:00 AM to 5:00 PM ~ 300 Wallace Hall
This workshop covers Data Science as a field and cultural phenomenon; Overall landscape of the scope of problems Data Scientists work on and the set of skills they tend to have; Data Scientist as a job title rising out of Silicon Valley tech companies; Data Science having potential to be a deep rigorous field; Using an API to get data; Building a classifier; Using Naive Bayes to classify documents; Exploratory data analysis; Anomalies in data, messy data, what can go wrong and how cleaning data can distort; How not bothering to do basic sanity checks can cause big problems; Modeling and evaluating models; A good case study of model not working and model working; Showcasing what statistical metrics are useful for and that ""error"" is not an isloated number; Interesting open research problems that Social Scientists might want to consider; Data Science tools and methods that Social Scientists can apply to their own research.