Data Science for Social Scientists


rachel schuttRachel Schutt is a Senior Research Scientist at Johnson Research Labs, an Adjunct Assistant Professor of Statistics at Columbia University, and a founding member of the Education Committee of Columbia Institute for Data Sciences and Engineering. Prior to her work at Johnson Research Labs, Schutt was a Senior Statistician at Google in the New York office. Schutt is co-authoring a book (with Cathy O'Neil) called "Doing Data Science." Her interests include statistical modeling, exploratory data analysis, machine learning algorithms, and social networks, as well as the ethical dimensions of data science, and using data science to do good. She earned her Ph.D. from Columbia University in Statistics and Master?s degrees in mathematics and engineering from NYU and Stanford University.

5/24/2013 from 9:00 AM to 5:00 PM ~ 300 Wallace Hall
This workshop covers Data Science as a field and cultural phenomenon; Overall landscape of the scope of problems Data Scientists work on and the set of skills they tend to have; Data Scientist as a job title rising out of Silicon Valley tech companies; Data Science having potential to be a deep rigorous field; Using an API to get data; Building a classifier; Using Naive Bayes to classify documents; Exploratory data analysis; Anomalies in data, messy data, what can go wrong and how cleaning data can distort; How not bothering to do basic sanity checks can cause big problems; Modeling and evaluating models; A good case study of model not working and model working; Showcasing what statistical metrics are useful for and that ""error"" is not an isloated number; Interesting open research problems that Social Scientists might want to consider; Data Science tools and methods that Social Scientists can apply to their own research.
Students should have a basic working knowledge of R.
Hardware/Software Requirements
Student will need to bring a laptop with R installed. Several R packages will also be required, and those packages will be announced at a later date.
Approximate Schedule
9:00 - 10:00 am"Intro to Data Science" as a field and cultural phenomenon.
10:00 - 10:20 amBreak
10:20 - 12:30 amUsing an API to get data and building a classifier (Naive Bayes) to classify documents. Walk them through actually building and doing it.
12:30 - 1:30 pmLunch
1:30 - 2:20 pmExploratory Data Analysis. Anomalies in the data. Messy data. What can go wrong. How cleaning data sets can distort.
2:20 - 2:35 pmBreak
2:35 - 3:30 pmModeling and Evaluating Models: A good case study of a model not working and model working.
3:30 - 4:00 pmBreak
4:00 - 5:00 pmBringing this back to your research. Break-out sessions/group discussion. Interesting open research problems that Social Scientists might want to consider. What data science tools or methods could they be applying to their own research.