Welcome to Unifying Data Science!

Hello, and welcome to the course site for Duke IDS 701!

A two part course. The first portion of the course provides an application-focused introduction to causal inference. This portion of the course introduces the potential outcomes framework and methods of causal statistical inference including randomized experiments, pre-post analysis, and differences-in-differences through both readings and hands-on exercises. We pay particular attention to concepts like internal and external validity, and the limitations of estimating Average Treatment Effects.

The second portion of the course is focused on learning to execute a full data science project from conceptualization through execution and presentation. I introduce a question-first, backwards design framework for systematically designing a data science project. Through exercises, students practice each step of this approach, from working with stakeholders to properly articulate the problem they are seeking to address, to picking a question (which, if answered, will help the stakeholder solve their problem), selecting the appropriate methodological approach to answering that question, and developing a concrete strategy for generating an answer.

In addition to completing a number of exercises related to project design, over the semester students conduct a complete data science project themselves. Data science is a fundamentally applied field, and there is no substitute for learning to put project design principles into action through practice. These projects are developed incrementally over the course of the semester with instructor guidance. By the end of the semester, students will have picked a topic area, developed a (tractable) question, decided what an answer to that question would actually look like, developed a work plan for generating that answer, executed and presented their project, and then iterated the project based on feedback from their initial presentation. For MIDS students, this will serve as a “capstone-project with training wheels” to prepare students for their second-year Capstone projects with external partners. This project also provides all students with a portfolio piece they can present to potential future employers.

Throughout the course, we will also be consistently returning to a few themes, chief among them the importance of developing a skeptical mindset. This is a core data science skill, but one that students do not always have the opportunity to practice. In this course, we will discuss and practice approaching our data, our code, our statistical models, our problem statements, and the work of others from a constructive but skeptical perspective.

Pre-Requisites for Non-MIDS Students

This course is primarily designed for students in the Duke Masters in Interdisciplinary Data Science (MIDS) program, but students from other programs are more then welcome if they have the appropriate pre-requisite training. Data Science is a fundamentally interdisciplinary field, so the more perspectives we have represented in the classroom the better!

This course will assume that enrolled students have a good grasp of inferential statistics and statistical modelling (e.g. a course in linear models), though no prior experience with causal inference is expected. In addition, MIDS students will be taking a concurrent course in applied machine learning, and so incoming students will also be expected to have some basic experience with machine learning, or be concurrently enrolled in an applied machine learning course.

This course will also assume students are comfortable manipulating real-world data in either Python or R. The substantive content of this course is language-independent, but because students will be required to work on their projects in teams, comfort with one of these two languages will be required to facilitate collaboration (Note that while MIDS students are, generally, “bilingual” in R and Python, they generally prefer Python, so life will be a little easier if you have a background in pandas). Where code examples are provided in class, they will use Python (pandas), but both the instructor and our TAs are also capable of providing support in R.

Finally, students will also be expected to be comfortable collaborating using git and github. If you meet the other requirements for this course but are not familiar with git and github, this is a skill you should be able to pickup on your own in advance of the course without too much difficulty. You can read more about git and github here. The Duke Center for Data and Visualization Science also hosts git and github workshops if you are a Duke student.

Syllabus

To learn more about the course, please read the course syllabus available here. and see a preliminary schedule and texts we’ll be using here. Note that our schedule is subject to change, but should give you a good sense of the material we will cover.