Welcome to Unifying Data Science!¶
Hello, and welcome to the course site for Duke IDS 701!
All too often, students learn data science by taking a course on machine learning from a computer scientist, a course on statistical modelling from a statistician, and a course on causal inference from a social scientist. As a result, graduating students find themselves with a toolbox of techniques, but no clear idea of how to use them to solve problems.
The aim of this course is to overcome this fragmentation and to provide students with a unified approach to using data science to solve real world problems. To that end, we will introduce a question-first, backwards design framework for systematically designing a data science project. Through exercises, students will practice each step of this approach, from working with stakeholders to properly articulate the problem they are seeking to address, to picking a question (which, if answered, will help the stakeholder solve their problem), selecting the appropriate methodological approach to answering that question, and developing a concrete strategy for generating an answer.
Having established this framework for solving data science problems, the class will then pivot to providing an application-focused introduction to causal inference, the art and science of using statistical data to make causal statements about the world. Our approach will be rooted in the potential outcomes framework, and will cover a range of methods of statistical inference including randomized experiments, pre-post analysis, differences-in-differences, and instrumental variables. In addition, we will also discuss concepts like the distinction between internal and external validity, and the limitations of estimating Average Treatment Effects.
Finally, towards the end of the semester—once we have covered causal inference in this class and MIDS students have covered machine learning in detail in IDS 705—we will return to our more general investigation of how best to use data science to solve problems, now with a focus on when (supervised) machine learning approaches are appropriate and when causal approaches are preferable.
In addition to completing a number of exercises related to project design, over the semester students will conduct a complete data science project themselves. Data science is a fundamentally applied field, and there is no substitute for learning to put these project design principles into action through practice. These projects will be developed incrementally over the course of the semester with instructor guidance. By the end of the semester, students will have picked a topic area, developed a (tractable) question, decided what an answer to that question would actually look like, developed a work plan for generating that answer, and executed and presented their project, and then iterated the project based on feedback from their initial presentation. For MIDS students, this will serve as a “capstone-project with training wheels” to prepare students for their second-year Capstone projects with external partners. And this project should provide all students with a portfolio piece they can present to potential future employers.
Throughout the course, we will also be consistently returning to a few themes, chief among them the importance of developing a skeptical mindset. This is a core data science skill, but one that students do not always have the opportunity to practice. In this course, we will discuss and practice approaching our data, our code, our statistical models, our problem statements, and the work of others from a constructive but skeptical perspective.
Pre-Requisites for Non-MIDS Students¶
This course is primarily designed for students in the Duke Masters in Interdisciplinary Data Science (MIDS) program, but students from other programs are more then welcome if they have the appropriate pre-requisite training. Data Science is a fundamentally interdisciplinary field, so the more perspectives we have represented in the classroom the better!
This course will assume that enrolled students have a good grasp of inferential statistics and statistical modelling (e.g. a course in linear models), though no prior experience with causal inference is expected. In addition, MIDS students will be taking a concurrent course in applied machine learning, and so incoming students will also be expected to have some basic experience with machine learning, or be concurrently enrolled in an applied machine learning course.
This course will also assume students are comfortable manipulating real-world data in either Python or R. The substantive content of this course is language-independent, but because students will be required to work on their projects in teams, comfort with one of these two languages will be required to facilitate collaboration (Note that while MIDS students are, generally, “bilingual” in R and Python, they generally prefer Python, so life will be a little easier if you have a background in pandas). Where code examples are provided in class, they will use Python (pandas), but both the instructor and our TAs are also capable of providing support in R.
Finally, students will also be expected to be comfortable collaborating using git and github. If you meet the other requirements for this course but are not familiar with git and github, this is a skill you should be able to pickup on your own in advance of the course without too much difficulty. You can read more about git and github here. The Duke Center for Data and Visualization Science also hosts git and github workshops if you are a Duke student.
Syllabus¶
To learn more about the course, please read the course syllabus available here. and see a preliminary schedule and texts we’ll be using here. Note that our schedule is subject to change, but should give you a good sense of the material we will cover.