Final Project Resources

Below are some resources on places to find public data that may be useful for your projects. This list is NOT exhaustive, but is rather meant to help generate ideas and help give you a sense of what exists, but if you find that you are interested in a question that would require data that isn’t on this list, ask me and I may be able to help find it!

Data That ISN’T Allowed

One note up front: data from Kaggle or other competition sites is not allowed. Part of the goal of this class is to give you experience working with messy real world data, picking your own questions and then sorting through all the data to find variables you need. Kaggle data and the like are pre-cleaned, and come with JUST the variables needed to answer a specific pre-determined questions. So absolutely no pre-curated data from competitions.

Good Starting Points

  • IPUMS: Just look at that list of data sources! IPUMS is amazing, and includes (just to list a few):

    • US American Community Survey: the US runs regular surveys (called the American Community Surveys). Results of these are available for many years for small geographic units over a few decades. They include things like income, race and ethnicity, and both geographic identifiers (in tabular formats), and also GIS shapefiles (for GIS manipulations). Can be very useful for things like estimating consumer demographics in different areas.

    • International Census Data: Similarly, census data from over 102 countries is also easily accessible. This data often isn’t quite as rich as the US ACS data (censuses have to be shorter than surveys), but still often has questions on health, income, race and ethnicity, etc. Also includes a lot of GIS data.

    • Health Surveys

    • NHGIS: Probably the MOST useful resources for US census and demographic data that comes with GIS information.

  • Wharton Research Data Services: A huge database of business data. This data is NOT public (and is very expensive), but is accessible to Duke students through a Duke subscription. Note that if you use this data, you’ll be able to share code and reports, but not your full project publicly, since the data can’t be public. The data, as far as I can tell, is primarily a compilation of public company financial data (board memberships, M&As, company financials, etc.). Much of what it has to offer is á la carte, and Duke only has some subscriptions, so if you need something you’ll have to explore to see if it’s covered.

  • The Microsoft Planetary Computer has some great resources for environmental questions, like this database of labelled images of wildlife!

  • The AWS Open Data Registry, while poorly organized, has great data on genomics and health data from the NIH, all the environmental data noted above, space telescope data, and more!

  • AirBnB Listings

Data with a Spatial / GIS Component

Government census data is often the underpinning of spatial analyses, because it’s available almost everywhere, is free, and has tons of information about… well, everyone!

The best resource for spatial census data is NHGIS (for US data) and IHGIS (for international data). These projects are run by the same folks – IPUMS who we’ve gone to in the past for individual level census data in the US or internationally. They’re amazing. You go to their site, tell them the geographic level at which you want data, and they will provide you will a list of available data. A few notes about using these services:

  • The larger the geographic area of aggregation, the more data they will be able to provide – privacy concerns mean that when geographic areas get really small, some data may be withheld to protect respondents.

  • They provide data in three files – a shapefile with a column called GISJOIN, a tabular dataset with all your data and a GISJOIN column, and a README that tells you what all the poorly named variables in the tabular data mean. So your first step with this data will almost always be to merge the tabular data with the shapefile using GISJOIN, then renaming things based on the data in the README.

Public Satellite Data

Another great spatial data resource is satellite data! We aren’t covering raster data in detail in this class, but that’s not because it isn’t useful – NASA has satellite data for the whole world with information on things like elevation, flood risk, air pollution, what kinds of plants are growing in different places (by looking at what wavelengths plants reflect, satellites can identify crops!), satellite imagery (used for things like studying energy infrastructure, or for “financial intelligence” firms doing things like studying factory activity to predict company earnings ahead of official announcements), and more. It’s… obscene how much data they have.

While most of this comes from NASA or NOAA, in the same way most people get their census from IPUMS (not govt census bureaus), most people I know actually get their satellite data from either the Microsoft Planetary Computer, or AWS Open Data Registry

Other Lists of Data That Are Great

Have Something In Mind I Didn’t Cover?

It’s hard to overstate how much data is freely available online. There are datasets on armed conflicts and terrorism (here and here), air pollution anywhere in the world, flooding and natural disaster data from satellites (e.g. here), measures of democratic institutions and freedom (here), data on elections, trade, shipping traffic, and oh so much more.