Selecting topics for analysis

Presenter Notes

Before everything was digitized

  • Solemn, trusted bodies collected data for indicators deemed feasible and relevant
    • UN
    • US Census Bureau
    • World Bank
    • OECD
    • ...
  • Life for the data analyst was in many ways easier
    • Response variables were determined by experts
    • Explanatory variables also typically chosen a priori
    • Data summaries were of manageable size
    • Data quality less of an issue (with exceptions, e.g., crime statistics)

Presenter Notes

Once everything became natively digital

  • Explosion of data generated as a byproduct of transactions
  • Can be made easily available if data creator has incentive to share
    • Many governments view data sharing as part of its core mission
    • Many firms share because they see value in joining their data with others
  • Why new sources of data are compelling but hard to deal with
    • Much more detailed information
    • Much finer level of granularity (location, time)
    • Raw data usually doesn't presuppose the interest of the examiner
    • Response variables must be constructed by examiner
    • Explanatory variables may not be directly present in the data set

Presenter Notes

Main tasks in digital era for data analysts

  1. Exploring and summarizing newly available raw data
    • Much raw data is simply put "out there" in hopes that someone will come along to distill insights
    • One option is to summarize the entire dataset
    • Another is to focus on just the subset of a narrower question
  2. Linking new data with "old" data sources
    • New data can provide explanatory variables not previously possible
  3. Linking new data with other new resources

Presenter Notes

Different ways to access new data

  1. Web scraping
  2. Complete data dumps
  3. APIs

Presenter Notes

Some compelling "new" data resources

Presenter Notes

Summarizing data from new resources

  1. If you have access to an entire data set
    • Pick variables of interest and construct response variable
    • If location information is provided, consider overlaying on a map
    • If time information is provided, plot response variable over time
    • We will go over techniques for this beginning in a couple of weeks
  2. If you only have API (or some other query-level) access
    • It is up to you to craft queries of interest
    • Often best to focus on a few

Presenter Notes

Example: Google Places API

Presenter Notes

Exercise 1: explore new data source

Presenter Notes

Opportunities for analysis

  1. Combining traditional indicators in previously unexpected ways
  2. Combining traditional indicators with newly available data
  3. Examining new sources of data

Presenter Notes

Exercise 2

Break up into groups to work on a strategy to leverage new data sources to help explain the following data traditional sources better

Choose one of the new sources, and explain how data could be collected from the new source to construct a helpful explanatory variable for the other source.

Presenter Notes

Project

  • Timeline
    • Groups should be formed by the evening of Wednesday Feb 22
    • Post to Piazza if you have an idea for a topic and are looking for partners, or if you are looking for partners and ideas
    • Anyone still "groupless" by then we will sort out after class on Thursday Feb 23
    • P0 due Friday Mar 2
    • Groups will meet with me to discuss P0 the following week
  • Topic selection
    • If you are working with a "traditional" data source (e.g., indicators of literacy rates and economic development collected from the World Bank), then you must complement the study with data collected using "new" sources as a complementary explanatory variable
    • If you are collecting data from a "new" source, then collecting linking data from another source is still strongly encouraged (see me if you think this will be infeasible or not make sense)

Presenter Notes

Presenter Notes