Resources
Websites of interest
Here are some links to interesting websites about data analysis, as well as sources of data. Email me if you find a useful resource that is not listed here.
Blogs
Data repositories and APIs
A number of websites make data available for download, either through an end-user interface or via an API aimed at developers.
- Freebase -- wide variety of community-sourced data
- Infochimps
- Numbrary
- Google Fusion Tables Search
- UN Data Sources
- ITU Statistics on ICT usage
- World Bank
- Sunlight Foundation
- Data.gov -- US-based data
- NYC data
- MBTA Transit Data
- San Francisco Data
- Google n-grams corpus
- Security breach database
- Alexa top 1 million websites
- Chilling Effects -- Database of reported DMCA notice-and-takedown requests for alleged copyright infringements. The raw database may be available from the maintainers. Email me if you are interested in working with this data.
- Open Net Initiative Assessment of Country-Level Internet Filtering
- Guardian Data Blog
- List of Guardian data resources through January 2011
- Harvard IQSS Data Repository
Data collection opportunities
These sources haven't made themselves as immediately accessible as the repositories, but instead provide interfaces to potentially dynamic data. Nonetheless, they can be queried systematically to construct data sets of interest. This may require a combination of API lookups and crawling. For example, an API call to a search API returns links to websites relevant to the topic under investigation, which can then be crawled and parsed to answer the question of interest.
- Bing Search API
- Google Insights for Search -- gives estimates of visits and ad prices for arbitrary search terms, plus lists of most popular terms across many categories
- Google Hot Trends
- DoubleClick Ad Planner -- Detailed demographic information on many websites
- Google Top Sites by Country
- Google Keyword Tool -- gives search popularity and ad price estimates for arbitrary search terms, plus suggests similar terms
- Weather Underground Weather Underground -- daily weather almanac history for many world cities is crawlable by inferring URL structure
- Twitter Streaming API -- Near real-time searches on tweets filtered by keyword, user id, location, or random sampling
- Scribd online document repository -- Enables searching and browsing of documents uploaded to Scribd
- Docstoc online document repository -- Enables searching and browsing of documents uploaded to Docstoc, a lesser-known competitor to Scribd
Exemplary papers and projects
- Propublica Investigations -- Example studies include:
- Guardian UK Riot Investigation - Including methodology and Twitter data
- National Obesity Comparison Tool
- Crowdsourcing black-market jobs -- Two studies of how freelance-labor websites have been extensively abused by spammers and other online criminals. The measurement was gathered by publicly crawling the publicly available history of the website
- Measuring commoditized malware -- this paper includes nice time-based measurements of the availability of collected data
- Measuring the prevalence of typosquatted websites
- Measuring the prevalence of illicit online pharmacies in web search results
- Measuring abuse of trending search terms
- Measuring speed of removal of phishing websites
- How digital distribution of television affects Internet piracy
- How Twitter was used during the Arab Spring uprisings in Egypt and Tunisia
- Tracking memes on Twitter
- Culturomics: tracking the prevalence of phrases in books over time
- Gapminder
Coding resources
Linux
Python
MongoDB
R
- Wellesley-only PDF copies of chapters to useful R books
- Beginner's guide to R -- Brief Introduction to Using R
- Official introduction to R -- I found this harder to follow than the Beginner's Guide to R, but it is shorter, so that may work better for some.
- UCLA Statistics Department R Resource Page Click on R Help link in upper left (no direct link available due to heavy-handed JavaScript)
LaTeX
- UCLA Statistics Department LaTeX Resource Page Click on LaTeX Help link in upper left (no direct link available due to heavy-handed JavaScript)
Writing resources
- Elements of Style -- This book offers pithy advice on how to write effectively. I especially encourage students to read Chapter III, Elementary Principles of Composition.