P0: Project topic proposal
The purpose of P0 is two-fold. First, it should identify the intended project topic and explain why it is interesting to you and should be of interest to others. Second, it should set out the blueprint for the data-collection phase (P1) and data-analysis phase (P2).
Selecting a topic
The first task is to identify a topic worthy of a semester-long investigation. See the resources page for links to blogs on data analysis and websites where data is available. Also included are links to exemplary data projects on a variety of applications. You can also look at the bottom of this page for a list of ideas for topics if you are stumped.
Requirements for topic
If you are working with a "traditional" data source (e.g., indicators of literacy rates and economic development collected from the World Bank), then you must complement the study with data collected using "new" sources as a complementary explanatory variable.
If you are collecting data from a "new" source, then collecting linking data from another source is still strongly encouraged (see me if you think this will be infeasible or not make sense).
Describing the topic
You should first succinctly describe the topic of interest. List several questions that you would like to answer about the topic (list even those questions you anticipate will be hard to answer).
What will the response variable(s) be? What about explanatory variables?
Do you have any hypotheses about the relationship between response variables and explanatory variables? For example, do you hypothesize that the response variable will be positively correlated with explanatory variables?
Data collection plan
From where will you collect the data? For each data resource, explain the following:
- Name and URL of resource
- Category of resource: web scraping, API access, manually-downloaded file, etc.
- Anticipated variables of interest (including identification of response and explanatory variables, plus numerical or categorical, as appropriate).
- Data collection frequency: one-off or repeated over a specific time period.
- Brief explanation of compliance with terms of service.
- Method of data storage: flat files, pickle files, or MongoDB database
- If the data is to be joined up with other resources, list which ones and explain any anticipated issues in connecting the datasets.
Creating derivative data for data analysis
After collecting the raw data sources, you will likely need to create derivative measures for the response and indicator variables. Explain what these derivative measures will be. If you are joining up multiple sources, then indicate which values are to be included. In P1 you will eventually create a CSV file with fields for each variable of interest. Here you should list the field names you anticipate including, along with a representative example of what the data might look like. You can use made-up figures for the example.
Dividing responsibility
This project is intended to be a team effort. I strongly recommend that you appoint one person to be the leader on each of the major tasks outlined in the data collection plan and the creation of derivative data. You should each contribute to the tasks, but it will help organize the project if each team member has primary responsibility for different aspects.
What to turn in
Turn in a document in PDF format describing each of the tasks as requested above. Turn in a single document for your team. Include the names of the team, along with a team user name (one word, up to 8 characters), at the top of the document. Include four sections:
- Topic description
- Data collection plan
- Derivative data for analysis
- Assignment of leaders to tasks
Please try to make the document as concise as possible while still conveying the key points requested. Email the completed document to qtw@cs.wellesley.edu
with subject P0
.
Also, please sign up for a time for your team to meet with me to discuss your project proposal P0 after you have turned it in. This will be a great opportunity for me to give feedback and to answer any questions you have before beginning to work on P1. Here is the Google Doc where you can add in the time your team can meet with me.
You will have the opportunity to revise your proposal following our meeting.
Ideas for potential topics
You are encouraged to come up with your own ideas for topics, but here are some ideas of mine that may be of interest.
- Investigating the prevalence of websites that link to copyright violations on web search
- SOPA has been in the news lately.
- You could scrape the top-selling MP3s from Amazon or movies, then issue search queries for the names and collect information on the linked websites to see which ones point to websites that likely link to infringements.
- Chilling Effects
- Link to popular media searches to compare comprehensiveness of take-down requests and delays
- See blog for example analysis and questions considered by the founders
- Compare trending terms on Twitter to word frequencies on Media Cloud.
- Compare the types of phrases that are popular in left vs. right political blogs, and to mainstream media and popular blogs.
- The Twitter data goes back up to one month, and the Media Cloud data goes back further. One could compare the timeliness of word frequencies on Twitter versus the different groups on Media Cloud. Which words appear as trending on Twitter before appearing in other sources, and vice versa?
- Compare the Amazon and New York Times Lists of Book Bestsellers
- Identify books that have the biggest divergence across lists
- Compare ebook sales on NYTimes and Amazon. Do the differences suggest what Nook readers might prefer?
- Match up to the corresponding New York Times Book review, if one exists. Look at the article's comments. Is the number of comments to an article correlated with the sales rank?