H0: Web Scraper

Tasks

What to turn in

Step 1: Parse a web page and download CSV files.

Step 2: Extract relevant information out of the stored files

Step 3. Compute calculations on the data.

Step 4. Create derivative CSV file

Step 5. Compute calculations on the data

So-called Super PACs are an important political issue during this presidential election. These committees can accept unlimited financial contributions, provided that the operate independently of a candidate. However, the law still requires that the name, address, and amount of large contributions must be publicly disclosed. These are then made available online. For more information see this New York Times page. Your task is to write scripts that will gather the latest reports of contributions from the Federal Election Commission (FEC)'s website.

Download http://cs.wellesley.edu/~qtw/assign/h0/superpac.html. This file is a local copy of an HTML file created by issuing a query on the Federal Election Commission's website. Using BeautifulSoup as a parser, extract the links labeled "Download" and visit those pages programmatically using Python. Note that the URLs are relative, which means that the left-hand side of the URL (i.e., before the '/') is missing. Consequently, you must construct the absolute URL by pre-pending http://query.nictusa.com to the relative URL. For example, if the relative URL is /cgi-bin/dcdev/forms/C00487744/, then the absolute URL that you should visit is http://query.nictusa.com/cgi-bin/dcdev/forms/C00487744/.

First link to visit

You will then have to extract the link to the CSV file:

First link to visit

The CSV files should be stored in the following directory ~/qtw/h0/data/. Each file should have the same name as the file being downloaded, with a .csv appended to the end (e.g., 789234.fec.csv). In order to keep a record of where the file originated, you should create a file called ~/qtw/h0/data/file2url.csv with the complete file location and the URL where it was downloaded from, for example:

[tmoore@trout h0] head file2url.csv
/home/tmoore/qtw/h0/data/762710.fec.csv,http://query.nictusa.com/showcsv/nicweb127500/762710.fec
/home/tmoore/qtw/h0/data/762417.fec.csv,http://query.nictusa.com/showcsv/nicweb127615/762417.fec
/home/tmoore/qtw/h0/data/762107.fec.csv,http://query.nictusa.com/showcsv/nicweb127638/762107.fec

Finally, wait 15 seconds between queries in order to minimize load on the FEC server. Place all of the code to execute this in a function called fetchSuperPAC.

Build a dictionary called r2d that maps from the ID included in the file name (the FEC ID) to a list including the committee number, name, and a list of tuples for each contribution listing the contribution amount, date of contribution, state, 5-digit ZIP code and whether the contribution came from an individual or organization.

r2d[FEC ID]=[committee number, name,[(contribution,date,state,zip,individual or organization),...]

For example, here is the value of the dictionary for one Super PAC (and here is its HTML entry and corresponding CSV):

>>> r2d['763133']
['C00489799', 'Planned Parenthood Votes', [(1000000.0, '20111230', 'NY', '10019', 'IND'), (50000.0, '20110816', 'NY', '11021', 'IND'), (865.14, '20111230', 'NY', '10001', 'ORG')]]

Note that the CSV entries are surrounded by " marks. This is done because some values could actually have commas in them. For instance, some of the committee names have commas in them. Consequently, you will need to do more than simply split each line by a comma. Instead, you should use Python's csv package (in particular the reader method), which can deal with issues like these. See the documentation for guidance.

Data is often poorly documented. It can be up to you to piece things together. The CSV file includes the complete report, not only the itemized contributions. It is up to you to extract the name and committee number from the summary entry, and then extract only the contributions (i.e., those listed in Schedule A). Here is a screenshot for the HTML version of one Super PAC, with the Schedule A boxed:

Schedule A

I suggest you look at a few example HTML filings and compare them to the CSV versions. This way you can identify the records that correspond to contributions. Here's a hint: look at the codes in the first field. Identify the pattern which indicates that the record is a contribution, not an expense or summary. Do some testing to justify your selection and indicate the rule you will use to identify just contributions.

Here is a complete list of codes used in the CSV files:

F3XT
SA11AI
SC/10
SB21b
HDR
SD10
SA11C
TEXT
SB28a
SA16
SB21B
SB23
SB29
SB28C
SA11B
SA17
F3XA
SA15
F3XN
SA13
SE

Store your code in a function called parseReports.

Now that you have the dictionary, write a few lines of code that summarize the data. Print these to a file called FECSummaryStatsPAC.txt located in the h0/data directory. Include the code in a function called computeStatsPAC().

Report the total number of Super PACs.
Report the number of Super PACs that received no contributions during the period.
Report the committee names that include at least 50 itemized contributions (compute using list comprehension).

Using the r2d dictionary, create a file named h0/data/allcont.csv listing all individual contributions to Super PACs of the form (with example entries):

FEC ID,Committee Number,Committee Name,Contribution Amount,Date,State,ZIP code,Indiv or Org
763133,C00489799,Planned Parenthood Votes,1000000.0,20111230,NY,10019,IND
763133,C00489799,Planned Parenthood Votes,50000.0,20110816,NY,11021,IND

Because committee names could include commas, be sure to use the writer function in the csv package. You should wrap any non-numeric field in double quotes.

You should also create a list of tuples called allcont of the form:

allcont=[("763133","C00489799","Planned Parenthood Votes",1000000.0,"20111230","NY",10019,"IND"),...]

Using the allcont list, write a few lines of code that summarize the data. Print these to a file called FECSummaryStatsAll.txt located in the h0/data directory. Include the code in a function called computeStatsAll().

Report the total number of contributions.
Report the total number of contributions of at least $1 million (calculate using list comprehension).
Report the single largest contribution (hint: use the max() function and a list comprehension).
Report the total dollar amount of contributions (hint: use the sum() function and a list comprehension)).
Report the total fraction of contributions that exceed $100K (first as fraction of the number of contributions, second as a fraction of the total dollar amount of contributions).
Report the total number and dollar amount of contributions from Massachusetts (use 2 list comprehensions).
Perform the same calculation for Massachusetts using the r2d dictionary you built earlier. Think about which data structure made the calculation easier. Briefly explain (no more than 2-3 sentences) what types of questions can be answered easier using the allcont list, versus which can be answered easier using the r2d dictionary.

CS349B: Quantifying the World

Wellesley College, Spring 2012

Instructor: Tyler Moore

Course Info

Include all code in the python file named h0.py. Create a directory under qtw called h0, along with subdirectories labeled code and data. Place h0.py in the code directory. Place the downloaded files in the data directory, along with the files FECSummaryStatsAll.txt and FECSummaryStatsPAC.txt.

Your code should be well-documented. Additionally, you should generate HTML documentation using the pydoc command with the -w parameter set.

I will be able to read any files or directories placed in the qtw directory. This means there is nothing for you to do to submit the homework, aside from putting it in the right place.

Please do not modify h0.py after the submission deadline. I will check the modification timestamp to verify an on-time submission. Also, be sure to put your name on the top of h0.py. If you work with a partner, submit only one version of the code and data with both names on the code.