H0: Web Scraper
So-called Super PACs are an important political issue during this presidential election. These committees can accept unlimited financial contributions, provided that the operate independently of a candidate. However, the law still requires that the name, address, and amount of large contributions must be publicly disclosed. These are then made available online. For more information see this New York Times page. Your task is to write scripts that will gather the latest reports of contributions from the Federal Election Commission (FEC)'s website.
Tasks
Step 1: Parse a web page and download CSV files.
Download http://cs.wellesley.edu/~qtw/assign/h0/superpac.html. This file is a local copy of an HTML file created by issuing a query on the Federal Election Commission's website. Using BeautifulSoup as a parser, extract the links labeled "Download" and visit those pages programmatically using Python. Note that the URLs are relative, which means that the left-hand side of the URL (i.e., before the '/') is missing. Consequently, you must construct the absolute URL by pre-pending http://query.nictusa.com
to the relative URL. For example, if the relative URL is /cgi-bin/dcdev/forms/C00487744/
, then the absolute URL that you should visit is http://query.nictusa.com/cgi-bin/dcdev/forms/C00487744/
.
You will then have to extract the link to the CSV file:
The CSV files should be stored in the following directory ~/qtw/h0/data/
. Each file should have the same name as the file being downloaded, with a .csv
appended to the end (e.g., 789234.fec.csv
). In order to keep a record of where the file originated, you should create a file called ~/qtw/h0/data/file2url.csv
with the complete file location and the URL where it was downloaded from, for example:
[tmoore@trout h0] head file2url.csv /home/tmoore/qtw/h0/data/762710.fec.csv,http://query.nictusa.com/showcsv/nicweb127500/762710.fec /home/tmoore/qtw/h0/data/762417.fec.csv,http://query.nictusa.com/showcsv/nicweb127615/762417.fec /home/tmoore/qtw/h0/data/762107.fec.csv,http://query.nictusa.com/showcsv/nicweb127638/762107.fec
Finally, wait 15 seconds between queries in order to minimize load on the FEC server. Place all of the code to execute this in a function called fetchSuperPAC
.
Step 2: Extract relevant information out of the stored files
Build a dictionary called r2d
that maps from the ID included in the file name (the FEC ID) to a list including the committee number, name, and a list of tuples for each contribution listing the contribution amount, date of contribution, state, 5-digit ZIP code and whether the contribution came from an individual or organization.
r2d[FEC ID]=[committee number, name,[(contribution,date,state,zip,individual or organization),...]
For example, here is the value of the dictionary for one Super PAC (and here is its HTML entry and corresponding CSV):
>>> r2d['763133'] ['C00489799', 'Planned Parenthood Votes', [(1000000.0, '20111230', 'NY', '10019', 'IND'), (50000.0, '20110816', 'NY', '11021', 'IND'), (865.14, '20111230', 'NY', '10001', 'ORG')]]
Note that the CSV entries are surrounded by "
marks. This is done because some values could actually have commas in them. For instance, some of the committee names have commas in them. Consequently, you will need to do more than simply split each line by a comma. Instead, you should use Python's csv
package (in particular the reader method), which can deal with issues like these. See the documentation for guidance.
Data is often poorly documented. It can be up to you to piece things together. The CSV file includes the complete report, not only the itemized contributions. It is up to you to extract the name and committee number from the summary entry, and then extract only the contributions (i.e., those listed in Schedule A). Here is a screenshot for the HTML version of one Super PAC, with the Schedule A boxed:
I suggest you look at a few example HTML filings and compare them to the CSV versions. This way you can identify the records that correspond to contributions. Here's a hint: look at the codes in the first field. Identify the pattern which indicates that the record is a contribution, not an expense or summary. Do some testing to justify your selection and indicate the rule you will use to identify just contributions.
Here is a complete list of codes used in the CSV files:
F3XT SA11AI SC/10 SB21b HDR SD10 SA11C TEXT SB28a SA16 SB21B SB23 SB29 SB28C SA11B SA17 F3XA SA15 F3XN SA13 SE
Store your code in a function called parseReports
.
Step 3. Compute calculations on the data.
Now that you have the dictionary, write a few lines of code that summarize the data. Print these to a file called FECSummaryStatsPAC.txt
located in the h0/data
directory. Include the code in a function called computeStatsPAC()
.
- Report the total number of Super PACs.
- Report the number of Super PACs that received no contributions during the period.
- Report the committee names that include at least 50 itemized contributions (compute using list comprehension).
Step 4. Create derivative CSV file
Using the r2d
dictionary, create a file named h0/data/allcont.csv
listing all individual contributions to Super PACs of the form (with example entries):
FEC ID,Committee Number,Committee Name,Contribution Amount,Date,State,ZIP code,Indiv or Org 763133,C00489799,Planned Parenthood Votes,1000000.0,20111230,NY,10019,IND 763133,C00489799,Planned Parenthood Votes,50000.0,20110816,NY,11021,IND
Because committee names could include commas, be sure to use the writer
function in the csv
package. You should wrap any non-numeric field in double quotes.
You should also create a list of tuples called allcont
of the form:
allcont=[("763133","C00489799","Planned Parenthood Votes",1000000.0,"20111230","NY",10019,"IND"),...]
Step 5. Compute calculations on the data
Using the allcont
list, write a few lines of code that summarize the data. Print these to a file called FECSummaryStatsAll.txt
located in the h0/data
directory. Include the code in a function called computeStatsAll()
.
- Report the total number of contributions.
- Report the total number of contributions of at least $1 million (calculate using list comprehension).
- Report the single largest contribution (hint: use the
max()
function and a list comprehension). - Report the total dollar amount of contributions (hint: use the
sum()
function and a list comprehension)). - Report the total fraction of contributions that exceed $100K (first as fraction of the number of contributions, second as a fraction of the total dollar amount of contributions).
- Report the total number and dollar amount of contributions from Massachusetts (use 2 list comprehensions).
- Perform the same calculation for Massachusetts using the
r2d
dictionary you built earlier. Think about which data structure made the calculation easier. Briefly explain (no more than 2-3 sentences) what types of questions can be answered easier using theallcont
list, versus which can be answered easier using ther2d
dictionary.
What to turn in
Include all code in the python file named h0.py
. Create a directory under qtw
called h0
, along with subdirectories labeled code
and data
. Place h0.py
in the code
directory. Place the downloaded files in the data
directory, along with the files FECSummaryStatsAll.txt
and FECSummaryStatsPAC.txt
.
Your code should be well-documented. Additionally, you should generate HTML documentation using the pydoc
command with the -w
parameter set.
I will be able to read any files or directories placed in the qtw
directory. This means there is nothing for you to do to submit the homework, aside from putting it in the right place.
Please do not modify h0.py
after the submission deadline. I will check the modification timestamp to verify an on-time submission. Also, be sure to put your name on the top of h0.py
. If you work with a partner, submit only one version of the code and data with both names on the code.