Data Wrangling

Presenter Notes

First to review from last time

There are many ways to run a Python script: http://cs.wellesley.edu/~qtw/lectures/python.html#section_12

Presenter Notes

Overview of Data Wrangling

  1. Find data sources of interest
  2. Write programs to fetch the resources (may need HTML parsing to find links of interest)
  3. If file type is HTML, must parse to extract information of value
  4. Clean up and test collected data for consistency

Presenter Notes

HTML

HTML consists of tags, some of which have attribute-value pairs:

 1 <html>
 2 <head>
 3     <title>Example HTML file</title>
 4 </head>
 5 <body>
 6     <h1>Important things</h1>
 7     <h2>Summary</h2>
 8     <p>Here is an example <a href="http://www.w3schools.com/html/">HTML</a> file.</p>
 9     <h2 class="section">Motivation</h2>
10     <div id="motives">
11         <p>I'm not sure why you'd want to download it. Here are some possible reasons:</p>
12         <ol>
13             <li>It contains very important information.</li>
14             <li>You are the <a href="http://www.google.com/">Google bot</a>, and you'll visit <em>anything</em>.</li>
15             <li>It's the example we are using in class.</li>
16         </ol>
17     </div>

Presenter Notes

Firebug

Presenter Notes

Beautiful Soup

Presenter Notes

findAll() function

Let's pull the file and extract the Headers

1 >>> from BeautifulSoup import BeautifulSoup
2 >>> soup = BeautifulSoup(urllib2.urlopen("http://cs.wellesley.edu/~qtw/data/htmldemo.html").read())
3 >>> [x.text for x in soup.findAll("h2")]
4 [u'Summary', u'Motivation', u'Results', u'Conclusion']

Presenter Notes

Attribute-value pairs can be accessed like a dictionary

Here's the code:

1 >>> anchors=soup.findAll("a")
2 >>> len(anchors)
3 2
4 >>> anchors[0]          #here's the first link
5 <a href="http://www.w3schools.com/html/">HTML</a>
6 >>> anchors[0]["href"]  #here's how you get the URL out of an anchor tag
7 u'http://www.w3schools.com/html/'
8 >>> anchors[0].text     #and here's the Inner HTML
9 u'HTML'

Presenter Notes

You can even select tags based on attribute-value pairs

1 sectionheaders=soup.findAll("h2",attrs={"class":"section"})
2 >>> sectionheaders
3 [<h2 class="section">Motivation</h2>, <h2 class="section">Results</h2>, <h2 class="section">Conclusion</h2>]

Presenter Notes

Exercise

  1. Can you get the div with id motives and print it out?
  2. Can you get the text out of the list items inside the div?

Presenter Notes

Solution to Exercise

Here's what you do:

 1 >>> mtag=soup.find("div",attrs={"id":"motives"})
 2 >>> mtag
 3 <div id="motives">
 4 <p>I'm not sure why you'd want to download it. Here are some possible reasons:</p>
 5 <ol>
 6 <li>It contains very important information.</li>
 7 <li>You are the <a href="http://www.google.com/">Google bot</a>, and you'll visit <em>anything</em>.</li>
 8 <li>It's the example we are using in class.</li>
 9 </ol>
10 </div>
11 >>> lisText=[t.text for t in mtag.findAll("li")]
12 >>> lisText
13 [u'It contains very important information.', u"You are theGoogle bot, and you'll visitanything.", u"It's the example we are using in class."]

Presenter Notes

Scraping Strategy (Fetching Multiple URLs)

  • Scraping Case 1: content is actually encoded in HTML, so it is up to you to dig it out.
  • Scraping Case 2: content is in lots of files, whose collection should be automated based on the links in a web page

Presenter Notes

Scraping Example: Weather Underground

Boston Weather

Presenter Notes

Goal: collect a month's worth of history and extract high/low temps

Download a copy of the code:

[tmoore@trout ~] pwd
/home/tmoore
[tmoore@trout ~] cd cs349b/code
[tmoore@trout code] mkdir weather
[tmoore@trout code] cd weather
[tmoore@trout weather] wget http://cs.wellesley.edu/~qtw/code/weather/weather.py
--2012-02-09 07:37:38--  http://cs.wellesley.edu/~qtw/code/weather/weather.py
Resolving cs.wellesley.edu... 149.130.136.19
Connecting to cs.wellesley.edu|149.130.136.19|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4837 (4.7K) [text/plain]
Saving to: weather.py

100%[==========================================================================>] 4,837       --.-K/s   in 0s

2012-02-09 07:37:38 (195 MB/s) - weather.py
[tmoore@trout weather] python
Python 2.7 (r27:82500, Sep 16 2010, 18:03:06)
[GCC 4.5.1 20100907 (Red Hat 4.5.1-3)] on linux2
>>> import weather

Presenter Notes

Fetching the files

Here is code to fetch the weather for multiple days:

 1 def fetchWeather(month,day,year,numdays=0,filedir='/home/qtw/public_html/data/weather/'):
 2     """
 3     This function fetches the historical weather for Boston on the requested day and stores in a local file
 4     arguments:
 5         month: number from 1-12
 6         day: number from 1-31
 7         year: 4-digit year
 8         numdays: number of days to go back in time. Default value is 0 (only get the requested date)
 9         filedir: location of the directory where weather data files should be stored. 
10             Default value is /home/qtw/public_html/data/weather/, but that only 
11             works for the user qtw
12     """
13     import datetime
14     startday=datetime.datetime(year,month,day)
15     #create the appropriate directory for the data if it doesn't exist
16     if not os.path.isdir('/home/qtw/public_html/data/weather/'):
17         os.mkdir('/home/qtw/public_html/data/weather/')   
18     ...

Presenter Notes

Fetching the files

Here is rest of the code to fetch the weather for multiple days (part 2):

 1 #finishing off the fetchWeather function
 2 for daydiff in range(numdays):
 3     day2check=startday-datetime.timedelta(daydiff) url2check='http://www.wunderground.com/history/airport/KBOS/%i/%i/%i/DailyHistory.html'%(day2check.year,day2check.month,day2check.day)
 4     #note that we must rename the HTML file, because it is named DailyHistory.html
 5     urlfileloc=os.path.join(filedir,'weather-%s.html' % day2check.strftime("%Y-%m-%d"))
 6     #first check to see if the file already exists (i.e., has been downloaded)
 7     if os.path.isfile(urlfileloc):
 8         print 'file already exists at %s, so skipping' % urlfileloc
 9         continue
10     #store a local copy of the file, be sure to check only after seeing if file is already stored
11     htmls = urllib2.urlopen(url2check).read()
12     f=open(urlfileloc,'w')
13     f.write(htmls)
14     f.close()
15     #keep a record of the original URL and the corresponding file name
16     f=open(os.path.join(filedir,'file2url.csv','a')
17     f.write('%s,%s\n'%(urlfileloc,url2check))
18     f.close()
19     #wait a few seconds before checking the previous day, out of courtesy
20     time.sleep(10)

Presenter Notes

Let's review what happened here

  1. We inferred the URL structure for different days.
  2. We made a local copy of the HTML file.
  3. We renamed our local copy to reflect the date collected.
  4. We created a separate file file2url.csv to keep track of the original URL the file was obtained from
  5. We waited 10 seconds between website requests
  6. We checked that we didn't already have the file before downloading a new copy.
  7. We wrote the function with parameters, but included sensible defaults

Presenter Notes

Best practices for scraping

  1. Download a file once into a data directory, then parse the local copy.
  2. Rename the local file to ensure uniqueness, but keep track of the original URL
  3. Pause between requests to the same web server

Presenter Notes

Next up: parsing

We can use Firebug to parse through the files

Presenter Notes

Extracting max and min temperature

 1 ht=soup.find('table',attrs={'id':'historyTable'})
 2 #looks like the max and min temp are in spans.
 3 tableSpans=ht.findAll('span')
 4 for s in tableSpans:
 5     print s
 6 #from printing them out, we see it's in tableSpans[6]
 7 >>> tableSpans[6]
 8 <span class="nobr"><span class="b">50</span>&nbsp;&deg;F</span>
 9 >>> tableSpans[6].text
10 u'50&nbsp;&deg;F'
11 >>> float(tableSpans[6].text)
12 Traceback (most recent call last):
13   File "<stdin>", line 1, in <module>
14 ValueError: invalid literal for float(): 50&nbsp;&deg;F    
15 >>> tableSpans[6].span.text
16 u'50'
17 maxtemp=float(tableSpans[6].span.text)
18 #again for the min temp
19 mintemp=float(tableSpans[13].span.text)

Presenter Notes

Exercise: can you extract the max wind speed?

Here's a hint: print out the tableSpans list and indices:

>>> for i in range(len(tableSpans)):
...     print '%i %s'%(i,tableSpans[i])
...
0 <span>Mean Temperature</span>
1 <span class="nobr"><span class="b">39</span>&nbsp;&deg;F</span>
2 <span class="b">39</span>
3 <span class="nobr"><span class="b">30</span>&nbsp;&deg;F</span>
4 <span class="b">30</span>
5 <span>Max Temperature</span>
6 <span class="nobr"><span class="b">50</span>&nbsp;&deg;F</span>
7 <span class="b">50</span>
8 <span class="nobr"><span class="b">37</span>&nbsp;&deg;F</span>
9 <span class="b">37</span>
10 <span class="nobr"><span class="b">57</span>&nbsp;&deg;F</span>
11 <span class="b">57</span>
12 <span>Min Temperature</span>
13 <span class="nobr"><span class="b">28</span>&nbsp;&deg;F</span>
14 <span class="b">28</span>
15 <span class="nobr"><span class="b">23</span>&nbsp;&deg;F</span>
    ...

Presenter Notes

Now go from one-off scraping to scraping all pages

Here is the general-purpose function:

 1 def parseWeather(makeFile=False,filedir='/home/qtw/public_html/data/weather/'):
 2     """
 3     Goes through all the weather files that have been fetched, extracts 
 4     the max and min temp, returns a dictionary mapping days to [max temp, mintemp]
 5     arguments:
 6         makeFile (default False): if True, creates CSV file of the form date,max,min
 7         filedir (default /home/qtw/public_html/data/weather/): directory for CSV file
 8     """
 9     #create a dictionary mapping the date to the max and min temp.
10     date2temp={}
11     for filename in os.listdir(filedir):
12         #first make sure we're only looking at the weather html files
13         if filename[:7]!='weather' or filename[-4:]!='html': continue
14         soup=BeautifulSoup(open(os.path.join(filedir,filename)).read())
15         #after inspecting element in Firebug, we see that the relevant 
16         #information is in the historyTable.
17         ht=soup.find('table',attrs={'id':'historyTable'})
18         #looks like the max and min temp are in spans.
19         tableSpans=ht.findAll('span')
20         maxtemp=float(tableSpans[6].span.text)
21         mintemp=float(tableSpans[13].span.text)
22         #get the date from the file name
23         day=filename[8:18]
24         date2temp[day]=[maxtemp,mintemp]
25     ...

Presenter Notes

Finishing off the parseWeather() function

Here's the code that writes the dictionary to a file:

1 if makeFile:
2     f=open(os.path.join(filedir,'temps.csv'),'w')
3     f.write('day,hi,low\n')
4     tempdays=date2temp.keys()
5     tempdays.sort() #this ensures that the dates are listed in order
6     for day in tempdays:
7         f.write('%s,%.1f,%.1f\n'%(day,date2temp[day][0],date2temp[day][1]))
8     f.close()
9 return date2temp

Presenter Notes

Let's review what just happened

Suppose you see a function in the code that you don't recognize, say os.listdir(). What can you do about it?

You could try to infer the structure using the interpreter:

>>> filedir='/home/qtw/public_html/data/weather/'
>>> import os
>>> os.listdir(filedir)
['weather-2012-01-25.html', 'weather-2012-01-11.html', 'weather-2012-01-14.html',
'weather-2012-01-20.html', 'weather-2012-01-21.html', 'weather-2012-01-16.html',
'weather-2012-02-03.html', 'weather-2012-01-19.html', 'weather-2012-02-01.html', 'temps.csv',
'weather-2012-01-23.html', 'weather-2012-01-09.html', 'weather-2012-01-28.html',
'weather-2012-01-15.html', 'weather-2012-01-13.html', 'file2url.csv', 'weather-2012-02-05.html',
...]
  • Type pydoc os.listdir at the Linux terminal (not the interpreter)
  • Google python os.listdir, which will take you to Python documentation

Presenter Notes

Let's review what just happened

  1. We iterated over all the HTML files downloaded by fetchWeather()
  2. We copied the code that worked for a single day and applied it to all days
  3. We created a dictionary and CSV file, just as in earlier exercises

Presenter Notes

A word on data debugging

  • In our example, we were lucky that the HTML structure was identical for all pages.
  • Sometimes, data is messy and inconsistent, so we have to run checks to make sure.
  • Often code will work on one example, but you will get errors when extrapolating to many pages.
  • Don't despair, just check the variables at the interpreter to see where the problem comes from and investigate.
  • A typical workflow is to encounter a few special cases (discovered by bugs), which you must account for using conditional statements

Presenter Notes

Data sanity checking

The best place to put code for testing data is in the __main__ condition:

 1 if __name__ == '__main__':
 2     #fetch 30 days' worth of weather pages
 3     fetchWeather(2,7,2012,30)
 4     d2t=parseWeather()
 5     #time for sanity checking
 6     #does the min temp ever exceed or equal max?
 7     mingmax=[d for d in d2t if d2t[d][1]>=d2t[d][0]]
 8     if mingmax==[]:
 9         print 'Sanity check 1: no min temps >= max temps'
10     else:
11         print 'Sanity check 1 failed : min temps >= max temps on days: %s' %mingmax
12     print "Temperatures for manual inspection"
13     for d in ['2012-01-15','2012-01-31','2012-02-07']:
14         print 'On %s, max: %i, min: %i'% (d,d2t[d][0],d2t[d][1])

Presenter Notes

Review: data-scraping code development

  1. Find the series of pages or resources that you need to download.
  2. Separate the data fetching from the parsing.
  3. Make local copies of downloaded files, and parse the local files
  4. Try to get a single example parsed properly first.
  5. Expect bugs to reveal inconsistencies in the gathered data
  6. Once you think all bugs are squashed, run extra data sanity checks to increase your confidence.

Presenter Notes

Be careful what you scrape

Weather Underground's terms of service permits data collection

Weather Underground Terms of service

Presenter Notes

Many websites forbid this behavior

Here is an excerpt from Amazon's terms of service:

Amazon grants you a limited license to access and make personal use of this site and not to download (other than page caching) or modify it, or any portion of it, except with express written consent of Amazon. This license does not include any resale or commercial use of this site or its contents; any collection and use of any product listings, descriptions, or prices; any derivative use of this site or its contents; any downloading or copying of account information for the benefit of another merchant; or any use of data mining, robots, or similar data gathering and extraction tools.

  • Safest route is to stick to websites that don't prohibit scraping
  • Fortunately this includes nearly all government websites

Presenter Notes

More complex cases

Sometimes simply fetching the URL doesn't work

Presenter Notes

Presenter Notes