There are many ways to run a Python script: http://cs.wellesley.edu/~qtw/lectures/python.html#section_12
HTML consists of tags, some of which have attribute-value pairs:
1 <html>
2 <head>
3 <title>Example HTML file</title>
4 </head>
5 <body>
6 <h1>Important things</h1>
7 <h2>Summary</h2>
8 <p>Here is an example <a href="http://www.w3schools.com/html/">HTML</a> file.</p>
9 <h2 class="section">Motivation</h2>
10 <div id="motives">
11 <p>I'm not sure why you'd want to download it. Here are some possible reasons:</p>
12 <ol>
13 <li>It contains very important information.</li>
14 <li>You are the <a href="http://www.google.com/">Google bot</a>, and you'll visit <em>anything</em>.</li>
15 <li>It's the example we are using in class.</li>
16 </ol>
17 </div>
findAll()
functionLet's pull the file and extract the Headers
1 >>> from BeautifulSoup import BeautifulSoup
2 >>> soup = BeautifulSoup(urllib2.urlopen("http://cs.wellesley.edu/~qtw/data/htmldemo.html").read())
3 >>> [x.text for x in soup.findAll("h2")]
4 [u'Summary', u'Motivation', u'Results', u'Conclusion']
Here's the code:
1 >>> anchors=soup.findAll("a")
2 >>> len(anchors)
3 2
4 >>> anchors[0] #here's the first link
5 <a href="http://www.w3schools.com/html/">HTML</a>
6 >>> anchors[0]["href"] #here's how you get the URL out of an anchor tag
7 u'http://www.w3schools.com/html/'
8 >>> anchors[0].text #and here's the Inner HTML
9 u'HTML'
1 sectionheaders=soup.findAll("h2",attrs={"class":"section"})
2 >>> sectionheaders
3 [<h2 class="section">Motivation</h2>, <h2 class="section">Results</h2>, <h2 class="section">Conclusion</h2>]
div
with id
motives
and print it out?div
?Here's what you do:
1 >>> mtag=soup.find("div",attrs={"id":"motives"})
2 >>> mtag
3 <div id="motives">
4 <p>I'm not sure why you'd want to download it. Here are some possible reasons:</p>
5 <ol>
6 <li>It contains very important information.</li>
7 <li>You are the <a href="http://www.google.com/">Google bot</a>, and you'll visit <em>anything</em>.</li>
8 <li>It's the example we are using in class.</li>
9 </ol>
10 </div>
11 >>> lisText=[t.text for t in mtag.findAll("li")]
12 >>> lisText
13 [u'It contains very important information.', u"You are theGoogle bot, and you'll visitanything.", u"It's the example we are using in class."]
Download a copy of the code:
[tmoore@trout ~] pwd
/home/tmoore
[tmoore@trout ~] cd cs349b/code
[tmoore@trout code] mkdir weather
[tmoore@trout code] cd weather
[tmoore@trout weather] wget http://cs.wellesley.edu/~qtw/code/weather/weather.py
--2012-02-09 07:37:38-- http://cs.wellesley.edu/~qtw/code/weather/weather.py
Resolving cs.wellesley.edu... 149.130.136.19
Connecting to cs.wellesley.edu|149.130.136.19|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4837 (4.7K) [text/plain]
Saving to: weather.py
100%[==========================================================================>] 4,837 --.-K/s in 0s
2012-02-09 07:37:38 (195 MB/s) - weather.py
[tmoore@trout weather] python
Python 2.7 (r27:82500, Sep 16 2010, 18:03:06)
[GCC 4.5.1 20100907 (Red Hat 4.5.1-3)] on linux2
>>> import weather
Here is code to fetch the weather for multiple days:
1 def fetchWeather(month,day,year,numdays=0,filedir='/home/qtw/public_html/data/weather/'):
2 """
3 This function fetches the historical weather for Boston on the requested day and stores in a local file
4 arguments:
5 month: number from 1-12
6 day: number from 1-31
7 year: 4-digit year
8 numdays: number of days to go back in time. Default value is 0 (only get the requested date)
9 filedir: location of the directory where weather data files should be stored.
10 Default value is /home/qtw/public_html/data/weather/, but that only
11 works for the user qtw
12 """
13 import datetime
14 startday=datetime.datetime(year,month,day)
15 #create the appropriate directory for the data if it doesn't exist
16 if not os.path.isdir('/home/qtw/public_html/data/weather/'):
17 os.mkdir('/home/qtw/public_html/data/weather/')
18 ...
Here is rest of the code to fetch the weather for multiple days (part 2):
1 #finishing off the fetchWeather function
2 for daydiff in range(numdays):
3 day2check=startday-datetime.timedelta(daydiff) url2check='http://www.wunderground.com/history/airport/KBOS/%i/%i/%i/DailyHistory.html'%(day2check.year,day2check.month,day2check.day)
4 #note that we must rename the HTML file, because it is named DailyHistory.html
5 urlfileloc=os.path.join(filedir,'weather-%s.html' % day2check.strftime("%Y-%m-%d"))
6 #first check to see if the file already exists (i.e., has been downloaded)
7 if os.path.isfile(urlfileloc):
8 print 'file already exists at %s, so skipping' % urlfileloc
9 continue
10 #store a local copy of the file, be sure to check only after seeing if file is already stored
11 htmls = urllib2.urlopen(url2check).read()
12 f=open(urlfileloc,'w')
13 f.write(htmls)
14 f.close()
15 #keep a record of the original URL and the corresponding file name
16 f=open(os.path.join(filedir,'file2url.csv','a')
17 f.write('%s,%s\n'%(urlfileloc,url2check))
18 f.close()
19 #wait a few seconds before checking the previous day, out of courtesy
20 time.sleep(10)
file2url.csv
to keep track of the original URL the file was obtained fromWe can use Firebug to parse through the files
soup=weather.testWeatherParse()
ht=soup.find('table',attrs={'id':'historyTable'})
1 ht=soup.find('table',attrs={'id':'historyTable'})
2 #looks like the max and min temp are in spans.
3 tableSpans=ht.findAll('span')
4 for s in tableSpans:
5 print s
6 #from printing them out, we see it's in tableSpans[6]
7 >>> tableSpans[6]
8 <span class="nobr"><span class="b">50</span> °F</span>
9 >>> tableSpans[6].text
10 u'50 °F'
11 >>> float(tableSpans[6].text)
12 Traceback (most recent call last):
13 File "<stdin>", line 1, in <module>
14 ValueError: invalid literal for float(): 50 °F
15 >>> tableSpans[6].span.text
16 u'50'
17 maxtemp=float(tableSpans[6].span.text)
18 #again for the min temp
19 mintemp=float(tableSpans[13].span.text)
Here's a hint: print out the tableSpans
list and indices:
>>> for i in range(len(tableSpans)):
... print '%i %s'%(i,tableSpans[i])
...
0 <span>Mean Temperature</span>
1 <span class="nobr"><span class="b">39</span> °F</span>
2 <span class="b">39</span>
3 <span class="nobr"><span class="b">30</span> °F</span>
4 <span class="b">30</span>
5 <span>Max Temperature</span>
6 <span class="nobr"><span class="b">50</span> °F</span>
7 <span class="b">50</span>
8 <span class="nobr"><span class="b">37</span> °F</span>
9 <span class="b">37</span>
10 <span class="nobr"><span class="b">57</span> °F</span>
11 <span class="b">57</span>
12 <span>Min Temperature</span>
13 <span class="nobr"><span class="b">28</span> °F</span>
14 <span class="b">28</span>
15 <span class="nobr"><span class="b">23</span> °F</span>
...
Here is the general-purpose function:
1 def parseWeather(makeFile=False,filedir='/home/qtw/public_html/data/weather/'):
2 """
3 Goes through all the weather files that have been fetched, extracts
4 the max and min temp, returns a dictionary mapping days to [max temp, mintemp]
5 arguments:
6 makeFile (default False): if True, creates CSV file of the form date,max,min
7 filedir (default /home/qtw/public_html/data/weather/): directory for CSV file
8 """
9 #create a dictionary mapping the date to the max and min temp.
10 date2temp={}
11 for filename in os.listdir(filedir):
12 #first make sure we're only looking at the weather html files
13 if filename[:7]!='weather' or filename[-4:]!='html': continue
14 soup=BeautifulSoup(open(os.path.join(filedir,filename)).read())
15 #after inspecting element in Firebug, we see that the relevant
16 #information is in the historyTable.
17 ht=soup.find('table',attrs={'id':'historyTable'})
18 #looks like the max and min temp are in spans.
19 tableSpans=ht.findAll('span')
20 maxtemp=float(tableSpans[6].span.text)
21 mintemp=float(tableSpans[13].span.text)
22 #get the date from the file name
23 day=filename[8:18]
24 date2temp[day]=[maxtemp,mintemp]
25 ...
parseWeather()
functionHere's the code that writes the dictionary to a file:
1 if makeFile:
2 f=open(os.path.join(filedir,'temps.csv'),'w')
3 f.write('day,hi,low\n')
4 tempdays=date2temp.keys()
5 tempdays.sort() #this ensures that the dates are listed in order
6 for day in tempdays:
7 f.write('%s,%.1f,%.1f\n'%(day,date2temp[day][0],date2temp[day][1]))
8 f.close()
9 return date2temp
Suppose you see a function in the code that you don't recognize, say os.listdir()
. What can you do about it?
You could try to infer the structure using the interpreter:
>>> filedir='/home/qtw/public_html/data/weather/'
>>> import os
>>> os.listdir(filedir)
['weather-2012-01-25.html', 'weather-2012-01-11.html', 'weather-2012-01-14.html',
'weather-2012-01-20.html', 'weather-2012-01-21.html', 'weather-2012-01-16.html',
'weather-2012-02-03.html', 'weather-2012-01-19.html', 'weather-2012-02-01.html', 'temps.csv',
'weather-2012-01-23.html', 'weather-2012-01-09.html', 'weather-2012-01-28.html',
'weather-2012-01-15.html', 'weather-2012-01-13.html', 'file2url.csv', 'weather-2012-02-05.html',
...]
pydoc os.listdir
at the Linux terminal (not the interpreter)python os.listdir
, which will take you to Python documentationfetchWeather()
The best place to put code for testing data is in the __main__
condition:
1 if __name__ == '__main__':
2 #fetch 30 days' worth of weather pages
3 fetchWeather(2,7,2012,30)
4 d2t=parseWeather()
5 #time for sanity checking
6 #does the min temp ever exceed or equal max?
7 mingmax=[d for d in d2t if d2t[d][1]>=d2t[d][0]]
8 if mingmax==[]:
9 print 'Sanity check 1: no min temps >= max temps'
10 else:
11 print 'Sanity check 1 failed : min temps >= max temps on days: %s' %mingmax
12 print "Temperatures for manual inspection"
13 for d in ['2012-01-15','2012-01-31','2012-02-07']:
14 print 'On %s, max: %i, min: %i'% (d,d2t[d][0],d2t[d][1])
Weather Underground's terms of service permits data collection
Here is an excerpt from Amazon's terms of service:
Amazon grants you a limited license to access and make personal use of this site and not to download (other than page caching) or modify it, or any portion of it, except with express written consent of Amazon. This license does not include any resale or commercial use of this site or its contents; any collection and use of any product listings, descriptions, or prices; any derivative use of this site or its contents; any downloading or copying of account information for the benefit of another merchant; or any use of data mining, robots, or similar data gathering and extraction tools.
Sometimes simply fetching the URL doesn't work
mechanize
will help you submit a form and automatically pick out the steps you wantselenium
has a Javascript-based browser that can be automatedTable of Contents | t |
---|---|
Exposé | ESC |
Full screen slides | e |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide slide context | c |
Notes | 2 |
Help | h |