Leveraging APIs to Collect Data
Readings
- These lecture notes
- Sign up for a New York Times API key. We will be doing some in-class exercises using this API as an example, and so I want to be sure that everyone can query the API.
API overview
APIs offer an interface to resources maintained by someone else. Sometimes this resource can be an operation, such as Facebook's authentication API, which lets website operators rely on Facebook to provide logins to users. Other times, APIs offer access to data under the control of a third party, such as the New York Times article search. This second case will be the focus of how we use APIs in this course.
In many respects, web-service APIs simply offer a web-based alternative for accessing databases. In principle, a company could set up an SQL database and allow users across the Internet to log into the database and issue queries on the data. This could introduce substantial efficiency and security complications, however. Instead, web-service APIs route queries through URLs, and so they are passed through the HTTP protocol, which is universally available in today's Internet-connected computers. This also means that the richness of queries available is entirely dependent on what the API allows, and it also means that there are fewer standards across APIs. Consequently, mastering the use of one API does not mean that you can effortlessly reapply the code when developing for a new API. For this reason, the most important skill for you to acquire is the ability to quickly examine an API's documentation in order to craft the appropriate queries to get the data you're after.
APIs are being developed for just about anything. ProgrammableWeb maintains an extensive directory of publicly available web-service APIs (5,000 and counting as of this writing).
Most APIs that we will encounter are RESTful. This means that the API request is issued by an HTTP GET request. The most important consequence of that for our purposes is that you can issue the entire request by fetching a carefully constructed URL.
URL structure
URLs follow a common structure: scheme://netloc/path;parameters?query#fragment
. Consider the URL:
http://www.bing.com/search?q=tyler+moore&go=&qs=n&form=QBLH&pq=tyler%2520moore&sc=8-10&sp=-1&sk=
We can parse out the constituent parts using Python's urlparse
module:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | >>> import urlparse print res >>> res=urlparse.urlparse("http://www.bing.com/search?q=tyler+moore&go=&qs=n&form=QBLH&pq=tyler%2520moore&sc=8-10&sp=-1&sk=") >>> print res ParseResult(scheme='http', netloc='www.bing.com', path='/search', params='', query='q=tyler+moore&go=&qs=n&form=QBLH&pq=tyler%2520moore&sc=8-10&sp=-1&sk=', fragment='') >>> res.scheme 'http' >>> res.query 'q=tyler+moore&go=&qs=n&form=QBLH&pq=tyler%2520moore&sc=8-10&sp=-1&sk=' >>> qs=urlparse.parse_qs(res.query) >>> qs {'qs': ['n'], 'pq': ['tyler%20moore'], 'form': ['QBLH'], 'sp': ['-1'], 'q': ['tyler moore'], 'sc': ['8-10']} >>> import urlparse >>> res=urlparse.urlparse("http://www.bing.com/search?q=tyler+moore&go=&qs=n&form=QBLH&pq=tyler%2520moore&sc=8-10&sp=-1&sk=") >>> qs=urlparse.parse_qs(res.query) >>> print qs {'qs': ['n'], 'pq': ['tyler%20moore'], 'form': ['QBLH'], 'sp': ['-1'], 'q': ['tyler moore'], 'sc': ['8-10']} |
Notice also that while I typed "tyler moore" into Bing for the query, what was passed in the URL was q=tyler+moore
. The space was encoded as +
. Also there is a parameter named pq=tyler%2520moore
, which has encoded a percent sign as %25
. Fortunately, encoding and decoding the escaped characters for URLs is also built in to Python using the urllib
module. To replace spaces and other special characters with their escaped equivalents, use the urllib.quote_plus()
method. To reverse the process, use the urllib.unquote_plus()
method.
Finally, to convert a dictionary mapping query attributes to values into a query string, use urllib.urlencode()
. For example, to convert the query dictionary qs
above:
1 2 | >>> urllib.urlencode(qs) 'qs=%5B%27n%27%5D&pq=%5B%27tyler%2520moore%27%5D&form=%5B%27QBLH%27%5D&sp=%5B%27-1%27%5D&q=%5B%27tyler+moore%27%5D&sc=%5B%278-10%27%5D' |
Now that we know a bit more about how URLs are structured and processed in Python, we can take a closer look at how to build API queries.
Crafting API queries
We will work with the New York Times Article Search API as our running example for working with web-resource APIs.
From the documentation, we can see this guidance on how to construct URLs for queries:
http://api.nytimes.com/svc/search/v1/article?query=(field:)keywords (facet:[value])(¶ms)&api-key=your-API-key
Note that there is a base URL which is the same for all queries: http://api.nytimes.com/svc/search/v1/article
, then there is an attribute called query
, which expects field and keywords, plus something called facets. Finally, there is another attribute called api-key
which expects your API key.
You'll find that all APIs have base URLs and require API keys. You sign up for API keys with each site. These keys let the API operator keep track of how many requests you issue, so that you don't exceed query limits.
Notice too that there are all these parentheses. You may wonder -- are those supposed to be literally included? What about the space? Well you have to dig a bit deeper in the documentation, in particular the Constructing a Search Query section, to find out. I suggest you read that section now before continuing with the lecture notes.
Looking back at the Requests section, we can see that the only require parameters are query
and api-key
, but there are a host of optional parameters.
Let's start with building a minimal query. Suppose we want to query for the term wellesley
in news articles. The URL should consist of the base URL + the query + the API key:
http://api.nytimes.com/svc/search/v1/article?query=wellesley&api-key
We can break that URL up into its constituent parts in Python:
1 2 3 4 | apikey="[removed API key for security reasons]" baseurl="http://api.nytimes.com/svc/search/v1/article?" q={"query":"wellesley","api-key":apikey} url2check=baseurl+urllib.urlencode(q) |
Suppose we wanted to ask a slightly more complicated query, perhaps searching for the phrase "wellesley college" appearing in either the title or byline of the article. We can just make a more complicated dictionary:
1 2 | q2={"query":'"wellesley college"',"fields":"title,byline"} url2check2=baseurl+urllib.urlencode(q2) |
Handling results
Now that we have crafted the query, the next step is to fetch the results. This starts out just like any other URL request:
1 2 3 4 5 6 7 8 9 10 11 12 | >>> result=urllib2.urlopen(url2check).read() >>> print result >>> print result {"offset" : "0" , "results" : [{"body" : "PHILADELPHIA BETSEY STEVENSON and Justin Wolfers might sound like almost any upscale couple. They ha ve impressive degrees and serious careers and the social markers that g o with them. They have one child, but there are two strollers, a Bugabo o and a Bob baby jogger, parked in the front hall of their stylish home here. Their daughter, Matilda, who" , "byline" : "By MOTOKO RICH" , "d ate" : "20120212" , "title" : "Economics of Family Life, as Taught by a Power Couple" , "url" : "http:\/\/www.nytimes.com\/2012\/02\/12\/busin ... |
However, what is returned is not HTML. Instead, it looks a bit like a Python dictionary with nested lists. Well, this is JSON. It is a simple format for expressing data structures that can be passed via the web. Fortunately for us, there is a Python library called json
which can convert JSON-encoded strings into Python data structures, and vice versa. Here it is in action:
1 2 3 4 5 | import json resd=json.loads(result) print resd >>> print resd {u'tokens': [u'wellesley'], u'total': 4409, u'results': [{u'body': u'PHEY STEVENSON and Justin Wolfers might sound like almost any upscale couimpressive degrees and serious careers and the social markers that go whave one child, but there are two strollers, a Bugaboo and a Bob baby jn the front hall of their stylish home here. Their daughter, Matilda, w'20120212', u'byline': u'By MOTOKO RICH', u'url': u'http://www.nytimes.business/economics-of-family-life-as-taught-by-a-power-couple.html', u'mics of Family Life, as Taught by a Power Couple'}, {u'body': u"LIPPINC97, on January 16, 2012, at Mayflower Place Nursing Center, West Yarmoummit, NJ, to Dr. Henry M. and Mary O'Reilly. Kent Place School, '32; We |
You can read about the meaning of the fields in the documentation, but you'll likely find it easier to simply explore the structure directly at the interpreter:
1 2 3 4 5 6 7 | >>> for r in resd: ... print r, resd[r] ... tokens [u'wellesley'] total 4409 results [{u'body': u'PHILADELPHIA BETSEY STEVENSON and Justin Wolfers might sound like almost any upscale couple. They have impressive degrees and serious careers and the social markers that go with them. They have one child, but there are two strollers, a Bugaboo and a Bob baby jogger, parked in the front hall of their stylish home here. Their daughter, Matilda, who', u'date': u'20120212', u'byline': u'By MOTOKO RICH', u'url': u'http://www.nytimes.com/2012/02/12/business/economics-of-family-life-as-taught-by-a-power-couple.html', u'title': u'Economics of Family Life, as Taught by a Power Couple'}, {u'body': u"LIPPINCOTT--Rosemond, 97, on January 16, 2012, at Mayflower Place Nursing Center, West Yarmouth, MA. Born Summit, NJ, to Dr. Henry M. and Mary O'Reilly. Kent Place School, '32; Wellesley College, '36. Predeceased by husband Job H. Lippincott. Resided Chatham, NJ, 1937-77; Nantucket, MA, 1977-85; and thereafter on Cape Cod. She was a generous", u'date': u'20120129', u'url': u'http://query.nytimes.com/gst/fullpage.html?res=9800E2DA133AF93AA15752C0A9649D8B63', u'title': u'Paid Notice: Deaths LIPPINCOTT, ROSEMOND'}, {u'body': u'HOWARD--Barnaby J. The son of a British lord, who grew up to be a pilot with the British and United States Navies during World War II and later a farmer in Southern Rhodesia (now Zimbabwe) before returning to America to set up a successful investment company (CAIMS), died December 18 at home in Orange Park, FL at age 86 after a courageous battle', u'date': u'20120129', u'url': u'http://query.nytimes.com/gst/fullpage.html?res=9803E3DA133AF93AA15752C0A9649D8B63', u'title': u'Paid Notice: Deaths HOWARD, BARNABY J'}, {u'body': u"Two hundred fifty-two consecutive matches won over 13 years. Thirteen national titles. The longest winning streak in college sports. Trinity College has been a squash dynasty under Coach Paul Assaiante. But two weeks ago in New Haven, Yale overthrew that dynasty in a 5-4 victory. Yale's coach, David Talbott, called it ''a long time coming.'' The", u'date': u'20120129', u'byline': u'By MING TSAI', u'url': u'http://www.nytimes.com/2012/01/29/sports/chef-ming-tsai-devoted-player-and-cooker-of-squash.html', u'title': u'Squash, a Growing Sport, And Nutritious, Too'}, {u'body': u"To the Editor: Hendrik Hartog has it right in ''Bargaining for a Child's Love'' (Sunday Review, Jan. 15). That Republicans disparage entitlement programs astounds me. I don't know of any who have refused Social Security or Medicare for themselves or their parents or grandparents. My mother, born in 1918, often said that it was President Franklin D.", u'date': u'20120124', u'url': u'http://www.nytimes.com/2012/01/24/opinion/benefits-for-the-elderly.html', u'title': u'LETTER; Benefits for the Elderly'}, {u'body': u"CRAWFORD--John Charlton, composer, pianist, professor, beloved father and husband, died on January 5, 2012, at age 80 in his 23rd year of Parkinson's disease in Cambridge, MA. Born the son of academic parents in 1931 in Philadelphia, he was gifted in music and languages. He graduated from Germantown Friends School and the Yale School of Music, and", u'date': u'20120122', u'url': u'http://query.nytimes.com/gst/fullpage.html?res=9C00E2DE133AF931A15752C0A9649D8B63', u'title': u'Paid Notice: Deaths CRAWFORD, JOHN CHARLTON'}, {u'body': u"IT'S show time for Anne M. Finucane. Her co-star on this day, Bill Clinton, is waiting offstage. The audience shifts in its seats. The spotlight goes up and ... action! It's a Thursday in early December, at a conference center near Orlando, and Ms. Finucane is busy shaping an image. Or, rather, trying to reshape one. This choreographed interview", u'date': u'20120115', u'byline': u'By LOUISE STORY and GRETCHEN MORGENSON', u'url': u'http://www.nytimes.com/2012/01/15/business/at-bank-of-america-the-image-officer-has-a-lot-to-fix.html', u'title': u'The Image Officer With a Lot to Fix'}, {u'body': u'KNEUBUHL--James Pritchard of Southbury, CT, formerly of New Canaan, CT and San Marino, CA, died December 30, 2011, at the age of 95. Husband of the late Margaret Woodard Kneubuhl, Jim leaves his daughters, Janet Schloat of Pound Ridge, NY and Barbara Kneubuhl of Wellesley, MA; three grandsons, David, Benjamin, and Michael Schloat and their wives;', u'date': u'20120112', u'url': u'http://query.nytimes.com/gst/fullpage.html?res=9404E5D8123AF931A25752C0A9649D8B63', u'title': u'Paid Notice: Deaths KNEUBUHL, JAMES PRITCHARD OF SOUTHBURY'}, {u'body': u'EDELMAN--Eleanor L. died peacefully in her sleep at her home in Bronxville, New York on January 7, 2012. For 53 years, she was the wife of Albert I. Edelman, an attorney who predeceased her. She was born Eleanor Louise Weisman in 1924 in St. Louis, Missouri and was known to her friends as Elly. Along with her beloved sisters, Beryl and Nanette, she', u'date': u'20120112', u'url': u'http://query.nytimes.com/gst/fullpage.html?res=9E03E6D8123AF931A25752C0A9649D8B63', u'title': u'Paid Notice: Deaths EDELMAN, ELEANOR L'}, {u'body': u'Nina Bich-Phuong Xuan Ha and Stephen Michael Girasuolo were married Friday evening at the Harvard Club of New York. Marylin G. Diamond, a retired acting justice of State Supreme Court in New York, officiated. On Thursday, the Rev. Thich Nguyen Hanh, a Buddhist priest, performed a ceremony that incorporated Vietnamese traditions at the Unitarian', u'date': u'20120108', u'url': u'http://www.nytimes.com/2012/01/08/fashion/weddings/nina-ha-stephen-girasuolo-weddings.html', u'title': u'Nina Ha, Stephen Girasuolo'}] offset 0 |
OK we can tell that most of the information is in resd['results']. We can see that len(resd[
results])
is 10, so there are 10 results returned. Each result is itself a dictionary:
1 2 3 4 5 6 7 8 | >>> for k in resd['results'][0]: ... print k, resd['results'][0][k] ... body PHILADELPHIA BETSEY STEVENSON and Justin Wolfers might sound like almost any upscale couple. They have impressive degrees and serious careers and the social markers that go with them. They have one child, but there are two strollers, a Bugaboo and a Bob baby jogger, parked in the front hall of their stylish home here. Their daughter, Matilda, who date 20120212 byline By MOTOKO RICH url http://www.nytimes.com/2012/02/12/business/economics-of-family-life-as-taught-by-a-power-couple.html title Economics of Family Life, as Taught by a Power Couple |
The most common format for API results is JSON, and it's also the most convenient for us as Python programmers. However, sometimes APIs return results in XML format. In this case, you should use BeautifulStoneSoup to parse the XML files, which behaves almost identically to BeautifulSoup.
Reading exercise
Can you write a list comprehension to extract just the URLs from resd
?
Here's the answer (only check after you've tried)
In-class exercise 1: Construct more queries and print results
Download the file http://cs.wellesley.edu/~qtw/code/apiex.py.
Your task is to create a dictionary q
that will include the appropriate parameters to answer the following query: get articles written by David Pogue in 2011 that mention "iphone" and "android".
Use a list comprehension to extract just the titles of the articles from the results.
Bonus task if you finish the rest of the exercise early: find out how many articles written by David Pogue in 2011 that don't mention iPhone or Android, and compare this to the number that do.
Storing results
Recall that when we discussed web scraping, the suggested strategy was to first download the HTML file, store it locally, and then parse the local copy of the file. A similar strategy is recommended for API data collections, where possible. One reason is that APIs limit the number of times you can query them in a given period, and so if you are going to be submitting a lot of queries it is essential that you make the most of each allowed query. Another is that APIs are often dynamic in nature, and so a query's results can change from one day to the next (e.g., new articles are added to the New York Times continuously).
Fortunately, when the results come in JSON format, parsing is a less error-prone process when compared to HTML. So this is one reason why it may not be as important to store the raw results for later recall.
Storing files locally, as done for HTML, is a viable strategy provided that you don't issue too many queries. You would still need to create a file2url.csv
file, but this time, include the requesting URL, the time of issue, and the stored file, e.g.:
"http://api.nytimes.com/svc/search/v1/article?query=iphone+android+byline%3A%22david+pogue%22&begin_date=20110101&api-key=[removed API key for security reasons]&end_date=20111231",2012-02-14 10:05:15, /home/tmoore/qtw/inclass/data/iphone_android_david_pogue_2011.json
Unfortunately, due to the way filesystems work, it can be very inefficient to store thousands of files in a directory. In fact, doing so can substantially degrade the overall performance of the filesystem. Consequently, in most cases you won't want to store all the JSON results individually.
Instead, you can create a composite data structure in Python that stores the information. Suppose you wanted to search New York Times for articles with "Obama", "Romney", "Santorum", "Gingrich" and "Paul" in the titles. You create five different queries, then store the results in a dictionary indexed by a tuple consisting of the query and date of search:
1 2 3 4 5 6 7 8 9 10 11 12 | import datetime, time rightnow=datetime.datetime.now() queries=[{"query":"title:"+politician,"api-key":apikey} for politician in ["Obama", "Romney", "Santorum", "Gingrich","Paul"]] apiResults={} for q in queries: #these 3 lines are just the same as before, just encoding and grabbing the URL url2check=baseurl+urllib.urlencode(q) result=urllib2.urlopen(url2check).read() resd=json.loads(result) #OK now store the json result in the apiResults dictionary apiResults[(url2check,rightnow)]=resd time.sleep(1) |
So now apiResults
is a dictionary whose keys are 2-element tuples of the URL requested plus the time of the search:
1 2 | >>> apiResults.keys() [('http://api.nytimes.com/svc/search/v1/article?query=title%3AGingrich&api-key=[removed API key for security reasons]', datetime.datetime(2012, 2, 14, 15, 3, 39, 685928)), ('http://api.nytimes.com/svc/search/v1/article?query=title%3AObama&api-key=[removed API key for security reasons]', datetime.datetime(2012, 2, 14, 15, 3, 39, 685928)), ('http://api.nytimes.com/svc/search/v1/article?query=title%3APaul&api-key=[removed API key for security reasons]', datetime.datetime(2012, 2, 14, 15, 3, 39, 685928)), ('http://api.nytimes.com/svc/search/v1/article?query=title%3ASantorum&api-key=[removed API key for security reasons]', datetime.datetime(2012, 2, 14, 15, 3, 39, 685928)), ('http://api.nytimes.com/svc/search/v1/article?query=title%3ARomney&api-key=[removed API key for security reasons]', datetime.datetime(2012, 2, 14, 15, 3, 39, 685928))] |
OK, so creating a dictionary of the results solves the problem of too many files floating around. But what happens after the Python session ends? Won't the dictionary just be cleared from memory? Enter the Python pickle module! This module allows us to freeze the any Python object and store a copy in permanent storage, rather than main memory. The fancy term is "object serialization".
We can pickle the apiResults
dictionary as follows:
1 2 3 4 | import cPickle as pickle #we use cPickle, a C implementation of the pickle module that runs faster pf=open("~/qtw/inclass/data/apiex.pkl","wb") #wb= write to a binary file pickle.dump(apiResults,pf,True) pf.close() |
Then we can close our python session, think about other classes, come back in a week, fire up the Python interpreter and type:
1 2 3 | pf=open("~/qtw/inclass/data/apiex.pkl","rb") apiRes=pickle.load(pf) pf.close() |
Voila! Now we have our original dictionary back and ready for use.
I suggest that you use this as a backup method for your data collection, rather than first collecting all data and then extracting the information you want out of the pickled structure.
Eventually, if you issue many thousands of requests, even Python pickle files can grow too large. In this case, your best bet is to store the results in a MongoDB table. We'll talk about how to do that next time!