Pizza, Twitter and APIs, Part 1

How do people in New York feel about their pizza? Despite the obvious critical importance of this issue, no research has been published to date that seeks to answer the question. This might be because editors at top journals are actively suppressing the research, or because the data was not easily available. While the first issue may still thwart our progress in the field, the second issue is quite tractable, thanks to Twitter’s Application Programming Interface, or API. Fortunately, even if pizza isn’t your area of interest, the same method we will use here applies to studying a host of contemporary attitudes and relationships, such as how often your students are advertising their intentions to plagiarize. This post will review the basics of accessing Twitter’s search API and then analyzing the resulting text data in Python. A basic familiarity with Python, acquired from the prior posts or elsewhere, is assumed.

In general, Twitter is a great resource for social scientists. One reason is that the default on Twitter is to share your information with the world. Anyone can follow you and you can follow just about anyone. This is in contrast to Facebook, where the default is to share information only with those you authorize to view it. Twitter’s public approach also greatly reduces any ethical concerns associated with data collection, as Twitter status updates are the online equivalent of shouting out your window with a megaphone (while Facebook updates are more analogous to a conversation in the cafeteria).

Twitter also provides much of its data in a format useful for researchers. This system isn’t set up with researchers in mind, but rather for web and application developers who want to build an app that interacts with Twitter. Instead of providing the data as a web page, it is provided as a special sort of text file called a JSON, an acronym for JavaScript Object Notation. Think of it as a more flexible version of a CSV. You can view a sample of a JSON by copying and pasting the following line into your browser window:

http://search.twitter.com/search.json?q=pizza

Your results might look something like this:

While at first glance this probably looks like an impenetrable stream of text, you can begin to make sense of it if you look for words in quotes. For example, the text following "from_user" is the user name of a person who just tweeted about pizza. In fact, since this search is drawn from the live Twitter stream, this might be the most recent person in the world to tweet about pizza. If you search for "text" you will find the text of all the status updates. But don’t spend too much time trying to decode this file because Python can help!

Returning to the url that you copied, note that it ends with q=pizza. If you changed that to q=obama, you would get tweets that mention the word “obama”, as used in an earlier tutorial. If you typed in q="french fries", your browser would automatically convert it to q=%22french%20fries%22 which is useful to know in case you want to do a two-word search–the ” became a %22 and the space became a %20. Everything after the ? is called a query string. If you pay attention to the URL after you enter some search terms, you often see a pattern of a letter or word, followed by an equal sign, followed by the value that you typed, followed by an ampersand, and then more of the same. In this Twitter example, the q stands for query. As described more fully in the Twitter developer’s manual’s search page, there are several additional ways you can modify your results, which I will discuss as we go through this tutorial.

While the default is for Twitter to return five status updates, you can increase that to any number (up to 100) by typing:

http://search.twitter.com/search.json?q=pizza&rpp=100

Here, the rpp stands for “results per page”. You can get another 1400 posts by going to the next page:

http://search.twitter.com/search.json?q=pizza&rpp=100&page=2

At around the 15th page, this stops working, which is the major limitation of the Twitter API for researchers. The API doesn’t really care about the past. Want to study the text from the beginning of Arab Spring or Occupy? Too late. Twitter won’t provide them. While this makes sense from a business point of view–who cares about 99.9% of tweets that happened six months ago–it can be frustrating for researchers. Twitter does have a pretty great way to access a random sample of all tweets as they are happening and a way to access specific search terms as they happen. So, for ongoing research, the streaming API is a great tool. It’s a little more complicated, so I’ll save discussion on that for another tutorial. I use it to track and graph the relationship among popular hashtags.

In addition to the search and streaming APIs, you can also look up specific Twitter users, and find all the people they follow and the people who follow them. These other APIs could be a great resource for network folks, amongst others.

Returning to our critical analysis of New Yorkers’ attitudes towards pizza, I noted above that the first status update we saw in our quick search “might” be the most recent tweet on the subject. The “might” is because the default for the search is to return a mix of recent and popular results. To get a better sample, that is one that is not impacted by Twitter’s algorithm for selecting popular tweets, let’s focus on just the most recent updates:

http://search.twitter.com/search.json?q=pizza&rpp=100&result_type=recent

Finally, we want to limit our search, for now, to New York. Twitter geocodes tweets based on the users GPS location if the update is sent from a smart phone, or, if that isn’t available, from the users own profile. The streaming API allows some flexibility in how you limit by geographic area, but the search API limits you to choosing a point by giving its latitude and longitude and then specifying a radius around the point. Using this technique, we can capture the broader New York area within a 30-mile radius around Manhattan:

http://search.twitter.com/search.json?q=pizza&rpp=100&result_type=recent&lang=en&geocode=40.76,-73.99,30mi

You might have noticed that I also snuck in a parameter to limit the results to just those in English. If you want, you can also omit the query value pizza to get a list of all the recent tweets coming from the New York Area. While you can’t do a blank search normally, you can do one once you geographically restrict your search.

Opening the JSON file in Python is quite similar to opening a CSV file, although extracting the information is quite different. To begin, start Python, and then import the module that handles accessing files from the Internet and the module for reading and writing data in the JSON format.

>>> import urllib
>>> import json

As before, you can store the URL in a string, but instead of downloading the file with urllib.urlretrieve, try accessing it directly from the Internet with urllib.urlopen.

>>> url='http://search.twitter.com/search.json?q=pizza&rpp=5&result_type=recent&geocode=40.76,-73.99,50mi&lang=en'
>>> file=urllib.urlopen(url)

To import the data as a JSON file, type:

>>> search = json.load(file)

You can print results to see the data, but, at this stage it won’t be much more interpretable than it was when you looked at it in the browser. Instead, loop over the results like we did earlier with the items in a list:

>>> for item in search:
...     print item
...
next_page
completed_in
max_id_str
since_id_str
refresh_url
results
since_id
results_per_page
query
max_id
page

These are the different categories of data that Twitter produced in response to your search. To access the content of the categories, you write the name of the item in quotation marks (because it is a string) inside brackets. For example:

>>> print search['query']
pizza

In this case, Python is looking up what is stored in the query location inside query. Here, the answer was the string pizza. The was what we were searching for, and Twitter stored the result in the query field. Think of this like a type of dictionary. We looked up the entry for query and found the resulting value that was associated with it. If you’ll remember with lists, we could only access specific positions in them, but we couldn’t jump immediately to a specific value without knowing where it was located. This dictionary method of storing data can be quite useful, as we’ll explore below, especially for computing the frequency of each word.

You can print out the values associated with each of the items, and perhaps review the developer’s guide if you want to know more about each of them, but for now, you can jump to the good stuff by looking at the contents of search['results']. While query stored the results of what we searched for, results stores all the status updates and associated data about who sent it and when. If you printed search['results'] it wouldn’t do you much good because instead of storing a string, like the other entries in our dictionary, this entry contains an entirely new dictionary. Just like lists can hold all different sorts of objects (like strings, integers and lists), so can dictionaries. To see what all the entries in the dictionary are, you can loop over them:

>>> for entry in search['results']:
...     print entry
...

The first few lines of my results look like:

{u'iso_language_code': u'en', u'to_user_id': None, u'to_user_id_str': None, u'profile_image_url_https': u'https://si0.twimg.com/profile_images/2064450537/ja0ncjyY_normal', u'from_user_id_str': u'451705132',
u'text': u'Drink  my coke nEatt my pizza nBut wenn u kiss my mom datt wen I qett mad haha (stolen)',
u'from_user_name': u"'Mrwronq'(RudeBoy)", u'profile_image_url': u'http://a0.twimg.com/profile_images/2064450537/ja0ncjyY_normal', u'id': 191860196279193601, u'to_user': None, u'source': u'<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', u'to_user_name': None, u'location': u'JERSEY CITY NEWJERSEY', u'from_user': u'Mr_o_soflii', u'from_user_id': 451705132, u'metadata': {u'result_type': u'recent'}, u'geo': None, u'created_at': u'Mon, 16 Apr 2012 12:06:50 +0000', u'id_str': u'191860196279193601'}

Scanning this, you might be able to discern a pattern such that specific fields of interest, such as “from_user_name” or “text” are followed by a quote, which are followed by a colon and then a value, such as, “Mrwronq'(RudeBoy)” and “Drink my coke nEatt my pizza nBut wenn u kiss my mom datt wen I qett mad haha (stolen).” You might remember that the n signifies a line break. Similarly, each text entry is preceded by the letter “u”, which is one of the ways that Python handles strings that could contain characters that aren’t English letters or numbers. We can ignore that for now.

The important thing is that each of these rows being printed out acts like a dictionary, and we can loop through each row to extract just the information we need. Since we only care about the text of the tweet, we can print just that by typing:

>>> for entry in search['results']:
...     print entry['text']
...

Based on scanning the contents of the lines of search['results'] and the API manual, we know that 'text' contains the words in the tweet, so these two lines instruct Python to loop over each 'results' entries in our original JSON object and print out whatever is stored in the 'text'. If you want to print out the time it was posted, you could just swap in 'created_at' for 'text'. It’s important to note that the names were determined by Twitter. In contrast, I made up search and entry because I thought they would be easy to remember, and in a different context, you might use other names for your dictionaries.

As always, printing out the entries is just a way to make sure things are going as planned. We want to store them somewhere for use later on. So, like with the Obama tweets, we can put them in a list:

>>> pizza_tweets=[]
>>> for entry in search['results']:
...     pizza_tweets.append(entry['text'])
...
>>> print pizza_tweets
[u'Drink  my coke nEatt my pizza nBut wenn u kiss my mom datt wen I qett mad haha (stolen)', u'@mmm_pizza @famfriendsfood thanks for the RT!', u'RT @mattymic914: @OMGitskellbell ahh thank you it was awesome n so was the pizza', u"Pizza & Pottery Thursday, April 26 - We'll Sculpt Crazy Heads...Join Us! #constantcontact http://t.co/jWCvTCSZ", u'Stainless Steel Kenmore 1.1 cu. ft. Countertop Microwave & Pizza Oven: Kenmore Combination CMO with Microwave Ov... http://t.co/zHv0fw0c']

As a fun exercise, you can change the number of tweets you get to 100 by altering the value after rpp and then feeding that data into the sentiment program to see whether people have positive or negative feelings about pizza.

Twitter lets you grab the last 1,500 tweets for a particular search, although if your geographic area is too small and your search too specific, Twitter will produce fewer results. At a maximum of 100 per page, this means that you can get up to 15 pages of results. We could download all the status updates and put them in a list by looping through the code we’ve developed so far:

>>> pizza_tweets=[]
>>> for page in range(1,16):
...     url='http://search.twitter.com/search.json?q=pizza&rpp=5&result_type=recent&geocode=40.76,-73.99,50mi&lang=en&page='+str(page)
...     file=urllib.urlopen(url)
...     search = json.load(file)
...     for entry in search['results']:
...         pizza_tweets.append(entry['text'])
...

Previously, we looped over a list. But, in order to loop over a string of numbers, we now include the range function. This takes two values: the number you want to be the first in the loop, and the number one greater than you want to be the last number in your loop. Confusing, but Python treats this as a “less than”, where you or I might expect a “less than or equal to”. Note that on the second line, the url now has a page parameter which is set to the current value of our page counter. If you’ll remember, Python won’t let us combine numbers and strings, so we use the str function to convert the page number as part of the line. The rest of the text is identical to what was written before. Since we didn’t tell Python to print anything, you’ll know this script works if it doesn’t give you an error message.

If you had a long list of search terms you wanted to gather data on, you could store them in a list, and then loop over them:

 >>> search_terms=['pizza','bagel','%22french%20fries%22','yogurt']
>>> for search_term in search_terms:
...     for page in range(1,16):
...         url='http://search.twitter.com/search.json?q='+search_term'+&rpp=5&result_type=recent&geocode=40.76,-73.99,50mi&lang=en&page='+str(page)
...         file=urllib.urlopen(url)
...         search = json.load(file)
...         for entry in search['results']:
...             pizza_tweets.append([search_term,entry['text']])
...

So far, we’ve covered the basics of APIs and reading the data into Python. In the next part, we’ll look at actually computing the word frequencies.

About Neal Caren

Sociology
This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink.

Comments are closed.