Two degrees of Tina Fetner, Part 2

This is the continuation of Two degrees of Tina Fetner, Part 1.

In order to create a program that downloads a list of all the followers of Tina Fetner on Twitter, and who follows each of them, we first create a list of all the people directly connected to @fetner, and then loop through that list to find all the people following them. For safe keeping, we’ll store the resulting edge list as a .csv file for later use in Python, or any other network or statistical package. There are other ways to store the data, such as by pickling it, but storing it as a CSV text file makes it fairly easy to use in both Python as well as other programs. The basic code for creating an ego network for @fetner is:

>>> fetner_ego=[]
>>> url=''
>>> fetner_followers=urllib2.urlopen(url)
>>> fetner_followers=json.load(fetner_followers)
>>> fetner_followers=fetner_followers['ids']
>>> for follower_id in fetner_followers:
...     print follower_id
...     edge=(4111281,follower_id) #Remember to change this number if you aren't pulling @fetner's network
...     fetner_ego.append(edge)
...     url=''+str(follower_id)
...     followers=urllib2.urlopen(url)
...     followers=json.load(followers)
...     for follower_id_id in followers['ids']:
...         edge=(follower_id,follower_id_id)
...         fetner_ego.append(edge)
>>> import csv
>>> writer = csv.writer(open('fetner_2_step.csv', 'wb'))
>>> writer.writerows(fetner_ego)

While this script is syntactically correct, it will crash on you before it finishes. Although it works fine when Twitter is responding the way you expect it to, it will stop at the first hiccup. The two problems this script will definitely encounter are private accounts and rate limiting. When a person has placed privacy protections on their Twitter account, you won’t be able to see the list of followers, and the urllib2.urlopenp/cci] line will spit out an ugly looking error. The JSON file is equally intimidating:[cc]
{"error":"Not authorized","request":"/1/followers/ids.json?user_id=180667973"}[/cc]
Fortunately, Python has a useful way to handle commands that you think might not work. You split what you want to write into three parts: what you think might fail, what you want to happen if it does fail, and what you want to happen if it doesn't fail. For example:[cce_python]
>>> try:
...     followers=urllib2.urlopen(url)  
... except:
...     print 'Problem with '+str(follower_id)
...     print 'Everything fine with '+str(follower_id)[/cc]
What you think might not work is indented and put after a line with [cci]try
followed by a colon. What you want to have happen in the event of a failure is after except: and all the things that you want to happen when it does work are after else:. Except is more powerful than just this purpose, and can be used to print out the exact error message. This way, for example, you could distinguish between error messages because your Internet or Twitter was down, or because you were getting an error page. You can also put more than one line of code in each of these sections–you aren’t just limited to a statement about whether it worked or not. More generally, try:, else:, and except: are quite useful for handling the multiple possible issues that might arise when you are handling data that may have unexpected attributes, and in other situations where Things Could Go Wrong.

The second reason your script will fail is that Twitter limits how often you can make these sorts of requests to 150 per hour. Using a more complicated method you can bump this number up to 350, but it is still less than the number of calls you would make if you didn’t tell Python to slow down and take a break. If each request takes a second to download and process, you would be sending Twitter 3,600 requests in an hour. In order to continue getting results from Twitter, we need to slow things down so that we are averaging a page every 24 seconds (60 seconds*60 minutes/150 requests per hour=24). If it took a long time to manipulate the data between calls, we might want to account for that, but since that happens almost instantaneously in this case, we can discount that time and simply instruct Python to rest for 24 seconds after each call to Twitter’s API. Resting in Python is done with the sleep command:

>>> from time import sleep
>>> sleep(24)

This instructs Python to do nothing for 24 seconds. This also means that the script has moved from something that you can watch work to something that you should leave in the background. Additionally, it means that scraping the information on all 21 million of Justin Bieber’s followers would take about two years, so don’t expect to create a massive network overnight. But over a weekend, you could probably get a fairly complete look at the network of active sociologists or any other medium sized Twitter community.

The final script, accounting for the solutions to these two potential problems: is:

>>> import urllib2
>>> import json
>>> import time

>>> fetner_ego=[]

>>> url=''
>>> fetner_followers=urllib2.urlopen(url)
>>> fetner_followers=json.load(fetner_followers)
>>> fetner_followers=fetner_followers['ids']

>>> for follower_id in fetner_followers:
...     edge=(4111281,follower_id) #note that this hardcoding of the number is probably a bad idea.
...     fetner_ego.append(edge)
...     url=''+str(follower_id)
...     try:
...         followers=urllib2.urlopen(url)  
...     except:
...         print 'Problem with '+str(follower_id)
...     else:
...         print 'No problem '+str(follower_id)
...         sleep(24)
...         followers=json.load(followers)
...         if 'ids' in followers:
...             for follower_id_id in followers['ids']:
...                 edge=(follower_id,follower_id_id)
...                 fetner_ego.append(edge)

>>> #Output the results to a CSV file.
>>> import csv
>>> writer = csv.writer(open('fetner_2_step.csv', 'wb'))
>>> writer.writerows(fetner_ego)

When I ran the script, it took about three hours and I ended up with a total of 386 of @fetners followers who themselves had publicly accessible followers. Between them, there were a total of 195,407 edges in the network, connecting 139,419 different Twitter users. Below, you can find the script I wrote that computed these numbers.

Besides, @fetner, who else did these people follow? Here’s a list of the Twitter users who were being followed most by @fetner’s followers. I list the username, the number of @fetner followers who followed them, and the total number of followers (again, script below):

That’s a who’s who of sociology twitter feeds! @SAGEsociology, publisher of all the ASA journals who posts frequently about new articles, has the most @fetner followers at 199. @socio_log and @RebeccaEHall both have a high proportion of total followers who are also @fetner followers.

Given Twitter’s rate limiting, it would be impractical to collect the followers of each of the 139,419 users, but it wouldn’t be difficult or time consuming to loop over a subset of of @fetner’s followers in an attempt to map the network of sociology folk using Twitter. You probably would want to tweak the code further so that you saved each JSON file to your hard drive such that you only downloaded them once in case you ever had to start over. One way to accomplish this is to use try to open the file from your hard drive and put retrieving it from Twitter as part of the else statement that is called only if the file opening fails. You would also want some mechanism for making sure that you only looped over each person once. Presumably, you could store a list of the people you had downloaded the information on and then only download information if they weren’t in the list (e.g.  if user_id not in alread_looked_up_list:). Your script would also need to handle the fact that the current version maxes out at 5,000 followers–adding some if statement to check the value of next_cursor and act appropriately. You might also want to seed you starting list with some known sociologists who aren’t in @fetner’s list to make sure you are getting any completely disconnected cliques, or you might want to expand your analysis in the other directions, look at the people who are being followed, rather than followers. My guess is that these networks will quickly explode, however, as you are much more likely to be following @justinbieber than have him follow you. With these small tweaks, you would be well on your way to an article in Social Networks, or at least Footnotes.

Here’s the script for computing the stats mentioned above and list of top followers, minus the >>> to make it easier for copying and pasting:

import csv
import urlllib2

#load up the saved csv file
fetner_ego = csv.reader(open('tina.csv'))
fetner_ego=[row for row in fetner_ego]

#Total number of edges
print len(fetner_ego)

#Create frequency of followers dictionary
for edge in fetner_ego:
    if edge[1] in target_dict:

#Total number of unique followers of followers
print len(target_dict)

for id in sorted(target_dict, key=target_dict.get, reverse=True)[0:21]:
    print lookup[0]['screen_name']+' had '+str(target_dict[id])+' people following'

About Neal Caren

This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

One Response to Two degrees of Tina Fetner, Part 2

  1. whikerms says:

    Just wanted to let you know this was a great article – extremely easy to follow and it works flawlessly! Thanks.