An introduction to text analysis with Python, Part 3

Two earlier tutorials looked at the basics of using Python to analyze text data. This post explains how to expand the code written earlier so that you can use it to explore the positive and negative sentiment of any set of texts. Specifically, we’ll look at looping over more than one tweet, incorporating a more complete dictionary, and exporting the results. [If you just want the final python script, you can just download it.]

Earlier, we used a pretty nealcaren list of words to measure positive sentiment. While the study in Science used the commercial LIWC dictionary, an alternate sentiment dictionary is produced by Theresa Wilson, Janyce Wiebe, and Paul Hoffmann at the University of Pittsburgh and is freely available. In both cases, the sentiment dictionaries are used in a fairly straightforward way: the more positive words in the text, the higher the text scores on the positive sentiment scale. While this has some drawbacks, the method is quite popular: the LIWC database has over 1,000 cites in Google Scholar, and the Wilson et al. database has more than 600.

Downloading

Since the Wislon et al. list combines negative and positive polarity words in one list, and includes both words and word stems, I will clean it up a little bit. You can download the positive list and the negative list using your browser, but you don’t have to. Python can do that.

First, you need to import one of the modules that Python uses to communicate with the Internet:

>>> import urllib

Like many commands, Python won’t return anything unless something went wrong. In this case, it should just respond with >>>, which means that the module was successfully brought into memory. Next, store the web address that you want to access in a string. You don’t have to do this, but it’s the type of thing that makes your code easier to read and allows you to scale up quickly when you want to download thousands of urls.

>>> url='http://www.unc.edu/~ncaren/haphazard/negative.txt'

You can also create a string with the name you want the file to have on you hard drive:

>>> file_name='negative.txt'

To download and save the file:

>>> urllib.urlretrieve(url,file_name)

This will download the file into your current directory. If you want it to go somewhere else, you can put the full path in the file_name string. You didn’t have to enter the url and the file name in the prior lines. Something like the following would have worked exactly the same:

>>> urllib.urlretrieve('http://www.unc.edu/~ncaren/haphazard/negative.txt','negative.txt')

Note that the location and filename are both surrounded by quotation marks because you want Python to use this information literally; they aren’t referring to a string object, like in our previous code. This line of code is actually quite readable, and in most circumstances this would be the most efficient thing to do. But there are actually three files that we want to get: the negative list, the positive list, and the list of tweets. And we can download the three using a pretty simple loop:

>>> files=['negative.txt','positive.txt','obama_tweets.txt']
>>> path='http://www.unc.edu/~ncaren/haphazard/'
>>> for file_name in files:
...     urllib.urlretrieve(path+file_name,file_name)
...

The first line creates a new list with three items, the names of the three files to be downloaded. The second line creates a string object that stores the url path that they all share. The third line starts a loop over each of the items in the files list using file_name to reference each item in turn. The fourth line is indented, because it happens once for each item in the list as a result of the loop, and downloads the file. This is the same as the original download line, except the URL is now the combination of two strings, path and file_name. As noted previously, Python can combine strings with a plus sign, so the result from the first pass through the loop will be http://www.unc.edu/~ncaren/haphazard/negative.txt, which is where the file can be found. Note that this takes advantage of the fact that we don’t mind reusing the original file name. If we wanted to change it, or if there were different paths to each of the files, things would get slightly trickier.

More fun with lists

Let’s take a look at the list of Tweets that we just downloaded. First, open the file:

>>> tweets = open("obama_tweets.txt").read()

As you might have guessed, this line is actually doing double duty. It opens the file and reads it into memory before it is stored in tweets. Since the file has one Tweet on each line, we can turn it into a list of tweets by splitting it at the end of line character. The file was originally created on a Mac, so the end of line character is an n (think n for new line). On a Windows computer, the end of line character is an rn (think r for return and n for new line). So if the file was created on a Windows computer, you might need to strip out the extra character with something like windows_file=windows_file.replace('r','') before you split the lines, but you don’t need to worry about that here, no matter what operating system you are using. The end of line character comes from the computer that made the file, not the computer you are currently using. To split the tweets into a list:

>>> tweets_list = tweets.split('\n')

As always, you can check how many items are in the list:

>>> len(tweets_list)
1365

You can print the entire list by typing print tweets_list, but it will scroll by very fast. A more useful way to look at it is to print just some of the items. Since it’s a list, we can loop through the first few item so they each print on the same line.

>>> for tweet in tweets_list[0:5]:
...     print tweet
...
Obama has called the GOP budget social Darwinism. Nice try, but they believe in social creationism.
In his teen years, Obama has been known to use marijuana and cocaine.
IPA Congratulates President Barack Obama for Leadership Regarding JOBS Act: WASHINGTON, Apr 05, 2012 (BUSINESS W... http://t.co/8le3DC8E
RT @Professor_Why: #WhatsRomneyHiding - his connection to supporters of Critical Race Theory.... Oh wait, that was Obama, not Romney...
RT @wardollarshome: Obama has approved more targeted assassinations than any modern US prez; READ & RT: http://t.co/bfC4gbBW

Note the new [0:5] after the tweets_list but before the : that begins the loop. The first number tells Python where to make the first cut in the list. The potentially counterintuitive part is that this number doesn’t reference an actual item in the list, but rather a position between each item in the list–think about where the comma goes when lists are created or printed. Adding to the confusion, the position at the start of the list is 0. So, in this case, we are telling Python we want to slice our list starting at the beginning and continuing until the fifth comma, which is after the fifth item in the list.

So, if you wanted to just print the second item in the list, you could type:

>>> print tweets_list[1:2]
['Obama has called the GOP budget social Darwinism. Nice try, but they believe in social creationism.']

This slices the list from the first comma to the second comma, so the result is the second item in the list. Unless you have a computer science background, this may be confusing as it’s not the common way to think of items in lists.

As a shorthand, you can leave out the first number in the pair if you want to start at the very beginning or leave out the last number if you want to go until the end. So, if you want to print out the first five tweets, you could just type print tweet_list[:5]. There are several other shortcuts along these lines that are available. We will cover some of them in other tutorials.

Now that we have our Tweet list expanded, let’s load up the positive sentiment list and print out the first few entries:

>>> pos_sent = open("positive.txt").read()
>>> positive_words=pos_sent.split('\n')
>>> print positive_words[:10]
['abidance', 'abidance', 'abilities', 'ability', 'able', 'above', 'above-average', 'abundant', 'abundance', 'acceptance']

Like the tweet list, this file contained each entry on its own line, so it loads exactly the same way. If you typed len(positive_words) you would find out that this list has 2,230 entries.

Preprocessing

In the earlier post, we explored how to preprocess the tweets: remove the punctuation, convert to lower case, and examine whether or not each word was in the positive sentiment list. We can use this exact same code here with our long list. The one alteration is that instead of having just one tweet, we now have a list of 1,365 tweets, so we have to loop over that list.

>>> for tweet in tweets_list:
...     positive_counter=0
...     tweet_processed=tweet.lower()
...     for p in list(punctuation):
...         tweet_processed=tweet_processed.replace(p,'')
...         words=tweet_processed.split(' ')
...     for word in words:
...         if word in positive_words:
...             print word
...             positive_counter=positive_counter+1
...      print positive_counter/len(words)
...

If you saw a string of numbers roll past you, it worked! To review, we start by looping over each item of the list. We set up a counter to hold the running total of the number of positive words found in the tweet. Then we make everything lower case and store it in tweet_processed. To strip out the punctuation, we loop over every item of punctuation, swapping out the punctuation mark with nothing. If you haven’t already typed  from string import punctuation in your current python session, you might get an error with this line, so make sure to include that import. The cleaned tweet is then converted to a list of words, split at the white spaces. Finally, we loop through each word in the tweet, and if the word is in our new and expanded list of positive words, we increase the counter by one. After cycling through each of the tweet words, the proportion of positive words is computed and printed. If you just get zeros, you might need to type from __future__ import division again, so that the result isn’t rounded down.

The major problem with this script is that it is currently useless. It prints the positive sentiment results, but then doesn’t do anything with it. A more practical solution would be to store the results somehow. In a standard statistical package, we would generate a new variable that held our results. We can do something similar here by storing the results in a new list. Before we start the tweet loop, we add the line:

>>> positive_counts=[]

Then, instead of printing the proportion, we can append it to the list:

>>> positive_counts.append(positive_counter/word_count)

The next time we run through the loop, it won’t produce any output, but it will create a list of the proportions. This still isn’t that useful, although you can use Python to do most of your statistical analysis and plotting, but at this point you are probably ready to get your data out of Python and back into your statistical package.

The most convenient way to store data for use in multiple packages is as a plain text file where each case is its own row and variables are separated by commas. This file type commonly has a “csv” extension, and Python can read and write these files quite easily.

First, import the csv module:

>>> import csv

To write to csv file, you first “open” the file with the csv writer:

>>> writer = csv.writer(open('tweet_sentiment.csv', 'wb'))

In the 'open' part of the command, the first item is the name of the file you want to create, and the 'wb' tells python that this is a file you want to write. Be careful with your file name, because if there is already a file with this name, Python will write over it. If you wanted to read a csv file, you would just swap reader for writer and 'rb' for 'wb', which creates a nice symmetry.

Sending your list of positive sentiment values to the file requires just one more line:

>>> writer.writerows(positive_counts)

You can now import this file into your statistical software package or just take a peak at it in excel. Of course, having just one variable is not the most useful thing. Usually, you will have more than one that you want to export, but for now we just have the one. At a minimum, you might also want to export the text of the original tweets. To combine more than one list together, you can zip them into one list. This is different from appending one list to the other, which would just make the one list twice as long.

>>> output=zip(tweets_list,positive_counts)

In this case, zip creates a new list output that is the same length as our tweets_list, but each entry has two items: the tweet and the positive count. You can use zip to combine as many lists as your like, although they all need to be the same length. Technically, each item in the list is a tuple, or an ordered element list, which is a data format quite similar to a list but generally less useful for textual analysis.

To write our final version of the output, we need to repeat the line that created our writer and then write the output list:

>>> writer = csv.writer(open('tweet_sentiment.csv', 'wb'))
>>> writer.writerows(output)

That’s it. If you searched everyday for tweets mentioning President Obama and ran this script, my guess is that your data would tell a pretty interesting story about trends over time. Or, if you had your own text data arranged such that each text was its own line, you could just update the file name, and compute the sentiment scores.

In case you were wondering, the top two most negative tweets, were “Hatch Makes Startling Accusation Against Obama http://t.co/HVQfUzgr ..shocking headline…NOT” and “We need to tag Obama & define him for Nov battle. #Obama #failedleader #incompetent #wasteful #divisive #desperate #flexible #arrogant #lazy”, which gives our little study some face validity.

You have probably noticed that our code for this project has swelled to about 40 lines. Not horrible, but not that easy to copy and paste. And if you mess up in a loop, you have to start all over again. While typing in commands this way in Python is useful for playing around with new codes and commands, most of the time its not the most efficient way to do things. Just like Stata has .do files, you can similarly save a series of Python commands as a text file and then run them all together. These sort of Python files use a .py extension.

I’ve compiled all the code for our sentiment analysis into one file, and you can download it sentiment.py using your browser. At this point, you might want to make a directory for yourself where you can store all your python files.

You can quit Python by typing the exit(), which should bring you back to your operating system’s prompt. Now, assuming you are in the directory where you download sentiment.py, you can run the entire program by typing:

$ python sentiment.py

Remember the $ sign means that we are out of Python. This command tells your computer that you want Python to run the program sentiment.py. If all works according to plan, your computer should think for a couple of seconds, and then display the operating system prompt. Python displays fewer things when run this way: only things with a print statement in front of them are displayed, so don’t expect your output to be as verbose as when you typed in each command. Actually, you probably would want to add some print statements along the way so that you knew everything was working.

Assuming you didn’t get an error message, there should be new file called tweet_sentiment.csv in your current directory. You can confirm this by typing ls -l on a Mac or dir in Windows. This should display the contents of the current directory and you should see tweet_sentiment.csv listed along with the current time–which means that the file was just created. Perfect.

There are easier ways to run your .py files which I’ll discuss at a later point, and ways to improve the script, such as adding comments as notes to ourselves, speeding it up, and allowing different types of input files. But if you made it this far, you can proudly call yourself a “beginning Python programmer.” Congratulations.

About Neal Caren

Sociology
This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

Comments are closed.