An introduction to text analysis with Python, Part 1

Note: This is the first in a series of tutorials designed to provide social scientists with the skills to collect and analyze text data using the Python programming language. The tutorials assume no prior knowledge of Python or text analysis.

In September of 2011, Science magazine printed an article by Cornell sociologists Scott Golder and Michael Macy that examined how trends in positive and negative attitudes varied over the day and the week. To do this, they collected 500 million Tweets produced by more than two million people. They found fascinating daily and weekly trends in attitudes. It’s a great example of the sort of interesting things social scientists can do with online social network data. More generally, the growth of what computer scientists call “big data” presents social scientists with unique opportunities for researching old questions, along with empowering us to ask new questions. While some of this big data is only numbers, much of it also consists of text. Sociologists have long had tools to assist us in coding and analyzing dozens or even hundreds of text documents, but many of these tools are less useful when the number of documents is in the tens of thousands or millions. Every sociology professor, graduate student and undergraduate in the United States working together couldn’t code even the 1% daily sample of Tweets that Twitter provides free to researchers. Luckily, computer scientists have been working for quite a while on exactly this data problem–how do we collect, categorize and understand massive text databases.

It turns out that while the volume of data in a study such as Golder and Macy’s is intimidating, doing a project of this sort isn’t that complicated for the typical social scientist. The major challenges are (1) collecting and managing the data, (2) turning the text into numbers of some sort, and (3) analyzing the numbers. The third step involves techniques familiar to many quantitative researchers. Based on their supplementary file, it appears Golder and Macy used Stata to analyze the data.

Getting the Twitter data isn’t that difficult, although it does involve dealing with the Twitter Application Programming Interface, or API, a task most social scientist have not been trained to do. If you’re wondering, Facebook also has an API and you can use what are called “web scraping” techniques to gather data from blogs and other websites too. I’ll discuss these topics in other tutorials.

In this post and the next, I’ll walk through the basics of a popular way to convert text into meaningful numbers, using the same analytic strategy that Golder and Macy used. While you can do this sort of analysis using one of several different programs or languages, one commonly used for this sort of quantitative text analysis is Python. It is free, used by millions (so there are lots of resources available), and relatively straightforward to learn. If you have a Mac, it’s already on your computer! There are no pull down menus in Python, though, so learning by fumbling around isn’t the best option. That’s what led me to write this series of tutorials.

This initial tutorial is aimed at social scientists who may be familiar with some statistical package like SPSS, Stata or SAS, but haven’t used Python. It walks through the basics of one type of text analysis using some sample text data, but swapping in your own data once you’ve got this up and running isn’t much harder.

Python

For the purposes of this walkthrough, I’m going to assume you are using a Mac. If you aren’t (and even if you are), a great place to get Python is through the Enthought Python Distribution, which is free for those in academia. In another post, I’ll talk about other ways to get Python up and running. Or, you can just Google it.

[As a side note, if you know what you are doing in Python, some of the code here might make you scratch your head. This could be because I’ve only been using Python for a short time. It could also be that sometimes I’m doing things the long way to demonstrate some Python fundamentals. I’m also trying to code using the fewest number of new commands and options to keep things simple. Finally, while trying to demonstrate a pythonic style of coding, I’m also trying to make it familiar to those who are used to analyzing data with Stata or SPSS. But, if I include any mistakes, please leave a comment or email me. And if you just want the code for this sentiment analysis, feel free to download it.]

There are two ways to run programs in Python: by running a script you wrote, or through the command line. We’ll start with the command line. To begin, use Spotlight (the eye glass in the upper right hand corner of your Mac) to find “Terminal”. Double click on the “Terminal” entry, and a potentially intimating little screen will greet you. Type, python

mid-campus-02203:~ nealcaren$ python

And your computer should return something like:

Python 2.7.1 (r271:86882M, Nov 30 2010, 10:35:34)
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

The “$” is where you type commands in Terminal. The words before your dollar sign will be different than mine, depending on your current directory and other factors. On a Windows machine, you are likely to see something like C:UsersNeal Caren>[cci]. After you type [cci]python, a few lines describing your version of Python will appear followed by the Python promt, “>>>“. As you can see above, I’m running Python 2.7, which is the current version as of this post. You might have Python 2.6, which is fine. You probably don’t have Python 3.2, which is different and should be avoided for compatibility reasons. But not matter which operating system or version of Python you have,>>> is the Python prompt.

Before we go any further, you might want to know how to get out of Python. Just type exit, followed by an open and close parenthesis:

>>> exit()

This will bring you back to your operating system prompt. On a Windows machine, you type exit without the parentheses and the command line will go away. On a Mac, you can quit Terminal like you would any other application, with either command-Q or by selection “Terminal-Quit Terminal” with your mouse.

Strings

Pretend that the first tweet you wanted to analyze was, “We have some delightful new food in the cafeteria. Awesome!!!” You might have a million more of these that you want to analyze, and we’ll get there eventually, but my coding style is to start simple and then slowly add complexity. In this case, a simple way to start is with one tweet. To tell python about your tweet, type:

>>> tweet='We have some delightful new food in the cafeteria. Awesome!!!'

There are three things to note. First, don’t type or copy the “>>>” That is just to show you where you should type this command in Python, and that it shouldn’t be indented. Second, there is a space between the “>>>” and the word “”tweet“”. Don’t type or copy that space. It’s just there. Python is very particular about spacing at the beginning of lines. So start typing or copying at the word “tweet”. Third, the text is surrounded by a single quote (i.e. ') on each side. You can also use double quotes (i.e. ") or even triple single quotes (i.e. '''), but single quotes are the default Python style for entering a string.

To make sure that you typed the tweet correctly, you can type:

>>> tweet

Python will respond with:

'We have some delightful new food in the cafeteria. Awesome!!!'

You can get almost the same response using a print statement:

>>> print tweet
We have some delightful new food in the cafeteria. Awesome!!!

The only difference is that the first response was wrapped in single quotes and the second wasn’t. As a side note, the single quotes weren’t because you put them there. If you used double quotes, you would get the same thing:

>>> tweet="We have some delightful new food in the cafeteria. Awesome!!!"
>>> tweet
'We have some delightful new food in the cafeteria. Awesome!!!'

Lists

Now, following Golder and Macy, we need to decide if this is a positive or negative opinion. If we had a large sample of the Tweets already coded by sentiment, would could try and figure out which words appeared more often in Tweets we considered positive, and which words appeared more often in Tweets we considered negative. In sociology, we might think about this in a regression framework. We want to predict whether the sentence is positive, negative, or neither, and we could use the presence or absence of words as predictors. In computer science, this would be considered a supervised learning classification problem. But we don’t have a sample precoded, so let’s save classification for another day.

One straightforward way to approach the problem is to count the proportion of words that usually have a positive connotation and the proportion of words that have a negative connotation. This is a common analytic strategy in many fields, especially psychology. Golder and Macy’s Twitter study used the lists of positive and negative words that are part of the Linguistic Inquiry and Word Count (LIWC) project. This data is only available commercially, so I won’t include it this tutorial. There’s a similar dictionary that’s freely available, but we won’t use that just yet.

For now, you can just make your own list of positive words. We’ll swap in the official list before we are done. Off the top of my head, the words “good”, “nice”, “super”, and “fun” are words that I use when I’m trying to be positive. To put this list into Python:

>>> positive_words=['awesome','good','nice','super','fun']

“positive_words” is now the name of our list. There are only a few restrictions on what you can name your list (e.g., it can’t start with a number, or have spaces). To tell Python that we are creating a list, you put everything in brackets. Since the items in the list are strings, each goes in single quotes. The list form in Python is roughly analogous to a variable in statistical programs.

If you wanted to add an item to your list, you append the list:

>>> positive_words.append('delightful')

In this case, you start with the list name, followed by “.append“, and then in parenthesis write the item that you want to add to your list. If you are adding a string, you put it in quotes. Otherwise, Python will think you are referencing something like this:

>>> positive_words.append(like)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'like' is not defined
>>> newword='like'
>>> positive_words.append(newword)

After I got the error message, I created a new string call “newword” which contained the word that I wanted to append. I then added the word “like” to our list by using the string’s name, “newword“. This is a pretty inefficient way to do things in this case, but useful in many other situations.

Lists are much more flexible than represented above. Items can be longer than a single word (e.g. ‘Super fun’); strings and numbers can be in the same list (e.g., 3,’swell’); and you can even put additional lists inside your list.

If we want to see what was in the list we created above, we can print it:

>>> print positive_words
['good', 'nice', 'super', 'fun', 'delightful', 'like']

The brackets remind you that this is a list, and the items in the list are separated by commas.

Now create a list of negative words:

>>> negative_words=['awful','lame','horrible','bad']
>>> print negative_words
['awful', 'lame', 'horrible', 'bad']

If you wanted to measure whether or not any emotion was expressed, you might create one list that combines all the positive and negative words. Rather than retyping them, you can just combine the lists with a plus sign:

>>> emotional_words=negative_words+positive_words
>>> print emotional_words
['awful', 'lame', 'horrible', 'bad', 'good', 'nice', 'super', 'fun', 'delightful', 'like']

From Strings to Lists

Later on, we’ll create a better list of positive and negative words, but for now let’s return to the original Tweet. The default strategy for this sort of analysis is to examine each word in the sentence on its own, regardless of word ordering. This is called a “bag of words” model. It has some obvious drawbacks (e.g. “This was not fun.” will show up as positive because of the presence of the word “fun” unless you somehow model it’s negation.), but, with a few tweaks, these models can be about as good at classification as an undergraduate RA.

Since our unit of analysis is the word and not the sentence, we want to split our sentence into words. We can do that by using the split option:

>>> words= tweet.split(' ')
>>> print words
['We', 'have', 'some', 'delightful', 'new', 'food', 'in', 'the', 'cafeteria.', 'Awesome!!!']

Here, we’ve split our string “tweet“, making a cut every time there was a space. This new object is stored as “words“. As you can see from the results of the print command, the new object is displayed in brackets, so Python has created “words” as a list. In order to see how long many words are in the sentence, you use the length command, which will return the number of objects in a list.

>>> len(words)
10

This only works because we’ve split the sentence into a list of words. If we ask for the length of the original tweet, we get something less useful:

>>> len(tweet)
61

Python doesn’t know that you only care about words, so it defaulted to counting the number of characters.

Loops

Our first goal is to go through our list of words and see if any of them show up in our list of positive words. For starters, we can loop over each of the words in our sentence with a “for” loop:

>>> for word in words:
...     print word
...
We
have
some
delightful
new
food
in
the
cafeteria.
Awesome!!!

The “for” tells Python that we are going to cycle through each element of a list. “word” is the name that I just made up that will hold each of the words. “in words” tells Python which list we want to iterate through, and the colon ends a line that declares a loop. Note that the second line begins with “...” instead of “>>>“. Python did that to remind you that you are inside a loop. The new line is also indented. I used a tab; others put four spaces. If you don’t indent, Python will report an error:

>>> for word in words:
... print words
File "<stdin>", line 2
print words
^
IndentationError: expected an indented block

Note also the ... on the third line of the successful loop. When typing in the Python command line you signal that you are done with a loop by hitting return without typing anything.

Conditionals

While this loop prints out each word (when the second line is appropriately tabbed), what we actually want to do is see if that word is the list of positive (or negative) words.

>>> for word in words:
...     if word in positive_words:
...         print word
...
delightful

Here, we include a conditional: we only move to the “print word” state if the value of “word” is in our list of positive words. So, the first time the loop cycles through and sees the value of “word” is “The”, so the loop skips the “print word” line. Here, the “if” line ends in a colon and the lines that should only occur if the conditions are met are doubled indented–once as a result of the “for” and once because of the “if”.

Not bad work–although we’ll have to track down why “awesome” didn’t appear in the list. You’ll want to continue on to Part 2 of the Tutorial. It would be part of this post, but the blog software seems to think I’ve written too much already and won’t display anymore lines.

About Neal Caren

Sociology
This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

Comments are closed.