An introduction to text analysis with Python, Part 2

An earlier tutorial looked at the basics of using Python to analyze text data. In this posts, we’ll take actually compute the proportion of positive words in a tweet, after cleaning up the data bit.

The other tutorial ended with a couple of stored strings, so if you are starting up Python, you’ll want to store a couple of them:

>>> tweet='We have some delightful new food in the cafeteria. Awesome!!!'
>>> postive_words=['awesome','good', 'nice', 'super', 'fun', 'delightful']
>>> negative_words=['awful','lame','horrible','bad']
>>> words= tweet.split(' ')

The prior post ended by printing out the positive words that it found in the tweet. We could make the print line a little more informative by adding some text that explains why it is randomly printing out the word, “delightful”.

>>> for word in words:
...     if word in positive_words:
...         print word+'  is a positive word'
...
delightful is a positive word

When Python sees a, “+”, it attempts to combine the two items. In this case, since both “word” and “is a positive word” are strings, the result is a longer string. This is the same logic that we used above to combine the two lists of words to create a longer list. This also works for combining two or more numbers, but, you can’t use this strategy to combine a string and a number:

>>> 2+'delightful'

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'int' and 'str'

Nicely, Python tells you the line number where there was a problem and a semi-informative error message: you were trying to combine a integer and string, which Python can’t do because it wouldn’t know what data type you would want to store it in. You could make this work by converting the number to a string :
[code]>>> str(2)+’delightful’
2delightful[/code]
You aren’t limited to combining just two items, any number of like objects can be put together with the +.

Preprocessing

You might have noticed that while our loop matched “delightful”, it didn’t find “awesome”.  Looking back at the list of words that printed when we printed every word in our tweet might provide some clues as to why this occurred. While we have “awesome” in our positive words list, we don’t have “Awesome!!!” and Python is looking for an exact match. In order to get the two versions to match, we would need to make the “A” lower case and remove the exclamation marks. This is called pre-processing or cleaning the data. Shifting everything to lower case and stripping punctuation are the most common pre-processing tasks in natural language processing. Other common things to do are stemming words, which attempts to find the root of the word (e.g. “running” and “runs” both get reduced to “run”) and removing little words like “the”, “and”, or “if”, which are known as stop words.

Since removing capitalization and punctuation involves throwing away potentially meaningful variation, you should proceed with caution. For example, you might think that the “Awesome!!!” is different from “awesome”, that “WOW” is different from “wow”, or that “Cool!” is different from “Cool?”. In machine learning (I technique I will discuss in more detail at a later point), this is part of the art of “feature selection”. Social scientists have independent or explanatory variables that they use to explain their models, while computer scientists try to find the “features” with the most predictive power. In natural language processing, features can be more than the absence or presence of specific words. Word count, presence of parts of words, sentence complexity, use of the passive voice, presence of emoticons, or any other text attribute that can be expressed as a number can be included as a feature. I’m a fan of starting with just the words to get a baseline model, and then seeing if you can improve on it. And in this case, we don’t have punctuation or non-lower cases words coded in our list of emotional words, so the decision is made for us.

However, making strings lower case in Python is simple:

>>> tweet.lower()
'we have some delightful new food in the cafeteria. awesome!!!'

But we can’t do it with our list of words:

>>> words.lower()
File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'lower'

So we either have to make it lower case when it is a full sentence, or we can do it to each individual word:

>>> for word in words:
...     print word.lower()
...
we
have
some
delightful
new
food
in
the
cafeteria.
awesome!!!

Updating our loop, we still don’t find awesome yet:

>>> for word in words:
...     if word.lower() in positive_words:
...         print word.lower()+' is a positive word'
...
delightful is a positive word

This is because we have not removed the exclamation marks. If that was the only punctuation we wanted to remove, we could replace it with nothing:

>>> print tweet.replace('!','')

We have some delightful new food in the cafeteria. Awesome

Python will let you use this technique to create a new string:

>>> tweet_noex=tweet.replace('!','')
>>> print tweet_noex
We have some delightful new food in the cafeteria. Awesome

Replace takes two options. The first is what you are looking for–in this case, the exclamation mark. The second is what you want to replace it with–in this case, nothing. As always, strings should be in quotation marks.

The new string could even have the same name as your old string:

>>> tweet=tweet.replace('!','')
>>> print tweet
We have some delightful new food in the cafeteria. Awesome

We’ve lost the original message, so this isn’t always the best policy. You might want to store your original string away some place for safe keeping, or create a new string name, such as “tweet_processed” that you update with each of your different preprocessing steps.

More than one string operation can be included in the same statement, so we could remove all the punctuation from the Tweet with something like:

 >>> tweet_processed=tweet.replace('!','').replace('.','')
>>> print tweet_processed
We have some delightful new food in the cafeteria Awesome

You could even append the “.lower()” operation to this and do all the cleaning in one line, but you might have trouble figuring out what you did a month later when you come back to your code if you combine different types of operations. But, if you wanted to, you could put it together like this: 

>>> tweet_processed=tweet.replace('!','').replace('.','')
>>> tweet_processed=tweet_processed.lower()
>>> words=tweet_processed.split(' ')
>>> print words
['we', 'have', 'some', 'delightful', 'new', 'food', 'in', 'the', 'cafeteria', 'awesome']

The first line creates a new string tweet_processed that holds our original tweet minus the punctuation. Note that the second line has “tweet_processed” on both sides of the equal sign. If you kept “tweet.lower()” on the right hand side you would just be throwing away the punctuation stripping that you did in the first line.

While removing the period and exclamation mark work for this tweet, it isn’t a very good general solution, because it ignores the 30(!) other punctuation marks that could be used in a sentence. Since we want to develop a script that works more generally, we want to use a technique that can be flexible enough to handle more than periods and exclamation marks.

Importing Modules

Python has built-in all the punctuation you need to account for in all cases. You can access them by typing: 

>>> from string import punctuation
>>> print punctuation
!"#$%&'()*+,-./:;<=>?@[]^_`{|}~

Most of Python’s usefulness isn’t available to you when you start up the program. You need to selectively bring modules into memory. In this case, we are accessing the “string” module, which comes with your Python. Other modules are available from the web, and to do anything interesting with natural language processing, you’ll have to download and set some of them up.

There are faster and more elegant solutions, but a straightforward way to remove the punctuations is to loop through our new punctuation string and replace each instance in our sentence with nothing. Like this:

>>> tweet_processed=tweet.lower()
>>> for p in list(punctuation):
...     tweet_processed=tweet_processed.replace(p,'')
...
>>> print tweet_processed
we have some delightful new food in the cafeteria awesome
>>> for word in words:
...     if word in positive_words:
...     print word + ' is a positive word'
...
delightful is a positive word
awesome is a positive word

It worked! The first line created a new string that contained a lower-case version of the original Tweet. The second line began a loop over all the punctuation marks that could potentially be in the sentence. Since the punctuation item that we imported was original stored as a string, we have to convert it to a list, which can happen on this same line.  Python’s default splitting is between individual characters, which works perfectly here. The remainder of the script is the same as used above.

Putting it all together

The original quantity of interest was the fraction of positive words in the sentence. We already computed the denominator of the fraction when we computed the length of the string words using the “len” command. One straightforward way to compute the numerator is with a counter that starts at zero and increases by one each time the loop finds a positive word.

>>> positive_counter=0
>>> tweet_processed=tweet.lower()
>>> for p in list(punctuation):
...     tweet_processed=tweet_processed.replace(p,'')
...
>>> print tweet_processed
we have some delightful new food in the cafeteria awesome
>>> for word in words:
...     if word in positive_words:
...         print word+ ' is a positive word'
...         positive_counter=positive_counter+1
...
delightful is a positive word
awesome is a positive word
>>> print positive_counter
2
>>> positive_counter/len(words)
0

That worked out well, except at the end.  Python default for division is to round down to the nearest integer when the figures involved are integers. While frustrating, it doesn’t actually impact you much because it has an easy fix: importing a different division calculator from the built in “future” module.

>>> from __future__ import division
>>> positive_counter/len(words)
0.2

Note that future has two underscores in front of it and two underscores behind it.

Now that we have the complete code for calculating positive sentiment from a tweet, we just need to build it up so that it is useful for research purposes. This includes computing negative sentiment, looping over more than one tweet, incorporating a more complete dictionary, and exporting the results. We will look at those items in the next tutorial. For now, you can just exit Python by typing:

>>> exit()

About Neal Caren

Sociology
This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

Comments are closed.