Big Data

Note: I’m slowly transitioning this to a new site. My Github page has the most recent information.


I’m publishing a series of tutorials that teach the fundamentals of quantitative text analysis for social scientists. The emphasis is on application. How can you collect and analyze thousands of web pages or Tweets? What are the best practices for turning words into numbers?

The tutorials are designed for people who may be familiar with a standard statistical program, such as Stata or SPSS, or perhaps a qualitative analysis program like NVivo, but who haven’t done any quantitative text analysis or used Python. Python is an open-source computer language that is quite popular amongst computer programmers. Computational linguistics and computer scientists have developed a large number of add-ons for Python that make it a popular choice for quantitative text analysis.

The tutorials are written in a cumulative fashion, following the flow of a workshop on collecting and analyzing data from the web that I lead at Carolina. Each tutorial assumes you have read through the prior ones. The first set is especially important as it introduce the basic concepts of using Python. If you want to jump directly to something of particular interest, go right ahead, but if you get lost, you might at least want to refer back to the introductory posts.

Special thanks to Sarah Gaby for beta testing the posts.

You can subscribe to this page using your RSS reader if you want to find out when new things being posted.

Current Tutorials:
The basics of Python and working with text files to compute something interesting.
An introduction to text analysis with Python, Part 1
An introduction to text analysis with Python, Part 2
An introduction to text analysis with Python, Part 3

Collecting data from the web when they want to give it to you
Pizza, Twitter and APIs, Part 1
Pizza, Twitter and APIs, Part 2

Using Twitter to collect network data
Note: This doesn’t work on the current Twitter API.
Two degrees of Tina Fetner, Part 1
Two degrees of Tina Fetner, Part 2

Using the Google Maps API
Inequality from Space

How the Times writes about men and women
Web scraping 101
Other Python Posts:
Scraping New York Times Comments
Cleaning up LexisNexis Files
A Sociology Citation Network

Other Posts:
Who’s offering domestic partnership benefits?
Fun Demographic Data from Facebook
“Our findings show”: What words to use in an abstract
The 102 most cited articles in sociology
The most cited articles in sociology by journal

Gathering Data
Using the Twitter Streaming API
Using the Facebook API
Downloading Websites
Scraping Websites
Regular Expressions
Cleaning Factiva and LexisNexis Files

Supervising Learning
Basic Classification
Advanced Classification

Unsupervised Learning
K-Means and Other Clustering Algorithms
Topic Models

Fun with Python
Setting up Python
More Python Basics
Defining Functions

Case Studies

Comments are closed.