Cleaning up LexisNexis Files

LexisNexis is a great resource for the full text of newspaper articles from the last twenty years. It can be a bit of a hassle for gathering large amounts of data: it isn’t easy to access in automated fashion; your search needs to return fewer than 3,000 results; you can only download 500 articles at a time; and the resulting file is kind of a mess. While I can’t do anything about the first three issues, I think I solved the last problem.

I’ve written a small Python script that converts a plain text file into a comma-separated values (CSV) file. Each document from the original file is a row. Each row has columns for the search result number, publication title, article tile, and full text, along with any other meta information that is provided, like the author, section, and length. The order of columns may vary across different files depending on which metadata is present–it only produces a column for the information if it is present in more than 20% of the documents–so make sure to inspect the files. It has occasional glitches if the document has words inside the article that mimic the patterns that usually identify meta data–all caps followed by a colon–so make sure it produces what you want. Finally, really long articles (more than 5,000 words or so) won’t display correctly in Excel even when the are in a correctly formatted CSV file. I think this is because the character count in the text field exceeds Excel cell limits.

The script requires Python to be installed but can be run directly from the command line if you want. So if you have a Mac, you don’t even need to know anything about Python, it will just work.

Sample usage from Terminal:

$ python ap_tp_201201.txt
Processing ap_tp_201201.txt
Wrote ap_tp_201201.csv
$ python T*.txt
Processing The_New_York_Times_TP_2012_1.txt
Wrote The_New_York_Times_TP_2012_1.csv
Processing The_New_York_Times_TP_2012_2.txt
Wrote The_New_York_Times_TP_2012_2.csv

Please let me know if you have problems with it.

About Neal Caren

This entry was posted in Uncategorized and tagged , . Bookmark the permalink.

Comments are closed.