One of the fun and simple things to do with large text databases that are already categorized into groups is to see what words are used more frequently in certain kinds of texts than in others. For example, Google Book’s Ngram Viewer lets you track how different words were used in different years across the twentieth century. Or, the Sunlight Foundation’s Capitol Words project lets you search through the Congressional Record for specific terms and graph word frequencies by date and party.
Sociologists are well aware that language use and status are highly correlated, so I thought it would be fun to investigate how the words used in high status journals differs from the words used in low status journals. The full text of articles isn’t readily available in an easy to read format, but it is simple and quick to acquire the text of abstracts from Web of Science. They allow search terms that return a huge number of results and you can download them as text files in batches of 500. From Web of Science, I collected all the information available on articles appearing in thirty-seven general interest sociology journals over the last five years, including the abstract. In the sample were four journals that are generally considered to be high status publications–American Journal of Sociology, American Sociological Review, Social Forces and Social Problems. I also included abstracts from three lower status, general interest journals. It should be noted that the lower status journals are still quite selective, and that a better comparison pool would probably be the abstracts of articles submitted to the ASA conference, but that data is not readily available.
After cleaning the data (e.g. combing some common word phrases, removing punctuation, making everything lower case), I computed the proportion of abstracts that included each word for both the high and low status sets of abstracts. For example, “process” was included in 11% of the low-status abstracts and 7% of the high-status abstracts.
Table 1. Abstract words sorted by the likelihood of appearing in a high status journal abstract compared to a low status journal abstract. Words must have appeared in at least 10% of high status abstracts and have a ratio of 1.2 of higher.
Table 1 lists words that frequently appeared in high status abstracts (appearing in at least 10%) and that appeared in high status abstracts at least 20% more often than in low status abstracts. The list is a combination of words that are associated with the scientific process (e.g. “outcomes”), topic areas (e.g “inequality”) and frequently used words (e.g. “at”).
As a measure of what is valued in high status sociology, I think the list has a great deal of face validity. We value developing and testing theory through the scientific process, and words like, outcomes, new, model, find, process, effect and findings reflect these values. High status work is also a collective process-“we” or the “authors” are doing the finding.
Equally important are the words that you might want to avoid. Table 2 lists the words that occurred relatively more often in low status abstracts.
Table 2. Abstract words sorted by likelihood of appearing in a low status journal abstract compared to a high status journal abstract. Words must have appeared in at least 10% of low status abstracts and have a ratio of 1.2 of higher.
|Word||Ratio||Type 1||Type 2|
Interestingly, the key word to avoid is “sociological”. Don’t tell people your work is sociological; show them. “Has” and “been” both show up, so the passive voice is also to be avoided. Race and gender make the list, but I suspect that is not entirely a topical bias in the field, but rather that it reflects how we talk about our findings. For instance, “men” showed up in the list of high status words, which might be because high status research talks about variation between men and women, whites and African Americans, while lower status work is more likely to discuss the effect of gender or race. Same research, different language. There are also fewer words on this list, which might be because lower status abstracts used about 10% fewer words than higher status abstracts, or because there is more variation in the words found in higher status abstracts.
More practically, feel free to use this list as a how to when writing your own abstract. For example, take the paper you are working on now and delete the line, “My research examines the relationship…” and replace it with “Our findings show…” Feel free to add me as co-author, at least of the abstract.
I did a simple analysis to look for the highest status abstract that’s been published over the last five years. For each article with an abstract in the Web of Science from any sociology journal, I computed the predicted high statusness of the article. This was accomplished by assigning each word that was used in the high status dictionary more than 2% of the time a value equal to the proportion of times it appeared in high status abstracts divided by proportion of times appear in low status articles. This is the same as the ratio column in the tables above. Then for each abstract, I summed the scores of each word in the abstract–duplicates count only once. Words that didn’t appear in 2% of high status articles were ignored. To control for varying abstract length, I divided the total sum by the total number of words that were in the high status dictionary. A more rigorous study would probably do something better, but I think the results would be highly correlated with this study.
The winner of the Most High Status Abstract, 2008-2012 is:
This article investigates the effect of family life course transitions on labor allocation strategies in rural Chinese households. We highlight three types of economic activity that involve reallocation of household labor oriented toward a more diversified, nonfarm rural economy: involvement in wage employment, household entrepreneurship, and/or multiple activities that span economic sectors. With the use of data from the China Health and Nutrition Survey (CHNS 1997, 2000, and 2004), our longitudinal analyses of rural household economic activity point to the significance of household demography, life course transitions, and local economic structures as factors facilitating household labor reallocation. First, as expected, a relatively youthful household structure is conducive to innovative economic behavior Second, household entrances and exits are significant, but their impacts are not equal. Life events such as births, deaths, marriage, or leaving home for school or employment affect household economy in distinctive ways. Finally, the reallocations of household labor undertaken by households are shaped by local economic structures: in particular the extent of village-level entrepreneurial activity, off-farm employment, and out-migration.
This article by Maryland’s Feinian Chen and Utah’s Kim Korinek appeared in Demography in 2010. I was expecting an article from one of the journals used to train the algorithm to win the prize. That an article from a high status journal that wasn’t included in developing the model won means overfitting likely isn’t a problem. If the top 10 were simply the 10 most recent ASR articles, there would be some concerns that model might be measuring the quirks of a specific abstract copy editor.
I also computed the average high statusesness for all the journals in the database.
Table 3. Sociology journals sorted by average abstract status score.
|American Sociological Review||1.18|
|Journal of Marriage and Family||1.17|
|American Journal of Sociology||1.17|
|Social Science Research||1.17|
|Journal of Health and Social Behavior||1.15|
|Sociological Methods & Research||1.14|
|Journal of Mathematical Sociology||1.14|
|Sociology of Education||1.13|
|City & Community||1.12|
|Sociological Methodology 2011||1.12|
|Work and Occupations||1.12|
|Work Employment and Society||1.11|
|American Journal of Economics and Sociology||1.10|
|Rationality and Society||1.10|
|Social Psychology Quarterly||1.10|
|Youth & Society||1.10|
|Annual Review of Sociology||1.09|
|Gender & Society||1.08|
|British Journal of Sociology||1.08|
|Theory and Society||1.07|
|Sociology of Health & Illness||1.06|
|Sociology-The Journal of The British Sociological Association||1.05|
|Journal of Contemporary Ethnography||1.05|
|Sociology of Sport Journal||1.05|
Finally, a little picture to hang up above your desk. Blue words appeared more often in high status articles, and orange words more often in low status articles. Word size is based on the likelihood of appearing in one status group compared to another–so the biggest orange words are the ones that most clearly signal low-status work, and the biggest blue words are the ones that most clearly signal high status work. This picture was inspired by Abe Gong‘s text analysis program.