What interests reddit? A network analysis of 84M comments by 200K users

Hi-res image


Back in fall of 2013, I scraped approximately 84 million comments from a set of just over 200,000 redditors using PRAW. At the time, I was interested in whether different subreddits had different norms for writing style, and whether I could model users as they learned this style (the answers turned out to be yes and no, respectively). To retrieve this data with minimal bias (with respect to topic) I used PRAW's random subreddit function to obtain subreddits in a pseudorandom way. My script then stepped through the most recent 1000 links, collecting the usernames of all the commenters. With a list of redditors in hand, I then scraped their entire comment history (up to the maximum of 1000 items).

All of this text, combined with metadata about the comments (e.g. # of upvotes) adds up to a hefty 23GB in csv format. I preprocessed the text by removing a large number of content free "stopwords" (e.g. grammatical words), as well as symbols and numeric characters. I then counted the frequencies of all remaining words within a very small subsample of my dataset (0.1%). I removed words which appeared fewer than 50 times or more than 1000 times in this ~84K set of comments. I then also (very simplistically) filtered out proper names, common words other than nouns and verbs, and past tense or plural versions of words already remaining in the list. Ultimately I ended up with a set of 1,862 "feature" words, the frequencies of which I then counted in the full set of 84M comments (aggregated by user).

Given the various preprocessing I had done up to this point, this process yielded counts for each of the feature words within each of 198,542 redditors. It is worth acknowledging that these users are representative of neither the general (US) population nor even the userbase of reddit, given that reddit is not representative of the US and that the selection process was inevitably biased towards users who comment more often. However, given the sheer size of the sample, even if the results do not generalize perfectly to wider populations we can be confident that they represent the views of a large number of people.

As a final preprocessing step, I calculated the cosine similarity matrix between the feature word frequencies (across redditors). Words with very low variance in their similarity to others (i.e. words that were used very generally) were removed. Simultaneously I removed words with very high or very low median similarities (undifferentiated/polysemous words and outliers). I then calculated an adjacency matrix between the remaining set of 1444 feature words. To maximize the interpretability of the subsequent social network visualizations, each node (feature word) was given at least two edges (connections) to other nodes in the network based on the two largest elements of its row in the adjacency matrix.

The network was visualized using the igraph package for R. The width and color of the edges (lines) varies with the log (base 10) of co-occurrence of the respective words (i.e. word with [stronger] lines between them are mentioned more frequently by the same redditors), and the size of the nodes varies with the log (base 10) of the absolute frequency of the word (so for instance a word that occurred 1000x more frequently than another would have with 3x the radius). The node coloring was determined by a 5-step walktrap community finding algorithm (effectively similar to a clustering algorithm, but for social networks). Note that the colors were chosen arbitrarily, so similarity in the color of different communities should not be interpreted. The positioning of the nodes was set using the Fruchterman-Reingold algorithm, a type of force-directed plotting. Note that while similar terms are often placed close together by this algorithm, that is not universally the case, so try to avoid over-interpreting the distances between points. The edges of the graph (i.e. lines between circles) are the more accurate guide to the relationships between terms.
Branches of Government

Many redditors seem to have an interested in talking about the government (dark blue). However, their interests in the topic appear to stem from a diverse set of sources. The two largest groups of features are defined by their secondary alignments with discussion of "state" and "women". The former appears to be largely focused on economics and security issues, with "business", "market", "force", and "police" numbering among the most common topics of discussion. Meanwhile the latter seems more devoted to social issues, with "health", "community", and "education" featuring prominently. Though I suspect these clusters may be correlated with the partisan liberal-conservative line, I'm not sure that that's their best characterization. Instead it seems to me more of a confirmation of "natural" division between social and economic interests.

A number of additional groups of feature words have strong connections to government. One in the "empiricist" community (light teal) seems interested in political philosophy. Another, linked to the technology community (blue-green) seems technocratically focused. Finally, another group of features (linked with "car") seems focused on issues of "low" (i.e. personal) finance rather than economics more broadly. I find the existence of this last group particularly interesting because its existence suggests a degree of disconnection between interest in "economic issues" and practical household economy.
Women. Women? Women!

Apparently women are a very common topic of discussion on reddit - perhaps not surprising for a forum (like many on the web) populated disproportionately by young men. However, while everyone seems to be talking about women, what's strikingly obvious is that not everyone is having the same conversation. On the right of the image above, we see two related clusters within the grey-green feature community - one centered on "men" and the other on "sex". Based on the words in these groups, they appear to reflect people discussing "relationships," broadly construed. Lower down, a large cluster in the darker green of the "story" feature community may be concerned with narrative by or about women. In the same grey-pink as the "women" node, we can also see two large clusters of features in the lower left. However, the nature of these clusters is less immediately obvious, at least to me.