Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach

Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach We analyzed 700 million words, phrases, and topic instances collected from the Facebook messages of 75,000 volunteers, who also took standard personality tests, and found striking variations in language with personality, gender, and age. In our open-vocabulary technique, the data itself drives a comprehensive exploration of language that distinguishes people, finding connections that are not captured with traditional closed-vocabulary word-category analyses. Our analyses shed new light on psychosocial processes yielding results that are face valid (e.g., subjects living in high elevations talk about the mountains), tie in with other research (e.g., neurotic people disproportionately use the phrase ‘sick of’ and the word ‘depressed’), suggest new hypotheses (e.g., an active life implies emotional stability), and give detailed insights (males use the possessive ‘my’ when mentioning their ‘wife’ or ‘girlfriend’ more often than females use ‘my’ with ‘husband’ or ’boyfriend’). To date, this represents the largest study, by an order of magnitude, of language and personality. Citation: Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Ramones SM, et al. (2013) Personality, Gender, and Age in the Language of Social Media: The Open- Vocabulary Approach. PLoS ONE 8(9): e73791. doi:10.1371/journal.pone.0073791 Editor: Tobias Preis, University of Warwick, United Kingdom Received January 23, 2013; Accepted July 29, 2013; Published September 25, 2013 Copyright:  2013 Schwartz et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: Support for this research was provided by the Robert Wood Johnson Foundation’s Pioneer Portfolio, through a grant to Martin Seligman, ‘‘Exploring Concept of Positive Health’’. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: hansens@sas.upenn.edu results. We call approaches like ours, which do not rely on a priori Introduction word or category judgments, open-vocabulary analyses. The social sciences have entered the age of data science, We use differential language analysis (DLA), our particular method leveraging the unprecedented sources of written language that of open-vocabulary analysis, to find language features across social media afford [1–3]. Through media such as Facebook and millions of Facebook messages that distinguish demographic and th Twitter, used regularly by more than 1/7 of the world’s psychological attributes. From a dataset of over 15.4 million population [4], variation in mood has been tracked diurnally Facebook messages collected from 75 thousand volunteers [12], we and across seasons [5], used to predict the stock market [6], and extract 700 million instances of words, phrases, and automatically leveraged to estimate happiness across time [7,8]. Search patterns generated topics and correlate them with gender, age, and on Google detect influenza epidemics weeks before CDC data personality. We replicate traditional language analyses by applying confirm them [9], and the digitization of books makes possible the Linguistic Inquiry and Word Count (LIWC) [11], a popular tool in quantitative tracking of cultural trends over decades [10]. To psychology, to our data set. Then, we show that open-vocabulary make sense of the massive data available, multidisciplinary analyses can yield additional insights (correlations between person- collaborations between fields such as computational linguistics ality and behavior as manifest through language) and more and the social sciences are needed. Here, we demonstrate an information (as measured through predictive accuracy) than instrument which uniquely describes similarities and differences traditional a priori word-category approaches. We present a word among groups of people in terms of their differential language use. cloud-based technique to visualize results of DLA. Our large set of Our technique leverages what people say in social media to find correlations is made available for others to use (available at: distinctive words, phrases, and topics as functions of known attributes http:www.wwbp.org/). of people such as gender, age, location, or psychological characteristics. The standard approach to correlating language Background use with individual attributes is to examine usage of a priori fixed sets of words [11], limiting findings to preconceived relationships This section outlines recent work linking language with with words or categories. In contrast, we extract a data-driven personality, gender, and age. In line with the focus of this paper, collection of words, phrases, and topics, in which the lexicon is based we predominantly discuss works which sought to gain psycholog- on the words of the text being analyzed. This yields a ical insights. However, we also touch on increasingly popular comprehensive description of the differences between groups of attempts at predicting personality from language in social media, people for any given attribute, and allows one to find unexpected which, for our study, offer an empirical means to compare a closed PLOS ONE | www.plosone.org 1 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language vocabulary analysis (relying on a priori word category human The larger sample-sizes from social media also enabled the first judgments) and an open vocabulary analysis (not relying on a priori study exploring personality as a function of single-word use. word category judgments). Yarkoni investigated LIWC categories along with single words in Personality refers to the traits and characteristics that make an connection with Big-5 scores of 406 bloggers [27]. He identified individual unique. Although there are multiple ways to classify single word results which would not have been caught with LIWC, traits [13], we draw on the popular Five Factor Model (or ‘‘Big 5’’), such as ‘hug’ correlating positively with agreeableness (there is no which classifies personality traits into five dimensions: extraversion physical affection category inLIWC), but, considering the sparse (e.g., outgoing, talkative, active), agreeableness (e.g., trusting, kind, nature of words, 406 blogs does not result in comprehensive view. generous), conscientiousness (e.g., self-controlled, responsible, thor- For example, they find only 13 significant word correlations for ough), neuroticism (e.g., anxious, depressive, touchy), and openness conscientiousness while we find thousands even after Bonferonni- (e.g., intellectual, artistic, insightful) [14]. With work beginning correcting significance levels. Additionally, they did not control for over 50 years ago [15] and journals dedicated to it, the FFM is a age or gender although they reported roughly 75% of their well-accepted construct of personality [16]. subjects were female. Still, as the most thorough point of comparison for LIWC results with personality, Figure 2 presents Automatic Lexical Analysis of Personality, Gender, the findings from Yarkoni’s study along with LIWC results over our data. and Age Analogous to a personality construct, work has been done in By examining what words people use, researchers have long psychology looking at the latent dimensions of self-expression. sought a better understanding of human psychology [17–19]. As Chung and Pennebaker factor analyzed 119 adjectives used in Tauszczik & Pennebaker put it: student essays of ‘‘who you think you are’’ and discovered 7 latent dimensions labeled such as ‘‘sociability’’ or ‘‘negativity’’ [28]. They Language is the most common and reliable way for people were able to relate these factors to the Big-5 and found only weak to translate their internal thoughts and emotions into a form relations, suggesting 7 dimensions as an alternative construction. that others can understand. Words and language, then, are Later, Kramer and Chung ran the same method over 1000 unique the very stuff of psychology and communication [20]. words across Facebook status updates, finding three components labeled, ‘‘positive events’’, ‘‘informal speech’’, and ‘‘school’’ [29]. The typical approach to analyzing language involves counting Although their vocabulary size was somewhat limited, we still see word usage over pre-chosen categories of language. For example, these as previous examples of open-vocabulary language analyses one might place words like ‘nose’, ‘bones’, ‘hips’, ‘skin’, ‘hands’, for psychology – no assumptions were made on the categories of and ‘gut’ into a body lexicon, and count how often words in the words beyond part-of-speech. lexicon are used by extraverts or introverts in order to determine who LIWC has also been used extensively for studying gender and talks about the body more. Of such word-category lexica, the most age [21]. Many studies have focused on function words (articles, widely used is Linguistic Inquiry and Word Count or LIWC, prepositions, conjunctions, and pronouns), finding females use developed over the last couple decades by human judges more first-person singular pronouns, males use more articles, and designating categories for common words [11,19]. The 2007 that older individuals use more plural pronouns and future tense version of LIWC includes 64 different categories of language verbs [30–32]. Other works have found males use more formal, ranging from part-of-speech (i.e. articles, prepositions, past-tense verbs, affirmation, and informational words, while females use more numbers,...) to topical categories (i.e. family, cognitive mechanisms, affect, social interaction, and deictic language [33–36]. For age, the most occupation, body,...), as well as a few other attributes such as total salient findings include older individuals using more positive number of words used [11]. Names of all 64 categories can be seen emotion and less negative emotion words [30], older individuals in Figure 2. preferring fewer self-references (i.e. ‘I’, ‘me’) [30,31], and Pennebaker & King conducted one of the first extensive stylistically there is less use of negation [37]. Similar to our applications of LIWC to personality by examining words in a finding of 2000 topics (clusters of semantically-related words), variety of domains including diaries, college writing assignments, Argamon et al. used factor analysis and identified 20 coherent and social psychology manuscript abstracts [21]. Their results components of word use to link gender and age, showing male were quite consistent across such domains, finding patterns such as components of language increase with age while female factors agreeable people using more articles, introverts and those low in decrease [32]. conscientiousness using more words signaling distinctions, and neurotic Occasionally, studies find contradictory results. For example, individuals using more negative emotion words. Mehl et al. tracks multiple studies report that emoticons (i.e. ‘:)’ ‘:-(‘) are used more the natural speech of 96 people over two days [22]. They found often by females [34,36,38], but Huffaker & Calvert found males similar results to Pennebaker & King and that neurotic and agreeable use them more in a sample of 100 teenage bloggers [39]. This people tend to use more first-person singulars, people low in particular discrepancy could be sample-related – differing openness talk more about social processes, extraverts use longer words. demographics or having a non-representative sample (Huffaker The recent growth of online social media has yielded great & Calvert looked at 100 bloggers, while later studies have looked sources of personal discourse. Besides advantages due to the size of at thousands of twitter users) or it could be due to differences in the the data, the content is often personal and describes everyday domain of the text (blogs versus twitter). One should always be concerns. Furthermore, previous research has suggested popula- careful generalizing new results outside of the domain they were tions for online studies and Facebook are quite representative found as language is often dependent on context [40]. In our case [23,24]. Sumner et al. examined the language of 537 Facebook we explore language in the broad context of Facebook, and do not users with LIWC [25] while Holtgraves studied the text messages claim our results would up under other smaller or larger contexts. of 46 students [26]. Findings from these studies largely confirmed As a starting point for reviewing more psychologically meaningful past links with LIWC but also introduced some new links such as language findings, we refer the reader to Tauszczik & Penneba- neurotics using more acronyms [26] or those high in openness using ker’s 2010 survey of computerized text analysis [20]. more quotations [25]. PLOS ONE | www.plosone.org 2 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language Eisenstein et al. presented a sophisticated open-vocabulary lan- which are suppressed, could have revealed important insights with guage analysis of demographics [41]. Their method views an outcome. In other words, these predictive models answer the language analysis as a multi-predictor to multi-output regression question ‘‘what is the best combination of words and weights to problem, and uses an L1 norm to select the most useful predictors predict personality?’’ whereas we believe answering the following (i.e. words). Part of their motivation was finding interpretable question is best for revealing new insights: ‘‘what words, controlled relationships between individual language features and sets of for gender and age, are individually most correlated with outcomes (demographics), and unlike the many predictive works personality?’’. we discuss in the next section, they test for significance of Recently, researchers have started looking at personality relationships between individual language features and outcomes. prediction. Early works in personality prediction used dictionary- To contrast with our approach, we consider features and outcomes based features such as LIWC. Argamon et al. (2005) noted that individually (i.e. an ‘‘L0 norm’’), which we think is more ideal for personality, as detected by categorical word use, was supportive for our goals of explaining psychological variables (i.e. understanding author attribution. They examined language use according to the openness by the words that correlate with it). For example, their traits of neuroticism and extraversion over approximately 2200 student method may throwout a word which is strongly predictive for only essays, while focused on using function words for the prediction of one outcome or which is collinear with other words, while we want gender [62]. Mairesse et al. used a variety of lexicon-based to know all the words most-predictive for a given outcome. We features to predict all Big-5 personality traits over approximately also explore other types of open-vocabulary language features such as 2500 essays as well as 90 sets of individual spoken words [63,64]. phrases and topics. As a first pass at predicting personality from language in Facebook, Similar language analyses also occurred in many fields outside Golbeck used LIWC features over a sample of 167 Facebook of psychology or demographics [42,43]. For example, Monroe volunteers as well as profile information and found limited success et al. explored a variety of techniques that compare two of a regression model [65]. Similarly, Kaggle held a competition of frequencies of words – one number for each of two groups [44]. personality prediction over Twitter messages, providing partici- In particular, they explored frequencies across democratic versus pants with language cues based on LIWC [66]. Results of the republican speeches and settled on a Bayesian model with competition suggested personality is difficult to predict based on regularization and shrinkage based on priors of word use. Lastly, language in social media, but it is not clear whether such a Gilbert finds words and phrases that distinguish communication conclusion would have been drawn had open-vocabulary language up or down a power-hierarchy across 2044 Enron emails [45]. cues been supplied for prediction. They used penalized logistic regression to fit a single model using In the largest previous study of language and personality, coefficients of each feature as their ‘‘power’’; this produces a good Iacobelli, Gill, Nowson, and Oberlander attempted prediction of single predictive model but also means words which are highly personality for 3,000 bloggers [67]. Not limited to categorical collinear with others will be missed (we run a separate regression language they found open-vocabulary features, such as bigrams, to for each word to avoid this). be better predictors than LIWC features. This motivates our Perhaps one of the most comprehensive language analysis exploration of open-vocabulary features for psychological insights, surveys outside of psychology is that of Grimmer & Stewart [43]. where we examine multi-word phrases (also called n-grams) as well They summarize how automated methods can inexpensively allow as open-vocabulary category language in the form of automatically systematic analysis and inference from large political text clustered groups of semantically related word (LDA topics, see collections, classifying types of analyses into a of hierarchy. ‘‘Linguistic Feature Extraction’’ in the ‘‘Materials and Methods’’ Additionally, they provide cautionary advice; In relation to this section). Since the application of Iacobelli et al. ’s work was work, they note that dictionary methods (such as the closed- content customization, they focused on prediction rather than vocabulary analyses discussed here) may signal something different exploration of language for psychological insight. Our much larger when used in a new domain (for example ‘crude’ may be a sample size lends itself well to more comprehensive exploratory negative word in student essays, but be neutral in energy industry results. reports: ‘crude oil’). For comprehensive surveys on text analyses Similar studies have also been undertaken for age and gender across fields see Grimmer & Stewart [43], O’Connor, Bamman, & prediction in social media. Because gender and age information is Smith [42], and Tausczik & Pennebaker [46]. more readily available, these studies tend to be larger. Argamon et al. predicted gender and age over 19,320 bloggers [32], while Predictive Models based on Language Burger et al. scaled up the gender prediction over 184,000 Twitter authors by using automatically guessed gender based-on gender- In contrast with the works seeking to gain insights about specific keywords in profiles. Most recently, Bamman et al. looked psychological variables, research focused on predicting outcomes at gender as a function of language and social network statistics in have embraced data-driven approaches. Such work uses open- twitter. They particularly looked at the characteristics of those vocabulary linguistic features in addition to a priori lexicon based whose gender was incorrectly predicted and found greater gender features in predictive models for tasks such as stylistics/authorship attribution [47–49], emotion prediction [50,51], interaction or homophily in the social networks of such individuals [68]. flirting detection [52,53], or sentiment analysis [54–57]. In other These past studies, mostly within the field of computer science works, ideologies of political figures (i.e. conservative to liberal) or specifically computational linguistics, have focused on predic- have been predicted based on language using supervised tion for tasks such as content personalization or authorship techniques [58] or unsupervised inference of ideological space attribution. In our work, predictive models of personality, gender, [59,60]. Sometimes these works note the highest weighted features, and age provide a quantitative means to compare various open- but with their goal being predictive accuracy, those features are vocabulary sets of features with a closed-vocabulary set. Our primary not tested for significance and they usually are not the most concern is to explore the benefits of an open-vocabulary approach for individually distinguishing pieces of language. To elaborate, most gaining insights, a goal that is at least as import as prediction for approaches to prediction penalize the weights of words that are psychosocial fields. Most works for gaining language-based insights highly collinear with other words as they fit a single model per in psychology are closed-vocabulary (for examples, see previous outcomes across all words. However, these highly collinear words section), and while many works in computational linguistics are PLOS ONE | www.plosone.org 3 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language open-vocabulary, they rarely focus on insight. We introduce the revealed in language use, and psychosocial variables. In turn, term ‘‘open-vocabulary’’ to distinguish an approach like ours from these results suggest undertaking studies, such as directly previous approaches to gaining insight, and in order to encourage measuring participation in activities in order to verify the link others seeking insights to consider similar approaches. ‘‘Differen- with emotional stability. tial language analysis’’ refers to the particular process, for which We demonstrate open-vocabulary features contain more we are not aware of another name, we use in our open-vocabulary information than a priori word-categories via their use in approach as depicted in Figure 1. predictive models. We take model accuracy in out-of-sample prediction as a measure of information of the features provided Contributions to the model. Models built from words and phrases as well as The contributions of this paper are as follows: those from automatically generated topics achieve significantly higher out-of-sample prediction accuracies than a standard First, we present the largest study of personality and language lexica for each variable of interest (gender, age, and personality). use to date. With just under 75,000 authors, our study covers Additionally, our prediction model for gender yielded state-of- an order-of-magnitude more people and instances of language the-art results for predictive models based entirely on features than the next largest study ([27]). The size of our data language, yielding an out-of-sample accuracy of 91.9%. enables qualitatively different analyses, including open vocab- We present a word cloud visualization which scales words by ulary analysis, based on more comprehensive sets of language correlation (i.e., how well they predict the given psychological features such as phrases and automatically derived topics. Most variable) rather than simply scaling by frequency. Since we prior studies used a priori language categories, presumably due find thousands of significantly correlated words, visualization is in part to the sparse nature of words and their relatively small key, and our differential word clouds provide a comprehensive samples of people. With smaller data sets, it is difficult to find view of our results (e.g. see Figure 3). statistically significant differences in language use for anything Lastly, we offer our comprehensive word, phrase, and topic but the most common words. correlation data for future research experiments (see: Our open-vocabulary analysis yields further insights into the wwbp.org). behavioral residue of personality types beyond those from a priori word-category based approaches, giving unanticipated results (correlations between language and personality, gender, Materials and Methods or age). For example, we make the novel discoveries that mentions of an assortment of social sports and life activities Ethics Statement (such as basketball, snowboarding, church, meetings) correlate with All research procedures were approved by the University of emotional stability, and that introverts show an interest in Japanese Pennsylvania Institutional Review Board. Volunteers agreed to media (such as anime, pokemon, manga and Japanese emoticons: written informed consent. _). Our inclusion of phrases in addition to words provided In seeking insights from language use about personality, gender, further insights (e.g. that males prefer to precede ‘girlfriend’ or and age, we explore two approaches. The first approach, serving ‘wife’ with the possessive ‘my’ significantly more than females as a replication of the past analyses, counts word usage over do for ‘boyfriend’ or ‘husband’. Such correlations provide manually created a priori word-category lexica. The second quantitative evidence for strong links between behavior, as approach, termed DLA, serves as out main method and is Figure 1. The infrastructure of our differential language analysis. 1) Feature Extraction. Language use features include: (a) words and phrases: a sequence of 1 to 3 words found using an emoticon-aware tokenizer and a collocation filter (24,530 features) (b) topics: automatically derived groups of words for a single topic found using the Latent Dirichlet Allocation technique [72,75] (500 features). 2) Correlational Analysis. We find the correlation (b of ordinary least square linear regression) between each language feature and each demographic or psychometric outcome. All relationships presented in this work are at least significant at a Bonferroni-corrected pv0:001 [76]. 3) Visualization. Graphical representation of correlational analysis output. doi:10.1371/journal.pone.0073791.g001 PLOS ONE | www.plosone.org 4 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language open-vocabulary – the words and clusters of words analyzed are In practice, we kept phrases with pmi values greater than determined by the data itself. 2  length, where length is the number of words contained in the phrase, ensuring that phrases we do keep are informative parts of speech and not just accidental juxtapositions. All word and phrase Closed Vocabulary: Word-Category Lexica counts are normalized by each subject’s total word use A common method for linking language with psychological (p(word j subject)), and we apply the Anscombe transformation variables involves counting words belonging to manually-created [71] to the normalized values for variance stabilization (p ): ans categories of language. Sometimes referred to as the word-count approach, one counts how often words in a given category are used by an individual, the percentage of the participants’ words freq (phrase, subject) p(phrase j subject)~ which are from the given category: freq (phrase , subject) phrase [vocab(subject) freq (word, subject) word[category p (category j subject)~ P pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi freq (word, subject) p (phrase j subject)~2 p(phrase j subject)z3=8 ans word[vocab (subject) where vocab(subject) returns a list of all words and phrases used where freq (word,subject) is the number of the times the by that subject. These Anscombe transformed ‘‘relative frequen- participant mentions word and vocab (subject) is the set of all cies’’ of words or phrases (p ) are then used as the independent Ans words mentioned by the subject. variables in all our analyses. Lastly, we restrict our analysis to those We use ordinary least squares regression to link word categories words and phrases which are used by at least 1% of our subjects, with author attributes, fitting a linear function between explan- keeping the focus on common language. atory variables (LIWC categories) and dependent variables (such as The second type of linguistic feature, topics, consists of word a trait of personality, e.g. extraversion). The coefficient of the clusters created using Latent Dirichlet Allocation (LDA) [72,73]. target explanatory variable (often referred to as b) is taken as the The LDA generative model assumes that documents (i.e. Face- strength of relationship. Including other variables allows us to book messages) contain a combination of topics, and that topics adjust for covariates such as gender and age to provide the unique are a distribution of words; since the words in a document are effect of a given language feature on each psychosocial variable. known, the latent variable of topics can be estimated through Gibbs sampling [74]. We use an implementation of the LDA Open Vocabulary: Differential Language Analysis algorithm provided by the Mallet package [75], adjusting one Our technique, differential language analysis (DLA), is based on parameter (alpha~0:30) to favor fewer topics per document, since three key characteristics. It is individual Facebook status updates tend to contain fewer topics than the typical documents (newspaper or encyclopedia articles) to 1. Open-vocabulary – it is not limited to predefined word lists. which LDA is applied. All other parameters were kept at their Rather, linguistic features including words, phrases, and topics default. An example of such a model is the following sets of words (sets of semantically related words) are automatically deter- (tuesday, monday, wednesday, friday, thursday, week, sunday, saturday) mined from the texts. (I.e., it is ‘‘data-driven’’.) This means which clusters together days of the week purely by exploiting their DLA is classified as a type of open-vocabulary approach. similar distributional properties across messages. We produced the 2. Discriminating – it finds key linguistic features that distinguish 2000 topics shown in Table S1 as well as on our website. psychological and demographic attributes, using stringent To use topics as features, we find the probability of a subject’s significance tests. use of each topic: 3. Simple – it uses simple, fast, and readily accepted statistical techniques. p(topic j subject)~ p(topic j word)  p(word j subject) word[topic We depict the components of this approach in Figure 1, and describe the three steps: 1) linguistic feature extraction, 2) where p(word j subject) is the normalized word use by that subject correlational analysis, and 3) visualization in the following sections. and p(topic j word) is the probability of the topic given the word 1. Linguistic Feature Extraction. We examined two types (a value provided from the LDA procedure). The prevalence of a of linguistic features: a) words and phrases, and b) topics. Words and word in a topic is given by p(topic,word), and is used to order the phrases consisted of sequences of 1 to 3 words (often referred to as words within a topic when displayed. ‘n-grams’ of size 1 to 3). What constitutes a word is determined 2. Correlational Analysis. Similar to word categories, using a tokenizer, which splits sentences into tokens (‘‘words’’). We distinguishing open-vocabulary words, phrases, and topics can built an emoticon-aware tokenizer on top of Pott’s ‘‘happyfunto- be identified using ordinary least squares regression. We again take kenizer’’ allowing us to capture emoticons like ‘v3’(a heart) or ‘:-)’ the coefficient of the target explanatory variable as its correlation (a smile), which most tokenizers incorrectly divide up as separate strength, and we include other variables (e.g. age and gender) as pieces of punctuation. When extracting phrases, we keep only covariates to get the unique effect of the target explanatory those sequences of words with high informative value according to variable. Since we explore many features at once, we consider pointwise mutual information (PMI ) [69,70], a ratio of the joint- coefficients significant if they are less than a Bonferroni-corrected probability to the independent probability of observing the phrase: [76] two-tailed p of 0.001. (I.e., when examining 20,000 features, a passing p-value is less than 0.001 divided by 20,000 which is {8 p(phrase) 5  10 ). pmi (phrase)~ log Our correlational analysis produces a comprehensive list of the P p(w) w[phrase most distinguishing language features for any given attribute, words, phrases, or topics which maximally discriminate a given target PLOS ONE | www.plosone.org 5 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language variables. For example, when we correlate the target variables continuous or ordinal dependent variables such as age. A standard geographic elevation with language features (N~18,383, time-series plot works well, where the horizontal axis is the pv0:001, adjusted for gender and age), we find ‘beach’ the dependent variable and the vertical axis represents the standard most distinguishing feature for low elevation localities, and ‘the score of the values produced from feature extraction. When mountains’ to be among the most distinguishing features for plotting language as a function of age, we fit first-order LOESS high elevation localities, (i.e., people in low elevations talk regression lines [81] to the age as the x-axis data and standardized about the beach more, whereas people at high elevations talk frequency as the y-axis data over all users. We are able to adjust about the mountains more). Similarly, we find the most for gender in the regression model by including it as a covariate distinguishing topics to be (beach, sand, sun, water, waves, ocean, when training the LOESS model and then using a neutral gender value when plotting. surf, sea, toes, sandy, surfing, beaches, sunset, Florida, Virginia) for low elevations and (Colorado, heading, headed, leaving, Denver, Kansas, City, Springs, Oklahoma, trip, moving, Iowa, KC, Utah, bound) for Data Set: Facebook Status Updates high elevations. Others have looked at geographic location Our complete dataset consists of approximately 19 million [77]. Facebook status updates written by 136,000 participants. Partic- 3. Visualization. An analysis over tens of thousands of ipants volunteered to share their status updates as part of the My language features and multiple dimensions results in hundreds of Personality application, where they also took a variety of question- thousands of statistically significant correlations. Visualization is naires [12]. We restrict our analysis to those Facebook users thus critical for their interpretation. We use word clouds [78] to meeting certain criteria: They must indicate English as a primary intuitively summarize our results. Unlike most word clouds, which language, have written at least 1,000 words in their status updates, scale word size by their frequency, we scale word size according to be less than 65 years (to avoid the non-representative sample the strength of the correlation of the word with the demographic above 65), and indicate both gender and age (for use as controls). or psychological measurement of interest, and we use color to This resulted in N~74,941 volunteers, writing a total of represent frequency over all subjects; that is, larger words indicate 309 million words (700 million feature instances of words, phrases, stronger correlations, and darker colors indicate more frequently and topics) across 15.4 million status updates. From this sample used words. This provides a clear picture of which words and each person wrote an average of 4,129 words over 206 status phrases are most discriminating while not losing track of which updates, and thus 20 words per update. Depending on the target ones are the most frequent. Word clouds scaled by frequency are variable, this number slightly varies as indicated in the caption of often used to summarize news, a practice that has been critiqued each result. for inaccurately representing articles [79]. Here, we believe the The personality scores are based on the International Person- word cloud is an appropriate visualization because the individual ality Item Pool proxy for the NEO Personality Inventory Revised words and phrases we depict in it are the actual results we wish to (NEO-PI-R) [14,82]. Participants could take 20 to 100 item summarize. Further, scaling by correlation coefficient rather than versions of the questionnaire, with a retest reliability of aw0:80 frequency gives clouds that distinguish a given outcome. [12]. With the addition of gender and age variables, this resulted in Word clouds can also used to represent distinguishing topics. In seven total dependent variables studied in this work, which are this case, the size of the word within the topic represents its depicted in Table 1 along with summary statistics. Personality prevalence among the cluster of words making up the topic. We distributions are quite typical with means near zero and standard use the 6 most distinguishing topics and place them on the deviations near 1. The statuses ranged over 34 months, from perimeter of the word clouds for words and phrases. This way, a January 2009 through October 2011. Previously, profile informa- single figure gives a comprehensive view of the most distinguishing tion (i.e. network metrics, relationship status) from users in this words, phrases, and topics for any given variables of interest. See dataset have been linked with personality [83], but this is the first Figure 3 for an example. use of its status updates. To reduce the redundancy of results, we automatically prune language features containing information already provided by a Results feature with higher correlation. First, we sort language features in Results of our analyses over gender, age, and personality are order of their correlation with a target variable (such as a presented below. As a baseline, we first replicate the commonly personality trait). Then, for phrases, we use frequency as a proxy used LIWC analysis on our data set. We then present our main for informative value [80], and only include additional phrases if results, the output of our method, DLA. Lastly, we explore they contain more informative words than previously included empirical evidence that open-vocabulary features provide more phrases with matching words. For example, consider the phrases information than those from an a priori lexicon through use in a ‘day’, ‘beautiful day’, and ‘the day’, listed in order of correlation predictive model. from greatest to least; ‘Beautiful day’ would be kept, because ‘beautiful’ is less frequent than ‘day’ (i.e., it is adding informative value), while ‘the day’ would be dropped because ‘the’ is more Closed Vocabulary frequent than ‘day’ (thus it is not contributing more information Figure 2 shows the results of applying the LIWC lexicon to our than we get from ‘day’). We do a similar pruning for topics: A dataset, along side-by-side with the most comprehensive previous lower-ranking topic is not displayed if more than 25% of its top 15 studies we could find for gender, age. and personality [27,30,34]. In words are also contained in the top 15 words of a higher ranking our case, correlation results are b values from an ordinary least topic. These discarded relationships are still statistically significant, squares linear regression where we can adjust for gender and age but removing them provides more room in the visualizations for to give the unique effect of the target variable. One should keep other significant results, making the visualization as a whole more in mind that it is often found that effect sizes tend to be relatively meaningful. smaller as sample sizes increase and become more stable [84]. Word clouds allow one to easily view the features most Even though the previous studies listed did not look at correlated with polar outcomes; we use other visualizations to Facebook, a majority of the correlations we find agree in direction. display the variation of correlation of language features with Some of the largest correlations emerge for the LIWC articles PLOS ONE | www.plosone.org 6 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language Figure 2. Correlation values of LIWC categories with gender, age, and the five factor model of personality. [34] d: Effect size as Cohen’s d values from Newman et al. ’s recent study of gender (positive is female, ns~ not significant at pv:001) [30]. b: Standardized linear regression coefficients adjusted for sex, writing/talking, and experimental condition from Pennebaker and Stone’s study of age (ns~ not significant at pv:05) [27]. r: Spearman correlations values from Yarkoni’s recent study of personality (ns~ not significant at pv:05). our b: Standardized multivariate regression coefficients adjusted for gender and age for this current study over Facebook (ns = not significant at Bonferroni-corrected pv:001). doi:10.1371/journal.pone.0073791.g002 PLOS ONE | www.plosone.org 7 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language Figure 3. Words, phrases, and topics most highly distinguishing females and males. Female language features are shown on top while males below. Size of the word indicates the strength of the correlation; color indicates relative frequency of usage. Underscores (_) connect words of multiword phrases. Words and phrases are in the center; topics, represented as the 15 most prevalent words, surround. (N~74,859: 46,412 females and 28,247 males; correlations adjusted for age; Bonferroni-corrected pv0:001). doi:10.1371/journal.pone.0073791.g003 category, which consists of determiners like ‘the’, ’a’, ‘an’ and were in the opposite direction from the prior work. This is not too serves as a proxy for the use of more nouns. Articles are highly surprising since openness exhibits the most variation across predictive of males, being older, and openness. As a content-related conditions of other studies (for examples, see [25,27,65]), and its language variable, the anger category also proved highly predictive component traits are most loosely related [85]. for males as well as younger individuals, those low in agreeableness and conscientiousness, and high in neuroticism. Openness had the least agreement with the comparison study; roughly half of our results PLOS ONE | www.plosone.org 8 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language One might also draw insights based on the gender results. For Table 1. Summary statistics for gender, age, and the five example, we noticed ‘my wife’ and ‘my girlfriend’ emerged as factor model of personality. strongly correlated in the male results, while simply ‘husband’ and ‘boyfriend’ were most predictive for females. Investigating the frequency data revealed that males did in fact precede such N mean standard skewness deviation references to their opposite-sex partner with ‘my’ significantly more often than females. On the other hand, females were more Gender 74859 0.62 0.49 20.50 likely to precede ‘husband’ or ‘boyfriend’ with ‘her’ or ‘amazing’ Age 74859 23.43 8.96 1.77 and a greater variety of words, which is why ‘my husband’ was not more predictive than ‘husband’ alone. Furthermore, this suggests Extraversion 72709 20.07 1.01 20.34 the male preference for the possessive ‘my’ is at least partially due Agreeableness 72772 0.03 1.00 20.40 to a lack of talking about others’ partners. Conscientiousness 72781 20.04 1.01 20.09 Language of Age. Figure 4 shows the word cloud (center) and most discriminating topics (surrounding) for four age buckets Neuroticism 71968 0.14 1.04 20.21 chosen with regard to the distribution of ages in our sample Openness 72809 0.12 0.97 20.48 (Facebook has many more young people). We see clear distinctions, such as use of slang, emoticons, and Internet speak These represent the seven dependent variables studied in this work. Gender ranged from 0 (male) to 1(female). Age ranged from 13 to 65. Personality in the youngest group (e.g. ’:)’, ‘idk’, and a couple Internet speak questionnaires produce values along a standardized continuum. topics) or work appearing in the 23 to 29 age group (e.g. ‘at work’, doi:10.1371/journal.pone.0073791.t001 ‘new job’, as a job position topic). We also find subtle changes of topics progressing from one age group to the next. For example, Open Vocabulary we see a school related topic for 13 to 18 year olds (e.g. ‘school’, Our DLA method identifies the most distinguishing language ‘homework’, ‘ugh’), while we see a college related topic for 19 to features (words, phrases: a sequence of 1 to 3 words, or topics:a 22 year olds (e.g. ‘semester’, ‘college’, ‘register’). Additionally, cluster of semantically related words) for any given attribute. consider the drunk topic (e.g. ‘drunk’, ‘hangover’, ‘wasted’) that Results progress from a one variable proof of concept (gender), to appears for 19 to 22 year olds and a more reserved beer topic (e.g. the multiple variables representing age groups, and finally to all 5 ‘beer’, ‘drinking’, ‘ale’) for 23 to 29 year olds. dimensions of personality. In general, we find a progression of school, college, work, and Language of Gender. Gender provides a familiar and easy to family when looking at the predominant topics across all age understand proof of concept for open-vocabulary analysis. Figure 3 groups. DLA may be valuable for the generation of hypotheses presents word clouds from age-adjusted gender correlations. We about life span developmental age differences. Figure 5A shows the scale word size according to the strength of the relation and we use relative frequency of the most discriminating topic for each age color to represent overall frequency; that is, larger words indicate group as a function of age. Typical concerns peak at different ages, stronger correlations, and darker colors indicate frequently used with the topic concerning relationships (e.g. ‘son’, ‘daughter’, words. For the topics, groups of semantically-related words, the size ‘father’, ‘mother’) continuously increasing across life span. On a indicate the relative prevalence of the word within the cluster as similar note, Figure 5C shows ‘we’ increases approximately defined in the methods section. All results are significant at linearly after the age of 22, whereas ‘I’ monotonically decreases. Bonferroni-corrected [76] pv0:001. We take this as a proxy for social integration [19], suggesting the Many strong results emerging from our analysis align with our increasing importance of friendships and relationships as people LIWC results and past studies of gender. For example, females age. Figure 5B reinforces this hypothesis by presenting a similar used more emotion words [86,87] (e.g., ‘excited’), and first-person pattern based on other social topics. One limitation of our dataset singulars [88], and they mention more psychological and social is the rarity of older individuals using social media; we look processes [34] (e.g., ‘love you’ and ‘v3’ –a heart). Males used forward to a time in which we can track fine-grained language more swear words, object references (e.g., ‘xbox’ and swear words) differences across the entire lifespan. [34,89]. Language of Personality. We created age and gender- Other results of ours contradicted past studies, which were adjusted word clouds for each personality factor based on around based upon significantly smaller sample sizes than ours. For 72 thousand participants with at least 1,000 words across their example, in 100 bloggers Huffaker et al. [39] found males use Facebook status updates, who took a Big Five questionnaire [91]. more emoticons than females. We calculated power analyses to Figure 6 shows word clouds for extraversion and neuroticism. determine the sample size needed to confidently find such (See Figure S2 for openness, conscientiousness, and agreeable- significant results. Since the Bonferonni-correction we use ness.) The dominant words in each cluster were consistent with elsewhere in this work is overly stringent (i.e. makes it harder prior lexical and questionnaire work [14]. For example, extraverts than necessary to pass significance tests), for this result we applied were more likely to mention social words such as ‘party’, ‘love the Benjamini-Hochberg false discovery rate procedure for you’, ‘boys’, and ‘ladies’, whereas introverts were more likely to multiple hypothesis testing [90]. Rerunning our language of mention words related to solitary activities such as ‘computer’, gender analysis on reduced random samples of our subjects ‘Internet’, and ‘reading’. In the openness cloud, words such as resulted in the following number of significant correlations ‘music’, ‘art’, and ‘writing’ (i.e., creativity), and ‘dream’, ‘universe’, (Benjamini-Hochberg tested pv0:001): 50 subjects: 0 significant and ‘soul’ (i.e., imagination) were discriminating [85]. correlations, 500 subjects: 7 correlations; 5,000 subjects: 1,489 Topics were also found reflecting similar concepts as the words, correlations; 50,000 subjects: 13,152 correlations (more detailed some of which would not have been captured with LIWC. For results of power analyses across gender, age, and personality can example, although LIWC has socially related categories, it does not be found in Figure S1). Thus, traditional study sample sizes, which contain a party topic, which emerges as a key distinguishing feature are closer to 50 or 500, are not powerful enough to do data-driven for extraverts. Topics related to other types of social events are DLA over individual words. listed elsewhere, such as a sports topic for low neuroticism PLOS ONE | www.plosone.org 9 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language Figure 4. Words, phrases, and topics most distinguishing subjects aged 13 to 18, 19 to 22, 23 to 29, and 30 to 65. Ordered from top to bottom: 13 to 18 19 to 22 23 to 29, and 30 to 65. Words and phrases are in the center; topics, represented as the 15 most prevalent words, surround. (N~74,859; correlations adjusted for gender; Bonferroni-corrected pv0:001). doi:10.1371/journal.pone.0073791.g004 PLOS ONE | www.plosone.org 10 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language Figure 5. Standardized frequency of topics and words across age. A. Standardized frequency for the best topic for each of the 4 age groups. Grey vertical lines divide groups: 13 to 18 (black: n~25,467 out of N~74,859), 19 to 22 (green: n~21,687), 23 to 29 (blue: n~14,656), and 30+ (red: n~13,049). Lines are fit from first-order LOESS regression [81] controlled for gender. B. Standardized frequency of social topic use across age. C. Standardized ‘I’, ‘we’ frequencies across age. doi:10.1371/journal.pone.0073791.g005 (emotional stability). Additionally, Figure 6 shows the advantage of small validation set of 10% of the training set which we tested having phrases in the analysis to get clearer signal: e.g. people high various regularization parameters over while fitting the model to in neuroticism mentioned ‘sick of’, and not just ‘sick’. the other 90% of the training set in order to select the best While many of our results confirm previous research, parameter). Thus, the predictive model is created without any demonstrating the instrument’s face validity, our word clouds outcome information outside of the training data, making the test also suggest new hypotheses. For example, Figure 6 (bottom- data an out-of-sample evaluation. right) shows language related to emotional stability (low As open-vocabulary features, we use the same units of neuroticism). Emotionally stable individuals wrote about enjoy- language as DLA: words and phrases (n-grams of size 1 to 3, able social activities that may foster greater emotional stability, passing a collocation filter) and topics. These features are outlined such as ‘sports’, ‘vacation’, ‘beach’, ‘church’, ‘team’, and a family precisely under the ‘‘Linguistic Feature Extraction’’ section time topic. Additionally, results suggest that introverts are presented earlier. As explained in that section, we use Anscombe interested in Japanese media (e.g. ‘anime’, ‘manga’, ‘japanese’, transformed relative frequencies of words and phrases and the Japanese style emoticons: ˆ_ˆ , and an anime topic) and that those conditional probability of a topic given a subject. For closed low in openness drive the use of shorthands in social media (e.g. vocabulary features, we use the LIWC categories of language ‘2day’, ‘ur’, ‘every 1’). Although these are only language calculated as the relative frequency of a user mentioning a word correlations, they show how open-vocabulary analyses can illumi- in the category given their total word usage. We do not provide nate areas to explore further. our models with anything other than these language usage features (independent variables) for prediction, and we use usage of all features (not just those passing significance tests from DLA). Predictive Evaluation As shown in Table 2, we see that models created with open Here we present a quantitative evaluation of open-vocabulary vocabulary features significantly (pv0:01) outperformed those and closed vocabulary language features. Although we have thus created based on LIWC features. The topics results are of particular far presented subjective evidence that open-vocabulary features interest, because these automatically clustered word-category contribute more information, we hypothesize empirically that the lexica were not created with any human or psychological data – inclusion of open-vocabulary features leads to prediction accura- only knowing what words occurred in messages together. cies above and beyond that of closed-vocabulary. We randomly Furthermore, we see that a model which includes LIWC features sampled 25% of our participants as test data, and used the on top of the open-vocabulary words, phrases, and topics does not result remaining 75% as training data to build our predictive models. in any improvement suggesting that the open-vocabulary features We use a linear support vector machine (SVM) [92] for are able to capture predictive information which fully supersedes classifying the binary variable of gender, and ridge regression LIWC. [93] for predicting age and each factor of personality. Features For personality we saw the largest relative improvement were first run through principal component analysis to reduce the between open-vocabulary approaches and LIWC. Our best person- feature dimension to half of the number of users. Both SVM ality R score of 0:42 fell just above the standard ‘‘correlational classification and ridge regression utilize a regularization param- upper-limit’’ for behavior to predict personality (a Pearson eter, which we set by validation over the training set (we defined a PLOS ONE | www.plosone.org 11 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language Figure 6. Words, phrases, and topics most distinguishing extraversion from introversion and neuroticism from emotional stability.A. Language of extraversion (left, e.g., ‘party’) and introversion (right, e.g., ‘computer’); N~72,709. B. Language distinguishing neuroticism (left, e.g. ‘hate’) from emotional stability (right, e.g., ‘blessed’); N~71,968 (adjusted for age and gender, Bonferroni-corrected pv0:001). Figure S8 contains results for openness, conscientiousness, and agreeableness. doi:10.1371/journal.pone.0073791.g006 PLOS ONE | www.plosone.org 12 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language Table 2. Comparison of LIWC and open-vocabulary features within predictive models of gender, age, and personality. Gender Age Extraversion Agreeableness Conscientious. Neuroticism Openness features accuracy R R R R R R LIWC 78.4% .65 .27 .25 .29 .21 .29 Topics 87.5% .80 .32 .29 .33 .28 .38 WordPhrases 91.4% .83 .37 .29 .34 .29 .41 WordPhrases + Topics 91.9% .84 .38 .31 .35 .31 .42 Topics + LIWC 89.2% .80 .33 .29 .33 .28 .38 WordPhrases + LIWC 91.6% .83 .38 .30 .34 .30 .41 WordPhrases + Topics 91.9% .84 .38 .31 .35 .31 .42 + LIWC accuracy: percent predicted correctly (for discrete binary outcomes). R: Square-root of the coefficient of determination (for sequential/continuous outcomes). LIWC: A priori word-categories from Linguistic Inquiry and Word Count. Topics: Automatically created LDA topic clusters. WordPhrases: words and phrases (n-grams of size 1 to 3 passing a collocation filter). Bold indicates significant (p,.01) improvement over the baseline set of features (use of LIWC alone). doi:10.1371/journal.pone.0073791.t002 correlation of 0:3 to 0:4) [94,95]. Some researchers have the social and psychological characteristics of people’s everyday discretized the personality scores for prediction, and classified concerns. people as being high or low (one standard deviation above or Most studies linking language with psychological variables rely below the mean or top and bottom quartiles, throwing out the on a priori fixed sets of words, such as the LIWC categories carefully middle) in each trait [61,64,67]. When we do such an approach, constructed over 20 years of human research [11]. Here, we show the benefits of an open-vocabulary approach in which the words our scores are in similar ranges to such literature: 65% to 79% classification accuracy. Of course, such a high/low model cannot analyzed are based on the data itself. We extracted words, phrases, and topics (automatically clustered sets of words) from millions of directly be used for classifying unlabeled people as one would also Facebook messages and found the language that correlates most need to know who fits in the middle. Regression is a more with gender, age, and five factors of personality. We discovered appropriate predictive task for continuous outcomes like age and insights not found previously and achieved higher accuracies than personality, even though R scores are naturally smaller than LIWC when using our open-vocabulary features in a predictive binary classification accuracies. model, achieving state-of-the-art accuracy in the case of gender We ran an additional tests to evaluate only those words and prediction. phrases, topics, or LIWC categories that are selected via differential Exploratory analyses like DLA change the process from that of language analysis rather than all features. Thus, we used only testing theories with observations to that of data-driven identifi- those language features that significantly correlated (Bonferonni- cation of new connections [97,98]. Our intention here is not a corrected pv0:001) with the outcome being predicting. To keep complete replacement for closed-vocabulary analyses like LIWC. consistent with the main evaluation, we used no controls, and so When one has a specific theory in mind or a small sample size, an one could view this as a univariate feature selection over each type a priori list of words can be ideal; in an open-vocabulary approach, of feature independently. We again found significant improvement the concept one cares about can be drowned out by more from using the open-vocabulary features over LIWC and no predictive concepts. Further, it may be easier to compare static a significant changes in accuracy overall. These results are presented priori categories of words across studies. However, automatically in Table S2. clustering words into coherent topics allows one to potentially In addition to demonstrating the greater informative value of discover categories that might not have been anticipated (e.g. open-vocabulary features, we found our results to be state-of-the-art. sports teams, kinds of outdoor exercise, or Japanese cartoons). The highest previous out-of-sample accuracies for gender prediction Open-vocabulary approaches also save labor in creating catego- based entirely on language were 88.0% over twitter data [68] while ries. They consider all words encountered and thus are able to our classifiers reach an accuracy of 91.9%. Our increased adapt well to the evolving language in social media or other performance could be attributed to our set of language features, genres. They are also transparent in that the exact words driving a strong predictive algorithm (the support vector machine), and correlations are not hidden behind a level of abstraction. Given the large sample of Facebook data. lots of text and dependent variables, an open-vocabulary approach like DLA can be immediately useful for many areas of study; for Discussion example, an economist contrasting sport utility with hybrid vehicle drivers, a political scientist comparing democrats and republicans, Online social media such as Facebook are a particularly or a cardiologist differentiating people with positive versus promising resource for the study of people, as ‘‘status’’ updates negative outcomes of heart disease. are self-descriptive, personal, and have emotional content [7]. Like most studies in the social sciences, this work is still subject Language use is objective and quantifiable behavioral data [96], to sampling and social desirability biases. Language connections and unlike surveys and questionnaires, Facebook language allows with psychosocial variables are often dependent on context [40]. researchers to observe individuals as they freely present Here, we examined language in a large sample of the broad themselves in their own words. Differential language analysis (DLA) context of Facebook. Under different contexts, it is likely some in social media is an unobtrusive and non-reactive window into results would differ. Still, the sample sizes and availability of PLOS ONE | www.plosone.org 13 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language demographic information afforded by social media bring us closer topics available here: wwbp.org/public_data/2000topics.top20 to a more ideal representative sample [99]. Our current freqs.keys.csv. results have face validity (subjects in high elevations talk about (XLS) ‘the mountains’), tie in with other research (neurotic people Table S2 Prediction results when selecting features via disproportionately use the phrase ‘depressed’), suggest new differential language analysis. accuracy: percent predicted hypotheses (an active life implies emotional stability), and give correctly (for discrete binary outcomes). R: Square-root of the detailed insights (males prefer to precede ‘wife’ with the possessive coefficient of determination (for sequential/continuous outcomes). ‘my’ more so than females precede ‘husband’ with ‘my’). LIWC: A priori word-categories from Linguistic Inquiry and Word Over the past one-hundred years, surveys and questionnaires Count. Topics: Automatically created LDA topic clusters. Word- have illuminated our understanding of people. We suggest that Phrases: words and phrases (n-grams of size 1 to 3 passing a new multipurpose instruments such as DLA emerging from the collocation filter). Bold indicates significant (P,.01) improvement field of computational social science shed new light on psychoso- over the baseline set of features (use of LIWC alone). Differential cial phenomena. language analysis was run over the training set, and only those features significant at Bonferonni-corrected P,0.001 were includ- Supporting Information ed during training and testing. No controls were used so as to be consistent with the evaluation in the main paper, and so one could Figure S1 Power analyses for all outcomes examined in consider this a univariate feature selection. On average results are this work. Number of features passing a Benjamini-Hochberg just below those of not using differential language analysis to select false-discovery rate of pv0:001 as a function of the number of features but there is no significant difference. users sampled, out of the maximum 24,530 words and phrases (PDF) used by at least 1% of users. (TIF) Acknowledgments Figure S2 Words, phrases, and topics most distinguish- We would like to thank Greg Park, Angela Duckworth, Adam Croom, ing agreeableness, conscientiousness, and openness. A. Molly Ireland, Paul Rozin, Eduardo Blanco, and our other colleagues in Language of high agreeableness (left) and low agreeableness (right); the Positive Psychology Center and Computer & Information Science N~72,772. B. Language of high conscientiousness (left) and low department for their valuable feedback regarding this work. conscientiousness (right); N~72,781. C. Language of openness (left) and closed to experience (right); N~72,809 (adjusted for Author Contributions gender and age, Bonferroni-corrected pv0:001). Conceived and designed the experiments: HAS JCE MLK LHU. (TIF) Performed the experiments: HAS LD. Analyzed the data: HAS JCE LD Table S1 The 15 most prevalent words for the 2000 SMR MA AS. Contributed reagents/materials/analysis tools: MK DS. automatically generated topics used in our study. All Wrote the paper: HAS JCE MLK DS MEPS LHU. References 1. Lazer D, Pentland A, Adamic L, Aral S, Barabasi AL, et al. (2009) 17. Stone P, Dunphy D, Smith M (1966) The General Inquirer: A Computer Computational social science. Science 323: 721–723. Approach to Content Analysis. MIT press. 2. Weinberger S (2011) Web of war: Can computational social science help to 18. Coltheart M (1981) The mrc psycholinguistic database. The Quarterly Journal prevent or win wars? the pentagon is betting millions of dollars on the hope that of Experimental Psychology 33: 497–505. it will. Nature 471: 566–568. 19. Pennebaker JW, Mehl MR, Niederhoffer KG (2003) Psychological aspects of 3. Miller G (2011) Social scientists wade into the tweet stream. Science 333: 1814– natural language use: our words, our selves. Annual Review of Psychology 54: 1815. 547–77. 4. Facebook (2012) Facebook company info: Fact sheet website. Available: http:// 20. Tausczik Y, Pennebaker J (2010) The psychological meaning of words: Liwc and newsroom?fb?com. Accessed 2012 Dec. computerized text analysis methods. Journal of Language and Social Psychology 5. Golder S, Macy M (2011) Diurnal and seasonal mood vary with work, sleep, and 29: 24–54. daylength across diverse cultures. Science 333: 1878–1881. 21. Pennebaker J, King L (1999) Linguistic styles: language use as an individual 6. Bollen J, Mao H, Zeng X (2011) Twitter mood predicts the stock market. Journal difference. Journal of personality and social psychology 77: 1296. of Computational Science 2: 1–8. 22. Mehl M, Gosling S, Pennebaker J (2006) Personality in its natural habitat: 7. Kramer A (2010) An unobtrusive behavioral model of gross national happiness. manifestations and implicit folk theories of personality in daily life. Journal of In: Proc of the 28th int conf on Human factors in comp sys. ACM, pp. 287–290. personality and social psychology 90: 862. 8. Dodds PS, Harris KD, Kloumann IM, Bliss CA, Danforth CM (2011) Temporal 23. Gosling S, Vazire S, Srivastava S, John O (2004) Should we trust web-based patterns of happiness and information in a global social network: Hedonometrics studies? a comparative analysis of six preconceptions about internet question- and twitter. PLoS ONE 6: 26. naires. American Psychologist 59: 93. 9. Ginsberg J, Mohebbi M, Patel R, Brammer L, Smolinski M, et al. (2009) 24. Back M, Stopfer J, Vazire S, Gaddis S, Schmukle S, et al. (2010) Facebook Detecting inuenza epidemics using search engine query data. Nature 457: 1012– profiles reect actual personality, not self-idealization. Psychological Science 21: 1014. 372–374. 10. Michel JB, Shen YK, Aiden AP, Veres A, Gray MK, et al. (2011) Quantitative 25. Sumner C, Byers A, Shearing M (2011) Determining personality traits & privacy analysis of culture using millions of digitized books. Science 331: 176–182. concerns from facebook activity. In: Black Hat Briefings. pp. 1–29. 11. Pennebaker JW, Chung CK, Ireland M, Gonzales A, Booth RJ (2007) The 26. Holtgraves T (2011) Text messaging, personality, and the social context. Journal development and psychometric properties of liwc2007 the university of texas at of Research in Personality 45: 92–99. austin. LIWCNET 1: 1–22. 27. Yarkoni T (2010) Personality in 100,000 Words: A large-scale analysis of 12. Kosinski M, Stillwell D, Graepel Y (2013) Private traits and attributes are personality and word use among bloggers. Journal of Research in Personality 44: predictable from digital records of human behavior. Proceedings of the National 363–373. Academy of Sciences (PNAS). 28. Chung C, Pennebaker J (2008) Revealing dimensions of thinking in open-ended 13. Goldberg LR (1990) An alternative ‘‘description of personality’’: the big-five self-descriptions: An automated meaning extraction method for natural factor structure. J Pers and Soc Psychol 59: 1216–1229. language. Journal of Research in Personality 42: 96–132. 14. McCrae RR, John OP (1992) An introduction to the five-factor model and its 29. Kramer A, Chung K (2011) Dimensions of self-expression in facebook status applications. Journal of Personality 60: 175–215. updates. In: Proceedings of the Fifth International AAAI Conference on 15. Norman W (1963) Toward an adequate taxonomy of personality attributes: Weblogs and Social Media. pp. 169–176. Replicated factor structure in peer nomination personality ratings. The Journal 30. Pennebaker J, Stone L (2003) Words of wisdom: Language use over the life span. of Abnormal and Social Psychology 66: 574. Journal of personality and social psychology 85: 291. 16. Digman J (1990) Personality structure: Emergence of the five-factor model. 31. Chung C, Pennebaker J (2007) The psychological function of function words. Annual review of psychology 41: 417–440. Social communication: Frontiers of social psychology : 343–359. PLOS ONE | www.plosone.org 14 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language 59. Monroe BL, Maeda K (2004) Talks cheap: Text-based estimation of rhetorical 32. Argamon S, Koppel M, Pennebaker J, Schler J (2007) Mining the blogosphere: ideal-points. In: annual meeting of the Society for Political Methodology. pp. age, gender, and the varieties of self-expression. First Monday 12. 29–31. 33. Argamon S, Koppel M, Fine J, Shimoni A (2003) Gender, genre, and writing style in formal written texts. To appear in Text 23: 3. 60. Slapin JB, Proksch SO (2008) A scaling model for estimating time-series party 34. Newman M, Groom C, Handelman L, Pennebaker J (2008) Gender differences positions from texts. American Journal of Political Science 52: 705–722. in language use: An analysis of 14,000 text samples. Discourse Processes 45: 61. Argamon S, Dhawle S, Koppel M, Pennebaker JW (2005) Lexical predictors of 211–236. personality type. In: Proceedings of the Joint Annual Meeting of the Interface and the Classification Society. 35. Mukherjee A, Liu B (2010) Improving gender classification of blog authors. In: 62. Argamon S, Koppel M, Pennebaker JW, Schler J (2009) Automatically profiling Proceedings of the 2010 conference on Empirical Methods in natural Language the author of an anonymous text. Commun ACM 52: 119–123. Processing. Association for Computational Linguistics, pp. 207–217. 63. Mairesse F,Walker M (2006) Automatic recognition of personality in 36. Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user conversation. In: Proceedings of the Human Language Technology Conference attributes in twitter. In: Proceedings of the 2nd international workshop on of the NAACL. pp. 85–88. Search and mining user-generated contents. ACM, pp. 37–44. 64. Mairesse F, Walker M, Mehl M, Moore R (2007) Using linguistic cues for the 37. Schler J, Koppel M, Argamon S, Pennebaker J (2006) Effects of age and gender automatic recognition of personality in conversation and text. Journal of on blogging. In: Proceedings of 2006 AAAI Spring Symposium on Computa- Artificial Intelligence Research 30: 457–500. tional Approaches for Analyzing Weblogs. pp. 199–205. 65. Golbeck J, Robles C, Edmondson M, Turner K (2011) Predicting personality 38. Burger J, Henderson J, Kim G, Zarrella G (2011) Discriminating gender on from twitter. In: Proc of the 3rd IEEE Int Conf on Soc Comput. pp. 149–156. twitter. In: Proceedings of the Conference on Empirical Methods in Natural doi:978-0-7695-4578-3/11. Language Processing. Association for Computational Linguistics, pp. 1301– 66. Sumner C, Byers A, Boochever R, Park G (2012) Predicting dark triad personality traits from twitter usage and a linguistic analysis of tweets. 39. Huffaker DA, Calvert SL (2005) Gender, Identity, and Language Use in wwwonlineprivacyfoundationorg. Teenage Blogs. Journal of Computer-Mediated Communication 10: 1–10. 67. Iacobelli F, Gill AJ, Nowson S, Oberlander J (2011) Large scale personality 40. Eckert P (2008) Variation and the indexical field1. Journal of Sociolinguistics 12: classification of bloggers. In: Proc of the 4th int conf on Affect comput and intel 453–476. interaction. Springer-Verlag, pp. 568–577. 41. Eisenstein J, Smith NA, Xing EP (2011) Discovering sociolinguistic associations 68. Bamman D, Eisenstein J, Schnoebelen T (2012) Gender in twitter: Styles, with structured sparsity. In: Proceedings of the 49th Annual Meeting of the stances, and social networks. arXiv preprint arXiv:12104567. Association for Computational Linguistics: Human Language Technologies- 69. Church KW, Hanks P (1990) Word association norms, mutual information, and Volume 1. Association for Computational Linguistics, pp. 1365–1374. lexicography. Computational Linguistics 16: 22–29. 42. OConnor B, Bamman D, Smith NA (2011) Computational text analysis for 70. Lin D (1998) Extracting collocations from text corpora. In: Knowledge Creation social science: Model assumptions and complexity. public health 41: 43. Diffusion Utilization. pp. 57–63. 43. Grimmer J, Stewart BM (2013) Text as data: The promise and pitfalls of 71. Anscombe FJ (1948) The transformation of poisson, binomial and negative- automatic content analysis methods for political texts. Political Analysis. binomial data. Biometrika 35: 246–254. 44. Monroe BL, Colaresi MP, Quinn KM (2008) Fightin’words: Lexical feature 72. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn selection and evaluation for identifying the content of political conict. Political Res 3: 993–1022. Analysis 16: 372–403. 73. Steyvers M, Griffiths T (2007) Probabilistic topic models. Handbook of latent 45. Gilbert E (2012) Phrases that signal workplace hierarchy. In: Proceedings of the semantic analysis 427: 424–440. ACM 2012 conference on Computer Supported Cooperative Work. ACM, pp. 74. Gelfand A, Smith A (1990) Sampling-based approaches to calculating marginal 1037–1046. densities. Journal of the American statistical association 85: 398–409. 46. Tausczik Y, Pennebaker J (2010) The psychological meaning of words: Liwc and 75. McCallum AK (2002) Mallet: A machine learning for language toolkit. computerized text analysis methods. Journal of Language and Social Psychology Available: http://mallet.cs.umass.edu. 29: 24. 76. Dunn OJ (1961) Multiple comparisons among means. Journal of the American 47. Holmes D (1994) Authorship attribution. Computers and the Humanities 28: Statistical Association 56: 52–64. 87–106. 77. Eisenstein J, O’Connor B, Smith N, Xing E (2010) A latent variable model for 48. Argamon S, Saric ´ M, Stein SS (2003) Style mining of electronic messages for geographic lexical variation. In: Proceedings of the 2010 Conference on multiple authorship discrimination: first results. In: KDD ’03: Proceedings of the Empirical Methods in Natural Language Processing. Association for Compu- ninth ACM SIGKDD international conference on Knowledge discovery and tational Linguistics, pp. 1277–1287. data mining. New York, NY, USA: ACM, pp. 475–480. 78. Wordle (2012) Wordle advanced website. Available: http://www?wordle?net/ 49. Stamatatos E (2009) A survey of modern authorship attribution methods. advanced Acceessed 2012 Dec. Journal of the American Society for information Science and Technology 60: 79. Harris J (2011) Word clouds considered harmful. Available: http:// 538–556. wwwniemanlaborg/2011/10/word-clouds-considered-harmful/. 50. Alm C, Roth D, Sproat R (2005) Emotions from text: machine learning for text- 80. Resnik P (1999) Semantic similarity in a taxonomy: An information-based based emotion prediction. In: Proceedings of the conference on Empirical measure and its application to problems of ambiguity in natural language. Methods in Natural Language Processing. Association for Computational Journal of Artificial Intelligence Research 11: 95–130. Linguistics, pp. 579–586. 81. Cleveland WS (1979) Robust locally weighted regression and smoothing 51. Mihalcea R, Liu H (2006) A corpus-based approach to finding happiness. In: scatterplots. Journal of the Am Stati Assoc 74: 829–836. Proceedings of the AAAI Spring Symposium on Computational Approaches to 82. Costa Jr P, McCrae R (2008) The revised neo personality inventory (neo-pi-r). Weblogs. p. 19. The SAGE handbook of personality theory and assessment 2: 179–198. 52. Jurafsky D, Ranganath R, McFarland D (2009) Extracting social meaning: 83. Bachrach Y, Kosinski M, Graepel T, Kohli P, Stillwell D (2012) Personality and Identifying interactional style in spoken conversation. In: Proceedings of Human patterns of facebook usage. Web Science. Language Technologies: The 2009 Annual Conference of the North American 84. Sterne J, Gavaghan D, Egger M (2000) Publication and related bias in meta- Chapter of the Association for Computational Linguistics. Association for analysis: power of statistical tests and prevalence in the literature. J Clin Computational Linguistics, pp. 638–646. Epidemiol 53: 1119–1129. 53. Ranganath R, Jurafsky D, McFarland D (2009) It’s not you, it’s me: detecting 85. McCrae RR, Sutin AR (2009) Openness to experience. In: Handbook of Indiv irting and its misperception in speed-dates. In: Proceedings of the 2009 Diff in Soc Behav, New York: Guilford. pp. 257–273. Conference on Empirical Methods in Natural Language Processing: Volume 1- 86. Mulac A, Studley LB, Blau S (1990) The gender-linked language effect in Volume 1. Association for Computational Linguistics, pp. 334–342. primary and secondary students’ impromptu essays. Sex Roles 23: 439–470. 54. Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification 87. Thomson R, Murachver T (2001) Predicting gender from electronic discourse. using machine learning techniques. In: Proceedings of the 2002 Conference on Brit J of Soc Psychol 40: 193–208. Empirical Methods in Natural Language Processing (EMNLP). pp. 79–86. 88. Mehl MR, Pennebaker JW (2003) The sounds of social life: a psychometric 55. Kim SM, Hovy E (2004) Determining the sentiment of opinions. In: Proceedings analysis of students’ daily social environments and natural conversations. J of of the 20th international conference on Computational Linguistics. Stroudsburg, Pers and Soc Psychol 84: 857–870. PA, USA: Association for Computational Linguistics, COLING,04. 89. Mulac A, Bradac JJ (1986) Male/female language differences and attributional 56. Wilson T, Wiebe J, Hoffmann P (2009) Recognizing contextual polarity: An consequences in a public speaking situation: Toward an explanation of the exploration of features for phrase-level sentiment analysis. Computational genderlinked language effect. Communication Monographs 53: 115–129. linguistics 35: 399–433. 90. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical 57. Baccianella S, Esuli A, Sebastiani F (2010) Sentiwordnet 3.0: An enhanced and powerful approach to multiple testing. Journal of the Royal Statistical lexical resource for sentiment analysis and opinion mining. In: Chair) NCC, Society Series B (Methodological) : 289–300. Choukri K, Maegaard B, Mariani J, Odijk J, et al., editors, Proceedings of the 91. Goldberg L, Johnson J, Eber H, Hogan R, Ashton M, et al. (2006) The Seventh International Conference on Language Resources and Evaluation international personality item pool and the future of public-domain personality (LREC’10). Valletta, Malta: European Language Resources Association measures. J of Res in Personal 40: 84–96. (ELRA). 92. Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: A 58. Laver M, Benoit K, Garry J (2003) Extracting policy positions from political library for large linear classification. Journal of Machine Learning Research 9: texts using words as data. American Political Science Review 97: 311–331. 1871–1874. PLOS ONE | www.plosone.org 15 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language 93. Hoerl A, Kennard R (1970) Ridge regression: Biased estimation for 96. Ireland ME, Mehl MR (2012) Natural language use as a marker of personality. (in press) Oxford Handbook of Language and Social Psychology. nonorthogonal problems. Technometrics 12: 55–67. 97. Haig B (2005) An abductive theory of scientific method. Psychological Methods; 94. Meyer G, Finn S, Eyde L, Kay G, Moreland K, et al. (2001) Psychological Psychological Methods 10: 371. testing and psychological assessment: A review of evidence and issues. American 98. Fast L, Funder D (2008) Personality as manifest in word use: Correlations with psychologist 56: 128. self-report, acquaintance report, and behavior. Journal of Personality and Social 95. Roberts B, Kuncel N, Shiner R, Caspi A, Goldberg L (2007) The power of Psychology 94: 334. personality: The comparative validity of personality traits, socioeconomic status, 99. Gosling SD, Vazire S, Srivastava S, John OP (2000) Should we trust web-based and cognitive ability for predicting important life outcomes. Perspectives on studies? a comparative analysis of six preconceptions about internet question- Psychological Science 2: 313–345. naires. American Psychologist 59: 93–104. PLOS ONE | www.plosone.org 16 September 2013 | Volume 8 | Issue 9 | e73791 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png PLoS ONE Pubmed Central

Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach

Loading next page...
 
/lp/pubmed-central/personality-gender-and-age-in-the-language-of-social-media-the-open-qEGyd0J82R

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Pubmed Central
Copyright
© 2013 Schwartz et al
ISSN
1932-6203
eISSN
1932-6203
DOI
10.1371/journal.pone.0073791
Publisher site
See Article on Publisher Site

Abstract

We analyzed 700 million words, phrases, and topic instances collected from the Facebook messages of 75,000 volunteers, who also took standard personality tests, and found striking variations in language with personality, gender, and age. In our open-vocabulary technique, the data itself drives a comprehensive exploration of language that distinguishes people, finding connections that are not captured with traditional closed-vocabulary word-category analyses. Our analyses shed new light on psychosocial processes yielding results that are face valid (e.g., subjects living in high elevations talk about the mountains), tie in with other research (e.g., neurotic people disproportionately use the phrase ‘sick of’ and the word ‘depressed’), suggest new hypotheses (e.g., an active life implies emotional stability), and give detailed insights (males use the possessive ‘my’ when mentioning their ‘wife’ or ‘girlfriend’ more often than females use ‘my’ with ‘husband’ or ’boyfriend’). To date, this represents the largest study, by an order of magnitude, of language and personality. Citation: Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Ramones SM, et al. (2013) Personality, Gender, and Age in the Language of Social Media: The Open- Vocabulary Approach. PLoS ONE 8(9): e73791. doi:10.1371/journal.pone.0073791 Editor: Tobias Preis, University of Warwick, United Kingdom Received January 23, 2013; Accepted July 29, 2013; Published September 25, 2013 Copyright:  2013 Schwartz et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: Support for this research was provided by the Robert Wood Johnson Foundation’s Pioneer Portfolio, through a grant to Martin Seligman, ‘‘Exploring Concept of Positive Health’’. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: hansens@sas.upenn.edu results. We call approaches like ours, which do not rely on a priori Introduction word or category judgments, open-vocabulary analyses. The social sciences have entered the age of data science, We use differential language analysis (DLA), our particular method leveraging the unprecedented sources of written language that of open-vocabulary analysis, to find language features across social media afford [1–3]. Through media such as Facebook and millions of Facebook messages that distinguish demographic and th Twitter, used regularly by more than 1/7 of the world’s psychological attributes. From a dataset of over 15.4 million population [4], variation in mood has been tracked diurnally Facebook messages collected from 75 thousand volunteers [12], we and across seasons [5], used to predict the stock market [6], and extract 700 million instances of words, phrases, and automatically leveraged to estimate happiness across time [7,8]. Search patterns generated topics and correlate them with gender, age, and on Google detect influenza epidemics weeks before CDC data personality. We replicate traditional language analyses by applying confirm them [9], and the digitization of books makes possible the Linguistic Inquiry and Word Count (LIWC) [11], a popular tool in quantitative tracking of cultural trends over decades [10]. To psychology, to our data set. Then, we show that open-vocabulary make sense of the massive data available, multidisciplinary analyses can yield additional insights (correlations between person- collaborations between fields such as computational linguistics ality and behavior as manifest through language) and more and the social sciences are needed. Here, we demonstrate an information (as measured through predictive accuracy) than instrument which uniquely describes similarities and differences traditional a priori word-category approaches. We present a word among groups of people in terms of their differential language use. cloud-based technique to visualize results of DLA. Our large set of Our technique leverages what people say in social media to find correlations is made available for others to use (available at: distinctive words, phrases, and topics as functions of known attributes http:www.wwbp.org/). of people such as gender, age, location, or psychological characteristics. The standard approach to correlating language Background use with individual attributes is to examine usage of a priori fixed sets of words [11], limiting findings to preconceived relationships This section outlines recent work linking language with with words or categories. In contrast, we extract a data-driven personality, gender, and age. In line with the focus of this paper, collection of words, phrases, and topics, in which the lexicon is based we predominantly discuss works which sought to gain psycholog- on the words of the text being analyzed. This yields a ical insights. However, we also touch on increasingly popular comprehensive description of the differences between groups of attempts at predicting personality from language in social media, people for any given attribute, and allows one to find unexpected which, for our study, offer an empirical means to compare a closed PLOS ONE | www.plosone.org 1 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language vocabulary analysis (relying on a priori word category human The larger sample-sizes from social media also enabled the first judgments) and an open vocabulary analysis (not relying on a priori study exploring personality as a function of single-word use. word category judgments). Yarkoni investigated LIWC categories along with single words in Personality refers to the traits and characteristics that make an connection with Big-5 scores of 406 bloggers [27]. He identified individual unique. Although there are multiple ways to classify single word results which would not have been caught with LIWC, traits [13], we draw on the popular Five Factor Model (or ‘‘Big 5’’), such as ‘hug’ correlating positively with agreeableness (there is no which classifies personality traits into five dimensions: extraversion physical affection category inLIWC), but, considering the sparse (e.g., outgoing, talkative, active), agreeableness (e.g., trusting, kind, nature of words, 406 blogs does not result in comprehensive view. generous), conscientiousness (e.g., self-controlled, responsible, thor- For example, they find only 13 significant word correlations for ough), neuroticism (e.g., anxious, depressive, touchy), and openness conscientiousness while we find thousands even after Bonferonni- (e.g., intellectual, artistic, insightful) [14]. With work beginning correcting significance levels. Additionally, they did not control for over 50 years ago [15] and journals dedicated to it, the FFM is a age or gender although they reported roughly 75% of their well-accepted construct of personality [16]. subjects were female. Still, as the most thorough point of comparison for LIWC results with personality, Figure 2 presents Automatic Lexical Analysis of Personality, Gender, the findings from Yarkoni’s study along with LIWC results over our data. and Age Analogous to a personality construct, work has been done in By examining what words people use, researchers have long psychology looking at the latent dimensions of self-expression. sought a better understanding of human psychology [17–19]. As Chung and Pennebaker factor analyzed 119 adjectives used in Tauszczik & Pennebaker put it: student essays of ‘‘who you think you are’’ and discovered 7 latent dimensions labeled such as ‘‘sociability’’ or ‘‘negativity’’ [28]. They Language is the most common and reliable way for people were able to relate these factors to the Big-5 and found only weak to translate their internal thoughts and emotions into a form relations, suggesting 7 dimensions as an alternative construction. that others can understand. Words and language, then, are Later, Kramer and Chung ran the same method over 1000 unique the very stuff of psychology and communication [20]. words across Facebook status updates, finding three components labeled, ‘‘positive events’’, ‘‘informal speech’’, and ‘‘school’’ [29]. The typical approach to analyzing language involves counting Although their vocabulary size was somewhat limited, we still see word usage over pre-chosen categories of language. For example, these as previous examples of open-vocabulary language analyses one might place words like ‘nose’, ‘bones’, ‘hips’, ‘skin’, ‘hands’, for psychology – no assumptions were made on the categories of and ‘gut’ into a body lexicon, and count how often words in the words beyond part-of-speech. lexicon are used by extraverts or introverts in order to determine who LIWC has also been used extensively for studying gender and talks about the body more. Of such word-category lexica, the most age [21]. Many studies have focused on function words (articles, widely used is Linguistic Inquiry and Word Count or LIWC, prepositions, conjunctions, and pronouns), finding females use developed over the last couple decades by human judges more first-person singular pronouns, males use more articles, and designating categories for common words [11,19]. The 2007 that older individuals use more plural pronouns and future tense version of LIWC includes 64 different categories of language verbs [30–32]. Other works have found males use more formal, ranging from part-of-speech (i.e. articles, prepositions, past-tense verbs, affirmation, and informational words, while females use more numbers,...) to topical categories (i.e. family, cognitive mechanisms, affect, social interaction, and deictic language [33–36]. For age, the most occupation, body,...), as well as a few other attributes such as total salient findings include older individuals using more positive number of words used [11]. Names of all 64 categories can be seen emotion and less negative emotion words [30], older individuals in Figure 2. preferring fewer self-references (i.e. ‘I’, ‘me’) [30,31], and Pennebaker & King conducted one of the first extensive stylistically there is less use of negation [37]. Similar to our applications of LIWC to personality by examining words in a finding of 2000 topics (clusters of semantically-related words), variety of domains including diaries, college writing assignments, Argamon et al. used factor analysis and identified 20 coherent and social psychology manuscript abstracts [21]. Their results components of word use to link gender and age, showing male were quite consistent across such domains, finding patterns such as components of language increase with age while female factors agreeable people using more articles, introverts and those low in decrease [32]. conscientiousness using more words signaling distinctions, and neurotic Occasionally, studies find contradictory results. For example, individuals using more negative emotion words. Mehl et al. tracks multiple studies report that emoticons (i.e. ‘:)’ ‘:-(‘) are used more the natural speech of 96 people over two days [22]. They found often by females [34,36,38], but Huffaker & Calvert found males similar results to Pennebaker & King and that neurotic and agreeable use them more in a sample of 100 teenage bloggers [39]. This people tend to use more first-person singulars, people low in particular discrepancy could be sample-related – differing openness talk more about social processes, extraverts use longer words. demographics or having a non-representative sample (Huffaker The recent growth of online social media has yielded great & Calvert looked at 100 bloggers, while later studies have looked sources of personal discourse. Besides advantages due to the size of at thousands of twitter users) or it could be due to differences in the the data, the content is often personal and describes everyday domain of the text (blogs versus twitter). One should always be concerns. Furthermore, previous research has suggested popula- careful generalizing new results outside of the domain they were tions for online studies and Facebook are quite representative found as language is often dependent on context [40]. In our case [23,24]. Sumner et al. examined the language of 537 Facebook we explore language in the broad context of Facebook, and do not users with LIWC [25] while Holtgraves studied the text messages claim our results would up under other smaller or larger contexts. of 46 students [26]. Findings from these studies largely confirmed As a starting point for reviewing more psychologically meaningful past links with LIWC but also introduced some new links such as language findings, we refer the reader to Tauszczik & Penneba- neurotics using more acronyms [26] or those high in openness using ker’s 2010 survey of computerized text analysis [20]. more quotations [25]. PLOS ONE | www.plosone.org 2 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language Eisenstein et al. presented a sophisticated open-vocabulary lan- which are suppressed, could have revealed important insights with guage analysis of demographics [41]. Their method views an outcome. In other words, these predictive models answer the language analysis as a multi-predictor to multi-output regression question ‘‘what is the best combination of words and weights to problem, and uses an L1 norm to select the most useful predictors predict personality?’’ whereas we believe answering the following (i.e. words). Part of their motivation was finding interpretable question is best for revealing new insights: ‘‘what words, controlled relationships between individual language features and sets of for gender and age, are individually most correlated with outcomes (demographics), and unlike the many predictive works personality?’’. we discuss in the next section, they test for significance of Recently, researchers have started looking at personality relationships between individual language features and outcomes. prediction. Early works in personality prediction used dictionary- To contrast with our approach, we consider features and outcomes based features such as LIWC. Argamon et al. (2005) noted that individually (i.e. an ‘‘L0 norm’’), which we think is more ideal for personality, as detected by categorical word use, was supportive for our goals of explaining psychological variables (i.e. understanding author attribution. They examined language use according to the openness by the words that correlate with it). For example, their traits of neuroticism and extraversion over approximately 2200 student method may throwout a word which is strongly predictive for only essays, while focused on using function words for the prediction of one outcome or which is collinear with other words, while we want gender [62]. Mairesse et al. used a variety of lexicon-based to know all the words most-predictive for a given outcome. We features to predict all Big-5 personality traits over approximately also explore other types of open-vocabulary language features such as 2500 essays as well as 90 sets of individual spoken words [63,64]. phrases and topics. As a first pass at predicting personality from language in Facebook, Similar language analyses also occurred in many fields outside Golbeck used LIWC features over a sample of 167 Facebook of psychology or demographics [42,43]. For example, Monroe volunteers as well as profile information and found limited success et al. explored a variety of techniques that compare two of a regression model [65]. Similarly, Kaggle held a competition of frequencies of words – one number for each of two groups [44]. personality prediction over Twitter messages, providing partici- In particular, they explored frequencies across democratic versus pants with language cues based on LIWC [66]. Results of the republican speeches and settled on a Bayesian model with competition suggested personality is difficult to predict based on regularization and shrinkage based on priors of word use. Lastly, language in social media, but it is not clear whether such a Gilbert finds words and phrases that distinguish communication conclusion would have been drawn had open-vocabulary language up or down a power-hierarchy across 2044 Enron emails [45]. cues been supplied for prediction. They used penalized logistic regression to fit a single model using In the largest previous study of language and personality, coefficients of each feature as their ‘‘power’’; this produces a good Iacobelli, Gill, Nowson, and Oberlander attempted prediction of single predictive model but also means words which are highly personality for 3,000 bloggers [67]. Not limited to categorical collinear with others will be missed (we run a separate regression language they found open-vocabulary features, such as bigrams, to for each word to avoid this). be better predictors than LIWC features. This motivates our Perhaps one of the most comprehensive language analysis exploration of open-vocabulary features for psychological insights, surveys outside of psychology is that of Grimmer & Stewart [43]. where we examine multi-word phrases (also called n-grams) as well They summarize how automated methods can inexpensively allow as open-vocabulary category language in the form of automatically systematic analysis and inference from large political text clustered groups of semantically related word (LDA topics, see collections, classifying types of analyses into a of hierarchy. ‘‘Linguistic Feature Extraction’’ in the ‘‘Materials and Methods’’ Additionally, they provide cautionary advice; In relation to this section). Since the application of Iacobelli et al. ’s work was work, they note that dictionary methods (such as the closed- content customization, they focused on prediction rather than vocabulary analyses discussed here) may signal something different exploration of language for psychological insight. Our much larger when used in a new domain (for example ‘crude’ may be a sample size lends itself well to more comprehensive exploratory negative word in student essays, but be neutral in energy industry results. reports: ‘crude oil’). For comprehensive surveys on text analyses Similar studies have also been undertaken for age and gender across fields see Grimmer & Stewart [43], O’Connor, Bamman, & prediction in social media. Because gender and age information is Smith [42], and Tausczik & Pennebaker [46]. more readily available, these studies tend to be larger. Argamon et al. predicted gender and age over 19,320 bloggers [32], while Predictive Models based on Language Burger et al. scaled up the gender prediction over 184,000 Twitter authors by using automatically guessed gender based-on gender- In contrast with the works seeking to gain insights about specific keywords in profiles. Most recently, Bamman et al. looked psychological variables, research focused on predicting outcomes at gender as a function of language and social network statistics in have embraced data-driven approaches. Such work uses open- twitter. They particularly looked at the characteristics of those vocabulary linguistic features in addition to a priori lexicon based whose gender was incorrectly predicted and found greater gender features in predictive models for tasks such as stylistics/authorship attribution [47–49], emotion prediction [50,51], interaction or homophily in the social networks of such individuals [68]. flirting detection [52,53], or sentiment analysis [54–57]. In other These past studies, mostly within the field of computer science works, ideologies of political figures (i.e. conservative to liberal) or specifically computational linguistics, have focused on predic- have been predicted based on language using supervised tion for tasks such as content personalization or authorship techniques [58] or unsupervised inference of ideological space attribution. In our work, predictive models of personality, gender, [59,60]. Sometimes these works note the highest weighted features, and age provide a quantitative means to compare various open- but with their goal being predictive accuracy, those features are vocabulary sets of features with a closed-vocabulary set. Our primary not tested for significance and they usually are not the most concern is to explore the benefits of an open-vocabulary approach for individually distinguishing pieces of language. To elaborate, most gaining insights, a goal that is at least as import as prediction for approaches to prediction penalize the weights of words that are psychosocial fields. Most works for gaining language-based insights highly collinear with other words as they fit a single model per in psychology are closed-vocabulary (for examples, see previous outcomes across all words. However, these highly collinear words section), and while many works in computational linguistics are PLOS ONE | www.plosone.org 3 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language open-vocabulary, they rarely focus on insight. We introduce the revealed in language use, and psychosocial variables. In turn, term ‘‘open-vocabulary’’ to distinguish an approach like ours from these results suggest undertaking studies, such as directly previous approaches to gaining insight, and in order to encourage measuring participation in activities in order to verify the link others seeking insights to consider similar approaches. ‘‘Differen- with emotional stability. tial language analysis’’ refers to the particular process, for which We demonstrate open-vocabulary features contain more we are not aware of another name, we use in our open-vocabulary information than a priori word-categories via their use in approach as depicted in Figure 1. predictive models. We take model accuracy in out-of-sample prediction as a measure of information of the features provided Contributions to the model. Models built from words and phrases as well as The contributions of this paper are as follows: those from automatically generated topics achieve significantly higher out-of-sample prediction accuracies than a standard First, we present the largest study of personality and language lexica for each variable of interest (gender, age, and personality). use to date. With just under 75,000 authors, our study covers Additionally, our prediction model for gender yielded state-of- an order-of-magnitude more people and instances of language the-art results for predictive models based entirely on features than the next largest study ([27]). The size of our data language, yielding an out-of-sample accuracy of 91.9%. enables qualitatively different analyses, including open vocab- We present a word cloud visualization which scales words by ulary analysis, based on more comprehensive sets of language correlation (i.e., how well they predict the given psychological features such as phrases and automatically derived topics. Most variable) rather than simply scaling by frequency. Since we prior studies used a priori language categories, presumably due find thousands of significantly correlated words, visualization is in part to the sparse nature of words and their relatively small key, and our differential word clouds provide a comprehensive samples of people. With smaller data sets, it is difficult to find view of our results (e.g. see Figure 3). statistically significant differences in language use for anything Lastly, we offer our comprehensive word, phrase, and topic but the most common words. correlation data for future research experiments (see: Our open-vocabulary analysis yields further insights into the wwbp.org). behavioral residue of personality types beyond those from a priori word-category based approaches, giving unanticipated results (correlations between language and personality, gender, Materials and Methods or age). For example, we make the novel discoveries that mentions of an assortment of social sports and life activities Ethics Statement (such as basketball, snowboarding, church, meetings) correlate with All research procedures were approved by the University of emotional stability, and that introverts show an interest in Japanese Pennsylvania Institutional Review Board. Volunteers agreed to media (such as anime, pokemon, manga and Japanese emoticons: written informed consent. _). Our inclusion of phrases in addition to words provided In seeking insights from language use about personality, gender, further insights (e.g. that males prefer to precede ‘girlfriend’ or and age, we explore two approaches. The first approach, serving ‘wife’ with the possessive ‘my’ significantly more than females as a replication of the past analyses, counts word usage over do for ‘boyfriend’ or ‘husband’. Such correlations provide manually created a priori word-category lexica. The second quantitative evidence for strong links between behavior, as approach, termed DLA, serves as out main method and is Figure 1. The infrastructure of our differential language analysis. 1) Feature Extraction. Language use features include: (a) words and phrases: a sequence of 1 to 3 words found using an emoticon-aware tokenizer and a collocation filter (24,530 features) (b) topics: automatically derived groups of words for a single topic found using the Latent Dirichlet Allocation technique [72,75] (500 features). 2) Correlational Analysis. We find the correlation (b of ordinary least square linear regression) between each language feature and each demographic or psychometric outcome. All relationships presented in this work are at least significant at a Bonferroni-corrected pv0:001 [76]. 3) Visualization. Graphical representation of correlational analysis output. doi:10.1371/journal.pone.0073791.g001 PLOS ONE | www.plosone.org 4 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language open-vocabulary – the words and clusters of words analyzed are In practice, we kept phrases with pmi values greater than determined by the data itself. 2  length, where length is the number of words contained in the phrase, ensuring that phrases we do keep are informative parts of speech and not just accidental juxtapositions. All word and phrase Closed Vocabulary: Word-Category Lexica counts are normalized by each subject’s total word use A common method for linking language with psychological (p(word j subject)), and we apply the Anscombe transformation variables involves counting words belonging to manually-created [71] to the normalized values for variance stabilization (p ): ans categories of language. Sometimes referred to as the word-count approach, one counts how often words in a given category are used by an individual, the percentage of the participants’ words freq (phrase, subject) p(phrase j subject)~ which are from the given category: freq (phrase , subject) phrase [vocab(subject) freq (word, subject) word[category p (category j subject)~ P pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi freq (word, subject) p (phrase j subject)~2 p(phrase j subject)z3=8 ans word[vocab (subject) where vocab(subject) returns a list of all words and phrases used where freq (word,subject) is the number of the times the by that subject. These Anscombe transformed ‘‘relative frequen- participant mentions word and vocab (subject) is the set of all cies’’ of words or phrases (p ) are then used as the independent Ans words mentioned by the subject. variables in all our analyses. Lastly, we restrict our analysis to those We use ordinary least squares regression to link word categories words and phrases which are used by at least 1% of our subjects, with author attributes, fitting a linear function between explan- keeping the focus on common language. atory variables (LIWC categories) and dependent variables (such as The second type of linguistic feature, topics, consists of word a trait of personality, e.g. extraversion). The coefficient of the clusters created using Latent Dirichlet Allocation (LDA) [72,73]. target explanatory variable (often referred to as b) is taken as the The LDA generative model assumes that documents (i.e. Face- strength of relationship. Including other variables allows us to book messages) contain a combination of topics, and that topics adjust for covariates such as gender and age to provide the unique are a distribution of words; since the words in a document are effect of a given language feature on each psychosocial variable. known, the latent variable of topics can be estimated through Gibbs sampling [74]. We use an implementation of the LDA Open Vocabulary: Differential Language Analysis algorithm provided by the Mallet package [75], adjusting one Our technique, differential language analysis (DLA), is based on parameter (alpha~0:30) to favor fewer topics per document, since three key characteristics. It is individual Facebook status updates tend to contain fewer topics than the typical documents (newspaper or encyclopedia articles) to 1. Open-vocabulary – it is not limited to predefined word lists. which LDA is applied. All other parameters were kept at their Rather, linguistic features including words, phrases, and topics default. An example of such a model is the following sets of words (sets of semantically related words) are automatically deter- (tuesday, monday, wednesday, friday, thursday, week, sunday, saturday) mined from the texts. (I.e., it is ‘‘data-driven’’.) This means which clusters together days of the week purely by exploiting their DLA is classified as a type of open-vocabulary approach. similar distributional properties across messages. We produced the 2. Discriminating – it finds key linguistic features that distinguish 2000 topics shown in Table S1 as well as on our website. psychological and demographic attributes, using stringent To use topics as features, we find the probability of a subject’s significance tests. use of each topic: 3. Simple – it uses simple, fast, and readily accepted statistical techniques. p(topic j subject)~ p(topic j word)  p(word j subject) word[topic We depict the components of this approach in Figure 1, and describe the three steps: 1) linguistic feature extraction, 2) where p(word j subject) is the normalized word use by that subject correlational analysis, and 3) visualization in the following sections. and p(topic j word) is the probability of the topic given the word 1. Linguistic Feature Extraction. We examined two types (a value provided from the LDA procedure). The prevalence of a of linguistic features: a) words and phrases, and b) topics. Words and word in a topic is given by p(topic,word), and is used to order the phrases consisted of sequences of 1 to 3 words (often referred to as words within a topic when displayed. ‘n-grams’ of size 1 to 3). What constitutes a word is determined 2. Correlational Analysis. Similar to word categories, using a tokenizer, which splits sentences into tokens (‘‘words’’). We distinguishing open-vocabulary words, phrases, and topics can built an emoticon-aware tokenizer on top of Pott’s ‘‘happyfunto- be identified using ordinary least squares regression. We again take kenizer’’ allowing us to capture emoticons like ‘v3’(a heart) or ‘:-)’ the coefficient of the target explanatory variable as its correlation (a smile), which most tokenizers incorrectly divide up as separate strength, and we include other variables (e.g. age and gender) as pieces of punctuation. When extracting phrases, we keep only covariates to get the unique effect of the target explanatory those sequences of words with high informative value according to variable. Since we explore many features at once, we consider pointwise mutual information (PMI ) [69,70], a ratio of the joint- coefficients significant if they are less than a Bonferroni-corrected probability to the independent probability of observing the phrase: [76] two-tailed p of 0.001. (I.e., when examining 20,000 features, a passing p-value is less than 0.001 divided by 20,000 which is {8 p(phrase) 5  10 ). pmi (phrase)~ log Our correlational analysis produces a comprehensive list of the P p(w) w[phrase most distinguishing language features for any given attribute, words, phrases, or topics which maximally discriminate a given target PLOS ONE | www.plosone.org 5 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language variables. For example, when we correlate the target variables continuous or ordinal dependent variables such as age. A standard geographic elevation with language features (N~18,383, time-series plot works well, where the horizontal axis is the pv0:001, adjusted for gender and age), we find ‘beach’ the dependent variable and the vertical axis represents the standard most distinguishing feature for low elevation localities, and ‘the score of the values produced from feature extraction. When mountains’ to be among the most distinguishing features for plotting language as a function of age, we fit first-order LOESS high elevation localities, (i.e., people in low elevations talk regression lines [81] to the age as the x-axis data and standardized about the beach more, whereas people at high elevations talk frequency as the y-axis data over all users. We are able to adjust about the mountains more). Similarly, we find the most for gender in the regression model by including it as a covariate distinguishing topics to be (beach, sand, sun, water, waves, ocean, when training the LOESS model and then using a neutral gender value when plotting. surf, sea, toes, sandy, surfing, beaches, sunset, Florida, Virginia) for low elevations and (Colorado, heading, headed, leaving, Denver, Kansas, City, Springs, Oklahoma, trip, moving, Iowa, KC, Utah, bound) for Data Set: Facebook Status Updates high elevations. Others have looked at geographic location Our complete dataset consists of approximately 19 million [77]. Facebook status updates written by 136,000 participants. Partic- 3. Visualization. An analysis over tens of thousands of ipants volunteered to share their status updates as part of the My language features and multiple dimensions results in hundreds of Personality application, where they also took a variety of question- thousands of statistically significant correlations. Visualization is naires [12]. We restrict our analysis to those Facebook users thus critical for their interpretation. We use word clouds [78] to meeting certain criteria: They must indicate English as a primary intuitively summarize our results. Unlike most word clouds, which language, have written at least 1,000 words in their status updates, scale word size by their frequency, we scale word size according to be less than 65 years (to avoid the non-representative sample the strength of the correlation of the word with the demographic above 65), and indicate both gender and age (for use as controls). or psychological measurement of interest, and we use color to This resulted in N~74,941 volunteers, writing a total of represent frequency over all subjects; that is, larger words indicate 309 million words (700 million feature instances of words, phrases, stronger correlations, and darker colors indicate more frequently and topics) across 15.4 million status updates. From this sample used words. This provides a clear picture of which words and each person wrote an average of 4,129 words over 206 status phrases are most discriminating while not losing track of which updates, and thus 20 words per update. Depending on the target ones are the most frequent. Word clouds scaled by frequency are variable, this number slightly varies as indicated in the caption of often used to summarize news, a practice that has been critiqued each result. for inaccurately representing articles [79]. Here, we believe the The personality scores are based on the International Person- word cloud is an appropriate visualization because the individual ality Item Pool proxy for the NEO Personality Inventory Revised words and phrases we depict in it are the actual results we wish to (NEO-PI-R) [14,82]. Participants could take 20 to 100 item summarize. Further, scaling by correlation coefficient rather than versions of the questionnaire, with a retest reliability of aw0:80 frequency gives clouds that distinguish a given outcome. [12]. With the addition of gender and age variables, this resulted in Word clouds can also used to represent distinguishing topics. In seven total dependent variables studied in this work, which are this case, the size of the word within the topic represents its depicted in Table 1 along with summary statistics. Personality prevalence among the cluster of words making up the topic. We distributions are quite typical with means near zero and standard use the 6 most distinguishing topics and place them on the deviations near 1. The statuses ranged over 34 months, from perimeter of the word clouds for words and phrases. This way, a January 2009 through October 2011. Previously, profile informa- single figure gives a comprehensive view of the most distinguishing tion (i.e. network metrics, relationship status) from users in this words, phrases, and topics for any given variables of interest. See dataset have been linked with personality [83], but this is the first Figure 3 for an example. use of its status updates. To reduce the redundancy of results, we automatically prune language features containing information already provided by a Results feature with higher correlation. First, we sort language features in Results of our analyses over gender, age, and personality are order of their correlation with a target variable (such as a presented below. As a baseline, we first replicate the commonly personality trait). Then, for phrases, we use frequency as a proxy used LIWC analysis on our data set. We then present our main for informative value [80], and only include additional phrases if results, the output of our method, DLA. Lastly, we explore they contain more informative words than previously included empirical evidence that open-vocabulary features provide more phrases with matching words. For example, consider the phrases information than those from an a priori lexicon through use in a ‘day’, ‘beautiful day’, and ‘the day’, listed in order of correlation predictive model. from greatest to least; ‘Beautiful day’ would be kept, because ‘beautiful’ is less frequent than ‘day’ (i.e., it is adding informative value), while ‘the day’ would be dropped because ‘the’ is more Closed Vocabulary frequent than ‘day’ (thus it is not contributing more information Figure 2 shows the results of applying the LIWC lexicon to our than we get from ‘day’). We do a similar pruning for topics: A dataset, along side-by-side with the most comprehensive previous lower-ranking topic is not displayed if more than 25% of its top 15 studies we could find for gender, age. and personality [27,30,34]. In words are also contained in the top 15 words of a higher ranking our case, correlation results are b values from an ordinary least topic. These discarded relationships are still statistically significant, squares linear regression where we can adjust for gender and age but removing them provides more room in the visualizations for to give the unique effect of the target variable. One should keep other significant results, making the visualization as a whole more in mind that it is often found that effect sizes tend to be relatively meaningful. smaller as sample sizes increase and become more stable [84]. Word clouds allow one to easily view the features most Even though the previous studies listed did not look at correlated with polar outcomes; we use other visualizations to Facebook, a majority of the correlations we find agree in direction. display the variation of correlation of language features with Some of the largest correlations emerge for the LIWC articles PLOS ONE | www.plosone.org 6 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language Figure 2. Correlation values of LIWC categories with gender, age, and the five factor model of personality. [34] d: Effect size as Cohen’s d values from Newman et al. ’s recent study of gender (positive is female, ns~ not significant at pv:001) [30]. b: Standardized linear regression coefficients adjusted for sex, writing/talking, and experimental condition from Pennebaker and Stone’s study of age (ns~ not significant at pv:05) [27]. r: Spearman correlations values from Yarkoni’s recent study of personality (ns~ not significant at pv:05). our b: Standardized multivariate regression coefficients adjusted for gender and age for this current study over Facebook (ns = not significant at Bonferroni-corrected pv:001). doi:10.1371/journal.pone.0073791.g002 PLOS ONE | www.plosone.org 7 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language Figure 3. Words, phrases, and topics most highly distinguishing females and males. Female language features are shown on top while males below. Size of the word indicates the strength of the correlation; color indicates relative frequency of usage. Underscores (_) connect words of multiword phrases. Words and phrases are in the center; topics, represented as the 15 most prevalent words, surround. (N~74,859: 46,412 females and 28,247 males; correlations adjusted for age; Bonferroni-corrected pv0:001). doi:10.1371/journal.pone.0073791.g003 category, which consists of determiners like ‘the’, ’a’, ‘an’ and were in the opposite direction from the prior work. This is not too serves as a proxy for the use of more nouns. Articles are highly surprising since openness exhibits the most variation across predictive of males, being older, and openness. As a content-related conditions of other studies (for examples, see [25,27,65]), and its language variable, the anger category also proved highly predictive component traits are most loosely related [85]. for males as well as younger individuals, those low in agreeableness and conscientiousness, and high in neuroticism. Openness had the least agreement with the comparison study; roughly half of our results PLOS ONE | www.plosone.org 8 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language One might also draw insights based on the gender results. For Table 1. Summary statistics for gender, age, and the five example, we noticed ‘my wife’ and ‘my girlfriend’ emerged as factor model of personality. strongly correlated in the male results, while simply ‘husband’ and ‘boyfriend’ were most predictive for females. Investigating the frequency data revealed that males did in fact precede such N mean standard skewness deviation references to their opposite-sex partner with ‘my’ significantly more often than females. On the other hand, females were more Gender 74859 0.62 0.49 20.50 likely to precede ‘husband’ or ‘boyfriend’ with ‘her’ or ‘amazing’ Age 74859 23.43 8.96 1.77 and a greater variety of words, which is why ‘my husband’ was not more predictive than ‘husband’ alone. Furthermore, this suggests Extraversion 72709 20.07 1.01 20.34 the male preference for the possessive ‘my’ is at least partially due Agreeableness 72772 0.03 1.00 20.40 to a lack of talking about others’ partners. Conscientiousness 72781 20.04 1.01 20.09 Language of Age. Figure 4 shows the word cloud (center) and most discriminating topics (surrounding) for four age buckets Neuroticism 71968 0.14 1.04 20.21 chosen with regard to the distribution of ages in our sample Openness 72809 0.12 0.97 20.48 (Facebook has many more young people). We see clear distinctions, such as use of slang, emoticons, and Internet speak These represent the seven dependent variables studied in this work. Gender ranged from 0 (male) to 1(female). Age ranged from 13 to 65. Personality in the youngest group (e.g. ’:)’, ‘idk’, and a couple Internet speak questionnaires produce values along a standardized continuum. topics) or work appearing in the 23 to 29 age group (e.g. ‘at work’, doi:10.1371/journal.pone.0073791.t001 ‘new job’, as a job position topic). We also find subtle changes of topics progressing from one age group to the next. For example, Open Vocabulary we see a school related topic for 13 to 18 year olds (e.g. ‘school’, Our DLA method identifies the most distinguishing language ‘homework’, ‘ugh’), while we see a college related topic for 19 to features (words, phrases: a sequence of 1 to 3 words, or topics:a 22 year olds (e.g. ‘semester’, ‘college’, ‘register’). Additionally, cluster of semantically related words) for any given attribute. consider the drunk topic (e.g. ‘drunk’, ‘hangover’, ‘wasted’) that Results progress from a one variable proof of concept (gender), to appears for 19 to 22 year olds and a more reserved beer topic (e.g. the multiple variables representing age groups, and finally to all 5 ‘beer’, ‘drinking’, ‘ale’) for 23 to 29 year olds. dimensions of personality. In general, we find a progression of school, college, work, and Language of Gender. Gender provides a familiar and easy to family when looking at the predominant topics across all age understand proof of concept for open-vocabulary analysis. Figure 3 groups. DLA may be valuable for the generation of hypotheses presents word clouds from age-adjusted gender correlations. We about life span developmental age differences. Figure 5A shows the scale word size according to the strength of the relation and we use relative frequency of the most discriminating topic for each age color to represent overall frequency; that is, larger words indicate group as a function of age. Typical concerns peak at different ages, stronger correlations, and darker colors indicate frequently used with the topic concerning relationships (e.g. ‘son’, ‘daughter’, words. For the topics, groups of semantically-related words, the size ‘father’, ‘mother’) continuously increasing across life span. On a indicate the relative prevalence of the word within the cluster as similar note, Figure 5C shows ‘we’ increases approximately defined in the methods section. All results are significant at linearly after the age of 22, whereas ‘I’ monotonically decreases. Bonferroni-corrected [76] pv0:001. We take this as a proxy for social integration [19], suggesting the Many strong results emerging from our analysis align with our increasing importance of friendships and relationships as people LIWC results and past studies of gender. For example, females age. Figure 5B reinforces this hypothesis by presenting a similar used more emotion words [86,87] (e.g., ‘excited’), and first-person pattern based on other social topics. One limitation of our dataset singulars [88], and they mention more psychological and social is the rarity of older individuals using social media; we look processes [34] (e.g., ‘love you’ and ‘v3’ –a heart). Males used forward to a time in which we can track fine-grained language more swear words, object references (e.g., ‘xbox’ and swear words) differences across the entire lifespan. [34,89]. Language of Personality. We created age and gender- Other results of ours contradicted past studies, which were adjusted word clouds for each personality factor based on around based upon significantly smaller sample sizes than ours. For 72 thousand participants with at least 1,000 words across their example, in 100 bloggers Huffaker et al. [39] found males use Facebook status updates, who took a Big Five questionnaire [91]. more emoticons than females. We calculated power analyses to Figure 6 shows word clouds for extraversion and neuroticism. determine the sample size needed to confidently find such (See Figure S2 for openness, conscientiousness, and agreeable- significant results. Since the Bonferonni-correction we use ness.) The dominant words in each cluster were consistent with elsewhere in this work is overly stringent (i.e. makes it harder prior lexical and questionnaire work [14]. For example, extraverts than necessary to pass significance tests), for this result we applied were more likely to mention social words such as ‘party’, ‘love the Benjamini-Hochberg false discovery rate procedure for you’, ‘boys’, and ‘ladies’, whereas introverts were more likely to multiple hypothesis testing [90]. Rerunning our language of mention words related to solitary activities such as ‘computer’, gender analysis on reduced random samples of our subjects ‘Internet’, and ‘reading’. In the openness cloud, words such as resulted in the following number of significant correlations ‘music’, ‘art’, and ‘writing’ (i.e., creativity), and ‘dream’, ‘universe’, (Benjamini-Hochberg tested pv0:001): 50 subjects: 0 significant and ‘soul’ (i.e., imagination) were discriminating [85]. correlations, 500 subjects: 7 correlations; 5,000 subjects: 1,489 Topics were also found reflecting similar concepts as the words, correlations; 50,000 subjects: 13,152 correlations (more detailed some of which would not have been captured with LIWC. For results of power analyses across gender, age, and personality can example, although LIWC has socially related categories, it does not be found in Figure S1). Thus, traditional study sample sizes, which contain a party topic, which emerges as a key distinguishing feature are closer to 50 or 500, are not powerful enough to do data-driven for extraverts. Topics related to other types of social events are DLA over individual words. listed elsewhere, such as a sports topic for low neuroticism PLOS ONE | www.plosone.org 9 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language Figure 4. Words, phrases, and topics most distinguishing subjects aged 13 to 18, 19 to 22, 23 to 29, and 30 to 65. Ordered from top to bottom: 13 to 18 19 to 22 23 to 29, and 30 to 65. Words and phrases are in the center; topics, represented as the 15 most prevalent words, surround. (N~74,859; correlations adjusted for gender; Bonferroni-corrected pv0:001). doi:10.1371/journal.pone.0073791.g004 PLOS ONE | www.plosone.org 10 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language Figure 5. Standardized frequency of topics and words across age. A. Standardized frequency for the best topic for each of the 4 age groups. Grey vertical lines divide groups: 13 to 18 (black: n~25,467 out of N~74,859), 19 to 22 (green: n~21,687), 23 to 29 (blue: n~14,656), and 30+ (red: n~13,049). Lines are fit from first-order LOESS regression [81] controlled for gender. B. Standardized frequency of social topic use across age. C. Standardized ‘I’, ‘we’ frequencies across age. doi:10.1371/journal.pone.0073791.g005 (emotional stability). Additionally, Figure 6 shows the advantage of small validation set of 10% of the training set which we tested having phrases in the analysis to get clearer signal: e.g. people high various regularization parameters over while fitting the model to in neuroticism mentioned ‘sick of’, and not just ‘sick’. the other 90% of the training set in order to select the best While many of our results confirm previous research, parameter). Thus, the predictive model is created without any demonstrating the instrument’s face validity, our word clouds outcome information outside of the training data, making the test also suggest new hypotheses. For example, Figure 6 (bottom- data an out-of-sample evaluation. right) shows language related to emotional stability (low As open-vocabulary features, we use the same units of neuroticism). Emotionally stable individuals wrote about enjoy- language as DLA: words and phrases (n-grams of size 1 to 3, able social activities that may foster greater emotional stability, passing a collocation filter) and topics. These features are outlined such as ‘sports’, ‘vacation’, ‘beach’, ‘church’, ‘team’, and a family precisely under the ‘‘Linguistic Feature Extraction’’ section time topic. Additionally, results suggest that introverts are presented earlier. As explained in that section, we use Anscombe interested in Japanese media (e.g. ‘anime’, ‘manga’, ‘japanese’, transformed relative frequencies of words and phrases and the Japanese style emoticons: ˆ_ˆ , and an anime topic) and that those conditional probability of a topic given a subject. For closed low in openness drive the use of shorthands in social media (e.g. vocabulary features, we use the LIWC categories of language ‘2day’, ‘ur’, ‘every 1’). Although these are only language calculated as the relative frequency of a user mentioning a word correlations, they show how open-vocabulary analyses can illumi- in the category given their total word usage. We do not provide nate areas to explore further. our models with anything other than these language usage features (independent variables) for prediction, and we use usage of all features (not just those passing significance tests from DLA). Predictive Evaluation As shown in Table 2, we see that models created with open Here we present a quantitative evaluation of open-vocabulary vocabulary features significantly (pv0:01) outperformed those and closed vocabulary language features. Although we have thus created based on LIWC features. The topics results are of particular far presented subjective evidence that open-vocabulary features interest, because these automatically clustered word-category contribute more information, we hypothesize empirically that the lexica were not created with any human or psychological data – inclusion of open-vocabulary features leads to prediction accura- only knowing what words occurred in messages together. cies above and beyond that of closed-vocabulary. We randomly Furthermore, we see that a model which includes LIWC features sampled 25% of our participants as test data, and used the on top of the open-vocabulary words, phrases, and topics does not result remaining 75% as training data to build our predictive models. in any improvement suggesting that the open-vocabulary features We use a linear support vector machine (SVM) [92] for are able to capture predictive information which fully supersedes classifying the binary variable of gender, and ridge regression LIWC. [93] for predicting age and each factor of personality. Features For personality we saw the largest relative improvement were first run through principal component analysis to reduce the between open-vocabulary approaches and LIWC. Our best person- feature dimension to half of the number of users. Both SVM ality R score of 0:42 fell just above the standard ‘‘correlational classification and ridge regression utilize a regularization param- upper-limit’’ for behavior to predict personality (a Pearson eter, which we set by validation over the training set (we defined a PLOS ONE | www.plosone.org 11 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language Figure 6. Words, phrases, and topics most distinguishing extraversion from introversion and neuroticism from emotional stability.A. Language of extraversion (left, e.g., ‘party’) and introversion (right, e.g., ‘computer’); N~72,709. B. Language distinguishing neuroticism (left, e.g. ‘hate’) from emotional stability (right, e.g., ‘blessed’); N~71,968 (adjusted for age and gender, Bonferroni-corrected pv0:001). Figure S8 contains results for openness, conscientiousness, and agreeableness. doi:10.1371/journal.pone.0073791.g006 PLOS ONE | www.plosone.org 12 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language Table 2. Comparison of LIWC and open-vocabulary features within predictive models of gender, age, and personality. Gender Age Extraversion Agreeableness Conscientious. Neuroticism Openness features accuracy R R R R R R LIWC 78.4% .65 .27 .25 .29 .21 .29 Topics 87.5% .80 .32 .29 .33 .28 .38 WordPhrases 91.4% .83 .37 .29 .34 .29 .41 WordPhrases + Topics 91.9% .84 .38 .31 .35 .31 .42 Topics + LIWC 89.2% .80 .33 .29 .33 .28 .38 WordPhrases + LIWC 91.6% .83 .38 .30 .34 .30 .41 WordPhrases + Topics 91.9% .84 .38 .31 .35 .31 .42 + LIWC accuracy: percent predicted correctly (for discrete binary outcomes). R: Square-root of the coefficient of determination (for sequential/continuous outcomes). LIWC: A priori word-categories from Linguistic Inquiry and Word Count. Topics: Automatically created LDA topic clusters. WordPhrases: words and phrases (n-grams of size 1 to 3 passing a collocation filter). Bold indicates significant (p,.01) improvement over the baseline set of features (use of LIWC alone). doi:10.1371/journal.pone.0073791.t002 correlation of 0:3 to 0:4) [94,95]. Some researchers have the social and psychological characteristics of people’s everyday discretized the personality scores for prediction, and classified concerns. people as being high or low (one standard deviation above or Most studies linking language with psychological variables rely below the mean or top and bottom quartiles, throwing out the on a priori fixed sets of words, such as the LIWC categories carefully middle) in each trait [61,64,67]. When we do such an approach, constructed over 20 years of human research [11]. Here, we show the benefits of an open-vocabulary approach in which the words our scores are in similar ranges to such literature: 65% to 79% classification accuracy. Of course, such a high/low model cannot analyzed are based on the data itself. We extracted words, phrases, and topics (automatically clustered sets of words) from millions of directly be used for classifying unlabeled people as one would also Facebook messages and found the language that correlates most need to know who fits in the middle. Regression is a more with gender, age, and five factors of personality. We discovered appropriate predictive task for continuous outcomes like age and insights not found previously and achieved higher accuracies than personality, even though R scores are naturally smaller than LIWC when using our open-vocabulary features in a predictive binary classification accuracies. model, achieving state-of-the-art accuracy in the case of gender We ran an additional tests to evaluate only those words and prediction. phrases, topics, or LIWC categories that are selected via differential Exploratory analyses like DLA change the process from that of language analysis rather than all features. Thus, we used only testing theories with observations to that of data-driven identifi- those language features that significantly correlated (Bonferonni- cation of new connections [97,98]. Our intention here is not a corrected pv0:001) with the outcome being predicting. To keep complete replacement for closed-vocabulary analyses like LIWC. consistent with the main evaluation, we used no controls, and so When one has a specific theory in mind or a small sample size, an one could view this as a univariate feature selection over each type a priori list of words can be ideal; in an open-vocabulary approach, of feature independently. We again found significant improvement the concept one cares about can be drowned out by more from using the open-vocabulary features over LIWC and no predictive concepts. Further, it may be easier to compare static a significant changes in accuracy overall. These results are presented priori categories of words across studies. However, automatically in Table S2. clustering words into coherent topics allows one to potentially In addition to demonstrating the greater informative value of discover categories that might not have been anticipated (e.g. open-vocabulary features, we found our results to be state-of-the-art. sports teams, kinds of outdoor exercise, or Japanese cartoons). The highest previous out-of-sample accuracies for gender prediction Open-vocabulary approaches also save labor in creating catego- based entirely on language were 88.0% over twitter data [68] while ries. They consider all words encountered and thus are able to our classifiers reach an accuracy of 91.9%. Our increased adapt well to the evolving language in social media or other performance could be attributed to our set of language features, genres. They are also transparent in that the exact words driving a strong predictive algorithm (the support vector machine), and correlations are not hidden behind a level of abstraction. Given the large sample of Facebook data. lots of text and dependent variables, an open-vocabulary approach like DLA can be immediately useful for many areas of study; for Discussion example, an economist contrasting sport utility with hybrid vehicle drivers, a political scientist comparing democrats and republicans, Online social media such as Facebook are a particularly or a cardiologist differentiating people with positive versus promising resource for the study of people, as ‘‘status’’ updates negative outcomes of heart disease. are self-descriptive, personal, and have emotional content [7]. Like most studies in the social sciences, this work is still subject Language use is objective and quantifiable behavioral data [96], to sampling and social desirability biases. Language connections and unlike surveys and questionnaires, Facebook language allows with psychosocial variables are often dependent on context [40]. researchers to observe individuals as they freely present Here, we examined language in a large sample of the broad themselves in their own words. Differential language analysis (DLA) context of Facebook. Under different contexts, it is likely some in social media is an unobtrusive and non-reactive window into results would differ. Still, the sample sizes and availability of PLOS ONE | www.plosone.org 13 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language demographic information afforded by social media bring us closer topics available here: wwbp.org/public_data/2000topics.top20 to a more ideal representative sample [99]. Our current freqs.keys.csv. results have face validity (subjects in high elevations talk about (XLS) ‘the mountains’), tie in with other research (neurotic people Table S2 Prediction results when selecting features via disproportionately use the phrase ‘depressed’), suggest new differential language analysis. accuracy: percent predicted hypotheses (an active life implies emotional stability), and give correctly (for discrete binary outcomes). R: Square-root of the detailed insights (males prefer to precede ‘wife’ with the possessive coefficient of determination (for sequential/continuous outcomes). ‘my’ more so than females precede ‘husband’ with ‘my’). LIWC: A priori word-categories from Linguistic Inquiry and Word Over the past one-hundred years, surveys and questionnaires Count. Topics: Automatically created LDA topic clusters. Word- have illuminated our understanding of people. We suggest that Phrases: words and phrases (n-grams of size 1 to 3 passing a new multipurpose instruments such as DLA emerging from the collocation filter). Bold indicates significant (P,.01) improvement field of computational social science shed new light on psychoso- over the baseline set of features (use of LIWC alone). Differential cial phenomena. language analysis was run over the training set, and only those features significant at Bonferonni-corrected P,0.001 were includ- Supporting Information ed during training and testing. No controls were used so as to be consistent with the evaluation in the main paper, and so one could Figure S1 Power analyses for all outcomes examined in consider this a univariate feature selection. On average results are this work. Number of features passing a Benjamini-Hochberg just below those of not using differential language analysis to select false-discovery rate of pv0:001 as a function of the number of features but there is no significant difference. users sampled, out of the maximum 24,530 words and phrases (PDF) used by at least 1% of users. (TIF) Acknowledgments Figure S2 Words, phrases, and topics most distinguish- We would like to thank Greg Park, Angela Duckworth, Adam Croom, ing agreeableness, conscientiousness, and openness. A. Molly Ireland, Paul Rozin, Eduardo Blanco, and our other colleagues in Language of high agreeableness (left) and low agreeableness (right); the Positive Psychology Center and Computer & Information Science N~72,772. B. Language of high conscientiousness (left) and low department for their valuable feedback regarding this work. conscientiousness (right); N~72,781. C. Language of openness (left) and closed to experience (right); N~72,809 (adjusted for Author Contributions gender and age, Bonferroni-corrected pv0:001). Conceived and designed the experiments: HAS JCE MLK LHU. (TIF) Performed the experiments: HAS LD. Analyzed the data: HAS JCE LD Table S1 The 15 most prevalent words for the 2000 SMR MA AS. Contributed reagents/materials/analysis tools: MK DS. automatically generated topics used in our study. All Wrote the paper: HAS JCE MLK DS MEPS LHU. References 1. Lazer D, Pentland A, Adamic L, Aral S, Barabasi AL, et al. (2009) 17. Stone P, Dunphy D, Smith M (1966) The General Inquirer: A Computer Computational social science. Science 323: 721–723. Approach to Content Analysis. MIT press. 2. Weinberger S (2011) Web of war: Can computational social science help to 18. Coltheart M (1981) The mrc psycholinguistic database. The Quarterly Journal prevent or win wars? the pentagon is betting millions of dollars on the hope that of Experimental Psychology 33: 497–505. it will. Nature 471: 566–568. 19. Pennebaker JW, Mehl MR, Niederhoffer KG (2003) Psychological aspects of 3. Miller G (2011) Social scientists wade into the tweet stream. Science 333: 1814– natural language use: our words, our selves. Annual Review of Psychology 54: 1815. 547–77. 4. Facebook (2012) Facebook company info: Fact sheet website. Available: http:// 20. Tausczik Y, Pennebaker J (2010) The psychological meaning of words: Liwc and newsroom?fb?com. Accessed 2012 Dec. computerized text analysis methods. Journal of Language and Social Psychology 5. Golder S, Macy M (2011) Diurnal and seasonal mood vary with work, sleep, and 29: 24–54. daylength across diverse cultures. Science 333: 1878–1881. 21. Pennebaker J, King L (1999) Linguistic styles: language use as an individual 6. Bollen J, Mao H, Zeng X (2011) Twitter mood predicts the stock market. Journal difference. Journal of personality and social psychology 77: 1296. of Computational Science 2: 1–8. 22. Mehl M, Gosling S, Pennebaker J (2006) Personality in its natural habitat: 7. Kramer A (2010) An unobtrusive behavioral model of gross national happiness. manifestations and implicit folk theories of personality in daily life. Journal of In: Proc of the 28th int conf on Human factors in comp sys. ACM, pp. 287–290. personality and social psychology 90: 862. 8. Dodds PS, Harris KD, Kloumann IM, Bliss CA, Danforth CM (2011) Temporal 23. Gosling S, Vazire S, Srivastava S, John O (2004) Should we trust web-based patterns of happiness and information in a global social network: Hedonometrics studies? a comparative analysis of six preconceptions about internet question- and twitter. PLoS ONE 6: 26. naires. American Psychologist 59: 93. 9. Ginsberg J, Mohebbi M, Patel R, Brammer L, Smolinski M, et al. (2009) 24. Back M, Stopfer J, Vazire S, Gaddis S, Schmukle S, et al. (2010) Facebook Detecting inuenza epidemics using search engine query data. Nature 457: 1012– profiles reect actual personality, not self-idealization. Psychological Science 21: 1014. 372–374. 10. Michel JB, Shen YK, Aiden AP, Veres A, Gray MK, et al. (2011) Quantitative 25. Sumner C, Byers A, Shearing M (2011) Determining personality traits & privacy analysis of culture using millions of digitized books. Science 331: 176–182. concerns from facebook activity. In: Black Hat Briefings. pp. 1–29. 11. Pennebaker JW, Chung CK, Ireland M, Gonzales A, Booth RJ (2007) The 26. Holtgraves T (2011) Text messaging, personality, and the social context. Journal development and psychometric properties of liwc2007 the university of texas at of Research in Personality 45: 92–99. austin. LIWCNET 1: 1–22. 27. Yarkoni T (2010) Personality in 100,000 Words: A large-scale analysis of 12. Kosinski M, Stillwell D, Graepel Y (2013) Private traits and attributes are personality and word use among bloggers. Journal of Research in Personality 44: predictable from digital records of human behavior. Proceedings of the National 363–373. Academy of Sciences (PNAS). 28. Chung C, Pennebaker J (2008) Revealing dimensions of thinking in open-ended 13. Goldberg LR (1990) An alternative ‘‘description of personality’’: the big-five self-descriptions: An automated meaning extraction method for natural factor structure. J Pers and Soc Psychol 59: 1216–1229. language. Journal of Research in Personality 42: 96–132. 14. McCrae RR, John OP (1992) An introduction to the five-factor model and its 29. Kramer A, Chung K (2011) Dimensions of self-expression in facebook status applications. Journal of Personality 60: 175–215. updates. In: Proceedings of the Fifth International AAAI Conference on 15. Norman W (1963) Toward an adequate taxonomy of personality attributes: Weblogs and Social Media. pp. 169–176. Replicated factor structure in peer nomination personality ratings. The Journal 30. Pennebaker J, Stone L (2003) Words of wisdom: Language use over the life span. of Abnormal and Social Psychology 66: 574. Journal of personality and social psychology 85: 291. 16. Digman J (1990) Personality structure: Emergence of the five-factor model. 31. Chung C, Pennebaker J (2007) The psychological function of function words. Annual review of psychology 41: 417–440. Social communication: Frontiers of social psychology : 343–359. PLOS ONE | www.plosone.org 14 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language 59. Monroe BL, Maeda K (2004) Talks cheap: Text-based estimation of rhetorical 32. Argamon S, Koppel M, Pennebaker J, Schler J (2007) Mining the blogosphere: ideal-points. In: annual meeting of the Society for Political Methodology. pp. age, gender, and the varieties of self-expression. First Monday 12. 29–31. 33. Argamon S, Koppel M, Fine J, Shimoni A (2003) Gender, genre, and writing style in formal written texts. To appear in Text 23: 3. 60. Slapin JB, Proksch SO (2008) A scaling model for estimating time-series party 34. Newman M, Groom C, Handelman L, Pennebaker J (2008) Gender differences positions from texts. American Journal of Political Science 52: 705–722. in language use: An analysis of 14,000 text samples. Discourse Processes 45: 61. Argamon S, Dhawle S, Koppel M, Pennebaker JW (2005) Lexical predictors of 211–236. personality type. In: Proceedings of the Joint Annual Meeting of the Interface and the Classification Society. 35. Mukherjee A, Liu B (2010) Improving gender classification of blog authors. In: 62. Argamon S, Koppel M, Pennebaker JW, Schler J (2009) Automatically profiling Proceedings of the 2010 conference on Empirical Methods in natural Language the author of an anonymous text. Commun ACM 52: 119–123. Processing. Association for Computational Linguistics, pp. 207–217. 63. Mairesse F,Walker M (2006) Automatic recognition of personality in 36. Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user conversation. In: Proceedings of the Human Language Technology Conference attributes in twitter. In: Proceedings of the 2nd international workshop on of the NAACL. pp. 85–88. Search and mining user-generated contents. ACM, pp. 37–44. 64. Mairesse F, Walker M, Mehl M, Moore R (2007) Using linguistic cues for the 37. Schler J, Koppel M, Argamon S, Pennebaker J (2006) Effects of age and gender automatic recognition of personality in conversation and text. Journal of on blogging. In: Proceedings of 2006 AAAI Spring Symposium on Computa- Artificial Intelligence Research 30: 457–500. tional Approaches for Analyzing Weblogs. pp. 199–205. 65. Golbeck J, Robles C, Edmondson M, Turner K (2011) Predicting personality 38. Burger J, Henderson J, Kim G, Zarrella G (2011) Discriminating gender on from twitter. In: Proc of the 3rd IEEE Int Conf on Soc Comput. pp. 149–156. twitter. In: Proceedings of the Conference on Empirical Methods in Natural doi:978-0-7695-4578-3/11. Language Processing. Association for Computational Linguistics, pp. 1301– 66. Sumner C, Byers A, Boochever R, Park G (2012) Predicting dark triad personality traits from twitter usage and a linguistic analysis of tweets. 39. Huffaker DA, Calvert SL (2005) Gender, Identity, and Language Use in wwwonlineprivacyfoundationorg. Teenage Blogs. Journal of Computer-Mediated Communication 10: 1–10. 67. Iacobelli F, Gill AJ, Nowson S, Oberlander J (2011) Large scale personality 40. Eckert P (2008) Variation and the indexical field1. Journal of Sociolinguistics 12: classification of bloggers. In: Proc of the 4th int conf on Affect comput and intel 453–476. interaction. Springer-Verlag, pp. 568–577. 41. Eisenstein J, Smith NA, Xing EP (2011) Discovering sociolinguistic associations 68. Bamman D, Eisenstein J, Schnoebelen T (2012) Gender in twitter: Styles, with structured sparsity. In: Proceedings of the 49th Annual Meeting of the stances, and social networks. arXiv preprint arXiv:12104567. Association for Computational Linguistics: Human Language Technologies- 69. Church KW, Hanks P (1990) Word association norms, mutual information, and Volume 1. Association for Computational Linguistics, pp. 1365–1374. lexicography. Computational Linguistics 16: 22–29. 42. OConnor B, Bamman D, Smith NA (2011) Computational text analysis for 70. Lin D (1998) Extracting collocations from text corpora. In: Knowledge Creation social science: Model assumptions and complexity. public health 41: 43. Diffusion Utilization. pp. 57–63. 43. Grimmer J, Stewart BM (2013) Text as data: The promise and pitfalls of 71. Anscombe FJ (1948) The transformation of poisson, binomial and negative- automatic content analysis methods for political texts. Political Analysis. binomial data. Biometrika 35: 246–254. 44. Monroe BL, Colaresi MP, Quinn KM (2008) Fightin’words: Lexical feature 72. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn selection and evaluation for identifying the content of political conict. Political Res 3: 993–1022. Analysis 16: 372–403. 73. Steyvers M, Griffiths T (2007) Probabilistic topic models. Handbook of latent 45. Gilbert E (2012) Phrases that signal workplace hierarchy. In: Proceedings of the semantic analysis 427: 424–440. ACM 2012 conference on Computer Supported Cooperative Work. ACM, pp. 74. Gelfand A, Smith A (1990) Sampling-based approaches to calculating marginal 1037–1046. densities. Journal of the American statistical association 85: 398–409. 46. Tausczik Y, Pennebaker J (2010) The psychological meaning of words: Liwc and 75. McCallum AK (2002) Mallet: A machine learning for language toolkit. computerized text analysis methods. Journal of Language and Social Psychology Available: http://mallet.cs.umass.edu. 29: 24. 76. Dunn OJ (1961) Multiple comparisons among means. Journal of the American 47. Holmes D (1994) Authorship attribution. Computers and the Humanities 28: Statistical Association 56: 52–64. 87–106. 77. Eisenstein J, O’Connor B, Smith N, Xing E (2010) A latent variable model for 48. Argamon S, Saric ´ M, Stein SS (2003) Style mining of electronic messages for geographic lexical variation. In: Proceedings of the 2010 Conference on multiple authorship discrimination: first results. In: KDD ’03: Proceedings of the Empirical Methods in Natural Language Processing. Association for Compu- ninth ACM SIGKDD international conference on Knowledge discovery and tational Linguistics, pp. 1277–1287. data mining. New York, NY, USA: ACM, pp. 475–480. 78. Wordle (2012) Wordle advanced website. Available: http://www?wordle?net/ 49. Stamatatos E (2009) A survey of modern authorship attribution methods. advanced Acceessed 2012 Dec. Journal of the American Society for information Science and Technology 60: 79. Harris J (2011) Word clouds considered harmful. Available: http:// 538–556. wwwniemanlaborg/2011/10/word-clouds-considered-harmful/. 50. Alm C, Roth D, Sproat R (2005) Emotions from text: machine learning for text- 80. Resnik P (1999) Semantic similarity in a taxonomy: An information-based based emotion prediction. In: Proceedings of the conference on Empirical measure and its application to problems of ambiguity in natural language. Methods in Natural Language Processing. Association for Computational Journal of Artificial Intelligence Research 11: 95–130. Linguistics, pp. 579–586. 81. Cleveland WS (1979) Robust locally weighted regression and smoothing 51. Mihalcea R, Liu H (2006) A corpus-based approach to finding happiness. In: scatterplots. Journal of the Am Stati Assoc 74: 829–836. Proceedings of the AAAI Spring Symposium on Computational Approaches to 82. Costa Jr P, McCrae R (2008) The revised neo personality inventory (neo-pi-r). Weblogs. p. 19. The SAGE handbook of personality theory and assessment 2: 179–198. 52. Jurafsky D, Ranganath R, McFarland D (2009) Extracting social meaning: 83. Bachrach Y, Kosinski M, Graepel T, Kohli P, Stillwell D (2012) Personality and Identifying interactional style in spoken conversation. In: Proceedings of Human patterns of facebook usage. Web Science. Language Technologies: The 2009 Annual Conference of the North American 84. Sterne J, Gavaghan D, Egger M (2000) Publication and related bias in meta- Chapter of the Association for Computational Linguistics. Association for analysis: power of statistical tests and prevalence in the literature. J Clin Computational Linguistics, pp. 638–646. Epidemiol 53: 1119–1129. 53. Ranganath R, Jurafsky D, McFarland D (2009) It’s not you, it’s me: detecting 85. McCrae RR, Sutin AR (2009) Openness to experience. In: Handbook of Indiv irting and its misperception in speed-dates. In: Proceedings of the 2009 Diff in Soc Behav, New York: Guilford. pp. 257–273. Conference on Empirical Methods in Natural Language Processing: Volume 1- 86. Mulac A, Studley LB, Blau S (1990) The gender-linked language effect in Volume 1. Association for Computational Linguistics, pp. 334–342. primary and secondary students’ impromptu essays. Sex Roles 23: 439–470. 54. Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification 87. Thomson R, Murachver T (2001) Predicting gender from electronic discourse. using machine learning techniques. In: Proceedings of the 2002 Conference on Brit J of Soc Psychol 40: 193–208. Empirical Methods in Natural Language Processing (EMNLP). pp. 79–86. 88. Mehl MR, Pennebaker JW (2003) The sounds of social life: a psychometric 55. Kim SM, Hovy E (2004) Determining the sentiment of opinions. In: Proceedings analysis of students’ daily social environments and natural conversations. J of of the 20th international conference on Computational Linguistics. Stroudsburg, Pers and Soc Psychol 84: 857–870. PA, USA: Association for Computational Linguistics, COLING,04. 89. Mulac A, Bradac JJ (1986) Male/female language differences and attributional 56. Wilson T, Wiebe J, Hoffmann P (2009) Recognizing contextual polarity: An consequences in a public speaking situation: Toward an explanation of the exploration of features for phrase-level sentiment analysis. Computational genderlinked language effect. Communication Monographs 53: 115–129. linguistics 35: 399–433. 90. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical 57. Baccianella S, Esuli A, Sebastiani F (2010) Sentiwordnet 3.0: An enhanced and powerful approach to multiple testing. Journal of the Royal Statistical lexical resource for sentiment analysis and opinion mining. In: Chair) NCC, Society Series B (Methodological) : 289–300. Choukri K, Maegaard B, Mariani J, Odijk J, et al., editors, Proceedings of the 91. Goldberg L, Johnson J, Eber H, Hogan R, Ashton M, et al. (2006) The Seventh International Conference on Language Resources and Evaluation international personality item pool and the future of public-domain personality (LREC’10). Valletta, Malta: European Language Resources Association measures. J of Res in Personal 40: 84–96. (ELRA). 92. Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: A 58. Laver M, Benoit K, Garry J (2003) Extracting policy positions from political library for large linear classification. Journal of Machine Learning Research 9: texts using words as data. American Political Science Review 97: 311–331. 1871–1874. PLOS ONE | www.plosone.org 15 September 2013 | Volume 8 | Issue 9 | e73791 Personality, Gender, Age in Social Media Language 93. Hoerl A, Kennard R (1970) Ridge regression: Biased estimation for 96. Ireland ME, Mehl MR (2012) Natural language use as a marker of personality. (in press) Oxford Handbook of Language and Social Psychology. nonorthogonal problems. Technometrics 12: 55–67. 97. Haig B (2005) An abductive theory of scientific method. Psychological Methods; 94. Meyer G, Finn S, Eyde L, Kay G, Moreland K, et al. (2001) Psychological Psychological Methods 10: 371. testing and psychological assessment: A review of evidence and issues. American 98. Fast L, Funder D (2008) Personality as manifest in word use: Correlations with psychologist 56: 128. self-report, acquaintance report, and behavior. Journal of Personality and Social 95. Roberts B, Kuncel N, Shiner R, Caspi A, Goldberg L (2007) The power of Psychology 94: 334. personality: The comparative validity of personality traits, socioeconomic status, 99. Gosling SD, Vazire S, Srivastava S, John OP (2000) Should we trust web-based and cognitive ability for predicting important life outcomes. Perspectives on studies? a comparative analysis of six preconceptions about internet question- Psychological Science 2: 313–345. naires. American Psychologist 59: 93–104. PLOS ONE | www.plosone.org 16 September 2013 | Volume 8 | Issue 9 | e73791

Journal

PLoS ONEPubmed Central

Published: Sep 25, 2013

References