technology from seed" NLP in the Social Sciences Paula Carvalho INESC-ID & Universidade Europeia | Laureate International Universities [email protected] Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 1 NLP in the Social Sciences 17/07/2013 technology from seed" NLP applications Tokenization and Sentence Segmentation Co-reference resolution Relation Extraction POS Tagging NER Word Sense Disambiguation Parsing Sentiment Analysis Summarization and Topic Identification Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 2 NLP in the Social Sciences 17/07/2013 technology The power of words from seed" • Word analyses help to understand different social processes. • Basic NLP applications, such as POS Tagging and Parsing, can be used to address different social science research questions. – Counting Little Words in Big Data: The Psychology of Communities, Culture, and History. Cindy K. Chung and James W. Pennebaker (2012) – Syntactic Annotations for the Google Books Ngram Corpus. Yuri Lin, Jean-Baptiste Michel, Erez Lieberman Aiden, Jon Orwant, Will Brockman and Slav Petrov, Google Inc. (2012) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 3 NLP in the Social Sciences 17/07/2013 technology How informative are function words? from seed" • Different NLP studies have been focused on analyzing content words (i.e. nouns, adjectives, verbs and adverbs), ignoring function words. • Chung & Pennebaker (2012) show that function words in big data are useful to understand the psychology of communities, culture, and history. • All words can be informative, depending on the research purposes. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 4 NLP in the Social Sciences 17/07/2013 technology What can words tell about people? from seed" • Words can signal basic social and demographic categories: – Sex. In general, women tend to use more pronouns and references to other people; men are more likely to use articles, prepositions, and big words. – Age. As people get older, they tend to refer to themselves less, use more positive emotion and fewer negative emotion words. Older people also use more future tense and fewer past tense verbs. – Social class. The higher the social class, the less likely one uses 1st person singular pronouns and the less one uses emotion words. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 5 NLP in the Social Sciences 17/07/2013 What can function words tell about people? technology from seed" • Words can also reveal basic social and personality processes, including: – Lying vs. telling the truth. When people tell the truth, they tend to to use 1st person singular pronouns. They also use more exclusive words (e.g. except, but, without, excluding). – Depression and suicide-proneness. Suicidal individuals show increasing social isolation and heightened self-focus in their increasing rates of “I” use and decreasing rates of “we” use over time. – Social bonding after a trauma. In the days and weeks after a cultural upheaval, people become more self-less (less use of “I”) and more oriented towards others (increased use of “we”). – … Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 6 NLP in the Social Sciences 17/07/2013 technology Social bonding after a trauma - 9/11 from seed" • 75,000 blog entries from about 1,000 bloggers in the weeks surrounding 9/11 • Statistically significant jump in we-words and drop in I-words immediately after the terrorist attacks. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 7 NLP in the Social Sciences 17/07/2013 technology Google N-grams from seed" • The Google Books Ngram Corpus (Michel et al.,2011) has enabled the quantitative analysis of linguistic and cultural trends as reflected in millions of books written over the past five centuries. • The corpus consists of words and phrases and their usage frequency over time. • New Google N-gram’s edition (Lin at al., 2012) introduces syntactic annotations: – words are tagged with their POS – head modifier relationships are recorded Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 8 NLP in the Social Sciences 17/07/2013 technology Term search from seed" • There is a strong correlation between Bin Laden and Al-Qaeda Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 9 NLP in the Social Sciences 17/07/2013 technology Word and POS from seed" • The inclusion of POS information reduces potential noise caused by lexical ambiguity. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 10 NLP in the Social Sciences 17/07/2013 technology Word and POS from seed" • The inclusion of POS information reduces potential noise caused by lexical ambiguity. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 11 NLP in the Social Sciences 17/07/2013 technology Head modifier relationship from seed" • This linguistic information makes possible to process syntactic units, not necessarily contiguous in text. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 12 NLP in the Social Sciences 17/07/2013 technology Mood and influence within a community from seed" • Many studies try to examine overall mood and influence within a community, by exploring different sentiment resources available (e.g. LIWC). • Experiments have been showed that the dictionary-based metric was found to be a valid indicator of happiness as a function of the cultural context. – Americans were more positive on national holidays (e.g., Christmas, Thanksgiving), and on Fridays. – Americans were the least positive on days of national tragedy (e.g., the day Michael Jackson died), and on Mondays. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 13 NLP in the Social Sciences 17/07/2013 technology The happiest and the saddest tweeters Mitchel et al. (2013) from seed" • More than 10 million geo-tagged tweets from 2011 to map out the moods of Americans in urban areas. • They ranked the locations based on frequency of positive and negative words using the Mechanical Turk Language Assessment word list. The happiest 5 states, in order, are: Hawaii, Maine, Nevada, Utah and Vermont. The saddest 5 states, in order, are: Louisiana, Mississippi, Maryland, Delaware and Georgia Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 14 NLP in the Social Sciences 17/07/2013 The happiest and the saddest tweeters Mitchel et al. (2013) technology from seed" • The data suggests that cities with high technology adoption rates are less happy than their less technological counterparts. • They found that wealthy areas tended to have higher happiness levels and that areas with high rates of obesity has lower happiness levels. – The authors looked at obesity rates and food words to create lists of low and high-obesity words and they took their results and compared them against census data. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 15 NLP in the Social Sciences 17/07/2013 technology Conclusions from seed" • Big data coming from social networks or other available resources provide means of answering different research questions, with particular impact in social sciences. • The more refined and vast will be the linguistic resources used to process such data, the more interesting and accurate will be the analyses obtained. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 16 NLP in the Social Sciences 17/07/2013 technology More interesting references • • • from seed" Computational Sociolinguistics – David Bamman, Jacob Eisenstein, and Tyler Schnoebelen (2012) Gender in Twitter: Styles, Stances, and Social Networks. Under Review – Brendan O'Connor and Jacob Eisenstein and Eric P. Xing and Noah A. Smith (2010) A Mixture Model of Demographic Language Variation. Proceedings of the NIPS Workshop on Machine Learning for Social Computing Data Driven Sociology and Political Science – Michael D. Conover, Emilio Ferrara, Filippo Menczer, Alessandro Flammini (2013) The Digital Evolution of Occupy Wall Street. PLoS ONE 8(5):e64679 – P. S. Dodds, C. M. Danforth (2009) Measuring the Happiness of Large-Scale Written Expression: Songs, Blogs, and Presidents. Journal of Happiness Studies, 11, 444-456 Linking Text and Geography – Travis Brown, Jason Baldridge, Maria Esteva, and Weijia Xu. The substantial words are in the ground and sea: computationally linking text and geography. Texas Studies in Literature and Language: Linguistics and Literary Studies: Computation and Convergence. 54(3) – Jacob Eisenstein, Brendan O'Connor, Noah Smith, Eric P. Xing (2010) A Latent Variable Model of Geographical Lexical Variation. Proceedings of the Conference on Empirical Methods in Natural Language Processing Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 17 NLP in the Social Sciences 17/07/2013