technology
from seed"
NLP in the Social Sciences
Paula Carvalho
INESC-ID & Universidade Europeia | Laureate International Universities
[email protected]
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
1
NLP in the Social Sciences
17/07/2013
technology
from seed"
NLP applications
Tokenization
and Sentence
Segmentation
Co-reference
resolution
Relation
Extraction
POS
Tagging
NER
Word Sense
Disambiguation
Parsing
Sentiment
Analysis
Summarization and
Topic Identification
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
2
NLP in the Social Sciences
17/07/2013
technology
The power of words
from seed"
•  Word analyses help to understand different social
processes.
•  Basic NLP applications, such as POS Tagging and
Parsing, can be used to address different social science
research questions.
–  Counting Little Words in Big Data: The Psychology of
Communities, Culture, and History. Cindy K. Chung and James
W. Pennebaker (2012)
–  Syntactic Annotations for the Google Books Ngram Corpus.
Yuri Lin, Jean-Baptiste Michel, Erez Lieberman Aiden, Jon Orwant,
Will Brockman and Slav Petrov, Google Inc. (2012)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
3
NLP in the Social Sciences
17/07/2013
technology
How informative are function words?
from seed"
•  Different NLP studies have been focused on analyzing
content words (i.e. nouns, adjectives, verbs and adverbs),
ignoring function words.
•  Chung & Pennebaker (2012) show that function words in
big data are useful to understand the psychology of
communities, culture, and history.
•  All words can be informative, depending on the research
purposes.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
4
NLP in the Social Sciences
17/07/2013
technology
What can words tell about people?
from seed"
•  Words can signal basic social and demographic
categories:
–  Sex. In general, women tend to use more pronouns and
references to other people; men are more likely to use articles,
prepositions, and big words.
–  Age. As people get older, they tend to refer to themselves less,
use more positive emotion and fewer negative emotion words.
Older people also use more future tense and fewer past tense
verbs.
–  Social class. The higher the social class, the less likely one uses
1st person singular pronouns and the less one uses emotion
words.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
5
NLP in the Social Sciences
17/07/2013
What can function words tell about
people?
technology
from seed"
•  Words can also reveal basic social and personality
processes, including:
–  Lying vs. telling the truth. When people tell the truth, they tend to
to use 1st person singular pronouns. They also use more exclusive
words (e.g. except, but, without, excluding).
–  Depression and suicide-proneness. Suicidal individuals show
increasing social isolation and heightened self-focus in their
increasing rates of “I” use and decreasing rates of “we” use over
time.
–  Social bonding after a trauma. In the days and weeks after a
cultural upheaval, people become more self-less (less use of “I”)
and more oriented towards others (increased use of “we”).
–  …
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
6
NLP in the Social Sciences
17/07/2013
technology
Social bonding after a trauma - 9/11
from seed"
•  75,000 blog entries from about 1,000 bloggers in the
weeks surrounding 9/11
•  Statistically significant jump in we-words and drop in
I-words immediately after the terrorist attacks.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
7
NLP in the Social Sciences
17/07/2013
technology
Google N-grams
from seed"
•  The Google Books Ngram Corpus (Michel et al.,2011) has
enabled the quantitative analysis of linguistic and cultural
trends as reflected in millions of books written over the
past five centuries.
•  The corpus consists of words and phrases and their usage
frequency over time.
•  New Google N-gram’s edition (Lin at al., 2012) introduces
syntactic annotations:
–  words are tagged with their POS
–  head modifier relationships are recorded
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
8
NLP in the Social Sciences
17/07/2013
technology
Term search
from seed"
•  There is a strong correlation between Bin Laden and Al-Qaeda
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
9
NLP in the Social Sciences
17/07/2013
technology
Word and POS
from seed"
•  The inclusion of POS information reduces potential noise
caused by lexical ambiguity.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
10
NLP in the Social Sciences
17/07/2013
technology
Word and POS
from seed"
•  The inclusion of POS information reduces potential noise
caused by lexical ambiguity.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
11
NLP in the Social Sciences
17/07/2013
technology
Head modifier relationship
from seed"
•  This linguistic information makes possible to process
syntactic units, not necessarily contiguous in text.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
12
NLP in the Social Sciences
17/07/2013
technology
Mood and influence within a
community
from seed"
•  Many studies try to examine overall mood and influence
within a community, by exploring different sentiment
resources available (e.g. LIWC).
•  Experiments have been showed that the dictionary-based
metric was found to be a valid indicator of happiness as a
function of the cultural context.
–  Americans were more positive on national holidays (e.g.,
Christmas, Thanksgiving), and on Fridays.
–  Americans were the least positive on days of national tragedy
(e.g., the day Michael Jackson died), and on Mondays.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
13
NLP in the Social Sciences
17/07/2013
technology
The happiest and the saddest tweeters
Mitchel et al. (2013)
from seed"
•  More than 10 million geo-tagged tweets from 2011 to map out the moods of
Americans in urban areas.
•  They ranked the locations based on frequency of positive and negative words
using the Mechanical Turk Language Assessment word list.
The happiest 5 states, in
order, are: Hawaii, Maine,
Nevada, Utah and Vermont.
The saddest 5 states, in
order, are: Louisiana,
Mississippi, Maryland,
Delaware and Georgia
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
14
NLP in the Social Sciences
17/07/2013
The happiest and the saddest tweeters
Mitchel et al. (2013)
technology
from seed"
•  The data suggests that cities with high technology
adoption rates are less happy than their less technological
counterparts.
•  They found that wealthy areas tended to have higher
happiness levels and that areas with high rates of obesity
has lower happiness levels.
–  The authors looked at obesity rates and food words to create lists
of low and high-obesity words and they took their results and
compared them against census data.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
15
NLP in the Social Sciences
17/07/2013
technology
Conclusions
from seed"
•  Big data coming from social networks or other available
resources provide means of answering different research
questions, with particular impact in social sciences.
•  The more refined and vast will be the linguistic resources
used to process such data, the more interesting and
accurate will be the analyses obtained.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
16
NLP in the Social Sciences
17/07/2013
technology
More interesting references
• 
• 
• 
from seed"
Computational Sociolinguistics
– 
David Bamman, Jacob Eisenstein, and Tyler Schnoebelen (2012) Gender in Twitter: Styles, Stances, and Social
Networks. Under Review
– 
Brendan O'Connor and Jacob Eisenstein and Eric P. Xing and Noah A. Smith (2010) A Mixture Model of
Demographic Language Variation. Proceedings of the NIPS Workshop on Machine Learning for Social Computing
Data Driven Sociology and Political Science
– 
Michael D. Conover, Emilio Ferrara, Filippo Menczer, Alessandro Flammini (2013) The Digital Evolution of Occupy
Wall Street. PLoS ONE 8(5):e64679
– 
P. S. Dodds, C. M. Danforth (2009) Measuring the Happiness of Large-Scale Written Expression: Songs, Blogs, and
Presidents. Journal of Happiness Studies, 11, 444-456
Linking Text and Geography
– 
Travis Brown, Jason Baldridge, Maria Esteva, and Weijia Xu. The substantial words are in the ground and sea:
computationally linking text and geography. Texas Studies in Literature and Language: Linguistics and Literary
Studies: Computation and Convergence. 54(3)
– 
Jacob Eisenstein, Brendan O'Connor, Noah Smith, Eric P. Xing (2010) A Latent Variable Model of Geographical
Lexical Variation. Proceedings of the Conference on Empirical Methods in Natural Language Processing
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
17
NLP in the Social Sciences
17/07/2013
Download

NLP in the Social Sciences