technology
from seed"
Societal Data Resources and Data Processing
Infrastructure
Bruno Martins
INESC-ID & Instituto Superior Técnico
[email protected]
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
1
technology
DATASTORM Task on Societal Data
from seed"
Project vision : Build infrastructure for large scale social data analysis and
processing, of interest to areas such as the social sciences and economics,
to be hosted at the National Foundation for Scientific computing (FCCN).
•  Study of large real social networks
–  Data acquisition, storage and processing
–  Knowledge extraction
•  Understand user activity and behavior
– 
– 
– 
– 
Analyze social behavior
Entity disambiguation
Analyze information diffusion and influence patterns
Mining of network communities
•  Deep articulation with horizontal tasks (e.g., H1, H2, H3 and others)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
2
Societal Data Resources and Data Processing Infrastructure
technology
Challenges in Using Societal Data
from seed"
•  Several heterogeneous sources of relevant information
•  Traditional sources (small scale datasets)
–  Large sets of statistical series on various areas
–  Scientific data repositories from the social sciences
–  Repositories with geographic and territorial information
•  Particular focus : Social media and news ( ? big data ? )
–  Articles in traditional media (i.e., online newspapers) and comments to these articles
–  Data on Web archives
–  Data from the social Web (e.g., Twitter, FourSquare, Facebook, etc.)
Project vision : Deep (automated) study of societal issues involves
integrating these sources, and analyzing the resulting datasets by
combining techniques for processing textual information and networks
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
3
Societal Data Resources and Data Processing Infrastructure
Statistical Series and Scientific Data
Repositories
PORDATA features statistical series about Portugal,
Portuguese municipalities and Europe, organized into
themes (e.g., population, health, family income and
expenditure, education, employment, etc.)
dados.gov.pt is a open-data initiative/portal that
aggregates and publishes information produced by
the Portuguese public administration (e.g., public
expenditure, electoral results, etc.)
Arquivo Português de Inf. Social (now hosted on
RCCAP/FCCN) aggregates information collected
about the Portuguese society in the context of
academic studies (e.g., surveys and opinion pools)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
4
Societal Data Resources and Data Processing Infrastructure
technology
from seed"
News Articles, Coments over News
Articles, and Social Media Contents
technology
from seed"
•  Society is reflected in the media that is presented to us
–  Text mining on news articles and on comments to news articles
•  Twitter and Facebook continue to become a more and
more important collective, global media voice, and is thus
an important story in itself worthy of scientific analysis
Science 30 September 2011: Vol. 333 no. 6051
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
5
Societal Data Resources and Data Processing Infrastructure
Horizontal Tasks from DATASTORM
technology
•  Horizontal Task 1 : Data aquisition and extraction
–  Task Leader : Pavel Calado
–  Data crawling data from social media platforms and from the Web
–  Mining textual contents
•  Horizontal Task 2 : Data representation and query validation
–  Task Leader : Alexandre Francisco
–  Building graph-based representations for the data
–  Data structures and algorithms for efficient storage and processing
•  Horizontal Task 3 : Knowledge Discovery
–  Task Leader : Ana Teresa Freitas
–  Algorithms for knowledge discovery from societal data
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
6
Societal Data Resources and Data Processing Infrastructure
from seed"
Some Previous Work on the Area
technology
from seed"
Previous work
in the area
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
7
Societal Data Resources and Data Processing Infrastructure
Using the Social Web for Analyzing
Social Change
technology
from seed"
Online social networks are now a fundamental organizing
mechanism in recent country-wide social movements
– 
– 
– 
– 
Characterizing the networks associated to these movements
Discovering communities and influential individuals
Examine the temporal and geospatial evolution of the communication activity
Comprehend modern societal dynamics!
The Digital Evolution of Occupy Wall Street. PLoS ONE 8(5)
The Geospatial Characteristics of a Social Movement Communication Network. PLoS ONE 8(3)
Structural and Dynamical Patterns on Online Social Networks: The Spanish May 15th Movement. PLoS ONE 6(8)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
8
Societal Data Resources and Data Processing Infrastructure
Location, Social Ties, and Human
Mobility
technology
from seed"
Using location-based social networks (LBSNs) to investigate
human mobility and the relationship between social ties
between people and co-occurrences in time and space
–  Predicting which individuals may become friends on social networks, based on visited places
–  Geolocating users with basis on their social links
–  Human mobility follows surprisingly regular patterns
Understanding individual human mobility patterns
Nature 453, 779-782 (2008)
Friendship and Mobility: User Movement In Location-Based Social Networks
Proceedings of the 2010 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
9
Societal Data Resources and Data Processing Infrastructure
Quantifying Happiness and Public
Well-Being
technology
from seed"
hedonometer.org is an instrument that measures the
happiness of large populations in real time
• 
• 
• 
Uses Twitter’s gardenhose feed (e.g. about 10% of all messages, 100GB of JSON per day).
Messages written in English are assigned a happiness score based on the average happiness
score of the words contained within (words manually accessed through crowdsourcing).
This measure of happiness correlates very well with traditional surveys of well-being.
Temporal Patterns of Happiness and Information in a Global-Scale Social Network: Hedonometrics and Twitter.
PLoS ONE, 6, e26752, 2011.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
10
Societal Data Resources and Data Processing Infrastructure
The Hedonometer and Quantifying
Hapiness and Public Well-Being
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
11
Societal Data Resources and Data Processing Infrastructure
technology
from seed"
Text Driven Forecasting and
Predicting the Real World from Text
technology
from seed"
•  Regression models from sparse/noisy word features
–  Often using methods that promote sparse models (e.g., Lasso regression)
•  Many possible applications
– 
– 
– 
– 
– 
– 
Box office results
Elections and opinion pools
Stock value
Investment risk
Restaurant menu prices
…
Whitepaper from Noah Smith in 2009 : Text-Driven Forecasting
Word Salad: Relating Food Prices and Descriptions
Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing and Natural Language Learning
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
12
Societal Data Resources and Data Processing Infrastructure
Interfaces to other vertical tasks
- Cultoromics and Epidemics -
www.culturomics.org and the cultural
observatory @ Harvard
–  Using n-grams extracted by Google from books
–  Type in a word or phrase in one of seven languages and see how its usage
has been changing throughout the last few centuries
Studying epidemics with Twitter
–  Use information embedded in the Twitter stream to:
•  Track rapidly-evolving public concern with respect to H1N1 or swine flu
•  Track and measure actual disease activity.
Quantitative analysis of culture using millions of digitized books. Science, 2010
The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during
the Influenza A H1N1 Pandemic. PLoS ONE 6(5): e19467
The Expression of Emotions in 20th Century Books. PLoS ONE 8(3): e59030
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
13
Societal Data Resources and Data Processing Infrastructure
technology
from seed"
Interfaces to other vertical tasks
- Cultoromics and Epidemics -
www.culturomics.org and the cultural
observatory @ Harvard
–  Using n-grams extracted by Google from books
–  Type in a word or phrase in one of seven languages and see how its usage
has been changing throughout the last few centuries
Studying epidemics with Twitter
–  Use information embedded in the Twitter stream to:
•  Track rapidly-evolving public concern with respect to H1N1 or swine flu
•  Track and measure actual disease activity.
Quantitative analysis of culture using millions of digitized books. Science, 2010
The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during
the Influenza A H1N1 Pandemic. PLoS ONE 6(5): e19467
The Expression of Emotions in 20th Century Books. PLoS ONE 8(3): e59030
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
14
Societal Data Resources and Data Processing Infrastructure
technology
from seed"
Some National Projects as Well…
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
15
Societal Data Resources and Data Processing Infrastructure
technology
from seed"
The rest of this session at the
DATASTORM workshop
•  Silvio Moreira, INESC-ID
–  The REACTION and POPSTAR projects
•  João Vasconcelos, AMA
–  Introducing the dados.gov.pt portal
•  Paula Carvalho, INESC-ID
–  Natural language processing for the social sciences
•  Hopefully, some discussion afterwards…
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa!
16
Societal Data Resources and Data Processing Infrastructure
technology
from seed"
Download

Societal Data Resources and Data Processing Infrastructure