technology from seed" Societal Data Resources and Data Processing Infrastructure Bruno Martins INESC-ID & Instituto Superior Técnico [email protected] Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 1 technology DATASTORM Task on Societal Data from seed" Project vision : Build infrastructure for large scale social data analysis and processing, of interest to areas such as the social sciences and economics, to be hosted at the National Foundation for Scientific computing (FCCN). • Study of large real social networks – Data acquisition, storage and processing – Knowledge extraction • Understand user activity and behavior – – – – Analyze social behavior Entity disambiguation Analyze information diffusion and influence patterns Mining of network communities • Deep articulation with horizontal tasks (e.g., H1, H2, H3 and others) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 2 Societal Data Resources and Data Processing Infrastructure technology Challenges in Using Societal Data from seed" • Several heterogeneous sources of relevant information • Traditional sources (small scale datasets) – Large sets of statistical series on various areas – Scientific data repositories from the social sciences – Repositories with geographic and territorial information • Particular focus : Social media and news ( ? big data ? ) – Articles in traditional media (i.e., online newspapers) and comments to these articles – Data on Web archives – Data from the social Web (e.g., Twitter, FourSquare, Facebook, etc.) Project vision : Deep (automated) study of societal issues involves integrating these sources, and analyzing the resulting datasets by combining techniques for processing textual information and networks Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 3 Societal Data Resources and Data Processing Infrastructure Statistical Series and Scientific Data Repositories PORDATA features statistical series about Portugal, Portuguese municipalities and Europe, organized into themes (e.g., population, health, family income and expenditure, education, employment, etc.) dados.gov.pt is a open-data initiative/portal that aggregates and publishes information produced by the Portuguese public administration (e.g., public expenditure, electoral results, etc.) Arquivo Português de Inf. Social (now hosted on RCCAP/FCCN) aggregates information collected about the Portuguese society in the context of academic studies (e.g., surveys and opinion pools) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 4 Societal Data Resources and Data Processing Infrastructure technology from seed" News Articles, Coments over News Articles, and Social Media Contents technology from seed" • Society is reflected in the media that is presented to us – Text mining on news articles and on comments to news articles • Twitter and Facebook continue to become a more and more important collective, global media voice, and is thus an important story in itself worthy of scientific analysis Science 30 September 2011: Vol. 333 no. 6051 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 5 Societal Data Resources and Data Processing Infrastructure Horizontal Tasks from DATASTORM technology • Horizontal Task 1 : Data aquisition and extraction – Task Leader : Pavel Calado – Data crawling data from social media platforms and from the Web – Mining textual contents • Horizontal Task 2 : Data representation and query validation – Task Leader : Alexandre Francisco – Building graph-based representations for the data – Data structures and algorithms for efficient storage and processing • Horizontal Task 3 : Knowledge Discovery – Task Leader : Ana Teresa Freitas – Algorithms for knowledge discovery from societal data Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 6 Societal Data Resources and Data Processing Infrastructure from seed" Some Previous Work on the Area technology from seed" Previous work in the area Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 7 Societal Data Resources and Data Processing Infrastructure Using the Social Web for Analyzing Social Change technology from seed" Online social networks are now a fundamental organizing mechanism in recent country-wide social movements – – – – Characterizing the networks associated to these movements Discovering communities and influential individuals Examine the temporal and geospatial evolution of the communication activity Comprehend modern societal dynamics! The Digital Evolution of Occupy Wall Street. PLoS ONE 8(5) The Geospatial Characteristics of a Social Movement Communication Network. PLoS ONE 8(3) Structural and Dynamical Patterns on Online Social Networks: The Spanish May 15th Movement. PLoS ONE 6(8) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 8 Societal Data Resources and Data Processing Infrastructure Location, Social Ties, and Human Mobility technology from seed" Using location-based social networks (LBSNs) to investigate human mobility and the relationship between social ties between people and co-occurrences in time and space – Predicting which individuals may become friends on social networks, based on visited places – Geolocating users with basis on their social links – Human mobility follows surprisingly regular patterns Understanding individual human mobility patterns Nature 453, 779-782 (2008) Friendship and Mobility: User Movement In Location-Based Social Networks Proceedings of the 2010 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 9 Societal Data Resources and Data Processing Infrastructure Quantifying Happiness and Public Well-Being technology from seed" hedonometer.org is an instrument that measures the happiness of large populations in real time • • • Uses Twitter’s gardenhose feed (e.g. about 10% of all messages, 100GB of JSON per day). Messages written in English are assigned a happiness score based on the average happiness score of the words contained within (words manually accessed through crowdsourcing). This measure of happiness correlates very well with traditional surveys of well-being. Temporal Patterns of Happiness and Information in a Global-Scale Social Network: Hedonometrics and Twitter. PLoS ONE, 6, e26752, 2011. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 10 Societal Data Resources and Data Processing Infrastructure The Hedonometer and Quantifying Hapiness and Public Well-Being Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 11 Societal Data Resources and Data Processing Infrastructure technology from seed" Text Driven Forecasting and Predicting the Real World from Text technology from seed" • Regression models from sparse/noisy word features – Often using methods that promote sparse models (e.g., Lasso regression) • Many possible applications – – – – – – Box office results Elections and opinion pools Stock value Investment risk Restaurant menu prices … Whitepaper from Noah Smith in 2009 : Text-Driven Forecasting Word Salad: Relating Food Prices and Descriptions Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing and Natural Language Learning Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 12 Societal Data Resources and Data Processing Infrastructure Interfaces to other vertical tasks - Cultoromics and Epidemics - www.culturomics.org and the cultural observatory @ Harvard – Using n-grams extracted by Google from books – Type in a word or phrase in one of seven languages and see how its usage has been changing throughout the last few centuries Studying epidemics with Twitter – Use information embedded in the Twitter stream to: • Track rapidly-evolving public concern with respect to H1N1 or swine flu • Track and measure actual disease activity. Quantitative analysis of culture using millions of digitized books. Science, 2010 The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic. PLoS ONE 6(5): e19467 The Expression of Emotions in 20th Century Books. PLoS ONE 8(3): e59030 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 13 Societal Data Resources and Data Processing Infrastructure technology from seed" Interfaces to other vertical tasks - Cultoromics and Epidemics - www.culturomics.org and the cultural observatory @ Harvard – Using n-grams extracted by Google from books – Type in a word or phrase in one of seven languages and see how its usage has been changing throughout the last few centuries Studying epidemics with Twitter – Use information embedded in the Twitter stream to: • Track rapidly-evolving public concern with respect to H1N1 or swine flu • Track and measure actual disease activity. Quantitative analysis of culture using millions of digitized books. Science, 2010 The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic. PLoS ONE 6(5): e19467 The Expression of Emotions in 20th Century Books. PLoS ONE 8(3): e59030 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 14 Societal Data Resources and Data Processing Infrastructure technology from seed" Some National Projects as Well… Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 15 Societal Data Resources and Data Processing Infrastructure technology from seed" The rest of this session at the DATASTORM workshop • Silvio Moreira, INESC-ID – The REACTION and POPSTAR projects • João Vasconcelos, AMA – Introducing the dados.gov.pt portal • Paula Carvalho, INESC-ID – Natural language processing for the social sciences • Hopefully, some discussion afterwards… Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 16 Societal Data Resources and Data Processing Infrastructure technology from seed"