Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins ([email protected]) Motivation for Geographic IR Geo-information associates things and events with places. Geo-information is abundant on the Web and on Digital Libraries. Collections of geo-referenced photographs. Newsfeeds. General databases of geo-referenced information. Around 80% of Web pages contain references to places. Many information needs are related to a given geographical context. Find me the nearest restaurants. Find me news about Lisboa. Find me photographs taken in Sintra. ... Around 20% of Web searches are “local” in nature. Geographic information is part of our everyday lives! Existing Geographical IR Systems Web search engines with “local search” Yahoo! Local, Google Local, ... Integration with navigation mechanisms. Mostly explore “Yellow-pages” information. Web-based GIS platforms (virtual globes) Google Earth, ... Explore databases of georeferenced info. OGC standards for Web-GIS Photo repositories with “local search” Flickr geo-tagging interface, ... Explore automatic “GPS” geo-referencing. Many more location-based services Advertisement, discussion communities, ... Location is everywhere in information systems. Challenges for Geographical IR • Very few systems explore information on the Web directly. – They instead used databases of georeferenced information. • Geographic context embedded in natural language descriptions. – This presents problems to automated processing. – Place names are ambiguous and get confused with names of organizations, people, buildings and streets. • Web queries depend on exact match of text terms. – Handling structured queries (e.g. “concept, relation, location”). – Intelligent interpretation of spatial relationships (“near”, “west” etc). – Ranking results against some measure of geographic relevance. Geographical Information Retrieval (GIR) Geographic information retrieval (GIR) is concerned with the retrieval of geographically referenced information objects. Information objects can be maps, images, digital geographic data or even textual (web) documents. New multidisciplinary field Combines techniques from database systems, information retrieval, digital libraries, user interfaces, geographical information systems, ... Geographic GeographicKnowledge IR Information Management Systems Information Retrieval The difference among GIR and GIS • GIS is concerned with exact spatial representations and complex analysis at the level of the individual spatial object or field. – Users are experts, information is structured and unambiguous! • GIR is concerned with retrieving geo-referenced information resources that may be relevant to a geographic query region. – Unstructured and ambiguous information, everyday applications! • Similar to the difference between search engines and relational database systems! Geo-referencing and GIR • Information objects can be geo-referenced by either place names or by geographic coordinates (i.e. longitude & latitude) – Geographic coordinates represent exact physical location – Placenames are ambiguous (main problem of GIR) • Spatial relations may be either: – Geometric: distance and direction measured on a continuous scale. – Topological: spatially related but not directly measurable. Y X The typical steps involved in GIR Anatomy of a Geographical IR System Mapping Query disambiguation User Interface Query footprint Broker Search Request + Query footprint Ranked Results Info. Textual Resources Spatial Text Indexing Geotagging Ontology a.k.a. Gazetteer Document Footprints Search Engine Indexes Spatial Textual Textual Spatial Unranked Results Ranked Results Relevance Ranking Gazetteers / Geographic Ontology • Database containing placenames, the spatial relationships among them and the associated geographical footprints. • Support for geo-referencing with basis on the place names over text. • Many problems in using traditional gazetteers for GIR. Roles of the Gazetteer in GIR User Interface Metadata Extraction document footprints Geo-Tagging Query Disambiguation gazetteer document collection document footprints Spatial Index Relevance Ranking Relevance Ranking Query Expansion (query footprint) Search Component Challenges to using Gazetteers in GIR • To be useful in GIR the gazetteer should support – Different locations and boundary changes, integrating data from multiple sources. – Synonymous and variant names with differing locations for the same entity. – Different relationships among concepts. – Names in multiple languages. – “Fuzzy” regions and intra-urban place names. • More than gazetteers, we need an ontology! Existing Gazetteer Systems/Services • Alexandria Digital Library (ADL) Gazetteer. – ~6 million entries – Has tried to standardize the format, description, and distribution of gazetteer data. – Has a published, detailed schema. – Basis for OGC standard. • Geonames website. – Integrates information from multiple sources. – Publishes OWL ontology. – ~6 million entries • EuroGeoNames project. GeoTagging = GeoParsing+GeoCoding Geo-parsing Recognizing geographic references, ignoring non-geographic uses of place terminology Geo-coding Attaching a unique quantitative location (footprint) to the extracted geographic references GeoParsing Textual Documents • The presence of placenames can be recognised with the help of gazetteers/geo-ontologies (i.e. lists of names) • Some types of place references given over text: – the name of the place : Coimbra – an address: INESC-ID, Rua Alves Redol, 9 Lisboa – an address fragment: “Manuel lived near Largo do Rato in Lisboa” – a postcode / zip code: 2840-137 – a phone number : most Lisbon phone numbers start with +351 21 Ambiguity in GeoParsing Documents Examples of false place references: • Personal names Smedes York,Jack London • Business names Dorchester Hotel,York Properties.. • • • Street names Oxford Street, London Road… Common words bath, battle, derby, over, well, …… Approach for handling ambiguity: – Look for patterns in surrounding context!!! – One reference per discourse. GeoCoding place references in text Many different places with the same name (referent ambiguity) Newport, Cambridge, Springfield, Lisboa……… • Use context to decide: references to parent or nearby places. • Choose most important one: by population or place type. • Optional step taken by some GIR approaches: • Finding a document’s encompassing geographic scope. – Combine all place references given in the document. – Use heuristics to guide the process. Document Indexing for Geographic IR • Different indexing strategies are possible: – Index documents with basis on gazetteer ids. – Use documents scopes to create document footprints (point, bounding rectangle, ...) and use footprints to index documents. • Strategy for handling queries: – Convert query to a query footprint/gazetteer id. – Match query footprint to document footprints/ids. – Rank documents according to “relevance”. Handling queries in GIR systems Data structures for indexing in GIR Term1 D1, D2, D23, … Term2 D9, D11, D100, … Term3 D27, D85, .. R1 R2 D7 • Typical strategy is to have separate indexes. – Inverted index for text. – R-tree for footprints. • Access spatial index with query footprint/gazetteer id. D10 D13 D1 D3 D12 R D6 D5 D4 D9 D2 • Access text index with query terms. D11 D15 D16 D8 D14 R3 R4 • Merge results and find the intersection. Ranking search results in GIR • Spatial similarity can indicate relevance – Documents whose spatial content is more similar to the spatial content of query should appear first. • But we need to consider both the: – Thematic relevance: BM25, TF-IDF, ... – Geographic relevance: proximity, containment, ... • Geometric (e.g. distance) and non-geometric (e.g. topology) – Other importance metrics: PageRank • State of the art consists of doing a linear combination. Existing GIR systems : MetaCarta The MetaCarta system – Pioneer system addressing all aspects given in this talk. – Conducts geo-parsing and geocoding of text documents, and sends back possible location references with relative strength scores. – Uses Natural Language Processing (NLP) to find possible location references. – Contains a gazetteer of ~14 million entries. Other GIR Systems : Research projects • Prototype system from the SPIRIT EU project – Spatially-aware information retrieval on the Internet. – Geo-tagging of Web documents with basis on geo-ontology. • Alexandria Digital Library – Digital library of geo-referenced materials. – Focus on development of a large gazetteer. • GREASE, GIPSY, Web-a-Where, GeoXWalk, ... – Many more research projects addressing GIR aspects individually. – GeoCLEF evaluation contest similar to TREC. • Project DIGMAP under development at IST – Digital library for old maps and historical cartography resources – Indexing metadata records for geographic retrieval. Current Challenges in Geographic IR Improve “conventional GIR” components and methods Geo-tagging, spatio-textual indexing and geo-relevance ranking. Improved understanding of spatial natural language terminology. Principled approaches for integration and evaluation of GIR. Better user interfaces for exploration of GIR results. Integration of geographical with temporal aspects. Everything we do happens in space and time! Creation of rich place ontologies with world-wide coverage. Fuzzy regions and intra-urban placenames present challenges Open GeoInformation Web services and Geospatial Semantic Web. Where To Find More Information • Georeferencing: The Geographic Associations of Information – By Linda L. Hill (Author), MIT Press • Proceedings of the Workshops on Geographical IR – Edited by Chris Jones and Ross Purves (4th edition in 2007, Lisbon) • Talk to me using the email address [email protected]