Geographical Information Retrieval
Instituto Superior Técnico - INESC-ID
Data Management and Information Retrieval Group (DMIR) - TagusPark
Por
Bruno Martins ([email protected])
Motivation for Geographic IR
 Geo-information associates things and events with places.
 Geo-information is abundant on the Web and on Digital Libraries.




Collections of geo-referenced photographs.
Newsfeeds.
General databases of geo-referenced information.
Around 80% of Web pages contain references to places.
 Many information needs are related to a given geographical context.





Find me the nearest restaurants.
Find me news about Lisboa.
Find me photographs taken in Sintra.
...
Around 20% of Web searches are “local” in nature.
 Geographic information is part of our everyday lives!
Existing Geographical IR Systems
 Web search engines with “local search”
 Yahoo! Local, Google Local, ...
 Integration with navigation mechanisms.
 Mostly explore “Yellow-pages” information.
 Web-based GIS platforms (virtual globes)
 Google Earth, ...
 Explore databases of georeferenced info.
 OGC standards for Web-GIS
 Photo repositories with “local search”
 Flickr geo-tagging interface, ...
 Explore automatic “GPS” geo-referencing.
 Many more location-based services
 Advertisement, discussion communities, ...
 Location is everywhere in information systems.
Challenges for Geographical IR
• Very few systems explore information on the Web directly.
– They instead used databases of georeferenced information.
• Geographic context embedded in natural language descriptions.
– This presents problems to automated processing.
– Place names are ambiguous and get confused with names of
organizations, people, buildings and streets.
• Web queries depend on exact match of text terms.
– Handling structured queries (e.g. “concept, relation, location”).
– Intelligent interpretation of spatial relationships (“near”, “west” etc).
– Ranking results against some measure of geographic relevance.
Geographical Information Retrieval (GIR)
 Geographic information retrieval (GIR) is concerned with the retrieval of
geographically referenced information objects.
 Information objects can be maps, images, digital geographic data or
even textual (web) documents.
 New multidisciplinary field
 Combines techniques from database systems, information retrieval,
digital libraries, user interfaces, geographical information systems, ...
Geographic GeographicKnowledge
IR
Information
Management
Systems
Information
Retrieval
The difference among GIR and GIS
• GIS is concerned with exact spatial representations
and complex analysis at the level of the individual
spatial object or field.
– Users are experts, information is structured and unambiguous!
• GIR is concerned with retrieving geo-referenced
information resources that may be relevant to a
geographic query region.
– Unstructured and ambiguous information, everyday applications!
• Similar to the difference between search engines
and relational database systems!
Geo-referencing and GIR
• Information objects can be geo-referenced by either place
names or by geographic coordinates (i.e. longitude & latitude)
– Geographic coordinates represent exact physical location
– Placenames are ambiguous (main problem of GIR)
• Spatial relations may be either:
– Geometric: distance and direction measured on a continuous scale.
– Topological: spatially related but not directly measurable.
Y
X
The typical steps involved in GIR
Anatomy of a Geographical IR System
Mapping
Query disambiguation
User
Interface
Query footprint
Broker
Search Request
+ Query footprint
Ranked
Results
Info.
Textual
Resources
Spatial
Text
Indexing
Geotagging
Ontology
a.k.a.
Gazetteer
Document
Footprints
Search
Engine
Indexes
Spatial
Textual
Textual
Spatial
Unranked
Results
Ranked
Results
Relevance
Ranking
Gazetteers / Geographic Ontology
• Database containing placenames, the spatial relationships among
them and the associated geographical footprints.
• Support for geo-referencing with basis on the place names over text.
• Many problems in using traditional gazetteers for GIR.
Roles of the Gazetteer in GIR
User
Interface
Metadata
Extraction
document footprints
Geo-Tagging
Query Disambiguation
gazetteer
document
collection
document
footprints
Spatial
Index
Relevance Ranking
Relevance
Ranking
Query Expansion
(query footprint)
Search
Component
Challenges to using Gazetteers in GIR
• To be useful in GIR the gazetteer should support
– Different locations and boundary changes, integrating data
from multiple sources.
– Synonymous and variant names with differing locations for
the same entity.
– Different relationships among concepts.
– Names in multiple languages.
– “Fuzzy” regions and intra-urban place names.
• More than gazetteers, we need an ontology!
Existing Gazetteer Systems/Services
•
Alexandria Digital Library (ADL)
Gazetteer.
– ~6 million entries
– Has tried to standardize the
format, description, and
distribution of gazetteer data.
– Has a published, detailed
schema.
– Basis for OGC standard.
•
Geonames website.
– Integrates information from
multiple sources.
– Publishes OWL ontology.
– ~6 million entries
•
EuroGeoNames project.
GeoTagging = GeoParsing+GeoCoding
Geo-parsing
Recognizing geographic references, ignoring non-geographic uses of
place terminology
Geo-coding
Attaching a unique quantitative location (footprint) to the extracted
geographic references
GeoParsing Textual Documents
• The presence of placenames can be recognised with the help
of gazetteers/geo-ontologies (i.e. lists of names)
• Some types of place references given over text:
– the name of the place : Coimbra
– an address:
INESC-ID,
Rua Alves Redol, 9
Lisboa
– an address fragment:
“Manuel lived near Largo do Rato
in Lisboa”
– a postcode / zip code:
2840-137
– a phone number :
most Lisbon phone numbers
start with +351 21
Ambiguity in GeoParsing Documents
Examples of false place references:
• Personal names
Smedes York,Jack London
• Business names
Dorchester Hotel,York Properties..
•
•
•
Street names
Oxford Street, London Road…
Common words
bath, battle, derby, over, well, ……
Approach for handling ambiguity:
– Look for patterns in
surrounding context!!!
– One reference per discourse.
GeoCoding place references in text
Many different places with the same name (referent ambiguity)
Newport, Cambridge, Springfield, Lisboa………
• Use context to decide: references to parent or nearby places.
• Choose most important one: by population or place type.
• Optional step taken by some GIR approaches:
• Finding a document’s encompassing geographic scope.
– Combine all place references given in the document.
– Use heuristics to guide the process.
Document Indexing for Geographic IR
• Different indexing strategies are possible:
– Index documents with basis on gazetteer ids.
– Use documents scopes to create document
footprints (point, bounding rectangle, ...) and use
footprints to index documents.
• Strategy for handling queries:
– Convert query to a query footprint/gazetteer id.
– Match query footprint to document footprints/ids.
– Rank documents according to “relevance”.
Handling queries in GIR systems
Data structures for indexing in GIR
Term1
D1, D2, D23, …
Term2
D9, D11, D100, …
Term3
D27, D85, ..
R1
R2
D7
• Typical strategy is to have
separate indexes.
– Inverted index for text.
– R-tree for footprints.
• Access spatial index with query
footprint/gazetteer id.
D10
D13
D1
D3
D12
R
D6
D5
D4
D9
D2
• Access text index with query
terms.
D11
D15
D16
D8
D14
R3
R4
• Merge results and find the
intersection.
Ranking search results in GIR
• Spatial similarity can indicate relevance
– Documents whose spatial content is more similar to the
spatial content of query should appear first.
• But we need to consider both the:
– Thematic relevance: BM25, TF-IDF, ...
– Geographic relevance: proximity, containment, ...
• Geometric (e.g. distance) and non-geometric (e.g. topology)
– Other importance metrics: PageRank
• State of the art consists of doing a linear combination.
Existing GIR systems : MetaCarta
The MetaCarta system
– Pioneer system addressing all
aspects given in this talk.
– Conducts geo-parsing and geocoding of text documents, and
sends back possible location
references with relative strength
scores.
– Uses Natural Language
Processing (NLP) to find possible
location references.
– Contains a gazetteer of ~14
million entries.
Other GIR Systems : Research projects
• Prototype system from the SPIRIT EU project
– Spatially-aware information retrieval on the Internet.
– Geo-tagging of Web documents with basis on geo-ontology.
• Alexandria Digital Library
– Digital library of geo-referenced materials.
– Focus on development of a large gazetteer.
• GREASE, GIPSY, Web-a-Where, GeoXWalk, ...
– Many more research projects addressing GIR aspects individually.
– GeoCLEF evaluation contest similar to TREC.
• Project DIGMAP under development at IST
– Digital library for old maps and historical cartography resources
– Indexing metadata records for geographic retrieval.
Current Challenges in Geographic IR
 Improve “conventional GIR” components and methods
 Geo-tagging, spatio-textual indexing and geo-relevance ranking.
 Improved understanding of spatial natural language terminology.
 Principled approaches for integration and evaluation of GIR.
 Better user interfaces for exploration of GIR results.
 Integration of geographical with temporal aspects.
 Everything we do happens in space and time!
 Creation of rich place ontologies with world-wide coverage.
 Fuzzy regions and intra-urban placenames present challenges
 Open GeoInformation Web services and Geospatial Semantic Web.
Where To Find More Information
• Georeferencing: The Geographic Associations of Information
– By Linda L. Hill (Author), MIT Press
• Proceedings of the Workshops on Geographical IR
– Edited by Chris Jones and Ross Purves (4th edition in 2007, Lisbon)
• Talk to me using the email address [email protected]
Download

How to keep up with language dynamics?