Sidra: a Flexible Distributed Indexing
and Ranking Architecture for Web
Search
Miguel Costa, Mário J. Silva
Universidade de Lisboa, Faculdade de Ciências,
Departamento de Informática
XLDB Research Group
[email protected]
Portuguese Web
●
There is an identifiable community Web, that
we call the Portuguese Web
–
●
The web of the people directly related to
Portugal
This is NOT a small community web
–
–
–
10M population PT
3+ M users
4+ M pages
Tumba!
(Temos um Motor de Busca Alternativo!)
●
●
Public service
–
Community Web Search Engine
–
Web Archive
–
Research infrastructure
See it in action at http://tumba.pt
Statistics
●
●
●
Up to 20,000 queries/day
3,5 million documents under .PT – the
deepest crawl!
95% responses under 0.5 sec
Web
Presentation Engine
Ranking Engine
Indexing Engine
Repository
Crawlers
Tumba!
crawling+archiving
“.PT”
DNS Authority
Seed
URLs
Versus
(Meta-data
Repository)
User Input
Web
ViúvaNegra
(Crawling
Engine)
WebStore
(Contents
Repository)
Query Processing Architecture
(indexing phase)
Versus
(Meta-data
Repository)
WebStore
(Contents
Repository)
Index
DataStructs
Generator
Page
Attributes
(Authority)
Word
Index
SIDRA - Word Index Data Structure
• 2 files
Term  {docID}
<Term,docID>  {hit}
• Hit = position + attrib
• DocID assigned in
Static Rank order
Terms
documents ids
blue
2
5
dog
1
3
...
...
hits
Terms + sids
blue +
2
blue +
5
4
hit
hit
...
hit
hit
...
dog + 1
hit
hit
...
dog + 3
hit
hit
...
dog + 4
hit
hit
...
SIDRA – Index Range Partitioning
Host
Terms
Host
documents ids
2
5
dog
1
3
...
...
sea
1
10
xldb
7
9
...
hits
Terms + sids
...
hits
Terms + sids
hit
hit
...
hit
hit
...
dog + 1
hit
hit
...
dog + 3
hit
hit
...
dog + 4
hit
hit
...
sea + 2
sea +
10
xldb +
7
xldb +
9
xldb +
25
25
hit
hit
...
hit
hit
...
hit
hit
...
hit
hit
...
hit
hit
...
hits index
blue +
2
blue +
5
4
documents ids
docIds index
blue
Terms
SIDRA - Ranking Engine
Word
Word
Word
Index
Index
Index
Query
Server
Query
Broker
Page
Attributes
Clients
Clients
Matching & Ranking Algorithm
Phase 1: Query Matching
• QueryServers fetch
matching docIDs
(pre-sorted in static
ranking order)
• QueryBrokers merge
results using distributed
merge-sort algorithm
(preserves ranking order)
Phase 2: Ranking
• Pick N (1000) first results
from phase 1
• Compute final rank using
hits data
– Are terms also in title?
– What is the distance among
query terms in the page?
– Terms in Bold, Italic?
Architecture
Index design
●
●
●
horizontal/global partition ~ each
QueryServer contains all documents of a
criteria. e.g of a keywork
allow searches on different criteria in parallel
(partition parallelism)
Brokers merge results received in parallel as
they are being produced (pipelline
parallelism)
Addressing Multi-dimensionality
●
●
●
Generalization: page-rank (page importance
measure) isn´t but one of possible ranking contexts.
Query Servers may index data according to other
dimensions
–
time
–
Location
–
...
Query Brokers perform the results “fusion”
Flexiblity / Scalability
Word
Word
Index
Word
Index
Index
Query
Server
• User requests may be
balanced among multiple
Presentation Engines
• Contents may be
replicated
• Requests may be balanced
among multiple Query
Brokers
• Page Attributes may be
replicated
Query
Broker
Presentation
Engine
Page
Attributes
WebStore
(Contents
Repository)
• Query Brokers may
balance requests to
multiple Query Servers
• Multiple Query servers for
a Word Index
• Word indexes may be
replicated
Non-functional properties
●
●
load-balancing ~ components distribute
requests to multiple replicas (round-robin or
less loaded)
fault-tolerance ~ components can detect high
response times and redirect requests.
Results
●
●
●
With 1 QueryServer and 1 Broker responds
to workloads of 50 requests per second with
an average time of 0.779 seconds
With 2 QueryServers and 1 Brokerresponds
to workloads of 110 requests per second with
an average time of 0.871 seconds
Extensive discussion in upcoming
dissertation
Tumba!
●
Modest effort:
–
●
●
1 Prof., 4-5 graduate students, 4-5 servers for 2 years
Still beta!
–
Fault-tolerance will require substantially more hardware
(replication)
–
Periodic update willl demand more storage
–
Full-time operators?
Encouraging feedback
http://tumba.pt