Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências, Departamento de Informática XLDB Research Group [email protected] Portuguese Web ● There is an identifiable community Web, that we call the Portuguese Web – ● The web of the people directly related to Portugal This is NOT a small community web – – – 10M population PT 3+ M users 4+ M pages Tumba! (Temos um Motor de Busca Alternativo!) ● ● Public service – Community Web Search Engine – Web Archive – Research infrastructure See it in action at http://tumba.pt Statistics ● ● ● Up to 20,000 queries/day 3,5 million documents under .PT – the deepest crawl! 95% responses under 0.5 sec Web Presentation Engine Ranking Engine Indexing Engine Repository Crawlers Tumba! crawling+archiving “.PT” DNS Authority Seed URLs Versus (Meta-data Repository) User Input Web ViúvaNegra (Crawling Engine) WebStore (Contents Repository) Query Processing Architecture (indexing phase) Versus (Meta-data Repository) WebStore (Contents Repository) Index DataStructs Generator Page Attributes (Authority) Word Index SIDRA - Word Index Data Structure • 2 files Term {docID} <Term,docID> {hit} • Hit = position + attrib • DocID assigned in Static Rank order Terms documents ids blue 2 5 dog 1 3 ... ... hits Terms + sids blue + 2 blue + 5 4 hit hit ... hit hit ... dog + 1 hit hit ... dog + 3 hit hit ... dog + 4 hit hit ... SIDRA – Index Range Partitioning Host Terms Host documents ids 2 5 dog 1 3 ... ... sea 1 10 xldb 7 9 ... hits Terms + sids ... hits Terms + sids hit hit ... hit hit ... dog + 1 hit hit ... dog + 3 hit hit ... dog + 4 hit hit ... sea + 2 sea + 10 xldb + 7 xldb + 9 xldb + 25 25 hit hit ... hit hit ... hit hit ... hit hit ... hit hit ... hits index blue + 2 blue + 5 4 documents ids docIds index blue Terms SIDRA - Ranking Engine Word Word Word Index Index Index Query Server Query Broker Page Attributes Clients Clients Matching & Ranking Algorithm Phase 1: Query Matching • QueryServers fetch matching docIDs (pre-sorted in static ranking order) • QueryBrokers merge results using distributed merge-sort algorithm (preserves ranking order) Phase 2: Ranking • Pick N (1000) first results from phase 1 • Compute final rank using hits data – Are terms also in title? – What is the distance among query terms in the page? – Terms in Bold, Italic? Architecture Index design ● ● ● horizontal/global partition ~ each QueryServer contains all documents of a criteria. e.g of a keywork allow searches on different criteria in parallel (partition parallelism) Brokers merge results received in parallel as they are being produced (pipelline parallelism) Addressing Multi-dimensionality ● ● ● Generalization: page-rank (page importance measure) isn´t but one of possible ranking contexts. Query Servers may index data according to other dimensions – time – Location – ... Query Brokers perform the results “fusion” Flexiblity / Scalability Word Word Index Word Index Index Query Server • User requests may be balanced among multiple Presentation Engines • Contents may be replicated • Requests may be balanced among multiple Query Brokers • Page Attributes may be replicated Query Broker Presentation Engine Page Attributes WebStore (Contents Repository) • Query Brokers may balance requests to multiple Query Servers • Multiple Query servers for a Word Index • Word indexes may be replicated Non-functional properties ● ● load-balancing ~ components distribute requests to multiple replicas (round-robin or less loaded) fault-tolerance ~ components can detect high response times and redirect requests. Results ● ● ● With 1 QueryServer and 1 Broker responds to workloads of 50 requests per second with an average time of 0.779 seconds With 2 QueryServers and 1 Brokerresponds to workloads of 110 requests per second with an average time of 0.871 seconds Extensive discussion in upcoming dissertation Tumba! ● Modest effort: – ● ● 1 Prof., 4-5 graduate students, 4-5 servers for 2 years Still beta! – Fault-tolerance will require substantially more hardware (replication) – Periodic update willl demand more storage – Full-time operators? Encouraging feedback http://tumba.pt