Information Retrieval: History: Elementary IR: Scalable Boolean Text Search
Information Retrieval: History: Elementary IR: Scalable Boolean Text Search
Products traditionally separate Originally, document management systems for libraries, government, law, etc. Gained prominence in recent years due to web search
Still used for non-web document management. (Enterprise search).
IR vs. DBMS
Seem like very different beasts IR
Imprecise Semantics Keyword search Unstructured data format Read-Mostly. Add docs occasionally Page through top k results
DBMS
Precise Semantics SQL Structured data Expect reasonable number of updates Generate full answer
Under the hood, not as different as they might seem But in practice, you have to choose between the 2 today
Text Indexes
When IR folks say text index usually mean more than what DB people mean In our terms, both tables and indexes Really a logical schema (i.e. tables) With a physical schema (i.e. indexes) Usually not stored in a DBMS
Tables implemented as files in a file system
Fancy list compression on the docIDs is important, too Typically called a postings list by IR people
This is often called an inverted file or inverted index Maps from words -> docs, rather than docs -> words
Term
data database date day dbms decision demonstrate description design desire developer differ disability discussion division do document document
docID
http://www-inst.eecs.berkeley.edu/~cs186 http://www-inst.eecs.berkeley.edu/~cs186 http://www-inst.eecs.berkeley.edu/~cs186 http://www-inst.eecs.berkeley.edu/~cs186 http://www-inst.eecs.berkeley.edu/~cs186 http://www-inst.eecs.berkeley.edu/~cs186 http://www-inst.eecs.berkeley.edu/~cs186 http://www-inst.eecs.berkeley.edu/~cs186 http://www-inst.eecs.berkeley.edu/~cs186 http://www-inst.eecs.berkeley.edu/~cs186 http://www.microsoft.com http://www-inst.eecs.berkeley.edu/~cs186 http://www.microsoft.com http://www-inst.eecs.berkeley.edu/~cs186 http://www-inst.eecs.berkeley.edu/~cs186 http://www-inst.eecs.berkeley.edu/~cs186 http://www-inst.eecs.berkeley.edu/~cs186 http://www.microsoft.com
An Inverted File
Snippets from: Old class web page Old microsoft.com home page Search for databases microsoft
microsoft microsoft midnight midterm minibase million monday more most ms msn must necessary need network new new news newsgroup
http://www.microsoft.com http://www-inst.eecs.berkeley.edu/~cs186 http://www-inst.eecs.berkeley.edu/~cs186 http://www-inst.eecs.berkeley.edu/~cs186 http://www-inst.eecs.berkeley.edu/~cs186 http://www.microsoft.com http://www.microsoft.com http://www.microsoft.com http://www-inst.eecs.berkeley.edu/~cs186 http://www-inst.eecs.berkeley.edu/~cs186 http://www.microsoft.com http://www-inst.eecs.berkeley.edu/~cs186 http://www-inst.eecs.berkeley.edu/~cs186 http://www-inst.eecs.berkeley.edu/~cs186 http://www.microsoft.com http://www-inst.eecs.berkeley.edu/~cs186 http://www.microsoft.com http://www.microsoft.com http://www-inst.eecs.berkeley.edu/~cs186
Usually not allowed! Optimizations: What order to handle terms if you have many ANDs? Can you do better than merge? How does this interact with postings list compression?
Can do a similar thing for term1 NEAR term2 Position < k off Think about refinement to merge-join
Files(docID int, docID string, snippet string, ) Btree on InvertedFile.term Btree on Docs.docID Requires a final join step between typical query result and Files.docID
Can do this lazily: cursor to generate a page full of results
So no concurrency control problems Can compress to search-friendly, update-unfriendly format Can keep postings lists sorted For these reasons, text search engines and DBMSs are usually separate products Also, text-search engines tune that one SQL query to death! The benefits of a special-case workload.
How to deal with synonyms, misspelling, abbreviations? How to write a good web crawler? Well return to some of these later The book Managing Gigabytes covers some of the details
Simple DBMS
DB
DB
DBMS
Search Engine
Data Modeling & Query Complexity DBMS supports any schema & queries
But requires you to define schema And SQL is hard to figure out for the average citizen
Summary
IR & Relational systems share basic building blocks for scalability IR internal representation is relational! Equality indexes (B-trees) Iterators Join algorithms, esp. merge-join Join ordering and selectivity estimation IR constrains queries, schema, promises on semantics Affects storage format, indexing and concurrency control Affects join algorithms & selectivity estimation IR has different performance goals Ranking and best answers fast Many challenges in IR related to text engineering But dont tend to change the scalability infrastructure
Exercise!
Implement Boolean search as described in Postgres Using the schemas and indexes here.
Write a simple script to load files. You can ignore stemming and stop-words.
Compare to Postgres tsearch facility Two indexes choices, GIN and GiST. GIN is an inverted index. Use the cost models for IndexScan and MergeJoin to calculate the expected number of IOs. Distinguish sequential and random Ios. Why is the nave solution slow? Storage overhead? Optimizer smarts?