Search

part of Software Engineering for Internet Applications by Eve Andersson, Philip Greenspun, and Andrew Grumet; revised February 2005
Recall from the "Planning" chapter our principles of sustainable online community:
  1. magnet content authored by experts
  2. means of collaboration
  3. powerful facilities for browsing and searching both magnet content and contributed content
  4. means of delegation of moderation
  5. means of identifying members who are imposing an undue burden on the community and ways of changing their behavior and/or excluding them from the community without them realizing it
  6. means of software extension by community members themselves
A sustainable online community is one that can accommodate new users. If Joe Novice, via browsing and searching, cannot find existing content relevant to his needs, he will ask questions that will annoy other community members: "Didn't you search the archives?" "Haven't you read the FAQ?" Long-term community members, instead of being stimulated by discussion of new and interesting topics, find their membership a tiresome burden of directing new users to pages that they "should" have been able to find on their own.

A community's first line of defense is high quality information architecture and navigation, as discussed at the end of the "Content Management" chapter. Users are better at browsing than formulating search queries. A community's second line of defense, however, is a superb full-text search facility. The search database must include both publisher-authored and user-contributed content. Here are some example query categories:

On a large site a user might wish to restrict the search in some way. If the search form is at the top of a document that is a chapter of an online book, it might make sense to offer "whole site" and "within the chapters of this book" options. If the publisher or the other users have gone to the trouble of rating content, the default search might limit results to those documents that have been rated of high quality. If there are multiple discussion forums on the site, each of which is essentially a self-contained subcommunity, the search boxes on those pages might offer a "restrict searching to postings in this forum" option. If a user hasn't visited the site for a month and wants to see if there is anything new and relevant, the site should perhaps offer a "restrict searching to content added within the last 30 days" option.

What's Wrong with SQL (Search Quality)

The relational database management system (RDBMS) sounds like the perfect tool for this job. We have a lot of data and we want to provide a lot of flexibility in querying. Suppose a person comes to a site for athletes and types "running" into the search form. The site sends the following SQL query to the database:
select * 
from content
where body like '%' || :user_query || '%'
which, by the time the bind variable :user_query is substituted, turns into
select * 
from content
where body like '%running%'
In Oracle this won't pick up a row whose message contains the same word but with a different capitalization. Instead we do
select * 
from content
where upper(body) like upper('%running%')
What if the user typed multiple words? The query
select * 
from content
where upper(body) like upper('%running shoes%')
would not pick up a message that contained the phrase "shoes for running". Instead we'll need multiple where clauses:
select * 
from content
where upper(body) like upper('%running%')
and upper(body) like upper('%shoes%')
This AND clause isn't quite right. If there are lots of documents that contain both "running" and "shoes", these are the ones that we'd like to see. However, if there aren't any rows with all query terms, we should probably offer the user rows that contain some of the query terms. We might need to use OR, a scoring function, and an ORDER BY so that the rows containing both query terms are returned first. If we insist on the AND clause, we've created a situation in which the more the user tells us about her interests the fewer documents we'll return in response to a search, eventually returning "0 results found" if she keeps adding words. (Note that public search engines circa 2005, such as Google, Yahoo, A9, and MSN, do implicitly use AND and do return 0 results if a user keeps adding words to a query and there aren't any documents in the database that contain each and every one of those words.)

There are some deeper problems with the Caveperson SQL Programmer approach to full-text search. Suppose that a message contains the phrase "My brother-in-law Billy Bob ran 20 miles yesterday" but not the word "running". Or a message contains the phrase "My cousin Gertrude runs 15 miles every day". These should be returned as relevant to the query "running", but the LIKE clause won't do the job. What is needed is a system for stemming both the query terms and the indexed terms: "running", "runs", and "ran" would all be bashed down to the stem word "run" for indexing and retrieval.

What about a message saying "I attended the 100th anniversary Boston Marathon"? The LIKE query won't pick that up. What is needed is a system for expanding queries through a thesaurus powerful enough to make the connection between "running" and "marathon".

What's Wrong with SQL (Performance)

Let's return to the simplest possible LIKE query:
select * 
from content
where body like '%running%'
The RDBMS must examine every row in the content table to answer this query, i.e., must perform a sequential table scan(O[N] time, where N is the number of rows in the table). Suppose that a standard RDBMS index is defined on the body column. The values of body will be used as keys for a B-tree and we could perform
select * 
from content
where body = 'running'
and maybe, depending on the implementation,
select * 
from content
where body like 'running%'
in O[logN] time. But the user's interest isn't restricted to documents whose only word is "running" or documents that begin with the word "running". The user wants documents in which the word "running" may be buried. A single B-tree index is not going to help.

Abandoning the RDBMS

We can solve both the performance and search quality problems by dumping all of our data into a full-text search system. As the name implies, these systems index every word in a document, not just the first words as with the standard RDBMS B-tree. A full-text index can answer the question "Find me the documents containing the word 'running'" in time that approaches O[1], i.e., an amount of time that does not vary with the size of the corpus indexed. If there are 10 million documents in the corpus, a search through those 10 million documents will not take much longer than a search through a corpus of 1000 documents. (Getting close to constant time in this situation would require that the 10-million-document collection did not use a larger vocabulary than the 1000-document collection and that it was not the case that, say, 90 percent of the documents contained the word "running".)

How does it work? Like every other indexing strategy: extra work at insertion time is traded for less work at query time. Consider constructing a big table of every word in the English language next to the database keys of those documents that contain the word:

WordDocument IDs
absquatulate612
bedizen36, 9211
cryptogenic9
dactylioglyph7214
exheredate57, 812, 4010
feuilleton87, 349, 1203
genetotrophic5000
hartebeest710
inspissate549, 21, 3987
...
samoyed17, 91, 1000, 3492
sesquipedalian723
the1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,...
uberous6, 800
velutinous45, 2307
widdershins7300
xenial3611
ypsiliform5607
zibeline4782
If we build this as a hash table, we have O[1] access to a row in the table. If we merely keep the rows in sorted order, we have O[log W] access to any row in the table, where W is the number of words in our vocabulary. Performance does not vary with the number of documents in the collection... or does it? Just about every English document will contain the word "the" and therefore simply returning the value of the document_ids column for the word "the" will take O[N] time, where N is the number of documents in the corpus. This row isn't useful anyway because it isn't selective, i.e., we could get the same information almost as fast with a sequential scan of the documents table, collecting all the document IDs. While indexing a document, a full-text search system will refer to a list of stopwords, words that are too common to be worth indexing. For standard English, the stopword list includes such words as "a", "and", "as", "at", "for", "or", "the", etc.

Inserting a new document into the collection will be slow. We'll have to go through the document, word by word, and update as many rows in the index as there are distinct words in the document. But that extra work at insertion time pays off in a reduction in query time from O[N] to O[1].

Given a data structure of the preceding form, we can quickly find all documents containing the word "running". We can also quickly find all documents containing the word "shoes". We can intersect these result sets quickly, giving us the documents that contain both "running" and "shoes". With some fancier indexing data structures we can restrict our search to documents that contain the contiguous phrase "running shoes" as opposed to documents where those words appear separately. But suppose that there are 1000 documents in the collection containing these two words. Which are the most relevant to the user's query of "running shoes"?

We need a new data structure: the word-frequency histogram. This will tell us which words occur in a document and how frequently they occur in a way that is easily adjusted for the total length of a document.

Here's a word-frequency histogram for the first sentence of Tolstoy's Anna Karenina:

Word  Count Frequency
all11/16
another11/16
but11/16
each11/16
families11/16
family11/16
happy11/16
in11/16
is11/16
its11/16
one11/16
own11/16
resemble11/16
unhappy22/16
way11/16

One might argue that this sentence makes better literature as "All happy families resemble one another, but each unhappy family is unhappy in its own way," but the full-text search software finds it more useful in this form.

After the crude histogram is made, it is typically adjusted for the prevalence of words in standard English. So, for example, the appearance of "resemble" is more interesting than "happy" because "resemble" occurs less frequently in standard English. Stopwords such as "is" are thrown away altogether. Stemming is another useful refinement. In the index and in queries we convert all words to their stems. The stem word for "families", for example, is "family". With stemming, a query for "families" would match a document containing "family" and vice versa.

Given a body of histograms it is possible to answer queries such as "Show me documents that are similar to this one" or "Show me documents whose histogram is closest to a user-entered string." The inter-document similarity query can be handled by comparing histograms already stored in the text database. The search string "platinum mines in New Zealand" might be processed first by throwing away the stopwords "in" and "new". By using histogram comparison, the software would deliver articles that that have the most occurrences of "platinum", "mines", and "Zealand". Suppose that "Zealand" is a rarer word than "platinum". Then a document with one occurrence of "Zealand" is favored over one with one occurrence of "platinum". A document with one occurrence of each word is preferred to an article where only one of those words shows up. A document that contains only the words "platinum mines Zealand" is a better match than a document that contains 100,000 words, three of which happen to match the query terms.

The power of this kind of system is enticing and raises the question "Can we run our entire Web application from a specialized full-text search database system?" Indeed, why not chuck the RDBMS altogether?

We don't chuck the RDBMS because we put it in to handle the problem of concurrency: two users trying to update the same item simultaneously. A better query tool is nice, but we can't adopt it as our primary database management system unless it handles the concurrency problem as well as the RDBMS.

A pragmatic approach would seem to start by keeping all the documents in the RDBMS: articles, user comments, discussion forum postings, etc. Either once per night or every time a new document was added, update a full-text search system's collection. Pages that are part of the standard user experience and workflow operate from the RDBMS. The search box at the upper right corner of every page, however, queries against the full-text search system. Let's call this a split-system design.

**** insert figure *****

Figure 12.1: A split-system approach to providing full-text search. The application's content is stored in a relational database management system. Scripts periodically maintain a second copy in a specialized text database. The Web server program performs queries, inserts, and updates to the RDBMS. When a user requests a full-text search, however, the query is sent to the text database.

One argument against the split-system approach is that two copies of the document collection are being kept. In an age of $200 disk drives of absurdly high capacity, this isn't a powerful argument. It is nearly impossible to fill a modern disk drive with words typed by humans. One can fill up a disk drive with video or audio streams, but not text. And in any case some full-text search systems can build an index to a document collection without themselves keeping the original document around, i.e., you would in fact have only one copy of the document in the RDBMS.

A second argument against using RDBMS and full-text search systems simultaneously is that the collections will get out of sync. If the Web server crashes in the middle of an RDBMS transaction, all work is rolled back. If the Web server was simultaneously inserting a document into a full-text search system, it is possible that the full-text database will contain a document that is not in fact available on the main pages of the site—the site being generated from the RDBMS. Alternatively, the RDBMS insert might succeed while the full-text insert fails, leading to a document that is available on the site, but not searchable. This argument, too, ultimately lacks power. It is true that the RDBMS is a convenient and nearly foolproof means of managing transactions and concurrency. However, it is not the only way. If one were to hire sufficiently careful programmers and sufficiently dedicated system and database administrators, it would be possible to keep two databases in sync.

A third argument against the split system is the disparity of interfaces. Suppose that our RDBMS is Oracle. The Web developers know how to talk to Oracle through Active Server Pages. The desktop programmers know how to talk to Oracle through the C API. The marketing people know how to talk to Oracle through various reporting tools. Some individual users have figured out to talk to Oracle from standard desktop programs such as Microsoft Excel and Microsoft Access. The cost of bringing in a new programmer grows if you have to teach that person not only about an RDBMS, but also about specialized tools, each with its own library of interfaces.

However, the best argument against using both an RDBMS and a "bag-on-the-side" full-text search system is that the split system does not naturally support the kinds of queries that are necessary:

Augmenting the RDBMS

Consider a full-text indexing system. It needs a way of writing stuff down (the index data structures) and typically chooses the operating system file system. It needs a way of performing computation in a procedural computer language, typically C circa 2004.

Consider a modern relational database management system. It offers a way of writing stuff down: CREATE TABLE and INSERT. It offers a way of executing software written in a procedural language: C, Java, or PL/SQL in the case of Oracle; any .NET-supported computer language in the case of Microsoft SQL Server.

Why couldn't one build a full-text search indexer inside the RDBMS? That's exactly what some of the commercial RDBMS vendors have done. Oracle was a pioneer in this area and the relevant Oracle product is called "Oracle Text".

create table content (
	content_id		integer primary key,
	refers_to		references content_raw,
	-- who contributed this and when
	creation_user		not null references users,
	creation_date		not null date,
	modified_date		not null date,
	mime_type		varchar(100) not null,
	one_line_summary	varchar(200) not null,
	body			clob,
	editorial_status	varchar(30) 
          check (editorial_status in ('submitted','rejected','approved','expired'))
);

-- create an Oracle Text index (the product used to be called
-- Oracle Context, hence the CTX prefixes on many procedures)

create index content_text 
on content(body) 
indextype is ctxsys.context;

-- let's look at opinions on running shoes from 
-- users who registered in the last 30 days, sorting
-- results in order of decreasing relevance

select 
  score(1), 
  content.content_id, 
  content.one_line_summary, 
  users.first_names,
  users.last_name
from content, users
where contains(body, 'running shoes', 1) > 0
and users.registration_date > current_timestamp - interval '30' day
and content.creation_us
er = users.user_id
order by score(1) desc;
In the preceding example, Oracle Text builds its own index on the body column of the content table. When a Text index is defined on a table it becomes possible to use the contains operator in a WHERE clause. The Oracle RDBMS SQL query processor is smart enough to know how to use the Text index to answer this query without doing a sequential table scan. It is possible to have more than one call to contains in the same query. Thus the last argument of contains is an integer identifying the query, in this case "1". It is possible to get a relevance score out in the select list or in an ORDER BY clause with the function score and an argument identifying from which contains call the score should be pulled.

Oracle Text is one of the more difficult and complex Oracle RDBMS products to use. For example, if you want to be able to search for a phrase that occurs in either the one_line_summary or body and combine the relevance score, you need to build a multi-column index:

ctx_ddl.create_preference('content_multi','MULTI_COLUMN_DATASTORE');

ctx_ddl.set_attribute('content_multi', 'COLUMNS', 'one_line_summary, body');

create index content_text 
on content(modified_date) 
indextype is ctxsys.context
parameters('datastore content_multi');
Notice that the index itself is built on the column modified_date, which is not itself indexed. The call to ctx_ddl.set_attribute in which the COLUMNS attribute is set is what determines which columns get indexed.
For an example of a system that tackles the challenge of indexing text from disparate Oracle tables, see http://philip.greenspun.com/seia/examples-search/site-wide-search

Oracle Text also has the property that its default search mode is exact phrase matching. A user who types "zippy pinhead" into a search engine will expect to find documents that contain the phrase "Zippy the Pinhead". This won't happen if your script passes the raw user query right through to the Contains operator. More problematic is what happens when a user types a query string that contains characters that Oracle Text treats specially. This can result in an error being raised by the SQL query and a "Server Error 500" returned to the user if you don't catch the error in your procedural script. It would be nice if Oracle Text had a built-in procedure called "ProcessRawQueryFromWebForm" or something. But it doesn't, at least we couldn't find one in the documentation for Oracle version 10g. The next best thing is a procedure called pavtranslate, available from http://technet.oracle.com/sample_code/products/text/htdocs/query_syntax_translators/query_syntax_translators.html.

Oracle Text, via the "INSO filters" option, has the capability to index a remarkable variety of documents in a BLOB column. For example, the software can recognize a Microsoft Excel spreadsheet, pull the text out and add it to the index. At the same time it is smart enough to know when to ignore a document entirely, e.g., if the BLOB column were filled with a JPEG photograph.

Exercise 1: Expected Queries

Ask your client what kinds of queries he or she expects to be most common in your community. For example, in a site for academics, it might be very important to type in a person's name and get all of the publications authored by that person. In a site for shoppers, it might be essential to query for a brand name and get back product reviews. Only your client can say authoritatively.

Exercise 2: Document Your Design

Place a document at /doc/search in which you describe your team's plan for providing full-text search over the content on your site. If your content management system has left you with a mixed bag of stuff in the file system and stuff in the RDBMS, explain how you're going to synchronize and unify these documents in one full-text index. If nightly maintenance scripts are required, document them here.

Include your client's answers to Exercise 1 in this document.

Exercise 3: Build the Basic Search Module

Build a basic search module that provides the following functions:

Exercise 4: Big Brother

Generally users prefer to browse rather than search. If users are resorting to searches in order to get standard answers or perform common tasks, there may be something wrong with a site's navigation or information architecture. If users are performing searches and getting zero results back from your full-text search facility, either your index or the site's content needs augmentation.

Record user search strings in an RDBMS table and let admins see what the popular search terms are (by the day, week, or month). Make sure to highlight any searches that resulted in the user seeing a page "No documents matched your query". Ask yourself whether it would be ethical to implement a facility whereby the site administrators could view a report of search strings and the users who typed them in.

Update your /doc/search file to reflect the addition of this facility.

Exercise 5: Linkage

Find logical places among your community's pages to link to the search facility. For example, on many sites it will make sense to have a quick search box in the upper-right corner of every page served. On most sites, it makes sense to link back to search from the search results page with a "search again" box filled in by default with the original query.

Make sure that your main documentation page links to the docs for this new module.

Working with the Public Search Engines

If your online community is on the public Internet you probably would like to see your content indexed by public search engines such as Google (www.google.com). First, Google has to know about your server. This happens either when someone already in the Google index links to your site or when you manually add your URL from a form off the google.com home page. Second, Google has to be able to read the text on your server. At least as of 2005 none of the public search engines implemented optical character recognition (OCR). This means that text embedded in a GIF, Flash animation, or a Java applet won't be indexed. It might be readable by a human user with perfect eyesight, but it won't be readable by the computer programs that crawl the Web to build databases for public search engines. Third, Google has to be able to get into all the pages on your server. If you've been requiring registration to view discussions, for example, those discussions won't be indexed by Google unless your software is smart enough to recognize that it is Google behind the request and make an exception. How to recognize Google? Here's a one-line snippet from the philip.greenspun.com access log (newlines inserted for readability):

66.249.71.53 - - [10/Feb/2005:02:13:15 -0500]
"GET /sql/triggers.html HTTP/1.0" 200 0 ""
"Googlebot/2.1 (+http://www.google.com/bot.html)"
Notice the user-agent header at the end: Googlebot/2.1, with its included suggestion that Web publishers check http://www.google.com/bot.html for more information. Because some search engines archive what they index, you would not want to provide registration-free access to content that is truly private to members. In theory a <META NAME="ROBOTS" CONTENT="NOARCHIVE"> placed in the HEAD of your HTML documents would prevent search engines from archiving the page, but robots are not guaranteed to follow such directives.

Some search engines allow you to provide indexing hints and hints for presentation once a user is looking at a search results page. For example, in the table of contents page for this book, we have the following META tags in the HEAD:

<meta name="keywords" content="web development 
online communities MIT 6.171 textbook">

<meta name="description" content="This is the textbook for the MIT
course Software Enginering for Internet Applications">
The "keywords" tag adds some words that are relevant to the document, but not present in the visible text. This would help someone who decided to search for "MIT 6.171 textbook", for example. The "description" tag can be used by a search engine when summarizing a page. If it isn't present, a search engine may show the first 20 words on the page or follow some heuristics to build a reasonable summary. These tags have been routinely abused. A publisher might add popular search terms such as "sex" to a site that is unrelated to those terms, in hopes of capturing more readers. A company might add the names of its competitors as keywords. Users wouldn't see these dirty tricks unless they went to the trouble of using the View Source command in their browser. Because of this history of abuse, many public search engines ignore these tags.
See http://searchenginewatch.com/resources/metasuits.html for accounts of various lawsuits that have been fought over the contents of meta tags.
A particularly destructive practice is "cloaking", in which a Web server is programmed to send entirely different pages to the search engines and human users (identified by having "Mozilla" or "MSIE" in their user-agent headers). An unscrupulous publisher would find out what are the current most popular search terms on public search engines (http://searchenginewatch.com/facts/searches.html offers a list of windows into various search services), string those terms together, and serve a mishmash of those to search engines. Meanwhile, when a regular user came to the site the page presented would be a banal product pitch. Google threatens to ban from their index any site that engages in this practice.

The /robots.txt File

Suppose that you don't want the public search engines indexing anything underneath the /staging/ directory on your server. This content isn't exactly secret, but neither do you want it released before its time. Nor do you want two copies of the same content in the Google index, one copy in the staging area and one copy in its final position on the site.

You need to read the Standard for Web Exclusion, a protocol for communication between Web publishers and Web crawlers, available from http://www.robotstxt.org/wc/norobots.html. You the publisher put a file on your site, accessible at /robots.txt, with instructions for robots. Here's an example that excludes the staging directory:

User-agent: *
# let's keep the robots away from our half-baked stuff
Disallow: /staging
The User-agent line specifies for which robots the injunctions are intended. Each Disallow asks a robot not to look in a particular directory. Nothing requires a robot to observe these injunctions, but the standard seems to have been adopted by all the major indices nonetheless.
Visit http://www.ibm.com/robots.txt to get a bit of insight into how a site may evolve over time.

Exercise 6: robots.txt

Place a file on your server at /robots.txt that excludes robots from appropriate portions of your server. Put some comments at the top of the file explaining who created this, when it was created, and the rationale behind the exclusions.
If you're doing a 100 percent database-backed content management system, you are free to put the content of the robots.txt file in the RDBMS, just so long as it is served when the URI /robots.txt is requested.

Exercise 7: Client Signoff

Review the search facility, both user and admin pages, with your client. Write down your client's reaction to this new module, paying particular attention to any new ideas that the client might have for what will be typical queries on the site.

The Future

As an online community grows older and larger it becomes ever more likely that a user will be overwhelmed with "100,000 documents matched your query". When a community is new and small, it is possible to search for an answer merely by reading the titles of everything on the site, i.e., by browsing. As a community grows, therefore, the greater the importance of information retrieval tools. The exercises in this chapter focus on answering a user's query by presenting links to relevant documents. Suppose that we build a search facility that always returns the very most relevant document in the corpus. Is that an optimal solution? Only if you believe that users like to read.

Suppose that Joe User visits photo.net and types "At what shutter speeds is a tripod required?" into the search box. Is it reasonable to assume that Joe wants to read a 10,000-word document that contains the answer to this question? Or would Joe rather get ... the answer to his question. The answer "at shutter speeds slower than 1/lens-focal-length" is a lot smaller and quicker to read than a document containing this information.

To get a feel for how a question answering system can be built on top of a full-text indexer, read "Scaling Question Answering to the Web" (Cody Kwok, Oren Etzioni, Dan Weld; WWW10 conference, May 2001; http://www.www10.org/cdrom/papers/pdf/p120.pdf), which describes a system built at the University of Washington. This system includes all of the expected linguistic gymnastics plus code to sort out the Internet-specific problem of noise. Traditional information retrieval systems are designed to work with authoritative documents, e.g., the Encyclopedia Britannica, a binder of corporate policies, or the design notes for a jetliner. The documents in the corpus are presumed to be authoritative. There won't be four different answers, three of them flat wrong, to questions such as "In what year was Gioacchino Rossini born?", "How many signatures are required for a purchase of $57,300?", or "How wide is the wingspan of the airplane?" With user-authored content in an online community, however, it seems safe to assume that while the average answer is likely to be correct, for every 100 correct answers there will be at least three or four incorrect ones. Even when the data require no interpretation, there will be typos. For example, a Google search for "rossini 1792-1868" returned 50,900 documents in February 2005; a search for "rossini 1792-1869" returned 43 documents. A question-answering system built on top of lightly moderated user-authored content will have to exercise the same sort of judgment as do humans: How many documents contain Answer A versus Answer B? What is the relative authority of conflicting documents? Which of two conflicting documents is more recent?

Mobile Internet devices put an even greater stress on information retrieval. Connection speeds are slower. Screens are smaller. It isn't practical for a user to drill down into 20 documents returned by a search engine as possibly relevant to a query, especially if the user is driving a car and using a voice browser.

If you want to emerge as a hero from the dust of the next Internet collapse, work on information retrieval.

More

Time and Motion

The two client interviews, at the beginning of the exercises and again at the end, should each take under an hour.

The search design and documentation should be a team effort, and take one to two hours.

The luckiest teams will be able to get their search systems up and running in an hour. Unlucky teams using difficult-to-install search systems may require the better part of a day. Teams with a single content table and no static html pages should be able to build the basic page scripts in one to two hours. Additional time will be required for designs that manage content across multiple tables and the filesystem.

The remaining exercises should be doable in 2 to 4 programmer-hours.


Return to Table of Contents

eve@eveandersson.com, philg@mit.edu, aegrumet@mit.edu