Enterprise search is like the weather: everyone complains about it, but no one does anything about it.
What follows is a high-level overview of the mechanics behind successful search.
If you’re a search manager, it will help you understand what goes on deep in the search index to help you create a search experience your users will love. Yes, love.
If you’re a frustrated search user, forward it to your search managers.
And if you’re a happy search user, you will know what your search team does to keep you happy.
On the Back End: Creating Your Index
Most search technology involves a two-stage process: creating an index of the content in each document, and then providing a way to retrieve relevant content based on terms found in the indexed documents.
Allowing some flexibility in defining key terms — index, document, relevant and terms — that covers just about every commercial and open source search platform.
Of course, some vendors (and other zealots) may differ. Let me clarify some of the terms I’ve mentioned in hopes of inclusion of all search technologies.
To create the index, the software extracts content from every document into what we can think of as two interrelated files.
Building the Foundation for Binary Search
The first file essentially lists every word, or "term," in every document, typically sorting them alphabetically. This allows for fast retrieval in what’s known as a binary search.
The word list also includes information about which documents contain the term, and may point out the location of the search term within each document. This latter feature is how some platforms highlight search terms when showing results. This capability drives proximity search, which some platforms use for relevancy calculations.
Some platforms store synonyms in the word list, but more often that's done at search time to minimize index size and maximize search performance.
Gathering the Fielded Data
The second file created during the indexing process maintains the fielded data such as title, author and other document metadata.
Picture this file as a database table: one row per document, and multiple columns for each row.
The columns include metadata for each document such as a title, author, file date, access security, and any other content that may be extracted from the document during indexing. Only extract and store fields that you want to display in the search result list, and fields users need to filter results, such as “country:France.”
Some search platforms also generate a static "document summary" to display in the result list, and that will typically be stored in the second file as well.
In reality, search only really requires one column: the "record key." This contains the file name, URL, database record or other key that allows the system to retrieve the document. This data is used both during the indexing process as well as when a user clicks on a result link after a search.
Note: you don’t have to extract and store every field a document contains, but include any that will support the results want displayed.
On the Front End: Time to Query
With your index created, you’re ready to search.
When a user types in a query, the search application accesses the word list to fetch the list of all documents that contain the term. In some secure schemes, it may have to filter the result. The search application then uses the various index fields to generate any facets and the result list and displays the first screen of results.
Why Should You Care?
Knowing what goes on under the covers helps you understand how search works. And understanding how search works enables you to recognize potential issues when your search behaves in unexpected ways.
The quality of any search is only as good as its index. So the next time your search platform behaves in unexpected ways, you may know where to look to solve the problem.