Enterprise search has undergone many changes over the years, but it still requires a level of human effort to get it right.
Remember Topic Trees?
Search on an intranet used to be a way to find administrative information: vacation schedules, the lunch menu, policies and procedures and more. As intranets expanded in scope, they became repositories of useful — and sometimes legally required — information, such as company policies and procedures, Equal Employment Opportunity information and other regulatory content.
Contracts, agreements and even work for clients started appearing in corporate intranets as other parts of the organization began to recognize them as great places to store content. This introduced security issues which administrators handled by restricting access to financial and sales data on a "need to know" basis.
But with these increases in intranets' scope, search became more of a challenge. Overcoming the generic relevance results most platforms delivered often required specific and complex syntax. Some companies, like Verity, attempted to improve relevance by allowing subject matter experts to create complex "topic trees," essentially synonym trees which included the ability to apply weights to any portion of the tree.
Topics were far better than simple single word synonym lists, but took a great deal of time, and specific vocabularies to build a detailed and complete topic tree.
Whoever built the topic tree also needed a detailed understanding of the content and the user needs, which needless to say, very few companies made the effort to do.
End users were dissatisfied with the Verity approach — they wanted an easier and faster way to get great search and simple synonym lists didn’t solve the problem.
Enter Internet Search Engines
A funny thing happened as enterprise search vendors and open source search platforms worked to improve their relevancy: the internet. People saw and loved how easy search was on sites like Alta Vista, Yahoo and Google.
What enterprise users did not recognize is that with billions of users, millions of documents (with thousands nearly identical), relevance gave way to popularity. And with millions of users searching and "voting" for the best answer by page views, it was relatively easy to recommend popular content ‘voted in’ by “people like you.”
I can't tell you how many times enterprise customers have asked, “Why don’t we just use Google for our internal search?” Unfortunately this is impossible for two reasons: 1) the nature of intranet content and 2) the lack of scale due to the lower number of users.
Add to this the fact that intranets don’t offer large numbers of nearly identical content: a given client only has one associated proposal and only one document describes company health plans and vacation policies.
The Answer? Use Context
Context can also be called metadata. This includes the words that make up the document, manual or intranet web page.
But additional valuable metadata often goes unused. Every Office document, text file and web page has it — if you set your search up to index and use it.
In Office documents, think of the "Properties" fields — Author, Location, Directory, dates and more. Text and database records have similar, if fewer, properties. But with modern search technologies, using entity extraction you can easily extract names and locations to associate additional context with the document as it is processed into the search index.
A caveat: trust the metadata in user-created content with a big grain of salt. I’d wager that searching “Document 1” will return a large number of results on your desktop, if not on your intranet. We’ve also seen cases where the person who created the original template continues to be identified as ‘the author,’ long after he or she have left the company.
Only you can judge how good your intranet metadata is — select the fields to search wisely.
Pipelines: The Way to Extract Context
For search to succeed, your indexing needs to be great.
The FAST ESP product was one of the first I remember offering an index pipeline and query pipeline, but an increasing number of commercial and open-source-based products are also supporting these capabilities.
Pipelines provide a built-in and supported capability to enrich and extract customized data prior to indexing and to pre-process the user query to enrich it. The concept of the indexing pipeline is that, once a document is identified for indexing and converted into a stream of text, the administrator can define tasks for the search platform to perform before sending the text to the search indexer.
Most commercial platforms automatically extract office fields, like Author and Title. But other technologies, whether included with your product or available from the open source community, take this a step further.
The technology I find most helpful here is "entity extraction." Depending on the source of your chosen tool, you can relatively easily recognize useful data. For example, you can recognize names of people mentioned in the document, companies, cities, products, competitors — just about anything you can put in a list or extract using regular expressions.
These all improve the search quality by letting you assign names to a names field, products to a products field, and much more.
Query pipelines provide a way to enhance and expand the user query prior to passing it to the search platform. Again, this is an opportunity to enrich or change the scope of the query, such as creating a query restriction, limitting the search to a particular index or to provide additional features. This is a great place to add user context to the query. For example, I recognize the user as field employee, so I’ll boost content most useful to employees in the same region or job function.
Enterprise Search Is Looking Up
The quality of “out of the box” search has been improving for years. One day, we may all search by conversing with HAL 9000. But for now, enterprise search still requires ongoing attention to detail to get it like Google.