Sinequa's recent announcement of the availability of its enterprise search application on Azure focused heavily on the ability of cloud services to cope with the challenges of building and maintaining a search index. In this column I want to outline how a search index is created using a content processing pipeline, and touch on a few index management issues.

The Role of an Inverted File

The underlying technology of search is referred to as an inverted file. This is a list, not a database. The easiest way to think of it is as the index of a book. Every content item is given a number and then taken to pieces called tokens, which are not words. The key to compiling an index is the content processing pipeline. Natural language processing (NLP) plays a key role here: extracting entities, disambiguating personal names and normalizing dates are just a few examples. PDF documentsPowerPoint presentations and tables/Excel spreadsheets all need careful curation, as will metadata tags.

Time now for a big warning: The pipeline has no way of differentiating between high and low quality content.

Each list entry is referred to as a posting and the data that refers to the document is called a pointer. Not only are tokens indexed but so, too, are their positions so that people can search for phrases. The index also holds all the access permissions. The quality of the index is such that it can be used to create result snippets from the index, adapting them to the specific query. Worth noting here is that deleting a content item only deletes the pointers — the complete content item can still be constructed from the index. It is also worth highlighting that semantic knowledge graphs are based on an inverted index.

None of this considers the very challenging implications of federated search across multiple applications, not to mention the challenges of multi-lingual search. 

This is not a new technology. Alfonso Cardenas optimized the concept while working at the San Jose research laboratories of IBM in the mid-1970s, so every search application you use is based on technology that is now almost 50 years old. If you want to understand more about inverted files take a look at this research paper by Justin Zobel and Alistair Moffat.

Related Article: Searching for Information in the Tower of Babel

Challenges to Scalability

In the early days of search, the index might be 50%-80% the size of the content collection. Storage might be cheap, but that is a significant overhead!

As a result, a substantial amount of work has gone into how the index can be compressed. The flip side of compression is that it has to be uncompressed at query and result presentation time. The same issue arises with encryption. The arrival of neural retrieval, with its ability to sort out vocabulary mismatch (gas (US) vis-à-vis petrol (UK)), puts further pressure on index management, which has led to numerous discussions around sparse vs. dense index approaches. Hybrid models are also in play, where the occurrence of phrases is indexed separately rather than constructed on the fly. However, it seems unlikely that the core inverted file model will be discarded any time soon given the installed base. Introducing a new index structure will require a complete re-index.

Related Article: AI and Enterprise Search: Who's in Control?

Learning Opportunities

Building the Search Index

Building the initial index is no time to be around a search team. Not so very long ago it might have taken a couple of weeks to index a reasonable-sized document collection. Halfway through, a rogue document might bring the process to a halt or a post-index check might show that some stop words had not been correctly recognized. Issues such as these might require starting all over again.

The duration and fragility of this process inhibits search vendors from making changes to the index structure that would require either a full or at least a partial reindex.

Another index creation challenge arises when a company has been acquired and an integrated index needs to be created. Not only does the incoming content need to be indexed according to the acquiring company’s schema (including all of the security permissions) but this index then needs to be integrated with the existing index. Working out how best to add new content to an index is hard work. It is not unusual for employees to be unable to find a document they wrote in the last few days because it is located in a collection that is only reindexed on a weekly basis.

The advent of cloud services makes these issues more tractable, without a doubt. But the cloud doesn't necessarily make them go away. The computing power may be available on a more flexible basis, but make sure you have a very good understanding of just what the multipliers and commitment levels are in your contract. Inverted file indexes may come with some rather specific pricing schedules.

Related Article: Not So Open Any More: Elasticsearch Relicensing and Implications for Open Source Search

Be Prepared!

If the NLP elements of the content processing pipeline do not create an index that will support the NLP requirements of query management, it does not matter how much AI the vendor is promising. Make sure you understand every element of the pipeline for your current and future search requirements and what the implications could be of any missing or poorly supported elements. Your search application has to be built on rock, not sand.

fa-solid fa-hand-paper Learn how you can join our contributor community.