When starting with a new client, I first request a spreadsheet of query terms ranked by frequency of use over a six-month period, which is then clustered by related terms. For example, some employees may search for "expenses" to track down the application while others search for "concur," a widely used expense management application. To make this task linguistically more challenging, Kosten, Auslagen, Spesen and Aufwendum are all potential German-language terms for "expenses."  

zipf curve

The plot always ends up as shown in the diagram above. The shape of the curve is not an artifact of technology, but of linguistics. It is an example of Zipf’s Law, named after the linguist George Kingsley Zipf (1902-1950) who first proposed it. Zipf's Law states that given a large sample of words used, the frequency of any word is inversely proportional to its rank in the frequency table. One of the core constructs of a text search application is the term frequency.inverse document frequency (tf.idf) measure. The shape of the curve can provide a great deal of useful information if you know where to look for it. This column is a gentle introduction.

Area A: Application Location

Many people are surprised to hear that the most frequent queries are to find applications and information on how to complete a task, such as filing expense claims. The top 100 searches in a global multi-national corporation are often related to application location. This is not a good use of search as a query for "concur" may result in a substantial number of results that do not provide a direct connection to the application. Search is being used because tracking down the applications on the corporate intranet has become a nightmare, especially in larger companies where the home page is just full of news. 

When reporting on the use of search, clients often quote the total number of queries as a metric of search adoption and success. Subtracting the queries in Area A can give a very different picture.

Related Article: Enterprise Search: An Invaluable, if Sporadically Used Application

Area B: Corporate News and Policies

Search really starts to come into its own here, when it's helping employees track down news stories and corporate policies. These are text-rich and search is often the only way to find the most recent version. One could argue these policies should be readily available on the intranet, but users are frequently looking for some specific terms (e.g. looking for unpaid leave) that might not be reflected in the title or even the summary of a document.

The volume of searches is high enough here for AI/machine learning to assist in delivering relevant results, taking into account the application holds information on office location, language capabilities and preferences, and product area.

This area looks to be where there's a significant inflection in the curve, but this is not the case. Area A is not a function of tf.idf but of application invisibility. So the Zipf curve should really be starting from the most highly ranked of Area B queries, making the inflection much less pronounced.

Learning Opportunities

Related Article: Enterprise Search and Machine Learning: A Match Whose Time Has Come

Area C: Learning

It is not until Area C that subject-related queries dominate. There is always a very long tail. What is interesting in this area is the relative ranking of terms. Taking a six-month view should eliminate cyclic variations (for example hunting down quarterly reports) but could indicate any steadily increasing topics in query frequency. This may require a review of "best bets" or an increased crawl frequency on selected servers.

Many of my clients are surprised that the level of use of highly relevant (based on algorithms) items is less frequent than expected. This is because employees build up their own collections of core documents. Many of these may have been pushed to them by other applications, such as an ERP, CRM or even the intranet. Attempts to improve relevance positioning (often referred to as "precision at n") is challenging as some documents may be in core collections for certain groups of employees but not for others. Moreover, in Area C recall could be very important.

Another outcome of this curve is to show just how few queries are actually run for these topics. The amount of data collected here will in most cases be below the level at which AI and ML could significantly and visibly improve search relevance across all users.

Related Article: Asking the Right Questions: Query Expansion in Enterprise Search

Use Ranked Lists With Care

Businesses can glean a great deal of information from a ranked list of queries, but it is important not to rush to judgment. This is where having a wide network of users is essential to understanding not only what employees are looking for, but why they are looking and if they were satisfied with what they found.

fa-solid fa-hand-paper Learn how you can join our contributor community.