Delving Into Enterprise Search Query Logs

When starting with a new client, I first request a spreadsheet of query terms ranked by frequency of use over a six-month period, which is then clustered by related terms. For example, some employees may search for "expenses" to track down the application while others search for "concur," a widely used expense management application. To make this task linguistically more challenging, Kosten, Auslagen, Spesen and Aufwendum are all potential German-language terms for "expenses."

The plot always ends up as shown in the diagram above. The shape of the curve is not an artifact of technology, but of linguistics. It is an example of Zipf’s Law, named after the linguist George Kingsley Zipf (1902-1950) who first proposed it. Zipf's Law states that given a large sample of words used, the frequency of any word is inversely proportional to its rank in the frequency table. One of the core constructs of a text search application is the term frequency.inverse document frequency (tf.idf) measure. The shape of the curve can provide a great deal of useful information if you know where to look for it. This column is a gentle introduction.

Area A: Application Location

Many people are surprised to hear that the most frequent queries are to find applications and information on how to complete a task, such as filing expense claims. The top 100 searches in a global multi-national corporation are often related to application location. This is not a good use of search as a query for "concur" may result in a substantial number of results that do not provide a direct connection to the application. Search is being used because tracking down the applications on the corporate intranet has become a nightmare, especially in larger companies where the home page is just full of news.

When reporting on the use of search, clients often quote the total number of queries as a metric of search adoption and success. Subtracting the queries in Area A can give a very different picture.

Area B: Corporate News and Policies

Search really starts to come into its own here, when it's helping employees track down news stories and corporate policies. These are text-rich and search is often the only way to find the most recent version. One could argue these policies should be readily available on the intranet, but users are frequently looking for some specific terms (e.g. looking for unpaid leave) that might not be reflected in the title or even the summary of a document.

The volume of searches is high enough here for AI/machine learning to assist in delivering relevant results, taking into account the application holds information on office location, language capabilities and preferences, and product area.

This area looks to be where there's a significant inflection in the curve, but this is not the case. Area A is not a function of tf.idf but of application invisibility. So the Zipf curve should really be starting from the most highly ranked of Area B queries, making the inflection much less pronounced.

Area C: Learning

It is not until Area C that subject-related queries dominate. There is always a very long tail. What is interesting in this area is the relative ranking of terms. Taking a six-month view should eliminate cyclic variations (for example hunting down quarterly reports) but could indicate any steadily increasing topics in query frequency. This may require a review of "best bets" or an increased crawl frequency on selected servers.

Many of my clients are surprised that the level of use of highly relevant (based on algorithms) items is less frequent than expected. This is because employees build up their own collections of core documents. Many of these may have been pushed to them by other applications, such as an ERP, CRM or even the intranet. Attempts to improve relevance positioning (often referred to as "precision at n") is challenging as some documents may be in core collections for certain groups of employees but not for others. Moreover, in Area C recall could be very important.

Another outcome of this curve is to show just how few queries are actually run for these topics. The amount of data collected here will in most cases be below the level at which AI and ML could significantly and visibly improve search relevance across all users.

Learning Opportunities

Webinar

Jun

From Legacy to Launch-Ready: How Gainbridge Made Its Website a Marketing-Led Growth Engine

Join in to learn how a D2C annuity brand gave marketing full website ownership — without slowing down or risking compliance.

Webinar

Jun

The 5-Question CX Audit: Benchmark Your CX Operations for 2026

Built around insights from the 2026 CX Outsourcing Report, this live session puts the audit into practice.

Webinar

Jun

The Hidden Cost of Fragmented Customer Communication

Discover why growing businesses are rethinking the systems, workflows and communication habits shaping customer experience.

Webinar

Jun

How Modern Marketing Is Exposing the Limits of Legacy CMS

Why marketers are rethinking CMS workflows that slow publishing, personalization and campaign execution.

Webinar

Prove the significant result not only in soccer

Jul

Content Leaders Collective: Proving Content’s Business Impact

Join us as top content leaders look beyond the buzzwords to share how they actually prove ROI and scale what works.

Webinar

On demand

Content Strategy Leaders Live: Managing Risk, Compliance & AI in Financial Services

Learn how financial services leaders are modernizing content systems without disrupting trust, compliance or experience.

Watch Now

Webinar

Jun

From Legacy to Launch-Ready: How Gainbridge Made Its Website a Marketing-Led Growth Engine

Join in to learn how a D2C annuity brand gave marketing full website ownership — without slowing down or risking compliance.

Webinar

Jun

The 5-Question CX Audit: Benchmark Your CX Operations for 2026

Built around insights from the 2026 CX Outsourcing Report, this live session puts the audit into practice.

Webinar

Jun

The Hidden Cost of Fragmented Customer Communication

Discover why growing businesses are rethinking the systems, workflows and communication habits shaping customer experience.

Use Ranked Lists With Care

Businesses can glean a great deal of information from a ranked list of queries, but it is important not to rush to judgment. This is where having a wide network of users is essential to understanding not only what employees are looking for, but why they are looking and if they were satisfied with what they found.

fa-solid fa-hand-paper Learn how you can join our contributor community.