You may have heard the new buzzword being applied to enterprise search: "insight engines." When I first saw the Gartner Enterprise Search Quadrant renamed as Insight Engines a few years ago, I was hard pressed to understand why the change was made. But Gartner and others have now defined the term as search that includes cognitive processing and machine learning (ML) and/or artificial intelligence (AI). When it comes to enterprise search, that usually means Apache Spark.
We all know search on large internet platforms, such as Google and Amazon, is simply amazing. I'd say it's akin to rocket science — except I understand how rockets work. ML and AI are beyond rocket science and into black box territory for me — and for many others.
It reminds me of a former teacher whose humor I didn’t appreciate until much later in life. When asked if a particular test was hard, he'd tell us that if we knew all the answers, the test would be very easy. As in geometry, the good news on machine learning (and by extension AI), is that once you know what the magician knows, it's easy.
The real challenge in applying ML and AI to most enterprise search instances is a matter of size. Even for very large organizations, intranet search generally doesn't have enough unique content or sufficient query volume to discover the kinds of relationships that big data and machine learning algorithms can provide. And to be really useful, we're not talking about lots of random data — you really need data that connects to and is related to other data in the environment.
When we read Wikipedia's definition of big data, "an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications," the two different approaches seem mutually incompatible.
Related Article: Enterprise Search in 2018: What a Long Strange Trip It's Been
Machine Learning Needs Big Data and Big Queries
If we need large volumes of related data to get value from big data tools, I'd propose we also need what I'd call "big queries." Maybe not really big queries, but certainly a lot of them, many of them either similar and/or related queries. Consider vacation and holiday schedules; savings and checking accounts; clothing for women and for men.
Remember big data means large and complex — which in the enterprise, we don't generally have. But the good news: even though we don't have large and complex data, search has signals.
Big Data Relies on Signals
The magic behind the big data app curtain falls into the general category called "signals." These represent actions, events or information that provide clues regarding relevance of a given document for a query. Signals fall into two categories I'll call implicit signals and explicit signals.
Implicit and Explicate Signals
Implicit signals are the default ranking elements in virtually all search platforms. This includes the clues that virtually every search platform uses: the terms contained in the document, and the document metadata like the properties available in Microsoft Office documents: Title, Author, Company, Category and others. These are stored in the search index during the indexing phase.
Explicit signals are the metadata about search activity. These are calculated and applied at the time a query is made and include a great deal of information about the documents in the search index, past user search activity and documents actually viewed in previous queries. These, used extensively in fewer search platforms, consider the user's recent query history and document views; security access levels, user roles in the organization, and people similar to the user. And sometimes even the user profile: department, age, role, etc.
Implicit signals are the ones used most frequently in search platforms. But the way many organizations create documents means the metadata is often incorrect. Years ago I wrote a blog post "60 Guys Named Sarah” that explained how one company discovered their bad document metadata.
Related Article: Diagnosing Enterprise Search Failures
'People Like You' in the Enterprise
Big data and big data tools are intended for use by organizations that possess very large sets of documents (or products) and that process relatively large number of queries every day.
But an additional bit of metadata is available within the enterprise that may tip the scales in a business's favor. This metadata is unavailable to public search sites but easy to find in the organization, even though it is still rarely used. If done properly, you can use ML tools to access this helpful metadata.
The magic trick? The context of the human doing the query. When Amazon tells you "people like you" liked something, it makes those valid assumptions based on a great deal of query and purchasing history: previous purchases, similar products and other query history. In your enterprise, you know a great deal of context about the person making the query including job title, age, security levels across multiple repositories, location and perhaps even native language. When you can bring this implicit context into your search, you can offset the disadvantages of relatively small query volume and relatively small and homogenous data sets. This may even be a way to implement "people like you' without the volume large public-facing sites need for the same level of context.
Related Article: Does Your Business Need a Search Center of Excellence?
Is Big Search for You?
Big data can be a big mistake in enterprise search, unless you can find and use context about your content as well as context about your users. Nonetheless, the insight engines you'll find on the commercial market — as well as in combinations of a few open source platforms — can deliver a reasonable semblance of the kinds of results you and your users see on the big internet sites.
But before you dive into a big ML project simply because it sounds so promising — and everyone seems to be doing it — remember that simply managing your enterprise search well will likely deliver just as good results, without the new tools and learning.