Over the last few years, artificial intelligence and machine learning have increasingly come up in conversations about enterprise search. As artificial intelligence (AI) and its cousin, machine learning (ML), increased in accuracy and ease of integration, instances of them being directly integrated with or running alongside of search to improve results increased as well.
But chances are you remember when search relevancy was based on simple metrics like term frequency — the document with the largest number of instances was ranked highest, and documents with fewer instances ranked lower. You were able to provide stop words like "the" and "of" whose frequent use typically added no value in retrieving relevant documents. The only content really useful to the search engine was the terms in the user query.
How Do AI and Machine Learning Work in Search?
Rather than depending on questions of frequency, machine learning (in general) works in enterprise search by looking for signals to help understand user intent. Signals can be any behavior on the part of the user during any given session. A search itself can be a signal, but more advanced ML technology can utilize a wide range of events as signals.
The AI we typically experience today is in consumer products like Alexa and Siri, on sites including Amazon and Netflix, or the memory of IBM Watson in action on Jeopardy. But signals for enterprise search are often higher quality on internal sites than on public-facing sites. Some argue that ML is not fit for use in enterprise search because it is most effective when it has a large set of signals. After all, few enterprises see the 60,000 queries a second that Google receives. On the other hand, enterprise content is much more homogeneous, and my experience is that enterprise signals are more homogenous than on the internet.
So even though most internal sites receive lower query volume, the information regarding each user is far richer than what is found on most external sites. Unlike the average public site, an enterprise has much more accurate information about each user, including attributes such as job function, department, division, seniority, location and more, which allow the search engine to (in theory) deliver more accurate results and also apply document level security on restricted content.
Some of the signals typically used by ML in internal searches include user queries and subsequent activity, including clicks, views, returning to view other results, or performing a new search.
What Exactly Do We Mean By 'Signals'?
I’ve mentioned signals but haven’t really defined what they are. Simply said, signals are attributes of, or actions by, users while engaged with a given site or application. A signal can be any metric, data point or user attribute you choose.
On an ecommerce site, a user’s past browsing and purchasing history prime the signal processing. Purchasing (or even viewing) ski jackets could serve as a signal to display discounted lift tickets.
On a corporate intranet, a search for an employee by name might act as a signal to display that employee’s email and location.
Depending on the site, signals could be product views, purchases and returns. Within the corporate web, signals could include employee location, seniority, department, projects and more. Signals can be almost any data you can imagine.
Related Article: When it Comes to Intelligent Search, Don't Expect Magic
How Do Signals Work?
Have you even been on a site where you’ve looked as a product and decided against it, but as you're leaving a popup offers a discount price if you return to purchase that product? That's machine learning at work. As you visit most ecommerce sites, and increasingly more corporate sites, the ML application tracks your clicks, page views, and even search queries to build a profile of you. This capability is based on signals.
Some network crawlers use ML to determine how often to revisit sub-sites. That is, ML can help determine how frequently to revisit a corporate site schedule based on how often a site is likely to update content.
For me, the most frustrating element of ecommerce ML is when I purchase an item for a vacation — say a ski jacket — and on the next several visits to the site, I’m offered ski gear that I won’t need again any time soon.
Signals are not forever: they age over time. Generally speaking, newer signals are fresher and more relevant because sites and user behavior change over time. When you undertake an ML project, you'll want to revisit user activity frequently so you can identify and correct issues in a timely fashion.
Beware Positive Feedback
When you first introduce machine learning to your enterprise search, you’ll inevitably make some mistakes. When users search a site, they often click on the top search result. This can create a "positive feedback loop," where your ML will opt to promote that first result — even if it’s not the best document for the query.
One solution is to prime your ML with relevant documents prior to roll-out. Pretty quickly it will start showing high quality results — which is, after all, why you brought in ML in the first place.
Track its behavior, and quickly correct any fluke results that may show up. Generally speaking, you’d like to see new content move up in the ML as old content loses its relevance.
Finally, watch your log files to see what users are looking for, and what the ML-driven search displays. After all, you are much smarter than any ML solution is or will be in the foreseeable future.
Coach, Don't Teach Your Way to Better Results
Implementing machine learning in enterprise search can seem onerous when you first start. My geometry teacher in high school had us address him as “Coach.” He told us at the beginning of the school year that he wasn’t going to teach us the topic. Rather, he assured us he would coach us as we learned. At the time I was too young to appreciate the truth of his claim. But Coach was a wise man, and his claim applies to ML as well: find a coach and dive in!