Anyone involved in the ongoing management of internal search has heard the complaints: internal search stinks.
We've spent the last few months getting your enterprise search house in order: giving it some of the attention your SEO team gives your public content for Google and improving the relevance on your internal search platforms. This month let's take a break from the "how" and focus on something more basic — specifically, what’s in your search index.
Insuring your intranet content is accurate, good and legal can help your employees find what they need while keeping your business on the right side of the law. I break these into three categories: relevance, metadata and retention.
Virtually every search platform uses algorithms for calculating document relevance in an attempt to rank the search results. Public Google uses a number of metrics, including the "link" count: pages with lots of inbound links are likely to hold higher relevance for the link term.
Most enterprise search products, likely including the Google Search Appliance, use link terms as an indication of relevance (‘popularity’) for a particular term. Think about it: if outside websites link to a given site for a term, chances are good the content is relevant for that term.
Public content can try to game search engines, so Google likely downplays that metric. However, your internal content creators are probably not making attempts to drive your employees to their content, so internally it’s still a good tool — and most enterprise search platforms do that.
Many modern search platforms use an approach called “TF/IDF.” The acronym stands for “Term Frequency/Inverse Document Frequency,” which means that words occuring frequently in a small document will rank higher than a large document if both have the same term used the same number of times. It makes sense: a one page document that mentions “vacation” 10 times is more relevant than a 200 page PDF with the same term mentioned 10 times.
On top of that, search engines rank terms in a metadata field as more important than the same term in the body of the content.
Have you ever noticed that a web search for "Document1" will show a large number of Microsoft Word documents that were saved in PDF format? The same is true if you search for things like “Document2” or even “Document10.” And if you want to find interesting spreadsheets, search Google for “Worksheet1.” I just did and found a Word document saved as a PDF published by the Supreme Court of Nebraska. Remember, with Google your mileage may vary.
And let's not forget that document metadata is often wrong. I blogged several years ago about a company with an internal search platform that, in response to a query for “Sarah” returned 60 resumes for the company’s consultants — all of whom were men. Can you guess the name of the person whose job it was to copy the CV data for new consultants from their resumes and paste it into a Word template before publishing on the website?
All of these examples serve as a reminder to make sure your users — especially those who create web content — understand the importance of good metadata. While at times you may want to override the search platform relevance as mentioned a few months ago, use that option sparingly.
You may also want to consider defining a fixed vocabulary, or using a "Best Bets" tool if your search platform offers one, to boost the right documents. Be certain your content creators know that good metadata will increase the likelihood of their content being found and read.
Some excellent metadata extraction tools are available commercially and as open source. The range of support differs widely, with some automatically assigning metadata to fields, while others just look for a dictionary of terms against which it will query.
Intranet content owners face a unique problem related to document retention policy. Intranet content can come under review during legal discovery, and failure to follow published retention policies can put your company at risk. Check with your legal team, and if your company implements a retention period, make sure you comply with it.
Once again, good metadata is important in this process. Sadly, some of the large metadata projects I’ve seen were started after a judgment was made against a company for failing to follow their own policy. Use care!
Do You Have a Problem?
It’s hard to say whether your company has any of these issues. A quick way to test is to search your internal (and public) sites for “Document1” or “Worksheet1.” Another fun test is to search your sites for “Company Confidential” or even “Secret” — you may be surprised what you find.
And if you have a document retention policy, test your indexed content every year. But remember, without good metadata quality, you may think you’re in good shape — right up until the discovery phase finds out-of-compliance content. The next project you’ll be working on after that will be to implement great search to identify non-compliant content!