We've come a long way.
Content security within search is now widely recognized as a requirement. Nearly every client we visit and every request for proposal we see includes a section on search security.
Google gives nearly 30 million results for "security enterprise search," and over 325 million results for "document security search." Hopefully the momentum translates beyond search results to real actions on securing search.
The Early Days of Search Security
When enterprise search was young, security generally meant document-level security. Essentially, the system tagged each page, record or document with access credentials and stored permissions for each document.
At query time, the more innovative search platforms would return the results by relevancy, but only after they were filtered, or trimmed, with users’ access permissions. If you didn’t have the correct security level to access the document in its native repository, you were not authorized to view the document as part of a search result.
Some early implementations were flawed. I once saw a search implementation that reported, “Showing zero of three results for ‘layoffs’” — which pretty much answered the question, credentials or not.
Search Is Vulnerable to Hackers Too
Search vendors and open source search projects have become much more cognizant of security's importance since then, but few technologies are prepared for break-ins and theft of confidential corporate information, an increasingly popular target for hackers.
This worrisome trend applies to files and databases, of course. But enterprise search owners may not know this same intrusion threat can expose the "secure" content of their enterprise search platform to the same hackers.
Search has always been designed for access, and document-level security was added after the fact, almost as an afterthought.
What most search owners and administrators do not know is that, in nearly all platforms, the search index itself is not encrypted at all. If someone has access to the files that make up the search index, they can recreate much of the original document. Niceties like grammar or spelling may not come through precisely, but it’s certainly sufficient to understand the intent. And with increasing intrusions, your search indexes are becoming increasingly more vulnerable.
Search Indices: The High-Level View
Let’s take a brief look at why you should address this vulnerability now.
The implementation details are typically different for each search platform, but virtually all search platform indices have two related objects: an inverted index list of all words in each searchable document, and a database-like table used to store values used for field searches (e.g. authors or titles).
The inverted index essentially stores every word in every document and often includes the location within each document where the word occurs. The field table contains the values extracted from each document field such as Title or Author.
If someone has access to these files, they can not only potentially glean what a document is about, they can recreate most of the document itself.
Are you worried now?
What Are Your Options?
The first step is to confirm that your search servers enjoy the same physical security as other mission-critical servers.
Lock them down and limit physical access to authorized staff. At minimum, that will keep out less skilled intruders.
As far as data security, we’ve seen companies locate their search servers behind an internal firewall that only allows traffic on certain ports.
Work with your security staff to determine whether your search platform can be maintained behind an additional firewall, with limited access.
Encrypt the Index
When your content is regulated or secure in nature, a next step might be to look at tools to encrypt the search index at rest.
With this type of solution, even full access to the search index files will not result in a security breach. Hitachi Data Systems markets a product called Credeon Secure Search for Solr, and a few other companies offer similar capabilities.
This approach may have some relatively minor downsides, depending on your environment and data. For example, terms to be used as facets cannot be encrypted. However, realize that fields you use as facets are not typically the types of data you need to encrypt anyway.
Solr committer Erick Erickson wrote a few years ago that when it comes to Lucene/Solr, the physical risk is perhaps not as bad as it seems. While a field configured to ‘store’ a term exists in plain text in the index, the word list is actually stored after processing through the full analysis chain, which applies expansions like stemming and synonyms.
This means the word list may be a bit cryptic, and a hacker may not be able to recreate a full index. But it clearly can present some risk.
So, What Do You Do Now?
If you are responsible for content security including search indices, you might consider taking a few steps:
Start with a search audit. Review your content, document level security requirements, physical and network access, and a complete risk analysis. Your findings will help dictate next steps.
Talk with your search vendor about current and planned security implementations they support. Ask them for a best practices document for using their platform with sensitive and secure content.
Create your implementation plan and get started. The time to start this may have been years ago, but the next best time is right now.
Have questions? Leave a comment below.