Companies might not want to know what turns up when employees type the word "confidential" into their intranet or portal search boxes.
The proliferation and increasing power of enterprise search requires companies to pay more attention than ever before to the protection of confidential information. They can no longer assume that what they thought was private is,in fact, protected.Today,companies must carefully reconsider document and information security issues, whether to protect intellectual property rights, undisclosed strategies (everything from new products to new partners to layoffs), and employee privacy for personal information where they are the fiduciary such as medical, retirement, and personnel data.
The need for increased security is being driven by the astronomical amount of data that companies are saving in their data stores,most of which is "search enabled." Corporations are even including email in the data earmarked to be spidered and indexed by search tools. Some storage manufacturers are turning their products into "smart devices," which allow stored data to be searched directly—giving companies without fancy enterprise search software both a tremendous amount of power and exposure to new risks.The more data that is indexed, the more the chance exists for sensitive data to be retrieved. If you don’t believe it, just go ahead and try searching for the word "confidential" on your company’s search box.
Levels of Security
Companies have very particular rules about which documents, or which portions of which documents, various users can see. Granularity refers to the precision with which particular pieces of data can be secured. "User-level" controls information access based on the type of the user: administrator, manager, human resources, etc. This concept is fairly widely understood, and it is about the same for search as it is for other systems.
There are two ways to implement security that maintain both granularity and user-level controls. The simpler method is through "collection-level"security, where users are given access to a certain collection of documents. Alternatively, with "document-level" security, users are granted access to documents on an individual basis. Both methods use "access control lists" or"single-sign-on."
Most search engines support the concept of collections, also referred to as "repositories," "sources," or "document sets." One of the easiest and reasonably effective control techniques is to simply segregate data by security requirements: Public data is grouped into one section, restricted data into a second, highly confidential data into a third, and so on. Search engines can limit the search to one or more of these collections. Once the credentials and access level of an individual user are determined, the appropriate collections are enabled for their search.
Where documents already have access-level security providedby a file system or a content management system (CMS), search security can beapplied at the document level. This method of securing data will feel veryfamiliar to those with a database background—certain groups or users can seecertain documents. Databases and CMSs have had this technology for a very longtime, and enterprise search engines are quickly catching up. Whether you talkabout "record," "document," or "web page," to a full-test search engine theyare all retrievable units of data.
Download the Complete PDF, with Illustrations
Complexity of Document-Level Security
One of the complexities with document-level security is the rendering of the results list. Typically a document or record will be well-secured, but the search engine has indexed all of the content and is displaying lists of titles and summaries in a results list. It’s not enough to secure the actual document; the results list should not display even the title or summary from a document that the user cannot see. Often even a title or summary can convey important (potentially confidential) information. For example, a title of "Indictment of John Smith Expected Tomorrow" would tip off John Smith, regardless of whether he can read the entire document or not.
A more subtle detail of the secured results list is the display of the number of matching documents and the links that allow users to page through a long results list. A simple engine might display the total matching count of documents, whereas a highly restricted user may only have access to 10% of those records, so the count is quite misleading. Beyond cosmetics, the engine needs to have an accurate idea about what documents that user can see when it is offering links to pages 2, 3, and 4 of the results list.
An even more subtle detail, but one which can still be a requirement in highly secure systems, is the confirmation of whether certain terms appear in the document index at all.
As seen in the John Smith example, searches for terms like"layoff," "indictments," or the names of specific people can partially confirm the presence of information, even if no document titles are shown. A highly secure search will not confirm or deny the presence of terms in its index outside the context of what the user is allowed to search. A more common example may be to not confirm the presence of obscenities or defamatory terms in nonaccessible content.
The theory of controlling access at the subdocument level is that different users can see different portions of the same document. Here area couple of common examples of this method:
- Some sites that charge for content allow users to see the title of documents in their results list, and perhaps a summary, but the user must then pay to see the entire text of the article.
- All managers can see summaries of sales documents, but only VPs, finance, and sales can see the specific financial terms.
- Partners can see the text of bug reports but can’t see the company that logged the issue.
In these cases, document-level security must give way to subdocument level security. Needless to say, implementation details can get a bit sticky. Detecting and removing certain parts of unstructured documents can prove difficult.
Broadly, there are three levels of difficultly associated with subdivision of documents. The easiest level includes database record (via select statement or view) and XML (via XSLT); moderately difficult subdivision includes HTML (because it is not always well-formed) and PDF; and the most difficult level, which includes proprietary office documents, and often requires a document filtering library and custom code or document conversion.
Emerging open documents standards will make subdividing documents easier as they come into more widespread use.
Download the Complete PDF.
Two Types of Implementation
Document-level security, where each group can have access to different documents on a group-by-group basis, is the fastest growing segment of high-end search engine installations. To have different permissions for each document, you need to have some type of existing Access Control List system and/or Single-Sign-On system in place and integration software from the search engine vendor to connect to it.
Although implementations are vendor-specific, there are two primary designs for providing document-level security: "early binding" and"late binding" document filtering.
Early binding document filtering is set up before the query is sent to the core search engine. Detailed information about the user’s permission is automatically added to the query, so the core engine will only bring back documents that the user can access. Early binding document security is often more complex to set up, but it is strongly preferred since it provides better performance and avoids some odd display issues.
Late binding document filtering handles document security after the search has been submitted to the core engine, while the results list of matching documents is being displayed to the user. Each document’s access level is checked against the user’s security credentials. The results list formatter will check every document against an external server to see if the user has access. Late binding document filtering can potentially be very slow and can strain corporate security systems.
Late binding is much simpler to design and implement. Until very recently, it was much more common. Early binding security requires significant up-front work. For each document, URL, database record, and so on,its entire access details must be downloaded and stored in the search index. Gathering Access Control List information from each of these unique sources and then mapping each to actual users and groups inside of a company is a complex task.
With late binding security, a single question can be asked of any matching document and a user. A simple "yes/no" request is made to retrieve the URL of each document, and the user who issued the search has his credentials forwarded to the remote system. The remote system will either return the document or not, depending of the user’s rights. From the search engine’s standpoint it will get either a "yes" or "no" answer and decide to display or discard that document from the results list accordingly.
Vendors have many different names for these two implementations,so sadly you may need to do a little digging.
Problems to Look Out For
In this article, we have addressed some of the key issues a company faces when all of its documents are indexed by a central search repository. Sometimes business processes or requirements dictate that some distributed sources of content cannot be included in a central search index. In this case, you may be able to solve the problem using a federated search. In a federated search model, each user query is sent to the native search engine for the distributed data source. The authentication credentials of a user are passed with each request. These results are merged with results from the central search repository and presented to the user.
There are a number of challenges in federating results in an effective manner, including relevance ranking and result presentation. Still,federation may be the only way to get all the relevant content to the user.
It is possible that some remote search engines will not accept federated searches for either technical or policy reasons. If there area number of nonfederated sites like this, the sites themselves can at least be listed as suggested sources of data if the descriptions of the site contain matching terms.
There are many potential security holes that need to be verified as a company deploys an enterprise search engine. The exponential growth of data reserves will only continue, so these issues will eventually need to be dealt with by all but the smallest companies. By matching the appropriate search engine technical tools with a company’s business requirements, companies can alleviate many of these concerns and significantly reduce the risk of unwanted security breaches.
Download the Complete PDF.
About the Author
MARK BENNETT is the chief technology officer of New Idea Engineering (www.ideaeng.com), which helps companies make search work right. NIE focuses on search best practices to help companies select, design,and deploy advanced enterprise search applications, including search 2.0 interactivity, periodic review of search activity and ongoing search data quality monitoring to ensure great relevancy and user satisfaction.