Enterprise Search Center

RESOURCES FOR EVALUATING ENTERPRISE SEARCH TECHNOLOGIES

December 10, 2008

Table of Contents

Integrating Security into Your Enterprise Search Infrastructure

OCLC, Syracuse University, and University of Washington Team Up

Controlled document collaboration

Digital asset navigation

ISYS releases ISYS:web 9

Inmagic Releases Presto 3.0 and DB/Text Library Suite

SRC Announces Alteryx Version 4.1

Exalead and TERMINALFOUR Partner

Infotrieve's Content SCM Solution Announced

Images from the Life Photo Archive Available On Google

Collexis Launches Peer Reviewer Finder Facilitating Grant Funding

Integrating Security into Your Enterprise Search Infrastructure

Companies might not want to know what turns up when employees type the word "confidential" into their intranet or portal search boxes.

The proliferation and increasing power of enterprise search requires companies to pay more attention than ever before to the protection of confidential information. They can no longer assume that what they thought was private is,in fact, protected.Today,companies must carefully reconsider document and information security issues, whether to protect intellectual property rights, undisclosed strategies (everything from new products to new partners to layoffs), and employee privacy for personal information where they are the fiduciary such as medical, retirement, and personnel data.

The need for increased security is being driven by the astronomical amount of data that companies are saving in their data stores,most of which is "search enabled." Corporations are even including email in the data earmarked to be spidered and indexed by search tools. Some storage manufacturers are turning their products into "smart devices," which allow stored data to be searched directly—giving companies without fancy enterprise search software both a tremendous amount of power and exposure to new risks.The more data that is indexed, the more the chance exists for sensitive data to be retrieved. If you don’t believe it, just go ahead and try searching for the word "confidential" on your company’s search box.

Levels of Security

Companies have very particular rules about which documents, or which portions of which documents, various users can see. Granularity refers to the precision with which particular pieces of data can be secured. "User-level" controls information access based on the type of the user: administrator, manager, human resources, etc. This concept is fairly widely understood, and it is about the same for search as it is for other systems.

There are two ways to implement security that maintain both granularity and user-level controls. The simpler method is through "collection-level"security, where users are given access to a certain collection of documents. Alternatively, with "document-level" security, users are granted access to documents on an individual basis. Both methods use "access control lists" or"single-sign-on."

Most search engines support the concept of collections, also referred to as "repositories," "sources," or "document sets." One of the easiest and reasonably effective control techniques is to simply segregate data by security requirements: Public data is grouped into one section, restricted data into a second, highly confidential data into a third, and so on. Search engines can limit the search to one or more of these collections. Once the credentials and access level of an individual user are determined, the appropriate collections are enabled for their search.

Where documents already have access-level security providedby a file system or a content management system (CMS), search security can beapplied at the document level. This method of securing data will feel veryfamiliar to those with a database background—certain groups or users can seecertain documents. Databases and CMSs have had this technology for a very longtime, and enterprise search engines are quickly catching up. Whether you talkabout "record," "document," or "web page," to a full-test search engine theyare all retrievable units of data.

Download the Complete PDF, with Illustrations

Complexity of Document-Level Security

One of the complexities with document-level security is the rendering of the results list. Typically a document or record will be well-secured, but the search engine has indexed all of the content and is displaying lists of titles and summaries in a results list. It’s not enough to secure the actual document; the results list should not display even the title or summary from a document that the user cannot see. Often even a title or summary can convey important (potentially confidential) information. For example, a title of "Indictment of John Smith Expected Tomorrow" would tip off John Smith, regardless of whether he can read the entire document or not.

A more subtle detail of the secured results list is the display of the number of matching documents and the links that allow users to page through a long results list. A simple engine might display the total matching count of documents, whereas a highly restricted user may only have access to 10% of those records, so the count is quite misleading. Beyond cosmetics, the engine needs to have an accurate idea about what documents that user can see when it is offering links to pages 2, 3, and 4 of the results list.

An even more subtle detail, but one which can still be a requirement in highly secure systems, is the confirmation of whether certain terms appear in the document index at all.

As seen in the John Smith example, searches for terms like"layoff," "indictments," or the names of specific people can partially confirm the presence of information, even if no document titles are shown. A highly secure search will not confirm or deny the presence of terms in its index outside the context of what the user is allowed to search. A more common example may be to not confirm the presence of obscenities or defamatory terms in nonaccessible content.

The theory of controlling access at the subdocument level is that different users can see different portions of the same document. Here area couple of common examples of this method:

Some sites that charge for content allow users to see the title of documents in their results list, and perhaps a summary, but the user must then pay to see the entire text of the article.
All managers can see summaries of sales documents, but only VPs, finance, and sales can see the specific financial terms.

Partners can see the text of bug reports but can’t see the company that logged the issue.

In these cases, document-level security must give way to subdocument level security. Needless to say, implementation details can get a bit sticky. Detecting and removing certain parts of unstructured documents can prove difficult.

Broadly, there are three levels of difficultly associated with subdivision of documents. The easiest level includes database record (via select statement or view) and XML (via XSLT); moderately difficult subdivision includes HTML (because it is not always well-formed) and PDF; and the most difficult level, which includes proprietary office documents, and often requires a document filtering library and custom code or document conversion.

Emerging open documents standards will make subdividing documents easier as they come into more widespread use.

Download the Complete PDF.

Two Types of Implementation

Document-level security, where each group can have access to different documents on a group-by-group basis, is the fastest growing segment of high-end search engine installations. To have different permissions for each document, you need to have some type of existing Access Control List system and/or Single-Sign-On system in place and integration software from the search engine vendor to connect to it.

Although implementations are vendor-specific, there are two primary designs for providing document-level security: "early binding" and"late binding" document filtering.

Early binding document filtering is set up before the query is sent to the core search engine. Detailed information about the user’s permission is automatically added to the query, so the core engine will only bring back documents that the user can access. Early binding document security is often more complex to set up, but it is strongly preferred since it provides better performance and avoids some odd display issues.

Late binding document filtering handles document security after the search has been submitted to the core engine, while the results list of matching documents is being displayed to the user. Each document’s access level is checked against the user’s security credentials. The results list formatter will check every document against an external server to see if the user has access. Late binding document filtering can potentially be very slow and can strain corporate security systems.

Late binding is much simpler to design and implement. Until very recently, it was much more common. Early binding security requires significant up-front work. For each document, URL, database record, and so on,its entire access details must be downloaded and stored in the search index. Gathering Access Control List information from each of these unique sources and then mapping each to actual users and groups inside of a company is a complex task.

With late binding security, a single question can be asked of any matching document and a user. A simple "yes/no" request is made to retrieve the URL of each document, and the user who issued the search has his credentials forwarded to the remote system. The remote system will either return the document or not, depending of the user’s rights. From the search engine’s standpoint it will get either a "yes" or "no" answer and decide to display or discard that document from the results list accordingly.

Vendors have many different names for these two implementations,so sadly you may need to do a little digging.

Problems to Look Out For

In this article, we have addressed some of the key issues a company faces when all of its documents are indexed by a central search repository. Sometimes business processes or requirements dictate that some distributed sources of content cannot be included in a central search index. In this case, you may be able to solve the problem using a federated search. In a federated search model, each user query is sent to the native search engine for the distributed data source. The authentication credentials of a user are passed with each request. These results are merged with results from the central search repository and presented to the user.

There are a number of challenges in federating results in an effective manner, including relevance ranking and result presentation. Still,federation may be the only way to get all the relevant content to the user.

It is possible that some remote search engines will not accept federated searches for either technical or policy reasons. If there area number of nonfederated sites like this, the sites themselves can at least be listed as suggested sources of data if the descriptions of the site contain matching terms.

There are many potential security holes that need to be verified as a company deploys an enterprise search engine. The exponential growth of data reserves will only continue, so these issues will eventually need to be dealt with by all but the smallest companies. By matching the appropriate search engine technical tools with a company’s business requirements, companies can alleviate many of these concerns and significantly reduce the risk of unwanted security breaches.

Download the Complete PDF.

About the Author

MARK BENNETT is the chief technology officer of New Idea Engineering (www.ideaeng.com), which helps companies make search work right. NIE focuses on search best practices to help companies select, design,and deploy advanced enterprise search applications, including search 2.0 interactivity, periodic review of search activity and ongoing search data quality monitoring to ensure great relevancy and user satisfaction.

Back to Contents...

OCLC, Syracuse University, and University of Washington Team Up

Researchers and developers from OCLC and the information schools of Syracuse University and the University of Washington announced their participation in a new international effort to explore the creation of a more credible web search experience based on input from librarians around the globe. Called the "Reference Extract," the planning phase of this project is funded through a $100,000 grant from the John D. and Catherine T. MacArthur Foundation. Reference Extract is envisioned as a web search experience similar to those provided by popular search engines. Unlike other search engines, Reference Extract will be built for maximum credibility of search results by relying on the expertise of librarians. Users will enter a search term and receive results weighted toward sites most often used by librarians at institutions such as the Library of Congress, the University of Washington, the State Library of Maryland, and over 2,000 other libraries worldwide.

(www.macfound.org, www.ischool.syr.edu/about/, www.ischool.washington.edu/, www.oclc.org)

Back to Contents...

Controlled document collaboration

DocCenter has launched DocLanding, a software-as-a-service solution designed to allow individuals and small- to medium-sized businesses (SMBs) to centralize, manage, search, share and "unshare" their documents. The company claims DocLanding is an affordable, robust and easy-to-use solution that provides a platform for controlled collaboration at work or home.

DocLanding accounts can be set up in minutes, DocCenter reports, and, after scanning or uploading documents, DocLanding allows users to electronically file, search and retrieve information and share it with virtually anyone, anytime and anywhere. DocLanding is designed for the average person and offers an intuitive graphical user interface, says DocCenter. Additionally, users can customize their Personal Information Terminals (POINTs) to reflect the way they organize and classify information. With one password, they can access their files and documents, as well as the POINTs of others.

DocCenter claims its new offering is the only Web solution that provides an annotation tool with the ability to black out or redact sensitive words or phrases, plus features such as document deskewing and despeckling. Additional security features include the ability to apply watermarks and to control viewing and checkout privileges. DocLanding also allows the preview of many popular file types, such as the Microsoft 2007 Office Suite and Photoshop, without having the software installed. Further, DocLanding documents can be "locked" to prevent accidental deletion or movement.

Back to Contents...

Digital asset navigation

Endeca for Media & Publishing has debuted its Digital Asset Navigator, a new search and information access solution for media companies, content publishers, advertising agencies and marketing departments. Designed to complement and extract latent value from digital asset management (DAM) initiatives, says Endeca, it provides information access capabilities to explore, find and evaluate images, graphics, audio/video files and other high-value digital assets.Endeca reports that specific features and functionality include:

search—helps people quickly find items of interest by taking advantage of the latest innovations in information retrieval, including "look ahead" suggestions, spell correction/suggestions, synonyms and compound dimensions;
guided navigation—encourages exploration and discovery by displaying attributes, characteristics and other metadata such as production type, scene type, location, orientation, etc., as dynamic navigation options, letting users browse large collections of digital assets and refine long lists of search results;
content spotlighting—promotes relevant related content, low-cost alternatives, tips and more based on a user’s profile, search terms and browse path;
simple integration with popular DAM, ECM, databases and enterprise applications—streamlines application development and ensures a 360-degree view of relevant information by including content adapters for OpenText Artesia, Interwoven, MediaBeacon, ODBC and JDBC; and
native security—ensures permission-based access and record-level security to guarantee that users only see what they have the ability to use, reducing legal and compliance costs.

Back to Contents...

ISYS releases ISYS:web 9

ISYS Search Software has launched ISYS:web 9, the company’s enterprise search solution for intranets, Web sites, Microsoft SharePoint and custom Web applications.

ISYS:web 9 includes enhancements designed to deliver the speed, efficiency and accuracy required to find information fast. Further, says ISYS, its Intelligent Content Analysis feature identifies key characteristics about a content collection, such as metadata patterns and entities, and leverages these facets in the interface to provide a more fluid search and discovery process.

Intelligent Content Analysis manifests itself in the form of several parametric search and refinement options, says the company. At index time, ISYS notes aspects like entities in the full text (e.g., names, locations); commonly recurring metadata values in semi-structured and database formats; location of files; dates and numbers; and position of words. These characteristics are then exploited, either as a front-end interface for parametric search or as refinement options within search results.

ISYS:web 9’s capabilities also include:

Intelligent Query Expansion--gives users greater context and avenues to pursue, offers suggestions based on a query and the document (for example, a search for "SharePoint" might suggest "SharePoint search Web part");

ContextCogs--snippets of relevant and contextual information pulled from third-party sources and displayed alongside standard ISYS results;

Intelligence Clouds--enable rapid navigation of key information;

Improved Performance and Scalability--ability to handle most search requests concurrently with a higher throughput;

Search Form Customization--offers both automatic and designable search forms;

Index Biasing--expands ISYS:web’s tuning capabilities, provides administrators with the ability to adjust the weighting on entire collections of documents. That enables an organization to further tune relevance to suit specific use case scenarios.De-duplication--automatically identifies identical documents and either removes them from the results or visually marks them;

ISYS:web Federator--allows customers to federate their searches across both ISYS and non-ISYS content sources. Exchange Indexing--enables administrators to centrally create and manage individual indexes for each user’s e-mail account.

Back to Contents...

Inmagic Releases Presto 3.0 and DB/Text Library Suite

Inmagic announced the release of Inmagic Presto 3.0, a socialized version of Inmagic's knowledge repository. Presto 3.0 integrates social media with enterprise knowledge, search, access, and discovery capabilities. Presto enables the management of internal and external data and unifies structured and unstructured content - including documents, images, audio, video, web sites, blogs, RSS feeds, and more.

Inmagic also announced the release of Inmagic DB/Text Library Suite, a web-based integrated library system (ILS). Designed for organizations with special libraries, DB/Text Library Suite gives librarians and information professionals a solution to collect, manage, and provide access to traditional and non-traditional library materials and collections. DB/Text Library Suite is comprised of three of Inmagic's library and special collections management technologies: Genie, DB/Text, and Works Web Publisher Pro. DB/Text Library Suite is an open system, with an XML API. It is available on both DB/Text and SQL platforms. Data can be exported at any time, in any format, including industry-standard XML, and shared with other libraries in MARC format. DB/Text Library Suite also provides access to all backoffice functions via a web browser.

(www.Inmagic.com)

Back to Contents...

SRC Announces Alteryx Version 4.1

SRC, LLC, a developer and provider of geographic business intelligence technology, announced the availability of Alteryx and Alteryx Enterprise version 4.1, advanced software solutions that deliver business intelligence. The latest upgrade focuses on unifying map, table, and report browsing in a single tool that offers interactive views and access to spatial and nonspatial data streams within Alteryx data flow modules. This new browse tool can insert a browse window anywhere within an application module stream to view the resulting data. In addition, the new browse tool hosts several other new features including interactive spatial and tabular selection navigation and the ability to view multiple panes simultaneously. Other added enhancements include: full unicode text support, extended styling rules and color palette controls improved spatial intelligence through added spatial summary processes and expanded support for object and spatial databases, and program memory optimizations and new functions in the formula library.

(www.ExtendTheReach.com)

Back to Contents...

Exalead and TERMINALFOUR Partner

Exalead, a provider of search software for business and the web, and TERMINALFOUR, a content management company, announced a new strategic partnership. Exalead's CloudView product line will be made available alongside TERMINALFOUR's Site Manager to help businesses manage their web, intranet, and extranet content and provide an enterprise search facility. Exalead's CloudView product line is an enterprise platform designed specifically to overcome search requirements and consumer facing web sites. CloudView automatically collects, structures, and contextualizes unstructured and structured content across the enterprise information cloud. TERMINALFOUR's Site Manager is an enterprise level content management solution.

(www.exalead.com/software, www.terminalfour.com)

Back to Contents...

Infotrieve's Content SCM Solution Announced

Infotrieve, Inc., a provider of business service solutions for information centers, announced the rollout of the new Content SCM software solution. Content SCM is Infotrieve’s web-based technology for automating document sourcing and delivery with rights management capabilities, copyright compliance auditing, and complete view of content usage throughout the organization – both licensed and pay-per-view content. End users can research content using integrated search and discovery platforms with advanced citation referencing capabilities, select and order specific articles, advise as to how the content is being used, see the cost of the article, and then place an order. Content SCM will check copyright compliance rights based on each client agreement with reproductive rights organizations then secure the article through licensed content, internal holdings, or on a pay-per-view basis.

(www.infotrieve.com)

Back to Contents...

Images from the Life Photo Archive Available On Google

Access to LIFE’s Photo Archive--over 10 million images in total--will soon be available on a new hosted image service from Google, Time Inc. has announced. Ninety-seven percent of the photographs have never been seen by the public. The collection contains some of the most iconic images of the 20th century, including works from great photojournalists.These images can be found when conducting a search on Google.com or on Google Image Search. The LIFE Photo Archive featured on Google will be among the largest professional photography collections on the web and one of the largest scanning projects ever undertaken. Millions of images have been scanned and made available on Google Image Search today with all 10 million images to be available in the coming months.

(www.google.com)

Back to Contents...

Collexis Launches Peer Reviewer Finder Facilitating Grant Funding

Collexis Holdings Inc., a developer of semantic search and knowledge discovery software, announced the formal launch of the Reviewer Finder product, an application which identifies the best reviewers based on their research profile. By using the proprietary Collexis Fingerprint technology, a submitted grant proposal or submitted scientific manuscript is analyzed and the resulting document Fingerprint is compared against more than 1.8 million expert Fingerprints--simultaneously checking for conflicts of interest based on co-authorship and organizational affiliation. The Reviewer Finder tool is the latest enhancement to the suite of Collexis grant management applications. These services provide funding organizations with portfolio analysis tools to assist in identifying gaps and trends and track the success of their funding by an institution, department, or researcher.

(www.collexis.com)

Back to Contents...

[Newsletters] [Home]