Federated search systems are developed to search multiple databases, regardless of location and structure, with one query and to display the results packaged in the most relevant way. Development efforts are directed primarily at presentation of results. Clearly, some people would like to use a dream search system that finds all the information needed with one single search query, but magical federated search systems have yet to arrive. While users want a completely easy-to-use and comprehensive search system, experienced researchers and information professionals know that no such system exists.
Federated search developers assume that users want a Google type search interface and that people who succeed using Google will succeed using their federated search engines. One difference between federated search systems and Google is that Google relies on a central index while federated search systems usually depend on the differing search systems of the databases being accessed. The sources being searched may use different information discovery schemes and different protocols. Different fields from different databases may have different rules for entering information, even if the federation system builds a centralized "üeber-index." Effective information finding, in reality, requires users to know the type and quality of required information, to have extensive knowledge of sources, to understand how to build effective search strategies, and to have the experience to evaluate results.
People query databases for many reasons, including finding an answer to a specific question, accessing a body of content associated with a particular subject or author, or browsing to learn about the structure of a field and sources. Searchers range from naïve to expert in their understanding of their own needs, knowledge of the subject of their queries, and knowledge of sources. A public library user searching for information on global warming is likely to turn to Google and be inundated with citations, while an experienced researcher most likely would search a specialized database for a particular aspect of the global waming crisis. Federated search systems are not effective for all types of questions and queries. A system could mislead if the searcher is not familiar with a subject, sources of coverage, or the limitations of the software.
Definition and Scope
Discussion of federated search systems raises many questions with no clear answers. Can users receive all the information they need with one search query and get results presented in a useful format? The success ofthe search likely will depend on the knowledge of the user. What are the challenges presented by scalability, name protocols, clustering, translation, OPACs, lack of training, and flexibility in the presentation of results?
Each developer I spoke with has a separate definition of federated search. Raul Valdes-Perez, CEO of Vivísimo [http://www.vivísimo.com], providedhis definition: "Federated search combines search results retrieved from multiple search engines into a single interweaved set of results." Emphasis is on relevancy and combining results. Stephen DiStasio, product manager atSerials Solutions [http://www.serialssolutions.com], said, "Federated search is a 'starting point for research,' a shotgun approach. Each federated search engine is different, ranks results indifferent ways, and presents results in different ways."
Federated search developers make various claims for the number of databases their systems can search and still find useful results. Scalability is a major issue. As Valdes-Perez explained: "The disadvantage to federated search is that the solution completing the federation is dependent on the underlying search technology of the individual sources. This means that often federated search is only as fast as the worst source. If a source times out, those results will not be displayed." Stephen Abram, vice-president of innovation at SirsiDynix [http://www.sirsidynix.com], and current president of the Special Libraries Association, observed: "Complexity is exponential as you add databases." He added that the semantic web may help. Abram and Randy Marcinko, CEO of Groxis [http://www.groxis.com], emphasized the need for taxonomies to reveal the structure, breadth, and depth of query results. Queries are likely to produce more relevant results if the searcher has knowledge of the field and its terminology.
Abe Lederman, founder and president of Deep Web Technologies [http://www.deepwebtech.com], said, "I believe we know from a technical standpoint how to scale federated search to hundreds or possibly thousands of sources. We know how to do that." World Wide Science [http://www.worldwidescience. org], an international sci-tech portal maintained by the U.S. Department of Energy’s Office of Scientific and Technical Information, uses Deep Web Technologies’ federated search to search databases around the world in different languages, depending on the servers involved to return results quickly. "As you have more and more sources, it does not make sense to search every source for every query. We want to use past history and profiles to say, for starters, these 20 sources are most likely to return good information." Lederman did not indicate how results could be sorted, deduped, and presented to the user in a useful way.
Common names present special search problems. If a user looks for an author with a common last name and no first name or initials, the user may be frustrated. One of the major challenges associated with federated search is name protocols. How does the search handle databases with different protocols? Is the protocol last name, first name? Last name, first initial? Last name, first and middle initials? And do those initials come with periods and spaces, periods and no spaces, no periods, etc., etc.? DiStasio talked about how Serials Solutions handles — or does not handle — this problem: "We simply search for a text string in the metadata that is provided by the content providers — if the patron’s entry doesn’t match that of the content provider, they may not find that result." Deep Web provides the federated search for Scitopia [http://www.scitopia.org]. [For more information on Scitopia, see http://newsbreaks.infotoday.com/nbReader.asp?Articleid=39927.] Lederman answered in the context of Scitopia: "We spend a significant amount of effort to get it as close to being right aspossible for Scitopia where we had much better access to the scientific societies that are content providers. It is not perfect and is still a challenge. The best we can do is transformation." Valdes-Perez indicated that Vivísimo does some transformation too.
Clearly, the problem has not been solved and can cause a deluge of results for users searching for a common last name only. When many databases use different name protocols, the probability of the user retrieving too much or too little increases significantly.
Profiling and alerting services can help repeat users find needed information without having to sift through information they’ve already seen. Most of the federated search developers provide or are developing profiles and alerts. Valdes-Perez said, "End-users can create personal profiles in order to select which sources to include in their federated search result set." Users also can set up alerts by topic, author, or other metadata. Science.gov [http://www.science.gov] uses Deep Web Technologies search, which provides for alerts sent to users weekly.
Profiles may work more effectively in narrowly defined fields when authors publish in a limited set of sources. As profiles broaden, the probability of finding all needed materials lessens. Using a thesaurus for the subject of the profile may help in keeping the numbers of items retrieved to a reasonable level. Searching in a limited set of sources occasionally will result in missing an important item. For example, newspapers such as The Wall Street Journal and The NewYork Times often publish in-depth articles on science subjects. These articles will not appear in a search or profile confined to science sources.
Download the Complete PDF, with illustrations.
Presentation
Presentation of results, whether from a search query or profile, is the heart of federated search technology. Vivísimo and Grokker use clustering to reveal underlying content in an organized and visual way. Marcinko prefers the term "lenses" rather than clusters. Lenses can be adjusted to reveal a panorama or zoom in for detail. Vivísimo returns a list of subtopics with no visual display [http://www.clusty.com or http://www.usa.gov] with the option of going deeper. Sources are displayed with the results.
Grokker [http://www.grokker.com] works in a similar way with the additional option of a visual presentation of circles within circles. Some circles lead to layers of narrowing subtopics. The circles and clusters reveal retrieved items in a hierarchical way. Christy Confetti Higgins commented that Sun Microsystems uses Grokker for internal content. She finds that some people prefer lists and some people prefer a visual approach. The preferences may be based on learning styles or the nature of the search. Preferences for different forms of results display confirms the idea that one size does not fit all in information retrieval systems.
Abram and Marcinko indicated that results could be enhanced by the use of taxonomies that present the depth and breadth of a subject in hierarchical order. The taxonomic approach has the advantage of providing a structure for categorizing results and informing the user of how the subject is organized.
Lederman indicated that his company is developing clustering for Scitopia and Science.gov. "We want to incorporate some of our relevance ranking into a cluster so that when you look at the results for a cluster you see the results in ranked order. I want to see if we can display clusters in terms of relevance to the user’s query." Science.gov has a demo that allows users to sort results by relevance, rank, source, date, author, and title. Since all documents in Science.gov are written in English, problems associated with translation do not arise.
Languages present many problems when searching multinational databases. Most users, especially American ones, are not fluent in more than one language. European and Asian users are more likely to be comfortable in several languages. Translation is a special challenge. WorldWide Science uses Deep Web Technologies software and includes databases from France, Germany, Netherlands, Denmark, Japan, India, Brazil, Chile, U.S.,and others. To take advantage of these global resources and to disseminate research on a global basis requires translation software. Since each language uses subtle distinctions and is constantly changing, translation software is a major challenge. Lederman said, "Translation will be an interesting challenge. We don’t know how well technology translations are going to work. "Clearly, the area will require significant investment, trial and error, expert input, and user patience.
Federated Versus Professional Searches
Federated search systems usually include online help that may or may not prove useful. It is difficult to replace one-on-one help and learning that users can receive from information professionals. Also, many users are not likely to have the knowledge and experience to find appropriate sources and to formulate queries that work. There is a clear need for online help and tutorials that can lead users to more effective skills.
I asked Valdes-Perez about end-user training. "For the end-user, little training is needed to understand how the system works." He added, "The strength of search technology is that users know how to search the web, so they need little training to get started once a single search box is presented to them."
Information professionals who deal with frustrated users retrieving too many or too few results or no answers at all might not agree that users are sufficiently proficient to not need training. The idea that users learn all they need to know about searching from Google, Yahoo!, or MSN is deceptive. Users themselves frequently believe that they can find needed information because they have been successful with Google. Abram opined that federated search should not try to compete with Google. Federated search should aim to go broader and deeper. A federated search engine may produce better results in an environment in which materials are related to a particular subject area rather than one in which databases cover hundreds of subjects in many disciplines.
Library OPACs integrated into federated search systems present special problems. Library catalogs are inventory systems that track a stock of materials, loans, and returns. While describing an object and containing some subject headings, OPACs are usually not designed for general information discovery. In order for the library catalog to become a useful information discovery tool, its metadata needs to be enhanced. Libraries may find it useful to retain the inventory in MARC records and create a derived and enhanced catalog database for searching with other databases.
Getting It Right
The Intel corporate library uses Deep Web Technologies’ search for its library systems. Barclay Hill, Intel librarian, stated, "Employees do not wish to learn a licensed product’s tool to retrieve information from its contents. They want a single search interface with familiar options to search and retrieve information with the least amount of effort possible."
The Intel project began in 2003 and completed implementation in 2006. "A design goal was set that no training or support would be necessary for employees to use the upgraded search solution." The team also assumed that employees would search all sources when they searched. After the rollout, the library did not receive any requests for user training or support.
Hill concluded, "We have a comprehensive and user friendly search solution that spans our external licensed information, internal managed information, and internet information sources." The Intel library is an example of an application in which federated search works well. Sources are limited to a small set of external databases and internal materials. Users are not forced into searching nonrelevant sources and have the advantage of searching internal and external sources with a single query.
Federated search systems could be improved with central indexing, taxonomies, visual as well as text representations of results, foreign language translation, and user training. The focus needs to be on users and how they work in their environment. Federated search systems will not satisfy every need, but tailoring the system to users can enhance satisfaction. Information professionals need to ensure that their client users are aware of sources and techniques that will make their jobs easier. Making a search system easy and effective for users requires a huge amount of behind-the-scenes work and cost. My conversation with developers and service providers convinced me that they are working hard to make their systems better. ■
Click Next Page for coverage on "traditional" federation services.
Traditional Federation of Sorts
Dow Jones Factivaand EBSCOhost have database systems that also merge different setsof databases in a technical form competitive with federation.
TheFactiva Publications Library database merges thousands of journals supplied bynumerous publishers and database aggregators. The collection is centrallyindexed with a high percentage of its content resident on central servers ratherthan following a federated approach that links data on diverse sites suchas Scitopia.org’s network of scholarly publishers managed by DeepWebTechnologies. Karin Borchert, vice president of global content andcustomer operations, Dow Jones Enterprise Media Group, said, "Seventypercent ofF activa content is indexed by auto coding and 30% is indexedmanually. Nearly a half-million codes covering companies, industries, regions,subjects and other topics are applied to content."
The Factivadatabase presents search options tailored to the needs of business users who mayor may not elect to choose items, such as company, type of publication,industry, region, subject, language, and date limits. These options are usefuland produce more precise results. Factiva licenses a search engine fromFastTechnologies, Inc. [http://www.fast.com.tw] thatis modified for Factiva’s applications. In addition to searching content on centralservers, the search engine crawls the web for relevant material, indexes thecontent, and returns it to the central database. Borchert commented thatFactiva’s business customers need access to new sources, such as blogsand social media.
EBSCOhost includes manydatabases covering diverse fields.Customers include academe, schools, and publiclibraries as well as business and government. Customers may search multipledatabases with one search query on EBSCO. Its search engine was developed inhouse. Each product has its own index and metadata. Michael Gorrell, senior vicepresident and chief information officer, EBSCO Publishing, described the system:"Each product has its own set of files. We replicate those files multipletimes across our server farms in multiple data centers." When multipledatabases are searched, the front end federates the request and sends it to eachsearch index. The results are merged, deduped, and sorted according to customerpreference. EBSCO’s opening page offers the user basic or advanced search,publication type, and date limits. The advanced search options are library typechoices, such as author, title, subject, abstract, publishers, etc.
Download the Complete PDF with Illustrations
About the Author
Miriam Drake is Professor Emerita, Georgia Institute of Technology and a frequent writer for Information Today's publications.