Information retrieval in the WWW


Michel Beigbeder mbeig@emse.fr
Bich-Liên Doan doan@emse.fr
Jean-Jacques Girardot girardot@emse.fr
Philippe Jaillon jaillon@emse.fr
Claudette Sayettat sayettat@emse.fr

1. Introduction

Information Retrieval (IR), see RIJSBERGEN [1], SALTON[2] has been a research area since the seventies. The IR issue is to find the whole relevant available information on a given topic and nothing else. So, the aim is to have:

Nowadays, the quantity of on-line available information is rapidly growing through the use of computer networks. Moreover, very different kinds of information exist, such as full documents, multimedia documents, computer programs, ...

We first present the IR problematic in a centralized context, then we expose the available tools in the WWW. Lastly, we suggest some research directions for a collaborative work in a distributed environment using IR technics.

2. IR model

The Information Retrieval (IR) research area is based on the model presented on the following picture. The elements of this model are present as well if the IR problem is related to a traditionnal library, a CDROM based catalogue of scientific articles, or the WWW.

IR model

The corpus is a set of documents. This corpus is indexed and the results of the indexation are stored in a database (indices). The user formulates a query. The correspondance function connects the indexed documents to the query. IR techniques use logic models to build the correspondance function. The first models used were the boolean model and the vectorial model. But these models were too rigid to be relevant enough. These models have since been improved and different technics such as fuzzy modal logic or conceptual graphs try to decrease noise and silence.

3. Existing tools

The World Wide Web (WWW) is a common way to browse the information. But it is not very easy to find information on a given topic and much more difficult to find all the available information on that topic. So the user needs tools to gather and filter the relevant information. A lot of tools are now available to search information in the WWW. We can study these tools according to some criteria which seem important to us: request, indexing, cooperation. We have globally established two categories coming out from the centralized and distributed aspect.

The first category concerns the tools which are centralized since they concentrate the information in one database. Some of them use robots to generate indices of (a part of) the WWW, for instance Lycos, which is one of the largest, but there are challengers, such as WebCrawler for example. Conversely, in Aliweb, the administrator of a server can register it in the database. Some servers are organized by subjects: Yahoo or EInet Galaxy. These databases are then queried with keywords and boolean operators or others technics using approximate match, adjacency or proximity operators.

This very simple architecture will not be usable in the short term future because of the growing of the WWW, and specifically the scalability problem. For example Alta Vista indexes 22 million Web Pages today, and the size of this index represents 33 GB, but how many tomorrow?

The second category of tools tries to address this issue; they are derived from distributed directories experiences such as WhoIs++, SOLO or X.500. Harvest, developped at Arizona University, is one of them. Basically, it provides two tools, a gatherer to perform indexation, and a broker to query the database generated by gatherers or other brokers. Harvest has other interesting capacities such as dealing with caches or replication for example.

Harvest overview

Beside these aspects, all these searching tools have defined their own rules for the indexation, the storage of documents descriptors, the formulation of the query, and the sorting of the results,...

4. Perspectives

Through this very quick survey of IR and of the existing tools we can see that different problems are to be addressed in order to build the next generation of searching tools for the WWW. Some of these problems are typically addressed by the IR domain and other ones are new because of the distribution of the information and of the evolving nature of the WWW.

As a first step, we can quote the following problems:

Emphasizing the need of retrieving the relevant information, IR technics take care of indexing language, interface query user. New research toward AI technics are now experienced as the use of fuzzy modal logic, language translator , and some interface user systems with learning technics, relevance feedback or preferences properties are focused on the need and the behavior of the user.

To face the new problems encountered with a distributed and unstructured environment like the WWW, first attempt such as Harvest is to define a distributed structure for supplying index servers and enable users to hierarchize and customize these index servers. However, Harvest is lacking of flexibility in the sense that all gatherers and brokers are described in an administration registry. Harvest does not use his gatherers and brokers in a cooperative manner. What is missing is a conceptual framework that allows all these tools to cooperate.

We think that in the near future, the searching tools will have to be built over some distributed architecture of components. Within this architecture we will have to be able to use the (best) components of the traditional IR domain in the indexation, user interface, query languages ... Some experiments using agents on the WWW exist, for instance [3]. Here we propose to define an IR architecture (multi-agents type) using agents as IR components to cooperate in an intelligent manner.

5. References

1. C. J. van RIJSBERGEN, Information retrieval, second edition, Butterworths (1979)

2 G. SALTON, Automatic Information Organization and Retrieval, McGraw-Hill, New York (1968)

3 J. Davies, R.Weeks, M. Revett, Jasper: Communicating Information Agents for WWW , in Fourth International WWW Conference, WWW4, Boston (1995).