mbeig@emse.fr
doan@emse.fr
girardot@emse.fr
jaillon@emse.fr
sayettat@emse.fr
Nowadays, the quantity of on-line available information is rapidly growing through the use of computer networks. Moreover, very different kinds of information exist, such as full documents, multimedia documents, computer programs, ...
We first present the IR problematic in a centralized context, then we expose the available tools in the WWW. Lastly, we suggest some research directions for a collaborative work in a distributed environment using IR technics.
The corpus is a set of documents. This corpus is indexed and the results of the indexation are stored in a database (indices). The user formulates a query. The correspondance function connects the indexed documents to the query. IR techniques use logic models to build the correspondance function. The first models used were the boolean model and the vectorial model. But these models were too rigid to be relevant enough. These models have since been improved and different technics such as fuzzy modal logic or conceptual graphs try to decrease noise and silence.
The first category concerns the tools which are centralized since they concentrate the information in one database. Some of them use robots to generate indices of (a part of) the WWW, for instance Lycos, which is one of the largest, but there are challengers, such as WebCrawler for example. Conversely, in Aliweb, the administrator of a server can register it in the database. Some servers are organized by subjects: Yahoo or EInet Galaxy. These databases are then queried with keywords and boolean operators or others technics using approximate match, adjacency or proximity operators.
This very simple architecture will not be usable in the short term future because of the growing of the WWW, and specifically the scalability problem. For example Alta Vista indexes 22 million Web Pages today, and the size of this index represents 33 GB, but how many tomorrow?
The second category of tools tries to address this issue; they are derived from distributed directories experiences such as WhoIs++, SOLO or X.500. Harvest, developped at Arizona University, is one of them. Basically, it provides two tools, a gatherer to perform indexation, and a broker to query the database generated by gatherers or other brokers. Harvest has other interesting capacities such as dealing with caches or replication for example.
Beside these aspects, all these searching tools have defined their own rules for the indexation, the storage of documents descriptors, the formulation of the query, and the sorting of the results,...
As a first step, we can quote the following problems:
Emphasizing the need of retrieving the relevant information, IR technics take care of indexing language, interface query user. New research toward AI technics are now experienced as the use of fuzzy modal logic, language translator , and some interface user systems with learning technics, relevance feedback or preferences properties are focused on the need and the behavior of the user.
To face the new problems encountered with a distributed and unstructured environment like the WWW, first attempt such as Harvest is to define a distributed structure for supplying index servers and enable users to hierarchize and customize these index servers. However, Harvest is lacking of flexibility in the sense that all gatherers and brokers are described in an administration registry. Harvest does not use his gatherers and brokers in a cooperative manner. What is missing is a conceptual framework that allows all these tools to cooperate.
We think that in the near future, the searching tools will have to be built over some distributed architecture of components. Within this architecture we will have to be able to use the (best) components of the traditional IR domain in the indexation, user interface, query languages ... Some experiments using agents on the WWW exist, for instance [3]. Here we propose to define an IR architecture (multi-agents type) using agents as IR components to cooperate in an intelligent manner.
1. C. J. van RIJSBERGEN, Information retrieval, second edition, Butterworths (1979)
2 G. SALTON, Automatic Information Organization and Retrieval, McGraw-Hill, New York (1968) 3 J. Davies, R.Weeks, M. Revett, Jasper: Communicating Information Agents for WWW , in Fourth International WWW Conference, WWW4, Boston (1995).