Digital Libraries and Data Warehouses
What is Digital Library
Digital Libraries (DL) is a metaphor for access to collections of electronic documents through a network. The classic research area dealing with the electronic search for documents is Information Retrieval. During the last 30 years the topic of this discipline evolved from electronic catalogs to the management of fulltext and multi-media documents. So, from a first viewpoint Digital Libraries and Information Retrieval are not that different.
Challenge of Digital Library
The challenge of Digital Library research is the framework in which it evolves: Information Retrieval has to leave the controlled and uniform conditions of professional information providers.
It is confronted with a vast variety of
- indexing strategies,
- and query mechanisms.
One of the main topics in DL research will be to cope with this heterogeneity. ‘Multimedia Information Retrieval Dialog Techniques, the Information Retrieval department of the GMD-Insitute for Integrated Publication and Information Systems, is working on several aspects of this problem.
Document Retrieval Systems
- Most present day Document Retrieval Systems employ specialized personnel to index documents. However, many providers of repositories for electronic documents are not able to use this successful but expensive method.
- An alternative is the use of automated indexing tools for full text documents. They are faster and less expensive and they can take advantage of structural information (like HTML or SGML tags) included in electronic documents. In addition they can incorporate specific views of specific users.
- GMD Institute for Integrated Publication and Information Systems is developing an automated indexing system for multimedia documents based on a Bayesian inference network. Such a network uses probabilistic estimates on multiple paths of evidence.
In this way partial evidence from various sources can be combined to a document’s overall estimation of relevance for a specific query. In addition our system includes a set of rules, which can be activated to detect a specific way of occurrence of an index term, or a specific feature in a digitized image and consequently ascribe an appropriate indexing concept.
Query Interpretation and Expansion
A good query has to be general enough to cover all relevant documents and specific enough to select only relevant ones. To achieve a high specificity we use the rules defined for the indexing tool and, in addition, a set of domain specific rules.
- These rules are managed by an abductive system, ie a system for hypothetical reasoning. It constructs the possible interpretations of query terms corresponding to alternative paths in the inference network and negotiates them with the user.
- In this way the user is able to select his/her intended interpretation of an unstructured query. Another way to enhance a query is to add related terms either as a substitution for or as an addition to existing query terms. Such related terms can be synonyms, associatively related terms, more general or more specific terms.
To find such related terms we use co-occurrence analysis based on large corpora of documents of the respective domain. This corpus based method allows the fast creation of such associative thesauri which are specific to a given domain and time.
Most retrieval sessions consist of a series of searches each based on the results of previous attempts. During this interaction the user elaborates his/her query. For many inexperienced users the cognitive load of managing the search and scanning the documents found is very high.
- In a Digital Libraries environment the situation will be worse due to the heterogeneity of the various systems and servers.
- To help users in this situation we develop a Dialog Management System that keeps track of the interaction; it is able to offer context specific interpretations of user actions and propose further steps in a context sensitive way. The system is based on the linguistic dialog model COR (Conversational Roles) and generic strategies for typical retrieval situations.
A special problem of Digital Libraries is the selection of appropriate servers for a given query. This is a kind of ‘Meta Search‘. Within the ERCIM project on Digital Libraries we plan to develop a system in which the remote servers are described in an appropriate model of servers.
This model will include static descriptions like
- retrieval engines available,
- format of queries,
- domain of the server,
- average load,
- network bandwidths, etc.
In addition we plan to use knowledge discovery methods to get information about a server: samples of documents will be analyzed with the indexing tool to characterize a server. These samples can either be drawn by chance or as response to a broad query characterizing a domain.
In 1990, Inmon coined the word data warehouse. The corporate’s gave foremost importance only to application-oriented database system,
- including spatial,
- active and scientific database,
- knowledge base
- and office information base.
Heterogeneous database systems
- Heterogeneous database systems and internet based global information system play a vital role and make a huge number of databases and information repositories available for transaction management, information retrieval and data analysis.
- This technology lead the corporate to think about identifying, storing, managing and retrieving terabyte of information need a narrative model called data warehousing. Data warehousing is a centralized and integrated database for using huge database repository effectively.
- As a result, applying statistical technique to data warehousing provides the multi dimensional view for analyzing corporate decision making.
- Digital libraries virtually collect information from different sources integrates them and give access to knowledge without any geographical boundary.
Major components of the architectures are:
- Staging area
- Operational ODS
- Web environment
- Archived/back up data
- Cross media storage manager
- Alternative storage
- Data mart
- User information application
The capture manager is the system component that performs all the operations necessary to support the extracting, cleansing and loading process. Staging areas are needed only where there is a large amount of data to be processed. It is mainly preparing for entering into the ETL setting i.e. ETL is Extraction, Transformation and Loading.
Digitization is the process of conversion of any physical or analogue item into a digital representation or facsimile. Different formats of materials are digitized like:
- Bound volumes, both print and manuscript
- Individual documents
- Photographs, both prints and transparencies
- Video and audio
- Maps, drawings and other large-format paper items
- Art works
- Physical three-dimensional (3-D) objects
The warehouse manager is the system component that performs all the operations necessary to support the warehouse management. The data found in the data warehouse can be restructured in various ways. Different people need to view the granular data found in the data warehouse in different ways. Designed properly the data warehouse can accommodate all the ways through which data can be accessed.
Data mart is the place where different department place their own data for mainly decision support process. The data mart is unique to a department like
- Acquisition; billing, ordering,
- Technical processing; cataloguing , classification
- Circulation; remote login, status check, OPAC access, user requests,
- Reference; e-mail reference, real-time reference, commercial web-based reference
- Current awareness service; recent addition.