Center for Research in Geomatics |
D-Lib Magazine, March 1997
The heterogeneity of Georeferenced Digital Libraries (GDL) is a serious problem for researchers who need to query several GDLs to find the best spatial data available for their projects. Differences in content, standards, user interfaces, semantics, database structure, etc. are the rule on the Internet. In addition, users have no tool to help them select the best sources of spatial data once they have clearly defined their needs and have found several potential sources in the GDLs.
Data warehousing technology, coupled with a data transformation / integration tool plus a data selection layer running on the Internet, appears to be a promising solution for such problems of heterogeneity and best selection. Data warehousing is used to integrate and replicate data subsets coming from heterogeneous legacy systems and to create new sets of summarized data to support management decisions. When the data warehouse supports an application to find the best source of data for a given demand, then we have a new and promising solution. Such a concept, called "System for the Optimized Selection of Spatial Data" (SOS-SD), is being developed as an M. Sc. research project, at Laval University.
Georeferenced Digital Libraries, metadata, spatial data warehouse, spatial data integration.
People are increasingly using the Internet to acquire or distribute spatial data. However, distribution and browsing of spatial data require more than just transferring files by FTP or browsing maps through the WWW. An infrastructure has to be established to allow the users to easily find and analyze the data covering a given territory. These operations are done in a Georeferenced Digital Library (cf. Proulx et al. for more details on GDL [1]). Unfortunately, Internet GDLs differ significantly from each other in terms of content, standards, user interfaces, semantics, database structure, etc. In this context, it is not surprising to see that there presently is no tool on the Internet to help the users to select the best sources of spatial data once they have clearly specified their needs. This paper discusses this problem and suggests some methods t o overcome them. It is an abstract of the ongoing M. Sc. research project of the first author under the direction of the second author.
The development of independent GDLs by different organizations and departments has led to a set of heterogeneous GDLs on the Internet. A survey, conducted in October 1995 and updated in July 1996 [2], gave us a good indication of this heterogeneity. Among the 26 sites that were identified, 38% were just presenting a list of available documents, 36% used a minimal set of metadata, and 28% used a complete set of metadata. For the same 26 sites, 53% did not use metadata standards. Almost 60% of the 26 sites were not connected to a database and finally, 36% did not show any map location or map coverage to help users locate the data sets over the territory. If one visits these GDLs, the visitor will rapidly notice:
These facts and statistics, in addition to similar ones found in the survey, clearly show the heterogeneity of GDLs. This heterogeneity problem is illustrated in Figure 1 . As a result, in order to retrieve the desired information, users have to understand every interface and process. In a way, such a problem is analogous to the problem of finding information on the Internet using different search engines, but it adds a strong geospatial consideration and new user interface considerations.
Of particular interest is the semantic disparity existing among these GDLs. Here, one must be aware that such a problem exists both in the definition of the structure of a DGL and in the content of the stored data. For example, the former occurs when one GDL calls the same field "Object type" while another GDL calls it "Spatial Entity", when two GDLs use the same name (ex. "standard") to represent different concepts (ex. "data acquisition standard" vs "data structure standards" vs "graphic semiology standards"), or when the concepts used in two DGLs are similar in nature but different in content (ex. using "spatial reference system" vs using only "map projection" and "datum"). An example of the second type of semantic problem happens when one user wants to find a map that features aqueduct networks. In one GDL, the aqueduct is named « aqueduct » and this data i s stored as such in the field "Theme", but in an other GDL, it is stored as « water conveyance network » in the "Theme" field. If the users are not aware of this detail, they will not find the document in the second GDL. As one could think, such semantic problems are more important with GDLs that are not standard-compliant.
Once users have found potential data sets with one or more GDLs, they may be facing very long lists of data sets. Such a list may even include redundancies. The difficulty analyzing such listings of data sets becomes a decision-support type of problem. Obviously, users will be overwhelmed by such a list and may end up not choosing the best source for their needs. In addition, because of a lack of technical expertise or a vague context, needs may be ill-defined or fuzzily described, worsening the situation. Such a problem is easily stated: among the data sets available, it is sometimes difficult to choose the ones who suit well the needs of the user. Once again, this problem is similar to the one of using search engines on the World Wide Web (WWW). At the present time, we have not seen a GDL on the Internet that helps the users to specify explicitly their needs and that uses this information to find the best data sets for a given project. This is why we decided to build SOS-SD (System for the Optimized Selection of Spatial Data).
We can think of some strategies to overcome the problems previously mentioned. The use of a common metadata standard and of a consistent graphical user interface could solve a lot of problems. In fact, this is not likely to happen because of the efforts needed conform to the standards, the lack of technical knowledge in geomatics (cartography, remote sensing, photogrammetry, surveying, geodesy, hydrography), the lack of commercial products having a strong presence in the market (a de facto standard), the lack of resources to properly accomplish GDL tasks, rapid technological advances, etc.
There is another possible way. Among new data management technologies, data warehousing coupled with a data transformation / integration tool and a data selection layer is very promising. Data warehouses, as defined by Inmon [3], « are a subject oriented, integrated, non-volatile, and time variant collection of data in support of management's decisions ». Usually, data warehouses are designed to deal with large volume of data, and they are regularly coupled with a transformation / integration tool, allowing the data coming from heterogeneous systems to be transformed and integrated in the data warehouse. Data transformation / integration is one of the cornerstones of data warehouses. For example, the map precision may be expressed in feet in GDL « A » while expressed in meters in GDL « B », the transformation / integration tool could transform the measurement unit from feet to meter before integrating it in the data warehouse. Figure 2 shows a generic data warehouse architecture, applied to spatial data.
Figure 2 : a generic data warehouse architecture applied
to spatial data
Up until now, data warehouses are mainly used in traditional business applications : insurance, banking, accounting, etc. Some projects are using this technology with spatial data, however they seem to offer limited capabilities when compared to non-spatial systems. In fact, because of the characteristics of spatial data, it is not always possible to use efficiently the current technology and to integrate this type of information into a data warehouse. The existing technologies have to be adapted, resulting in research opportunities arising to solve these problems.
From a GDL point of view, data warehouses offer several interesting perspectives. For the present project, we can imagine a data warehouse system getting some data from different GDLs and integrating them in a single database. Users, instead of consulting several different GDL could do a one-stop initial search in this data warehouse and find the preliminary information they need. From there, using a computerized procedure to define their needs (such as the one developed by Charron and the second author [4]), the system could perform a preliminary but very useful filtering of the data sets. If more precise information is needed about the successful data sets, users could go to the legacy GDL's data source to find what they are looking for. Using such an architecture greatly reduces the problems previously mentioned (and completely eliminate them if the metadata stored/derived in the warehouse are sufficient).
One may also consider using the transformation / integration tool to produce value-added information. For example, if a GDL simply lists the available maps using the Canadian national map numbering standard, for example 21- L-10, but doesn't show their coverage on an index map, we could use this information to show the extent of this map over a background map, because 21 L-10 refers to a known area. Another example of value-added data is to produce statistical information about map sheets available in paper versus digital form and to indicate the planned dates of digitization (which are available only in paper documents).
From a Decision-Support System (DSS) point of view, it was previously mentioned that SOS-SD will assist the users in the definition of their needs and help them to find the data sets that best suit their needs. Figure 3 illustrates this concept. This system is composed of tools to transform, integrate, synthesize and semantically analyze the data stored in a data warehouse and a GIS-like user interface.
One of the main problems for this system is to get the data. Sometimes, metadata in GDLs are just stored in html pages, sometimes they are stored in a complex relational database. Additionally, accessing these data usually requires a signed agreement between the GDL providers and SOS-SD and a complete infrastructure has to be developed to allow the system to integrate the data. Technical problems of data access also have to be resolved. However, our present goal is not to address those problems. Our goal is to investigate the potential of this data warehouse-oriented approach to solve the initial problems. To do so, we created 5 different GDLs that we suppose to represent the heterogeneity of existing GDLs. One of these GDLs is GEOREP described in [1] and consequently, the four other GDLs will cover the area covered by GEOREP, that is Forêt Montmorency, and each will store certain metadata describing carefully selected existing data sets (about 75 datasets created over a period of 30 years). These four GDLs (or DL in some cases) are being implemented in Oracle, in a GIS and in web servers using simple html files while GEOREP already uses Microsoft Jet database engine (cf. MS-Access) and Java. If the results of our research indicates that the proposed solution is well suited to solve the problem previously mentioned, then additional technical and legal issues can be addressed later.
We have presented a solution to the problems of dealing with heterogeneous GDLs and selecting the best data sets to suit our needs. We have defined most of the current concepts and, during previous research projects, have experimented with several issues dealing with spatial metadata, semantic analysis, selecting the best source of data, data quality analysis, data model merging, web database, Java programming, and GIS. At the time of writing this paper, we expect to finish this research project before the end of summer '97. Eventually, the results will be available through the Internet. Please use the SOS-SD hyperlink in the GEOREP site to learn more about the current state of the project or go directly to this address (sosds.scg.ulaval.ca).
Copyright © 1997 François Létourneau, Yvan Bédard, Marie-Josée Proulx
hdl:cnri.dlib/march96-letourneau