LIBRES:
Library and Information Science Research
Electronic Journal ISSN 1058-6768
2000 Volume 10 Issue 2; September 30
Bi-annual LIBRES 10N2
Gilberto R,
Sotolongo-Aguilar *, Carlos A. Suárez-Balseiro **, Maria V. Guzmán-Sánchez *
* The Finlay
Institute; POBox 16017, Cod. 11600 La Habana, CUBA. E-mail: finlayci@infomed.sld.cu
** Faculty of Communication, University of Havana Calle G, No.506, Vedado, La Habana 10600, La Habana, CUBA. E-mail: csbgv@bib.uc3m.es
One of the
challenges today for information professionals is to guide the way through huge
volumes of information generated by different means. The birth and development
of new disciplines such as “data mining” and “knowledge discovery”, shows the
increasing importance of quantitative and qualitative analysis of huge corpora
data (Dhar and Stein, 1997; Swanson and Smalheiser, 1997).
Bearing this in
mind, bibliometric research becomes one of the fundamental tools used by
information professionals in their quest of indicators; allowing them “critical
appraisals” of scientific research, as well as interaction among researchers,
institutions and knowledge areas.
The above reasons
have conditioned increasing efforts for the systematization and standardization
of methods and tools used in bibliometric research. Classical studies have
supported the importance of clearly defining the problems in the field,
emphasising the application of statistics as a key factor in discovering new
knowledge (Egghe and Rousseau, 1990). For Glanzel (1996), bibliometrics is a
complex discipline which, although it
may be classified as a social science, is closely conditioned by pure
and technological sciences. Therefore any methodological characterization
requires well-documented data processing methods, a clear description of the
sources and exact definition of indicators and, on the other hand, there is a need
for an effective selection and integration of the applied technologies.
Ravichandra Rao (1996) asserts that no bibliometric technique alone can be
applied to all research, but instead different procedures should be used for
different problems. Grivel, Polanco and Kaplan (1997) emphasize what they
call “informatic infrastructure” where
bibliometrics could develop all its potential. According to these authors,
bibliometrics should have a methodology characterized by not only an adequate
mathematical representation but also an effective “informatic architecture”.
Therefore
bibliometric information systems are the workbench of bibliometric research.
Being an important part of this field
of endeavor, they require a flexible design in order to obtain accurate and
customized indicators and should integrate new features resulting from the
latest developments.
Many colleagues have found
their way into bibliometrics by building in-house applications. At the end of
the 80’s Terrence Brooks prepared a set of computer programs written in Turbo
Pascal called the Bibliometrics Toolbox in order to measure the bibliometric
aspects of a literature (Brooks, 1987; McLain, 1990). For Van Raan (1996) in
Leiden the chosen name was “The Machine”; in CRRM they use a software suite,
with DATAVIEW (Rostaing et al., 1996) as flagship, and the CUIB-METRIC system
is proposed by specialists at the UNAM in Mexico (Portal and Thompson, 1994).
There is the application of TOAK -Technology Opportunity Analysis Knowbot - at
the Technology Policy & Assessment Center, at Georgia Institute of
Technology, in Atlanta, USA (Porter and Detampel, 1995) and HENOCH (Grivel et
al., 1997), and NEURODOC (Polanco et al., 1998) which are used at INIST in
France. Bibexcel (Bibmap before Excel), developed by Professor Olle Persson,
from Inforsk, Umeå University in Sweden, and BibTechMon (Kopcsa and Schiebel,
1998) are other available approaches. The work of Katz and Hicks (1997), Small
(1998), and White and McCain (1998) takes the same direction. Finally we have
to mention the experiences of Chen (1995), Lin (1995, 1997), and Orwing, Chen
and Nunamaker (1997) in the application of artificial neural networks for
bibliometric purposes based on the Kohonen’s self-organizing map (SOM), which is an orderly mapping of a high-dimensional,
eventually structured distribution of data onto a regular low-dimensional grid.
The Kohonen´s SOM is probably the best know network model geared towards
unsupervised training and essentially consists of a regular grid of processing
units or “neurons” associated with a model of some multidimensional observation
to represent all the available observations with optimal accuracy, using a
restricted set of models ordered on the grid so that similar models are close
to each other and dissimilar models far from each other (Kohonen, et al., 1999; Kohonen, 1998).
However the problem arises
of when generalization should be done.
In-house applications are rarely well documented and their use by others
becomes difficult. The result is that only the members of the team are able to
replicate the use of such applications.
The standardization fails and it is not only a handicap for practical
research; it becomes a barrier for
teaching purposes because many educational institutions are not be able to
obtain and implement in-house applications to support bibliometric educational
programs.
This problem may be overcome
in part using a set of proprietary software, which is well-documented, widely
available and more accessible than in-house applications. On the other hand,
the validation of techniques is obvious and many teams of developers are
continuously improving the performance of such software.
This paper
describes a methodology based on the utilization of a set of proprietary
software working together in order to perform bibliometric analysis of a
literature. This methodology is explained as an open and flexible bibliometric
information system in compliance with a simple modular design and connectivity
for desktop work. It is useful for practical work as well as for education and
training purposes. We envisage our task as seeking the integration of different
available software with the objectives of consolidating an informatic
infrastructure for our bibliometric research and developing standard methods
that fit with this purpose.
Our approach
consists of five modules based on proprietary software integrating the system.
The modules perform the following functions:
(1)
Bibliographic
Searches
(2)
File Conversion
& Handling
(3)
Bibliographic
Reference Management
(4)
Indicators
including (experimental) Artificial Neural networking
(5)
Bibliometric
Analysis
Bibliographic
Searches are conducted online or on CD-ROM.
Resulting files are downloaded and converted by module (2) File
Conversion & Handling. Resulting files are the input to module (3)
Bibliographic Reference Management, where the standardization of the database
is performed. Different fields under study or a combination of them are
exported and saved as text files. Afterwards, those files are processed in
module (4) Indicators. In this module
several statistical analysis may be carried out to obtain the inputs for the bibliometric analysis in the module
5. Different scenarios could be
implemented, varying the elements
inside each module.
In our experience, for small and medium size corpora,
the following packages appear to work very
well:
1.
Bibliographic Searches.
Dependent on the topic of research e.g. in biomedicine SPIRS, WINSPIRS, both
from Silver Platter, PubMed and Internet GratefulMed or The Query E-mail Retrieval System from NLM.
2. File
Conversion & Handling.
Resulting files are downloaded and treated by BiblioLink™ converting them according to a selected configuration that depends
on host and fields to be studied. BiblioLinkä
convert the files to Prociteä
format.
3.
Bibliographic Reference Manager.
Procite™ (Research Information Systems Inc.), works very well for the managing
of bibliographic references allowing standardization of data. Furthermore
BiblioLinkä
and Prociteä
are totally integrated in their latest versions.
4.
Indicators
(obtained in this module). The functions of this module are performed by
different statistical packages, e.g. Excel™ (Microsoft Corp.) and its
complement xlStat™ (Stat@Com Inc.) and STATISTICA® (StatSoft Inc.). The former
gives the possibility of profiting from all the built-in features of this
program including graph and functions features. We have recently introduced an
experimental submodule for Artificial Neural Networking. We used Viscovery ®
SoMine from Eudaptics Software Gmbh for this purpose, allowing us to work
without models and statistical assumptions by using the powerful Self-Organizing
Maps (SOM) Technology. It leads to a very good representation of
high-dimensional data by maintaining similarities implicit in the data.
5.
Bibliometric analysis.
Finally in this module the analysis of indicators is performed according to the
aims of the particular task undertaken.
Results & Discussion
The above
mentioned scenario operates according to the following procedure: bibliographic
searches are conducted online or on CD-ROM. Resulting files are downloaded and
treated by BiblioLink™ converting them according to a selected configuration
that depends on host and fields to be studied. The resulting converted file is
already in Procite™ format having the possibility to switch directly to the
bibliographic reference management features of Procite™. Here standardization
of the database is conducted. Many different treatments could take place
including the building of authority lists with the contents of different fields
including an authority list of all words in any field or in the whole database.
The different fields under study or a combination of them are exported and
saved as text files. Afterwards, Excel™ imports those files. Frequency analysis
may then be performed aided by the Pivot Table feature of Excel™ complemented by
built-in Analysis Functions available in the Tools Menu. With Excel™, using
xlStat™ it is also possible to build
the matrices that produce the input for cluster analysis, factor analysis, PCA,
and multidimensional scaling and
undertake these analysis. Alternatively, those matrices may be exported as
Excel™ sheets and imported into STATISTICA® (StatSoft Inc.) and finally
processed. More recently we have been experimenting with Viscovery SoMine.
Beyond its visual exploration capabilities, Viscovery® also supports in-depth
statistical analysis of data. The combination of the non-linear data
representation of the SOM approach with classical statistical methods - such as
regressions or principal component analysis (PCA) - results in the improvement
of the final model in terms of precision
and efficiency.
This system
platform guarantees a comprehensive traceability of all data from the first
data downloaded, to the last chart obtained. At the same time, consistent
results are attained by means of the reproducibility at all the steps
performed. Bibliographic data in the database could also be used for
building-up formatted bibliographies.
This approach has
been applied in different studies (Macías-Chapula et al., 1999; Guzmán-Sánchez
et al., 1998; Sanz-Casado, et al., 1998) mainly focused on the biomedicine
field and more specifically in the area of vaccines research. However, other
domains such as library and information science or economics have been explored
(Sanz-Casado et al., 1999; Sotolongo-Aguilar, 1999)[1].
Bibliometric output data of the system could perform, among others, the
following activity and relations measurements:
1.
Counts of papers
by the following fields or a combination of them:
·
Authors
·
Sources
·
Keywords (e.g.
MESH)
·
Years
·
Substances
·
Documents types
·
Languages
·
Affiliation
·
Country of
publication
·
Authors/papers
·
Substances/papers
·
Keywords/papers
·
Document
types/papers
2.
Co-occurrence
matrices for multivariate analysis of the following fields or a combination of
them:
·
Authors
·
Keywords
·
Substances
·
Document types
·
Self-Organized-Maps
for spatial representation of linear or multidimensional data
The benefits
resulting from the developments outlined in this paper could be threefold.
Besides integrating public domain software in a flexible modular design,
comprehensive automated processing and data representation stages of research could
be achieved in contrast to the cumbersome tasks that are performed by other means. This platform is supported on
software which is widely used and
regularly updated and upgraded; in contrast with adhoc software that becomes
outdated very rapidly.
Last but not
least, teaching bibliometric research seems to benefit from this approach,
bearing in mind its use of proprietary software available world wide, regularly
updated and supported by well established developing teams. Improved
bibliometric research practices must be supported by theoretical and practical
education in which technologies have a key role. However, if the methods and
tools used are not widely accessible and it is necessary to depend on an
in-house application, the experience could be frustrating.
Finally we have to emphasise
the fact that MOBIS-ProSoft is not new software, it is not a new in-house
developed application; it is a methodological approach using a set of different
proprietary applications working in modules.
This methodology has been shown to be a working platform that could be
up-graded, is flexible and has utility performance. Moreover it is a practical
alternative for educators to improve their teaching programs. Improvements to
MOBIS-ProSoft are foreseen. Participation in the testing of this methodology is
welcome, as well as new ideas for incorporating modules or improving the
existing system.
This appendix
contains figures illustrating some features of
the MOBIS-ProSoft (modular
bibliometrics information system with proprietary software) approach.
Figure 1. Matrix-building module showing
(partially): raw data (upper-half); frequency of ocurrence (lower half – left);
coocurrence matrix (lower half – right)
One of the most
interesting features of using the MOBIS-ProSoft approach, is the
matrix-builder feature implemented by running biblio.xla module under xlStat™.
A selected multiword field of the database, for instance the Subject field or
the Author field, is exported as comma delimited text and after it is imported
into Excel™. Then the biblio.xla module can be run. The resulting Excel™ sheet of the current workbook looks like the
upper half of figure 1. Column A is manually added for assuring that all elements
will be included in the matrix-building process. The results of running
biblio.xla appears in the lower half of figure 1; at the left the frequency
table is built up; after skipping one column the symmetric coocurrence matrix
appears at the right. The square matrix can have a maximum size of 252 x 252. In the illustration is a data set
from ISA (1966-1998) descriptors related to Library & Information Science
Research Methods in Latin America & the Caribbean.
Figure
2. Dendogram resulting from clustering using the Ward method including clusters
observation / clusters size
This illustration shows the
results of clustering similarities e.g.
Pearson Product Moment rp,
using the Ward method. In this case country clustering according to the
Activity Index from ISA (1966-1998)
Library & Information Science in Latin America & the Caribbean.
The illustration includes the Clusters Observation / Clusters Size for the best
partition suggested by running the results i.e. three groups of size 5, 5 and
9, with the corresponding country names by group.
Figure 3.
Map based on principal component
analysis with Varimax transformation
This illustration
shows the map corresponding to the same data set used in figure 2. In this case
the display is a PCA map with Varimax transformation. The groups founded in the
clustering procedure shown in figure 2 are highlighted.
Figure
4. Map based on multidimentional
scaling
Figure 4
shows an MDS map which displays
another view of the data set used in figures 2 and 3. In this case
dissimilarities were calculated as 1- rp.
Figure 5.
Self Organized Map (SOM) and windows of map-layers & values based on
an Artificial Neural Network (ANN) trained with Kohonen unsupervised algorithm.
The most
intriguing feature recently incorporated on an experimental basis to the
MOBIS-ProSoft approach is the SOM-ANN. Again, as in figure 1, a data set from
ISA (1966-1998) descriptors related to Library & Information Science
Research Methods in Latin America & the Caribbean is used, in this case
time series of descriptor occurrence for the years 1983-1984,1986, 1990-1998.
Viscovery® SoMine was used for training the network based on 133 data records
from same number of descriptors characterized by 12 components (dimensions or features). Forty four cycles
in normal exact mode of training were needed
for generating a map size of 100:61 with 1905 nodes that represent the
trained network . Figure 5 shows a
screenshot with some results. In the upper-left part of the screenshot is
a map showing the clusters formed, at
the upper-right the values of some of the year component of the cluster
corresponding to bibliometrics that is located in the lower-left corner of the
cluster map. The rest are all the map-layers corresponding to all the
dimensions (one for each displayed from left-right top-bottom. A different tone of gray (different colors in the
original) shows different “landscapes” views. The first year of the time series
(left map in the first row of maps-layers) displays isolated spots of activity,
while as time goes by (second row of maps-layers) the activity increases and
non-linear correlation could be observed. Resulting data from the trained network
could later be evaluated.
References
Brooks, T (1987). The Bibliometrics Toolbox, version 2.8. North City Bibliometrics. Available at: ftp.u.washington.edu/public/tabrooks/toolbox/
Chen, H. (1995). Machine Learning for Information Retrieval: neural networks, symbolic learning, and genetic algorithms. Journal of the American Society of Information Science 46(3), 194-216
Dhar, V. and Stein, R. (1997). Seven Methods for Transforming Corporate Data into Business Intelligence. Prentice Hall.
Egghe, L. and Rousseau, R. (1990). Introduction to Informetrics. Quantitative Methods in Library Documentation and Information Science. Netherdlands: Elsevier Sciences Publisher.
Eudaptics Software Gmbh. (1999). Viscovery® for CRM-applications (Viscovery White Paper). Available from: http://www.eudatic.com/
Glanzel, W. (1996). The need for standards in bibliometric research and technology. Scientometrics 35(2), 167-176.
Grivel, L., Polanco, X. and Kaplan, A. (1997). A computer system for
big scientometrics at the age of the worldwide web. Scientometrics,
40(3), 493-506.
Guzmán-Sanchez, M.V., Sánz-Casado, E. and Sotolongo-Aguilar, G. (1998). Bibliometric study on vaccines (1990-1995) in Iberian-American countries. Scientometrics 43(2), 189-205 .
Katz, J.S. and Hicks, D. (1997). Desktop Scientometrics. Scientometrics 38(1),141-153.
Kohonen, T.,
Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Paatero, V., and Saarela, A.
(1999). Self-Organization of a massive text document collection. In: Oja, E.
and Kaski, S. editors. Kohonen Maps.
Amsterdam, Elsevier pp.171-182.
Kohonen, T.
(1998). Self-organization of very large document collection: State of the Art.
In: Niklasson, L., Boden, M. and Ziemke, T., editors. Proceedings of ICANN98, 8th International Conference on
Artificial Neural Networks, vol. 1, Springer, London. pp. 65-74.
Kopcsa, A. and Schiebel, E. (1998). Science and technology Mapping: A New Iteratio Model for Representing Multidimensional Relationships. Journal of the American Society of Information Science, 49(1), 7-17.
Lin, X. (1995). Searching and Browsing on Map Displays. Proceedings of ASIS 95, Chicago, 13-18.
Lin, X. (1997). Map Display for Information Retrieval. Journal of the American Society of Information Science 48(1), 40-54.
Macias-Chapula, C.A., Sotolongo-Aguilar, G.R., Madge, B. and Solorio-Lagunas, J. (1999). Subject content analysis of AIDS literature as produced in or about Latin America and the Caribbean. Scientometrics 46(3), 563-574
McLain, J. P. (1990). Bibliometrics Toolbox.. Journal of the American Society for Information Science 41(1), 70-71.
Orwing, R., Chen, H. and Nunamaker, J. (1997). A Graphical, Self-Organizing Approach to classifying electronic meeting output. Journal of the American Society of Information Science 48(2), 157-170.
Polanco, X., Francois, C. and Keim J. P. (1998). Artificial neural network technology for the classification and cartography of scientific and technical information. Scientometrics 41(1-2), 69-82.
Portal, S. G. and Thompson, A. C.
(1994). CUIB-METRIC: an integral system for metric analysis of bibliographic
information. Investigacion Bibliotecológica, 8 (16), 27-31.
Porter, A. L. and
Detampel, M. J. (1995). Technology
Opportunities Analysis. Technological
Forecasting and Social Change 49, 239-255.
Ravichandra Rao, I.K. (1996). Methodological and conceptual questions of bibliometric standards. Scientometrics, 35(2), 265-270.
Rostaing, H., Dou, H., Hassanaly, P. and Paoli, C. (1996). Dataview: bibliometric software for analysis of downloaded data. Available from: http://crrm.univ-mrs.fr
Sanz-Casado, E., García-Zorita, C., García-Romero, A., and Modrego-Rico, A. (1999). Research by spanish economists. Characteristics in terms of the scope of publications. Proceedings of the 7th International Conference on Scientometrics and Informetrics, University of Colima, Colima, Mexico, July 5-9, pp. 593-595.
Sanz-Casado, E., Suárez-Balseiro, C.A., and García-Zorita, C. (1998). Estudio de la producción científica española en biomedicina durante el período 1991-1996. Actas de las Jornadas de Documentación en Ciencias de la Salud, Zaragoza, marzo 1998. (Copies available from the authors).
Small, H. (1998). A general framework for creating large-scale maps of science in two or three dimensions: the SCIVIZ system. Scientometrics 41(1-2), 125-133.
Sotolongo-Aguilar, G. (1999). Library and Information Science Research Methods in Latin America and the Caribbean (1966-1998). Report 23, Scientific University Council (SUC), University of Havana, Cuba. (Available from the authors).
Swanson, D. R. and Smalheiser, N. R. (1997). An interactive system for finding complementary literatures: a stimulus to scientific discovery. Artificial Intelligence 91, 183-203. Available from: http://kiwi.uchicago.edu/webwork/AIabtext.html
Van Raan, A.F.I. (1996). Scientometrics: state-of-the-art. Scientometrics 38(1), 205-218.
White, H. D. and McCain, K. W. (1998). Visualizing a Discipline: An Author Co-Citation Analysis of Information Science. Journal of the American Society of Information Science 49(4), 327-355.
This document may be circulated freely
with the following statement included in its entirety:
Copyright 2000
This article was originally published in
LIBRES: Library and Information Science
Electronic Journal (ISSN 1058-6768) September 30, 2000
Volume 10 Issue 2.
For any commercial use, or publication
(including electronic journals), you must obtain
the permission of the authors.
Gilberto R, Sotolongo-Aguilar
The Finlay Institute; POBox 16017, Cod. 11600 La Habana, CUBA. E-mail: finlayci@infomed.sld.cu
Carlos A. Suárez-Balseiro
Faculty of Communication, University of Havana Calle G, No.506, Vedado, La
Habana 10600, La Habana, CUBA. E-mail: csbgv@bib.uc3m.es
Maria V. Guzmán-Sánchez
The Finlay Institute; POBox 16017, Cod. 11600 La Habana, CUBA. E-mail: finlayci@infomed.sld.cu
To subscribe to LIBRES send
e-mail message to
listproc@info.curtin.edu.au
with the text:
subscribe libres [your first name] [your last name]
________________________________________
Return to Libres
10n2 Contents
Return to Libres Home Page
[1] All the figures used in the appendix are related to the report prepared by Gilberto Sotolongo-Aguilar for the Scientific University Council of the University of Havana in 1999. This study focused on library and information science research methods in Latin America and the Caribbean from 1966 to 1998.