US PATENTS: Cardona, System and method for database retrieval, indexing and statistical analysis

Loet Leydesdorff loet at LEYDESDORFF.NET
Wed Dec 18 03:59:24 EST 2002

Dear colleagues,

I made available at the
program FullText.exe. FullText.exe is freely available for academic usage.

The program generates a word-occurrence matrix, a co-occurrence matrix, and
a normalized co-occurrence matrix from a set of text files and a word list.
The output files can be read into standard software (like SPSS,
Ucinet/Pajek, etc.) for the statistical analysis and the visualization.

input files

The program prompts for two informations, notably, (a) the name of the file
<words.txt> that contains the words (as variables) to be analyzed in ASCII
format and (b) the number of files that contain the text elements as cases.
The text elements are to be numbered sequentially like Text1.txt,
Text2.txt, etc. The number of texts is unlimited, but each text can be only
64 kByte at the maximum. The number of words is limited to 1024, but keep
in mind that most programs will not allow you to handle more than 256
variables in the follow-up.

program file

The program is based on DOS-legacy software from the 1980s (Leydesdorff,
1995). It runs in a MS-Dos Command Box under Windows. The programs and the
input files have to be contained in the same directory. The output files
are written into this directory as well. Please, note that existing files
from a previous run are overwritten by the program. Save output elsewhere
if you wish to continue with the materials.

output files

The program produces three output files in dBase IV format. These files can
be read into Excel and/or SPSS for further processing. Two files with the
extension ".dat" are in DL-format (ASCII) and can be read into Pajek for
the visualization.

a. matrix.dbf contains an occurrence matrix of the words in the texts. This
matrix is asymmetrical: it contains the words as the variables and the
texts as the cases. In other words, each row represents a text in the
sequential order of the text numbering, and each column represents a word
in the sequential order of the word list. (It is advisable to sort the word
list alphabetically before the analysis.) The words are also the variable
names although truncated to ten positions. The words are counted as
frequencies. (The plural "s" is removed before processing.)

b. coocc.dbf contains a co-occurrence matrix of the words from this same
data. This matrix is symmetrical and it contains the words both as
variables and as labels in the first field. The main diagonal is set to
zero. The number of co-occurrences is equal to the multiplication of
occurrences in each of the texts. (The procedure is similar to using the
file matrix.dbf as input to the routine "affiliations" in Ucinet, but the
main diagonal is here set to zero in this matrix.)

c. cosine.dbf contains a normalized co-occurrence matrix of the words from
the same data. Normalization is based on the cosine between the variables
conceptualized as vectors (Salton & McGill, 1983). (The procedure is
similar to using the file matrix.dbf as input to the corresponding routing
in SPSS.)


Leydesdorff, L. (1995). The Challenge of Scientometrics: the development,
measurement, and self-organization of scientific communications. Leiden:
DSWO Press, Leiden University; at .

Salton, G., & McGill, M. J. (1983). Introduction to Modern Information
Retrieval. Auckland, etc.: McGraw-Hill.

At 09:09 PM 12/17/2002 -0500, you wrote:
>US Patent 6,385,611
>Dated May 7, 2002
>Inventor: Cardona; Carlos (P.O. Box 22892, Seattle, WA 98122-0892)
>(There are no spaces in the above URL, it's just really long.)
>The present invention provides a system and method with the capacity to
>compare and analyze keywords of a specific area of study.
>By the use of the methods of the present invention, some sets of keywords
>will be seen as "warming up" due to their upward trends,
>whereas other keywords might be seen as "cooling down" due to their
>downward trends. Given the accepted fact that growing areas of
>research are the ones that are more likely to produce scientific
>breakthroughs, the system identifies these emerging ("hot") areas of
>research would accelerate the scientific advances of their
>users. Similarly, users will be able to view and shift from non-productive
>("cool") areas of research to productive "hot" areas. The invention
>involves the utilization of a commercially available database
>program and provides specific keywords associated with the investigated
>topic. The present invention also provides a method for
>indexing the keywords using a keyword tree structure database so the data
>is in the correct format for analysis. The invention also
>provides a method for analyzing the number of occurrences of keywords
>along with the analysis of an impact factor associated with
>the keywords. The formatted data then allows the construction of several
>charts so a user can easily assess the state and forefront of a
>specified topic.
>Gretchen Whitney, PhD                                     tel 865.974.7919
>School of Information Sciences                            fax 865.974.4967
>University of Tennessee, Knoxville TN 37996 USA           gwhitney at

Loet Leydesdorff
Science & Technology Dynamics, University of Amsterdam
Amsterdam School of Communications Research (ASCoR)
Kloveniersburgwal 48, 1012 CX  Amsterdam
Tel.: +31-20-525 6598; fax: +31-20-525 3681 ; loet at

More information about the SIGMETRICS mailing list