new versions of ti.exe and fulltext.exe for co-word (network) analysis
Loet Leydesdorff
loet at LEYDESDORFF.NET
Tue Feb 1 02:30:44 EST 2011
Dear colleagues,
I extended and updated both ti.exe
<http://www.leydesdorff.net/software/ti/index.htm> and fulltext.exe
<http://www.leydesdorff.net/software/fulltext/index.htm> . Ti.exe generates
a semantic co-word map of a set of lines no longer than 1000 characters each
and fulltext.exe uses a set of documents. The extensions are as follows:
First both programs write now the word-document matrix also as matrix.txt
(comma separated variables). This file can be read into SPSS in the case of
more than 255 variables. The file labels.sps can be used for labeling the
variables to a maximum of 1023 variables. (SPSS does not read more than 255
variables from a .dbf or .xls file.) Both programs have a limit of 1023
words; the number of records (documents) is not limited other than by the
disksize. (A manual for applications to content analysis can be found here
<http://www.leydesdorff.net/indicators/pajekmanual.2010.pdf> .)
After generating the word-document matrix and the Pajek input files
(cosine.dat and coocc.dat; cf. Leydesdorff <http://arxiv.org/abs/1011.5209>
& Welbers, in press), the program prompts with the question of whether one
wishes additionally to run the same routines with observed/expected values.
This generates obsexp.dbf (analogous to matrix.dbf), obsexp.txt (analogous
to matrix.txt), and coocc_oe.dat and cos_oe.dat, analogous to the input
files for Pajek (see the websites), but now containing or operating on the
observed/expected values instead of the observed ones. Note that answering
"y" (yes) extends the processing time of the original routine; therefore,
the default is "n". The SPSS syntax file labels.sps is the same because the
variable are not changed.
Similarly, one can use (with the same variable labels) the file TfIdf.dbf
which contains the "term frequency-inverse document frequency" values as
used in library and information science. The expected values are separately
stored in the file expected.dbf. The file obs_exp.dbf contains the signed
(!) differences between observed and expected values at the cell level of
the matrix. (These are the non-standardized residuals of the chi-square.)
The corresponding Pajek files can be generated by replacing the matrix
values in cos_oe.dat with, for example, the cosine matrix of the values in
TfIdf.dbf. (Cosine values can be generated in SPSS under Analyze > Correlate
> Distances.) One can also replace the matrix values with the non-normalized
values. This should work without problems. Note that the number of cases can
be different because rows with no values other than zero are in some cases
removed in order to prevent divisions by zero in the computation. However,
the number and order of the variables remains the same.
After processing, the file words.dbf contains for all words the following
summations over the (column) vectors for each word:
1. A variable named "Chi_Sq" which provides Chi-square contributions
for each of the variables; these are defined for wordi as Σiχ2 = (Observedij
- Expectedij)2 / Expectedin. In other words, the sum of the contributions
over the column for the variable in each row (Mogoutov et al., 2008);
2. A variable named "Obs_Exp" which provides the sum of |Observed -
Expected| for the word as a variable summed over the column;
3. A variable named "ObsExp" which provides the quotient of Obs/Exp for
the word as variable summed over the column;
4. A variable named "TfIdf" (that is, Term Frequency * Inverse Document
Frequency) defined as follows: Tf-Idf = FREQik * [log2 (n / DOCFREQk)]. This
function assigns a high degree of importance to terms occurring in only a
few documents in the collection (Salton & McGill, 1983, p. 63);
5. The word frequency within the set.
These programs were made under DOS, using 16-bits. Increasingly, 64-bits
machines are no longer able to use these programs (unless one downloads the
Virtual PC and runs in XP mode). I uploaded versions which are recompiled
under Windows-32. These versions are much larger ( > 7 Mbyte) and more
error-prone. I just recompile (using a different compiler) without
systematically debugging, but I'll react on feedback about error messages.
One can disregard error messages which do not stop the program. The 64-bits
versions can be found here: ti.exe
<http://www.leydesdorff.net/software.64bit/ti.exe> and fulltext.exe
<http://www.leydesdorff.net/software.64bit/fulltext.exe> .
Error messages are provided when working from the command prompt. (These
programs still look like DOS, but they are fully Windows-32.) One can run
the old versions on older machines or using the virtual PC in XP-mode on
64-bits machines. Please, consider these programs as legacy software; you
are most welcome to use them for scholarly purposes, but at your own risk!
Best wishes,
Loet
References:
Loet Leydesdorff & Kasper Welbers (2011), The semantic mapping of words and
co-words in contexts <http://arxiv.org/abs/1011.5209> , Journal of
Informetrics (in press); preprint version available at
http://arxiv.org/abs/1011.5209.
Esther Vlieger & Loet Leydesdorff (2009).
<http://www.leydesdorff.net/indicators/pajekmanual.2010.pdf> "How to analyze
frames using semantic maps of a collection of messages? Pajek Manual."
Amsterdam: University of Amsterdam.
** apologies for cross-postings
_____
Loet Leydesdorff
Professor, University of Amsterdam
Amsterdam School of Communications Research (ASCoR)
Kloveniersburgwal 48, 1012 CX Amsterdam.
Tel. +31-20-525 6598; fax: +31-842239111
<mailto:loet at leydesdorff.net%20> loet at leydesdorff.net ;
<http://www.leydesdorff.net/> http://www.leydesdorff.net/
Visiting Professor, <http://www.istic.ac.cn/Eng/brief_en.html> ISTIC,
Beijing; Honorary Fellow, <http://www.sussex.ac.uk/spru/> SPRU, University
of Sussex
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20110201/75e8ebb1/attachment.html>
More information about the SIGMETRICS
mailing list