The routines ti.exe (at
and fulltext.exe (at
now additionally provide as output a file "words.dbf" (readable in Excel)
which contains for all words the following summations: 

1.	A variable named "Chi_Sq" which provides Chi-square contributions
for each of the variables (that is, words); these are defined for word(i) as
Σ(i)χ2 = (Observed(ij) - Expected(ij))^2 / Expected(ij). In other words, the
sum of the contributions over the column for the variable in each row
(Mogoutov et al., 2008); 
2.	A variable named "ObsExp" which provides the sum of absolute values
|Observed - Expected| for the word as a variable summed over the column;
3.	A variable named "TfIdf" which use Salton & McGill's (1983: 63)
TermFrequency-InverseDocumentFrequency measure (but without Salton's
additional + 1; Magerman et al., 2007) defined as follows: WEIGHT(ik) =
FREQ(ik) * [log2 (n) - log2 (DOCFREQ(k))]. This function assigns a high
degree of importance to terms occurring in only a few documents in the
4.	The word frequency within the set. 

These statistics provide the researcher with opportunities to refine the
list of words to be considered. 


Magerman, T., Van Looy, B., & Song, X. (2007). Exploring the feasibility and
accuracy of Latent Semantic Analysis based text mining techniques to detect
similarity between patent documents and scientific publications. Paper
presented at the 6th Triple Helix Conference, 16-19 May 2007, Singapore.

Mogoutov, A., Cambrosio, A., Keating, P., & Mustar, P. (2008). Biomedical
innovation at the laboratory, clinical and commercial interface: A new
method for mapping research projects, publications and patents in the field
of microarrays. Journal of Informetrics (In print);


