Quentin L. Burrell
Wed Jan 31 15:40:02 EST 2007

```I apologise for the fact that my last response to Steven Morris (SM) was
sent out before proper completion - my intention was to include endorsement
also of the points raised by Stephen Bensman (SJB).

(I also apologise to any list members who find the following somewhat
patronising, but over the (many) years of teaching and reading regression
analyses I have found that a major problem is that of performing the
appropriate analysis without looking at the data!)

To illustrate at a very simple (and artificial) level, consider a set of 6
papers in e.g. scientometrics for which the (references, citations) counts
are (0,2), (1,1), (2,0), (18,20), (19,19) and (20,18) respectively. A
staightforward regression analysis yields :
citations = 0.16 + 0.98references
with Rsquared = 0.97 (correlation coefficient = 0.984).

Impressive, but following SM if we plot the scatter diagram we see two
"clusters" made up of the first 3 points and the last 3 which reflect very
different "patterns". If we now find from the context of the data that the
first set correspond (say) to mathematical presentations and the latter to
non mathematical ones, then maybe we should be analysing them separately. If
we do, then we find the correllation coefficient for each category is -1,
perfect negative correlation in each case!

This simplistic presentation is also hinting at SJBs plea for a more subtle
analysis including an "exogenous subject variable".

Thanks to SM and SJB for highlighting some of the possible pitfalls in
regression analysis, but thanks also to Ali for his interesting analysis.

I think that my main point is that when performing any sort of
mathematical/statistical analysis one has to take full account of the data
context, not just the data.

Best wishes

Quentin

From: "Stephen J Bensman"
Sent: Tuesday, January 30, 2007 5:17 PM
Subject: Re: [SIGMETRICS] question

>
> OK, Ali, I have read your paper, and a nice piece of work it is.  However,
> I want to make one criticism.  You failed to classify your 467
> scientometric papers into subject subsets.  It may well be that certain
> scientometric topics may both have more references per paper and be more
> prone to be cited.  Therefore, your finding of the high postive
> relationship between the number of reference and the number of citations
> may be an artifact of an exogenous subject variable.  I hope that you do
> not take this as a criticism but as an opportunity to squeeze another
> paper
> out of the same set of data.
>
> SB
>
ali uzun
> -----Dear Stephen,
> I am sending an electronic version of the paper. The statistical
> ralationship between the two categories (citations received and
> referances listed) is of predictive type. There is no cause and effect
> relation.
>
> Prof. Dr. Ali Uzun
> Depr. Stat. Middle East Technical Univ. Ankara-Turkey.
>> ------Dear Ronald,
>> A sample of 467 artiles (not including reviews) published from 1999
> to
>> 2003 in the journal Scientometrics has shown that there is a linear
>> correlation (correlation coefficient of 0.799) between the number of
>> times an article is cited and the number of references it contains.
>> This was supported by a Chi-Square test of independence between the
>> two indicators at 0.01 level of significance (Uzun, A. (2006).
>> Proceedings of the International Workshop on Webometrics,
> Informetrics
>> and Scientometrics, 87-91,10-12 May 2006, Nancy-France).
