​About the size of Google Scholar: playing the numbers

Stephen J Bensman notsjb at LSU.EDU
Thu Sep 11 09:40:42 EDT 2014

I thought that you had raised an important issue, and therefore I was eager but disappointed to read your paper.  I thought that you would do with the GS database what Barabasi and others did with the entire WWW.  If you could do this, it would really help to understand much better how search engines operate and why they are successful.  Unfortunately you did not, and I was disappointed.

The question really is the difference between defined and undefined databases.  WoS is the ultimate in a defined or prescribed database with definite and measurable sizes.  However, its prescriptions have been denounced as biased, restrictive, and distortive of reality.  In medicine sometimes the doctor gives the wrong prescription, and the patient dies.  The WWW is by nature undefined and open.  Therefore it can only be dealt with by taking proper samples.  Being undefined and open allows new perspectives.  For example, we found that Krugman was right—working papers are more in the development of economics than journal articles.  I don’t think that this would have been possible with WoS, particularly before the Book Citation Index.  Now we do not only have working papers but also blogs.  I do not think that defined databases can capture their importance, but I may be wrong.

In any case we seem to be in agreement on the main points, and we are both in the working paper stage, vetting ideas.  Therefore, we both may have a lot to learn from each other and should keep in contact.  I would be interested in your critiques and suggestions.


Stephen J Bensman, Ph.D.
LSU Libraries
Lousiana State University
Baton Rouge, LA 70803

From: ASIS&T Special Interest Group on Metrics [mailto:SIGMETRICS at LISTSERV.UTK.EDU] On Behalf Of Enrique Orduña
Sent: Thursday, September 11, 2014 4:22 AM
Subject: Re: [SIGMETRICS] ​About the size of Google Scholar: playing the numbers

Adminstrative info for SIGMETRICS (for example unsubscribe): http://web.utk.edu/~gwhitney/sigmetrics.html
Dear Stephen,

First, thank you very much for your critical and fruitful feedback, and for these studies you recommend us, that we find extremely interesting.
​ ​
In fact,
​the work ​
expressly devoted to Google Scholar
​ is
 cataloged in our
​Google Scholar Digest Bibliography.

It is true that perhaps the word
​ ​
"contains" referred to Google Scholar is not entirely accurate. We know that Google does not "possess" the documents, it is clear that Google is “simply” a search engine that serves as a bridge between the documents (wherever they are deposited, universities, repositories, journals, etc.) and end users.

In any case, it is true that Google Scholar classifies each record using its own metadata scheme. Therefore, it is a database of bibliographic references, which provide access to the document (sometimes the full text, sometimes a short abstract, sometimes nothing really).

A different question is whether it makes sense or not knowing the size
​ of this set of references​
. And
​at this point
 I disagree with you. It is not a meaningless issue.

Your vision is company-oriented. Obviously Google is a search engine, and
​it ​
wants to provide the best possible result for a given query. And what the user also wants is to get the best result. We agree.

as researchers devoted to bibliometrics and webmetrics, we are interested in a better understanding of the processes
​related to
 scientific communication. And knowing the size and evolution of Google Scholar is fundamental. Scopus and WoS represent the elite, but sometimes we
​may ​
want to know the processes
 happen outside the elite, and today this world outside the cream is represented by Google Scholar, though of course, it does not cover everything that exists
. And moreover, can we be sure that such cream is not identified by Google and provided to users as well?​

Thus, of course PageRank (by the way increasingly obsolete) is important for Google and for
​its ​
​ (most of us and increasing)​
, but we raised up other questions apart from the commercial use of the product, regardless the opacity, that we denounce in our work, and Isidro Aguillo also comments on in his latest email.

​I look forward to read your ​next working paper about this topic, and thanks again for your good feedback.



On Thu, Sep 11, 2014 at 8:46 AM, Isidro F. Aguillo <isidro.aguillo at cchs.csic.es<mailto:isidro.aguillo at cchs.csic.es>> wrote:
Adminstrative info for SIGMETRICS (for example unsubscribe): http://web.utk.edu/~gwhitney/sigmetrics.html
Are you talking about Google Scholar?

The useless bibliographic tool that does not allow to extract large data sets?

The system that blocked the access to it to your whole organization if you try to do it?

Are suffering CAPTCHA?

Is somebody able to talk with them and convince of changing their approach to our community?

On 10/09/2014 20:17, Stephen J Bensman wrote:
Enrique and Emilio.
I read your working paper with great interest as it deals with the same topic on which we are doing research here at LSU.  To tell you the  honest truth, I had trouble with its basic premise, i.e., that Google Scholar (GS) has a given size.  I do not think that it does, and, if it does, it is meaningless.  The real problem is what is the size of documentary set that is relevant to the search query.

The WWW and PageRank (the Google search engine) operate within what can be called the power-law or Lotkaian domain.  Informetric laws also operate within this domain.  On top of that, PageRank operates on what is called the probability ranking principle, by which the probability of relevance exponentially decreases as the number of inlinks decreases, i.e. below a certain point you are dealing with gibberish manufactured by the search engine itself.  Therefore, there is a need for left truncation and determination of what can be termed the x-min.  Since we are dealing with the Lotkaian domain, the x-min marks the point where the asymptote or “tail” on the x-axis for the items begins.

We are dealing with Nobelists, and what we have found is that with PageRank the set of relevant documents is conterminous with the researcher’s h-index and the “tail” of his GS citations distribution.  In other words—whether by serendipity or not—the h-index is an excellent estimate of the x-min of a GS citations distribution.  Below that is what the Germans would call a “Trummerzone” or rubbish zone largely manufactured by the search engine itself.  This conterminous-ness is a validation of both the h-index and Google Scholar.  The relevance of the set is also proven by the fact that the extreme outliers on the right messing up the tail are usually works on the topics for which the Nobelist won the prize.  Case closed.

Every field has its statistical problem.  With medical research it is right truncation, for every patient has to die before the results are really known.  With the WWW and scientometric research, it is left truncation.

If you are interested in how I view how Google Scholar works, you can read our working papers at the following URLs:



I hope to post another working paper there next week that will really clinch the point.  But who knows?  I may be wrong.


Stephen J Bensman, Ph.D.
LSU Libraries
Lousiana State University
Baton Rouge, LA 70803

From: ASIS&T Special Interest Group on Metrics [mailto:SIGMETRICS at LISTSERV.UTK.EDU] On Behalf Of Enrique Orduña
Sent: Wednesday, September 10, 2014 5:15 AM
Subject: [SIGMETRICS] ​About the size of Google Scholar: playing the numbers

Adminstrative info for SIGMETRICS (for example unsubscribe): http://web.utk.edu/~gwhitney/sigmetrics.html<http://web.utk.edu/%7Egwhitney/sigmetrics.html>
​ ​
Dear Colleagues,

The purpose of this mail is to present our latest working paper, deposited on July 24, 2014.
​ ​

We propose the inextricable task of knowing the size of this huge black hole looks like Google Scholar (GS). Anyway, as the title of the document (
​ ​
About the size of Google Scholar: playing the numbers), we have begun to make accounts and using 4 different empirical methods we estimate that the number of unique documents (different versions of a document are excluded) should not be less than 160 million (as of May 2014).

Regardless of this particular outcome, which is itself significant (especially when compared with other scientific databases, and that gives us key clues about the amount of scientific knowledge that can be searchable, found and accessed to on the web), even more exciting is the methodological challenge of this assumption. It has not only forced us to devise various techniques for measuring the size of this dark object that GS is, but
​ also ​
applying them we have shed light, again, on various inconsistencies, uncertainties and limitations of the search interface tools used by Google. In short, we have learned more about what Google Scholar does or does not, and we want to share it with you all.

This research comes at a good time. We are not only almost celebrating the 10th anniversary of GS but also hearing some voices (from somewhere in Europe…) finally relying on the use of Google Scholar for scientific evaluation.

Now, when empirical studies (http://googlescholardigest.blogspot.com.es/p/bibliography.html) demonstrate every day that Google Scholar and its derivatives

a) measure with similar credit to traditional bibliometric indicators,
b) are the most used products by scientists (http://www.nature.com/news/online-collaboration-scientists-and-the-social-network-1.15711),
​ and​

c) have unfortunately ended up with the competition (Microsoft Academic Search is in an unexplained hibernation, http://googlescholardigest.blogspot.com.es/2014/04/empirical-evidences-microsoft-academic-search-dead.html)
​ .​

seems that certain euphoria unleashed. We are pleased, better late than never…

However, without wanting to lower the aroused expectations, we emphasize that the problems of Google Scholar for scientific evaluation are not technical or methodological (coverage, reliability and validity of the measures, records filtering performance…). Seminal limitations are those related with:

a) the ease with which GS indicators can be manipulated

b) the transience of the results and measures (in many cases difficult to replicate stably),

c) the technological dependence on companies that develop tools that come and go on the consumer product market (http://ec3noticias.blogspot.com.es/2014/04/la-new-new-horizontes.html-bibliometrics).

Google Scholar enthusiasts are now welcome; meanwhile we will continue vigorously in which we already proposed several years ago: to reveal with “data”
​ - ​
and not mere opinions
​ -​
, the bowels of Google Scholar, and to reveal at the same time their strengths and weaknesses. So, like the old serials published, we can only promise...TO BE CONTINUED…

​ Best,​

Enrique Orduña-Malea​
​ Polytechnic University of Valencia​

​ ​Emilio Delgado López-Cózar
Universidad de Granada​



Isidro F. Aguillo, HonDr.

The Cybermetrics Lab, IPP-CSIC

Grupo Scimago

Madrid. SPAIN

isidro.aguillo at csic.es<mailto:isidro.aguillo at csic.es>

ORCID 0000-0001-8927-4873

ResearcherID: A-7280-2008

Scholar Citations SaCSbeoAAAAJ

Twitter @isidroaguillo

Rankings Web webometrics.info<http://webometrics.info>



Este mensaje no contiene virus ni malware porque la protección de avast! Antivirus<http://www.avast.com/> está activa.



Enrique Orduña-Malea
Personal de investigación.
Grupo de Investigación EC3. Instituto de Diseño y Fabricación (IDF).
Universidad Politécnica de Valencia (UPV).
Camino de Vera s/n, 46022 Valencia. Edificio 1H.
Tfo. 96 3879480 (Ext. 79480)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20140911/850e858e/attachment.html>

More information about the SIGMETRICS mailing list