​About the size of Google Scholar: playing the numbers

Stephen J Bensman notsjb at LSU.EDU
Fri Sep 12 09:55:21 EDT 2014

Enrique and Emilio,
Thank you for the thoughtful e-mail.  I have downloaded the Khabsa and Giles article and will look at your presentation (I hope there is an English version).  I am in the final stages of finishing what I think may be a major article and will incorporate your suggestions as much as possible.  I will be posting this article or working paper on arXiv soon and would like you to look it over before submission to a major journal.  We are at the cutting edge, where vested interests are threatened and disagreements deep, and it is necessary to see the objections and know how to counter them.

I do not think that it is the physical size of the GS universe that is important but its structure.  PageRank is nothing but a further development of Garfield’s theory of citation indexing and what is important and fascinating here is that relevant sets are semantically defined not by words but by linkages.  This leads to something of a tautology in reasoning, because what emerges as relevant from linkages has to be defined again in words to be communicated.  Moreover, WWW structure is in the power-law or Lotkaian domain, and therefore linkages to sites are highly skewed with few sites having most linkages.  This results in two things: 1) the so-called “diameter” of the universe is very small in that it takes very few jumps to cross it; and 2) your probability of landing on an important, relevant site are much higher than landing on an unimportant, irrelevant one.  Size can increase exponentially with little effect on this.  This comprises the telemetry of PageRank, guiding it to its targets.  What works against this telemetry is the chaos governing the WWW due to its lack of authority structure.  Therefore, you need what you seem to be calling an “API” to edit and control your data, the best of which now is Harzing’s.  I am an enhance cataloger certified by the Library of Congress, and our biggest concern is authority structure, so I know its importance.

Boy!?  I have really gone out on a limb here, and somebody may saw it off.


Stephen J Bensman, Ph.D.
LSU Libraries
Lousiana State University
Baton Rouge, LA 70803

From: ASIS&T Special Interest Group on Metrics [mailto:SIGMETRICS at LISTSERV.UTK.EDU] On Behalf Of Enrique Orduña
Sent: Friday, September 12, 2014 7:08 AM
Subject: Re: [SIGMETRICS] ​About the size of Google Scholar: playing the numbers

Adminstrative info for SIGMETRICS (for example unsubscribe): http://web.utk.edu/~gwhitney/sigmetrics.html
Dear Stephen,

Thank you very much again for your interesting email, it helps us to debate and discuss in a very concentrated form.

As regards Barabasi and Albert studies (and Broder, Baeza-Yates and many other colleagues), we have certainly learned a lot from them. However, I think they are located in another analysis context, as they are intended to measure the network, and we just the database (which obviously harvest the net, but filtering contents).

Therefore, our aim was to count the number of bibliographic records that Google Scholar. That is, Google Scholar is an academic search engine, but also a bibliographical database. And that amount (and its evolution) gives us much information. In addition, a record in Google Scholar database does not always correspond to an online digital object (a node the academic network), since many records are mere citations that do not lead to any website.

In any case, unless this point, which is determined by simply have different research interests, I think we agree in fact about what is a defined or undefined universe of information. Let us recommend you the following presentation:

In addition, Google Scholar is able to measure not only scientific impact, but the professional and social impact of people, something that neither WoS nor Scopus do, and that certainly explains the high values in the performance of economists and other researchers in social sciences and humanities.

We therefore believe that it is interesting to know the size of Google Scholar, also measured in this way. And it seems we are not the only ones to be concerned about this. We recommend the following work performed by Khabsa & Giles (2014), which certainly enlightened us on our way.

Khabsa & Giles (2014). “The number of scholarly documents on the public web. Plos One, 9(5): e93949.doi:10.1371/journal.pone.0093949

As regards Google Scholar’s opacity, while we agree that it is a company and Google can do whatever they want with their products, we should remind that the bibliographic records that Google Scholar catalogs and provides access to are harvested in most from public entities (such as institutional repositories from public universities). A little consideration with those institutions that nurture for free its database would be appreciated,
​for example
 an API to be used for academic purposes. But this is a personal appreciation.

And of course we can learn a lot from each other. We will be happy to read your work and communicate you our ideas and suggestions and / or vice versa. We are in contact.

Kind regards,

Enrique Orduña-Malea & Emilio Delgado López-Cózar

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20140912/e712dd13/attachment.html>

More information about the SIGMETRICS mailing list