​About the size of Google Scholar: playing the numbers

Enrique and Emilo,
Thank you for the kind offer to look over our working paper when we have it posted on arXiv.  We should have it posted by the week after next.  I have finished the first draft and sent it out to my collaborators for comment.  To whet your appetite, here is the title and abstract.

This paper comprises an analysis of whether Google Scholar (GS) can construct documentary sets relevant for the evaluation of the works of researchers.  The researchers analyzed were two samples of Nobelists in economics: an original sample of five laureates downloaded in September, 2011; and a validating sample of laureates downloaded in October, 2013.  Two methods were utilized to conduct this analysis.  The first is distributional.  Here it is shown that the distributions of the laureates’ works by total GS citations belong within the Lotkaian or power-law domain, whose major characteristic is asymptote or “tail” to the right.  It also proves that this asymptote is conterminous with the laureates’ h-indexes, which demarcate their core œuvre.  This overlap is proof of both the ability of GS to form relevant documentary sets and the validity of the h-index.  The second method is semantic.  This method shows that the extreme outliers at the right tip of the tail—a signature feature of the economists’ distributions—are not random events but related by subject to contributions to the discipline for which the laureates were awarded this prize.  Another interesting finding is the important role played by working papers in the dissemination of new economic knowledge.”
Since I am vetting these ideas, I decided to cc the entire list.  I hope that you not mind.

Dear Stephen,

Thanks again for your email. Of course we will be pleased to have a look at your preprint once you finish it.

As regards your research focus, I understand what you mean. The topology of the net is meaninful itself, and sometimes the estructure becomes the information. The Eigenvector of each node can give us information about the most relevant pieces for an especific query. And net metrics are relevant here.

But this is just one way to look at Google Scholar (clearly of much interest). We just wanted to know a) the total number of nodes (considering node each bibliographical record), because this influence the documents available to conform the set of documents that fits a query, and b) If Google Scholar let us knowing this size.

Of course, the growth of total size is not going to affect neither the topology (at least at short term) nor the influence of particular nodes. Our focus were not in the connections of nodes (citations, mentions, links..) but on the proportion of current academic literature covered. Of course this is a first step, and we want to know more things now.

And do not worry...we usually go out on a limb too...

Enrique & Emilio

Enrique and Emilio,
Thank you for the thoughtful e-mail.  I have downloaded the Khabsa and Giles article and will look at your presentation (I hope there is an English version).  I am in the final stages of finishing what I think may be a major article and will incorporate your suggestions as much as possible.  I will be posting this article or working paper on arXiv soon and would like you to look it over before submission to a major journal.  We are at the cutting edge, where vested interests are threatened and disagreements deep, and it is necessary to see the objections and know how to counter them.

I do not think that it is the physical size of the GS universe that is important but its structure.  PageRank is nothing but a further development of Garfield’s theory of citation indexing and what is important and fascinating here is that relevant sets are semantically defined not by words but by linkages.  This leads to something of a tautology in reasoning, because what emerges as relevant from linkages has to be defined again in words to be communicated.  Moreover, WWW structure is in the power-law or Lotkaian domain, and therefore linkages to sites are highly skewed with few sites having most linkages.  This results in two things: 1) the so-called “diameter” of the universe is very small in that it takes very few jumps to cross it; and 2) your probability of landing on an important, relevant site are much higher than landing on an unimportant, irrelevant one.  Size can increase exponentially with little effect on this.  This comprises the telemetry of PageRank, guiding it to its targets.  What works against this telemetry is the chaos governing the WWW due to its lack of authority structure.  Therefore, you need what you seem to be calling an “API” to edit and control your data, the best of which now is Harzing’s.  I am an enhance cataloger certified by the Library of Congress, and our biggest concern is authority structure, so I know its importance.

Boy!?  I have really gone out on a limb here, and somebody may saw it off.


Dear Stephen,

Thank you very much again for your interesting email, it helps us to debate and discuss in a very concentrated form.

As regards Barabasi and Albert studies (and Broder, Baeza-Yates and many other colleagues), we have certainly learned a lot from them. However, I think they are located in another analysis context, as they are intended to measure the network, and we just the database (which obviously harvest the net, but filtering contents).

Therefore, our aim was to count the number of bibliographic records that Google Scholar. That is, Google Scholar is an academic search engine, but also a bibliographical database. And that amount (and its evolution) gives us much information. In addition, a record in Google Scholar database does not always correspond to an online digital object (a node the academic network), since many records are mere citations that do not lead to any website.

In any case, unless this point, which is determined by simply have different research interests, I think we agree in fact about what is a defined or undefined universe of information. Let us recommend you the following presentation:

In addition, Google Scholar is able to measure not only scientific impact, but the professional and social impact of people, something that neither WoS nor Scopus do, and that certainly explains the high values in the performance of economists and other researchers in social sciences and humanities.

We therefore believe that it is interesting to know the size of Google Scholar, also measured in this way. And it seems we are not the only ones to be concerned about this. We recommend the following work performed by Khabsa & Giles (2014), which certainly enlightened us on our way.

Khabsa & Giles (2014). “The number of scholarly documents on the public web. Plos One, 9(5): e93949.doi:10.1371/journal.pone.0093949

As regards Google Scholar’s opacity, while we agree that it is a company and Google can do whatever they want with their products, we should remind that the bibliographic records that Google Scholar catalogs and provides access to are harvested in most from public entities (such as institutional repositories from public universities). A little consideration with those institutions that nurture for free its database would be appreciated,
​for example
 an API to be used for academic purposes. But this is a personal appreciation.

And of course we can learn a lot from each other. We will be happy to read your work and communicate you our ideas and suggestions and / or vice versa. We are in contact.

Kind regards,

Enrique Orduña-Malea & Emilio Delgado López-Cózar



