About the size of Google Scholar: playing the numbers

Thu Sep 11 15:50:41 EDT 2014

William,
You may be right.  It may not have been a Mendeley Webinar, but it was an Elsevier Webinar.  If I now remember correctly, it was an Elsevier Webinar how librarians can help their faculty achieve greater recognition, and the librarian was Dutch.  There was another Webinar on this same topic, and there was a Brit librarian, who made the same recommendation.  It has become a common recommendation, and elements of the LSU administration are also starting to recommend this.  The LSU Libraries administration has, and that is why I made mine public, even though I swore not to.  It is an easy way for the administration to keep tabs on the faculty.

There is a lot of discontent with Google, and perhaps they will take it further.  But they are facing a major problem that is faced by all databases.  The Web has no authority structure, and it may be impossible to rate journals, etc., with hyperlinks until one is constructed, if one can be constructed.  However, I do find the Google attempts to do this quite interesting, because you began to see the importance of repositories and other methods of dissemination.

Respectfully,

Steve B.

From: ASIS&T Special Interest Group on Metrics [mailto:SIGMETRICS at LISTSERV.UTK.EDU] On Behalf Of William Gunn
Sent: Thursday, September 11, 2014 1:21 PM
To: SIGMETRICS at LISTSERV.UTK.EDU
Subject: Re: [SIGMETRICS] About the size of Google Scholar: playing the numbers

Adminstrative info for SIGMETRICS (for example unsubscribe): http://web.utk.edu/~gwhitney/sigmetrics.html

Steven, I can assure you that if someone from Elsevier suggested that a researcher put a link to a GS profile in their CV during a Mendeley seminar, they were very much off message.

Google Scholar did a great thing by making a comprehensive index of literature and profiles available to anyone, but they seem to have stopped there. The index includes quite a bit of nonsense, which makes its size less impressive and its counts questionable. Furthermore, the lack of API, in contrast to the good API support for their other products, is not accidental. It's a deliberate decision by the GS team which should concern anyone considering institutionalizing their use of GS.

I have a lot of respect for the GS team, and they did a great thing going as far as they did to make an index available, I just wish they could have gone one step further.

Best,

William Gunn | Head of Academic Outreach, Mendeley | @mrgunn
http://www.mendeley.com/profiles/william-gunn | (650) 614-1749
On Sep 11, 2014 8:43 AM, "Stephen J Bensman" <notsjb at lsu.edu<mailto:notsjb at lsu.edu>> wrote:
Isidro and Yves,
That is very interesting.  How did you find what is the highest h-index in GS citations?  Bourdieu died in 2002 before GS, so how did his h-index become public?

Like it or not, Google Scholar is the biggest game in town, and Elsevier seems to recognize this.  I have attended its Mendeley webinars, and it is here that the presenters advised that the URL to your GS citations be posted on your site and pasted into your CV.

Google has so many fish to fry, and I regret that it is not putting more effort into its scholarly and library operations.  Google used to have representatives at ALA conventions, and its exhibit was the most interesting.  They are not there anymore.  It has to be recognized that Google did in digitalizing books in five years what libraries estimated would take a century.  They do amaze me, but something better will come along.

Stephen J Bensman
LSU Libraries
Lousiana State University
Baton Rouge, LA 70803
USA

From: ASIS&T Special Interest Group on Metrics [mailto:SIGMETRICS at LISTSERV.UTK.EDU<mailto:SIGMETRICS at LISTSERV.UTK.EDU>] On Behalf Of Yves Gingras
Sent: Thursday, September 11, 2014 10:21 AM
To: SIGMETRICS at LISTSERV.UTK.EDU<mailto:SIGMETRICS at LISTSERV.UTK.EDU>
Subject: Re: [SIGMETRICS] About the size of Google Scholar: playing the numbers

Adminstrative info for SIGMETRICS (for example unsubscribe): http://web.utk.edu/~gwhitney/sigmetrics.html
Hello

Having worked with Bourdieu before is untimely death in 2002, I cannot let pass this opportinity to suggest that it would be nice if in addition to just looking at h-index for the fun of it (which by the way we do not need to know that Bourdieu is among the very few great sociologists of the second half of the 20th century) people read his book: Science of science and reflexivity (Chicago press, 2004). He talks briefly about scientometrics on p. 14, and putting more reflexive sociology into our thinking before counting would be welcome...

After “slow science” why not a new motto for all of us:  “slow bibliometrics: thinking before counting”

Best regards

Yves Gingras

Le 11/09/14 10:34, « Isidro F. Aguillo » <isidro.aguillo at CCHS.CSIC.ES<mailto:isidro.aguillo at CCHS.CSIC.ES>> a écrit :
Adminstrative info for SIGMETRICS (for example unsubscribe): http://web.utk.edu/~gwhitney/sigmetrics.html
Dear Stephen,

 Thanks for your comments. I understand the private nature of Google, but Mendeley (owned by Elsevier) and other similar biblio/altmetric sources are also commercial backed companies and they are offering good APIs for in-depth, large data analysis.

 as a matter of curiosity I checked the largest h-index in Google Scholar Citations and it looks to be:

Pierre Bourdieu

Centre de Sociologie Européenne, Collège de France
 http://scholar.google.com/citations?user=d_lp40IAAAAJ&hl=en <http://scholar.google.com/citations?user=d_lp40IAAAAJ&hl=en><http://scholar.google.com/citations?user=d_lp40IAAAAJ&hl=en>

 Citations    361973
 h-index             207

 Any better candidates?

 On 11/09/2014 16:13, Stephen J Bensman wrote:

Isidro,

Unfortunately Google is a cautious private enterprise company with commercial interest and secrets.  For example, it is very cautious when it comes to copyright.  I really hate it when I find a book chapter of interest to me but cannot download it or copy/paste it.  Moreover, with Google Scholar citations it allows you to make the choice whether you want yours public or private.  That keeps the door open for Harzing’s Publish-or-Perish program.  Google does not want any law suits resulting from making your private data public without your permission.

Google allows you large enough samples for most purposes.  For example, when it comes to individuals, the main measure appears to be the h-index.  For analytical purposes, your h-index has to be above 50 to provide a proper sample.  Few people have h-indexes above 50, and I know of none with an h-index above 1000.

Google’s database is Google’s private property.  It can do with it what it wants.  I imagine that—like Thomson Reuters—you could purchase a lot of data from it.  However, you may be some sort of Bolshevik, who wants the right to expropriate it.   As for the uselessness of Google Scholar, I will quote your compatriots below:

“Now, when empirical studies (http://googlescholardigest.blogspot.com.es/p/bibliography.html) demonstrate every day that Google Scholar and its derivatives

a) measure with similar credit to traditional bibliometric indicators,

b) are the most used products by scientists (http://www.nature.com/news/online-collaboration-scientists-and-the-social-network-1.15711) <http://www.nature.com/news/online-collaboration-scientists-and-the-social-network-1.15711%29><http://www.nature.com/news/online-collaboration-scientists-and-the-social-network-1.15711%29> ,”

Why don’t you take up your case with them as well?

Respectfully,

Stephen J Bensman

LSU Libraries

Lousiana State University

Baton Rouge, LA 70803

From: ASIS&T Special Interest Group on Metrics [mailto:SIGMETRICS at LISTSERV.UTK.EDU] On Behalf Of Isidro F. Aguillo
 Sent: Thursday, September 11, 2014 1:46 AM
 To: SIGMETRICS at LISTSERV.UTK.EDU<mailto:SIGMETRICS at LISTSERV.UTK.EDU>
 Subject: Re: [SIGMETRICS] About the size of Google Scholar: playing the numbers

Adminstrative info for SIGMETRICS (for example unsubscribe): http://web.utk.edu/~gwhitney/sigmetrics.html <http://web.utk.edu/%7Egwhitney/sigmetrics.html><http://web.utk.edu/%7Egwhitney/sigmetrics.html>

Are you talking about Google Scholar?

 The useless bibliographic tool that does not allow to extract large data sets?

 The system that blocked the access to it to your whole organization if you try to do it?

 Are suffering CAPTCHA?

 Is somebody able to talk with them and convince of changing their approach to our community?

 On 10/09/2014 20:17, Stephen J Bensman wrote:

Enrique and Emilio.

I read your working paper with great interest as it deals with the same topic on which we are doing research here at LSU.  To tell you the  honest truth, I had trouble with its basic premise, i.e., that Google Scholar (GS) has a given size.  I do not think that it does, and, if it does, it is meaningless.  The real problem is what is the size of documentary set that is relevant to the search query.

The WWW and PageRank (the Google search engine) operate within what can be called the power-law or Lotkaian domain.  Informetric laws also operate within this domain.  On top of that, PageRank operates on what is called the probability ranking principle, by which the probability of relevance exponentially decreases as the number of inlinks decreases, i.e. below a certain point you are dealing with gibberish manufactured by the search engine itself.  Therefore, there is a need for left truncation and determination of what can be termed the x-min.  Since we are dealing with the Lotkaian domain, the x-min marks the point where the asymptote or “tail” on the x-axis for the items begins.

We are dealing with Nobelists, and what we have found is that with PageRank the set of relevant documents is conterminous with the researcher’s h-index and the “tail” of his GS citations distribution.  In other words—whether by serendipity or not—the h-index is an excellent estimate of the x-min of a GS citations distribution.  Below that is what the Germans would call a “Trummerzone” or rubbish zone largely manufactured by the search engine itself.  This conterminous-ness is a validation of both the h-index and Google Scholar.  The relevance of the set is also proven by the fact that the extreme outliers on the right messing up the tail are usually works on the topics for which the Nobelist won the prize.  Case closed.

Every field has its statistical problem.  With medical research it is right truncation, for every patient has to die before the results are really known.  With the WWW and scientometric research, it is left truncation.

If you are interested in how I view how Google Scholar works, you can read our working papers at the following URLs:

http://arxiv.org/abs/1312.3872

http://arxiv.org/abs/1404.4904

I hope to post another working paper there next week that will really clinch the point.  But who knows?  I may be wrong.

Respectfully,

Stephen J Bensman, Ph.D.

LSU Libraries

Lousiana State University

Baton Rouge, LA 70803

USA

From: ASIS&T Special Interest Group on Metrics [mailto:SIGMETRICS at LISTSERV.UTK.EDU] On Behalf Of Enrique Orduña
 Sent: Wednesday, September 10, 2014 5:15 AM
 To: SIGMETRICS at LISTSERV.UTK.EDU<mailto:SIGMETRICS at LISTSERV.UTK.EDU>
 Subject: [SIGMETRICS] About the size of Google Scholar: playing the numbers

Adminstrative info for SIGMETRICS (for example unsubscribe): http://web.utk.edu/~gwhitney/sigmetrics.html <http://web.utk.edu/%7Egwhitney/sigmetrics.html><http://web.utk.edu/%7Egwhitney/sigmetrics.html>

Dear Colleagues,

 The purpose of this mail is to present our latest working paper, deposited on July 24, 2014.
 http://googlescholardigest.blogspot.com.es/2014/09/about-size-of-google-scholar-playing.html

We propose the inextricable task of knowing the size of this huge black hole looks like Google Scholar (GS). Anyway, as the title of the document (

About the size of Google Scholar: playing the numbers), we have begun to make accounts and using 4 different empirical methods we estimate that the number of unique documents (different versions of a document are excluded) should not be less than 160 million (as of May 2014).

Regardless of this particular outcome, which is itself significant (especially when compared with other scientific databases, and that gives us key clues about the amount of scientific knowledge that can be searchable, found and accessed to on the web), even more exciting is the methodological challenge of this assumption. It has not only forced us to devise various techniques for measuring the size of this dark object that GS is, but

 also 

applying them we have shed light, again, on various inconsistencies, uncertainties and limitations of the search interface tools used by Google. In short, we have learned more about what Google Scholar does or does not, and we want to share it with you all.

This research comes at a good time. We are not only almost celebrating the 10th anniversary of GS but also hearing some voices (from somewhere in Europe…) finally relying on the use of Google Scholar for scientific evaluation.

Now, when empirical studies (http://googlescholardigest.blogspot.com.es/p/bibliography.html) demonstrate every day that Google Scholar and its derivatives

a) measure with similar credit to traditional bibliometric indicators,

b) are the most used products by scientists (http://www.nature.com/news/online-collaboration-scientists-and-the-social-network-1.15711),

 and

c) have unfortunately ended up with the competition (Microsoft Academic Search is in an unexplained hibernation, http://googlescholardigest.blogspot.com.es/2014/04/empirical-evidences-microsoft-academic-search-dead.html)

 .

seems that certain euphoria unleashed. We are pleased, better late than never…

However, without wanting to lower the aroused expectations, we emphasize that the problems of Google Scholar for scientific evaluation are not technical or methodological (coverage, reliability and validity of the measures, records filtering performance…). Seminal limitations are those related with:

a) the ease with which GS indicators can be manipulated

(http://ec3noticias.blogspot.com.es/2014/01/google-scholar-wins-ravesbut-can-it-be.htmt),

 b) the transience of the results and measures (in many cases difficult to replicate stably),

 c) the technological dependence on companies that develop tools that come and go on the consumer product market (http://ec3noticias.blogspot.com.es/2014/04/la-new-new-horizontes.html-bibliometrics).

Google Scholar enthusiasts are now welcome; meanwhile we will continue vigorously in which we already proposed several years ago: to reveal with “data”

 - 

and not mere opinions

 -

, the bowels of Google Scholar, and to reveal at the same time their strengths and weaknesses. So, like the old serials published, we can only promise...TO BE CONTINUED…

 Best,

Enrique Orduña-Malea

 Polytechnic University of Valencia

 Emilio Delgado López-Cózar

Universidad de Granada

Yves Gingras

Professeur
Département d'histoire
Centre interuniversitaire de recherche
sur la science et la technologie (CIRST)
Chaire de recherche du Canada en histoire
et sociologie des sciences
Observatoire des sciences et des technologies (OST)
UQAM
C.P. 8888, Succ. Centre-Ville
Montréal, Québec
Canada, H3C 3P8

Tel: (514)-987-3000-7053
Fax: (514)-987-7726<tel:%28514%29-987-7726>

http://www.chss.uqam.ca
http://www.cirst.uqam.ca
http://www.ost.uqam.ca
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20140911/d7d55393/attachment.html>

​About the size of Google Scholar: playing the numbers

About the size of Google Scholar: playing the numbers