​About the size of Google Scholar: playing the numbers

Isidro F. Aguillo isidro.aguillo at CCHS.CSIC.ES
Thu Sep 11 10:34:13 EDT 2014


Dear Stephen,

Thanks for your comments. I understand the private nature of Google, but 
Mendeley (owned by Elsevier) and other similar biblio/altmetric sources 
are also commercial backed companies and they are offering good APIs for 
in-depth, large data analysis.

as a matter of curiosity I checked the largest h-index in Google Scholar 
Citations and it looks to be:

Pierre Bourdieu
Centre de Sociologie Européenne, Collège de France
http://scholar.google.com/citations?user=d_lp40IAAAAJ&hl=en

Citations    361973
h-index             207

Any better candidates?


On 11/09/2014 16:13, Stephen J Bensman wrote:
>
> Isidro,
>
> Unfortunately Google is a cautious private enterprise company with 
> commercial interest and secrets.  For example, it is very cautious 
> when it comes to copyright.  I really hate it when I find a book 
> chapter of interest to me but cannot download it or copy/paste it.  
> Moreover, with Google Scholar citations it allows you to make the 
> choice whether you want yours public or private.  That keeps the door 
> open for Harzing’s Publish-or-Perish program.  Google does not want 
> any law suits resulting from making your private data public without 
> your permission.
>
> Google allows you large enough samples for most purposes.  For 
> example, when it comes to individuals, the main measure appears to be 
> the h-index.  For analytical purposes, your h-index has to be above 50 
> to provide a proper sample.  Few people have h-indexes above 50, and I 
> know of none with an h-index above 1000.
>
> Google’s database is Google’s private property.  It can do with it 
> what it wants.  I imagine that—like Thomson Reuters—you could purchase 
> a lot of data from it.  However, you may be some sort of Bolshevik, 
> who wants the right to expropriate it.   As for the uselessness of 
> Google Scholar, I will quote your compatriots below:
>
> “Now, when empirical studies 
> (http://googlescholardigest.blogspot.com.es/p/bibliography.html) 
> demonstrate every day that Google Scholar and its derivatives
>
> a) measure with similar credit to traditional bibliometric indicators,
>
> b) are the most used products by scientists 
> (_http://www.nature.com/news/online-collaboration-scientists-and-the-social-network-1.15711) 
> <http://www.nature.com/news/online-collaboration-scientists-and-the-social-network-1.15711%29>_,”
>
> Why don’t you take up your case with them as well?
>
> Respectfully,
>
> Stephen J Bensman
>
> LSU Libraries
>
> Lousiana State University
>
> Baton Rouge, LA 70803
>
> *From:*ASIS&T Special Interest Group on Metrics 
> [mailto:SIGMETRICS at LISTSERV.UTK.EDU] *On Behalf Of *Isidro F. Aguillo
> *Sent:* Thursday, September 11, 2014 1:46 AM
> *To:* SIGMETRICS at LISTSERV.UTK.EDU
> *Subject:* Re: [SIGMETRICS] ​About the size of Google Scholar: playing 
> the numbers
>
> Adminstrative info for SIGMETRICS (for example unsubscribe): 
> http://web.utk.edu/~gwhitney/sigmetrics.html 
> <http://web.utk.edu/%7Egwhitney/sigmetrics.html>
>
> Are you talking about Google Scholar?
>
> The useless bibliographic tool that does not allow to extract large 
> data sets?
>
> The system that blocked the access to it to your whole organization if 
> you try to do it?
>
> Are suffering CAPTCHA?
>
> Is somebody able to talk with them and convince of changing their 
> approach to our community?
>
> On 10/09/2014 20:17, Stephen J Bensman wrote:
>
>     Enrique and Emilio.
>
>     I read your working paper with great interest as it deals with the
>     same topic on which we are doing research here at LSU.  To tell
>     you the  honest truth, I had trouble with its basic premise, i.e.,
>     that Google Scholar (GS) has a given size.  I do not think that it
>     does, and, if it does, it is meaningless.  The real problem is
>     what is the size of documentary set that is relevant to the search
>     query.
>
>     The WWW and PageRank (the Google search engine) operate within
>     what can be called the power-law or Lotkaian domain. Informetric
>     laws also operate within this domain.  On top of that, PageRank
>     operates on what is called the probability ranking principle, by
>     which the probability of relevance exponentially decreases as the
>     number of inlinks decreases, i.e. below a certain point you are
>     dealing with gibberish manufactured by the search engine itself.
>     Therefore, there is a need for left truncation and determination
>     of what can be termed the x-min.  Since we are dealing with the
>     Lotkaian domain, the x-min marks the point where the asymptote or
>     “tail” on the x-axis for the items begins.
>
>     We are dealing with Nobelists, and what we have found is that with
>     PageRank the set of relevant documents is conterminous with the
>     researcher’s h-index and the “tail” of his GS citations
>     distribution.  In other words—whether by serendipity or not—the
>     h-index is an excellent estimate of the x-min of a GS citations
>     distribution.  Below that is what the Germans would call a
>     “Trummerzone” or rubbish zone largely manufactured by the search
>     engine itself. This conterminous-ness is a validation of both the
>     h-index and Google Scholar.  The relevance of the set is also
>     proven by the fact that the extreme outliers on the right messing
>     up the tail are usually works on the topics for which the Nobelist
>     won the prize.  Case closed.
>
>     Every field has its statistical problem.  With medical research it
>     is right truncation, for every patient has to die before the
>     results are really known.  With the WWW and scientometric
>     research, it is left truncation.
>
>     If you are interested in how I view how Google Scholar works, you
>     can read our working papers at the following URLs:
>
>     http://arxiv.org/abs/1312.3872
>
>     http://arxiv.org/abs/1404.4904
>
>     I hope to post another working paper there next week that will
>     really clinch the point.  But who knows?  I may be wrong.
>
>     Respectfully,
>
>     Stephen J Bensman, Ph.D.
>
>     LSU Libraries
>
>     Lousiana State University
>
>     Baton Rouge, LA 70803
>
>     USA
>
>     *From:*ASIS&T Special Interest Group on Metrics
>     [mailto:SIGMETRICS at LISTSERV.UTK.EDU] *On Behalf Of *Enrique Orduña
>     *Sent:* Wednesday, September 10, 2014 5:15 AM
>     *To:* SIGMETRICS at LISTSERV.UTK.EDU <mailto:SIGMETRICS at LISTSERV.UTK.EDU>
>     *Subject:* [SIGMETRICS] ​About the size of Google Scholar: playing
>     the numbers
>
>     Adminstrative info for SIGMETRICS (for example unsubscribe):
>     http://web.utk.edu/~gwhitney/sigmetrics.html
>     <http://web.utk.edu/%7Egwhitney/sigmetrics.html>
>
>     ​ ​
>
>     Dear Colleagues,
>
>     The purpose of this mail is to present our latest working paper,
>     deposited on July 24, 2014.
>     http://googlescholardigest.blogspot.com.es/2014/09/about-size-of-google-scholar-playing.html
>
>     ​ ​
>
>     We propose the inextricable task of knowing the size of this huge
>     black hole looks like Google Scholar (GS). Anyway, as the title of
>     the document (
>
>     ​ ​
>
>     About the size of Google Scholar: playing the numbers), we have
>     begun to make accounts and using 4 different empirical methods we
>     estimate that the number of unique documents (different versions
>     of a document are excluded) should not be less than 160 million
>     (as of May 2014).
>
>     Regardless of this particular outcome, which is itself significant
>     (especially when compared with other scientific databases, and
>     that gives us key clues about the amount of scientific knowledge
>     that can be searchable, found and accessed to on the web), even
>     more exciting is the methodological challenge of this assumption.
>     It has not only forced us to devise various techniques for
>     measuring the size of this dark object that GS is, but
>
>     ​ also ​
>
>     applying them we have shed light, again, on various
>     inconsistencies, uncertainties and limitations of the search
>     interface tools used by Google. In short, we have learned more
>     about what Google Scholar does or does not, and we want to share
>     it with you all.
>
>     This research comes at a good time. We are not only almost
>     celebrating the 10th anniversary of GS but also hearing some
>     voices (from somewhere in Europe…) finally relying on the use of
>     Google Scholar for scientific evaluation.
>
>     Now, when empirical studies
>     (http://googlescholardigest.blogspot.com.es/p/bibliography.html)
>     demonstrate every day that Google Scholar and its derivatives
>
>     a) measure with similar credit to traditional bibliometric indicators,
>
>     b) are the most used products by scientists
>     (http://www.nature.com/news/online-collaboration-scientists-and-the-social-network-1.15711),
>
>     ​ and​
>
>     c) have unfortunately ended up with the competition (Microsoft
>     Academic Search is in an unexplained hibernation,
>     http://googlescholardigest.blogspot.com.es/2014/04/empirical-evidences-microsoft-academic-search-dead.html)
>
>     ​ .​
>
>     seems that certain euphoria unleashed. We are pleased, better late
>     than never…
>
>     However, without wanting to lower the aroused expectations, we
>     emphasize that the problems of Google Scholar for scientific
>     evaluation are not technical or methodological (coverage,
>     reliability and validity of the measures, records filtering
>     performance…). Seminal limitations are those related with:
>
>     a) the ease with which GS indicators can be manipulated
>
>>
>     (http://ec3noticias.blogspot.com.es/2014/01/google-scholar-wins-ravesbut-can-it-be.htmt),
>
>     b) the transience of the results and measures (in many cases
>     difficult to replicate stably),
>
>     c) the technological dependence on companies that develop tools
>     that come and go on the consumer product market
>     (http://ec3noticias.blogspot.com.es/2014/04/la-new-new-horizontes.html-bibliometrics).
>
>     Google Scholar enthusiasts are now welcome; meanwhile we will
>     continue vigorously in which we already proposed several years
>     ago: to reveal with “data”
>
>     ​ - ​
>
>     and not mere opinions
>
>     ​ -​
>
>     , the bowels of Google Scholar, and to reveal at the same time
>     their strengths and weaknesses. So, like the old serials
>     published, we can only promise...TO BE CONTINUED…
>
>     ​ Best,​
>
>     Enrique Orduña-Malea​
>
>     ​ Polytechnic University of Valencia​
>
>     ​ ​Emilio Delgado López-Cózar
>
>     Universidad de Granada​
>
>
>
>
> -- 
>   
> ************************************
> Isidro F. Aguillo, HonDr.
> The Cybermetrics Lab, IPP-CSIC
> Grupo Scimago
> Madrid. SPAIN
>   
> isidro.aguillo at csic.es  <mailto:isidro.aguillo at csic.es>
> ORCID 0000-0001-8927-4873
> ResearcherID: A-7280-2008
> Scholar Citations SaCSbeoAAAAJ
> Twitter @isidroaguillo
> Rankings Web webometrics.info
> ************************************
>
> ------------------------------------------------------------------------
>
> <http://www.avast.com/>
>
> 	
>
> Este mensaje no contiene virus ni malware porque la protección de 
> avast! Antivirus <http://www.avast.com/> está activa.
>


-- 

************************************
Isidro F. Aguillo, HonDr.
The Cybermetrics Lab, IPP-CSIC
Grupo Scimago
Madrid. SPAIN

isidro.aguillo at csic.es
ORCID 0000-0001-8927-4873
ResearcherID: A-7280-2008
Scholar Citations SaCSbeoAAAAJ
Twitter @isidroaguillo
Rankings Web webometrics.info
************************************



---
Este mensaje no contiene virus ni malware porque la protección de avast! Antivirus está activa.
http://www.avast.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20140911/f1f7fa09/attachment.html>


More information about the SIGMETRICS mailing list