About the size of Google Scholar: playing the numbers
Isidro F. Aguillo
isidro.aguillo at CCHS.CSIC.ES
Thu Sep 11 10:34:13 EDT 2014
Dear Stephen,
Thanks for your comments. I understand the private nature of Google, but
Mendeley (owned by Elsevier) and other similar biblio/altmetric sources
are also commercial backed companies and they are offering good APIs for
in-depth, large data analysis.
as a matter of curiosity I checked the largest h-index in Google Scholar
Citations and it looks to be:
Pierre Bourdieu
Centre de Sociologie Européenne, Collège de France
http://scholar.google.com/citations?user=d_lp40IAAAAJ&hl=en
Citations 361973
h-index 207
Any better candidates?
On 11/09/2014 16:13, Stephen J Bensman wrote:
>
> Isidro,
>
> Unfortunately Google is a cautious private enterprise company with
> commercial interest and secrets. For example, it is very cautious
> when it comes to copyright. I really hate it when I find a book
> chapter of interest to me but cannot download it or copy/paste it.
> Moreover, with Google Scholar citations it allows you to make the
> choice whether you want yours public or private. That keeps the door
> open for Harzing’s Publish-or-Perish program. Google does not want
> any law suits resulting from making your private data public without
> your permission.
>
> Google allows you large enough samples for most purposes. For
> example, when it comes to individuals, the main measure appears to be
> the h-index. For analytical purposes, your h-index has to be above 50
> to provide a proper sample. Few people have h-indexes above 50, and I
> know of none with an h-index above 1000.
>
> Google’s database is Google’s private property. It can do with it
> what it wants. I imagine that—like Thomson Reuters—you could purchase
> a lot of data from it. However, you may be some sort of Bolshevik,
> who wants the right to expropriate it. As for the uselessness of
> Google Scholar, I will quote your compatriots below:
>
> “Now, when empirical studies
> (http://googlescholardigest.blogspot.com.es/p/bibliography.html)
> demonstrate every day that Google Scholar and its derivatives
>
> a) measure with similar credit to traditional bibliometric indicators,
>
> b) are the most used products by scientists
> (_http://www.nature.com/news/online-collaboration-scientists-and-the-social-network-1.15711)
> <http://www.nature.com/news/online-collaboration-scientists-and-the-social-network-1.15711%29>_,”
>
> Why don’t you take up your case with them as well?
>
> Respectfully,
>
> Stephen J Bensman
>
> LSU Libraries
>
> Lousiana State University
>
> Baton Rouge, LA 70803
>
> *From:*ASIS&T Special Interest Group on Metrics
> [mailto:SIGMETRICS at LISTSERV.UTK.EDU] *On Behalf Of *Isidro F. Aguillo
> *Sent:* Thursday, September 11, 2014 1:46 AM
> *To:* SIGMETRICS at LISTSERV.UTK.EDU
> *Subject:* Re: [SIGMETRICS] About the size of Google Scholar: playing
> the numbers
>
> Adminstrative info for SIGMETRICS (for example unsubscribe):
> http://web.utk.edu/~gwhitney/sigmetrics.html
> <http://web.utk.edu/%7Egwhitney/sigmetrics.html>
>
> Are you talking about Google Scholar?
>
> The useless bibliographic tool that does not allow to extract large
> data sets?
>
> The system that blocked the access to it to your whole organization if
> you try to do it?
>
> Are suffering CAPTCHA?
>
> Is somebody able to talk with them and convince of changing their
> approach to our community?
>
> On 10/09/2014 20:17, Stephen J Bensman wrote:
>
> Enrique and Emilio.
>
> I read your working paper with great interest as it deals with the
> same topic on which we are doing research here at LSU. To tell
> you the honest truth, I had trouble with its basic premise, i.e.,
> that Google Scholar (GS) has a given size. I do not think that it
> does, and, if it does, it is meaningless. The real problem is
> what is the size of documentary set that is relevant to the search
> query.
>
> The WWW and PageRank (the Google search engine) operate within
> what can be called the power-law or Lotkaian domain. Informetric
> laws also operate within this domain. On top of that, PageRank
> operates on what is called the probability ranking principle, by
> which the probability of relevance exponentially decreases as the
> number of inlinks decreases, i.e. below a certain point you are
> dealing with gibberish manufactured by the search engine itself.
> Therefore, there is a need for left truncation and determination
> of what can be termed the x-min. Since we are dealing with the
> Lotkaian domain, the x-min marks the point where the asymptote or
> “tail” on the x-axis for the items begins.
>
> We are dealing with Nobelists, and what we have found is that with
> PageRank the set of relevant documents is conterminous with the
> researcher’s h-index and the “tail” of his GS citations
> distribution. In other words—whether by serendipity or not—the
> h-index is an excellent estimate of the x-min of a GS citations
> distribution. Below that is what the Germans would call a
> “Trummerzone” or rubbish zone largely manufactured by the search
> engine itself. This conterminous-ness is a validation of both the
> h-index and Google Scholar. The relevance of the set is also
> proven by the fact that the extreme outliers on the right messing
> up the tail are usually works on the topics for which the Nobelist
> won the prize. Case closed.
>
> Every field has its statistical problem. With medical research it
> is right truncation, for every patient has to die before the
> results are really known. With the WWW and scientometric
> research, it is left truncation.
>
> If you are interested in how I view how Google Scholar works, you
> can read our working papers at the following URLs:
>
> http://arxiv.org/abs/1312.3872
>
> http://arxiv.org/abs/1404.4904
>
> I hope to post another working paper there next week that will
> really clinch the point. But who knows? I may be wrong.
>
> Respectfully,
>
> Stephen J Bensman, Ph.D.
>
> LSU Libraries
>
> Lousiana State University
>
> Baton Rouge, LA 70803
>
> USA
>
> *From:*ASIS&T Special Interest Group on Metrics
> [mailto:SIGMETRICS at LISTSERV.UTK.EDU] *On Behalf Of *Enrique Orduña
> *Sent:* Wednesday, September 10, 2014 5:15 AM
> *To:* SIGMETRICS at LISTSERV.UTK.EDU <mailto:SIGMETRICS at LISTSERV.UTK.EDU>
> *Subject:* [SIGMETRICS] About the size of Google Scholar: playing
> the numbers
>
> Adminstrative info for SIGMETRICS (for example unsubscribe):
> http://web.utk.edu/~gwhitney/sigmetrics.html
> <http://web.utk.edu/%7Egwhitney/sigmetrics.html>
>
>
>
> Dear Colleagues,
>
> The purpose of this mail is to present our latest working paper,
> deposited on July 24, 2014.
> http://googlescholardigest.blogspot.com.es/2014/09/about-size-of-google-scholar-playing.html
>
>
>
> We propose the inextricable task of knowing the size of this huge
> black hole looks like Google Scholar (GS). Anyway, as the title of
> the document (
>
>
>
> About the size of Google Scholar: playing the numbers), we have
> begun to make accounts and using 4 different empirical methods we
> estimate that the number of unique documents (different versions
> of a document are excluded) should not be less than 160 million
> (as of May 2014).
>
> Regardless of this particular outcome, which is itself significant
> (especially when compared with other scientific databases, and
> that gives us key clues about the amount of scientific knowledge
> that can be searchable, found and accessed to on the web), even
> more exciting is the methodological challenge of this assumption.
> It has not only forced us to devise various techniques for
> measuring the size of this dark object that GS is, but
>
> also
>
> applying them we have shed light, again, on various
> inconsistencies, uncertainties and limitations of the search
> interface tools used by Google. In short, we have learned more
> about what Google Scholar does or does not, and we want to share
> it with you all.
>
> This research comes at a good time. We are not only almost
> celebrating the 10th anniversary of GS but also hearing some
> voices (from somewhere in Europe…) finally relying on the use of
> Google Scholar for scientific evaluation.
>
> Now, when empirical studies
> (http://googlescholardigest.blogspot.com.es/p/bibliography.html)
> demonstrate every day that Google Scholar and its derivatives
>
> a) measure with similar credit to traditional bibliometric indicators,
>
> b) are the most used products by scientists
> (http://www.nature.com/news/online-collaboration-scientists-and-the-social-network-1.15711),
>
> and
>
> c) have unfortunately ended up with the competition (Microsoft
> Academic Search is in an unexplained hibernation,
> http://googlescholardigest.blogspot.com.es/2014/04/empirical-evidences-microsoft-academic-search-dead.html)
>
> .
>
> seems that certain euphoria unleashed. We are pleased, better late
> than never…
>
> However, without wanting to lower the aroused expectations, we
> emphasize that the problems of Google Scholar for scientific
> evaluation are not technical or methodological (coverage,
> reliability and validity of the measures, records filtering
> performance…). Seminal limitations are those related with:
>
> a) the ease with which GS indicators can be manipulated
>
>
>
> (http://ec3noticias.blogspot.com.es/2014/01/google-scholar-wins-ravesbut-can-it-be.htmt),
>
> b) the transience of the results and measures (in many cases
> difficult to replicate stably),
>
> c) the technological dependence on companies that develop tools
> that come and go on the consumer product market
> (http://ec3noticias.blogspot.com.es/2014/04/la-new-new-horizontes.html-bibliometrics).
>
> Google Scholar enthusiasts are now welcome; meanwhile we will
> continue vigorously in which we already proposed several years
> ago: to reveal with “data”
>
> -
>
> and not mere opinions
>
> -
>
> , the bowels of Google Scholar, and to reveal at the same time
> their strengths and weaknesses. So, like the old serials
> published, we can only promise...TO BE CONTINUED…
>
> Best,
>
> Enrique Orduña-Malea
>
> Polytechnic University of Valencia
>
> Emilio Delgado López-Cózar
>
> Universidad de Granada
>
>
>
>
> --
>
> ************************************
> Isidro F. Aguillo, HonDr.
> The Cybermetrics Lab, IPP-CSIC
> Grupo Scimago
> Madrid. SPAIN
>
> isidro.aguillo at csic.es <mailto:isidro.aguillo at csic.es>
> ORCID 0000-0001-8927-4873
> ResearcherID: A-7280-2008
> Scholar Citations SaCSbeoAAAAJ
> Twitter @isidroaguillo
> Rankings Web webometrics.info
> ************************************
>
> ------------------------------------------------------------------------
>
> <http://www.avast.com/>
>
>
>
> Este mensaje no contiene virus ni malware porque la protección de
> avast! Antivirus <http://www.avast.com/> está activa.
>
--
************************************
Isidro F. Aguillo, HonDr.
The Cybermetrics Lab, IPP-CSIC
Grupo Scimago
Madrid. SPAIN
isidro.aguillo at csic.es
ORCID 0000-0001-8927-4873
ResearcherID: A-7280-2008
Scholar Citations SaCSbeoAAAAJ
Twitter @isidroaguillo
Rankings Web webometrics.info
************************************
---
Este mensaje no contiene virus ni malware porque la protección de avast! Antivirus está activa.
http://www.avast.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20140911/f1f7fa09/attachment.html>
More information about the SIGMETRICS
mailing list