OA Growth Monitoring Needs a Google Data-Mining Exemption

Bosman, J.M. j.bosman at UU.NL
Mon Aug 26 06:12:21 EDT 2013


Stevan, Sean,

Do you know..


1)   How many of the freely available full text versions are "black OA", i.e. shared against copyright? I know many examples of that in for instance ResearchGate, that is indexed by Google Scholar....

2)   To what extent the growth of available OA versions can be explained by increasing numbers of green OA versions of which the embargo period has ended and to what extent to more general acceptance of OA by scholars? It seems likely that the first effect will be more pronounced 6-24 months after a period of exceptional growth of self-archiving in repositories etc.


Jeroen Bosman
Utrecht University Library

From: ASIS&T Special Interest Group on Metrics [mailto:SIGMETRICS at LISTSERV.UTK.EDU] On Behalf Of Stevan Harnad
Sent: vrijdag 23 augustus 2013 17:13
To: SIGMETRICS at LISTSERV.UTK.EDU
Subject: Re: [SIGMETRICS] OA Growth Monitoring Needs a Google Data-Mining Exemption

On Fri, Aug 23, 2013 at 6:58 AM, Burns, Christopher S  wrote:

Although a harvester would be very nice, sampling theory and some manual
work does the trick too... [in my dissertation] I took the sample in May 2010 and collected
bibliometric and other relevant data from Google Scholar in July 2010, July 2011,
and July 2012.

Yes, hand-sampling can and does provide valuable information.

But, as I said, for systematic ongoing monitoring of the global time-course of OA growth across institutions, disciplines and nations, hand-sampling is excruciatingly difficult and time-consuming, holding research that could greatly benefit the worldwide research community (as well as Google and Google Scholar) to a scale and pace that is more suitable for a doctoral dissertation.

Historically speaking, if a few projects designed to monitor the ongoing global growth and distribution of OA were allowed to do machine data-mining in Google space, the growth rate of OA would be dramatically accelerated (and thereby also the size and functionality of Google Scholar space).

Otherwise, efforts to enrich Google Scholar space are relegated to the same fate as attempts to enrich vendors, spammers, napsters or phishermen.

Stevan Harnad



Key findings include:

Of the 995 bibliographic references in my sample, 691 referred to
journal articles.

In 2012, 662 of those references were valid references and Google
Scholar was able to locate them.

Of those 662 references, 381 (57.55%) were retrievable full-text from
Google Scholar in 2012, and this was up from 345 (out of 648 valid and
Google Scholar locatable) in 2010. Of course, these were retrievable w/o
the benefit of a university library's proxy.

The sources providing access were varied but not numerous (by category),
but universities (which includes institutional repositories) were the
most common source for full text access via Google Scholar. In 2012, 145
(63.32%) universities provided access to 199 (52.09%) of the documents.

One more bit:

For the 2012 data, of those articles that were available full text via
Google Scholar, the median citation count was 49. For those articles
that were not available full text via Google Scholar, the median
citation count was 20.

I'm no longer collecting data in the same way I did for 2010, 2011, and
2012. Instead I'm getting set to sample from multiple sources, other
than and in addition to CiteULike, in order to acquire even more
credible results.

Sean Burns

--
C. Sean Burns | Assistant Professor
School of Library and Information Science
University of Kentucky
327 Little Library Building | Lexington, KY 40506-0224
Phone +1 859-218-2296 | Fax +1 859-257-4205
https://ci.uky.edu/lis/
https://sweb.uky.edu/~csbu225


> Adminstrative info for SIGMETRICS (for example unsubscribe):
> http://web.utk.edu/~gwhitney/sigmetrics.html
> This is a response to a query regarding Eric Archambault's report on
> OA Growth by Adam G Dunn in Science Insider: "I find it difficult to
> believe that the authors of the study managed to create a harvester
> that could identify and verify the pdfs linked to by Google Scholar
> when Google Scholar actively blocks IP addresses when they identify
> crawling."
>
> Our own "harvester" attempts to gather the all-important data on OA
> growth were blocked by Google.
>
> It is completely understandable and justifiable that Google shields
> its increasingly vital global database and search mechanisms from the
> countless and incessant worldwide attempts at exploitation by
> commercial interests, spammers, and malware that could bring Google to
> its knees if not rigorously and relentlessly blocked.
>
> But in the very special (and tiny) case of scientific research
> articles it would not only be a great help to the worldwide research
> community but to Google (and Google Scholar) itself if Google granted
> special individual exemptions for important international studies like
> Eric Archambault's, which was commissioned by the European Union to
> monitor the global growth rate of open access to research.
>
> Google and Google Scholar would become all the richer as research
> databases if data like Eric's (and our own) were not made so
> excruciatingly difficult and time-consuming to gather by Google's
> blanket blockage of automated data-mining.
>
>
> (We do not trawl books, so Google's agreements with publishers are not
> violated or at issue in any way. We just want to trawl for articles
> whose metadata match the the metadata from Web of Science or SCOPUS
> and have been made freely accessible on the web; nor do we want their
> full-texts: just to check whether they are there!)
>
> Stevan Harnad
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20130826/8d94b91a/attachment.html>


More information about the SIGMETRICS mailing list