OA Growth Monitoring Needs a Google Data-Mining Exemption

Fri Aug 23 06:58:00 EDT 2013

Although a harvester would be very nice, sampling theory and some manual
work does the trick too.

In any case, my dissertation should be uploaded to the institutional
repository within the next week, and I can share the link then. That
said, I did a similar inquiry but instead of collecting references from
WoS or Scopus, I took a systematic random sample of bibliographic
references from CiteULike.org, under the idea that the references
collected on CiteULike, by the various kinds of researchers who collect
there, have a special kind of relevance than the total set of all things
published (comparable idea, I guess, to some of the assumptions that
altmetricians make).

I took the sample in May 2010 and collected bibliometric and other
relevant data from Google Scholar in July 2010, July 2011, and July
2012.

Key findings include:

Of the 995 bibliographic references in my sample, 691 referred to
journal articles.

In 2012, 662 of those references were valid references and Google
Scholar was able to locate them.

Of those 662 references, 381 (57.55%) were retrievable full-text from
Google Scholar in 2012, and this was up from 345 (out of 648 valid and
Google Scholar locatable) in 2010. Of course, these were retrievable w/o
the benefit of a university library's proxy.

The sources providing access were varied but not numerous (by category),
but universities (which includes institutional repositories) were the
most common source for full text access via Google Scholar. In 2012, 145
(63.32%) universities provided access to 199 (52.09%) of the documents. 

One more bit:

For the 2012 data, of those articles that were available full text via
Google Scholar, the median citation count was 49. For those articles
that were not available full text via Google Scholar, the median
citation count was 20.

I'm no longer collecting data in the same way I did for 2010, 2011, and
2012. Instead I'm getting set to sample from multiple sources, other
than and in addition to CiteULike, in order to acquire even more
credible results.

Sean Burns

-- 
C. Sean Burns | Assistant Professor
School of Library and Information Science
University of Kentucky
327 Little Library Building | Lexington, KY 40506-0224
Phone +1 859-218-2296 | Fax +1 859-257-4205
https://ci.uky.edu/lis/
https://sweb.uky.edu/~csbu225

> Adminstrative info for SIGMETRICS (for example unsubscribe):
> http://web.utk.edu/~gwhitney/sigmetrics.html 
> This is a response to a query regarding Eric Archambault's report on
> OA Growth by Adam G Dunn in Science Insider: "I find it difficult to
> believe that the authors of the study managed to create a harvester
> that could identify and verify the pdfs linked to by Google Scholar
> when Google Scholar actively blocks IP addresses when they identify
> crawling."
> 
> Our own "harvester" attempts to gather the all-important data on OA
> growth were blocked by Google. 
> 
> It is completely understandable and justifiable that Google shields
> its increasingly vital global database and search mechanisms from the
> countless and incessant worldwide attempts at exploitation by
> commercial interests, spammers, and malware that could bring Google to
> its knees if not rigorously and relentlessly blocked. 
> 
> But in the very special (and tiny) case of scientific research
> articles it would not only be a great help to the worldwide research
> community but to Google (and Google Scholar) itself if Google granted
> special individual exemptions for important international studies like
> Eric Archambault's, which was commissioned by the European Union to
> monitor the global growth rate of open access to research. 
> 
> Google and Google Scholar would become all the richer as research
> databases if data like Eric's (and our own) were not made so
> excruciatingly difficult and time-consuming to gather by Google's
> blanket blockage of automated data-mining.
> 
> 
> (We do not trawl books, so Google's agreements with publishers are not
> violated or at issue in any way. We just want to trawl for articles
> whose metadata match the the metadata from Web of Science or SCOPUS
> and have been made freely accessible on the web; nor do we want their
> full-texts: just to check whether they are there!)
> 
> Stevan Harnad
> 
>