OA Growth Monitoring Needs a Google Data-Mining Exemption

Fri Aug 23 11:13:20 EDT 2013

On Fri, Aug 23, 2013 at 6:58 AM, Burns, Christopher S  wrote:

> Although a harvester would be very nice, sampling theory and some manual
> work does the trick too... [in my dissertation] I took the sample in May
> 2010 and collected

bibliometric and other relevant data from Google Scholar in July 2010, July
> 2011,

and July 2012.
>

Yes, hand-sampling can and does provide valuable information.

But, as I said, for systematic ongoing monitoring of the global time-course
of OA growth across institutions, disciplines and nations, hand-sampling
is excruciatingly difficult and time-consuming, holding research that could
greatly benefit the worldwide research community (as well as Google and
Google Scholar) to a scale and pace that is more suitable for a doctoral
dissertation.

Historically speaking, if a few projects designed to monitor the ongoing
global growth and distribution of OA were allowed to do machine data-mining
in Google space, the growth rate of OA would be dramatically accelerated
(and thereby also the size and functionality of Google Scholar space).

Otherwise, efforts to enrich Google Scholar space are relegated to the same
fate as attempts to enrich vendors, spammers, napsters or phishermen.

Stevan Harnad

> Key findings include:
>
> Of the 995 bibliographic references in my sample, 691 referred to
> journal articles.
>
> In 2012, 662 of those references were valid references and Google
> Scholar was able to locate them.
>
> Of those 662 references, 381 (57.55%) were retrievable full-text from
> Google Scholar in 2012, and this was up from 345 (out of 648 valid and
> Google Scholar locatable) in 2010. Of course, these were retrievable w/o
> the benefit of a university library's proxy.
>
> The sources providing access were varied but not numerous (by category),
> but universities (which includes institutional repositories) were the
> most common source for full text access via Google Scholar. In 2012, 145
> (63.32%) universities provided access to 199 (52.09%) of the documents.
>
> One more bit:
>
> For the 2012 data, of those articles that were available full text via
> Google Scholar, the median citation count was 49. For those articles
> that were not available full text via Google Scholar, the median
> citation count was 20.
>
> I'm no longer collecting data in the same way I did for 2010, 2011, and
> 2012. Instead I'm getting set to sample from multiple sources, other
> than and in addition to CiteULike, in order to acquire even more
> credible results.
>
> Sean Burns
>
> --
> C. Sean Burns | Assistant Professor
> School of Library and Information Science
> University of Kentucky
> 327 Little Library Building | Lexington, KY 40506-0224
> Phone +1 859-218-2296 | Fax +1 859-257-4205
> https://ci.uky.edu/lis/
> https://sweb.uky.edu/~csbu225
>
>
> > Adminstrative info for SIGMETRICS (for example unsubscribe):
> > http://web.utk.edu/~gwhitney/sigmetrics.html
> > This is a response to a query regarding Eric Archambault's report on
> > OA Growth by Adam G Dunn in Science Insider: "I find it difficult to
> > believe that the authors of the study managed to create a harvester
> > that could identify and verify the pdfs linked to by Google Scholar
> > when Google Scholar actively blocks IP addresses when they identify
> > crawling."
> >
> > Our own "harvester" attempts to gather the all-important data on OA
> > growth were blocked by Google.
> >
> > It is completely understandable and justifiable that Google shields
> > its increasingly vital global database and search mechanisms from the
> > countless and incessant worldwide attempts at exploitation by
> > commercial interests, spammers, and malware that could bring Google to
> > its knees if not rigorously and relentlessly blocked.
> >
> > But in the very special (and tiny) case of scientific research
> > articles it would not only be a great help to the worldwide research
> > community but to Google (and Google Scholar) itself if Google granted
> > special individual exemptions for important international studies like
> > Eric Archambault's, which was commissioned by the European Union to
> > monitor the global growth rate of open access to research.
> >
> > Google and Google Scholar would become all the richer as research
> > databases if data like Eric's (and our own) were not made so
> > excruciatingly difficult and time-consuming to gather by Google's
> > blanket blockage of automated data-mining.
> >
> >
> > (We do not trawl books, so Google's agreements with publishers are not
> > violated or at issue in any way. We just want to trawl for articles
> > whose metadata match the the metadata from Web of Science or SCOPUS
> > and have been made freely accessible on the web; nor do we want their
> > full-texts: just to check whether they are there!)
> >
> > Stevan Harnad
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20130823/9dd89028/attachment.html>