Comments on BRAIN Project

Sat May 6 17:07:46 EDT 2006

Hi All,

Here are a few comments and suggestions on the BRAIN documents I've seen
so far at http://www3.isrl.uiuc.edu/~unsworth/BRAIN/

> About 40% of US universities have or are building institutional
> repositories, and another 40% are planning them.

For the c. 130 US IRs registered so far, sorted by size, see ROMEO:
http://archives.eprints.org/?country=us&version=&type=&order=recordcount&submit=Filter

> About 80% of all
> journals now permit authors to self-archive (on a personal web site or
> in an institutional repository)

Of the 9000+ journals registered in Romeo, 93% endorse some form of
self-archiving (69% for postprints + 24% for preprints)

> About 15% of faculty publishing new scholarly articles actually do this.

With a lot of variability by discipline. Here are the data:

http://citebase.eprints.org/isi_study/
http://www.crsc.uqam.ca/lab/chawki/graphes/EtudeImpact.htm

> BRAIN aims to raise the rate of voluntary participation in institutional
> repositories, as follows:
>
> When a scholar self-archives in an institutional repository that
> participates in BRAIN, he or she gets back a list of the papers from any
> open-access repository or journal that best match this one, based on
> coincidence of citations and full-text clustering, plus best matches
> from books and journals, as described below. Scholars would be able to
> read full-text of these if they were open access articles, but (see
> below, Publishers) might only get an abstract and citations within the
> article, if it were not open access.

I am not sure what algorithm BRAIN will use to generate these matches,
but if citations, co-citations and keyword similarity scores are to be
among the metrics, see citebase:

http://www.citebase.org/
http://www.citebase.org/help/order.php

> a partial example of the kind of results I'd hope we might eventually
> offer to participants can be found in the search results here:
>
> http://vivo.library.cornell.edu/

Perhaps a vivo-like tool will be useful for students and even teachers,
but I profoundly doubt it is the way active researchers will either
search or keep up with their research literature. There, I would put my
money on target alerting algorithms, based on boolean word-profiles plus
citation and co-citation profiles and possibly also some latent semantic
or other text proximity metrics.

> ...except that they're only dealing with one institution's materials,
> and not tying the service to submissions.

Why would an author especially want a search at submission (rather
than, say, when researching and writing the paper and constructing the
bibliography)?

> Publishers--initially university presses--might be approached to
> participate, perhaps by permitting full-text of journals and books to be
> searched for corresponding citations using web services, or perhaps
> allowing full-text to be aggregated and text-mined, behind a firewall.

Good idea (but probably more useful if it is accessible for searching
via google-like full-text-inversion and boolean search).

> Publishers who participate would get back information about hot topics
> in areas of interest, based on submissions to the institutional
> repository.

Probably it's journal editorial offices, searching for referees -- and
referees themselves -- who will need and want a service like this,
rather than the publishers themselves.

> Somehow, it would be nice to make this feedback proportional
> to the contribution, so that publishers who contributed only
> bibliographic information got back only bibliographic information, and
> publishers who contributed full text got back full text.

It would be sweet revenge, but maybe there are better ways to encourage
publishers (or, better, journals and their editors) to provide the
requisite information -- e.g., its effects on their impact factors.

> On the other hand, this matchmaking service should be open from the very
> beginning to whoever wants to participate, either by aggregating their
> instutitional repository materials with ours, or by contributing
> bibliographic information or full-text, even when that information
> cannot be freely distributed, but only mined.

If the IRs are set up properly, all of this information should be
OAI-harvestable *including the references*. Text-inversion can be via
google, or google-like harvesters of full-text for OA content only (in
the same way citeseer harvests only its own target content -- but does
not invert it).

> Very likely, we would need to aggregate the content of participating
> repositories here at UIUC--unless participating institutional
> repositories share a very low-level common infrastructure, their
> materials will have to be aggregated in one place for data-mining, for
> the forseeable future. On the other hand, we could probably automate
> this aggregation using OAI harvesting to get URLs for repository
> contents, then spidering those URLs and downloading the full content of
> the articles, perhaps using tools developed by OCLC in the NDIIPP
> project (see http://www.ndiipp.uiuc.edu/pdfs/IST2005paper_final.pdf for
> more information). For material (like university press books and
> journals) that's not freely available, we'd need some cooperation from
> content providers, but I am reasonably confident I can recruit
> participation from at least a few large presses and
> journal-repositories, for starters.

Good luck on this: important, but complicated.

> We'll be aggregating data from at least two sources--OAI
> repositories, and Google Scholar.

Why not also from all the IRs in ROAR? And all the OAJs in DOAJ?

> Identifying citations and matching them, in unstructured text

I suggest you collaborate with Les Carr, Mike Jewell, Tim Brody (OpCit,
Paracite Citebase) as well as Chawki Hajjem (UQaM) on this, as they
have been doing both for some time: harvesting texts, parsing
references, linking references.

> Matching one document with others, based on its content.

See:

    Shadbolt, N., Brody, T., Carr, L. and Harnad, S. (2006) The Open
    Research Web: A Preview of the Optimal and the Inevitable, in Jacobs,
    N., Eds. Open Access: Key Strategic, Technical and Economic Aspects,
    chapter 21. Chandos.
    http://eprints.ecs.soton.ac.uk/12453/

Best wishes,

Stevan Harnad
American Scientist Open Access Forum
http://amsci-forum.amsci.org/archives/American-Scientist-Open-Access-Forum.html

Chaire de recherche du Canada                   Professor of Cognitive Science
Ctr. de neuroscience de la cognition    Dpt. Electronics & Computer Science
Université du Québec à Montréal                 University of Southampton
Montréal, Québec                                                Highfield, Southampton
Canada  H3C 3P8                                                 SO17 1BJ United Kingdom
http://www.crsc.uqam.ca/                                http://www.ecs.soton.ac.uk/~harnad/