[Sigmetrics] 1findr: research discovery & analytics platform

Mon Apr 30 18:53:11 EDT 2018

Thank you very much Emilio. Please find out answers to your questions:

  1.  What is the main source of the metadata to those 89,561,005 articles? Is it perhaps Crossref?
Yes, we do use Crossref, which is one of the best sources of data around. 1findr is built with the help of an enrichment/clustering/mashup/filtering data processing pipeline. There are currently 75 million records from Crossref out of the total 450 million records entering the pipeline (17%).  This doesn’t mean that 17% of the records are straight from Crossref, it means that Crossref currently represents 17%  of the ingredients used to produce 1findr. In the next few months, we’ll add content of BASE<https://www.base-search.net/> https://www.base-search.net/ and CORE<https://core.ac.uk/> https://core.ac.uk/ as these organizations have accepted to share the fruits of their labour. This will certainly help to fill gaps in 1findr and further increase quality, help us produce complete records, and generally increase depth and breadth. Please encourage BASE and CORE; they are providing an extremely useful public service. We are examining the best way to add these sources to our datastore, which will then increase to close to 700 million bibliographic records. We think 1findr will then be able to add 5-10 million records we may not have yet, and using these sources and others we will likely surpass 100 million records this year, which will help users be assured that they search closer to the full population of articles published in peer-reviewed journals.

  1.  How have peer-reviewed journals been identified?
In a nutshell, through a long learning curve and an expensive self-financed compilation process. We have been building a list of peer-reviewed journals since about 2002 with the first efforts being initiated at Science-Metrix when we started the company. We pursued and intensified the efforts at 1science starting as soon as we spun off the company in 2014, and we now use tens of methods to acquire candidate journals and we are always adding new ones. We are constantly honing the list, adding journals, withdrawing journals we find do not meet our criteria and for which we have evidence of quality-reviewing avoidance. In short, the journals included in 1findr need to be scholarly/scientific/research journals and be peer-reviewed/refereed, which most of the time means having references, and this is to the exclusion of trade journals and popular science magazines. This working definition works really well in the health and natural sciences and in most of the arts, humanities and social sciences, and is somewhat more challenged in architecture, the most un-typical field in academia. Currently, the white list that contributes to building 1findr contains more than 85,000 journals, and 1findr already has content for close to 80,000 of them. This journal list is itself curated, clustered, enriched, and filtered from a larger dataset stored in a system containing more than 300,000 entries. We feel we are converging on an exhaustive inventory of the contemporary active journals, but we still have work to do to identify the whole retrospective list of relevant journals as far back as 1665.

  1.  Are all document types in these journals covered? Editorial material, news, comments, book reviews...
The bibliographic database that powers 1findr is presently mostly used as a specialized discovery system for documents published in peer-reviewed journals. However, this datastore has been built from the ground up to evolve into a powerful bibliometric database. As such, we have concentrated our efforts on the types of documents considered to be “original contributions to knowledge”. These are the document types that are usually counted in bibliometric studies, e.g. “articles”, “notes”, and “reviews”. 1findr is positively biased towards these. That said, for most of the journals, we have been collecting material from cover-to-cover, but many items with no author currently stay in the datastore, and have not made their way to 1findr yet. We will change our clustering/filtering rules in the next few months to include more material types, and 1findr will grow in size by several million records as a consequence of adding more news, comments, and similar types of documents.

  1.  How have OA versions of the documents been identified?
Using focused harvesters, 1findr scrutinizes the web in search of metadata sources which are likely to correspond to scholarly publications. To reduce the amount of upstream curation required, our system harvests only relevant metadata, which is used to build the datastore with its 450 million metadata records. When the system clusters documents and freely finds downloadable versions of the papers, it takes note of this. At 1science, we use the definition of “gratis” open access suggested by Peter Suber. This means that articles are freely downloadable, readable, printable, but may or may not have rights attached. For example, disembargoed gold open access (gold open access means made available either directly or in a mediated manner by publishers) made available through a moving pay wall/free access model are frequently associated with residual rights, whereas green open access (green OA means archived by a party other than a publisher or other than a publisher’s mediator – Scielo and PubMedCentral being examples of such mediators) are more frequently without. We code OA versions based on these definitions of green and gold. The OA colouring scheme has nothing to do with usage rights, or with the fact that a paper is a preprint, a postprint (author’s final peer-reviewed manuscript) or a version of record. Who makes the paper available, what rights there are, and what version of the manuscript is made available are three dimensions we are careful not to conflate. Most of the operational definitions we use in 1findr find their root in the study Science-Metrix conducted for the European Commission on the measurement of the percentage of articles published in peer-reviewed journals.

http://science-metrix.com/sites/default/files/science-metrix/publications/d_1.8_sm_ec_dg-rtd_proportion_oa_1996-2013_v11p.pdf
You can also find other reports on OA for this and more recent projects on Science-Metrix’ selected reports list:
http://science-metrix.com/en/publications/reports

  1.  How have articles been categorized in subjects?
To classify articles, we use the CC BY classification created by Science-Metrix and used in its bibliometric studies:
http://www.science-metrix.com/en/classification
http://science-metrix.com/sites/default/files/science-metrix/sm_journal_classification_106_1.xls
This classification is available in more than 20 languages, and we are currently working on version 2.0. For the time being, 1findr uses the Science-Metrix classification to perform a journal-level classification of articles, but stay tuned for article-level classification of articles.

  1.  To what extent has Google Scholar data been used to build 1findr?
We have used Google Scholar for tactical purposes, to do cross-checks and for benchmarking. We do not scrape Google Scholar or use Google Scholar metadata. There are vestigial traces of Google Scholar in our system and between 1.8% and 4.4% of the hyperlinks to gratis OA papers which are used in 1findr could come from that source. These are progressively being replaced with refreshed links secured from other sources.

What really distinguishes 1findr from all other sources of data we know of is that we really care about global research. We haven’t seen anyone else doing as much work as we’ve done to make accessible the extraordinarily interesting activity that can be found in the long tail of science and academia. Just like most researchers, we care to have access to the material from the top tier publishers and we’re really open to working with them to make their articles more discoverable and more useful for them and for the whole world. But we do not focus solely on the top tiers. The focus of 1science is on the big picture in the scholarly publishing world. Our research in the last 10 years has revealed that thousands of journals have emerged with the global transition to open access, and there are thousands of journals in the eastern part of the world and the Global South that were traditionally ignored and saw their journals unfairly being shunned by the mainstream indexes. We are not creating a product that isolates Eastern or Southern contents from a core package centered on the West. There is 1 science, and it should be conveniently accessible in 1 place and this is why we created 1findr.
Cordially

Éric

Eric Archambault, PhD
CEO  |  Chef de la direction
C. 1.514.518.0823
eric.archambault at science-metrix.com<mailto:eric.archambault at science-metrix.com>
science-metrix.com<http://www.science-metrix.com/>  &  1science.com<http://www.science-metrix.com/>

From: SIGMETRICS <sigmetrics-bounces at asist.org> On Behalf Of Emilio Delgado López-Cózar
Sent: April-26-18 2:52 PM
To: sigmetrics at mail.asis.org
Subject: Re: [Sigmetrics] 1findr: research discovery & analytics platform

First of all, I would like to congratulate the team behind 1findr for releasing this new product. New scientific information systems with an open approach that make their resources available to the scientific community are always welcome. A few days ago another system was launched (Lens https://www.lens.org), and not many weeks ago Dimensions was launched (https://app.dimensions.ai). The landscape of scientific information systems is becoming increasingly more populated. Everyone is moving: new platforms with new features, new actors with new ideas, and old actors trying to adapt rather than die...

In order to be able to build a solid idea of what 1findr is and how it has been constructed, I woul like to formulate some questions, since I haven't found their answer in the website:

What is the main source of the metadata to those 89,561,005 articles? Is it perhaps Crossref?

How have peer-reviewed journals been identified?

Are all document types in these journals covered? Editorial material, news, comments, book reviews...

How have OA versions of the documents been identified?

How have articles been categorised in subjects?

To what extent has Google Scholar data been used to build 1findr?

We think this information will help assess exactly what 1findr offers that is not offered by other platforms.

Kind regards,

---

Emilio Delgado López-Cózar

Facultad de Comunicación y Documentación

Universidad de Granada

http://scholar.google.com/citations?hl=es&user=kyTHOh0AAAAJ

https://www.researchgate.net/profile/Emilio_Delgado_Lopez-Cozar

http://googlescholardigest.blogspot.com.es

Dubitando ad veritatem pervenimus (Cicerón, De officiis. A. 451...)

Contra facta non argumenta

A fructibus eorum cognoscitis eos (San Mateo 7, 16)

El 2018-04-25 22:50, Kevin Boyack escribió:
Éric,
… and thanks to you for being so transparent about what you’re doing!
Kevin

From: SIGMETRICS On Behalf Of Éric Archambault
Sent: Wednesday, April 25, 2018 9:05 AM
To: Anne-Wil Harzing ; sigmetrics at mail.asis.org<mailto:sigmetrics at mail.asis.org>
Subject: Re: [Sigmetrics] 1findr: research discovery & analytics platform

Anne-Wil,
Thank you so much for this review. We need that kind of feedback to prioritize development.
Thanks a lot for the positive comments. We are happy that they reflect our design decisions. Now, onto the niggles (all fair points in current version).
An important distinction of our system – at this stage of development – is that our emphasis is on scholarly/scientific/research work published in peer-reviewed/quality controlled journals (e.g. we don’t index trade journals and popular science magazines such as New Scientist – not a judgment on quality, many of them are stunningly good, they are just not the type we focus on for now).  This stems from work conducted several years ago for the European Commission. We got a contract at Science-Metrix to measure the proportion of articles published in peer-reviewed journals. We discovered (discovery being a big term considering what follows) that 1) OA articles were hard to find and count (numerator in the percentage), and 2) there wasn’t a database that comprised all peer-reviewed journals (denominator in the percentage). Consequently, we had to work by sampling, but hard core bibliometricians like the ones we are at Science-Metrix like the idea of working on population level measurement. At Science-Metrix, our bibliometric company, we have been using licensed bibliometric versions of the Web of Science and Scopus. Great tools, very high quality data (obvious to anyone who has worked on big bibliographic metadata), extensive coverage and loads of high quality, expensive to implement smart enrichment. However, when measuring, we noticed, as did many others, that the databases emphasized Western production to the detriment of the Global South, emerging countries, especially in Asia, and even the old Cold War foe in which the West lost interest after the fall of the wall. 1findr is addressing this – it aims to find as much OA as possible and to index everything peer-reviewed and academic level published in journals. We aim to expand to other types of content with a rationally designed indexing strategy, but this is what we are obstinately focusing on for now.
-We are working on linking all the papers within 1findr with references/citations. This will create the first rationally designed citation network: from all peer-reviewed journals to all peer-reviewed journals, regardless of language, country, field of research (we won’t get there easily or soon). We feel this is scientifically a sound way to measure. Conferences and books are also important, but currently when we take them into account in citations, we have extremely non-random lumps of indexed material, and no one can say what the effect on measured citations is. My educated guess is that this is extremely biased – book coverage is extremely linguistically biased, conference proceedings indexing is extremely field biased (proportionately way more computer and engineering than other fields). If we want to turn scientometrics into a proper science we need proper measurement tools. This is the long-term direction of 1findr. It won’t remain solely in the discovery field, it will become a scientifically designed tool to measure research, with clearly documented strengths and weaknesses.
-We still need to improve our coverage of OA. Though we find twice as many freely downloadable papers in journals than Dimensions, Impact Story finds about 8% OA for papers with a DOI for which we haven’t found a copy yet (one reason we have more OA as a percentage of journal articles is that in 1findr we find much OA for articles without DOIs). We are working on characterizing a sample of papers which are not OA on the 1findr side, but which ImpactStory finds in OA. A glimpse at the data reveals some of these are false positives, but some of them reflect approaches used by ImpactStory that we have not yet implemented (Heather and Jason are smart, and we can all learn from them -  thanks to their generosity). There are also transient problems we experienced while building 1findr. For example, at the moment, we have challenges with our existing Wiley dataset and we need to update our harvester for Wiley’s site. Would be nice to have their collaboration, but they have been ignoring my emails for the last two months… Shame, we’re only making their papers more discoverable and helping world users find papers for which article processing charges were paid for. We need the cooperation of publishers to do justice to the wealth of their content, especially hybrid OA papers.
-We know we have several papers displaying a “404”. We are improving the oaFindr link resolver built in 1findr to reduce this. Also we need to scan more frequently for change (we have to be careful there as we don’t want to overwhelm servers; many of the servers we harvest from are truly slow and we want to be nice guys), and we need to continue to implement smarter mechanisms to avoid 404. Transiency of OA is a huge challenge. We have addressed several of the issues, but this takes time and our team has a finite size, and as you note, several challenges, and big ambitions at the same time.
-We are rewriting our “help” center. Please be aware that using no quote does full stemming, using single quote does stemming, but words need be in the same order in the results. Double quotes should be used for non-stemmed, exact matches. This is a really powerful way of searching.
Fuel cell = finds articles with fuel and cell(s)
'fuel cell' = finds articles with both fuel cell and fuel cells
"fuel cell" = finds articles strictly with fuel cell (won’t return fuel cells only articles)
Once again, thanks for the review, and apologies for the lengthy reply.

Éric

Eric Archambault, PhD
CEO  |  Chef de la direction
C. 1.514.518.0823
eric.archambault at science-metrix.com<mailto:eric.archambault at science-metrix.com>
science-metrix.com<http://www.science-metrix.com/>  &  1science.com<http://www.science-metrix.com/>

From: SIGMETRICS <sigmetrics-bounces at asist.org<mailto:sigmetrics-bounces at asist.org>> On Behalf Of Anne-Wil Harzing
Sent: April-24-18 5:11 PM
To: sigmetrics at mail.asis.org<mailto:sigmetrics at mail.asis.org>
Subject: Re: [Sigmetrics] 1findr: research discovery & analytics platform

Dear all,

I was asked (with a very short time-frame) to comment on 1Findr for an article in Nature (which I am not sure has actually appeared). I was given temporary login details for the Advanced interface.

As "per normal" with these kind of requests only one of my comments was actually used. So I am posting all of them here in case they are of use to anyone (and to Eric and his team in fine-tuning the system).

================
As I had a very limited amount of time to provide my comments, I tried out 1Findr by searching for my own name (I have about 150 publications including journal articles, books, book chapters, software, web publications and white papers) and some key terms in my own field (international management).
What I like
Simple and intuitive user interface with fast response to search requests, much faster than with some competitor products where the website takes can take ages to load. The flexibility of the available search options clearly reflects the fact that this is an offering built by people with a background in Scientometrics.
A search for my own name showed that coverage at the author level is good, it finds more of my publications than both the Web of Science and Scopus, but fewer than Google Scholar and Microsoft Academic. It is approximately on par with CrossRef and Dimensions though all three services (CR, Dimensions and Findr) have unique publications that the other service doesn’t cover.
As far as I could assess, topic searches worked well with flexible options to search in title, keywords and abstracts. However, I have not tried these in detail.
Provides a very good set of subjects for filtering searches that – for the disciplines I can evaluate – shows much better knowledge of academic disciplines and disciplinary boundaries than is reflected in some competitor products. I particularly like the fact that there is more differentiation in the Applied Sciences, the Economic and Social Sciences and Arts & Humanities than in some other databases. This was sorely needed.
There is a quick summary of Altmetrics such as tweets, Facebook postings and Mendeley readers. Again I like the fact that a simple presentation is used, rather than the “bells & whistle” approach with the flashy graphics of some other providers. This keeps the website snappy and provides an instant overview.
There is good access to OA versions and a “1-click” download of all available OA versions [for a maximum of 40 publications at once as this is the upper limit of the number of records on a page]. I like the fact that it finds OA versions from my personal website (www.harzing.com<http://www.harzing.com>) as well as OA versions in university repositories and gold OA versions. However, it doesn’t find all OA versions of my papers (see dislike below).
What I dislike
Although I like the fact that Findr doesn’t try to be anything and everything leading to a cluttered user interface, for me the fact that it doesn’t offer citation metrics limits its usefulness. Although I understand its focus is on finding literature (which is fair enough) many academics – rightly or wrongly – use citations scores to assess which articles to prioritize articles for downloading and reading.
The fact that it doesn’t yet find all Open Access versions that Google Scholar and Microsoft Academic do. All my publications are available in OA on my website, but Findr does not seem to find all of them. Findr also doesn’t seem to source OA versions from ResearchGate. Also several OA versions resulted in a “404. The requested resource is not found.”
The fact that it only seems to cover journal articles. None of my books, book chapters, software, white papers or web publications were found. Although a focus on peer-reviewed work is understandable I think coverage of books and book chapters is essential and services like Google Scholar, Microsoft Academic and CrossRef do cover books.
Niggles
There are duplicate results for quite a few of my articles, usually “poorer” versions (i.e. without full text/abstract/altmetric scores) it would be good if the duplicates could be removed and only the “best” version kept
Automatic stemming of searches is awkward if you try to search for author names in the “general” search (as many users will do). In my case (Harzing) it results in hundreds of articles on the Harz mountains obscuring all of my output.
Preferred search syntax should be clearer as many users will search authors with initials only (as this is what works best in other databases). In Findr this provides very few results as there are “exact” matches only, whereas in other databases initial searches are interpreted as initial + wildcard.
More generally needs better author disambiguation. Some of my articles can only be found when searching for a-w harzing, a very specific rendition of my name.
When Exporting Citations the order seems to reverts to alphabetical order of the first author, not the order that was on the screen.

Best wishes,
Anne-Wil
Prof. Anne-Wil Harzing
Professor of International Management
Middlesex University London, Business School

Web: Harzing.com<https://harzing.com> - Twitter: @awharzing<https://twitter.com/awharzing> - Google Scholar: Citation Profile<https://scholar.google.co.uk/citations?user=v0sDYGsAAAAJ>
New: Latest blog post<https://harzing.com/blog/.latest?redirect> - Surprise: Random blog post<https://harzing.com/blog/.random> - Finally: Support Publish or Perish<https://harzing.com/resources/publish-or-perish/donations>
On 24/04/2018 21:51, Bosman, J.M. (Jeroen) wrote:
Of course there is much more to say about 1Findr. What I have seen so far is that the coverage back to 1944 is very much akin to Dimensions, probably because both are deriving the bulk of their records from Crossref.

Full text search is relatively rare among these systems. Google Scholar does it. Dimensions does it on a subset. And some publisher platform support it, as do some OA aggragators.

Apart from these two aspects (coverage and full text search support), there are a lot of aspects and (forthcoming) 1Findr functionalities that deserve scrutiny, not least the exact method of OA detection (and version priority) of course.

Jeroen Bosman
Utrecht University Library
________________________________
From: SIGMETRICS [sigmetrics-bounces at asist.org<mailto:sigmetrics-bounces at asist.org>] on behalf of David Wojick [dwojick at craigellachie.us<mailto:dwojick at craigellachie.us>]
Sent: Tuesday, April 24, 2018 8:59 PM
To: Mark C. Wilson
Cc: sigmetrics at mail.asis.org<mailto:sigmetrics at mail.asis.org>
Subject: Re: [Sigmetrics] 1findr: research discovery & analytics platform
There is a joke that what is called "rapid prototyping" actually means fielding the beta version. In that case every user is a beta tester.

It is fast and the filter numbers are useful in themselves. Some of the hits are a bit mysterious. It may have unique metric capabilities. Too bad that advanced search is not available for free.

David

At 02:34 PM 4/24/2018, Mark C. Wilson wrote:
Searching for my own papers I obtained some wrong records and the link to arXiv was broken. It does return results very quickly and many are useful. I am not sure whether 1science intended to use everyone in the world as beta-testers.

On 25/04/2018, at 06:16, David Wojick <dwojick at craigellachie.us<mailto:dwojick at craigellachie.us> > wrote:

It appears not to be doing full text search, which is a significant limitation. I did a search on "chaotic" for 2018 and got 527 hits. Almost all had the term in the title and almost all of the remainder had it in the abstract. Normally with full text, those with the term only in the text are many times more than those with it in title, often orders of magnitude more.

But the scope is impressive, as is the ability to filter for OA.

David

David Wojick, Ph.D.
Formerly Senior Consultant for Innovation
DOE OSTI https://www.osti.gov/

At 08:00 AM 4/24/2018, you wrote:
Content-Language: en-US
Content-Type: multipart/related;
         type="multipart/alternative";
         boundary="----=_NextPart_001_00EE_01D3DBBD.BC977220"

Greetings everyone,

Today, 1science announced the official launch of 1findr, its platform for research discovery and analytics. Indexing 90 million articlesÂof which 27 million are available in OAÂit represents the largest curated collection worldwide of scholarly research. The platform aims to include all articles published in peer-reviewed journals, in all fields of research, in all languages and from every country.

Here are a few resources if youâ€™re interested in learning more:

•           p;  Access 1findr platform: www.1findr.com<http://www.1findr.com/>
•           p;  Visit the 1findr website: www.1science.com/1findr<http://www.1science.com/1findr>
•           p;  Send in your questions: 1findr at 1science.com<mailto:1findr at 1science.com>
•           p;  See the press release: www.1science.com/1findr-public-launch<http://www.1science.com/1findr-public-launch>

Sincerely,

GrÃ©goire

GrÃ©goire CÃ´tÃ©
President | PrÃ©sident
Science-Metrix
1335, Mont-Royal E
MontrÃ©al, QC  H2J 1Y6
Canada

T. 1.514.495.6505 x115
T. 1.800.994.4761
F. 1.514.495.6523
gregoire.cote at science-metrix.com<mailto:gregoire.cote at science-metrix.com>
www.science-metrix.com<http://www.science-metrix.com/>

Content-Type: image/png;
         name="image001.png"
Content-Description: image001.png
Content-Disposition: inline;
         creation-date=Tue, 24 Apr 2018 12:00:30 GMT;
         modification-date=Tue, 24 Apr 2018 12:00:30 GMT;
         filename="image001.png";
         size=1068
Content-ID:

Content-Type: image/png;
         name="image002.png"
Content-Description: image002.png
Content-Disposition: inline;
         creation-date=Tue, 24 Apr 2018 12:00:30 GMT;
         modification-date=Tue, 24 Apr 2018 12:00:30 GMT;
         filename="image002.png";
         size=1109
Content-ID:

_______________________________________________
SIGMETRICS mailing list
SIGMETRICS at mail.asis.org<mailto:SIGMETRICS at mail.asis.org>
http://mail.asis.org/mailman/listinfo/sigmetrics
_______________________________________________
SIGMETRICS mailing list
SIGMETRICS at mail.asis.org<mailto:SIGMETRICS at mail.asis.org>
http://mail.asis.org/mailman/listinfo/sigmetrics

_______________________________________________

SIGMETRICS mailing list

SIGMETRICS at mail.asis.org<mailto:SIGMETRICS at mail.asis.org>

http://mail.asis.org/mailman/listinfo/sigmetrics

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20180430/7286cdc8/attachment-0001.html>