Corporate addresses in Web of Science v5

Pikas, Christina K. Christina.Pikas at JHUAPL.EDU
Mon Jul 18 16:49:01 EDT 2011


Hi All-
I shared this with a colleague of mine who is working with me on a science mapping project and he made a very logical point: How odd it is that WoS data are getting more difficult to parse instead of easier! The prevailing movement is to make data more interoperable and more machine-friendly.

Why, oh why, isn't it easier to figure out which author goes with which address after all this time?

Thank you for your efforts, Loet.

Christina


----
Christina K Pikas
Librarian
The Johns Hopkins University Applied Physics Laboratory
Christina.Pikas at jhuapl.edu
(240) 228 4812 (DC area)
(443) 778 4812 (Baltimore area)




From: ASIS&T Special Interest Group on Metrics [mailto:SIGMETRICS at listserv.utk.edu] On Behalf Of Loet Leydesdorff
Sent: Monday, July 18, 2011 8:41 AM
To: SIGMETRICS at listserv.utk.edu
Subject: [SIGMETRICS] Corporate addresses in Web of Science v5

Dear colleagues,

Unlike the WoS4 interface, the new WoS5 interface (introduced yesterday) does not contain a unique delimiter of the address information in each record. (WoS4 used a period for this.) It may therefore occur that it is not possible to parse unambiguously whether a new line is a continuation of the previous address or a new address.

I used the following rules: If author names are coupled to the addresses, these are placed between brackets and these brackets can be used for an unambiguous delineation of the address information. The relations between authors and corporate addresses are stored in a file csau.dbf in this case.

If there are no author names, the address field can be on one or two lines. The following rules are used:

1.       if the line ends with a comma, the next line is considered as a continuation of the address information;
2.       if the next line contains no commas or only a single one, this line is also considered as a continuation of the previous one;
3.       in other cases, the two lines are each time concatenated and tested on the number of commas. If this number is five or larger, the two lines are considered as separate addresses. The number of five is chosen because in some cases four commas can still be considered as a single address, but six commas almost never. However, errors are possible on both sides because some addresses contain only two commas and some individual addresses more than four.

In some cases of older records, the field for the responding author (RP) contains additional information (Costas & Iribarren-Maestro, Scientometrics, 2007). This field is tested on whether the first subfield (the organization) is the same and similarly for the country name. A new address is only added if this test fails. The number of this address (CSNR in CS.DBF) is 999 in order to distinguish it clearly from the other addresses numbered consecutively and harvested from the address fields (C1).

The above procedure with the test for the five commas will unavoidably generate some error. However, this is a consequence of the restructuration of the address field in the new WoS5 interface. From the files which I tested, I also noted that short address information may be country specific (e.g., Germany). Most addresses, however, contain three or four commas. Two consecutive addresses with two commas (and no author information between brackets) can be erroneously concatenated. Addresses with five or more commas may erroneously be distinguished as two addresses (if the line break is not at a comma, etc.).

I will react on feedback suggesting improvements. The different programs using the address field will be replaced as there is demand for this. All programs will now be Win32 although they keep the same user interface (using the C-prompt). The upgrade to Win32 is needed because 64-bits computers can no longer handle the 16-bits programs under DOS.

I replaced this morning ISI.exe at http://www.leydesdorff.net/software/isi with a new version. Feedback is appreciated. I'll replace the other programs before December 31 when v4 becomes obsolete.

Best wishes (and apologies for crosspostings),
Loet

________________________________
Loet Leydesdorff
Professor, University of Amsterdam
Amsterdam School of Communications Research (ASCoR)
Kloveniersburgwal 48, 1012 CX Amsterdam.
Tel. +31-20-525 6598; fax: +31-842239111
loet at leydesdorff.net <mailto:loet at leydesdorff.net> ; http://www.leydesdorff.net/
Visiting Professor, ISTIC, <http://www.istic.ac.cn/Eng/brief_en.html> Beijing; Honorary Fellow, SPRU, <http://www.sussex.ac.uk/spru/> University of Sussex

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20110718/a698e424/attachment.html>


More information about the SIGMETRICS mailing list