[Sigsti-l] catalogging for scientific data

Thu Oct 11 14:40:03 EDT 2007

As the list's been quiet lately, and it's only a week or so 'til the 
Annual Meeting, I thought I'd bring up something that's been bugging me 
for quite some time in the science informatics field, that I think that 
some of the folks on this list might be interested in.

(I've also been asked to give a presentation at the Library of Congress, 
so I'm testing the water to see how interesting the issue is to the 
library and information science community)

I'll be at the Annual Meeting, so if anyone wants to discuss this in 
person, just let me know.

...

I'm going to try to explain this in as generic of concepts as I can for 
two reasons -- I'm going to assume that most of the people on this mailing 
list typically work with bibliographical records, not data archives; Also, 
the terminology used between scientific disciplines (eg, 'level 3 data', 
'data product') are not consistent.

Basically, the problems in library cataloging that resulted in FRBR are 
still present in scientific data.  The terms 'data set' and 'data product' 
are as ambiguous in the scientific community as the term 'book' is to the 
library community.  Although FRBR doesn't take care of all of the 
ambiguities in distinguishing between if an object is a new Work, 
Expression, Manifestation, or Item (eg, the 14th edition of a math 
textbook might be considered the same Work as the 13th edition and a new 
Expression, or it might qualify as a new Work), the FRBR model still helps 
reduce confusion in discussions of catalogs.

In my case, I'm dealing with federated searching across heterogeneous 
science archives.  The concept of what level the data should be catalogged 
at varies greatly:

 	1. The catalog is of 'observations'*, and there is a single
 	   archival object per observation.
 		1(a) the archival object is the raw sensor data (or lowest
 		level data available, as it may have been compressed
 		before transmission from a spacecraft).  In this case,
 		software and calibration data is necessary to make the
 		data useful to the scientific community.
 		1(b) the archival object is of calibrated data, and may be
 		subject to change if they determine something is wrong
 		with the calibration of the sensor.  The data may be in
 		physical units and directly usable, or the calibration may
 		simply remove known sensor anomalies and still require
 		further software to make the data comparable to data from
 		other sensors.

 	2. The catalog is of objects being served by the archive, and the
 	   archive may have data at multiple states of calibration (raw
 	   vs. calibrated vs. some 'higher level data product'), multiple
 	   editions (which calibration was applied to it; they may have a
 	   'quick' calibration applied that's easier to generate and a
 	   better calibration that requires additional future observations
 	   to generate), multiple resolutions (full resolution vs. reduced
 	   resolution 'browse products'), or multiple file formats (some
 	   of which may not be 'scientifically useful' on its own, as it
 	   doesn't have the necessary metadata)

 	* of course, I can't even get a consistent agreement on what is an
 	  'observation'.  There's a concept of 'data granule' that gets
 	  thrown around, but the application of the term is used
 	  differently between both science diciplines and different types
 	  of sensors.

Of course, as we start federating archives, we start having even more 
problems in duplication, as two archives may serve the 'same' data which 
has gone through different processing, or contains different metadata.
Frequently, the archives use different naming conventions, making it 
difficult to verify the two objects contain data of the same observation.

As I don't know what the intent of the user is, I can't know for sure how 
to de-duplicate the data.  Although some discipline-focused search systems 
can make assumptions based on their community, most of the available 
funding is geared towards making the data available to a broader 
community, and our assumptions may not hold up for people trying to do 
cross-discipline research.

Each of the individual objects has a purpose for being 
archived, and finding a way to present the differences between the 
different objects so that a researcher can select the best object for 
their needs -- without overwhelming them with too many choices -- is 
almost impossible to do without consistency across archives.

Anyway, this spring, I was invited to a workshop on 'Science Archives in 
the 21st Century', and I presented a poster entitied 'FRBR in a Scientific 
Data Context', discussing the need for something like FRBR in the 
scientific community:

 	http://nssdc.gsfc.nasa.gov/nost/conf/archive21st/presentations/posters/p11-hourcle.pdf
 	(warning -- it's 6.5MB)

The lack of cataloging standards not only results in problems for 
federating archives, but it also makes it impossible to ensure that any 
discussion of data can be accurately tracked.  For instance, in a journal 
article, a scientist will mention what instrument's data they used, the 
time range, the observed location, and what sort of calibration was 
applied to the data ... but if we don't know what their source was for the 
data and ancilary data (eg, calibration) was used to generate the data 
they analyzed, we can't show that the event they observed wasn't 
introduced by mishandling of the data.

We need identifiers that can be used to track the provenance of the data, 
as well as to identify the data is used in research.  Unfortunately, to 
get those identifiers, we need to determine what level of granularity they 
need to be applied at, or it won't be useful.  Most likely, we need 
identifiers at all levels of granularity, but we need to make sure that 
the levels are consistently applied across scientific disciplines.

In the poster I mentioned above, I believe that we can actually use FRBR 
group 1 entities with one exception -- the concept of 'work' doesn't quite 
fit.  However, if we instead add an entity for 'Raw Sensor Data', and an 
group 2 entity for 'Sensor', I think we can map most, if not all, 
scientific catalogs to the model.  'Work' would still be used for 
human-produced scientific data, such as scientific models or event 
catalogs.

Sensor data would then be cataloged at 4 levels:

 	Raw Sensor Data
 		an 'observation' for lack of a better term.  The exact
 		boundaries may still require the scientists to work it
 		out.  I would likely be a function of the sensor's
 		operating mode.  (eg, for particle counters, it may be all
 		counts over an hour; for telescopes, it would be a single
 		reading of the CCD.)
 	Expression
 		This would be what 'translation' has been applied to the
 		data.  Any of the following would be a new Expression:
 			a different calibration was applied
 			any sort of reduction or subsetting was applied
 			any form of lossy compression
 			any other non-reversable process
 	Manifestion
 		This would be how the data is packaged.  Any of the
 		following would be a new Manifestion:
 			changes in metadata
 			data aggregation
 			changes in file format
 			changes in the on-disk representation
 				(eg. non-lossy compression, byte order)
 	Item
 		An individual file.

Unfortunately, there's still a lot of ambiguity in where the boundaries 
are for Raw Sensor Data, and it likely won't get cleared up, as the team 
operating the instrument may generate data objects that aren't useful** 
individually.

** of course, we have to ask 'useful to who?'  It may not be useful to the 
instrument team's primary experiment, but may be useful for other studies. 
(or visa-versa)

....

Of course, how we go about fixing this issue is another problem.  Quite a 
few people were interested in the concept once I explained it to them, as 
it would allow them to build more useful applications and better serve 
their community ... but most hadn't recognized this as a problem.  A few 
had seen some of the 'symptoms' of not having the necessary framework, but 
that was it.

It also doesn't help that I'm not involved in this sort of work directly 
-- my primary taskings are in software application development.  It's just 
one of those annoying things I have to deal with.

-----
Joe Hourcle