[Sigsti-l] catalogging for scientific data
Joe Hourcle
oneiros at grace.nascom.nasa.gov
Thu Oct 11 14:40:03 EDT 2007
As the list's been quiet lately, and it's only a week or so 'til the
Annual Meeting, I thought I'd bring up something that's been bugging me
for quite some time in the science informatics field, that I think that
some of the folks on this list might be interested in.
(I've also been asked to give a presentation at the Library of Congress,
so I'm testing the water to see how interesting the issue is to the
library and information science community)
I'll be at the Annual Meeting, so if anyone wants to discuss this in
person, just let me know.
...
I'm going to try to explain this in as generic of concepts as I can for
two reasons -- I'm going to assume that most of the people on this mailing
list typically work with bibliographical records, not data archives; Also,
the terminology used between scientific disciplines (eg, 'level 3 data',
'data product') are not consistent.
Basically, the problems in library cataloging that resulted in FRBR are
still present in scientific data. The terms 'data set' and 'data product'
are as ambiguous in the scientific community as the term 'book' is to the
library community. Although FRBR doesn't take care of all of the
ambiguities in distinguishing between if an object is a new Work,
Expression, Manifestation, or Item (eg, the 14th edition of a math
textbook might be considered the same Work as the 13th edition and a new
Expression, or it might qualify as a new Work), the FRBR model still helps
reduce confusion in discussions of catalogs.
In my case, I'm dealing with federated searching across heterogeneous
science archives. The concept of what level the data should be catalogged
at varies greatly:
1. The catalog is of 'observations'*, and there is a single
archival object per observation.
1(a) the archival object is the raw sensor data (or lowest
level data available, as it may have been compressed
before transmission from a spacecraft). In this case,
software and calibration data is necessary to make the
data useful to the scientific community.
1(b) the archival object is of calibrated data, and may be
subject to change if they determine something is wrong
with the calibration of the sensor. The data may be in
physical units and directly usable, or the calibration may
simply remove known sensor anomalies and still require
further software to make the data comparable to data from
other sensors.
2. The catalog is of objects being served by the archive, and the
archive may have data at multiple states of calibration (raw
vs. calibrated vs. some 'higher level data product'), multiple
editions (which calibration was applied to it; they may have a
'quick' calibration applied that's easier to generate and a
better calibration that requires additional future observations
to generate), multiple resolutions (full resolution vs. reduced
resolution 'browse products'), or multiple file formats (some
of which may not be 'scientifically useful' on its own, as it
doesn't have the necessary metadata)
* of course, I can't even get a consistent agreement on what is an
'observation'. There's a concept of 'data granule' that gets
thrown around, but the application of the term is used
differently between both science diciplines and different types
of sensors.
Of course, as we start federating archives, we start having even more
problems in duplication, as two archives may serve the 'same' data which
has gone through different processing, or contains different metadata.
Frequently, the archives use different naming conventions, making it
difficult to verify the two objects contain data of the same observation.
As I don't know what the intent of the user is, I can't know for sure how
to de-duplicate the data. Although some discipline-focused search systems
can make assumptions based on their community, most of the available
funding is geared towards making the data available to a broader
community, and our assumptions may not hold up for people trying to do
cross-discipline research.
Each of the individual objects has a purpose for being
archived, and finding a way to present the differences between the
different objects so that a researcher can select the best object for
their needs -- without overwhelming them with too many choices -- is
almost impossible to do without consistency across archives.
Anyway, this spring, I was invited to a workshop on 'Science Archives in
the 21st Century', and I presented a poster entitied 'FRBR in a Scientific
Data Context', discussing the need for something like FRBR in the
scientific community:
http://nssdc.gsfc.nasa.gov/nost/conf/archive21st/presentations/posters/p11-hourcle.pdf
(warning -- it's 6.5MB)
The lack of cataloging standards not only results in problems for
federating archives, but it also makes it impossible to ensure that any
discussion of data can be accurately tracked. For instance, in a journal
article, a scientist will mention what instrument's data they used, the
time range, the observed location, and what sort of calibration was
applied to the data ... but if we don't know what their source was for the
data and ancilary data (eg, calibration) was used to generate the data
they analyzed, we can't show that the event they observed wasn't
introduced by mishandling of the data.
We need identifiers that can be used to track the provenance of the data,
as well as to identify the data is used in research. Unfortunately, to
get those identifiers, we need to determine what level of granularity they
need to be applied at, or it won't be useful. Most likely, we need
identifiers at all levels of granularity, but we need to make sure that
the levels are consistently applied across scientific disciplines.
In the poster I mentioned above, I believe that we can actually use FRBR
group 1 entities with one exception -- the concept of 'work' doesn't quite
fit. However, if we instead add an entity for 'Raw Sensor Data', and an
group 2 entity for 'Sensor', I think we can map most, if not all,
scientific catalogs to the model. 'Work' would still be used for
human-produced scientific data, such as scientific models or event
catalogs.
Sensor data would then be cataloged at 4 levels:
Raw Sensor Data
an 'observation' for lack of a better term. The exact
boundaries may still require the scientists to work it
out. I would likely be a function of the sensor's
operating mode. (eg, for particle counters, it may be all
counts over an hour; for telescopes, it would be a single
reading of the CCD.)
Expression
This would be what 'translation' has been applied to the
data. Any of the following would be a new Expression:
a different calibration was applied
any sort of reduction or subsetting was applied
any form of lossy compression
any other non-reversable process
Manifestion
This would be how the data is packaged. Any of the
following would be a new Manifestion:
changes in metadata
data aggregation
changes in file format
changes in the on-disk representation
(eg. non-lossy compression, byte order)
Item
An individual file.
Unfortunately, there's still a lot of ambiguity in where the boundaries
are for Raw Sensor Data, and it likely won't get cleared up, as the team
operating the instrument may generate data objects that aren't useful**
individually.
** of course, we have to ask 'useful to who?' It may not be useful to the
instrument team's primary experiment, but may be useful for other studies.
(or visa-versa)
....
Of course, how we go about fixing this issue is another problem. Quite a
few people were interested in the concept once I explained it to them, as
it would allow them to build more useful applications and better serve
their community ... but most hadn't recognized this as a problem. A few
had seen some of the 'symptoms' of not having the necessary framework, but
that was it.
It also doesn't help that I'm not involved in this sort of work directly
-- my primary taskings are in software application development. It's just
one of those annoying things I have to deal with.
-----
Joe Hourcle
More information about the Sigsti-l
mailing list