[Pasig-discuss] WORM (Write Once Read Many) AIPs

Jacob Farmer jfarmer at cambridgecomputer.com
Fri May 12 16:51:06 EDT 2017


Great point.  I think of the whole things as a stack.  There is the metadata
and bits that defines an object from the preservation point of view.  Then
there is a storage device that defines an object a specific set of bits to
serve up.

In the case of my software, Starfish, we think of ourselves as a middleware
that can define the object in some intermediate form.

At the end of the day, though, an object is any piece of data that can be
addressed and manipulated.  That piece of data should have a permanent
address, unique identifiers, and some metadata that gives it meaning.





-----Original Message-----
From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of
Neil Jefferies
Sent: Friday, May 12, 2017 4:43 PM
To: Jacob Farmer <jfarmer at cambridgecomputer.com>
Subject: Re: [Pasig-discuss] WORM (Write Once Read Many) AIPs

Jacob,

This is the key point of my argument - the definition of object you have is
not the definition of an object that an archive wants to preserve.
I'm speaking for people like Tim and I - others are quite happy to build
what I term bit-museums.

Likewise, what you consider preservation (immutability of a bitstream) is
not quite the same as ours - retention of knowledge content - which requires
mutability but with immutable previous versions and provenance/audit
records.

As long as this disconnect between technology and requirements remains the
case, object stores are actually of limited use for us in preservation and
archiving without considerable additional work. The 'metadata' that most
object stores support (key-value pairs) is pretty useless as far as our
metadata requirements go - in the end we have to store XML or triples as
separate files/objects. This was an issue when I reviewed the StorageTek
5800 code builds way back and frankly object storage hasn't moved on much.

Fedora, for all its faults, does actually provide an object view that is
meaningful - something that can be a node in a linked-data graph. It can be
arbitrarily complex but equally, could comprise only metadata. It is almost
never a file.

Neil

On 2017-05-12 20:29, Jacob Farmer wrote:
> Hi, Neil.  Great points.  Indeed, hard links only work in a single
> file system, but they continue pointing to and fro when a file is
> otherwise moved or renamed.
>
> I personally think of POSIX file systems as object stores that have
> weak addressing, limited metadata, and that offer mutability as the
> default.
>
> My preferred definition of an object store is a device that stores
> objects.
> My preferred definition of an object is any piece of data that can be
> individually addressed and manipulated.
> So, by that definition, POSIX file systems are object stores, so are
> hard drives.  So is Microsoft exchange, etc.
>
> If you name a file according to a hash or a UUID (the hash could be
> the UUID), then you have a form of persistent address.  As long as no
> one messes with your file system, the address scheme stays intact.
>
>
> -----Original Message-----
> From: Neil Jefferies [mailto:neil at jefferies.org]
> Sent: Friday, May 12, 2017 11:25 AM
> To: Jacob Farmer
> Subject: RE: [Pasig-discuss] WORM (Write Once Read Many) AIPs
>
> Good point on the housekeeping!
>
> Most (reasonable) filesystems allow you specify the inode numbers at
> creation but yes, it is hard to change afterwards!
>
> But I would really, really avoid hard links - they only work within a
> single filesystem so they can't be used in tiered or virtual storage
> systems and even break quota controls on regular filesystems. Scale up
> thus becomes very difficult with hard links. Symlinks also make it
> explicit when you are dealing with a reference and can tell you which
> version of the object held the original - useful provenance that hard
> links don't capture.
>
> My personal feeling is no for hashes, yes for UUID's (or other
> suitably unique object ID). This allows us to keep all versions of an
> object in the same root path even though it varies. And don't store at
> a file level - this shotguns object fragments all over the store and
> make rebuilds horrible.
> Many current object stores do this - and consequently don't version
> effectively - I wish people would understand objects are not files.
> UUID's
> are also consistent in terms of computational time and hashes very
> much aren't.
>
> There's a big difference in robustness between needing just filesystem
> metadata to find an object in storage and requiring filesystem
> metadata (because underneath all object stores are filesystems - even
> Seagates "object" hard drives), object store metadata to map paths to
> hashes, and object metadata to find all the bits that make up a
> composite object.
>
> ...and yes, I am saying that most object store vendors have got it
> wrong. At least as far as archiving is concerned. And they ought to
> consider why every object store ends up presenting itself as a POSIX
> filesystem.
>
> Neil
>
>
> On 2017-05-12 14:33, Jacob Farmer wrote:
>> Two warnings and two suggestions:
>>
>> Warnings:
>>
>> 1)  Symlinks and Housekeeping -- It is a common practice to use
>> symlinks to make versioned file collections.  If you do this, you
>> should have some kind of housekeeping processes that ensure that the
>> symlinks are all working correctly.  If files ever have to get
>> migrated, symlinks can break.
>>
>> 2)  Check with your file system vendor -- Most removable media file
>> systems have some built in limitations on the number of inodes
>> (files) that you can have in one file system.  If you generate a lot
>> of symlinks, you might overwhelm the file system.  Your vendor will know.
>>
>> Suggestions:
>>
>> 1)  Hashes for file names -- If your application software maintains a
>> hash for each file, you might consider naming the file according to
>> the hash.
>> Use the first two digits for the parent directory, the next two
>> digits for sub-diretory, the next two digits for sub-directory.  Then
>> use the full hash for the file name.  This turns your POSIX file
>> system into an object store with uniquely named objects.
>>
>> 	As a safeguard, you might maintain a separate table or list that
>> associates path names with hashes.
>>
>> 2)  Consider using hard links instead of symlinks -- You might use
>> hard links instead of symlinks, presuming that the files are all in
>> the same file system.  You still have to watch for file count issues,
>> but you have less housekeeping to do.
>>
>> I hope that helps.
>>
>>
>> Jacob Farmer  |  Chief Technology Officer  |  Cambridge Computer  |
>> "Artists In Data Storage"
>> Phone 781-250-3210  |  jfarmer at CambridgeComputer.com  |
>> www.CambridgeComputer.com
>>
>>
>>
>>
>> -----Original Message-----
>> From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf
>> Of Neil Jefferies
>> Sent: Friday, May 12, 2017 8:06 AM
>> To: Tim.Gollins at nrscotland.gov.uk
>> Cc: pasig-discuss at mail.asis.org
>> Subject: Re: [Pasig-discuss] WORM (Write Once Read Many) AIPs
>>
>> Tim,
>>
>> If we store AIP's unpackaged, as a collection of files in a folder,
>> then object updates could just be a new folder with symlinks to the
>> unchanged parts and the updated parts in place in the folder. The
>> object "location"
>> would be a parent folder for all these version folders - for example,
>> a pairtree (or triple-tree for faster scanning/rebuilds) based on
>> object UUID.
>> Version folders would be named accoprding to date or version number
>> (date might make Memento compliant access simpler).
>> Creating anew version clones the current verion (including links)
>> with a new name and then replaces the updated parts in situ. Final
>> act is to update a "current" symlink in the object. Any update
>> failure will mean "current"
>> is
>> not updated an the partial clone can be discarded.
>>
>> This assumes most updates are metadata and that a diff won't save
>> much compared to a complete new XML file or whatever. I am also
>> assuming that metadata won't be wrappered either (so you can forget
>> METS) so that different types are stored in the most stuiable format
>> and are accessed only when required. The problems with roundtripping
>> packaged AIP's for updates rather than diff-ing are repeated by METS
>> wrappering.
>>
>> These may be a virtual folder/filesytem presentation and underneath
>> an HSM would retrieve files from wherever when it is actually accessed.
>> HSM policy in soemthing like SAM-QFS/Versity/Cray TAS can ensure
>> folders are kep intact when moved to other storage (we could even
>> dereference symlinks when dealing with tape).
>>
>> This can be done with a POSIX filesystem and not muich code - Ben
>> O'Steen started something along these lines here:
>> https://github.com/dataflow/RDFDatabank/wiki/What-is-DataBank-and-wha
>> t
>> -does-it-do%3F
>>
>> Fedora also also a versioning object store that could support this
>> kind of model but also adds a fair bit of complexity to be
>> Linked-Data_platform compliant.
>>
>> In my paralance I would probably equate "Minimal Ingest" with "Sheer
>> Curation" and APT with Asynchronous Message Driven Workers.
>>
>> Neil
>>
>>
>> On 2017-05-12 12:33, Tim.Gollins at nrscotland.gov.uk wrote:
>>> Dear PASIG
>>>
>>> I have been thinking recently about the challenge of managing
>>> "physical"  AIPs on offline or near line storage and how to optimise
>>> or simplify the use of managed storage media in a tape based
>>> (robotic) Hierarchical Storage Management (HSM) system. By "physical"
>>> AIPs I mean that the actual structure of the AIP written to the
>>> storage system is sufficiently self-describing that even if the
>>> management or other elements of a DP system were to be lost to a
>>> disaster then the entire collection could be fully re-instated
>>> reliably from the stored AIPs alone.
>>>
>>> I have also been thinking about the huge benefits of adopting the
>>> concepts of "Minimal Ingest" (MI) and "Autonomous Preservation Tools"
>>> (APT) in a new Digital Archive solution.
>>>
>>> One of the potential effects of the MI and APT concepts is that over
>>> time it is clear that while (of course) the original bit streams
>>> will never need to be updated, the metadata packaged in the AIP will
>>> need to change relatively often (through the life of the AIP) . This
>>> is of course in addition to any new renderings of the bit streams
>>> produced for preservation purposes (manifestations as termed in some
>>> systems).
>>>
>>> If to update the AIP the process involves the AIP being "loaded" and
>>> "Modified" and "Stored" again as a whole then this will result in
>>> significant "churn" of the offline or near line media (i.e. tapes)
>>> in a HSM - which I would like to avoid. I think it would be really
>>> great if the AIP representation could accommodate the concept of an
>>> "update IP" (perhaps UIP?) where the UIP contains a "delta" of the
>>> original AIP - the full AIP then being interpreted as the original
>>> as modified by a series of deltas. This would then effectively
>>> result in AIPs (and
>>> UIPs) becoming WORM objects with clear benefits that I perceive in
>>> managing their reliable and safe storage.
>>>
>>> I am not sufficiently familiar with the detail of all the different
>>> AIP models or implementations, I was wondering if anyone in the team
>>> would be able to comment on whether the they know of any AIP models,
>>> specifications or implementations that  would support such a use
>>> case.
>>>
>>> I have just posted a version of this question to the E-Ark Linked in
>>> Group so my apologies to those who see it twice.
>>>
>>> Many thanks
>>>
>>> Tim
>>> Tim Gollins | Head of Digital Archiving and Director of the NRS
>>> Digital Preservation Programme National Records of Scotland | West
>>> Register House | Edinburgh EH2 4DF
>>> + 44 (0)131 535 1431 / + 44 (0)7974 922614 |
>>> tim.gollins at nrscotland.gov.uk | www.nrscotland.gov.uk
>>>
>>> Preserving the past | Recording the present | Informing the future
>>> Follow us on Twitter: @NatRecordsScot |
>>> http://twitter.com/NatRecordsScot
>>>
>>>
>>> ********************************************************************
>>> *
>>> * This e-mail (and any files or other attachments transmitted with
>>> it) is intended solely for the attention of the addressee(s).
>>> Unauthorised use, disclosure, storage, copying or distribution of
>>> any part of this e-mail is not permitted. If you are not the
>>> intended recipient please destroy the email, remove any copies from
>>> your system and inform the sender immediately by return.
>>>
>>> Communications with the Scottish Government may be monitored or
>>> recorded in order to secure the effective operation of the system
>>> and for other lawful purposes. The views or opinions contained
>>> within this e-mail may not necessarily reflect those of the Scottish
>>> Government.
>>>
>>>
>>> Tha am post-d seo (agus faidhle neo ceanglan  còmhla ris) dhan neach
>>> neo luchd-ainmichte a-mhàin. Chan eil e ceadaichte a chleachdadh ann
>>> an dòigh sam bith, a’ toirt a-steach còraichean, foillseachadh neo
>>> sgaoileadh,  gun chead. Ma ’s e is gun d’fhuair sibh seo le gun
>>> fhiosd’, bu choir cur às dhan phost-d agus lethbhreac sam bith air
>>> an t-siostam agaibh, leig fios chun  neach a sgaoil am post-d  gun dàil.
>>>
>>> Dh’fhaodadh gum bi teachdaireachd sam bith bho Riaghaltas na h-Alba
>>> air a chlàradh neo air a sgrùdadh airson dearbhadh gu bheil an
>>> siostam ag obair gu h-èifeachdach neo airson adhbhar laghail eile.
>>> Dh’fhaodadh nach  eil beachdan anns a’ phost-d seo co-ionann ri
>>> beachdan Riaghaltas na h-Alba.
>>> ********************************************************************
>>> *
>>> *
>>>
>>>
>>>
>>> ----
>>> To subscribe, unsubscribe, or modify your subscription, please visit
>>> http://mail.asis.org/mailman/listinfo/pasig-discuss
>>> _______
>>> PASIG Webinars and conference material is at
>>> http://www.preservationandarchivingsig.org/index.html
>>> _______________________________________________
>>> Pasig-discuss mailing list
>>> Pasig-discuss at mail.asis.org
>>> http://mail.asis.org/mailman/listinfo/pasig-discuss
>>
>> ----
>> To subscribe, unsubscribe, or modify your subscription, please visit
>> http://mail.asis.org/mailman/listinfo/pasig-discuss
>> _______
>> PASIG Webinars and conference material is at
>> http://www.preservationandarchivingsig.org/index.html
>> _______________________________________________
>> Pasig-discuss mailing list
>> Pasig-discuss at mail.asis.org
>> http://mail.asis.org/mailman/listinfo/pasig-discuss

----
To subscribe, unsubscribe, or modify your subscription, please visit
http://mail.asis.org/mailman/listinfo/pasig-discuss
_______
PASIG Webinars and conference material is at
http://www.preservationandarchivingsig.org/index.html
_______________________________________________
Pasig-discuss mailing list
Pasig-discuss at mail.asis.org
http://mail.asis.org/mailman/listinfo/pasig-discuss



More information about the Pasig-discuss mailing list