[Pasig-discuss] WORM (Write Once Read Many) AIPs

Neil Jefferies neil at jefferies.org
Fri May 12 16:42:55 EDT 2017


Jacob,

This is the key point of my argument - the definition of object you have 
is not the definition of an object that an archive wants to preserve. 
I'm speaking for people like Tim and I - others are quite happy to build 
what I term bit-museums.

Likewise, what you consider preservation (immutability of a bitstream) 
is not quite the same as ours - retention of knowledge content - which 
requires mutability but with immutable previous versions and 
provenance/audit records.

As long as this disconnect between technology and requirements remains 
the case, object stores are actually of limited use for us in 
preservation and archiving without considerable additional work. The 
'metadata' that most object stores support (key-value pairs) is pretty 
useless as far as our metadata requirements go - in the end we have to 
store XML or triples as separate files/objects. This was an issue when I 
reviewed the StorageTek 5800 code builds way back and frankly object 
storage hasn't moved on much.

Fedora, for all its faults, does actually provide an object view that is 
meaningful - something that can be a node in a linked-data graph. It can 
be arbitrarily complex but equally, could comprise only metadata. It is 
almost never a file.

Neil

On 2017-05-12 20:29, Jacob Farmer wrote:
> Hi, Neil.  Great points.  Indeed, hard links only work in a single file
> system, but they continue pointing to and fro when a file is otherwise 
> moved
> or renamed.
> 
> I personally think of POSIX file systems as object stores that have 
> weak
> addressing, limited metadata, and that offer mutability as the default.
> 
> My preferred definition of an object store is a device that stores 
> objects.
> My preferred definition of an object is any piece of data that can be
> individually addressed and manipulated.
> So, by that definition, POSIX file systems are object stores, so are 
> hard
> drives.  So is Microsoft exchange, etc.
> 
> If you name a file according to a hash or a UUID (the hash could be the
> UUID), then you have a form of persistent address.  As long as no one 
> messes
> with your file system, the address scheme stays intact.
> 
> 
> -----Original Message-----
> From: Neil Jefferies [mailto:neil at jefferies.org]
> Sent: Friday, May 12, 2017 11:25 AM
> To: Jacob Farmer
> Subject: RE: [Pasig-discuss] WORM (Write Once Read Many) AIPs
> 
> Good point on the housekeeping!
> 
> Most (reasonable) filesystems allow you specify the inode numbers at
> creation but yes, it is hard to change afterwards!
> 
> But I would really, really avoid hard links - they only work within a 
> single
> filesystem so they can't be used in tiered or virtual storage systems 
> and
> even break quota controls on regular filesystems. Scale up thus becomes 
> very
> difficult with hard links. Symlinks also make it explicit when you are
> dealing with a reference and can tell you which version of the object 
> held
> the original - useful provenance that hard links don't capture.
> 
> My personal feeling is no for hashes, yes for UUID's (or other suitably
> unique object ID). This allows us to keep all versions of an object in 
> the
> same root path even though it varies. And don't store at a file level - 
> this
> shotguns object fragments all over the store and make rebuilds 
> horrible.
> Many current object stores do this - and consequently don't version
> effectively - I wish people would understand objects are not files. 
> UUID's
> are also consistent in terms of computational time and hashes very much
> aren't.
> 
> There's a big difference in robustness between needing just filesystem
> metadata to find an object in storage and requiring filesystem metadata
> (because underneath all object stores are filesystems - even Seagates
> "object" hard drives), object store metadata to map paths to hashes, 
> and
> object metadata to find all the bits that make up a composite object.
> 
> ...and yes, I am saying that most object store vendors have got it 
> wrong. At
> least as far as archiving is concerned. And they ought to consider why 
> every
> object store ends up presenting itself as a POSIX filesystem.
> 
> Neil
> 
> 
> On 2017-05-12 14:33, Jacob Farmer wrote:
>> Two warnings and two suggestions:
>> 
>> Warnings:
>> 
>> 1)  Symlinks and Housekeeping -- It is a common practice to use
>> symlinks to make versioned file collections.  If you do this, you
>> should have some kind of housekeeping processes that ensure that the
>> symlinks are all working correctly.  If files ever have to get
>> migrated, symlinks can break.
>> 
>> 2)  Check with your file system vendor -- Most removable media file
>> systems have some built in limitations on the number of inodes (files)
>> that you can have in one file system.  If you generate a lot of
>> symlinks, you might overwhelm the file system.  Your vendor will know.
>> 
>> Suggestions:
>> 
>> 1)  Hashes for file names -- If your application software maintains a
>> hash for each file, you might consider naming the file according to
>> the hash.
>> Use the first two digits for the parent directory, the next two digits
>> for sub-diretory, the next two digits for sub-directory.  Then use the
>> full hash for the file name.  This turns your POSIX file system into
>> an object store with uniquely named objects.
>> 
>> 	As a safeguard, you might maintain a separate table or list that
>> associates path names with hashes.
>> 
>> 2)  Consider using hard links instead of symlinks -- You might use
>> hard links instead of symlinks, presuming that the files are all in
>> the same file system.  You still have to watch for file count issues,
>> but you have less housekeeping to do.
>> 
>> I hope that helps.
>> 
>> 
>> Jacob Farmer  |  Chief Technology Officer  |  Cambridge Computer  |
>> "Artists In Data Storage"
>> Phone 781-250-3210  |  jfarmer at CambridgeComputer.com  |
>> www.CambridgeComputer.com
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf
>> Of Neil Jefferies
>> Sent: Friday, May 12, 2017 8:06 AM
>> To: Tim.Gollins at nrscotland.gov.uk
>> Cc: pasig-discuss at mail.asis.org
>> Subject: Re: [Pasig-discuss] WORM (Write Once Read Many) AIPs
>> 
>> Tim,
>> 
>> If we store AIP's unpackaged, as a collection of files in a folder,
>> then object updates could just be a new folder with symlinks to the
>> unchanged parts and the updated parts in place in the folder. The
>> object "location"
>> would be a parent folder for all these version folders - for example,
>> a pairtree (or triple-tree for faster scanning/rebuilds) based on
>> object UUID.
>> Version folders would be named accoprding to date or version number
>> (date might make Memento compliant access simpler).
>> Creating anew version clones the current verion (including links) with
>> a new name and then replaces the updated parts in situ. Final act is
>> to update a "current" symlink in the object. Any update failure will
>> mean "current"
>> is
>> not updated an the partial clone can be discarded.
>> 
>> This assumes most updates are metadata and that a diff won't save much
>> compared to a complete new XML file or whatever. I am also assuming
>> that metadata won't be wrappered either (so you can forget METS) so
>> that different types are stored in the most stuiable format and are
>> accessed only when required. The problems with roundtripping packaged
>> AIP's for updates rather than diff-ing are repeated by METS
>> wrappering.
>> 
>> These may be a virtual folder/filesytem presentation and underneath an
>> HSM would retrieve files from wherever when it is actually accessed.
>> HSM policy in soemthing like SAM-QFS/Versity/Cray TAS can ensure
>> folders are kep intact when moved to other storage (we could even
>> dereference symlinks when dealing with tape).
>> 
>> This can be done with a POSIX filesystem and not muich code - Ben
>> O'Steen started something along these lines here:
>> https://github.com/dataflow/RDFDatabank/wiki/What-is-DataBank-and-what
>> -does-it-do%3F
>> 
>> Fedora also also a versioning object store that could support this
>> kind of model but also adds a fair bit of complexity to be
>> Linked-Data_platform compliant.
>> 
>> In my paralance I would probably equate "Minimal Ingest" with "Sheer
>> Curation" and APT with Asynchronous Message Driven Workers.
>> 
>> Neil
>> 
>> 
>> On 2017-05-12 12:33, Tim.Gollins at nrscotland.gov.uk wrote:
>>> Dear PASIG
>>> 
>>> I have been thinking recently about the challenge of managing
>>> "physical"  AIPs on offline or near line storage and how to optimise
>>> or simplify the use of managed storage media in a tape based
>>> (robotic) Hierarchical Storage Management (HSM) system. By "physical"
>>> AIPs I mean that the actual structure of the AIP written to the
>>> storage system is sufficiently self-describing that even if the
>>> management or other elements of a DP system were to be lost to a
>>> disaster then the entire collection could be fully re-instated
>>> reliably from the stored AIPs alone.
>>> 
>>> I have also been thinking about the huge benefits of adopting the
>>> concepts of "Minimal Ingest" (MI) and "Autonomous Preservation Tools"
>>> (APT) in a new Digital Archive solution.
>>> 
>>> One of the potential effects of the MI and APT concepts is that over
>>> time it is clear that while (of course) the original bit streams will
>>> never need to be updated, the metadata packaged in the AIP will need
>>> to change relatively often (through the life of the AIP) . This is of
>>> course in addition to any new renderings of the bit streams produced
>>> for preservation purposes (manifestations as termed in some systems).
>>> 
>>> If to update the AIP the process involves the AIP being "loaded" and
>>> "Modified" and "Stored" again as a whole then this will result in
>>> significant "churn" of the offline or near line media (i.e. tapes) in
>>> a HSM - which I would like to avoid. I think it would be really great
>>> if the AIP representation could accommodate the concept of an "update
>>> IP" (perhaps UIP?) where the UIP contains a "delta" of the original
>>> AIP - the full AIP then being interpreted as the original as modified
>>> by a series of deltas. This would then effectively result in AIPs
>>> (and
>>> UIPs) becoming WORM objects with clear benefits that I perceive in
>>> managing their reliable and safe storage.
>>> 
>>> I am not sufficiently familiar with the detail of all the different
>>> AIP models or implementations, I was wondering if anyone in the team
>>> would be able to comment on whether the they know of any AIP models,
>>> specifications or implementations that  would support such a use 
>>> case.
>>> 
>>> I have just posted a version of this question to the E-Ark Linked in
>>> Group so my apologies to those who see it twice.
>>> 
>>> Many thanks
>>> 
>>> Tim
>>> Tim Gollins | Head of Digital Archiving and Director of the NRS
>>> Digital Preservation Programme National Records of Scotland | West
>>> Register House | Edinburgh EH2 4DF
>>> + 44 (0)131 535 1431 / + 44 (0)7974 922614 |
>>> tim.gollins at nrscotland.gov.uk | www.nrscotland.gov.uk
>>> 
>>> Preserving the past | Recording the present | Informing the future
>>> Follow us on Twitter: @NatRecordsScot |
>>> http://twitter.com/NatRecordsScot
>>> 
>>> 
>>> *********************************************************************
>>> * This e-mail (and any files or other attachments transmitted with
>>> it) is intended solely for the attention of the addressee(s).
>>> Unauthorised use, disclosure, storage, copying or distribution of any
>>> part of this e-mail is not permitted. If you are not the intended
>>> recipient please destroy the email, remove any copies from your
>>> system and inform the sender immediately by return.
>>> 
>>> Communications with the Scottish Government may be monitored or
>>> recorded in order to secure the effective operation of the system and
>>> for other lawful purposes. The views or opinions contained within
>>> this e-mail may not necessarily reflect those of the Scottish 
>>> Government.
>>> 
>>> 
>>> Tha am post-d seo (agus faidhle neo ceanglan  còmhla ris) dhan neach
>>> neo luchd-ainmichte a-mhàin. Chan eil e ceadaichte a chleachdadh ann
>>> an dòigh sam bith, a’ toirt a-steach còraichean, foillseachadh neo
>>> sgaoileadh,  gun chead. Ma ’s e is gun d’fhuair sibh seo le gun
>>> fhiosd’, bu choir cur às dhan phost-d agus lethbhreac sam bith air an
>>> t-siostam agaibh, leig fios chun  neach a sgaoil am post-d  gun dàil.
>>> 
>>> Dh’fhaodadh gum bi teachdaireachd sam bith bho Riaghaltas na h-Alba
>>> air a chlàradh neo air a sgrùdadh airson dearbhadh gu bheil an
>>> siostam ag obair gu h-èifeachdach neo airson adhbhar laghail eile.
>>> Dh’fhaodadh nach  eil beachdan anns a’ phost-d seo co-ionann ri
>>> beachdan Riaghaltas na h-Alba.
>>> *********************************************************************
>>> *
>>> 
>>> 
>>> 
>>> ----
>>> To subscribe, unsubscribe, or modify your subscription, please visit
>>> http://mail.asis.org/mailman/listinfo/pasig-discuss
>>> _______
>>> PASIG Webinars and conference material is at
>>> http://www.preservationandarchivingsig.org/index.html
>>> _______________________________________________
>>> Pasig-discuss mailing list
>>> Pasig-discuss at mail.asis.org
>>> http://mail.asis.org/mailman/listinfo/pasig-discuss
>> 
>> ----
>> To subscribe, unsubscribe, or modify your subscription, please visit
>> http://mail.asis.org/mailman/listinfo/pasig-discuss
>> _______
>> PASIG Webinars and conference material is at
>> http://www.preservationandarchivingsig.org/index.html
>> _______________________________________________
>> Pasig-discuss mailing list
>> Pasig-discuss at mail.asis.org
>> http://mail.asis.org/mailman/listinfo/pasig-discuss



More information about the Pasig-discuss mailing list