[Pasig-discuss] WORM (Write Once Read Many) AIPs

Jonathan Tilbury jonathan.tilbury at preservica.com
Sun May 14 06:54:59 EDT 2017


Tim,

I have always thought of the "autonomous AIP" zipped up and held on a storage device as an residue of paper-thinking. When dealing with paper storage it is possible to bundle up the papers and some description and put it in a box onto a shelf. If you need the artefact, you get all of the box. The paper is unlikely to be updated of changed during its lifetime. 

This really does not map well onto the digital world. There a lots of changes that result in the "API" being changed, for example changes in descriptive metadata, structure (parentage), security settings, technical metadata (during a re-characterisation) and audit trail. You may also add extra files to the API and most importantly generate new representations for access or digital masters following a migration. This makes the idea of a single immutable AIP redundant. 

Addressing this we need to ask why are we worrying. I think you answered this well by saying the content plus all of the metadata listed above must be accessible outside of whatever system you are using to re-build the collection should disaster happen or should you want to change system provider. To enable this you need all of the digital objects plus metadata (description, technical, security, structure, audit trail, fixity) to be held in a place and in a way that can be machine read. This does not imply physical zipped AIPs, just that the data is there and is understandable. 

Physical (zipped) AIPs are difficult to work with. Whenever you need to access a file you need to unpack the zip which is cumbersome and slow. This happens for download, rendering, and fixity checking. This overhead has no benefit and several risks. Also, it brings into question what fixity checking actually means when the storage container is being changed all the time. These problems become particularly acute when we have to address the large flat collections we are now seeing more of. 

I have always thought a better approach is to save the digital objects (files) in an object store (for example a file drive, tape store, cloud storage), and to make sure these never change using fixity validation. All of the metadata can be written to the object store as well, and either updated or new versions written as it is updated. These digital objects (files and metadata) can be stored in multiple locations in different technologies. 

In Preservica we support both approaches through the range of storage adapters we include. Each has its own way of renaming the digital objects, but the use of objects with a UUID naming convention is preferred. We strongly recommend against the use of physical APIs. All of the objects, once stored, can then be checked for fixity on a rotating basis or when accessed. By storing to multiple storage adapters you can even self-heal if someone does mess with your file system.

As for exiting the system, we allow cloud edition users to replicate all of the content plus metadata to a remote store using SFTP in such a way that the physical directory structure mimics the logical collection structure. If they want to leave they have all the content safe in a place of their choosing.

I would very interested I people's comments on whether we should still support Physical (zipped) AIPs. 

Jon

=============
Jon Tilbury
CTO, Preserivca
=============


-----Original Message-----
From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Neil Jefferies
Sent: Friday, May 12, 2017 4:43 PM
To: Jacob Farmer <jfarmer at cambridgecomputer.com>
Subject: Re: [Pasig-discuss] WORM (Write Once Read Many) AIPs

Jacob,

This is the key point of my argument - the definition of object you have is not the definition of an object that an archive wants to preserve.
I'm speaking for people like Tim and I - others are quite happy to build what I term bit-museums.

Likewise, what you consider preservation (immutability of a bitstream) is not quite the same as ours - retention of knowledge content - which requires mutability but with immutable previous versions and provenance/audit records.

As long as this disconnect between technology and requirements remains the case, object stores are actually of limited use for us in preservation and archiving without considerable additional work. The 'metadata' that most object stores support (key-value pairs) is pretty useless as far as our metadata requirements go - in the end we have to store XML or triples as separate files/objects. This was an issue when I reviewed the StorageTek
5800 code builds way back and frankly object storage hasn't moved on much.

Fedora, for all its faults, does actually provide an object view that is meaningful - something that can be a node in a linked-data graph. It can be arbitrarily complex but equally, could comprise only metadata. It is almost never a file.

Neil

On 2017-05-12 20:29, Jacob Farmer wrote:
> Hi, Neil.  Great points.  Indeed, hard links only work in a single 
> file system, but they continue pointing to and fro when a file is 
> otherwise moved or renamed.
>
> I personally think of POSIX file systems as object stores that have 
> weak addressing, limited metadata, and that offer mutability as the 
> default.
>
> My preferred definition of an object store is a device that stores 
> objects.
> My preferred definition of an object is any piece of data that can be 
> individually addressed and manipulated.
> So, by that definition, POSIX file systems are object stores, so are 
> hard drives.  So is Microsoft exchange, etc.
>
> If you name a file according to a hash or a UUID (the hash could be 
> the UUID), then you have a form of persistent address.  As long as no 
> one messes with your file system, the address scheme stays intact.
>
>
> -----Original Message-----
> From: Neil Jefferies [mailto:neil at jefferies.org]
> Sent: Friday, May 12, 2017 11:25 AM
> To: Jacob Farmer
> Subject: RE: [Pasig-discuss] WORM (Write Once Read Many) AIPs
>
> Good point on the housekeeping!
>
> Most (reasonable) filesystems allow you specify the inode numbers at 
> creation but yes, it is hard to change afterwards!
>
> But I would really, really avoid hard links - they only work within a 
> single filesystem so they can't be used in tiered or virtual storage 
> systems and even break quota controls on regular filesystems. Scale up 
> thus becomes very difficult with hard links. Symlinks also make it 
> explicit when you are dealing with a reference and can tell you which 
> version of the object held the original - useful provenance that hard 
> links don't capture.
>
> My personal feeling is no for hashes, yes for UUID's (or other 
> suitably unique object ID). This allows us to keep all versions of an 
> object in the same root path even though it varies. And don't store at 
> a file level - this shotguns object fragments all over the store and 
> make rebuilds horrible.
> Many current object stores do this - and consequently don't version 
> effectively - I wish people would understand objects are not files.
> UUID's
> are also consistent in terms of computational time and hashes very 
> much aren't.
>
> There's a big difference in robustness between needing just filesystem 
> metadata to find an object in storage and requiring filesystem 
> metadata (because underneath all object stores are filesystems - even 
> Seagates "object" hard drives), object store metadata to map paths to 
> hashes, and object metadata to find all the bits that make up a 
> composite object.
>
> ...and yes, I am saying that most object store vendors have got it 
> wrong. At least as far as archiving is concerned. And they ought to 
> consider why every object store ends up presenting itself as a POSIX 
> filesystem.
>
> Neil
>
>
> On 2017-05-12 14:33, Jacob Farmer wrote:
>> Two warnings and two suggestions:
>>
>> Warnings:
>>
>> 1)  Symlinks and Housekeeping -- It is a common practice to use 
>> symlinks to make versioned file collections.  If you do this, you 
>> should have some kind of housekeeping processes that ensure that the 
>> symlinks are all working correctly.  If files ever have to get 
>> migrated, symlinks can break.
>>
>> 2)  Check with your file system vendor -- Most removable media file 
>> systems have some built in limitations on the number of inodes
>> (files) that you can have in one file system.  If you generate a lot 
>> of symlinks, you might overwhelm the file system.  Your vendor will know.
>>
>> Suggestions:
>>
>> 1)  Hashes for file names -- If your application software maintains a 
>> hash for each file, you might consider naming the file according to 
>> the hash.
>> Use the first two digits for the parent directory, the next two 
>> digits for sub-diretory, the next two digits for sub-directory.  Then 
>> use the full hash for the file name.  This turns your POSIX file 
>> system into an object store with uniquely named objects.
>>
>> 	As a safeguard, you might maintain a separate table or list that 
>> associates path names with hashes.
>>
>> 2)  Consider using hard links instead of symlinks -- You might use 
>> hard links instead of symlinks, presuming that the files are all in 
>> the same file system.  You still have to watch for file count issues, 
>> but you have less housekeeping to do.
>>
>> I hope that helps.
>>
>>
>> Jacob Farmer  |  Chief Technology Officer  |  Cambridge Computer  | 
>> "Artists In Data Storage"
>> Phone 781-250-3210  |  jfarmer at CambridgeComputer.com  | 
>> www.CambridgeComputer.com
>>
>>
>>
>>
>> -----Original Message-----
>> From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf 
>> Of Neil Jefferies
>> Sent: Friday, May 12, 2017 8:06 AM
>> To: Tim.Gollins at nrscotland.gov.uk
>> Cc: pasig-discuss at mail.asis.org
>> Subject: Re: [Pasig-discuss] WORM (Write Once Read Many) AIPs
>>
>> Tim,
>>
>> If we store AIP's unpackaged, as a collection of files in a folder, 
>> then object updates could just be a new folder with symlinks to the 
>> unchanged parts and the updated parts in place in the folder. The 
>> object "location"
>> would be a parent folder for all these version folders - for example, 
>> a pairtree (or triple-tree for faster scanning/rebuilds) based on 
>> object UUID.
>> Version folders would be named accoprding to date or version number 
>> (date might make Memento compliant access simpler).
>> Creating anew version clones the current verion (including links) 
>> with a new name and then replaces the updated parts in situ. Final 
>> act is to update a "current" symlink in the object. Any update 
>> failure will mean "current"
>> is
>> not updated an the partial clone can be discarded.
>>
>> This assumes most updates are metadata and that a diff won't save 
>> much compared to a complete new XML file or whatever. I am also 
>> assuming that metadata won't be wrappered either (so you can forget
>> METS) so that different types are stored in the most stuiable format 
>> and are accessed only when required. The problems with roundtripping 
>> packaged AIP's for updates rather than diff-ing are repeated by METS 
>> wrappering.
>>
>> These may be a virtual folder/filesytem presentation and underneath 
>> an HSM would retrieve files from wherever when it is actually accessed.
>> HSM policy in soemthing like SAM-QFS/Versity/Cray TAS can ensure 
>> folders are kep intact when moved to other storage (we could even 
>> dereference symlinks when dealing with tape).
>>
>> This can be done with a POSIX filesystem and not muich code - Ben 
>> O'Steen started something along these lines here:
>> https://github.com/dataflow/RDFDatabank/wiki/What-is-DataBank-and-wha
>> t
>> -does-it-do%3F
>>
>> Fedora also also a versioning object store that could support this 
>> kind of model but also adds a fair bit of complexity to be 
>> Linked-Data_platform compliant.
>>
>> In my paralance I would probably equate "Minimal Ingest" with "Sheer 
>> Curation" and APT with Asynchronous Message Driven Workers.
>>
>> Neil
>>
>>
>> On 2017-05-12 12:33, Tim.Gollins at nrscotland.gov.uk wrote:
>>> Dear PASIG
>>>
>>> I have been thinking recently about the challenge of managing 
>>> "physical"  AIPs on offline or near line storage and how to optimise 
>>> or simplify the use of managed storage media in a tape based
>>> (robotic) Hierarchical Storage Management (HSM) system. By "physical"
>>> AIPs I mean that the actual structure of the AIP written to the 
>>> storage system is sufficiently self-describing that even if the 
>>> management or other elements of a DP system were to be lost to a 
>>> disaster then the entire collection could be fully re-instated 
>>> reliably from the stored AIPs alone.
>>>
>>> I have also been thinking about the huge benefits of adopting the 
>>> concepts of "Minimal Ingest" (MI) and "Autonomous Preservation Tools"
>>> (APT) in a new Digital Archive solution.
>>>
>>> One of the potential effects of the MI and APT concepts is that over 
>>> time it is clear that while (of course) the original bit streams 
>>> will never need to be updated, the metadata packaged in the AIP will 
>>> need to change relatively often (through the life of the AIP) . This 
>>> is of course in addition to any new renderings of the bit streams 
>>> produced for preservation purposes (manifestations as termed in some 
>>> systems).
>>>
>>> If to update the AIP the process involves the AIP being "loaded" and 
>>> "Modified" and "Stored" again as a whole then this will result in 
>>> significant "churn" of the offline or near line media (i.e. tapes) 
>>> in a HSM - which I would like to avoid. I think it would be really 
>>> great if the AIP representation could accommodate the concept of an 
>>> "update IP" (perhaps UIP?) where the UIP contains a "delta" of the 
>>> original AIP - the full AIP then being interpreted as the original 
>>> as modified by a series of deltas. This would then effectively 
>>> result in AIPs (and
>>> UIPs) becoming WORM objects with clear benefits that I perceive in 
>>> managing their reliable and safe storage.
>>>
>>> I am not sufficiently familiar with the detail of all the different 
>>> AIP models or implementations, I was wondering if anyone in the team 
>>> would be able to comment on whether the they know of any AIP models, 
>>> specifications or implementations that  would support such a use 
>>> case.
>>>
>>> I have just posted a version of this question to the E-Ark Linked in 
>>> Group so my apologies to those who see it twice.
>>>
>>> Many thanks
>>>
>>> Tim
>>> Tim Gollins | Head of Digital Archiving and Director of the NRS 
>>> Digital Preservation Programme National Records of Scotland | West 
>>> Register House | Edinburgh EH2 4DF
>>> + 44 (0)131 535 1431 / + 44 (0)7974 922614 |
>>> tim.gollins at nrscotland.gov.uk | www.nrscotland.gov.uk
>>>
>>> Preserving the past | Recording the present | Informing the future 
>>> Follow us on Twitter: @NatRecordsScot | 
>>> http://twitter.com/NatRecordsScot
>>>
>>>
>>> ********************************************************************
>>> *
>>> * This e-mail (and any files or other attachments transmitted with
>>> it) is intended solely for the attention of the addressee(s).
>>> Unauthorised use, disclosure, storage, copying or distribution of 
>>> any part of this e-mail is not permitted. If you are not the 
>>> intended recipient please destroy the email, remove any copies from 
>>> your system and inform the sender immediately by return.
>>>
>>> Communications with the Scottish Government may be monitored or 
>>> recorded in order to secure the effective operation of the system 
>>> and for other lawful purposes. The views or opinions contained 
>>> within this e-mail may not necessarily reflect those of the Scottish 
>>> Government.
>>>
>>>
>>> Tha am post-d seo (agus faidhle neo ceanglan  còmhla ris) dhan neach 
>>> neo luchd-ainmichte a-mhàin. Chan eil e ceadaichte a chleachdadh ann 
>>> an dòigh sam bith, a’ toirt a-steach còraichean, foillseachadh neo 
>>> sgaoileadh,  gun chead. Ma ’s e is gun d’fhuair sibh seo le gun 
>>> fhiosd’, bu choir cur às dhan phost-d agus lethbhreac sam bith air 
>>> an t-siostam agaibh, leig fios chun  neach a sgaoil am post-d  gun dàil.
>>>
>>> Dh’fhaodadh gum bi teachdaireachd sam bith bho Riaghaltas na h-Alba 
>>> air a chlàradh neo air a sgrùdadh airson dearbhadh gu bheil an 
>>> siostam ag obair gu h-èifeachdach neo airson adhbhar laghail eile.
>>> Dh’fhaodadh nach  eil beachdan anns a’ phost-d seo co-ionann ri 
>>> beachdan Riaghaltas na h-Alba.
>>> ********************************************************************
>>> *
>>> *
>>>
>>>
>>>
>>> ----
>>> To subscribe, unsubscribe, or modify your subscription, please visit 
>>> http://mail.asis.org/mailman/listinfo/pasig-discuss
>>> _______
>>> PASIG Webinars and conference material is at 
>>> http://www.preservationandarchivingsig.org/index.html
>>> _______________________________________________
>>> Pasig-discuss mailing list
>>> Pasig-discuss at mail.asis.org
>>> http://mail.asis.org/mailman/listinfo/pasig-discuss
>>
>> ----
>> To subscribe, unsubscribe, or modify your subscription, please visit 
>> http://mail.asis.org/mailman/listinfo/pasig-discuss
>> _______
>> PASIG Webinars and conference material is at 
>> http://www.preservationandarchivingsig.org/index.html
>> _______________________________________________
>> Pasig-discuss mailing list
>> Pasig-discuss at mail.asis.org
>> http://mail.asis.org/mailman/listinfo/pasig-discuss

----
To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss
_______
PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
_______________________________________________
Pasig-discuss mailing list
Pasig-discuss at mail.asis.org
http://mail.asis.org/mailman/listinfo/pasig-discuss

----
To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss
_______
PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
_______________________________________________
Pasig-discuss mailing list
Pasig-discuss at mail.asis.org
http://mail.asis.org/mailman/listinfo/pasig-discuss



More information about the Pasig-discuss mailing list