[Pasig-discuss] WORM (Write Once Read Many) AIPs

Jacob Farmer jfarmer at cambridgecomputer.com
Fri May 12 09:33:11 EDT 2017


Two warnings and two suggestions:

Warnings:

1)  Symlinks and Housekeeping -- It is a common practice to use symlinks to
make versioned file collections.  If you do this, you should have some kind
of housekeeping processes that ensure that the symlinks are all working
correctly.  If files ever have to get migrated, symlinks can break.

2)  Check with your file system vendor -- Most removable media file systems
have some built in limitations on the number of inodes (files) that you can
have in one file system.  If you generate a lot of symlinks, you might
overwhelm the file system.  Your vendor will know.

Suggestions:

1)  Hashes for file names -- If your application software maintains a hash
for each file, you might consider naming the file according to the hash.
Use the first two digits for the parent directory, the next two digits for
sub-diretory, the next two digits for sub-directory.  Then use the full hash
for the file name.  This turns your POSIX file system into an object store
with uniquely named objects.

	As a safeguard, you might maintain a separate table or list that associates
path names with hashes.

2)  Consider using hard links instead of symlinks -- You might use hard
links instead of symlinks, presuming that the files are all in the same file
system.  You still have to watch for file count issues, but you have less
housekeeping to do.

I hope that helps.


Jacob Farmer  |  Chief Technology Officer  |  Cambridge Computer  |
"Artists In Data Storage"
Phone 781-250-3210  |  jfarmer at CambridgeComputer.com  |
www.CambridgeComputer.com




-----Original Message-----
From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of
Neil Jefferies
Sent: Friday, May 12, 2017 8:06 AM
To: Tim.Gollins at nrscotland.gov.uk
Cc: pasig-discuss at mail.asis.org
Subject: Re: [Pasig-discuss] WORM (Write Once Read Many) AIPs

Tim,

If we store AIP's unpackaged, as a collection of files in a folder, then
object updates could just be a new folder with symlinks to the unchanged
parts and the updated parts in place in the folder. The object "location"
would be a parent folder for all these version folders - for example, a
pairtree (or triple-tree for faster scanning/rebuilds) based on object UUID.
Version folders would be named accoprding to date or version number (date
might make Memento compliant access simpler).
Creating anew version clones the current verion (including links) with a new
name and then replaces the updated parts in situ. Final act is to update a
"current" symlink in the object. Any update failure will mean "current" is
not updated an the partial clone can be discarded.

This assumes most updates are metadata and that a diff won't save much
compared to a complete new XML file or whatever. I am also assuming that
metadata won't be wrappered either (so you can forget METS) so that
different types are stored in the most stuiable format and are accessed only
when required. The problems with roundtripping packaged AIP's for updates
rather than diff-ing are repeated by METS wrappering.

These may be a virtual folder/filesytem presentation and underneath an HSM
would retrieve files from wherever when it is actually accessed. HSM policy
in soemthing like SAM-QFS/Versity/Cray TAS can ensure folders are kep intact
when moved to other storage (we could even dereference symlinks when dealing
with tape).

This can be done with a POSIX filesystem and not muich code - Ben O'Steen
started something along these lines here:
https://github.com/dataflow/RDFDatabank/wiki/What-is-DataBank-and-what-does-it-do%3F

Fedora also also a versioning object store that could support this kind of
model but also adds a fair bit of complexity to be Linked-Data_platform
compliant.

In my paralance I would probably equate "Minimal Ingest" with "Sheer
Curation" and APT with Asynchronous Message Driven Workers.

Neil


On 2017-05-12 12:33, Tim.Gollins at nrscotland.gov.uk wrote:
> Dear PASIG
>
> I have been thinking recently about the challenge of managing
> "physical"  AIPs on offline or near line storage and how to optimise
> or simplify the use of managed storage media in a tape based (robotic)
> Hierarchical Storage Management (HSM) system. By "physical" AIPs I
> mean that the actual structure of the AIP written to the storage
> system is sufficiently self-describing that even if the management or
> other elements of a DP system were to be lost to a disaster then the
> entire collection could be fully re-instated reliably from the stored
> AIPs alone.
>
> I have also been thinking about the huge benefits of adopting the
> concepts of "Minimal Ingest" (MI) and "Autonomous Preservation Tools"
> (APT) in a new Digital Archive solution.
>
> One of the potential effects of the MI and APT concepts is that over
> time it is clear that while (of course) the original bit streams will
> never need to be updated, the metadata packaged in the AIP will need
> to change relatively often (through the life of the AIP) . This is of
> course in addition to any new renderings of the bit streams produced
> for preservation purposes (manifestations as termed in some systems).
>
> If to update the AIP the process involves the AIP being "loaded" and
> "Modified" and "Stored" again as a whole then this will result in
> significant "churn" of the offline or near line media (i.e. tapes) in
> a HSM - which I would like to avoid. I think it would be really great
> if the AIP representation could accommodate the concept of an "update
> IP" (perhaps UIP?) where the UIP contains a "delta" of the original
> AIP - the full AIP then being interpreted as the original as modified
> by a series of deltas. This would then effectively result in AIPs (and
> UIPs) becoming WORM objects with clear benefits that I perceive in
> managing their reliable and safe storage.
>
> I am not sufficiently familiar with the detail of all the different
> AIP models or implementations, I was wondering if anyone in the team
> would be able to comment on whether the they know of any AIP models,
> specifications or implementations that  would support such a use case.
>
> I have just posted a version of this question to the E-Ark Linked in
> Group so my apologies to those who see it twice.
>
> Many thanks
>
> Tim
> Tim Gollins | Head of Digital Archiving and Director of the NRS
> Digital Preservation Programme National Records of Scotland | West
> Register House | Edinburgh EH2 4DF
> + 44 (0)131 535 1431 / + 44 (0)7974 922614 |
> tim.gollins at nrscotland.gov.uk | www.nrscotland.gov.uk
>
> Preserving the past | Recording the present | Informing the future
> Follow us on Twitter: @NatRecordsScot |
> http://twitter.com/NatRecordsScot
>
>
> **********************************************************************
> This e-mail (and any files or other attachments transmitted with it)
> is intended solely for the attention of the addressee(s). Unauthorised
> use, disclosure, storage, copying or distribution of any part of this
> e-mail is not permitted. If you are not the intended recipient please
> destroy the email, remove any copies from your system and inform the
> sender immediately by return.
>
> Communications with the Scottish Government may be monitored or
> recorded in order to secure the effective operation of the system and
> for other lawful purposes. The views or opinions contained within this
> e-mail may not necessarily reflect those of the Scottish Government.
>
>
> Tha am post-d seo (agus faidhle neo ceanglan  còmhla ris) dhan neach
> neo luchd-ainmichte a-mhàin. Chan eil e ceadaichte a chleachdadh ann
> an dòigh sam bith, a’ toirt a-steach còraichean, foillseachadh neo
> sgaoileadh,  gun chead. Ma ’s e is gun d’fhuair sibh seo le gun
> fhiosd’, bu choir cur às dhan phost-d agus lethbhreac sam bith air an
> t-siostam agaibh, leig fios chun  neach a sgaoil am post-d  gun dàil.
>
> Dh’fhaodadh gum bi teachdaireachd sam bith bho Riaghaltas na h-Alba
> air a chlàradh neo air a sgrùdadh airson dearbhadh gu bheil an siostam
> ag obair gu h-èifeachdach neo airson adhbhar laghail eile. Dh’fhaodadh
> nach  eil beachdan anns a’ phost-d seo co-ionann ri beachdan
> Riaghaltas na h-Alba.
> **********************************************************************
>
>
>
> ----
> To subscribe, unsubscribe, or modify your subscription, please visit
> http://mail.asis.org/mailman/listinfo/pasig-discuss
> _______
> PASIG Webinars and conference material is at
> http://www.preservationandarchivingsig.org/index.html
> _______________________________________________
> Pasig-discuss mailing list
> Pasig-discuss at mail.asis.org
> http://mail.asis.org/mailman/listinfo/pasig-discuss

----
To subscribe, unsubscribe, or modify your subscription, please visit
http://mail.asis.org/mailman/listinfo/pasig-discuss
_______
PASIG Webinars and conference material is at
http://www.preservationandarchivingsig.org/index.html
_______________________________________________
Pasig-discuss mailing list
Pasig-discuss at mail.asis.org
http://mail.asis.org/mailman/listinfo/pasig-discuss



More information about the Pasig-discuss mailing list