[Pasig-discuss] Digital repository storage benchmarking

William Kilbride william.kilbride at dpconline.org
Mon May 15 04:02:27 EDT 2017


Hi All, Hi Tim

This is a super thread and I am learning a tonne.  On the subject of costs I can make a recommendation and request ...

The Curation Costs Exchange is a useful thing and well worth a look for anyone looking at comparative costs across the digital preservation lifecycle including storage.  It's not been mentioned yet in the discussions, I assume because everyone is already aware of it.  But have a look: http://www.curationexchange.org/ 

The conclusion we drew from the 4C project was that financial planning was a core skill in preservation planning. So to be a 'trusted' repository an institution should be able to demonstrate certain skills in financial planning and be transparent about it.  It's expressed more elegantly in the 4c project roadmap: 
http://www.4cproject.eu/roadmap/

Now the request: there's a network effect here.  The more agencies share data the more useful the data becomes.  So can I encourage you all to share that information (anonymously or identifiably) via the costs exchange?

All best wishes,

William


-----Original Message-----
From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Jake Carroll
Sent: 15 May 2017 04:01
To: pasig-discuss at asis.org
Subject: Re: [Pasig-discuss] Digital repository storage benchmarking

Certainly interesting.

At the Queensland Brain Institute and the Australian Institute of Bioengineering and Nanotechnology at the University of Queensland, we have around 8.5PB of data under management across our HSM platforms. We currently use Oracle HSM for this task.

We have 256TB of online “cache” for the data landing location split across 6 different filesystems that are tuned differently for different types of workloads and different tasks. These workloads are generally categorised into a few functions:

• High IO, large serial writes from instruments • Low IO, large serial writes from instruments • High IO, granular “many files, many IOPS” from instruments and computational factors • Low IO, granular “many files, low IOPS” from instruments and computational factors • Generic group share • Generic user dir

It is an interesting thing to manage and run statistical modelling on in terms of performance analysis and micro benchmarking of data movement patterns. All the filesystems above are provisioned on 16Gbit/sec FC connected Hitachi HUS-VM, 10K SAS.

The metadata for these filesystems is around 10 terabytes of Hitachi Accelerated Advanced Flash storage. We have around 3.8 billion files/unique objects under management.

We run a “disk based copy” (we call that copy1) which is our disk based VSN or vault. It is around 1PB of ZFS managed storage sitting inside the very large Hitachi HUS-VM platform.

Our Copy2 and Copy3 are 2 * T10000D Oracle tape media copies in SL3000 storage silos, geographically distributed.

We do some interesting things with our tape infrastructure, including DIV-always-on, proactive data protection sweeps inside the HSM and continuous validation checks against the media. We also run STA (tape analytics tools) extra-data-path so we can see *exactly* what each drive is doing at all times. Believe me, we see things that would baffle and boggle the mind (and probably create a healthy sense of paranoia!) if you knew exactly what was going on “inside there”.

We use finely tuned policy for data automation of movement between tiers so as to minimally impact user experience. Our HSM supports offline file mapping to the windows client, so people can tell when their files and objects are “offline”. It is a useful semantic and great for usability for people.

We ZFS scrub the disk copy for “always on disk consistency”, we use tpverify commands on the tape media also to consistently check the media itself. We’re experimenting with implementing fixity shortly too, as the filesystem supports it.

As for going “all online”, at our scale –we just can’t afford it yet, to walk away from “cold tape” principles. We’re just too big. We’d love to rid ourselves of the complexities of it, and consider a full cloud based consumption model, but having crunched the very hard numbers of things such as AWS Glacier and S3, it is a long (long) way more expensive than the relative TCO’s of running it “on premise” at this stage. My hope is that this will change soon and I can start experimenting with one of my copies being a “cloud library”.

Interesting thread, this…

-jc



On 15/5/17, 11:41 am, "Pasig-discuss on behalf of BUNTON, GLENN" <pasig-discuss-bounces at asis.org on behalf of BUNTONGA at mailbox.sc.edu> wrote:

    This discussion of the various digital repository storage approaches has been very enlightening and useful so far. I appreciate all the excellent details. There is one piece of information, however, that is missing. Cost? Both initial implementation outlay and ongoing costs. Any general sense of costs would be greatly appreciated. 
    
    
    -----Original Message-----
    From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Steve Knight
    Sent: Sunday, May 14, 2017 6:44 PM
    To: 'Sheila Morrissey' <Sheila.Morrissey at ithaka.org>; pasig-discuss at asis.org
    Subject: Re: [Pasig-discuss] Digital repository storage benchmarking
    
    Hi Tim
    
    At the National library of New Zealand, we are storing about 210TB of digital objects in our permanent repository.
    
    We have a 25TB online cache, with an online copy of all the digital objects sitting on disk. 
    
    Three tape copies of the objects are made as soon as they enter into the disk archive.  1 copy remains within the tape library (nearline), the other 2 copies are sent offsite (offline). We use Oracle SAM-QFS to manage the storage policies and automatic tierage.
    
    We have a similar treatment for our 100TB of Test data, which has 1 less offsite tape copy.
    
    We are currently looking at replacing this storage architecture with a mix of Hitachi's HDI and HCP S30 object storage products and our cloud provider's object storage offering. The cloud provider storage includes replication across 3 geographic locations providing both higher availability and higher resilience than we currently have.
    
    By moving to an all online solution we hope to increase overall performance and make savings through utilising object storage and exiting some services related to current backup and restore processes.
    
    Regards
    Steve
    
    
    
    
    -----Original Message-----
    From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Sheila Morrissey
    Sent: Saturday, 13 May 2017 5:44 a.m.
    To: pasig-discuss at asis.org
    Subject: [Pasig-discuss] FW: Digital repository storage benchmarking
    
    
    Hello, Tim,
    
    At Portico (http://www.portico.org/digital-preservation/), we preserve e-journals, e-books, digitized historical collections, and other born-digital scholarly content.
    
    Currently, the Portico archive is comprised of roughly 77.7 million digital objects (we call them "Archival Units", or AUs); comprising over 400 TB; made up of 1.3 billion files.
    
    We maintain 3 copies of the archive:  2 on disk in geographically distributed data centers, and a 3rd copy in commercial cloud storage.  We create and maintain backups (including fixity checks) using our own custom-written software.
    
    I hope this helpful.
    
    Best regards,
    Sheila
    
    
    Sheila M. Morrissey
    Senior Researcher
    ITHAKA
    100 Campus Drive
    Suite 100
    Princeton NJ 08540
    609-986-2221
    sheila.morrissey at ithaka.org
     
    ITHAKA (www.ithaka.org) is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways.  We provide innovative services that benefit higher education, including Ithaka S+R, JSTOR, and Portico.
    
    
    
    -----Original Message-----
    From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Tim Walsh
    Sent: Friday, May 12, 2017 10:16 AM
    To: pasig-discuss at asis.org
    Subject: [Pasig-discuss] Digital repository storage benchmarking
    
    Dear PASIG,
    
    I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions’ configurations online. It’s very possible that this question has been asked before on-list, but I wasn’t able to find anything in the list archives.
    
    For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those “various media” will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we’d like to benchmark our plans against other institutions.
    
    I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are:
    
    * Could you point me to published/available resources outlining other institutions’ digital repository storage configurations?
    * Or, if you work at an institution, would you be willing to share the details of your configuration on- or off-list? (any information sent off-list will be kept strictly confidential)
    
    Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups.
    
    Thank you!
    Tim
    
    - - -
    
    Tim Walsh
    Archiviste, Archives numériques
    Archivist, Digital Archives
    
    Centre Canadien d’Architecture
    Canadian Centre for Architecture
    1920, rue Baile, Montréal, Québec  H3H 2S6 T 514 939 7001 x 1532 F 514 939 7020 www.cca.qc.ca<http://www.cca.qc.ca/>
    
    
    Pensez à l’environnement avant d’imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n’êtes pas le destinataire prévu, veuillez nous en aviser immédiatement. Merci également de supprimer le présent courriel et d’en détruire toute copie.
    This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you.
    
    ----
    To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss
    _______
    PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
    _______________________________________________
    Pasig-discuss mailing list
    Pasig-discuss at mail.asis.org
    http://mail.asis.org/mailman/listinfo/pasig-discuss
    
    ----
    To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss
    _______
    PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
    _______________________________________________
    Pasig-discuss mailing list
    Pasig-discuss at mail.asis.org
    http://mail.asis.org/mailman/listinfo/pasig-discuss
    ----
    To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss
    _______
    PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
    _______________________________________________
    Pasig-discuss mailing list
    Pasig-discuss at mail.asis.org
    http://mail.asis.org/mailman/listinfo/pasig-discuss
    
    ----
    To subscribe, unsubscribe, or modify your subscription, please visit
    http://mail.asis.org/mailman/listinfo/pasig-discuss
    _______
    PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
    _______________________________________________
    Pasig-discuss mailing list
    Pasig-discuss at mail.asis.org
    http://mail.asis.org/mailman/listinfo/pasig-discuss
    


----
To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss
_______
PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
_______________________________________________
Pasig-discuss mailing list
Pasig-discuss at mail.asis.org
http://mail.asis.org/mailman/listinfo/pasig-discuss



More information about the Pasig-discuss mailing list