[Pasig-discuss] Digital repository storage benchmarking

Louis Suárez-Potts luispo at gmail.com
Tue May 16 13:53:39 EDT 2017


> Now the request: there's a network effect here.  The more agencies share data the more useful the data becomes.  So can I encourage you all to share that information (anonymously or identifiably) via the costs exchange?


Hi
I'm all for sharing this data, as well as other relevant information, including accounts of how we do things, even when they are mistakes. But an email list is not the best venue; something more pliable, like a wiki or its equivalent? I'm sure there are options. And equally sure that this particular issue has complications related to political location that do need to be made clear, as political mandates (must be within certain political boundaries, say) affect cost, inter alia.

Cheers,
Louis



> On 2017-05-16, at 10:05, Stern, Randy <randy_stern at harvard.edu> wrote:
> 
> Re costs:  For Harvard Library’s Digital Repository Service - 2 disk copies plus 2 tape copies - as of July 1, the cost of storage for depositors to the DRS is $1.25/GB/year for storage. This figure is moderately close to the storage hardware costs. The storage cost does not include staff costs, preservation activities, or server costs associated with the core DRS software services, tools, and databases.
> 
> Randy
> 
> 
> 
> On 5/15/17, 4:02 AM, "Pasig-discuss on behalf of William Kilbride" <pasig-discuss-bounces at asis.org on behalf of william.kilbride at dpconline.org> wrote:
> 
>    Hi All, Hi Tim
> 
>    This is a super thread and I am learning a tonne.  On the subject of costs I can make a recommendation and request ...
> 
>    The Curation Costs Exchange is a useful thing and well worth a look for anyone looking at comparative costs across the digital preservation lifecycle including storage.  It's not been mentioned yet in the discussions, I assume because everyone is already aware of it.  But have a look: http://www.curationexchange.org/ 
> 
>    The conclusion we drew from the 4C project was that financial planning was a core skill in preservation planning. So to be a 'trusted' repository an institution should be able to demonstrate certain skills in financial planning and be transparent about it.  It's expressed more elegantly in the 4c project roadmap: 
>    http://www.4cproject.eu/roadmap/
> 
>    Now the request: there's a network effect here.  The more agencies share data the more useful the data becomes.  So can I encourage you all to share that information (anonymously or identifiably) via the costs exchange?
> 
>    All best wishes,
> 
>    William
> 
> 
>    -----Original Message-----
>    From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Jake Carroll
>    Sent: 15 May 2017 04:01
>    To: pasig-discuss at asis.org
>    Subject: Re: [Pasig-discuss] Digital repository storage benchmarking
> 
>    Certainly interesting.
> 
>    At the Queensland Brain Institute and the Australian Institute of Bioengineering and Nanotechnology at the University of Queensland, we have around 8.5PB of data under management across our HSM platforms. We currently use Oracle HSM for this task.
> 
>    We have 256TB of online “cache” for the data landing location split across 6 different filesystems that are tuned differently for different types of workloads and different tasks. These workloads are generally categorised into a few functions:
> 
>    • High IO, large serial writes from instruments • Low IO, large serial writes from instruments • High IO, granular “many files, many IOPS” from instruments and computational factors • Low IO, granular “many files, low IOPS” from instruments and computational factors • Generic group share • Generic user dir
> 
>    It is an interesting thing to manage and run statistical modelling on in terms of performance analysis and micro benchmarking of data movement patterns. All the filesystems above are provisioned on 16Gbit/sec FC connected Hitachi HUS-VM, 10K SAS.
> 
>    The metadata for these filesystems is around 10 terabytes of Hitachi Accelerated Advanced Flash storage. We have around 3.8 billion files/unique objects under management.
> 
>    We run a “disk based copy” (we call that copy1) which is our disk based VSN or vault. It is around 1PB of ZFS managed storage sitting inside the very large Hitachi HUS-VM platform.
> 
>    Our Copy2 and Copy3 are 2 * T10000D Oracle tape media copies in SL3000 storage silos, geographically distributed.
> 
>    We do some interesting things with our tape infrastructure, including DIV-always-on, proactive data protection sweeps inside the HSM and continuous validation checks against the media. We also run STA (tape analytics tools) extra-data-path so we can see *exactly* what each drive is doing at all times. Believe me, we see things that would baffle and boggle the mind (and probably create a healthy sense of paranoia!) if you knew exactly what was going on “inside there”.
> 
>    We use finely tuned policy for data automation of movement between tiers so as to minimally impact user experience. Our HSM supports offline file mapping to the windows client, so people can tell when their files and objects are “offline”. It is a useful semantic and great for usability for people.
> 
>    We ZFS scrub the disk copy for “always on disk consistency”, we use tpverify commands on the tape media also to consistently check the media itself. We’re experimenting with implementing fixity shortly too, as the filesystem supports it.
> 
>    As for going “all online”, at our scale –we just can’t afford it yet, to walk away from “cold tape” principles. We’re just too big. We’d love to rid ourselves of the complexities of it, and consider a full cloud based consumption model, but having crunched the very hard numbers of things such as AWS Glacier and S3, it is a long (long) way more expensive than the relative TCO’s of running it “on premise” at this stage. My hope is that this will change soon and I can start experimenting with one of my copies being a “cloud library”.
> 
>    Interesting thread, this…
> 
>    -jc
> 
> 
> 
>    On 15/5/17, 11:41 am, "Pasig-discuss on behalf of BUNTON, GLENN" <pasig-discuss-bounces at asis.org on behalf of BUNTONGA at mailbox.sc.edu> wrote:
> 
>        This discussion of the various digital repository storage approaches has been very enlightening and useful so far. I appreciate all the excellent details. There is one piece of information, however, that is missing. Cost? Both initial implementation outlay and ongoing costs. Any general sense of costs would be greatly appreciated. 
> 
> 
>        -----Original Message-----
>        From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Steve Knight
>        Sent: Sunday, May 14, 2017 6:44 PM
>        To: 'Sheila Morrissey' <Sheila.Morrissey at ithaka.org>; pasig-discuss at asis.org
>        Subject: Re: [Pasig-discuss] Digital repository storage benchmarking
> 
>        Hi Tim
> 
>        At the National library of New Zealand, we are storing about 210TB of digital objects in our permanent repository.
> 
>        We have a 25TB online cache, with an online copy of all the digital objects sitting on disk. 
> 
>        Three tape copies of the objects are made as soon as they enter into the disk archive.  1 copy remains within the tape library (nearline), the other 2 copies are sent offsite (offline). We use Oracle SAM-QFS to manage the storage policies and automatic tierage.
> 
>        We have a similar treatment for our 100TB of Test data, which has 1 less offsite tape copy.
> 
>        We are currently looking at replacing this storage architecture with a mix of Hitachi's HDI and HCP S30 object storage products and our cloud provider's object storage offering. The cloud provider storage includes replication across 3 geographic locations providing both higher availability and higher resilience than we currently have.
> 
>        By moving to an all online solution we hope to increase overall performance and make savings through utilising object storage and exiting some services related to current backup and restore processes.
> 
>        Regards
>        Steve
> 
> 
> 
> 
>        -----Original Message-----
>        From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Sheila Morrissey
>        Sent: Saturday, 13 May 2017 5:44 a.m.
>        To: pasig-discuss at asis.org
>        Subject: [Pasig-discuss] FW: Digital repository storage benchmarking
> 
> 
>        Hello, Tim,
> 
>        At Portico (http://www.portico.org/digital-preservation/), we preserve e-journals, e-books, digitized historical collections, and other born-digital scholarly content.
> 
>        Currently, the Portico archive is comprised of roughly 77.7 million digital objects (we call them "Archival Units", or AUs); comprising over 400 TB; made up of 1.3 billion files.
> 
>        We maintain 3 copies of the archive:  2 on disk in geographically distributed data centers, and a 3rd copy in commercial cloud storage.  We create and maintain backups (including fixity checks) using our own custom-written software.
> 
>        I hope this helpful.
> 
>        Best regards,
>        Sheila
> 
> 
>        Sheila M. Morrissey
>        Senior Researcher
>        ITHAKA
>        100 Campus Drive
>        Suite 100
>        Princeton NJ 08540
>        609-986-2221
>        sheila.morrissey at ithaka.org
> 
>        ITHAKA (www.ithaka.org) is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways.  We provide innovative services that benefit higher education, including Ithaka S+R, JSTOR, and Portico.
> 
> 
> 
>        -----Original Message-----
>        From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Tim Walsh
>        Sent: Friday, May 12, 2017 10:16 AM
>        To: pasig-discuss at asis.org
>        Subject: [Pasig-discuss] Digital repository storage benchmarking
> 
>        Dear PASIG,
> 
>        I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions’ configurations online. It’s very possible that this question has been asked before on-list, but I wasn’t able to find anything in the list archives.
> 
>        For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those “various media” will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we’d like to benchmark our plans against other institutions.
> 
>        I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are:
> 
>        * Could you point me to published/available resources outlining other institutions’ digital repository storage configurations?
>        * Or, if you work at an institution, would you be willing to share the details of your configuration on- or off-list? (any information sent off-list will be kept strictly confidential)
> 
>        Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups.
> 
>        Thank you!
>        Tim
> 
>        - - -
> 
>        Tim Walsh
>        Archiviste, Archives numériques
>        Archivist, Digital Archives
> 
>        Centre Canadien d’Architecture
>        Canadian Centre for Architecture
>        1920, rue Baile, Montréal, Québec  H3H 2S6 T 514 939 7001 x 1532 F 514 939 7020 www.cca.qc.ca<http://www.cca.qc.ca/>
> 
> 
>        Pensez à l’environnement avant d’imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n’êtes pas le destinataire prévu, veuillez nous en aviser immédiatement. Merci également de supprimer le présent courriel et d’en détruire toute copie.
>        This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you.
> 
>        ----
>        To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss
>        _______
>        PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
>        _______________________________________________
>        Pasig-discuss mailing list
>        Pasig-discuss at mail.asis.org
>        http://mail.asis.org/mailman/listinfo/pasig-discuss
> 
>        ----
>        To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss
>        _______
>        PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
>        _______________________________________________
>        Pasig-discuss mailing list
>        Pasig-discuss at mail.asis.org
>        http://mail.asis.org/mailman/listinfo/pasig-discuss
>        ----
>        To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss
>        _______
>        PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
>        _______________________________________________
>        Pasig-discuss mailing list
>        Pasig-discuss at mail.asis.org
>        http://mail.asis.org/mailman/listinfo/pasig-discuss
> 
>        ----
>        To subscribe, unsubscribe, or modify your subscription, please visit
>        http://mail.asis.org/mailman/listinfo/pasig-discuss
>        _______
>        PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
>        _______________________________________________
>        Pasig-discuss mailing list
>        Pasig-discuss at mail.asis.org
>        http://mail.asis.org/mailman/listinfo/pasig-discuss
> 
> 
> 
>    ----
>    To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss
>    _______
>    PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
>    _______________________________________________
>    Pasig-discuss mailing list
>    Pasig-discuss at mail.asis.org
>    http://mail.asis.org/mailman/listinfo/pasig-discuss
> 
>    ----
>    To subscribe, unsubscribe, or modify your subscription, please visit
>    http://mail.asis.org/mailman/listinfo/pasig-discuss
>    _______
>    PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
>    _______________________________________________
>    Pasig-discuss mailing list
>    Pasig-discuss at mail.asis.org
>    http://mail.asis.org/mailman/listinfo/pasig-discuss
> 
> 
> 
> ----
> To subscribe, unsubscribe, or modify your subscription, please visit
> http://mail.asis.org/mailman/listinfo/pasig-discuss
> _______
> PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
> _______________________________________________
> Pasig-discuss mailing list
> Pasig-discuss at mail.asis.org
> http://mail.asis.org/mailman/listinfo/pasig-discuss




More information about the Pasig-discuss mailing list