[Pasig-discuss] Digital repository storage benchmarking
Bjarne Andersen
bja at kb.dk
Wed May 17 02:48:00 EDT 2017
In Denmark The Royal Danish Library has developed the open source BitRepository software
www.bitrepository.org
This software handles "nothing but" the preservation of bits.
Its very basically explained a system for handling multiple copies of data on different "pillars" (different technologies, different locations, different organisations) to ensure as independent copies of data as possible.
In our own collections we store and preserve more than 4Pbytes unique content meaning that we have over 15Pbtes of current capacity
The Royal Danish Library offers bit preservation using this software for other national cultural heritage institutions.
Our pricing model basically has two prices - one for ingest (first year) and one for following years (which includes re-investment budget for periodic migration to new media/technology)
The prices are roughly (per Tb/year)
Online (disk): ingest: 500 Euros, following years: 200 Euros
Nearline (tape inside robot): ingest 156 Euros, following years: 68 Euros
Offline (tape moved to fire safe box): ingest: 132 Euros, following years 50 Euros.
These are meant for long term preservation so there are access-prices as well - off cause higher for the tape based storage and especially naturally for the Off line model where staff needs to collect tapes from a box and mount into tape robot.
With these prices we can offer a 3-copy setup with e.g. 1 disk and 2 tapes for a total of 750 Euros/Tbytes the first year and 300 Euros/Tbytes in the following years.
The prices includes everything: hardware, staff, power, media migration, etc...
best
-
Bjarne Andersen
Vicedirektør
Deputy Director General
It-udvikling og Infrastruktur
It developement & Infrastructure
+45 89 46 21 65 / + 45 25 66 23 53
bja at kb.dk
Det Kgl. Bibliotek
Royal Danish Library
Victor Albecks Vej 1
DK-8000 Aarhus C
+45 3347 4747
CVR 2898 8842
EAN 5798 000 792142
-----Original Message-----
From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Stern, Randy
Sent: Tuesday, May 16, 2017 4:05 PM
To: William Kilbride <william.kilbride at dpconline.org>; Jake Carroll <jake.carroll at uq.edu.au>; pasig-discuss at asis.org
Subject: Re: [Pasig-discuss] Digital repository storage benchmarking
Re costs: For Harvard Library’s Digital Repository Service - 2 disk copies plus 2 tape copies - as of July 1, the cost of storage for depositors to the DRS is $1.25/GB/year for storage. This figure is moderately close to the storage hardware costs. The storage cost does not include staff costs, preservation activities, or server costs associated with the core DRS software services, tools, and databases.
Randy
On 5/15/17, 4:02 AM, "Pasig-discuss on behalf of William Kilbride" <pasig-discuss-bounces at asis.org on behalf of william.kilbride at dpconline.org> wrote:
Hi All, Hi Tim
This is a super thread and I am learning a tonne. On the subject of costs I can make a recommendation and request ...
The Curation Costs Exchange is a useful thing and well worth a look for anyone looking at comparative costs across the digital preservation lifecycle including storage. It's not been mentioned yet in the discussions, I assume because everyone is already aware of it. But have a look: http://www.curationexchange.org/
The conclusion we drew from the 4C project was that financial planning was a core skill in preservation planning. So to be a 'trusted' repository an institution should be able to demonstrate certain skills in financial planning and be transparent about it. It's expressed more elegantly in the 4c project roadmap:
http://www.4cproject.eu/roadmap/
Now the request: there's a network effect here. The more agencies share data the more useful the data becomes. So can I encourage you all to share that information (anonymously or identifiably) via the costs exchange?
All best wishes,
William
-----Original Message-----
From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Jake Carroll
Sent: 15 May 2017 04:01
To: pasig-discuss at asis.org
Subject: Re: [Pasig-discuss] Digital repository storage benchmarking
Certainly interesting.
At the Queensland Brain Institute and the Australian Institute of Bioengineering and Nanotechnology at the University of Queensland, we have around 8.5PB of data under management across our HSM platforms. We currently use Oracle HSM for this task.
We have 256TB of online “cache” for the data landing location split across 6 different filesystems that are tuned differently for different types of workloads and different tasks. These workloads are generally categorised into a few functions:
• High IO, large serial writes from instruments • Low IO, large serial writes from instruments • High IO, granular “many files, many IOPS” from instruments and computational factors • Low IO, granular “many files, low IOPS” from instruments and computational factors • Generic group share • Generic user dir
It is an interesting thing to manage and run statistical modelling on in terms of performance analysis and micro benchmarking of data movement patterns. All the filesystems above are provisioned on 16Gbit/sec FC connected Hitachi HUS-VM, 10K SAS.
The metadata for these filesystems is around 10 terabytes of Hitachi Accelerated Advanced Flash storage. We have around 3.8 billion files/unique objects under management.
We run a “disk based copy” (we call that copy1) which is our disk based VSN or vault. It is around 1PB of ZFS managed storage sitting inside the very large Hitachi HUS-VM platform.
Our Copy2 and Copy3 are 2 * T10000D Oracle tape media copies in SL3000 storage silos, geographically distributed.
We do some interesting things with our tape infrastructure, including DIV-always-on, proactive data protection sweeps inside the HSM and continuous validation checks against the media. We also run STA (tape analytics tools) extra-data-path so we can see *exactly* what each drive is doing at all times. Believe me, we see things that would baffle and boggle the mind (and probably create a healthy sense of paranoia!) if you knew exactly what was going on “inside there”.
We use finely tuned policy for data automation of movement between tiers so as to minimally impact user experience. Our HSM supports offline file mapping to the windows client, so people can tell when their files and objects are “offline”. It is a useful semantic and great for usability for people.
We ZFS scrub the disk copy for “always on disk consistency”, we use tpverify commands on the tape media also to consistently check the media itself. We’re experimenting with implementing fixity shortly too, as the filesystem supports it.
As for going “all online”, at our scale –we just can’t afford it yet, to walk away from “cold tape” principles. We’re just too big. We’d love to rid ourselves of the complexities of it, and consider a full cloud based consumption model, but having crunched the very hard numbers of things such as AWS Glacier and S3, it is a long (long) way more expensive than the relative TCO’s of running it “on premise” at this stage. My hope is that this will change soon and I can start experimenting with one of my copies being a “cloud library”.
Interesting thread, this…
-jc
On 15/5/17, 11:41 am, "Pasig-discuss on behalf of BUNTON, GLENN" <pasig-discuss-bounces at asis.org on behalf of BUNTONGA at mailbox.sc.edu> wrote:
This discussion of the various digital repository storage approaches has been very enlightening and useful so far. I appreciate all the excellent details. There is one piece of information, however, that is missing. Cost? Both initial implementation outlay and ongoing costs. Any general sense of costs would be greatly appreciated.
-----Original Message-----
From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Steve Knight
Sent: Sunday, May 14, 2017 6:44 PM
To: 'Sheila Morrissey' <Sheila.Morrissey at ithaka.org>; pasig-discuss at asis.org
Subject: Re: [Pasig-discuss] Digital repository storage benchmarking
Hi Tim
At the National library of New Zealand, we are storing about 210TB of digital objects in our permanent repository.
We have a 25TB online cache, with an online copy of all the digital objects sitting on disk.
Three tape copies of the objects are made as soon as they enter into the disk archive. 1 copy remains within the tape library (nearline), the other 2 copies are sent offsite (offline). We use Oracle SAM-QFS to manage the storage policies and automatic tierage.
We have a similar treatment for our 100TB of Test data, which has 1 less offsite tape copy.
We are currently looking at replacing this storage architecture with a mix of Hitachi's HDI and HCP S30 object storage products and our cloud provider's object storage offering. The cloud provider storage includes replication across 3 geographic locations providing both higher availability and higher resilience than we currently have.
By moving to an all online solution we hope to increase overall performance and make savings through utilising object storage and exiting some services related to current backup and restore processes.
Regards
Steve
-----Original Message-----
From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Sheila Morrissey
Sent: Saturday, 13 May 2017 5:44 a.m.
To: pasig-discuss at asis.org
Subject: [Pasig-discuss] FW: Digital repository storage benchmarking
Hello, Tim,
At Portico (http://www.portico.org/digital-preservation/), we preserve e-journals, e-books, digitized historical collections, and other born-digital scholarly content.
Currently, the Portico archive is comprised of roughly 77.7 million digital objects (we call them "Archival Units", or AUs); comprising over 400 TB; made up of 1.3 billion files.
We maintain 3 copies of the archive: 2 on disk in geographically distributed data centers, and a 3rd copy in commercial cloud storage. We create and maintain backups (including fixity checks) using our own custom-written software.
I hope this helpful.
Best regards,
Sheila
Sheila M. Morrissey
Senior Researcher
ITHAKA
100 Campus Drive
Suite 100
Princeton NJ 08540
609-986-2221
sheila.morrissey at ithaka.org
ITHAKA (www.ithaka.org) is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. We provide innovative services that benefit higher education, including Ithaka S+R, JSTOR, and Portico.
-----Original Message-----
From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Tim Walsh
Sent: Friday, May 12, 2017 10:16 AM
To: pasig-discuss at asis.org
Subject: [Pasig-discuss] Digital repository storage benchmarking
Dear PASIG,
I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions’ configurations online. It’s very possible that this question has been asked before on-list, but I wasn’t able to find anything in the list archives.
For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those “various media” will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we’d like to benchmark our plans against other institutions.
I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are:
* Could you point me to published/available resources outlining other institutions’ digital repository storage configurations?
* Or, if you work at an institution, would you be willing to share the details of your configuration on- or off-list? (any information sent off-list will be kept strictly confidential)
Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups.
Thank you!
Tim
- - -
Tim Walsh
Archiviste, Archives numériques
Archivist, Digital Archives
Centre Canadien d’Architecture
Canadian Centre for Architecture
1920, rue Baile, Montréal, Québec H3H 2S6 T 514 939 7001 x 1532 F 514 939 7020 www.cca.qc.ca<http://www.cca.qc.ca/>
Pensez à l’environnement avant d’imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n’êtes pas le destinataire prévu, veuillez nous en aviser immédiatement. Merci également de supprimer le présent courriel et d’en détruire toute copie.
This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you.
----
To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss
_______
PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
_______________________________________________
Pasig-discuss mailing list
Pasig-discuss at mail.asis.org
http://mail.asis.org/mailman/listinfo/pasig-discuss
----
To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss
_______
PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
_______________________________________________
Pasig-discuss mailing list
Pasig-discuss at mail.asis.org
http://mail.asis.org/mailman/listinfo/pasig-discuss
----
To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss
_______
PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
_______________________________________________
Pasig-discuss mailing list
Pasig-discuss at mail.asis.org
http://mail.asis.org/mailman/listinfo/pasig-discuss
----
To subscribe, unsubscribe, or modify your subscription, please visit
http://mail.asis.org/mailman/listinfo/pasig-discuss
_______
PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
_______________________________________________
Pasig-discuss mailing list
Pasig-discuss at mail.asis.org
http://mail.asis.org/mailman/listinfo/pasig-discuss
----
To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss
_______
PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
_______________________________________________
Pasig-discuss mailing list
Pasig-discuss at mail.asis.org
http://mail.asis.org/mailman/listinfo/pasig-discuss
----
To subscribe, unsubscribe, or modify your subscription, please visit
http://mail.asis.org/mailman/listinfo/pasig-discuss
_______
PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
_______________________________________________
Pasig-discuss mailing list
Pasig-discuss at mail.asis.org
http://mail.asis.org/mailman/listinfo/pasig-discuss
----
To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss
_______
PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
_______________________________________________
Pasig-discuss mailing list
Pasig-discuss at mail.asis.org
http://mail.asis.org/mailman/listinfo/pasig-discuss
More information about the Pasig-discuss
mailing list