[Pasig-discuss] Digital repository storage benchmarking
Jake Carroll
jake.carroll at uq.edu.au
Sun May 14 23:01:19 EDT 2017
Certainly interesting.
At the Queensland Brain Institute and the Australian Institute of Bioengineering and Nanotechnology at the University of Queensland, we have around 8.5PB of data under management across our HSM platforms. We currently use Oracle HSM for this task.
We have 256TB of online “cache” for the data landing location split across 6 different filesystems that are tuned differently for different types of workloads and different tasks. These workloads are generally categorised into a few functions:
• High IO, large serial writes from instruments
• Low IO, large serial writes from instruments
• High IO, granular “many files, many IOPS” from instruments and computational factors
• Low IO, granular “many files, low IOPS” from instruments and computational factors
• Generic group share
• Generic user dir
It is an interesting thing to manage and run statistical modelling on in terms of performance analysis and micro benchmarking of data movement patterns. All the filesystems above are provisioned on 16Gbit/sec FC connected Hitachi HUS-VM, 10K SAS.
The metadata for these filesystems is around 10 terabytes of Hitachi Accelerated Advanced Flash storage. We have around 3.8 billion files/unique objects under management.
We run a “disk based copy” (we call that copy1) which is our disk based VSN or vault. It is around 1PB of ZFS managed storage sitting inside the very large Hitachi HUS-VM platform.
Our Copy2 and Copy3 are 2 * T10000D Oracle tape media copies in SL3000 storage silos, geographically distributed.
We do some interesting things with our tape infrastructure, including DIV-always-on, proactive data protection sweeps inside the HSM and continuous validation checks against the media. We also run STA (tape analytics tools) extra-data-path so we can see *exactly* what each drive is doing at all times. Believe me, we see things that would baffle and boggle the mind (and probably create a healthy sense of paranoia!) if you knew exactly what was going on “inside there”.
We use finely tuned policy for data automation of movement between tiers so as to minimally impact user experience. Our HSM supports offline file mapping to the windows client, so people can tell when their files and objects are “offline”. It is a useful semantic and great for usability for people.
We ZFS scrub the disk copy for “always on disk consistency”, we use tpverify commands on the tape media also to consistently check the media itself. We’re experimenting with implementing fixity shortly too, as the filesystem supports it.
As for going “all online”, at our scale –we just can’t afford it yet, to walk away from “cold tape” principles. We’re just too big. We’d love to rid ourselves of the complexities of it, and consider a full cloud based consumption model, but having crunched the very hard numbers of things such as AWS Glacier and S3, it is a long (long) way more expensive than the relative TCO’s of running it “on premise” at this stage. My hope is that this will change soon and I can start experimenting with one of my copies being a “cloud library”.
Interesting thread, this…
-jc
On 15/5/17, 11:41 am, "Pasig-discuss on behalf of BUNTON, GLENN" <pasig-discuss-bounces at asis.org on behalf of BUNTONGA at mailbox.sc.edu> wrote:
This discussion of the various digital repository storage approaches has been very enlightening and useful so far. I appreciate all the excellent details. There is one piece of information, however, that is missing. Cost? Both initial implementation outlay and ongoing costs. Any general sense of costs would be greatly appreciated.
-----Original Message-----
From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Steve Knight
Sent: Sunday, May 14, 2017 6:44 PM
To: 'Sheila Morrissey' <Sheila.Morrissey at ithaka.org>; pasig-discuss at asis.org
Subject: Re: [Pasig-discuss] Digital repository storage benchmarking
Hi Tim
At the National library of New Zealand, we are storing about 210TB of digital objects in our permanent repository.
We have a 25TB online cache, with an online copy of all the digital objects sitting on disk.
Three tape copies of the objects are made as soon as they enter into the disk archive. 1 copy remains within the tape library (nearline), the other 2 copies are sent offsite (offline). We use Oracle SAM-QFS to manage the storage policies and automatic tierage.
We have a similar treatment for our 100TB of Test data, which has 1 less offsite tape copy.
We are currently looking at replacing this storage architecture with a mix of Hitachi's HDI and HCP S30 object storage products and our cloud provider's object storage offering. The cloud provider storage includes replication across 3 geographic locations providing both higher availability and higher resilience than we currently have.
By moving to an all online solution we hope to increase overall performance and make savings through utilising object storage and exiting some services related to current backup and restore processes.
Regards
Steve
-----Original Message-----
From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Sheila Morrissey
Sent: Saturday, 13 May 2017 5:44 a.m.
To: pasig-discuss at asis.org
Subject: [Pasig-discuss] FW: Digital repository storage benchmarking
Hello, Tim,
At Portico (http://www.portico.org/digital-preservation/), we preserve e-journals, e-books, digitized historical collections, and other born-digital scholarly content.
Currently, the Portico archive is comprised of roughly 77.7 million digital objects (we call them "Archival Units", or AUs); comprising over 400 TB; made up of 1.3 billion files.
We maintain 3 copies of the archive: 2 on disk in geographically distributed data centers, and a 3rd copy in commercial cloud storage. We create and maintain backups (including fixity checks) using our own custom-written software.
I hope this helpful.
Best regards,
Sheila
Sheila M. Morrissey
Senior Researcher
ITHAKA
100 Campus Drive
Suite 100
Princeton NJ 08540
609-986-2221
sheila.morrissey at ithaka.org
ITHAKA (www.ithaka.org) is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. We provide innovative services that benefit higher education, including Ithaka S+R, JSTOR, and Portico.
-----Original Message-----
From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Tim Walsh
Sent: Friday, May 12, 2017 10:16 AM
To: pasig-discuss at asis.org
Subject: [Pasig-discuss] Digital repository storage benchmarking
Dear PASIG,
I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions’ configurations online. It’s very possible that this question has been asked before on-list, but I wasn’t able to find anything in the list archives.
For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those “various media” will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we’d like to benchmark our plans against other institutions.
I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are:
* Could you point me to published/available resources outlining other institutions’ digital repository storage configurations?
* Or, if you work at an institution, would you be willing to share the details of your configuration on- or off-list? (any information sent off-list will be kept strictly confidential)
Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups.
Thank you!
Tim
- - -
Tim Walsh
Archiviste, Archives numériques
Archivist, Digital Archives
Centre Canadien d’Architecture
Canadian Centre for Architecture
1920, rue Baile, Montréal, Québec H3H 2S6 T 514 939 7001 x 1532 F 514 939 7020 www.cca.qc.ca<http://www.cca.qc.ca/>
Pensez à l’environnement avant d’imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n’êtes pas le destinataire prévu, veuillez nous en aviser immédiatement. Merci également de supprimer le présent courriel et d’en détruire toute copie.
This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you.
----
To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss
_______
PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
_______________________________________________
Pasig-discuss mailing list
Pasig-discuss at mail.asis.org
http://mail.asis.org/mailman/listinfo/pasig-discuss
----
To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss
_______
PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
_______________________________________________
Pasig-discuss mailing list
Pasig-discuss at mail.asis.org
http://mail.asis.org/mailman/listinfo/pasig-discuss
----
To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss
_______
PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
_______________________________________________
Pasig-discuss mailing list
Pasig-discuss at mail.asis.org
http://mail.asis.org/mailman/listinfo/pasig-discuss
----
To subscribe, unsubscribe, or modify your subscription, please visit
http://mail.asis.org/mailman/listinfo/pasig-discuss
_______
PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html
_______________________________________________
Pasig-discuss mailing list
Pasig-discuss at mail.asis.org
http://mail.asis.org/mailman/listinfo/pasig-discuss
More information about the Pasig-discuss
mailing list