From dorothy.waugh at emory.edu Mon May 1 09:30:08 2017 From: dorothy.waugh at emory.edu (Waugh, Dorothy F.) Date: Mon, 1 May 2017 13:30:08 +0000 Subject: [Pasig-discuss] The Archivist's Guide to KryoFlux Message-ID: <512DD020-9827-421E-9D9E-FBEBC79A0E16@emory.edu> (With apologies for cross-posting) An initial draft of The Archivist?s Guide to KryoFlux is now open for comment and review at goo.gl/ZZxxAJ. The Archivist?s Guide to KryoFlux aims to provide a helpful resource for practitioners working with floppy disks in an archival context. This DRAFT of the Guide will remain open for comments from the digital archives community from May 1 through November 1, 2017. Once revisions have been incorporated, a version of the document will be freely available on GitHub. Whether you already use a KryoFlux at your institution or are considering purchasing one, please take a look at the guide, put it to the test, and give us your feedback! You can either add your comments to the guide itself or send an email to archivistsguidetokryoflux at gmail.com. Your feedback will be enormously helpful as we go through an additional round of revisions in late 2017?so please, please do get in touch if you have any comments or questions. With thanks, The Archivist?s Guide to KryoFlux working group Dorothy Waugh Digital Archivist Stuart A. Rose Manuscript, Archives, and Rare Book Library Emory University 540 Asbury Circle Atlanta, GA 30322-2870 Tel: (404) 727.2471 Email: dorothy.waugh at emory.edu [cid:7B77B058-A0CD-49F0-91A9-AA9E0C53110D] "The Stuart A. Rose Manuscript, Archives, & Rare Book Library collects and connects stories of human experience, promotes access and learning, and offers opportunities for dialogue for all wise hearts who seek knowledge." Read the Rose Library blog: https://scholarblogs.emory.edu/marbl/ Like the Rose Library on Facebook: https://www.facebook.com/emorymarbl Follow the Rose Library on Twitter: https://twitter.com/EmoryMARBL ________________________________ This e-mail message (including any attachments) is for the sole use of the intended recipient(s) and may contain confidential and privileged information. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this message (including any attachments) is strictly prohibited. If you have received this message in error, please contact the sender by reply e-mail message and destroy all copies of the original message (including attachments). -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: rose_signature[CVcI][5].png Type: image/png Size: 12836 bytes Desc: rose_signature[CVcI][5].png URL: From Inge.Angevaare at KB.nl Mon May 1 10:22:47 2017 From: Inge.Angevaare at KB.nl (Inge Angevaare) Date: Mon, 1 May 2017 14:22:47 +0000 Subject: [Pasig-discuss] unsubscribe Message-ID: <887EC26076B8864CB585EC40E46385D854AA87A8@MBX-SRV-P100.wpakb.kb.nl> Inge Angevaare eindredacteur www.kb.nl en http://bibliotheekenbasisvaardigheden.nl Marketing en diensten Koninklijke Bibliotheek T 06 11776725 E inge.angevaare at kb.nl Van: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] Namens Waugh, Dorothy F. Verzonden: maandag 1 mei 2017 15:30 Aan: pasig-discuss at mail.asis.org Onderwerp: [Pasig-discuss] The Archivist's Guide to KryoFlux (With apologies for cross-posting) An initial draft of The Archivist?s Guide to KryoFlux is now open for comment and review at goo.gl/ZZxxAJ. The Archivist?s Guide to KryoFlux aims to provide a helpful resource for practitioners working with floppy disks in an archival context. This DRAFT of the Guide will remain open for comments from the digital archives community from May 1 through November 1, 2017. Once revisions have been incorporated, a version of the document will be freely available on GitHub. Whether you already use a KryoFlux at your institution or are considering purchasing one, please take a look at the guide, put it to the test, and give us your feedback! You can either add your comments to the guide itself or send an email to archivistsguidetokryoflux at gmail.com. Your feedback will be enormously helpful as we go through an additional round of revisions in late 2017?so please, please do get in touch if you have any comments or questions. With thanks, The Archivist?s Guide to KryoFlux working group Dorothy Waugh Digital Archivist Stuart A. Rose Manuscript, Archives, and Rare Book Library Emory University 540 Asbury Circle Atlanta, GA 30322-2870 Tel: (404) 727.2471 Email: dorothy.waugh at emory.edu [cid:image001.png at 01D2C297.2B803B40] "The Stuart A. Rose Manuscript, Archives, & Rare Book Library collects and connects stories of human experience, promotes access and learning, and offers opportunities for dialogue for all wise hearts who seek knowledge." Read the Rose Library blog: https://scholarblogs.emory.edu/marbl/ Like the Rose Library on Facebook: https://www.facebook.com/emorymarbl Follow the Rose Library on Twitter: https://twitter.com/EmoryMARBL ________________________________ This e-mail message (including any attachments) is for the sole use of the intended recipient(s) and may contain confidential and privileged information. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this message (including any attachments) is strictly prohibited. If you have received this message in error, please contact the sender by reply e-mail message and destroy all copies of the original message (including attachments). -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 12836 bytes Desc: image001.png URL: From sean.killen at oracle.com Mon May 1 10:52:12 2017 From: sean.killen at oracle.com (Sean Killen) Date: Mon, 1 May 2017 10:52:12 -0400 Subject: [Pasig-discuss] unsubscribe In-Reply-To: <887EC26076B8864CB585EC40E46385D854AA87A8@MBX-SRV-P100.wpakb.kb.nl> References: <887EC26076B8864CB585EC40E46385D854AA87A8@MBX-SRV-P100.wpakb.kb.nl> Message-ID: Unsubscribe Please pardon the typos. I sent from an iPhone. > On May 1, 2017, at 10:22 AM, Inge Angevaare wrote: > > > > Inge Angevaare > eindredacteur www.kb.nl > en http://bibliotheekenbasisvaardigheden.nl > Marketing en diensten > Koninklijke Bibliotheek > T 06 11776725 > E inge.angevaare at kb.nl > > Van: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] Namens Waugh, Dorothy F. > Verzonden: maandag 1 mei 2017 15:30 > Aan: pasig-discuss at mail.asis.org > Onderwerp: [Pasig-discuss] The Archivist's Guide to KryoFlux > > (With apologies for cross-posting) > > An initial draft of The Archivist?s Guide to KryoFlux is now open for comment and review at goo.gl/ZZxxAJ. > > The Archivist?s Guide to KryoFlux aims to provide a helpful resource for practitioners working with floppy disks in an archival context. This DRAFT of the Guide will remain open for comments from the digital archives community from May 1 through November 1, 2017. Once revisions have been incorporated, a version of the document will be freely available on GitHub. > > Whether you already use a KryoFlux at your institution or are considering purchasing one, please take a look at the guide, put it to the test, and give us your feedback! You can either add your comments to the guide itself or send an email to archivistsguidetokryoflux at gmail.com. Your feedback will be enormously helpful as we go through an additional round of revisions in late 2017?so please, please do get in touch if you have any comments or questions. > > With thanks, > The Archivist?s Guide to KryoFlux working group > > Dorothy Waugh > Digital Archivist > Stuart A. Rose Manuscript, Archives, and Rare Book Library > Emory University > 540 Asbury Circle > Atlanta, GA 30322-2870 > Tel: (404) 727.2471 > Email: dorothy.waugh at emory.edu > > > > "The Stuart A. Rose Manuscript, Archives, & Rare Book Library collects and connects stories of human experience, promotes access and learning, and offers opportunities for dialogue for all wise hearts who seek knowledge." > > Read the Rose Library blog: https://scholarblogs.emory.edu/marbl/ > > Like the Rose Library on Facebook: https://www.facebook.com/emorymarbl > > Follow the Rose Library on Twitter: https://twitter.com/EmoryMARBL > > > This e-mail message (including any attachments) is for the sole use of > the intended recipient(s) and may contain confidential and privileged > information. If the reader of this message is not the intended > recipient, you are hereby notified that any dissemination, distribution > or copying of this message (including any attachments) is strictly > prohibited. > > If you have received this message in error, please contact > the sender by reply e-mail message and destroy all copies of the > original message (including attachments). > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sserbicki at ea.com Mon May 1 19:35:40 2017 From: sserbicki at ea.com (Serbicki, Stefan) Date: Mon, 1 May 2017 23:35:40 +0000 Subject: [Pasig-discuss] The Archivist's Guide to KryoFlux In-Reply-To: <512DD020-9827-421E-9D9E-FBEBC79A0E16@emory.edu> References: <512DD020-9827-421E-9D9E-FBEBC79A0E16@emory.edu> Message-ID: At Electronic Arts, we used Kryoflux boards to recover data from approx. six thousand 3.5? and 5.25? floppies dating back to the 80s and early 90s. We had a ~95% success rate which was quite astounding considering that a good portion of the media had exceeded its theoretical lifetime: 25-30 years. Getting the data off the disks was only part of the overall project. Our final goal was to obtain ?loose? files that could be read or executed. As several of the datasets consisted of backups in various formats, for various platforms, made with obsolete software, considerable work had to be done after achieving a successful Kryoflux extraction. In fact, our work is ongoing. Currently we are focusing on restoring backups made with Fastback 2.0. We have managed to do this successfully for two titles: F-22 Interceptor and LHX Attack Chopper. The original backups were broken in parts and stored in 5.25? floppies. We used virtual machines to recreate the original environment in which the backups were made. The final output yielded Betas and the game source code. I?ll be happy to write a paper describing the steps we took from beginning to end to recover that data if there is interest. ------------------ Stefan Serbicki Technical Lead ? IP Preservation Electronic Arts 209 Redwood Shores Parkway Redwood City, CA 94065 From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Waugh, Dorothy F. Sent: Monday, May 01, 2017 6:30 AM To: pasig-discuss at mail.asis.org Subject: [Pasig-discuss] The Archivist's Guide to KryoFlux (With apologies for cross-posting) An initial draft of The Archivist?s Guide to KryoFlux is now open for comment and review at goo.gl/ZZxxAJ. The Archivist?s Guide to KryoFlux aims to provide a helpful resource for practitioners working with floppy disks in an archival context. This DRAFT of the Guide will remain open for comments from the digital archives community from May 1 through November 1, 2017. Once revisions have been incorporated, a version of the document will be freely available on GitHub. Whether you already use a KryoFlux at your institution or are considering purchasing one, please take a look at the guide, put it to the test, and give us your feedback! You can either add your comments to the guide itself or send an email to archivistsguidetokryoflux at gmail.com. Your feedback will be enormously helpful as we go through an additional round of revisions in late 2017?so please, please do get in touch if you have any comments or questions. With thanks, The Archivist?s Guide to KryoFlux working group Dorothy Waugh Digital Archivist Stuart A. Rose Manuscript, Archives, and Rare Book Library Emory University 540 Asbury Circle Atlanta, GA 30322-2870 Tel: (404) 727.2471 Email: dorothy.waugh at emory.edu [cid:image001.png at 01D2C296.3F9C0C40] "The Stuart A. Rose Manuscript, Archives, & Rare Book Library collects and connects stories of human experience, promotes access and learning, and offers opportunities for dialogue for all wise hearts who seek knowledge." Read the Rose Library blog: https://scholarblogs.emory.edu/marbl/ Like the Rose Library on Facebook: https://www.facebook.com/emorymarbl Follow the Rose Library on Twitter: https://twitter.com/EmoryMARBL ________________________________ This e-mail message (including any attachments) is for the sole use of the intended recipient(s) and may contain confidential and privileged information. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this message (including any attachments) is strictly prohibited. If you have received this message in error, please contact the sender by reply e-mail message and destroy all copies of the original message (including attachments). -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 12836 bytes Desc: image001.png URL: From j.meyerson at austin.utexas.edu Mon May 1 21:01:34 2017 From: j.meyerson at austin.utexas.edu (Meyerson, Jessica W) Date: Tue, 2 May 2017 01:01:34 +0000 Subject: [Pasig-discuss] From soup to nuts (or to a continuum of meaningful reuse): Kryoflux data triage to emulated access Message-ID: Huge thanks to Dorothy and all of the members of the Archivists Guide to the Kryoflux Working Group for this awesome contribution to the preservation community! And Stefan - your offer to write up next steps towards a mountable, executable object would be useful to a broad audience to be sure: things you tested that failed, how you discerned the appropriate disktype wrapper to make a mountable image, and the components of the emulated environment necessary to ultimately provide access. Best, Jessica Jessica Meyerson, MSIS, CA Digital Archivist Briscoe Center for American History The University of Texas at Austin 2300 Red River St. Stop D1100 Austin TX, 78712-1426 (512) 495-4405 j.meyerson at austin.utexas.edu http://www.cah.utexas.edu/ http://www.softwarepreservationnetwork.org/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From Walker.Sampson at Colorado.EDU Tue May 2 17:16:26 2017 From: Walker.Sampson at Colorado.EDU (Walker Sampson) Date: Tue, 2 May 2017 21:16:26 +0000 Subject: [Pasig-discuss] The Archivist's Guide to KryoFlux Message-ID: Hi Stefan, I?d certainly be interested in that paper. Particularly, what you all decided to do with that ~5% of floppies that were not successful. Just try again? Setting more retries in the KryoFlux software? Cleaning the platter, swapping drives, recalibration, cleaning the drive head, etc.? I would venture users are interested in troubleshooting steps there. Also, are you keeping the raw track data KryoFlux makes? Regardless, happy to hear it?s been a successful project. All best, Walker Sampson Digital Archivist, MSIS, CA Special Collections and Archives University of Colorado Boulder From: Pasig-discuss on behalf of "Serbicki, Stefan" Date: Monday, May 1, 2017 at 5:35 PM To: "Waugh, Dorothy F." , "pasig-discuss at mail.asis.org" Subject: Re: [Pasig-discuss] The Archivist's Guide to KryoFlux At Electronic Arts, we used Kryoflux boards to recover data from approx. six thousand 3.5? and 5.25? floppies dating back to the 80s and early 90s. We had a ~95% success rate which was quite astounding considering that a good portion of the media had exceeded its theoretical lifetime: 25-30 years. Getting the data off the disks was only part of the overall project. Our final goal was to obtain ?loose? files that could be read or executed. As several of the datasets consisted of backups in various formats, for various platforms, made with obsolete software, considerable work had to be done after achieving a successful Kryoflux extraction. In fact, our work is ongoing. Currently we are focusing on restoring backups made with Fastback 2.0. We have managed to do this successfully for two titles: F-22 Interceptor and LHX Attack Chopper. The original backups were broken in parts and stored in 5.25? floppies. We used virtual machines to recreate the original environment in which the backups were made. The final output yielded Betas and the game source code. I?ll be happy to write a paper describing the steps we took from beginning to end to recover that data if there is interest. ------------------ Stefan Serbicki Technical Lead ? IP Preservation Electronic Arts 209 Redwood Shores Parkway Redwood City, CA 94065 From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Waugh, Dorothy F. Sent: Monday, May 01, 2017 6:30 AM To: pasig-discuss at mail.asis.org Subject: [Pasig-discuss] The Archivist's Guide to KryoFlux (With apologies for cross-posting) An initial draft of The Archivist?s Guide to KryoFlux is now open for comment and review at goo.gl/ZZxxAJ. The Archivist?s Guide to KryoFlux aims to provide a helpful resource for practitioners working with floppy disks in an archival context. This DRAFT of the Guide will remain open for comments from the digital archives community from May 1 through November 1, 2017. Once revisions have been incorporated, a version of the document will be freely available on GitHub. Whether you already use a KryoFlux at your institution or are considering purchasing one, please take a look at the guide, put it to the test, and give us your feedback! You can either add your comments to the guide itself or send an email to archivistsguidetokryoflux at gmail.com. Your feedback will be enormously helpful as we go through an additional round of revisions in late 2017?so please, please do get in touch if you have any comments or questions. With thanks, The Archivist?s Guide to KryoFlux working group Dorothy Waugh Digital Archivist Stuart A. Rose Manuscript, Archives, and Rare Book Library Emory University 540 Asbury Circle Atlanta, GA 30322-2870 Tel: (404) 727.2471 Email: dorothy.waugh at emory.edu [cid:image001.png at 01D2C357.100EC460] "The Stuart A. Rose Manuscript, Archives, & Rare Book Library collects and connects stories of human experience, promotes access and learning, and offers opportunities for dialogue for all wise hearts who seek knowledge." Read the Rose Library blog: https://scholarblogs.emory.edu/marbl/ Like the Rose Library on Facebook: https://www.facebook.com/emorymarbl Follow the Rose Library on Twitter: https://twitter.com/EmoryMARBL ________________________________ This e-mail message (including any attachments) is for the sole use of the intended recipient(s) and may contain confidential and privileged information. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this message (including any attachments) is strictly prohibited. If you have received this message in error, please contact the sender by reply e-mail message and destroy all copies of the original message (including attachments). -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 12837 bytes Desc: image001.png URL: From dwilcox at duraspace.org Thu May 4 10:09:23 2017 From: dwilcox at duraspace.org (David Wilcox) Date: Thu, 4 May 2017 09:09:23 -0500 Subject: [Pasig-discuss] JOIN US at Fedora Camp in Texas Message-ID: You are invited to join experienced trainers and Fedora gurus at Fedora Camp to be held October 16-18 at the Perry-Casta?eda Library at the University of Texas, Austin. Fedora is the robust, modular, open source repository platform for the management and dissemination of digital content. Fedora 4, the latest production version of Fedora, features vast improvements in scalability, linked data capabilities, research data support, modularity, ease of use and more. Fedora Camp offers everyone a chance to dive in and learn all about Fedora. Training will begin with the basics and build toward more advanced concepts?no prior Fedora experience is required. Participants can expect to come away with a deep dive Fedora learning experience coupled with multiple opportunities for applying hands-on techniques. Previous Fedora Camps include the inaugural camp held at Duke University, the West Coast camp at CalTech, and the most recent, NYC camp held at Columbia University. Betsy Coles, Caltech Library Services and Fedora Camp attendee, said, ?The material covered was comprehensive, which I needed. It was also pitched at an appropriate level for me. I was able to keep up with the hands-on exercises without becoming completely befuddled. I thought the organization of the material and the hands-on exercises were very well done. I also appreciated the chance to interact with others with both similar and different interests.? The camp curriculum provides a comprehensive overview of Fedora 4 by exploring such topics as: Core & Integrated features Data modeling and linked data Hydra and Islandora Migrating to Fedora 4 Deploying Fedora 4 in production Preservation Services A knowledgeable team of instructors from the Fedora community will lead you through the curriculum: David Wilcox - Fedora Product Manager, Andrew Woods - Fedora Technical Lead, Bethany Seeger - Amherst College, Aaron Birkland - Johns Hopkins University, Mike Durbin - University of Virginia View the detailed agenda . Register here . Please note that the early bird discount will be offered until August 14, and that accommodations are available at a discounted rate. -- David Wilcox Fedora Product Manager DuraSpace dwilcox at duraspace.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From debra.weiss at colorado.edu Mon May 8 09:09:04 2017 From: debra.weiss at colorado.edu (Debra Weiss) Date: Mon, 8 May 2017 13:09:04 +0000 Subject: [Pasig-discuss] Job Opening: Digital Library Software Architect at University of Colorado Boulder Message-ID: The University of Colorado Boulder is seeking applicants for the position of digital library software architect to support University Libraries software applications and digital initiatives. This is a permanent full-time position. For the complete posting with information on how to apply, please see: https://cu.taleo.net/careersection/jobdetail.ftl?job=09351&lang=en Debra Weiss Director of Libraries Information Technology 184 UCB University of Colorado Boulder Libraries Boulder, CO 80309 303-492-3965 http://www.colorado.edu/libraries/ [cid:image001.jpg at 01CFD693.E1EE9680] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.jpg Type: image/jpeg Size: 2229 bytes Desc: image002.jpg URL: From arthurpasquinelli at gmail.com Wed May 10 15:20:30 2017 From: arthurpasquinelli at gmail.com (Arthur Pasquinelli) Date: Wed, 10 May 2017 12:20:30 -0700 Subject: [Pasig-discuss] 11th Annual Creative Storage Conference - Special Offer Message-ID: <74080b9e-1050-7824-651b-a320b061d89c@gmail.com> On May 24, 2017 at the DoubleTree Hotel in Culver City, CA the 11th annual Creative Storage Conference will explore every aspect of digital storage and rich media (www.creativestorage.org ). This includes discussions of digital archiving and preservation. If you are interested in attending we would like to offer you a $150 discount off of early registration using this link: https://cs2017.eventbrite.com?discount=onefiftyoff37168524 If you like the Southern California area I hope that you can join us. Thomas Coughlin Coughlin Associates 408-202-5098 tom at tomcoughlin.com www.tomcoughlin.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From Tim.Gollins at nrscotland.gov.uk Fri May 12 07:33:34 2017 From: Tim.Gollins at nrscotland.gov.uk (Tim.Gollins at nrscotland.gov.uk) Date: Fri, 12 May 2017 11:33:34 +0000 Subject: [Pasig-discuss] WORM (Write Once Read Many) AIPs Message-ID: Dear PASIG I have been thinking recently about the challenge of managing "physical" AIPs on offline or near line storage and how to optimise or simplify the use of managed storage media in a tape based (robotic) Hierarchical Storage Management (HSM) system. By "physical" AIPs I mean that the actual structure of the AIP written to the storage system is sufficiently self-describing that even if the management or other elements of a DP system were to be lost to a disaster then the entire collection could be fully re-instated reliably from the stored AIPs alone. I have also been thinking about the huge benefits of adopting the concepts of "Minimal Ingest" (MI) and "Autonomous Preservation Tools" (APT) in a new Digital Archive solution. One of the potential effects of the MI and APT concepts is that over time it is clear that while (of course) the original bit streams will never need to be updated, the metadata packaged in the AIP will need to change relatively often (through the life of the AIP) . This is of course in addition to any new renderings of the bit streams produced for preservation purposes (manifestations as termed in some systems). If to update the AIP the process involves the AIP being "loaded" and "Modified" and "Stored" again as a whole then this will result in significant "churn" of the offline or near line media (i.e. tapes) in a HSM - which I would like to avoid. I think it would be really great if the AIP representation could accommodate the concept of an "update IP" (perhaps UIP?) where the UIP contains a "delta" of the original AIP - the full AIP then being interpreted as the original as modified by a series of deltas. This would then effectively result in AIPs (and UIPs) becoming WORM objects with clear benefits that I perceive in managing their reliable and safe storage. I am not sufficiently familiar with the detail of all the different AIP models or implementations, I was wondering if anyone in the team would be able to comment on whether the they know of any AIP models, specifications or implementations that would support such a use case. I have just posted a version of this question to the E-Ark Linked in Group so my apologies to those who see it twice. Many thanks Tim Tim Gollins | Head of Digital Archiving and Director of the NRS Digital Preservation Programme National Records of Scotland | West Register House | Edinburgh EH2 4DF + 44 (0)131 535 1431 / + 44 (0)7974 922614 | tim.gollins at nrscotland.gov.uk | www.nrscotland.gov.uk Preserving the past | Recording the present | Informing the future Follow us on Twitter: @NatRecordsScot | http://twitter.com/NatRecordsScot ********************************************************************** This e-mail (and any files or other attachments transmitted with it) is intended solely for the attention of the addressee(s). Unauthorised use, disclosure, storage, copying or distribution of any part of this e-mail is not permitted. If you are not the intended recipient please destroy the email, remove any copies from your system and inform the sender immediately by return. Communications with the Scottish Government may be monitored or recorded in order to secure the effective operation of the system and for other lawful purposes. The views or opinions contained within this e-mail may not necessarily reflect those of the Scottish Government. Tha am post-d seo (agus faidhle neo ceanglan c?mhla ris) dhan neach neo luchd-ainmichte a-mh?in. Chan eil e ceadaichte a chleachdadh ann an d?igh sam bith, a? toirt a-steach c?raichean, foillseachadh neo sgaoileadh, gun chead. Ma ?s e is gun d?fhuair sibh seo le gun fhiosd?, bu choir cur ?s dhan phost-d agus lethbhreac sam bith air an t-siostam agaibh, leig fios chun neach a sgaoil am post-d gun d?il. Dh?fhaodadh gum bi teachdaireachd sam bith bho Riaghaltas na h-Alba air a chl?radh neo air a sgr?dadh airson dearbhadh gu bheil an siostam ag obair gu h-?ifeachdach neo airson adhbhar laghail eile. Dh?fhaodadh nach eil beachdan anns a? phost-d seo co-ionann ri beachdan Riaghaltas na h-Alba. ********************************************************************** From neil at jefferies.org Fri May 12 08:05:42 2017 From: neil at jefferies.org (Neil Jefferies) Date: Fri, 12 May 2017 13:05:42 +0100 Subject: [Pasig-discuss] WORM (Write Once Read Many) AIPs In-Reply-To: References: Message-ID: <7290d680d2d83ae9c5d4a88371bb6147@imap.plus.net> Tim, If we store AIP's unpackaged, as a collection of files in a folder, then object updates could just be a new folder with symlinks to the unchanged parts and the updated parts in place in the folder. The object "location" would be a parent folder for all these version folders - for example, a pairtree (or triple-tree for faster scanning/rebuilds) based on object UUID. Version folders would be named accoprding to date or version number (date might make Memento compliant access simpler). Creating anew version clones the current verion (including links) with a new name and then replaces the updated parts in situ. Final act is to update a "current" symlink in the object. Any update failure will mean "current" is not updated an the partial clone can be discarded. This assumes most updates are metadata and that a diff won't save much compared to a complete new XML file or whatever. I am also assuming that metadata won't be wrappered either (so you can forget METS) so that different types are stored in the most stuiable format and are accessed only when required. The problems with roundtripping packaged AIP's for updates rather than diff-ing are repeated by METS wrappering. These may be a virtual folder/filesytem presentation and underneath an HSM would retrieve files from wherever when it is actually accessed. HSM policy in soemthing like SAM-QFS/Versity/Cray TAS can ensure folders are kep intact when moved to other storage (we could even dereference symlinks when dealing with tape). This can be done with a POSIX filesystem and not muich code - Ben O'Steen started something along these lines here: https://github.com/dataflow/RDFDatabank/wiki/What-is-DataBank-and-what-does-it-do%3F Fedora also also a versioning object store that could support this kind of model but also adds a fair bit of complexity to be Linked-Data_platform compliant. In my paralance I would probably equate "Minimal Ingest" with "Sheer Curation" and APT with Asynchronous Message Driven Workers. Neil On 2017-05-12 12:33, Tim.Gollins at nrscotland.gov.uk wrote: > Dear PASIG > > I have been thinking recently about the challenge of managing > "physical" AIPs on offline or near line storage and how to optimise > or simplify the use of managed storage media in a tape based (robotic) > Hierarchical Storage Management (HSM) system. By "physical" AIPs I > mean that the actual structure of the AIP written to the storage > system is sufficiently self-describing that even if the management or > other elements of a DP system were to be lost to a disaster then the > entire collection could be fully re-instated reliably from the stored > AIPs alone. > > I have also been thinking about the huge benefits of adopting the > concepts of "Minimal Ingest" (MI) and "Autonomous Preservation Tools" > (APT) in a new Digital Archive solution. > > One of the potential effects of the MI and APT concepts is that over > time it is clear that while (of course) the original bit streams will > never need to be updated, the metadata packaged in the AIP will need > to change relatively often (through the life of the AIP) . This is of > course in addition to any new renderings of the bit streams produced > for preservation purposes (manifestations as termed in some systems). > > If to update the AIP the process involves the AIP being "loaded" and > "Modified" and "Stored" again as a whole then this will result in > significant "churn" of the offline or near line media (i.e. tapes) in > a HSM - which I would like to avoid. I think it would be really great > if the AIP representation could accommodate the concept of an "update > IP" (perhaps UIP?) where the UIP contains a "delta" of the original > AIP - the full AIP then being interpreted as the original as modified > by a series of deltas. This would then effectively result in AIPs (and > UIPs) becoming WORM objects with clear benefits that I perceive in > managing their reliable and safe storage. > > I am not sufficiently familiar with the detail of all the different > AIP models or implementations, I was wondering if anyone in the team > would be able to comment on whether the they know of any AIP models, > specifications or implementations that would support such a use case. > > I have just posted a version of this question to the E-Ark Linked in > Group so my apologies to those who see it twice. > > Many thanks > > Tim > Tim Gollins | Head of Digital Archiving and Director of the NRS > Digital Preservation Programme > National Records of Scotland | West Register House | Edinburgh EH2 4DF > + 44 (0)131 535 1431 / + 44 (0)7974 922614 | > tim.gollins at nrscotland.gov.uk | www.nrscotland.gov.uk > > Preserving the past | Recording the present | Informing the future > Follow us on Twitter: @NatRecordsScot | > http://twitter.com/NatRecordsScot > > > ********************************************************************** > This e-mail (and any files or other attachments transmitted with it) > is intended solely for the attention of the addressee(s). Unauthorised > use, disclosure, storage, copying or distribution of any part of this > e-mail is not permitted. If you are not the intended recipient please > destroy the email, remove any copies from your system and inform the > sender immediately by return. > > Communications with the Scottish Government may be monitored or > recorded in order to secure the effective operation of the system and > for other lawful purposes. The views or opinions contained within this > e-mail may not necessarily reflect those of the Scottish Government. > > > Tha am post-d seo (agus faidhle neo ceanglan c?mhla ris) dhan neach > neo luchd-ainmichte a-mh?in. Chan eil e ceadaichte a chleachdadh ann > an d?igh sam bith, a? toirt a-steach c?raichean, foillseachadh neo > sgaoileadh, gun chead. Ma ?s e is gun d?fhuair sibh seo le gun > fhiosd?, bu choir cur ?s dhan phost-d agus lethbhreac sam bith air an > t-siostam agaibh, leig fios chun neach a sgaoil am post-d gun d?il. > > Dh?fhaodadh gum bi teachdaireachd sam bith bho Riaghaltas na h-Alba > air a chl?radh neo air a sgr?dadh airson dearbhadh gu bheil an siostam > ag obair gu h-?ifeachdach neo airson adhbhar laghail eile. Dh?fhaodadh > nach eil beachdan anns a? phost-d seo co-ionann ri beachdan > Riaghaltas na h-Alba. > ********************************************************************** > > > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at > http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss From Tim.Gollins at nrscotland.gov.uk Fri May 12 08:18:10 2017 From: Tim.Gollins at nrscotland.gov.uk (Tim.Gollins at nrscotland.gov.uk) Date: Fri, 12 May 2017 12:18:10 +0000 Subject: [Pasig-discuss] WORM (Write Once Read Many) AIPs In-Reply-To: <7290d680d2d83ae9c5d4a88371bb6147@imap.plus.net> References: <7290d680d2d83ae9c5d4a88371bb6147@imap.plus.net> Message-ID: Hi Neil Brilliant - Most helpful and thought provoking. The fact that Fedora has the idea of a versioning Object store is particularly interesting. I think there are a couple of distinctions between Minimal Ingest and Sheer Curation but (from a quick glance at Google articles) they are appear very closely related. I think APT uses something like Asynchronous Message Driven Workers. Very many thanks indeed, especially for such a swift an comprehensive response. Tim Tim Gollins | Head of Digital Archiving and Director of the NRS Digital Preservation Programme National Records of Scotland | West Register House | Edinburgh EH2 4DF + 44 (0)131 535 1431 / + 44 (0)7974 922614 | tim.gollins at nrscotland.gov.uk | www.nrscotland.gov.uk Preserving the past | Recording the present | Informing the future Follow us on Twitter: @NatRecordsScot | http://twitter.com/NatRecordsScot -----Original Message----- From: Neil Jefferies [mailto:neil at jefferies.org] Sent: 12 May 2017 13:06 To: Gollins T (Tim) Cc: pasig-discuss at mail.asis.org Subject: Re: [Pasig-discuss] WORM (Write Once Read Many) AIPs Tim, If we store AIP's unpackaged, as a collection of files in a folder, then object updates could just be a new folder with symlinks to the unchanged parts and the updated parts in place in the folder. The object "location" would be a parent folder for all these version folders - for example, a pairtree (or triple-tree for faster scanning/rebuilds) based on object UUID. Version folders would be named accoprding to date or version number (date might make Memento compliant access simpler). Creating anew version clones the current verion (including links) with a new name and then replaces the updated parts in situ. Final act is to update a "current" symlink in the object. Any update failure will mean "current" is not updated an the partial clone can be discarded. This assumes most updates are metadata and that a diff won't save much compared to a complete new XML file or whatever. I am also assuming that metadata won't be wrappered either (so you can forget METS) so that different types are stored in the most stuiable format and are accessed only when required. The problems with roundtripping packaged AIP's for updates rather than diff-ing are repeated by METS wrappering. These may be a virtual folder/filesytem presentation and underneath an HSM would retrieve files from wherever when it is actually accessed. HSM policy in soemthing like SAM-QFS/Versity/Cray TAS can ensure folders are kep intact when moved to other storage (we could even dereference symlinks when dealing with tape). This can be done with a POSIX filesystem and not muich code - Ben O'Steen started something along these lines here: https://github.com/dataflow/RDFDatabank/wiki/What-is-DataBank-and-what-does-it-do%3F Fedora also also a versioning object store that could support this kind of model but also adds a fair bit of complexity to be Linked-Data_platform compliant. In my paralance I would probably equate "Minimal Ingest" with "Sheer Curation" and APT with Asynchronous Message Driven Workers. Neil On 2017-05-12 12:33, Tim.Gollins at nrscotland.gov.uk wrote: > Dear PASIG > > I have been thinking recently about the challenge of managing > "physical" AIPs on offline or near line storage and how to optimise > or simplify the use of managed storage media in a tape based (robotic) > Hierarchical Storage Management (HSM) system. By "physical" AIPs I > mean that the actual structure of the AIP written to the storage > system is sufficiently self-describing that even if the management or > other elements of a DP system were to be lost to a disaster then the > entire collection could be fully re-instated reliably from the stored > AIPs alone. > > I have also been thinking about the huge benefits of adopting the > concepts of "Minimal Ingest" (MI) and "Autonomous Preservation Tools" > (APT) in a new Digital Archive solution. > > One of the potential effects of the MI and APT concepts is that over > time it is clear that while (of course) the original bit streams will > never need to be updated, the metadata packaged in the AIP will need > to change relatively often (through the life of the AIP) . This is of > course in addition to any new renderings of the bit streams produced > for preservation purposes (manifestations as termed in some systems). > > If to update the AIP the process involves the AIP being "loaded" and > "Modified" and "Stored" again as a whole then this will result in > significant "churn" of the offline or near line media (i.e. tapes) in > a HSM - which I would like to avoid. I think it would be really great > if the AIP representation could accommodate the concept of an "update > IP" (perhaps UIP?) where the UIP contains a "delta" of the original > AIP - the full AIP then being interpreted as the original as modified > by a series of deltas. This would then effectively result in AIPs (and > UIPs) becoming WORM objects with clear benefits that I perceive in > managing their reliable and safe storage. > > I am not sufficiently familiar with the detail of all the different > AIP models or implementations, I was wondering if anyone in the team > would be able to comment on whether the they know of any AIP models, > specifications or implementations that would support such a use case. > > I have just posted a version of this question to the E-Ark Linked in > Group so my apologies to those who see it twice. > > Many thanks > > Tim > Tim Gollins | Head of Digital Archiving and Director of the NRS > Digital Preservation Programme > National Records of Scotland | West Register House | Edinburgh EH2 4DF > + 44 (0)131 535 1431 / + 44 (0)7974 922614 | > tim.gollins at nrscotland.gov.uk | www.nrscotland.gov.uk > > Preserving the past | Recording the present | Informing the future > Follow us on Twitter: @NatRecordsScot | > http://twitter.com/NatRecordsScot > > > ********************************************************************** > This e-mail (and any files or other attachments transmitted with it) > is intended solely for the attention of the addressee(s). Unauthorised > use, disclosure, storage, copying or distribution of any part of this > e-mail is not permitted. If you are not the intended recipient please > destroy the email, remove any copies from your system and inform the > sender immediately by return. > > Communications with the Scottish Government may be monitored or > recorded in order to secure the effective operation of the system and > for other lawful purposes. The views or opinions contained within this > e-mail may not necessarily reflect those of the Scottish Government. > > > Tha am post-d seo (agus faidhle neo ceanglan c?mhla ris) dhan neach > neo luchd-ainmichte a-mh?in. Chan eil e ceadaichte a chleachdadh ann > an d?igh sam bith, a? toirt a-steach c?raichean, foillseachadh neo > sgaoileadh, gun chead. Ma ?s e is gun d?fhuair sibh seo le gun > fhiosd?, bu choir cur ?s dhan phost-d agus lethbhreac sam bith air an > t-siostam agaibh, leig fios chun neach a sgaoil am post-d gun d?il. > > Dh?fhaodadh gum bi teachdaireachd sam bith bho Riaghaltas na h-Alba > air a chl?radh neo air a sgr?dadh airson dearbhadh gu bheil an siostam > ag obair gu h-?ifeachdach neo airson adhbhar laghail eile. Dh?fhaodadh > nach eil beachdan anns a? phost-d seo co-ionann ri beachdan > Riaghaltas na h-Alba. > ********************************************************************** > > > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at > http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ *********************************** ******************************** This email has been received from an external party and has been swept for the presence of computer viruses. ******************************************************************** From jfarmer at cambridgecomputer.com Fri May 12 09:33:11 2017 From: jfarmer at cambridgecomputer.com (Jacob Farmer) Date: Fri, 12 May 2017 09:33:11 -0400 Subject: [Pasig-discuss] WORM (Write Once Read Many) AIPs In-Reply-To: <7290d680d2d83ae9c5d4a88371bb6147@imap.plus.net> References: <7290d680d2d83ae9c5d4a88371bb6147@imap.plus.net> Message-ID: <63d06e35b40be1c7d0ff6e5613950844@mail.gmail.com> Two warnings and two suggestions: Warnings: 1) Symlinks and Housekeeping -- It is a common practice to use symlinks to make versioned file collections. If you do this, you should have some kind of housekeeping processes that ensure that the symlinks are all working correctly. If files ever have to get migrated, symlinks can break. 2) Check with your file system vendor -- Most removable media file systems have some built in limitations on the number of inodes (files) that you can have in one file system. If you generate a lot of symlinks, you might overwhelm the file system. Your vendor will know. Suggestions: 1) Hashes for file names -- If your application software maintains a hash for each file, you might consider naming the file according to the hash. Use the first two digits for the parent directory, the next two digits for sub-diretory, the next two digits for sub-directory. Then use the full hash for the file name. This turns your POSIX file system into an object store with uniquely named objects. As a safeguard, you might maintain a separate table or list that associates path names with hashes. 2) Consider using hard links instead of symlinks -- You might use hard links instead of symlinks, presuming that the files are all in the same file system. You still have to watch for file count issues, but you have less housekeeping to do. I hope that helps. Jacob Farmer | Chief Technology Officer | Cambridge Computer | "Artists In Data Storage" Phone 781-250-3210 | jfarmer at CambridgeComputer.com | www.CambridgeComputer.com -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Neil Jefferies Sent: Friday, May 12, 2017 8:06 AM To: Tim.Gollins at nrscotland.gov.uk Cc: pasig-discuss at mail.asis.org Subject: Re: [Pasig-discuss] WORM (Write Once Read Many) AIPs Tim, If we store AIP's unpackaged, as a collection of files in a folder, then object updates could just be a new folder with symlinks to the unchanged parts and the updated parts in place in the folder. The object "location" would be a parent folder for all these version folders - for example, a pairtree (or triple-tree for faster scanning/rebuilds) based on object UUID. Version folders would be named accoprding to date or version number (date might make Memento compliant access simpler). Creating anew version clones the current verion (including links) with a new name and then replaces the updated parts in situ. Final act is to update a "current" symlink in the object. Any update failure will mean "current" is not updated an the partial clone can be discarded. This assumes most updates are metadata and that a diff won't save much compared to a complete new XML file or whatever. I am also assuming that metadata won't be wrappered either (so you can forget METS) so that different types are stored in the most stuiable format and are accessed only when required. The problems with roundtripping packaged AIP's for updates rather than diff-ing are repeated by METS wrappering. These may be a virtual folder/filesytem presentation and underneath an HSM would retrieve files from wherever when it is actually accessed. HSM policy in soemthing like SAM-QFS/Versity/Cray TAS can ensure folders are kep intact when moved to other storage (we could even dereference symlinks when dealing with tape). This can be done with a POSIX filesystem and not muich code - Ben O'Steen started something along these lines here: https://github.com/dataflow/RDFDatabank/wiki/What-is-DataBank-and-what-does-it-do%3F Fedora also also a versioning object store that could support this kind of model but also adds a fair bit of complexity to be Linked-Data_platform compliant. In my paralance I would probably equate "Minimal Ingest" with "Sheer Curation" and APT with Asynchronous Message Driven Workers. Neil On 2017-05-12 12:33, Tim.Gollins at nrscotland.gov.uk wrote: > Dear PASIG > > I have been thinking recently about the challenge of managing > "physical" AIPs on offline or near line storage and how to optimise > or simplify the use of managed storage media in a tape based (robotic) > Hierarchical Storage Management (HSM) system. By "physical" AIPs I > mean that the actual structure of the AIP written to the storage > system is sufficiently self-describing that even if the management or > other elements of a DP system were to be lost to a disaster then the > entire collection could be fully re-instated reliably from the stored > AIPs alone. > > I have also been thinking about the huge benefits of adopting the > concepts of "Minimal Ingest" (MI) and "Autonomous Preservation Tools" > (APT) in a new Digital Archive solution. > > One of the potential effects of the MI and APT concepts is that over > time it is clear that while (of course) the original bit streams will > never need to be updated, the metadata packaged in the AIP will need > to change relatively often (through the life of the AIP) . This is of > course in addition to any new renderings of the bit streams produced > for preservation purposes (manifestations as termed in some systems). > > If to update the AIP the process involves the AIP being "loaded" and > "Modified" and "Stored" again as a whole then this will result in > significant "churn" of the offline or near line media (i.e. tapes) in > a HSM - which I would like to avoid. I think it would be really great > if the AIP representation could accommodate the concept of an "update > IP" (perhaps UIP?) where the UIP contains a "delta" of the original > AIP - the full AIP then being interpreted as the original as modified > by a series of deltas. This would then effectively result in AIPs (and > UIPs) becoming WORM objects with clear benefits that I perceive in > managing their reliable and safe storage. > > I am not sufficiently familiar with the detail of all the different > AIP models or implementations, I was wondering if anyone in the team > would be able to comment on whether the they know of any AIP models, > specifications or implementations that would support such a use case. > > I have just posted a version of this question to the E-Ark Linked in > Group so my apologies to those who see it twice. > > Many thanks > > Tim > Tim Gollins | Head of Digital Archiving and Director of the NRS > Digital Preservation Programme National Records of Scotland | West > Register House | Edinburgh EH2 4DF > + 44 (0)131 535 1431 / + 44 (0)7974 922614 | > tim.gollins at nrscotland.gov.uk | www.nrscotland.gov.uk > > Preserving the past | Recording the present | Informing the future > Follow us on Twitter: @NatRecordsScot | > http://twitter.com/NatRecordsScot > > > ********************************************************************** > This e-mail (and any files or other attachments transmitted with it) > is intended solely for the attention of the addressee(s). Unauthorised > use, disclosure, storage, copying or distribution of any part of this > e-mail is not permitted. If you are not the intended recipient please > destroy the email, remove any copies from your system and inform the > sender immediately by return. > > Communications with the Scottish Government may be monitored or > recorded in order to secure the effective operation of the system and > for other lawful purposes. The views or opinions contained within this > e-mail may not necessarily reflect those of the Scottish Government. > > > Tha am post-d seo (agus faidhle neo ceanglan c?mhla ris) dhan neach > neo luchd-ainmichte a-mh?in. Chan eil e ceadaichte a chleachdadh ann > an d?igh sam bith, a? toirt a-steach c?raichean, foillseachadh neo > sgaoileadh, gun chead. Ma ?s e is gun d?fhuair sibh seo le gun > fhiosd?, bu choir cur ?s dhan phost-d agus lethbhreac sam bith air an > t-siostam agaibh, leig fios chun neach a sgaoil am post-d gun d?il. > > Dh?fhaodadh gum bi teachdaireachd sam bith bho Riaghaltas na h-Alba > air a chl?radh neo air a sgr?dadh airson dearbhadh gu bheil an siostam > ag obair gu h-?ifeachdach neo airson adhbhar laghail eile. Dh?fhaodadh > nach eil beachdan anns a? phost-d seo co-ionann ri beachdan > Riaghaltas na h-Alba. > ********************************************************************** > > > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at > http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss From twalsh at cca.qc.ca Fri May 12 10:15:39 2017 From: twalsh at cca.qc.ca (Tim Walsh) Date: Fri, 12 May 2017 14:15:39 +0000 Subject: [Pasig-discuss] Digital repository storage benchmarking Message-ID: <8B597316-5049-40E0-A7C4-4F7431E69E76@cca.qc.ca> Dear PASIG, I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions? configurations online. It?s very possible that this question has been asked before on-list, but I wasn?t able to find anything in the list archives. For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those ?various media? will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we?d like to benchmark our plans against other institutions. I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are: * Could you point me to published/available resources outlining other institutions? digital repository storage configurations? * Or, if you work at an institution, would you be willing to share the details of your configuration on- or off-list? (any information sent off-list will be kept strictly confidential) Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups. Thank you! Tim - - - Tim Walsh Archiviste, Archives num?riques Archivist, Digital Archives Centre Canadien d?Architecture Canadian Centre for Architecture 1920, rue Baile, Montr?al, Qu?bec H3H 2S6 T 514 939 7001 x 1532 F 514 939 7020 www.cca.qc.ca Pensez ? l?environnement avant d?imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n??tes pas le destinataire pr?vu, veuillez nous en aviser imm?diatement. Merci ?galement de supprimer le pr?sent courriel et d?en d?truire toute copie. This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you. From preservation.guide at gmail.com Fri May 12 10:30:10 2017 From: preservation.guide at gmail.com (Richard Wright) Date: Fri, 12 May 2017 14:30:10 +0000 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: <8B597316-5049-40E0-A7C4-4F7431E69E76@cca.qc.ca> References: <8B597316-5049-40E0-A7C4-4F7431E69E76@cca.qc.ca> Message-ID: Tim and all -- quite a few case studies in the presentations at this conference from a few years ago: http://www.digitalpreservation.gov/meetings/storage14.html On Fri, 12 May 2017 at 15:18 Tim Walsh wrote: > Dear PASIG, > > I am currently in the process of benchmarking digital repository storage > setups with our Director of IT, and am having trouble finding very much > information about other institutions? configurations online. It?s very > possible that this question has been asked before on-list, but I wasn?t > able to find anything in the list archives. > > For context, we are a research museum with significant born-digital > archival holdings preparing to manage about 200 TB of digital objects over > the next 3 years, replicated several times on various media. The question > is what precisely those ?various media? will be. Currently, our plan is to > store one copy on disk on-site, one copy on disk in a managed off-site > facility, and a third copy on LTO sent to a third facility. Before we > commit, we?d like to benchmark our plans against other institutions. > > I have been able to find information about the storage configurations for > MoMA and the Computer History Museum (who each wrote blog posts or > presented on this topic), but not very many others. So my questions are: > > * Could you point me to published/available resources outlining other > institutions? digital repository storage configurations? > * Or, if you work at an institution, would you be willing to share the > details of your configuration on- or off-list? (any information sent > off-list will be kept strictly confidential) > > Helpful details would include: amount of digital objects being stored; how > many copies of data are being stored; which copies are online, nearline, or > offline; which media are being used for which copies; and what > services/software applications are you using to manage the creation and > maintainance of backups. > > Thank you! > Tim > > - - - > > Tim Walsh > Archiviste, Archives num?riques > Archivist, Digital Archives > > Centre Canadien d?Architecture > Canadian Centre for Architecture > 1920, rue Baile, Montr?al, Qu?bec H3H 2S6 > T 514 939 7001 x 1532 > F 514 939 7020 > www.cca.qc.ca > > > Pensez ? l?environnement avant d?imprimer ce message > Please consider the environment before printing this email > Ce courriel peut contenir des renseignements confidentiels. Si vous n??tes > pas le destinataire pr?vu, veuillez nous en aviser imm?diatement. Merci > ?galement de supprimer le pr?sent courriel et d?en d?truire toute copie. > This email may contain confidential information. If you are not the > intended > recipient, please advise us immediately and delete this email as well > as any other copy. Thank you. > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at > http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > -- Regards, Richard Richard Wright +44 7724 717 981 preservationguide.co.uk -------------- next part -------------- An HTML attachment was scrubbed... URL: From Tim.Gollins at nrscotland.gov.uk Fri May 12 10:54:17 2017 From: Tim.Gollins at nrscotland.gov.uk (Tim.Gollins at nrscotland.gov.uk) Date: Fri, 12 May 2017 14:54:17 +0000 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: References: <8B597316-5049-40E0-A7C4-4F7431E69E76@cca.qc.ca> Message-ID: Time, Richard Very many thanks from me too ? the answer to this question also helps me understand more in the context of my own recent question on WORM AIPs . All the best Tim Tim Gollins | Head of Digital Archiving and Director of the NRS Digital Preservation Programme National Records of Scotland | West Register House | Edinburgh EH2 4DF + 44 (0)131 535 1431 / + 44 (0)7974 922614 | tim.gollins at nrscotland.gov.uk | www.nrscotland.gov.uk Preserving the past | Recording the present | Informing the future Follow us on Twitter: @NatRecordsScot | http://twitter.com/NatRecordsScot From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Richard Wright Sent: 12 May 2017 15:30 To: Tim Walsh; pasig-discuss at asis.org Subject: Re: [Pasig-discuss] Digital repository storage benchmarking Tim and all -- quite a few case studies in the presentations at this conference from a few years ago: http://www.digitalpreservation.gov/meetings/storage14.html On Fri, 12 May 2017 at 15:18 Tim Walsh > wrote: Dear PASIG, I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions? configurations online. It?s very possible that this question has been asked before on-list, but I wasn?t able to find anything in the list archives. For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those ?various media? will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we?d like to benchmark our plans against other institutions. I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are: * Could you point me to published/available resources outlining other institutions? digital repository storage configurations? * Or, if you work at an institution, would you be willing to share the details of your configuration on- or off-list? (any information sent off-list will be kept strictly confidential) Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups. Thank you! Tim - - - Tim Walsh Archiviste, Archives num?riques Archivist, Digital Archives Centre Canadien d?Architecture Canadian Centre for Architecture 1920, rue Baile, Montr?al, Qu?bec H3H 2S6 T 514 939 7001 x 1532 F 514 939 7020 www.cca.qc.ca Pensez ? l?environnement avant d?imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n??tes pas le destinataire pr?vu, veuillez nous en aviser imm?diatement. Merci ?galement de supprimer le pr?sent courriel et d?en d?truire toute copie. This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you. ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss -- Regards, Richard Richard Wright +44 7724 717 981 preservationguide.co.uk ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ *********************************** ******************************** This email has been received from an external party and has been swept for the presence of computer viruses. ******************************************************************** ********************************************************************** This e-mail (and any files or other attachments transmitted with it) is intended solely for the attention of the addressee(s). Unauthorised use, disclosure, storage, copying or distribution of any part of this e-mail is not permitted. If you are not the intended recipient please destroy the email, remove any copies from your system and inform the sender immediately by return. Communications with the Scottish Government may be monitored or recorded in order to secure the effective operation of the system and for other lawful purposes. The views or opinions contained within this e-mail may not necessarily reflect those of the Scottish Government. Tha am post-d seo (agus faidhle neo ceanglan c?mhla ris) dhan neach neo luchd-ainmichte a-mh?in. Chan eil e ceadaichte a chleachdadh ann an d?igh sam bith, a? toirt a-steach c?raichean, foillseachadh neo sgaoileadh, gun chead. Ma ?s e is gun d?fhuair sibh seo le gun fhiosd?, bu choir cur ?s dhan phost-d agus lethbhreac sam bith air an t-siostam agaibh, leig fios chun neach a sgaoil am post-d gun d?il. Dh?fhaodadh gum bi teachdaireachd sam bith bho Riaghaltas na h-Alba air a chl?radh neo air a sgr?dadh airson dearbhadh gu bheil an siostam ag obair gu h-?ifeachdach neo airson adhbhar laghail eile. Dh?fhaodadh nach eil beachdan anns a? phost-d seo co-ionann ri beachdan Riaghaltas na h-Alba. ********************************************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From Sheila.Morrissey at ithaka.org Fri May 12 13:43:56 2017 From: Sheila.Morrissey at ithaka.org (Sheila Morrissey) Date: Fri, 12 May 2017 17:43:56 +0000 Subject: [Pasig-discuss] FW: Digital repository storage benchmarking References: <8B597316-5049-40E0-A7C4-4F7431E69E76@cca.qc.ca> Message-ID: Hello, Tim, At Portico (http://www.portico.org/digital-preservation/), we preserve e-journals, e-books, digitized historical collections, and other born-digital scholarly content. Currently, the Portico archive is comprised of roughly 77.7 million digital objects (we call them "Archival Units", or AUs); comprising over 400 TB; made up of 1.3 billion files. We maintain 3 copies of the archive: 2 on disk in geographically distributed data centers, and a 3rd copy in commercial cloud storage. We create and maintain backups (including fixity checks) using our own custom-written software. I hope this helpful. Best regards, Sheila Sheila M. Morrissey Senior Researcher ITHAKA 100 Campus Drive Suite 100 Princeton NJ 08540 609-986-2221? ? sheila.morrissey at ithaka.org ? ITHAKA (www.ithaka.org) is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways.? We provide innovative services that benefit higher education, including Ithaka S+R, JSTOR, and Portico. -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Tim Walsh Sent: Friday, May 12, 2017 10:16 AM To: pasig-discuss at asis.org Subject: [Pasig-discuss] Digital repository storage benchmarking Dear PASIG, I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions? configurations online. It?s very possible that this question has been asked before on-list, but I wasn?t able to find anything in the list archives. For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those ?various media? will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we?d like to benchmark our plans against other institutions. I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are: * Could you point me to published/available resources outlining other institutions? digital repository storage configurations? * Or, if you work at an institution, would you be willing to share the details of your configuration on- or off-list? (any information sent off-list will be kept strictly confidential) Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups. Thank you! Tim - - - Tim Walsh Archiviste, Archives num?riques Archivist, Digital Archives Centre Canadien d?Architecture Canadian Centre for Architecture 1920, rue Baile, Montr?al, Qu?bec H3H 2S6 T 514 939 7001 x 1532 F 514 939 7020 www.cca.qc.ca Pensez ? l?environnement avant d?imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n??tes pas le destinataire pr?vu, veuillez nous en aviser imm?diatement. Merci ?galement de supprimer le pr?sent courriel et d?en d?truire toute copie. This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you. ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss From randy_stern at harvard.edu Fri May 12 13:58:46 2017 From: randy_stern at harvard.edu (Stern, Randy) Date: Fri, 12 May 2017 17:58:46 +0000 Subject: [Pasig-discuss] FW: Digital repository storage benchmarking Message-ID: <85DD9199-4D53-4754-8802-6C3171BD3BC4@harvard.edu> Harvard is similar ? 2 disk copies in geographically distributed sites on, and one tape copy in a third location. We also have a 4th copy on tape in a tape library that is creating the tapes we remove off site to the third location. We run fixity checks on the disk copies, but not the tape copy. We currently have in excess of 200TB for each copy. We currently store preservation and real-time access copies of files in the same storage system with the same storage policies. We expect that to change in the future, with likely delivery copy storage in the cloud. Randy On 5/12/17, 1:43 PM, "Sheila Morrissey" wrote: Hello, Tim, At Portico (http://www.portico.org/digital-preservation/), we preserve e-journals, e-books, digitized historical collections, and other born-digital scholarly content. Currently, the Portico archive is comprised of roughly 77.7 million digital objects (we call them "Archival Units", or AUs); comprising over 400 TB; made up of 1.3 billion files. We maintain 3 copies of the archive: 2 on disk in geographically distributed data centers, and a 3rd copy in commercial cloud storage. We create and maintain backups (including fixity checks) using our own custom-written software. I hope this helpful. Best regards, Sheila Sheila M. Morrissey Senior Researcher ITHAKA 100 Campus Drive Suite 100 Princeton NJ 08540 609-986-2221 sheila.morrissey at ithaka.org ITHAKA (www.ithaka.org) is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. We provide innovative services that benefit higher education, including Ithaka S+R, JSTOR, and Portico. -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Tim Walsh Sent: Friday, May 12, 2017 10:16 AM To: pasig-discuss at asis.org Subject: [Pasig-discuss] Digital repository storage benchmarking Dear PASIG, I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions? configurations online. It?s very possible that this question has been asked before on-list, but I wasn?t able to find anything in the list archives. For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those ?various media? will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we?d like to benchmark our plans against other institutions. I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are: * Could you point me to published/available resources outlining other institutions? digital repository storage configurations? * Or, if you work at an institution, would you be willing to share the details of your configuration on- or off-list? (any information sent off-list will be kept strictly confidential) Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups. Thank you! Tim - - - Tim Walsh Archiviste, Archives num?riques Archivist, Digital Archives Centre Canadien d?Architecture Canadian Centre for Architecture 1920, rue Baile, Montr?al, Qu?bec H3H 2S6 T 514 939 7001 x 1532 F 514 939 7020 www.cca.qc.ca Pensez ? l?environnement avant d?imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n??tes pas le destinataire pr?vu, veuillez nous en aviser imm?diatement. Merci ?galement de supprimer le pr?sent courriel et d?en d?truire toute copie. This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you. ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss From charlottekostelic at gmail.com Fri May 12 14:23:36 2017 From: charlottekostelic at gmail.com (Charlotte Kostelic) Date: Fri, 12 May 2017 19:23:36 +0100 Subject: [Pasig-discuss] Call for Proposals - NDSA 2017 Pittsburgh, PA Message-ID: The National Digital Stewardship Alliance (NDSA ) invites proposals for Digital Preservation 2017: ?Preservation is Political,? to be held in Pittsburgh, Pennsylvania, October 25-26, 2017. Digital Preservation is the major meeting and conference of the NDSA?open to members and non-members alike?focusing on tools, techniques, theories and methodologies for digital stewardship and preservation, data curation, the content lifecycle, and related issues. Our 2017 meeting is held in partnership with our host organization, the Digital Library Federation (DLF ). Separate calls are being issued for the DLF Liberal Arts Colleges Pre-Conference (22 October) and 2017 DLF Forum (23-24 October)?all happening in the same location. Proposals are due by May 22th at 11:59pm Pacific Time. About the NDSA and Digital Preservation 2017: The National Digital Stewardship Alliance is a consortium of more than 160 organizations committed to the long-term preservation and stewardship of digital information and cultural heritage, for the benefit of present and future generations. Digital Preservation 2017 (#digipres17 ) will help to chart future directions for both the NDSA and digital stewardship, and is expected to be a crucial venue for intellectual exchange, community-building, development of best practices, and national-level agenda-setting in the field. The conference will be held at the Westin Convention Center ?where downtown buzz meets restorative sleep?, just blocks from historic Market Square , The Andy Warhol Museum , boutiques, restaurants, and nightlife. The NDSA strives to create a safe, accessible, welcoming, and inclusive event, and will operate under the DLF Forum?s Code of Conduct . Submissions: 250-word proposals describing the presentation/demo/poster are invited (500 words for full panel sessions). Please also include a 50-word short abstract for the program if your submission is selected. Submit proposals online: https://conftool.pro/dlf2017/. Deadline: May 9th, 2017 at 11:59pm PT. We especially encourage proposals that speak to our conference theme, ?Preservation is Political.? This core theme emerged from a discussion of strategic topics, our practice, our mission and the challenges. Submissions are invited in the following lengths and formats: Talks/Demos: Presentations and demonstrations are allocated 30 minutes each. Speakers should reserve time for interactive exchanges on next steps, possible NDSA community action, and discussion or debate. Panels: Panel discussions with 4 or more speakers will be given a dedicated session. Organizers are especially encouraged to include as diverse an array of perspectives and voices as possible, and to reserve time for audience Q&A. Minute Madness: Share your ideas in 60 seconds or less as part of the opening plenary of the conference. Presenters will have the option to display posters during the reception that follows. (Guidelines for poster sizes will be provided on acceptance.) Lunchtime Working Group Meetings: NDSA working and interest group chairs are invited to propose group meetings or targeted collaboration sessions. (Lunch provided.) All submissions will be peer-reviewed by NDSA?s volunteer Program Committee. Presenters will be notified in July and guaranteed a registration slot at the conference. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tab.butler at mlb.com Fri May 12 14:43:32 2017 From: tab.butler at mlb.com (Butler, Tab) Date: Fri, 12 May 2017 18:43:32 +0000 Subject: [Pasig-discuss] FW: Digital repository storage benchmarking In-Reply-To: <85DD9199-4D53-4754-8802-6C3171BD3BC4@harvard.edu> References: <85DD9199-4D53-4754-8802-6C3171BD3BC4@harvard.edu> Message-ID: Tim, At Major League Baseball, we are focused mostly on archiving the broadcast game video feeds, along with pregame, postgame, and individual camera iso feeds for each game. The content includes both the home and away team broadcasts, with and without graphics. Essentially, we record 7 hours plus of content for every 1 hour of baseball played. We also record and archive all the MLB Network content that is produced, which is between 12 - 18 hours of live content per day. We will archive the entire broadcast show of record, and the individual elements that make up a show. All in, we are recording over 1,000 hours of content per day. This equates to 50+ TB of content being added to our archive per day. We have both an active on-line disk tier (2 SAN's - each 2.88 PB) for recording, editing, and on-line storage, and a data tape archive that supports Partial File Restore (PFR) of video files. We load balance recording content across the two SAN's... American League on one SAN, and National League on the other... and all edits (96 high performance / 54 desktop machines) access both SAN's. Once content is written to a SAN, it is auto archived to tape, as per our DIAMOND asset management system (home grown). We started archiving on LTO-4 in 2008, and are currently on Oracle T10000-D. We are migrating content from LTO-4 to T10K-D tape within a tape group... We have both an 'On-Site' tape sub-group, and an 'Off-Site' tape sub-group for each of our Tape Groups. Tape Groups include "Games with Graphics" (Dirty) and "Games without Graphics" (Clean)... the Dirty off-site tapes go to a separate off site location than the Clean off-site tapes. We break up all of our Off-Site Tape Groups between two geographically distributed locations, as well. We are using the Oracle DIVArichive middleware, which performs a checksum value that is compared to the stored database value, each time a file is copied, moved, or restored. We are performing between 1,000 to 2,000 PFR / Restores per day. Currently we have over 45,000 LTO-4's and over 10,000 T10K-D tapes, growing at the rate of 125,000 hours of content per year. If you would like more details regarding archiving video content, feel free to reach out to me. Sincerely, Tab Tab Butler | Sr. Director - Media Management & Post Production| MLB Network | 40 Hartz Way, Suite 10 | Secaucus, NJ 07094 (201) 520-6252 Office | (646) 498-1662 Cell tab.butler at mlb.com -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Stern, Randy Sent: Friday, May 12, 2017 1:59 PM To: Sheila Morrissey ; pasig-discuss at asis.org Subject: Re: [Pasig-discuss] FW: Digital repository storage benchmarking Harvard is similar ? 2 disk copies in geographically distributed sites on, and one tape copy in a third location. We also have a 4th copy on tape in a tape library that is creating the tapes we remove off site to the third location. We run fixity checks on the disk copies, but not the tape copy. We currently have in excess of 200TB for each copy. We currently store preservation and real-time access copies of files in the same storage system with the same storage policies. We expect that to change in the future, with likely delivery copy storage in the cloud. Randy On 5/12/17, 1:43 PM, "Sheila Morrissey" wrote: Hello, Tim, At Portico (http://www.portico.org/digital-preservation/), we preserve e-journals, e-books, digitized historical collections, and other born-digital scholarly content. Currently, the Portico archive is comprised of roughly 77.7 million digital objects (we call them "Archival Units", or AUs); comprising over 400 TB; made up of 1.3 billion files. We maintain 3 copies of the archive: 2 on disk in geographically distributed data centers, and a 3rd copy in commercial cloud storage. We create and maintain backups (including fixity checks) using our own custom-written software. I hope this helpful. Best regards, Sheila Sheila M. Morrissey Senior Researcher ITHAKA 100 Campus Drive Suite 100 Princeton NJ 08540 609-986-2221 sheila.morrissey at ithaka.org ITHAKA (www.ithaka.org) is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. We provide innovative services that benefit higher education, including Ithaka S+R, JSTOR, and Portico. -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Tim Walsh Sent: Friday, May 12, 2017 10:16 AM To: pasig-discuss at asis.org Subject: [Pasig-discuss] Digital repository storage benchmarking Dear PASIG, I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions? configurations online. It?s very possible that this question has been asked before on-list, but I wasn?t able to find anything in the list archives. For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those ?various media? will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we?d like to benchmark our plans against other institutions. I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are: * Could you point me to published/available resources outlining other institutions? digital repository storage configurations? * Or, if you work at an institution, would you be willing to share the details of your configuration on- or off-list? (any information sent off-list will be kept strictly confidential) Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups. Thank you! Tim - - - Tim Walsh Archiviste, Archives num?riques Archivist, Digital Archives Centre Canadien d?Architecture Canadian Centre for Architecture 1920, rue Baile, Montr?al, Qu?bec H3H 2S6 T 514 939 7001 x 1532 F 514 939 7020 www.cca.qc.ca Pensez ? l?environnement avant d?imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n??tes pas le destinataire pr?vu, veuillez nous en aviser imm?diatement. Merci ?galement de supprimer le pr?sent courriel et d?en d?truire toute copie. This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you. ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss From neil at jefferies.org Fri May 12 16:42:55 2017 From: neil at jefferies.org (Neil Jefferies) Date: Fri, 12 May 2017 21:42:55 +0100 Subject: [Pasig-discuss] WORM (Write Once Read Many) AIPs In-Reply-To: References: <7290d680d2d83ae9c5d4a88371bb6147@imap.plus.net> <63d06e35b40be1c7d0ff6e5613950844@mail.gmail.com> Message-ID: <81d799913e0c44c2d1d46d9ddd9fbd23@imap.plus.net> Jacob, This is the key point of my argument - the definition of object you have is not the definition of an object that an archive wants to preserve. I'm speaking for people like Tim and I - others are quite happy to build what I term bit-museums. Likewise, what you consider preservation (immutability of a bitstream) is not quite the same as ours - retention of knowledge content - which requires mutability but with immutable previous versions and provenance/audit records. As long as this disconnect between technology and requirements remains the case, object stores are actually of limited use for us in preservation and archiving without considerable additional work. The 'metadata' that most object stores support (key-value pairs) is pretty useless as far as our metadata requirements go - in the end we have to store XML or triples as separate files/objects. This was an issue when I reviewed the StorageTek 5800 code builds way back and frankly object storage hasn't moved on much. Fedora, for all its faults, does actually provide an object view that is meaningful - something that can be a node in a linked-data graph. It can be arbitrarily complex but equally, could comprise only metadata. It is almost never a file. Neil On 2017-05-12 20:29, Jacob Farmer wrote: > Hi, Neil. Great points. Indeed, hard links only work in a single file > system, but they continue pointing to and fro when a file is otherwise > moved > or renamed. > > I personally think of POSIX file systems as object stores that have > weak > addressing, limited metadata, and that offer mutability as the default. > > My preferred definition of an object store is a device that stores > objects. > My preferred definition of an object is any piece of data that can be > individually addressed and manipulated. > So, by that definition, POSIX file systems are object stores, so are > hard > drives. So is Microsoft exchange, etc. > > If you name a file according to a hash or a UUID (the hash could be the > UUID), then you have a form of persistent address. As long as no one > messes > with your file system, the address scheme stays intact. > > > -----Original Message----- > From: Neil Jefferies [mailto:neil at jefferies.org] > Sent: Friday, May 12, 2017 11:25 AM > To: Jacob Farmer > Subject: RE: [Pasig-discuss] WORM (Write Once Read Many) AIPs > > Good point on the housekeeping! > > Most (reasonable) filesystems allow you specify the inode numbers at > creation but yes, it is hard to change afterwards! > > But I would really, really avoid hard links - they only work within a > single > filesystem so they can't be used in tiered or virtual storage systems > and > even break quota controls on regular filesystems. Scale up thus becomes > very > difficult with hard links. Symlinks also make it explicit when you are > dealing with a reference and can tell you which version of the object > held > the original - useful provenance that hard links don't capture. > > My personal feeling is no for hashes, yes for UUID's (or other suitably > unique object ID). This allows us to keep all versions of an object in > the > same root path even though it varies. And don't store at a file level - > this > shotguns object fragments all over the store and make rebuilds > horrible. > Many current object stores do this - and consequently don't version > effectively - I wish people would understand objects are not files. > UUID's > are also consistent in terms of computational time and hashes very much > aren't. > > There's a big difference in robustness between needing just filesystem > metadata to find an object in storage and requiring filesystem metadata > (because underneath all object stores are filesystems - even Seagates > "object" hard drives), object store metadata to map paths to hashes, > and > object metadata to find all the bits that make up a composite object. > > ...and yes, I am saying that most object store vendors have got it > wrong. At > least as far as archiving is concerned. And they ought to consider why > every > object store ends up presenting itself as a POSIX filesystem. > > Neil > > > On 2017-05-12 14:33, Jacob Farmer wrote: >> Two warnings and two suggestions: >> >> Warnings: >> >> 1) Symlinks and Housekeeping -- It is a common practice to use >> symlinks to make versioned file collections. If you do this, you >> should have some kind of housekeeping processes that ensure that the >> symlinks are all working correctly. If files ever have to get >> migrated, symlinks can break. >> >> 2) Check with your file system vendor -- Most removable media file >> systems have some built in limitations on the number of inodes (files) >> that you can have in one file system. If you generate a lot of >> symlinks, you might overwhelm the file system. Your vendor will know. >> >> Suggestions: >> >> 1) Hashes for file names -- If your application software maintains a >> hash for each file, you might consider naming the file according to >> the hash. >> Use the first two digits for the parent directory, the next two digits >> for sub-diretory, the next two digits for sub-directory. Then use the >> full hash for the file name. This turns your POSIX file system into >> an object store with uniquely named objects. >> >> As a safeguard, you might maintain a separate table or list that >> associates path names with hashes. >> >> 2) Consider using hard links instead of symlinks -- You might use >> hard links instead of symlinks, presuming that the files are all in >> the same file system. You still have to watch for file count issues, >> but you have less housekeeping to do. >> >> I hope that helps. >> >> >> Jacob Farmer | Chief Technology Officer | Cambridge Computer | >> "Artists In Data Storage" >> Phone 781-250-3210 | jfarmer at CambridgeComputer.com | >> www.CambridgeComputer.com >> >> >> >> >> -----Original Message----- >> From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf >> Of Neil Jefferies >> Sent: Friday, May 12, 2017 8:06 AM >> To: Tim.Gollins at nrscotland.gov.uk >> Cc: pasig-discuss at mail.asis.org >> Subject: Re: [Pasig-discuss] WORM (Write Once Read Many) AIPs >> >> Tim, >> >> If we store AIP's unpackaged, as a collection of files in a folder, >> then object updates could just be a new folder with symlinks to the >> unchanged parts and the updated parts in place in the folder. The >> object "location" >> would be a parent folder for all these version folders - for example, >> a pairtree (or triple-tree for faster scanning/rebuilds) based on >> object UUID. >> Version folders would be named accoprding to date or version number >> (date might make Memento compliant access simpler). >> Creating anew version clones the current verion (including links) with >> a new name and then replaces the updated parts in situ. Final act is >> to update a "current" symlink in the object. Any update failure will >> mean "current" >> is >> not updated an the partial clone can be discarded. >> >> This assumes most updates are metadata and that a diff won't save much >> compared to a complete new XML file or whatever. I am also assuming >> that metadata won't be wrappered either (so you can forget METS) so >> that different types are stored in the most stuiable format and are >> accessed only when required. The problems with roundtripping packaged >> AIP's for updates rather than diff-ing are repeated by METS >> wrappering. >> >> These may be a virtual folder/filesytem presentation and underneath an >> HSM would retrieve files from wherever when it is actually accessed. >> HSM policy in soemthing like SAM-QFS/Versity/Cray TAS can ensure >> folders are kep intact when moved to other storage (we could even >> dereference symlinks when dealing with tape). >> >> This can be done with a POSIX filesystem and not muich code - Ben >> O'Steen started something along these lines here: >> https://github.com/dataflow/RDFDatabank/wiki/What-is-DataBank-and-what >> -does-it-do%3F >> >> Fedora also also a versioning object store that could support this >> kind of model but also adds a fair bit of complexity to be >> Linked-Data_platform compliant. >> >> In my paralance I would probably equate "Minimal Ingest" with "Sheer >> Curation" and APT with Asynchronous Message Driven Workers. >> >> Neil >> >> >> On 2017-05-12 12:33, Tim.Gollins at nrscotland.gov.uk wrote: >>> Dear PASIG >>> >>> I have been thinking recently about the challenge of managing >>> "physical" AIPs on offline or near line storage and how to optimise >>> or simplify the use of managed storage media in a tape based >>> (robotic) Hierarchical Storage Management (HSM) system. By "physical" >>> AIPs I mean that the actual structure of the AIP written to the >>> storage system is sufficiently self-describing that even if the >>> management or other elements of a DP system were to be lost to a >>> disaster then the entire collection could be fully re-instated >>> reliably from the stored AIPs alone. >>> >>> I have also been thinking about the huge benefits of adopting the >>> concepts of "Minimal Ingest" (MI) and "Autonomous Preservation Tools" >>> (APT) in a new Digital Archive solution. >>> >>> One of the potential effects of the MI and APT concepts is that over >>> time it is clear that while (of course) the original bit streams will >>> never need to be updated, the metadata packaged in the AIP will need >>> to change relatively often (through the life of the AIP) . This is of >>> course in addition to any new renderings of the bit streams produced >>> for preservation purposes (manifestations as termed in some systems). >>> >>> If to update the AIP the process involves the AIP being "loaded" and >>> "Modified" and "Stored" again as a whole then this will result in >>> significant "churn" of the offline or near line media (i.e. tapes) in >>> a HSM - which I would like to avoid. I think it would be really great >>> if the AIP representation could accommodate the concept of an "update >>> IP" (perhaps UIP?) where the UIP contains a "delta" of the original >>> AIP - the full AIP then being interpreted as the original as modified >>> by a series of deltas. This would then effectively result in AIPs >>> (and >>> UIPs) becoming WORM objects with clear benefits that I perceive in >>> managing their reliable and safe storage. >>> >>> I am not sufficiently familiar with the detail of all the different >>> AIP models or implementations, I was wondering if anyone in the team >>> would be able to comment on whether the they know of any AIP models, >>> specifications or implementations that would support such a use >>> case. >>> >>> I have just posted a version of this question to the E-Ark Linked in >>> Group so my apologies to those who see it twice. >>> >>> Many thanks >>> >>> Tim >>> Tim Gollins | Head of Digital Archiving and Director of the NRS >>> Digital Preservation Programme National Records of Scotland | West >>> Register House | Edinburgh EH2 4DF >>> + 44 (0)131 535 1431 / + 44 (0)7974 922614 | >>> tim.gollins at nrscotland.gov.uk | www.nrscotland.gov.uk >>> >>> Preserving the past | Recording the present | Informing the future >>> Follow us on Twitter: @NatRecordsScot | >>> http://twitter.com/NatRecordsScot >>> >>> >>> ********************************************************************* >>> * This e-mail (and any files or other attachments transmitted with >>> it) is intended solely for the attention of the addressee(s). >>> Unauthorised use, disclosure, storage, copying or distribution of any >>> part of this e-mail is not permitted. If you are not the intended >>> recipient please destroy the email, remove any copies from your >>> system and inform the sender immediately by return. >>> >>> Communications with the Scottish Government may be monitored or >>> recorded in order to secure the effective operation of the system and >>> for other lawful purposes. The views or opinions contained within >>> this e-mail may not necessarily reflect those of the Scottish >>> Government. >>> >>> >>> Tha am post-d seo (agus faidhle neo ceanglan c?mhla ris) dhan neach >>> neo luchd-ainmichte a-mh?in. Chan eil e ceadaichte a chleachdadh ann >>> an d?igh sam bith, a? toirt a-steach c?raichean, foillseachadh neo >>> sgaoileadh, gun chead. Ma ?s e is gun d?fhuair sibh seo le gun >>> fhiosd?, bu choir cur ?s dhan phost-d agus lethbhreac sam bith air an >>> t-siostam agaibh, leig fios chun neach a sgaoil am post-d gun d?il. >>> >>> Dh?fhaodadh gum bi teachdaireachd sam bith bho Riaghaltas na h-Alba >>> air a chl?radh neo air a sgr?dadh airson dearbhadh gu bheil an >>> siostam ag obair gu h-?ifeachdach neo airson adhbhar laghail eile. >>> Dh?fhaodadh nach eil beachdan anns a? phost-d seo co-ionann ri >>> beachdan Riaghaltas na h-Alba. >>> ********************************************************************* >>> * >>> >>> >>> >>> ---- >>> To subscribe, unsubscribe, or modify your subscription, please visit >>> http://mail.asis.org/mailman/listinfo/pasig-discuss >>> _______ >>> PASIG Webinars and conference material is at >>> http://www.preservationandarchivingsig.org/index.html >>> _______________________________________________ >>> Pasig-discuss mailing list >>> Pasig-discuss at mail.asis.org >>> http://mail.asis.org/mailman/listinfo/pasig-discuss >> >> ---- >> To subscribe, unsubscribe, or modify your subscription, please visit >> http://mail.asis.org/mailman/listinfo/pasig-discuss >> _______ >> PASIG Webinars and conference material is at >> http://www.preservationandarchivingsig.org/index.html >> _______________________________________________ >> Pasig-discuss mailing list >> Pasig-discuss at mail.asis.org >> http://mail.asis.org/mailman/listinfo/pasig-discuss From jfarmer at cambridgecomputer.com Fri May 12 16:51:06 2017 From: jfarmer at cambridgecomputer.com (Jacob Farmer) Date: Fri, 12 May 2017 16:51:06 -0400 Subject: [Pasig-discuss] WORM (Write Once Read Many) AIPs In-Reply-To: <81d799913e0c44c2d1d46d9ddd9fbd23@imap.plus.net> References: <7290d680d2d83ae9c5d4a88371bb6147@imap.plus.net> <63d06e35b40be1c7d0ff6e5613950844@mail.gmail.com> <81d799913e0c44c2d1d46d9ddd9fbd23@imap.plus.net> Message-ID: <945351555fe73d699a190d9d7d4fd135@mail.gmail.com> Great point. I think of the whole things as a stack. There is the metadata and bits that defines an object from the preservation point of view. Then there is a storage device that defines an object a specific set of bits to serve up. In the case of my software, Starfish, we think of ourselves as a middleware that can define the object in some intermediate form. At the end of the day, though, an object is any piece of data that can be addressed and manipulated. That piece of data should have a permanent address, unique identifiers, and some metadata that gives it meaning. -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Neil Jefferies Sent: Friday, May 12, 2017 4:43 PM To: Jacob Farmer Subject: Re: [Pasig-discuss] WORM (Write Once Read Many) AIPs Jacob, This is the key point of my argument - the definition of object you have is not the definition of an object that an archive wants to preserve. I'm speaking for people like Tim and I - others are quite happy to build what I term bit-museums. Likewise, what you consider preservation (immutability of a bitstream) is not quite the same as ours - retention of knowledge content - which requires mutability but with immutable previous versions and provenance/audit records. As long as this disconnect between technology and requirements remains the case, object stores are actually of limited use for us in preservation and archiving without considerable additional work. The 'metadata' that most object stores support (key-value pairs) is pretty useless as far as our metadata requirements go - in the end we have to store XML or triples as separate files/objects. This was an issue when I reviewed the StorageTek 5800 code builds way back and frankly object storage hasn't moved on much. Fedora, for all its faults, does actually provide an object view that is meaningful - something that can be a node in a linked-data graph. It can be arbitrarily complex but equally, could comprise only metadata. It is almost never a file. Neil On 2017-05-12 20:29, Jacob Farmer wrote: > Hi, Neil. Great points. Indeed, hard links only work in a single > file system, but they continue pointing to and fro when a file is > otherwise moved or renamed. > > I personally think of POSIX file systems as object stores that have > weak addressing, limited metadata, and that offer mutability as the > default. > > My preferred definition of an object store is a device that stores > objects. > My preferred definition of an object is any piece of data that can be > individually addressed and manipulated. > So, by that definition, POSIX file systems are object stores, so are > hard drives. So is Microsoft exchange, etc. > > If you name a file according to a hash or a UUID (the hash could be > the UUID), then you have a form of persistent address. As long as no > one messes with your file system, the address scheme stays intact. > > > -----Original Message----- > From: Neil Jefferies [mailto:neil at jefferies.org] > Sent: Friday, May 12, 2017 11:25 AM > To: Jacob Farmer > Subject: RE: [Pasig-discuss] WORM (Write Once Read Many) AIPs > > Good point on the housekeeping! > > Most (reasonable) filesystems allow you specify the inode numbers at > creation but yes, it is hard to change afterwards! > > But I would really, really avoid hard links - they only work within a > single filesystem so they can't be used in tiered or virtual storage > systems and even break quota controls on regular filesystems. Scale up > thus becomes very difficult with hard links. Symlinks also make it > explicit when you are dealing with a reference and can tell you which > version of the object held the original - useful provenance that hard > links don't capture. > > My personal feeling is no for hashes, yes for UUID's (or other > suitably unique object ID). This allows us to keep all versions of an > object in the same root path even though it varies. And don't store at > a file level - this shotguns object fragments all over the store and > make rebuilds horrible. > Many current object stores do this - and consequently don't version > effectively - I wish people would understand objects are not files. > UUID's > are also consistent in terms of computational time and hashes very > much aren't. > > There's a big difference in robustness between needing just filesystem > metadata to find an object in storage and requiring filesystem > metadata (because underneath all object stores are filesystems - even > Seagates "object" hard drives), object store metadata to map paths to > hashes, and object metadata to find all the bits that make up a > composite object. > > ...and yes, I am saying that most object store vendors have got it > wrong. At least as far as archiving is concerned. And they ought to > consider why every object store ends up presenting itself as a POSIX > filesystem. > > Neil > > > On 2017-05-12 14:33, Jacob Farmer wrote: >> Two warnings and two suggestions: >> >> Warnings: >> >> 1) Symlinks and Housekeeping -- It is a common practice to use >> symlinks to make versioned file collections. If you do this, you >> should have some kind of housekeeping processes that ensure that the >> symlinks are all working correctly. If files ever have to get >> migrated, symlinks can break. >> >> 2) Check with your file system vendor -- Most removable media file >> systems have some built in limitations on the number of inodes >> (files) that you can have in one file system. If you generate a lot >> of symlinks, you might overwhelm the file system. Your vendor will know. >> >> Suggestions: >> >> 1) Hashes for file names -- If your application software maintains a >> hash for each file, you might consider naming the file according to >> the hash. >> Use the first two digits for the parent directory, the next two >> digits for sub-diretory, the next two digits for sub-directory. Then >> use the full hash for the file name. This turns your POSIX file >> system into an object store with uniquely named objects. >> >> As a safeguard, you might maintain a separate table or list that >> associates path names with hashes. >> >> 2) Consider using hard links instead of symlinks -- You might use >> hard links instead of symlinks, presuming that the files are all in >> the same file system. You still have to watch for file count issues, >> but you have less housekeeping to do. >> >> I hope that helps. >> >> >> Jacob Farmer | Chief Technology Officer | Cambridge Computer | >> "Artists In Data Storage" >> Phone 781-250-3210 | jfarmer at CambridgeComputer.com | >> www.CambridgeComputer.com >> >> >> >> >> -----Original Message----- >> From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf >> Of Neil Jefferies >> Sent: Friday, May 12, 2017 8:06 AM >> To: Tim.Gollins at nrscotland.gov.uk >> Cc: pasig-discuss at mail.asis.org >> Subject: Re: [Pasig-discuss] WORM (Write Once Read Many) AIPs >> >> Tim, >> >> If we store AIP's unpackaged, as a collection of files in a folder, >> then object updates could just be a new folder with symlinks to the >> unchanged parts and the updated parts in place in the folder. The >> object "location" >> would be a parent folder for all these version folders - for example, >> a pairtree (or triple-tree for faster scanning/rebuilds) based on >> object UUID. >> Version folders would be named accoprding to date or version number >> (date might make Memento compliant access simpler). >> Creating anew version clones the current verion (including links) >> with a new name and then replaces the updated parts in situ. Final >> act is to update a "current" symlink in the object. Any update >> failure will mean "current" >> is >> not updated an the partial clone can be discarded. >> >> This assumes most updates are metadata and that a diff won't save >> much compared to a complete new XML file or whatever. I am also >> assuming that metadata won't be wrappered either (so you can forget >> METS) so that different types are stored in the most stuiable format >> and are accessed only when required. The problems with roundtripping >> packaged AIP's for updates rather than diff-ing are repeated by METS >> wrappering. >> >> These may be a virtual folder/filesytem presentation and underneath >> an HSM would retrieve files from wherever when it is actually accessed. >> HSM policy in soemthing like SAM-QFS/Versity/Cray TAS can ensure >> folders are kep intact when moved to other storage (we could even >> dereference symlinks when dealing with tape). >> >> This can be done with a POSIX filesystem and not muich code - Ben >> O'Steen started something along these lines here: >> https://github.com/dataflow/RDFDatabank/wiki/What-is-DataBank-and-wha >> t >> -does-it-do%3F >> >> Fedora also also a versioning object store that could support this >> kind of model but also adds a fair bit of complexity to be >> Linked-Data_platform compliant. >> >> In my paralance I would probably equate "Minimal Ingest" with "Sheer >> Curation" and APT with Asynchronous Message Driven Workers. >> >> Neil >> >> >> On 2017-05-12 12:33, Tim.Gollins at nrscotland.gov.uk wrote: >>> Dear PASIG >>> >>> I have been thinking recently about the challenge of managing >>> "physical" AIPs on offline or near line storage and how to optimise >>> or simplify the use of managed storage media in a tape based >>> (robotic) Hierarchical Storage Management (HSM) system. By "physical" >>> AIPs I mean that the actual structure of the AIP written to the >>> storage system is sufficiently self-describing that even if the >>> management or other elements of a DP system were to be lost to a >>> disaster then the entire collection could be fully re-instated >>> reliably from the stored AIPs alone. >>> >>> I have also been thinking about the huge benefits of adopting the >>> concepts of "Minimal Ingest" (MI) and "Autonomous Preservation Tools" >>> (APT) in a new Digital Archive solution. >>> >>> One of the potential effects of the MI and APT concepts is that over >>> time it is clear that while (of course) the original bit streams >>> will never need to be updated, the metadata packaged in the AIP will >>> need to change relatively often (through the life of the AIP) . This >>> is of course in addition to any new renderings of the bit streams >>> produced for preservation purposes (manifestations as termed in some >>> systems). >>> >>> If to update the AIP the process involves the AIP being "loaded" and >>> "Modified" and "Stored" again as a whole then this will result in >>> significant "churn" of the offline or near line media (i.e. tapes) >>> in a HSM - which I would like to avoid. I think it would be really >>> great if the AIP representation could accommodate the concept of an >>> "update IP" (perhaps UIP?) where the UIP contains a "delta" of the >>> original AIP - the full AIP then being interpreted as the original >>> as modified by a series of deltas. This would then effectively >>> result in AIPs (and >>> UIPs) becoming WORM objects with clear benefits that I perceive in >>> managing their reliable and safe storage. >>> >>> I am not sufficiently familiar with the detail of all the different >>> AIP models or implementations, I was wondering if anyone in the team >>> would be able to comment on whether the they know of any AIP models, >>> specifications or implementations that would support such a use >>> case. >>> >>> I have just posted a version of this question to the E-Ark Linked in >>> Group so my apologies to those who see it twice. >>> >>> Many thanks >>> >>> Tim >>> Tim Gollins | Head of Digital Archiving and Director of the NRS >>> Digital Preservation Programme National Records of Scotland | West >>> Register House | Edinburgh EH2 4DF >>> + 44 (0)131 535 1431 / + 44 (0)7974 922614 | >>> tim.gollins at nrscotland.gov.uk | www.nrscotland.gov.uk >>> >>> Preserving the past | Recording the present | Informing the future >>> Follow us on Twitter: @NatRecordsScot | >>> http://twitter.com/NatRecordsScot >>> >>> >>> ******************************************************************** >>> * >>> * This e-mail (and any files or other attachments transmitted with >>> it) is intended solely for the attention of the addressee(s). >>> Unauthorised use, disclosure, storage, copying or distribution of >>> any part of this e-mail is not permitted. If you are not the >>> intended recipient please destroy the email, remove any copies from >>> your system and inform the sender immediately by return. >>> >>> Communications with the Scottish Government may be monitored or >>> recorded in order to secure the effective operation of the system >>> and for other lawful purposes. The views or opinions contained >>> within this e-mail may not necessarily reflect those of the Scottish >>> Government. >>> >>> >>> Tha am post-d seo (agus faidhle neo ceanglan c?mhla ris) dhan neach >>> neo luchd-ainmichte a-mh?in. Chan eil e ceadaichte a chleachdadh ann >>> an d?igh sam bith, a? toirt a-steach c?raichean, foillseachadh neo >>> sgaoileadh, gun chead. Ma ?s e is gun d?fhuair sibh seo le gun >>> fhiosd?, bu choir cur ?s dhan phost-d agus lethbhreac sam bith air >>> an t-siostam agaibh, leig fios chun neach a sgaoil am post-d gun d?il. >>> >>> Dh?fhaodadh gum bi teachdaireachd sam bith bho Riaghaltas na h-Alba >>> air a chl?radh neo air a sgr?dadh airson dearbhadh gu bheil an >>> siostam ag obair gu h-?ifeachdach neo airson adhbhar laghail eile. >>> Dh?fhaodadh nach eil beachdan anns a? phost-d seo co-ionann ri >>> beachdan Riaghaltas na h-Alba. >>> ******************************************************************** >>> * >>> * >>> >>> >>> >>> ---- >>> To subscribe, unsubscribe, or modify your subscription, please visit >>> http://mail.asis.org/mailman/listinfo/pasig-discuss >>> _______ >>> PASIG Webinars and conference material is at >>> http://www.preservationandarchivingsig.org/index.html >>> _______________________________________________ >>> Pasig-discuss mailing list >>> Pasig-discuss at mail.asis.org >>> http://mail.asis.org/mailman/listinfo/pasig-discuss >> >> ---- >> To subscribe, unsubscribe, or modify your subscription, please visit >> http://mail.asis.org/mailman/listinfo/pasig-discuss >> _______ >> PASIG Webinars and conference material is at >> http://www.preservationandarchivingsig.org/index.html >> _______________________________________________ >> Pasig-discuss mailing list >> Pasig-discuss at mail.asis.org >> http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss From twalsh at cca.qc.ca Fri May 12 17:15:15 2017 From: twalsh at cca.qc.ca (Tim Walsh) Date: Fri, 12 May 2017 21:15:15 +0000 Subject: [Pasig-discuss] FW: Digital repository storage benchmarking In-Reply-To: References: <85DD9199-4D53-4754-8802-6C3171BD3BC4@harvard.edu> Message-ID: <25E51070-A934-448B-ADDD-B33AFE88F8A4@cca.qc.ca> Thank you to Tab, Randy, Sheila, Richard, et al. Very interesting and helpful responses! Best, Tim - - - Tim Walsh Archiviste, Archives num?riques Archivist, Digital Archives Centre Canadien d?Architecture Canadian Centre for Architecture T 514 939 7001 x 1532 www.cca.qc.ca On 2017-05-12, 2:43 PM, "Pasig-discuss on behalf of Butler, Tab" wrote: Tim, At Major League Baseball, we are focused mostly on archiving the broadcast game video feeds, along with pregame, postgame, and individual camera iso feeds for each game. The content includes both the home and away team broadcasts, with and without graphics. Essentially, we record 7 hours plus of content for every 1 hour of baseball played. We also record and archive all the MLB Network content that is produced, which is between 12 - 18 hours of live content per day. We will archive the entire broadcast show of record, and the individual elements that make up a show. All in, we are recording over 1,000 hours of content per day. This equates to 50+ TB of content being added to our archive per day. We have both an active on-line disk tier (2 SAN's - each 2.88 PB) for recording, editing, and on-line storage, and a data tape archive that supports Partial File Restore (PFR) of video files. We load balance recording content across the two SAN's... American League on one SAN, and National League on the other... and all edits (96 high performance / 54 desktop machines) access both SAN's. Once content is written to a SAN, it is auto archived to tape, as per our DIAMOND asset management system (home grown). We started archiving on LTO-4 in 2008, and are currently on Oracle T10000-D. We are migrating content from LTO-4 to T10K-D tape within a tape group... We have both an 'On-Site' tape sub-group, and an 'Off-Site' tape sub-group for each of our Tape Groups. Tape Groups include "Games with Graphics" (Dirty) and "Games without Graphics" (Clean)... the Dirty off-site tapes go to a separate off site location than the Clean off-site tapes. We break up all of our Off-Site Tape Groups between two geographically distributed locations, as well. We are using the Oracle DIVArichive middleware, which performs a checksum value that is compared to the stored database value, each time a file is copied, moved, or restored. We are performing between 1,000 to 2,000 PFR / Restores per day. Currently we have over 45,000 LTO-4's and over 10,000 T10K-D tapes, growing at the rate of 125,000 hours of content per year. If you would like more details regarding archiving video content, feel free to reach out to me. Sincerely, Tab Tab Butler | Sr. Director - Media Management & Post Production| MLB Network | 40 Hartz Way, Suite 10 | Secaucus, NJ 07094 (201) 520-6252 Office | (646) 498-1662 Cell tab.butler at mlb.com -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Stern, Randy Sent: Friday, May 12, 2017 1:59 PM To: Sheila Morrissey ; pasig-discuss at asis.org Subject: Re: [Pasig-discuss] FW: Digital repository storage benchmarking Harvard is similar ? 2 disk copies in geographically distributed sites on, and one tape copy in a third location. We also have a 4th copy on tape in a tape library that is creating the tapes we remove off site to the third location. We run fixity checks on the disk copies, but not the tape copy. We currently have in excess of 200TB for each copy. We currently store preservation and real-time access copies of files in the same storage system with the same storage policies. We expect that to change in the future, with likely delivery copy storage in the cloud. Randy On 5/12/17, 1:43 PM, "Sheila Morrissey" wrote: Hello, Tim, At Portico (http://www.portico.org/digital-preservation/), we preserve e-journals, e-books, digitized historical collections, and other born-digital scholarly content. Currently, the Portico archive is comprised of roughly 77.7 million digital objects (we call them "Archival Units", or AUs); comprising over 400 TB; made up of 1.3 billion files. We maintain 3 copies of the archive: 2 on disk in geographically distributed data centers, and a 3rd copy in commercial cloud storage. We create and maintain backups (including fixity checks) using our own custom-written software. I hope this helpful. Best regards, Sheila Sheila M. Morrissey Senior Researcher ITHAKA 100 Campus Drive Suite 100 Princeton NJ 08540 609-986-2221 sheila.morrissey at ithaka.org ITHAKA (www.ithaka.org) is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. We provide innovative services that benefit higher education, including Ithaka S+R, JSTOR, and Portico. -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Tim Walsh Sent: Friday, May 12, 2017 10:16 AM To: pasig-discuss at asis.org Subject: [Pasig-discuss] Digital repository storage benchmarking Dear PASIG, I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions? configurations online. It?s very possible that this question has been asked before on-list, but I wasn?t able to find anything in the list archives. For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those ?various media? will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we?d like to benchmark our plans against other institutions. I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are: * Could you point me to published/available resources outlining other institutions? digital repository storage configurations? * Or, if you work at an institution, would you be willing to share the details of your configuration on- or off-list? (any information sent off-list will be kept strictly confidential) Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups. Thank you! Tim - - - Tim Walsh Archiviste, Archives num?riques Archivist, Digital Archives Centre Canadien d?Architecture Canadian Centre for Architecture 1920, rue Baile, Montr?al, Qu?bec H3H 2S6 T 514 939 7001 x 1532 F 514 939 7020 www.cca.qc.ca Pensez ? l?environnement avant d?imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n??tes pas le destinataire pr?vu, veuillez nous en aviser imm?diatement. Merci ?galement de supprimer le pr?sent courriel et d?en d?truire toute copie. This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you. ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss From jmorley at stanford.edu Fri May 12 17:28:30 2017 From: jmorley at stanford.edu (Julian M. Morley) Date: Fri, 12 May 2017 21:28:30 +0000 Subject: [Pasig-discuss] FW: Digital repository storage benchmarking In-Reply-To: <25E51070-A934-448B-ADDD-B33AFE88F8A4@cca.qc.ca> References: <85DD9199-4D53-4754-8802-6C3171BD3BC4@harvard.edu> <25E51070-A934-448B-ADDD-B33AFE88F8A4@cca.qc.ca> Message-ID: Tim, Moab - used here at Stanford Libraries - is a POSIX-based paradigm that allows incremental updates without involving symlinks. We use it in conjunction with UUIDs (not hashes) and Fedora to define the AIPs used in the Stanford Digital Repository. There?s a white paper describing Moab here: http://journal.code4lib.org/articles/8482#2.5 -- Julian M. Morley Technology Infrastructure Manager Digital Library Systems & Services Stanford University Libraries On 5/12/17, 2:15 PM, "Pasig-discuss on behalf of Tim Walsh" wrote: >Thank you to Tab, Randy, Sheila, Richard, et al. Very interesting and helpful responses! > >Best, >Tim > >- - - > >Tim Walsh >Archiviste, Archives num?riques >Archivist, Digital Archives > >Centre Canadien d?Architecture >Canadian Centre for Architecture >T 514 939 7001 x 1532 >www.cca.qc.ca > >On 2017-05-12, 2:43 PM, "Pasig-discuss on behalf of Butler, Tab" wrote: > > Tim, > > At Major League Baseball, we are focused mostly on archiving the broadcast game video feeds, along with pregame, postgame, and individual camera iso feeds for each game. The content includes both the home and away team broadcasts, with and without graphics. Essentially, we record 7 hours plus of content for every 1 hour of baseball played. We also record and archive all the MLB Network content that is produced, which is between 12 - 18 hours of live content per day. We will archive the entire broadcast show of record, and the individual elements that make up a show. > > All in, we are recording over 1,000 hours of content per day. This equates to 50+ TB of content being added to our archive per day. > > We have both an active on-line disk tier (2 SAN's - each 2.88 PB) for recording, editing, and on-line storage, and a data tape archive that supports Partial File Restore (PFR) of video files. We load balance recording content across the two SAN's... American League on one SAN, and National League on the other... and all edits (96 high performance / 54 desktop machines) access both SAN's. > > Once content is written to a SAN, it is auto archived to tape, as per our DIAMOND asset management system (home grown). We started archiving on LTO-4 in 2008, and are currently on Oracle T10000-D. We are migrating content from LTO-4 to T10K-D tape within a tape group... > > We have both an 'On-Site' tape sub-group, and an 'Off-Site' tape sub-group for each of our Tape Groups. Tape Groups include "Games with Graphics" (Dirty) and "Games without Graphics" (Clean)... the Dirty off-site tapes go to a separate off site location than the Clean off-site tapes. We break up all of our Off-Site Tape Groups between two geographically distributed locations, as well. > > We are using the Oracle DIVArichive middleware, which performs a checksum value that is compared to the stored database value, each time a file is copied, moved, or restored. We are performing between 1,000 to 2,000 PFR / Restores per day. > > Currently we have over 45,000 LTO-4's and over 10,000 T10K-D tapes, growing at the rate of 125,000 hours of content per year. > > If you would like more details regarding archiving video content, feel free to reach out to me. > > Sincerely, > > Tab > > > > Tab Butler | Sr. Director - Media Management & Post Production| MLB Network | 40 Hartz Way, Suite 10 | Secaucus, NJ 07094 > (201) 520-6252 Office | (646) 498-1662 Cell > > tab.butler at mlb.com > > > -----Original Message----- > From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Stern, Randy > Sent: Friday, May 12, 2017 1:59 PM > To: Sheila Morrissey ; pasig-discuss at asis.org > Subject: Re: [Pasig-discuss] FW: Digital repository storage benchmarking > > Harvard is similar ? 2 disk copies in geographically distributed sites on, and one tape copy in a third location. We also have a 4th copy on tape in a tape library that is creating the tapes we remove off site to the third location. We run fixity checks on the disk copies, but not the tape copy. We currently have in excess of 200TB for each copy. > > We currently store preservation and real-time access copies of files in the same storage system with the same storage policies. We expect that to change in the future, with likely delivery copy storage in the cloud. > > Randy > > On 5/12/17, 1:43 PM, "Sheila Morrissey" wrote: > > > Hello, Tim, > > At Portico (http://www.portico.org/digital-preservation/), we preserve e-journals, e-books, digitized historical collections, and other born-digital scholarly content. > > Currently, the Portico archive is comprised of roughly 77.7 million digital objects (we call them "Archival Units", or AUs); comprising over 400 TB; made up of 1.3 billion files. > > We maintain 3 copies of the archive: 2 on disk in geographically distributed data centers, and a 3rd copy in commercial cloud storage. We create and maintain backups (including fixity checks) using our own custom-written software. > > I hope this helpful. > > Best regards, > Sheila > > > Sheila M. Morrissey > Senior Researcher > ITHAKA > 100 Campus Drive > Suite 100 > Princeton NJ 08540 > 609-986-2221 > sheila.morrissey at ithaka.org > > ITHAKA (www.ithaka.org) is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. We provide innovative services that benefit higher education, including Ithaka S+R, JSTOR, and Portico. > > > > -----Original Message----- > From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Tim Walsh > Sent: Friday, May 12, 2017 10:16 AM > To: pasig-discuss at asis.org > Subject: [Pasig-discuss] Digital repository storage benchmarking > > Dear PASIG, > > I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions? configurations online. It?s very possible that this question has been asked before on-list, but I wasn?t able to find anything in the list archives. > > For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those ?various media? will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we?d like to benchmark our plans against other institutions. > > I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are: > > * Could you point me to published/available resources outlining other institutions? digital repository storage configurations? > * Or, if you work at an institution, would you be willing to share the details of your configuration on- or off-list? (any information sent off-list will be kept strictly confidential) > > Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups. > > Thank you! > Tim > > - - - > > Tim Walsh > Archiviste, Archives num?riques > Archivist, Digital Archives > > Centre Canadien d?Architecture > Canadian Centre for Architecture > 1920, rue Baile, Montr?al, Qu?bec H3H 2S6 T 514 939 7001 x 1532 F 514 939 7020 www.cca.qc.ca > > > Pensez ? l?environnement avant d?imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n??tes pas le destinataire pr?vu, veuillez nous en aviser imm?diatement. Merci ?galement de supprimer le pr?sent courriel et d?en d?truire toute copie. > This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you. > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > > >---- >To subscribe, unsubscribe, or modify your subscription, please visit >http://mail.asis.org/mailman/listinfo/pasig-discuss >_______ >PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html >_______________________________________________ >Pasig-discuss mailing list >Pasig-discuss at mail.asis.org >http://mail.asis.org/mailman/listinfo/pasig-discuss From jmorley at stanford.edu Fri May 12 17:36:55 2017 From: jmorley at stanford.edu (Julian M. Morley) Date: Fri, 12 May 2017 21:36:55 +0000 Subject: [Pasig-discuss] WORM (Write Once Read Many) AIPs Message-ID: <53A9BB8F-F59C-4370-A7CC-BC490969BAEA@stanford.edu> Apologies - I replied to the wrong Tim! -- Julian M. Morley Technology Infrastructure Manager Digital Library Systems & Services Stanford University Libraries On 5/12/17, 2:28 PM, "Julian M. Morley" wrote: > >Tim, > >Moab - used here at Stanford Libraries - is a POSIX-based paradigm that allows incremental updates without involving symlinks. We use it in conjunction with UUIDs (not hashes) and Fedora to define the AIPs used in the Stanford Digital Repository. > >There?s a white paper describing Moab here: >http://journal.code4lib.org/articles/8482#2.5 > > >-- >Julian M. Morley >Technology Infrastructure Manager >Digital Library Systems & Services >Stanford University Libraries > > From jonathan.tilbury at preservica.com Sun May 14 06:54:59 2017 From: jonathan.tilbury at preservica.com (Jonathan Tilbury) Date: Sun, 14 May 2017 10:54:59 +0000 Subject: [Pasig-discuss] WORM (Write Once Read Many) AIPs In-Reply-To: <945351555fe73d699a190d9d7d4fd135@mail.gmail.com> References: <7290d680d2d83ae9c5d4a88371bb6147@imap.plus.net> <63d06e35b40be1c7d0ff6e5613950844@mail.gmail.com> <81d799913e0c44c2d1d46d9ddd9fbd23@imap.plus.net> <945351555fe73d699a190d9d7d4fd135@mail.gmail.com> Message-ID: Tim, I have always thought of the "autonomous AIP" zipped up and held on a storage device as an residue of paper-thinking. When dealing with paper storage it is possible to bundle up the papers and some description and put it in a box onto a shelf. If you need the artefact, you get all of the box. The paper is unlikely to be updated of changed during its lifetime. This really does not map well onto the digital world. There a lots of changes that result in the "API" being changed, for example changes in descriptive metadata, structure (parentage), security settings, technical metadata (during a re-characterisation) and audit trail. You may also add extra files to the API and most importantly generate new representations for access or digital masters following a migration. This makes the idea of a single immutable AIP redundant. Addressing this we need to ask why are we worrying. I think you answered this well by saying the content plus all of the metadata listed above must be accessible outside of whatever system you are using to re-build the collection should disaster happen or should you want to change system provider. To enable this you need all of the digital objects plus metadata (description, technical, security, structure, audit trail, fixity) to be held in a place and in a way that can be machine read. This does not imply physical zipped AIPs, just that the data is there and is understandable. Physical (zipped) AIPs are difficult to work with. Whenever you need to access a file you need to unpack the zip which is cumbersome and slow. This happens for download, rendering, and fixity checking. This overhead has no benefit and several risks. Also, it brings into question what fixity checking actually means when the storage container is being changed all the time. These problems become particularly acute when we have to address the large flat collections we are now seeing more of. I have always thought a better approach is to save the digital objects (files) in an object store (for example a file drive, tape store, cloud storage), and to make sure these never change using fixity validation. All of the metadata can be written to the object store as well, and either updated or new versions written as it is updated. These digital objects (files and metadata) can be stored in multiple locations in different technologies. In Preservica we support both approaches through the range of storage adapters we include. Each has its own way of renaming the digital objects, but the use of objects with a UUID naming convention is preferred. We strongly recommend against the use of physical APIs. All of the objects, once stored, can then be checked for fixity on a rotating basis or when accessed. By storing to multiple storage adapters you can even self-heal if someone does mess with your file system. As for exiting the system, we allow cloud edition users to replicate all of the content plus metadata to a remote store using SFTP in such a way that the physical directory structure mimics the logical collection structure. If they want to leave they have all the content safe in a place of their choosing. I would very interested I people's comments on whether we should still support Physical (zipped) AIPs. Jon ============= Jon Tilbury CTO, Preserivca ============= -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Neil Jefferies Sent: Friday, May 12, 2017 4:43 PM To: Jacob Farmer Subject: Re: [Pasig-discuss] WORM (Write Once Read Many) AIPs Jacob, This is the key point of my argument - the definition of object you have is not the definition of an object that an archive wants to preserve. I'm speaking for people like Tim and I - others are quite happy to build what I term bit-museums. Likewise, what you consider preservation (immutability of a bitstream) is not quite the same as ours - retention of knowledge content - which requires mutability but with immutable previous versions and provenance/audit records. As long as this disconnect between technology and requirements remains the case, object stores are actually of limited use for us in preservation and archiving without considerable additional work. The 'metadata' that most object stores support (key-value pairs) is pretty useless as far as our metadata requirements go - in the end we have to store XML or triples as separate files/objects. This was an issue when I reviewed the StorageTek 5800 code builds way back and frankly object storage hasn't moved on much. Fedora, for all its faults, does actually provide an object view that is meaningful - something that can be a node in a linked-data graph. It can be arbitrarily complex but equally, could comprise only metadata. It is almost never a file. Neil On 2017-05-12 20:29, Jacob Farmer wrote: > Hi, Neil. Great points. Indeed, hard links only work in a single > file system, but they continue pointing to and fro when a file is > otherwise moved or renamed. > > I personally think of POSIX file systems as object stores that have > weak addressing, limited metadata, and that offer mutability as the > default. > > My preferred definition of an object store is a device that stores > objects. > My preferred definition of an object is any piece of data that can be > individually addressed and manipulated. > So, by that definition, POSIX file systems are object stores, so are > hard drives. So is Microsoft exchange, etc. > > If you name a file according to a hash or a UUID (the hash could be > the UUID), then you have a form of persistent address. As long as no > one messes with your file system, the address scheme stays intact. > > > -----Original Message----- > From: Neil Jefferies [mailto:neil at jefferies.org] > Sent: Friday, May 12, 2017 11:25 AM > To: Jacob Farmer > Subject: RE: [Pasig-discuss] WORM (Write Once Read Many) AIPs > > Good point on the housekeeping! > > Most (reasonable) filesystems allow you specify the inode numbers at > creation but yes, it is hard to change afterwards! > > But I would really, really avoid hard links - they only work within a > single filesystem so they can't be used in tiered or virtual storage > systems and even break quota controls on regular filesystems. Scale up > thus becomes very difficult with hard links. Symlinks also make it > explicit when you are dealing with a reference and can tell you which > version of the object held the original - useful provenance that hard > links don't capture. > > My personal feeling is no for hashes, yes for UUID's (or other > suitably unique object ID). This allows us to keep all versions of an > object in the same root path even though it varies. And don't store at > a file level - this shotguns object fragments all over the store and > make rebuilds horrible. > Many current object stores do this - and consequently don't version > effectively - I wish people would understand objects are not files. > UUID's > are also consistent in terms of computational time and hashes very > much aren't. > > There's a big difference in robustness between needing just filesystem > metadata to find an object in storage and requiring filesystem > metadata (because underneath all object stores are filesystems - even > Seagates "object" hard drives), object store metadata to map paths to > hashes, and object metadata to find all the bits that make up a > composite object. > > ...and yes, I am saying that most object store vendors have got it > wrong. At least as far as archiving is concerned. And they ought to > consider why every object store ends up presenting itself as a POSIX > filesystem. > > Neil > > > On 2017-05-12 14:33, Jacob Farmer wrote: >> Two warnings and two suggestions: >> >> Warnings: >> >> 1) Symlinks and Housekeeping -- It is a common practice to use >> symlinks to make versioned file collections. If you do this, you >> should have some kind of housekeeping processes that ensure that the >> symlinks are all working correctly. If files ever have to get >> migrated, symlinks can break. >> >> 2) Check with your file system vendor -- Most removable media file >> systems have some built in limitations on the number of inodes >> (files) that you can have in one file system. If you generate a lot >> of symlinks, you might overwhelm the file system. Your vendor will know. >> >> Suggestions: >> >> 1) Hashes for file names -- If your application software maintains a >> hash for each file, you might consider naming the file according to >> the hash. >> Use the first two digits for the parent directory, the next two >> digits for sub-diretory, the next two digits for sub-directory. Then >> use the full hash for the file name. This turns your POSIX file >> system into an object store with uniquely named objects. >> >> As a safeguard, you might maintain a separate table or list that >> associates path names with hashes. >> >> 2) Consider using hard links instead of symlinks -- You might use >> hard links instead of symlinks, presuming that the files are all in >> the same file system. You still have to watch for file count issues, >> but you have less housekeeping to do. >> >> I hope that helps. >> >> >> Jacob Farmer | Chief Technology Officer | Cambridge Computer | >> "Artists In Data Storage" >> Phone 781-250-3210 | jfarmer at CambridgeComputer.com | >> www.CambridgeComputer.com >> >> >> >> >> -----Original Message----- >> From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf >> Of Neil Jefferies >> Sent: Friday, May 12, 2017 8:06 AM >> To: Tim.Gollins at nrscotland.gov.uk >> Cc: pasig-discuss at mail.asis.org >> Subject: Re: [Pasig-discuss] WORM (Write Once Read Many) AIPs >> >> Tim, >> >> If we store AIP's unpackaged, as a collection of files in a folder, >> then object updates could just be a new folder with symlinks to the >> unchanged parts and the updated parts in place in the folder. The >> object "location" >> would be a parent folder for all these version folders - for example, >> a pairtree (or triple-tree for faster scanning/rebuilds) based on >> object UUID. >> Version folders would be named accoprding to date or version number >> (date might make Memento compliant access simpler). >> Creating anew version clones the current verion (including links) >> with a new name and then replaces the updated parts in situ. Final >> act is to update a "current" symlink in the object. Any update >> failure will mean "current" >> is >> not updated an the partial clone can be discarded. >> >> This assumes most updates are metadata and that a diff won't save >> much compared to a complete new XML file or whatever. I am also >> assuming that metadata won't be wrappered either (so you can forget >> METS) so that different types are stored in the most stuiable format >> and are accessed only when required. The problems with roundtripping >> packaged AIP's for updates rather than diff-ing are repeated by METS >> wrappering. >> >> These may be a virtual folder/filesytem presentation and underneath >> an HSM would retrieve files from wherever when it is actually accessed. >> HSM policy in soemthing like SAM-QFS/Versity/Cray TAS can ensure >> folders are kep intact when moved to other storage (we could even >> dereference symlinks when dealing with tape). >> >> This can be done with a POSIX filesystem and not muich code - Ben >> O'Steen started something along these lines here: >> https://github.com/dataflow/RDFDatabank/wiki/What-is-DataBank-and-wha >> t >> -does-it-do%3F >> >> Fedora also also a versioning object store that could support this >> kind of model but also adds a fair bit of complexity to be >> Linked-Data_platform compliant. >> >> In my paralance I would probably equate "Minimal Ingest" with "Sheer >> Curation" and APT with Asynchronous Message Driven Workers. >> >> Neil >> >> >> On 2017-05-12 12:33, Tim.Gollins at nrscotland.gov.uk wrote: >>> Dear PASIG >>> >>> I have been thinking recently about the challenge of managing >>> "physical" AIPs on offline or near line storage and how to optimise >>> or simplify the use of managed storage media in a tape based >>> (robotic) Hierarchical Storage Management (HSM) system. By "physical" >>> AIPs I mean that the actual structure of the AIP written to the >>> storage system is sufficiently self-describing that even if the >>> management or other elements of a DP system were to be lost to a >>> disaster then the entire collection could be fully re-instated >>> reliably from the stored AIPs alone. >>> >>> I have also been thinking about the huge benefits of adopting the >>> concepts of "Minimal Ingest" (MI) and "Autonomous Preservation Tools" >>> (APT) in a new Digital Archive solution. >>> >>> One of the potential effects of the MI and APT concepts is that over >>> time it is clear that while (of course) the original bit streams >>> will never need to be updated, the metadata packaged in the AIP will >>> need to change relatively often (through the life of the AIP) . This >>> is of course in addition to any new renderings of the bit streams >>> produced for preservation purposes (manifestations as termed in some >>> systems). >>> >>> If to update the AIP the process involves the AIP being "loaded" and >>> "Modified" and "Stored" again as a whole then this will result in >>> significant "churn" of the offline or near line media (i.e. tapes) >>> in a HSM - which I would like to avoid. I think it would be really >>> great if the AIP representation could accommodate the concept of an >>> "update IP" (perhaps UIP?) where the UIP contains a "delta" of the >>> original AIP - the full AIP then being interpreted as the original >>> as modified by a series of deltas. This would then effectively >>> result in AIPs (and >>> UIPs) becoming WORM objects with clear benefits that I perceive in >>> managing their reliable and safe storage. >>> >>> I am not sufficiently familiar with the detail of all the different >>> AIP models or implementations, I was wondering if anyone in the team >>> would be able to comment on whether the they know of any AIP models, >>> specifications or implementations that would support such a use >>> case. >>> >>> I have just posted a version of this question to the E-Ark Linked in >>> Group so my apologies to those who see it twice. >>> >>> Many thanks >>> >>> Tim >>> Tim Gollins | Head of Digital Archiving and Director of the NRS >>> Digital Preservation Programme National Records of Scotland | West >>> Register House | Edinburgh EH2 4DF >>> + 44 (0)131 535 1431 / + 44 (0)7974 922614 | >>> tim.gollins at nrscotland.gov.uk | www.nrscotland.gov.uk >>> >>> Preserving the past | Recording the present | Informing the future >>> Follow us on Twitter: @NatRecordsScot | >>> http://twitter.com/NatRecordsScot >>> >>> >>> ******************************************************************** >>> * >>> * This e-mail (and any files or other attachments transmitted with >>> it) is intended solely for the attention of the addressee(s). >>> Unauthorised use, disclosure, storage, copying or distribution of >>> any part of this e-mail is not permitted. If you are not the >>> intended recipient please destroy the email, remove any copies from >>> your system and inform the sender immediately by return. >>> >>> Communications with the Scottish Government may be monitored or >>> recorded in order to secure the effective operation of the system >>> and for other lawful purposes. The views or opinions contained >>> within this e-mail may not necessarily reflect those of the Scottish >>> Government. >>> >>> >>> Tha am post-d seo (agus faidhle neo ceanglan c?mhla ris) dhan neach >>> neo luchd-ainmichte a-mh?in. Chan eil e ceadaichte a chleachdadh ann >>> an d?igh sam bith, a? toirt a-steach c?raichean, foillseachadh neo >>> sgaoileadh, gun chead. Ma ?s e is gun d?fhuair sibh seo le gun >>> fhiosd?, bu choir cur ?s dhan phost-d agus lethbhreac sam bith air >>> an t-siostam agaibh, leig fios chun neach a sgaoil am post-d gun d?il. >>> >>> Dh?fhaodadh gum bi teachdaireachd sam bith bho Riaghaltas na h-Alba >>> air a chl?radh neo air a sgr?dadh airson dearbhadh gu bheil an >>> siostam ag obair gu h-?ifeachdach neo airson adhbhar laghail eile. >>> Dh?fhaodadh nach eil beachdan anns a? phost-d seo co-ionann ri >>> beachdan Riaghaltas na h-Alba. >>> ******************************************************************** >>> * >>> * >>> >>> >>> >>> ---- >>> To subscribe, unsubscribe, or modify your subscription, please visit >>> http://mail.asis.org/mailman/listinfo/pasig-discuss >>> _______ >>> PASIG Webinars and conference material is at >>> http://www.preservationandarchivingsig.org/index.html >>> _______________________________________________ >>> Pasig-discuss mailing list >>> Pasig-discuss at mail.asis.org >>> http://mail.asis.org/mailman/listinfo/pasig-discuss >> >> ---- >> To subscribe, unsubscribe, or modify your subscription, please visit >> http://mail.asis.org/mailman/listinfo/pasig-discuss >> _______ >> PASIG Webinars and conference material is at >> http://www.preservationandarchivingsig.org/index.html >> _______________________________________________ >> Pasig-discuss mailing list >> Pasig-discuss at mail.asis.org >> http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss From peter.burnhill at ed.ac.uk Fri May 12 08:25:21 2017 From: peter.burnhill at ed.ac.uk (BURNHILL Peter) Date: Fri, 12 May 2017 12:25:21 +0000 Subject: [Pasig-discuss] WORM (Write Once Read Many) AIPs In-Reply-To: References: <7290d680d2d83ae9c5d4a88371bb6147@imap.plus.net>, Message-ID: <1E70721C-7F4C-4163-A910-8857FCE8286B@ed.ac.uk> Yes, I appreciated that too. Peter Peter Burnhill University of Edinburgh Mobile: +44 (0) 774 0763 119 ps Am writing 'on the go' so pl excuse brevity On 12 May 2017, at 1:23 pm, "Tim.Gollins at nrscotland.gov.uk" > wrote: Hi Neil Brilliant - Most helpful and thought provoking. The fact that Fedora has the idea of a versioning Object store is particularly interesting. I think there are a couple of distinctions between Minimal Ingest and Sheer Curation but (from a quick glance at Google articles) they are appear very closely related. I think APT uses something like Asynchronous Message Driven Workers. Very many thanks indeed, especially for such a swift an comprehensive response. Tim Tim Gollins | Head of Digital Archiving and Director of the NRS Digital Preservation Programme National Records of Scotland | West Register House | Edinburgh EH2 4DF + 44 (0)131 535 1431 / + 44 (0)7974 922614 | tim.gollins at nrscotland.gov.uk | www.nrscotland.gov.uk Preserving the past | Recording the present | Informing the future Follow us on Twitter: @NatRecordsScot | http://twitter.com/NatRecordsScot -----Original Message----- From: Neil Jefferies [mailto:neil at jefferies.org] Sent: 12 May 2017 13:06 To: Gollins T (Tim) Cc: pasig-discuss at mail.asis.org Subject: Re: [Pasig-discuss] WORM (Write Once Read Many) AIPs Tim, If we store AIP's unpackaged, as a collection of files in a folder, then object updates could just be a new folder with symlinks to the unchanged parts and the updated parts in place in the folder. The object "location" would be a parent folder for all these version folders - for example, a pairtree (or triple-tree for faster scanning/rebuilds) based on object UUID. Version folders would be named accoprding to date or version number (date might make Memento compliant access simpler). Creating anew version clones the current verion (including links) with a new name and then replaces the updated parts in situ. Final act is to update a "current" symlink in the object. Any update failure will mean "current" is not updated an the partial clone can be discarded. This assumes most updates are metadata and that a diff won't save much compared to a complete new XML file or whatever. I am also assuming that metadata won't be wrappered either (so you can forget METS) so that different types are stored in the most stuiable format and are accessed only when required. The problems with roundtripping packaged AIP's for updates rather than diff-ing are repeated by METS wrappering. These may be a virtual folder/filesytem presentation and underneath an HSM would retrieve files from wherever when it is actually accessed. HSM policy in soemthing like SAM-QFS/Versity/Cray TAS can ensure folders are kep intact when moved to other storage (we could even dereference symlinks when dealing with tape). This can be done with a POSIX filesystem and not muich code - Ben O'Steen started something along these lines here: https://github.com/dataflow/RDFDatabank/wiki/What-is-DataBank-and-what-does-it-do%3F Fedora also also a versioning object store that could support this kind of model but also adds a fair bit of complexity to be Linked-Data_platform compliant. In my paralance I would probably equate "Minimal Ingest" with "Sheer Curation" and APT with Asynchronous Message Driven Workers. Neil On 2017-05-12 12:33, Tim.Gollins at nrscotland.gov.uk wrote: Dear PASIG I have been thinking recently about the challenge of managing "physical" AIPs on offline or near line storage and how to optimise or simplify the use of managed storage media in a tape based (robotic) Hierarchical Storage Management (HSM) system. By "physical" AIPs I mean that the actual structure of the AIP written to the storage system is sufficiently self-describing that even if the management or other elements of a DP system were to be lost to a disaster then the entire collection could be fully re-instated reliably from the stored AIPs alone. I have also been thinking about the huge benefits of adopting the concepts of "Minimal Ingest" (MI) and "Autonomous Preservation Tools" (APT) in a new Digital Archive solution. One of the potential effects of the MI and APT concepts is that over time it is clear that while (of course) the original bit streams will never need to be updated, the metadata packaged in the AIP will need to change relatively often (through the life of the AIP) . This is of course in addition to any new renderings of the bit streams produced for preservation purposes (manifestations as termed in some systems). If to update the AIP the process involves the AIP being "loaded" and "Modified" and "Stored" again as a whole then this will result in significant "churn" of the offline or near line media (i.e. tapes) in a HSM - which I would like to avoid. I think it would be really great if the AIP representation could accommodate the concept of an "update IP" (perhaps UIP?) where the UIP contains a "delta" of the original AIP - the full AIP then being interpreted as the original as modified by a series of deltas. This would then effectively result in AIPs (and UIPs) becoming WORM objects with clear benefits that I perceive in managing their reliable and safe storage. I am not sufficiently familiar with the detail of all the different AIP models or implementations, I was wondering if anyone in the team would be able to comment on whether the they know of any AIP models, specifications or implementations that would support such a use case. I have just posted a version of this question to the E-Ark Linked in Group so my apologies to those who see it twice. Many thanks Tim Tim Gollins | Head of Digital Archiving and Director of the NRS Digital Preservation Programme National Records of Scotland | West Register House | Edinburgh EH2 4DF + 44 (0)131 535 1431 / + 44 (0)7974 922614 | tim.gollins at nrscotland.gov.uk | www.nrscotland.gov.uk Preserving the past | Recording the present | Informing the future Follow us on Twitter: @NatRecordsScot | http://twitter.com/NatRecordsScot ********************************************************************** This e-mail (and any files or other attachments transmitted with it) is intended solely for the attention of the addressee(s). Unauthorised use, disclosure, storage, copying or distribution of any part of this e-mail is not permitted. If you are not the intended recipient please destroy the email, remove any copies from your system and inform the sender immediately by return. Communications with the Scottish Government may be monitored or recorded in order to secure the effective operation of the system and for other lawful purposes. The views or opinions contained within this e-mail may not necessarily reflect those of the Scottish Government. Tha am post-d seo (agus faidhle neo ceanglan c?mhla ris) dhan neach neo luchd-ainmichte a-mh?in. Chan eil e ceadaichte a chleachdadh ann an d?igh sam bith, a? toirt a-steach c?raichean, foillseachadh neo sgaoileadh, gun chead. Ma ?s e is gun d?fhuair sibh seo le gun fhiosd?, bu choir cur ?s dhan phost-d agus lethbhreac sam bith air an t-siostam agaibh, leig fios chun neach a sgaoil am post-d gun d?il. Dh?fhaodadh gum bi teachdaireachd sam bith bho Riaghaltas na h-Alba air a chl?radh neo air a sgr?dadh airson dearbhadh gu bheil an siostam ag obair gu h-?ifeachdach neo airson adhbhar laghail eile. Dh?fhaodadh nach eil beachdan anns a? phost-d seo co-ionann ri beachdan Riaghaltas na h-Alba. ********************************************************************** ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ *********************************** ******************************** This email has been received from an external party and has been swept for the presence of computer viruses. ******************************************************************** ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: not available URL: From Steve.Knight at dia.govt.nz Sun May 14 19:43:36 2017 From: Steve.Knight at dia.govt.nz (Steve Knight) Date: Sun, 14 May 2017 23:43:36 +0000 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: References: <8B597316-5049-40E0-A7C4-4F7431E69E76@cca.qc.ca> Message-ID: Hi Tim At the National library of New Zealand, we are storing about 210TB of digital objects in our permanent repository. We have a 25TB online cache, with an online copy of all the digital objects sitting on disk. Three tape copies of the objects are made as soon as they enter into the disk archive. 1 copy remains within the tape library (nearline), the other 2 copies are sent offsite (offline). We use Oracle SAM-QFS to manage the storage policies and automatic tierage. We have a similar treatment for our 100TB of Test data, which has 1 less offsite tape copy. We are currently looking at replacing this storage architecture with a mix of Hitachi's HDI and HCP S30 object storage products and our cloud provider's object storage offering. The cloud provider storage includes replication across 3 geographic locations providing both higher availability and higher resilience than we currently have. By moving to an all online solution we hope to increase overall performance and make savings through utilising object storage and exiting some services related to current backup and restore processes. Regards Steve -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Sheila Morrissey Sent: Saturday, 13 May 2017 5:44 a.m. To: pasig-discuss at asis.org Subject: [Pasig-discuss] FW: Digital repository storage benchmarking Hello, Tim, At Portico (http://www.portico.org/digital-preservation/), we preserve e-journals, e-books, digitized historical collections, and other born-digital scholarly content. Currently, the Portico archive is comprised of roughly 77.7 million digital objects (we call them "Archival Units", or AUs); comprising over 400 TB; made up of 1.3 billion files. We maintain 3 copies of the archive: 2 on disk in geographically distributed data centers, and a 3rd copy in commercial cloud storage. We create and maintain backups (including fixity checks) using our own custom-written software. I hope this helpful. Best regards, Sheila Sheila M. Morrissey Senior Researcher ITHAKA 100 Campus Drive Suite 100 Princeton NJ 08540 609-986-2221 sheila.morrissey at ithaka.org ? ITHAKA (www.ithaka.org) is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways.? We provide innovative services that benefit higher education, including Ithaka S+R, JSTOR, and Portico. -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Tim Walsh Sent: Friday, May 12, 2017 10:16 AM To: pasig-discuss at asis.org Subject: [Pasig-discuss] Digital repository storage benchmarking Dear PASIG, I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions? configurations online. It?s very possible that this question has been asked before on-list, but I wasn?t able to find anything in the list archives. For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those ?various media? will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we?d like to benchmark our plans against other institutions. I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are: * Could you point me to published/available resources outlining other institutions? digital repository storage configurations? * Or, if you work at an institution, would you be willing to share the details of your configuration on- or off-list? (any information sent off-list will be kept strictly confidential) Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups. Thank you! Tim - - - Tim Walsh Archiviste, Archives num?riques Archivist, Digital Archives Centre Canadien d?Architecture Canadian Centre for Architecture 1920, rue Baile, Montr?al, Qu?bec H3H 2S6 T 514 939 7001 x 1532 F 514 939 7020 www.cca.qc.ca Pensez ? l?environnement avant d?imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n??tes pas le destinataire pr?vu, veuillez nous en aviser imm?diatement. Merci ?galement de supprimer le pr?sent courriel et d?en d?truire toute copie. This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you. ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss From BUNTONGA at mailbox.sc.edu Sun May 14 21:41:22 2017 From: BUNTONGA at mailbox.sc.edu (BUNTON, GLENN) Date: Mon, 15 May 2017 01:41:22 +0000 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: References: <8B597316-5049-40E0-A7C4-4F7431E69E76@cca.qc.ca> Message-ID: This discussion of the various digital repository storage approaches has been very enlightening and useful so far. I appreciate all the excellent details. There is one piece of information, however, that is missing. Cost? Both initial implementation outlay and ongoing costs. Any general sense of costs would be greatly appreciated. -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Steve Knight Sent: Sunday, May 14, 2017 6:44 PM To: 'Sheila Morrissey' ; pasig-discuss at asis.org Subject: Re: [Pasig-discuss] Digital repository storage benchmarking Hi Tim At the National library of New Zealand, we are storing about 210TB of digital objects in our permanent repository. We have a 25TB online cache, with an online copy of all the digital objects sitting on disk. Three tape copies of the objects are made as soon as they enter into the disk archive. 1 copy remains within the tape library (nearline), the other 2 copies are sent offsite (offline). We use Oracle SAM-QFS to manage the storage policies and automatic tierage. We have a similar treatment for our 100TB of Test data, which has 1 less offsite tape copy. We are currently looking at replacing this storage architecture with a mix of Hitachi's HDI and HCP S30 object storage products and our cloud provider's object storage offering. The cloud provider storage includes replication across 3 geographic locations providing both higher availability and higher resilience than we currently have. By moving to an all online solution we hope to increase overall performance and make savings through utilising object storage and exiting some services related to current backup and restore processes. Regards Steve -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Sheila Morrissey Sent: Saturday, 13 May 2017 5:44 a.m. To: pasig-discuss at asis.org Subject: [Pasig-discuss] FW: Digital repository storage benchmarking Hello, Tim, At Portico (http://www.portico.org/digital-preservation/), we preserve e-journals, e-books, digitized historical collections, and other born-digital scholarly content. Currently, the Portico archive is comprised of roughly 77.7 million digital objects (we call them "Archival Units", or AUs); comprising over 400 TB; made up of 1.3 billion files. We maintain 3 copies of the archive: 2 on disk in geographically distributed data centers, and a 3rd copy in commercial cloud storage. We create and maintain backups (including fixity checks) using our own custom-written software. I hope this helpful. Best regards, Sheila Sheila M. Morrissey Senior Researcher ITHAKA 100 Campus Drive Suite 100 Princeton NJ 08540 609-986-2221 sheila.morrissey at ithaka.org ? ITHAKA (www.ithaka.org) is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways.? We provide innovative services that benefit higher education, including Ithaka S+R, JSTOR, and Portico. -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Tim Walsh Sent: Friday, May 12, 2017 10:16 AM To: pasig-discuss at asis.org Subject: [Pasig-discuss] Digital repository storage benchmarking Dear PASIG, I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions? configurations online. It?s very possible that this question has been asked before on-list, but I wasn?t able to find anything in the list archives. For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those ?various media? will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we?d like to benchmark our plans against other institutions. I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are: * Could you point me to published/available resources outlining other institutions? digital repository storage configurations? * Or, if you work at an institution, would you be willing to share the details of your configuration on- or off-list? (any information sent off-list will be kept strictly confidential) Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups. Thank you! Tim - - - Tim Walsh Archiviste, Archives num?riques Archivist, Digital Archives Centre Canadien d?Architecture Canadian Centre for Architecture 1920, rue Baile, Montr?al, Qu?bec H3H 2S6 T 514 939 7001 x 1532 F 514 939 7020 www.cca.qc.ca Pensez ? l?environnement avant d?imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n??tes pas le destinataire pr?vu, veuillez nous en aviser imm?diatement. Merci ?galement de supprimer le pr?sent courriel et d?en d?truire toute copie. This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you. ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss From jake.carroll at uq.edu.au Sun May 14 23:01:19 2017 From: jake.carroll at uq.edu.au (Jake Carroll) Date: Mon, 15 May 2017 03:01:19 +0000 Subject: [Pasig-discuss] Digital repository storage benchmarking Message-ID: <78ADB971-820E-4450-BDEB-1814B86B19F0@uq.edu.au> Certainly interesting. At the Queensland Brain Institute and the Australian Institute of Bioengineering and Nanotechnology at the University of Queensland, we have around 8.5PB of data under management across our HSM platforms. We currently use Oracle HSM for this task. We have 256TB of online ?cache? for the data landing location split across 6 different filesystems that are tuned differently for different types of workloads and different tasks. These workloads are generally categorised into a few functions: ? High IO, large serial writes from instruments ? Low IO, large serial writes from instruments ? High IO, granular ?many files, many IOPS? from instruments and computational factors ? Low IO, granular ?many files, low IOPS? from instruments and computational factors ? Generic group share ? Generic user dir It is an interesting thing to manage and run statistical modelling on in terms of performance analysis and micro benchmarking of data movement patterns. All the filesystems above are provisioned on 16Gbit/sec FC connected Hitachi HUS-VM, 10K SAS. The metadata for these filesystems is around 10 terabytes of Hitachi Accelerated Advanced Flash storage. We have around 3.8 billion files/unique objects under management. We run a ?disk based copy? (we call that copy1) which is our disk based VSN or vault. It is around 1PB of ZFS managed storage sitting inside the very large Hitachi HUS-VM platform. Our Copy2 and Copy3 are 2 * T10000D Oracle tape media copies in SL3000 storage silos, geographically distributed. We do some interesting things with our tape infrastructure, including DIV-always-on, proactive data protection sweeps inside the HSM and continuous validation checks against the media. We also run STA (tape analytics tools) extra-data-path so we can see *exactly* what each drive is doing at all times. Believe me, we see things that would baffle and boggle the mind (and probably create a healthy sense of paranoia!) if you knew exactly what was going on ?inside there?. We use finely tuned policy for data automation of movement between tiers so as to minimally impact user experience. Our HSM supports offline file mapping to the windows client, so people can tell when their files and objects are ?offline?. It is a useful semantic and great for usability for people. We ZFS scrub the disk copy for ?always on disk consistency?, we use tpverify commands on the tape media also to consistently check the media itself. We?re experimenting with implementing fixity shortly too, as the filesystem supports it. As for going ?all online?, at our scale ?we just can?t afford it yet, to walk away from ?cold tape? principles. We?re just too big. We?d love to rid ourselves of the complexities of it, and consider a full cloud based consumption model, but having crunched the very hard numbers of things such as AWS Glacier and S3, it is a long (long) way more expensive than the relative TCO?s of running it ?on premise? at this stage. My hope is that this will change soon and I can start experimenting with one of my copies being a ?cloud library?. Interesting thread, this? -jc On 15/5/17, 11:41 am, "Pasig-discuss on behalf of BUNTON, GLENN" wrote: This discussion of the various digital repository storage approaches has been very enlightening and useful so far. I appreciate all the excellent details. There is one piece of information, however, that is missing. Cost? Both initial implementation outlay and ongoing costs. Any general sense of costs would be greatly appreciated. -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Steve Knight Sent: Sunday, May 14, 2017 6:44 PM To: 'Sheila Morrissey' ; pasig-discuss at asis.org Subject: Re: [Pasig-discuss] Digital repository storage benchmarking Hi Tim At the National library of New Zealand, we are storing about 210TB of digital objects in our permanent repository. We have a 25TB online cache, with an online copy of all the digital objects sitting on disk. Three tape copies of the objects are made as soon as they enter into the disk archive. 1 copy remains within the tape library (nearline), the other 2 copies are sent offsite (offline). We use Oracle SAM-QFS to manage the storage policies and automatic tierage. We have a similar treatment for our 100TB of Test data, which has 1 less offsite tape copy. We are currently looking at replacing this storage architecture with a mix of Hitachi's HDI and HCP S30 object storage products and our cloud provider's object storage offering. The cloud provider storage includes replication across 3 geographic locations providing both higher availability and higher resilience than we currently have. By moving to an all online solution we hope to increase overall performance and make savings through utilising object storage and exiting some services related to current backup and restore processes. Regards Steve -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Sheila Morrissey Sent: Saturday, 13 May 2017 5:44 a.m. To: pasig-discuss at asis.org Subject: [Pasig-discuss] FW: Digital repository storage benchmarking Hello, Tim, At Portico (http://www.portico.org/digital-preservation/), we preserve e-journals, e-books, digitized historical collections, and other born-digital scholarly content. Currently, the Portico archive is comprised of roughly 77.7 million digital objects (we call them "Archival Units", or AUs); comprising over 400 TB; made up of 1.3 billion files. We maintain 3 copies of the archive: 2 on disk in geographically distributed data centers, and a 3rd copy in commercial cloud storage. We create and maintain backups (including fixity checks) using our own custom-written software. I hope this helpful. Best regards, Sheila Sheila M. Morrissey Senior Researcher ITHAKA 100 Campus Drive Suite 100 Princeton NJ 08540 609-986-2221 sheila.morrissey at ithaka.org ITHAKA (www.ithaka.org) is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. We provide innovative services that benefit higher education, including Ithaka S+R, JSTOR, and Portico. -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Tim Walsh Sent: Friday, May 12, 2017 10:16 AM To: pasig-discuss at asis.org Subject: [Pasig-discuss] Digital repository storage benchmarking Dear PASIG, I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions? configurations online. It?s very possible that this question has been asked before on-list, but I wasn?t able to find anything in the list archives. For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those ?various media? will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we?d like to benchmark our plans against other institutions. I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are: * Could you point me to published/available resources outlining other institutions? digital repository storage configurations? * Or, if you work at an institution, would you be willing to share the details of your configuration on- or off-list? (any information sent off-list will be kept strictly confidential) Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups. Thank you! Tim - - - Tim Walsh Archiviste, Archives num?riques Archivist, Digital Archives Centre Canadien d?Architecture Canadian Centre for Architecture 1920, rue Baile, Montr?al, Qu?bec H3H 2S6 T 514 939 7001 x 1532 F 514 939 7020 www.cca.qc.ca Pensez ? l?environnement avant d?imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n??tes pas le destinataire pr?vu, veuillez nous en aviser imm?diatement. Merci ?galement de supprimer le pr?sent courriel et d?en d?truire toute copie. This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you. ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss From william.kilbride at dpconline.org Mon May 15 04:02:27 2017 From: william.kilbride at dpconline.org (William Kilbride) Date: Mon, 15 May 2017 08:02:27 +0000 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: <78ADB971-820E-4450-BDEB-1814B86B19F0@uq.edu.au> References: <78ADB971-820E-4450-BDEB-1814B86B19F0@uq.edu.au> Message-ID: Hi All, Hi Tim This is a super thread and I am learning a tonne. On the subject of costs I can make a recommendation and request ... The Curation Costs Exchange is a useful thing and well worth a look for anyone looking at comparative costs across the digital preservation lifecycle including storage. It's not been mentioned yet in the discussions, I assume because everyone is already aware of it. But have a look: http://www.curationexchange.org/ The conclusion we drew from the 4C project was that financial planning was a core skill in preservation planning. So to be a 'trusted' repository an institution should be able to demonstrate certain skills in financial planning and be transparent about it. It's expressed more elegantly in the 4c project roadmap: http://www.4cproject.eu/roadmap/ Now the request: there's a network effect here. The more agencies share data the more useful the data becomes. So can I encourage you all to share that information (anonymously or identifiably) via the costs exchange? All best wishes, William -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Jake Carroll Sent: 15 May 2017 04:01 To: pasig-discuss at asis.org Subject: Re: [Pasig-discuss] Digital repository storage benchmarking Certainly interesting. At the Queensland Brain Institute and the Australian Institute of Bioengineering and Nanotechnology at the University of Queensland, we have around 8.5PB of data under management across our HSM platforms. We currently use Oracle HSM for this task. We have 256TB of online ?cache? for the data landing location split across 6 different filesystems that are tuned differently for different types of workloads and different tasks. These workloads are generally categorised into a few functions: ? High IO, large serial writes from instruments ? Low IO, large serial writes from instruments ? High IO, granular ?many files, many IOPS? from instruments and computational factors ? Low IO, granular ?many files, low IOPS? from instruments and computational factors ? Generic group share ? Generic user dir It is an interesting thing to manage and run statistical modelling on in terms of performance analysis and micro benchmarking of data movement patterns. All the filesystems above are provisioned on 16Gbit/sec FC connected Hitachi HUS-VM, 10K SAS. The metadata for these filesystems is around 10 terabytes of Hitachi Accelerated Advanced Flash storage. We have around 3.8 billion files/unique objects under management. We run a ?disk based copy? (we call that copy1) which is our disk based VSN or vault. It is around 1PB of ZFS managed storage sitting inside the very large Hitachi HUS-VM platform. Our Copy2 and Copy3 are 2 * T10000D Oracle tape media copies in SL3000 storage silos, geographically distributed. We do some interesting things with our tape infrastructure, including DIV-always-on, proactive data protection sweeps inside the HSM and continuous validation checks against the media. We also run STA (tape analytics tools) extra-data-path so we can see *exactly* what each drive is doing at all times. Believe me, we see things that would baffle and boggle the mind (and probably create a healthy sense of paranoia!) if you knew exactly what was going on ?inside there?. We use finely tuned policy for data automation of movement between tiers so as to minimally impact user experience. Our HSM supports offline file mapping to the windows client, so people can tell when their files and objects are ?offline?. It is a useful semantic and great for usability for people. We ZFS scrub the disk copy for ?always on disk consistency?, we use tpverify commands on the tape media also to consistently check the media itself. We?re experimenting with implementing fixity shortly too, as the filesystem supports it. As for going ?all online?, at our scale ?we just can?t afford it yet, to walk away from ?cold tape? principles. We?re just too big. We?d love to rid ourselves of the complexities of it, and consider a full cloud based consumption model, but having crunched the very hard numbers of things such as AWS Glacier and S3, it is a long (long) way more expensive than the relative TCO?s of running it ?on premise? at this stage. My hope is that this will change soon and I can start experimenting with one of my copies being a ?cloud library?. Interesting thread, this? -jc On 15/5/17, 11:41 am, "Pasig-discuss on behalf of BUNTON, GLENN" wrote: This discussion of the various digital repository storage approaches has been very enlightening and useful so far. I appreciate all the excellent details. There is one piece of information, however, that is missing. Cost? Both initial implementation outlay and ongoing costs. Any general sense of costs would be greatly appreciated. -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Steve Knight Sent: Sunday, May 14, 2017 6:44 PM To: 'Sheila Morrissey' ; pasig-discuss at asis.org Subject: Re: [Pasig-discuss] Digital repository storage benchmarking Hi Tim At the National library of New Zealand, we are storing about 210TB of digital objects in our permanent repository. We have a 25TB online cache, with an online copy of all the digital objects sitting on disk. Three tape copies of the objects are made as soon as they enter into the disk archive. 1 copy remains within the tape library (nearline), the other 2 copies are sent offsite (offline). We use Oracle SAM-QFS to manage the storage policies and automatic tierage. We have a similar treatment for our 100TB of Test data, which has 1 less offsite tape copy. We are currently looking at replacing this storage architecture with a mix of Hitachi's HDI and HCP S30 object storage products and our cloud provider's object storage offering. The cloud provider storage includes replication across 3 geographic locations providing both higher availability and higher resilience than we currently have. By moving to an all online solution we hope to increase overall performance and make savings through utilising object storage and exiting some services related to current backup and restore processes. Regards Steve -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Sheila Morrissey Sent: Saturday, 13 May 2017 5:44 a.m. To: pasig-discuss at asis.org Subject: [Pasig-discuss] FW: Digital repository storage benchmarking Hello, Tim, At Portico (http://www.portico.org/digital-preservation/), we preserve e-journals, e-books, digitized historical collections, and other born-digital scholarly content. Currently, the Portico archive is comprised of roughly 77.7 million digital objects (we call them "Archival Units", or AUs); comprising over 400 TB; made up of 1.3 billion files. We maintain 3 copies of the archive: 2 on disk in geographically distributed data centers, and a 3rd copy in commercial cloud storage. We create and maintain backups (including fixity checks) using our own custom-written software. I hope this helpful. Best regards, Sheila Sheila M. Morrissey Senior Researcher ITHAKA 100 Campus Drive Suite 100 Princeton NJ 08540 609-986-2221 sheila.morrissey at ithaka.org ITHAKA (www.ithaka.org) is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. We provide innovative services that benefit higher education, including Ithaka S+R, JSTOR, and Portico. -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Tim Walsh Sent: Friday, May 12, 2017 10:16 AM To: pasig-discuss at asis.org Subject: [Pasig-discuss] Digital repository storage benchmarking Dear PASIG, I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions? configurations online. It?s very possible that this question has been asked before on-list, but I wasn?t able to find anything in the list archives. For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those ?various media? will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we?d like to benchmark our plans against other institutions. I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are: * Could you point me to published/available resources outlining other institutions? digital repository storage configurations? * Or, if you work at an institution, would you be willing to share the details of your configuration on- or off-list? (any information sent off-list will be kept strictly confidential) Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups. Thank you! Tim - - - Tim Walsh Archiviste, Archives num?riques Archivist, Digital Archives Centre Canadien d?Architecture Canadian Centre for Architecture 1920, rue Baile, Montr?al, Qu?bec H3H 2S6 T 514 939 7001 x 1532 F 514 939 7020 www.cca.qc.ca Pensez ? l?environnement avant d?imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n??tes pas le destinataire pr?vu, veuillez nous en aviser imm?diatement. Merci ?galement de supprimer le pr?sent courriel et d?en d?truire toute copie. This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you. ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss From neil.jefferies at bodleian.ox.ac.uk Mon May 15 04:26:45 2017 From: neil.jefferies at bodleian.ox.ac.uk (Neil Jefferies) Date: Mon, 15 May 2017 08:26:45 +0000 Subject: [Pasig-discuss] WORM (Write Once Read Many) AIPs In-Reply-To: References: <7290d680d2d83ae9c5d4a88371bb6147@imap.plus.net> <63d06e35b40be1c7d0ff6e5613950844@mail.gmail.com> <81d799913e0c44c2d1d46d9ddd9fbd23@imap.plus.net> <945351555fe73d699a190d9d7d4fd135@mail.gmail.com> Message-ID: <48E9420A4871584593FC3D435EF345AAEEED7C15@MBX10.ad.oak.ox.ac.uk> I pretty much agree although I do think there is a use case for (mostly) immutable AIP's such as retention of material for legal reasons. However, I can?t see any reason to package them other than leveraging whatever logical grouping the underlying storage facility provides - which may be a folder in a filesystem or a tar file on tape. Anything additional just adds overhead and increased scope for errors. If you are moving the object then a package should be undone anyway since you *should* be adding provenance information to cover the move at the very least. On a more pragmatic level, until there is a better appreciation of the fact that OAIS is a model rather than a design template then I think there will be people who demand physical AIP's - but that is why we are here! Neil Jefferies MA MBA Head of Innovation Bodleian Digital Library Systems and Services Osney One Osney Mead OX2 0EW T: +44 1865 2-80588 -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Jonathan Tilbury Sent: 14 May 2017 11:55 To: pasig-discuss at asis.org Subject: Re: [Pasig-discuss] WORM (Write Once Read Many) AIPs Tim, I have always thought of the "autonomous AIP" zipped up and held on a storage device as an residue of paper-thinking. When dealing with paper storage it is possible to bundle up the papers and some description and put it in a box onto a shelf. If you need the artefact, you get all of the box. The paper is unlikely to be updated of changed during its lifetime. This really does not map well onto the digital world. There a lots of changes that result in the "API" being changed, for example changes in descriptive metadata, structure (parentage), security settings, technical metadata (during a re-characterisation) and audit trail. You may also add extra files to the API and most importantly generate new representations for access or digital masters following a migration. This makes the idea of a single immutable AIP redundant. Addressing this we need to ask why are we worrying. I think you answered this well by saying the content plus all of the metadata listed above must be accessible outside of whatever system you are using to re-build the collection should disaster happen or should you want to change system provider. To enable this you need all of the digital objects plus metadata (description, technical, security, structure, audit trail, fixity) to be held in a place and in a way that can be machine read. This does not imply physical zipped AIPs, just that the data is there and is understandable. Physical (zipped) AIPs are difficult to work with. Whenever you need to access a file you need to unpack the zip which is cumbersome and slow. This happens for download, rendering, and fixity checking. This overhead has no benefit and several risks. Also, it brings into question what fixity checking actually means when the storage container is being changed all the time. These problems become particularly acute when we have to address the large flat collections we are now seeing more of. I have always thought a better approach is to save the digital objects (files) in an object store (for example a file drive, tape store, cloud storage), and to make sure these never change using fixity validation. All of the metadata can be written to the object store as well, and either updated or new versions written as it is updated. These digital objects (files and metadata) can be stored in multiple locations in different technologies. In Preservica we support both approaches through the range of storage adapters we include. Each has its own way of renaming the digital objects, but the use of objects with a UUID naming convention is preferred. We strongly recommend against the use of physical APIs. All of the objects, once stored, can then be checked for fixity on a rotating basis or when accessed. By storing to multiple storage adapters you can even self-heal if someone does mess with your file system. As for exiting the system, we allow cloud edition users to replicate all of the content plus metadata to a remote store using SFTP in such a way that the physical directory structure mimics the logical collection structure. If they want to leave they have all the content safe in a place of their choosing. I would very interested I people's comments on whether we should still support Physical (zipped) AIPs. Jon ============= Jon Tilbury CTO, Preserivca ============= -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Neil Jefferies Sent: Friday, May 12, 2017 4:43 PM To: Jacob Farmer Subject: Re: [Pasig-discuss] WORM (Write Once Read Many) AIPs Jacob, This is the key point of my argument - the definition of object you have is not the definition of an object that an archive wants to preserve. I'm speaking for people like Tim and I - others are quite happy to build what I term bit-museums. Likewise, what you consider preservation (immutability of a bitstream) is not quite the same as ours - retention of knowledge content - which requires mutability but with immutable previous versions and provenance/audit records. As long as this disconnect between technology and requirements remains the case, object stores are actually of limited use for us in preservation and archiving without considerable additional work. The 'metadata' that most object stores support (key-value pairs) is pretty useless as far as our metadata requirements go - in the end we have to store XML or triples as separate files/objects. This was an issue when I reviewed the StorageTek 5800 code builds way back and frankly object storage hasn't moved on much. Fedora, for all its faults, does actually provide an object view that is meaningful - something that can be a node in a linked-data graph. It can be arbitrarily complex but equally, could comprise only metadata. It is almost never a file. Neil On 2017-05-12 20:29, Jacob Farmer wrote: > Hi, Neil. Great points. Indeed, hard links only work in a single > file system, but they continue pointing to and fro when a file is > otherwise moved or renamed. > > I personally think of POSIX file systems as object stores that have > weak addressing, limited metadata, and that offer mutability as the > default. > > My preferred definition of an object store is a device that stores > objects. > My preferred definition of an object is any piece of data that can be > individually addressed and manipulated. > So, by that definition, POSIX file systems are object stores, so are > hard drives. So is Microsoft exchange, etc. > > If you name a file according to a hash or a UUID (the hash could be > the UUID), then you have a form of persistent address. As long as no > one messes with your file system, the address scheme stays intact. > > > -----Original Message----- > From: Neil Jefferies [mailto:neil at jefferies.org] > Sent: Friday, May 12, 2017 11:25 AM > To: Jacob Farmer > Subject: RE: [Pasig-discuss] WORM (Write Once Read Many) AIPs > > Good point on the housekeeping! > > Most (reasonable) filesystems allow you specify the inode numbers at > creation but yes, it is hard to change afterwards! > > But I would really, really avoid hard links - they only work within a > single filesystem so they can't be used in tiered or virtual storage > systems and even break quota controls on regular filesystems. Scale up > thus becomes very difficult with hard links. Symlinks also make it > explicit when you are dealing with a reference and can tell you which > version of the object held the original - useful provenance that hard > links don't capture. > > My personal feeling is no for hashes, yes for UUID's (or other > suitably unique object ID). This allows us to keep all versions of an > object in the same root path even though it varies. And don't store at > a file level - this shotguns object fragments all over the store and > make rebuilds horrible. > Many current object stores do this - and consequently don't version > effectively - I wish people would understand objects are not files. > UUID's > are also consistent in terms of computational time and hashes very > much aren't. > > There's a big difference in robustness between needing just filesystem > metadata to find an object in storage and requiring filesystem > metadata (because underneath all object stores are filesystems - even > Seagates "object" hard drives), object store metadata to map paths to > hashes, and object metadata to find all the bits that make up a > composite object. > > ...and yes, I am saying that most object store vendors have got it > wrong. At least as far as archiving is concerned. And they ought to > consider why every object store ends up presenting itself as a POSIX > filesystem. > > Neil > > > On 2017-05-12 14:33, Jacob Farmer wrote: >> Two warnings and two suggestions: >> >> Warnings: >> >> 1) Symlinks and Housekeeping -- It is a common practice to use >> symlinks to make versioned file collections. If you do this, you >> should have some kind of housekeeping processes that ensure that the >> symlinks are all working correctly. If files ever have to get >> migrated, symlinks can break. >> >> 2) Check with your file system vendor -- Most removable media file >> systems have some built in limitations on the number of inodes >> (files) that you can have in one file system. If you generate a lot >> of symlinks, you might overwhelm the file system. Your vendor will know. >> >> Suggestions: >> >> 1) Hashes for file names -- If your application software maintains a >> hash for each file, you might consider naming the file according to >> the hash. >> Use the first two digits for the parent directory, the next two >> digits for sub-diretory, the next two digits for sub-directory. Then >> use the full hash for the file name. This turns your POSIX file >> system into an object store with uniquely named objects. >> >> As a safeguard, you might maintain a separate table or list that >> associates path names with hashes. >> >> 2) Consider using hard links instead of symlinks -- You might use >> hard links instead of symlinks, presuming that the files are all in >> the same file system. You still have to watch for file count issues, >> but you have less housekeeping to do. >> >> I hope that helps. >> >> >> Jacob Farmer | Chief Technology Officer | Cambridge Computer | >> "Artists In Data Storage" >> Phone 781-250-3210 | jfarmer at CambridgeComputer.com | >> www.CambridgeComputer.com >> >> >> >> >> -----Original Message----- >> From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf >> Of Neil Jefferies >> Sent: Friday, May 12, 2017 8:06 AM >> To: Tim.Gollins at nrscotland.gov.uk >> Cc: pasig-discuss at mail.asis.org >> Subject: Re: [Pasig-discuss] WORM (Write Once Read Many) AIPs >> >> Tim, >> >> If we store AIP's unpackaged, as a collection of files in a folder, >> then object updates could just be a new folder with symlinks to the >> unchanged parts and the updated parts in place in the folder. The >> object "location" >> would be a parent folder for all these version folders - for example, >> a pairtree (or triple-tree for faster scanning/rebuilds) based on >> object UUID. >> Version folders would be named accoprding to date or version number >> (date might make Memento compliant access simpler). >> Creating anew version clones the current verion (including links) >> with a new name and then replaces the updated parts in situ. Final >> act is to update a "current" symlink in the object. Any update >> failure will mean "current" >> is >> not updated an the partial clone can be discarded. >> >> This assumes most updates are metadata and that a diff won't save >> much compared to a complete new XML file or whatever. I am also >> assuming that metadata won't be wrappered either (so you can forget >> METS) so that different types are stored in the most stuiable format >> and are accessed only when required. The problems with roundtripping >> packaged AIP's for updates rather than diff-ing are repeated by METS >> wrappering. >> >> These may be a virtual folder/filesytem presentation and underneath >> an HSM would retrieve files from wherever when it is actually accessed. >> HSM policy in soemthing like SAM-QFS/Versity/Cray TAS can ensure >> folders are kep intact when moved to other storage (we could even >> dereference symlinks when dealing with tape). >> >> This can be done with a POSIX filesystem and not muich code - Ben >> O'Steen started something along these lines here: >> https://github.com/dataflow/RDFDatabank/wiki/What-is-DataBank-and-wha >> t >> -does-it-do%3F >> >> Fedora also also a versioning object store that could support this >> kind of model but also adds a fair bit of complexity to be >> Linked-Data_platform compliant. >> >> In my paralance I would probably equate "Minimal Ingest" with "Sheer >> Curation" and APT with Asynchronous Message Driven Workers. >> >> Neil >> >> >> On 2017-05-12 12:33, Tim.Gollins at nrscotland.gov.uk wrote: >>> Dear PASIG >>> >>> I have been thinking recently about the challenge of managing >>> "physical" AIPs on offline or near line storage and how to optimise >>> or simplify the use of managed storage media in a tape based >>> (robotic) Hierarchical Storage Management (HSM) system. By "physical" >>> AIPs I mean that the actual structure of the AIP written to the >>> storage system is sufficiently self-describing that even if the >>> management or other elements of a DP system were to be lost to a >>> disaster then the entire collection could be fully re-instated >>> reliably from the stored AIPs alone. >>> >>> I have also been thinking about the huge benefits of adopting the >>> concepts of "Minimal Ingest" (MI) and "Autonomous Preservation Tools" >>> (APT) in a new Digital Archive solution. >>> >>> One of the potential effects of the MI and APT concepts is that over >>> time it is clear that while (of course) the original bit streams >>> will never need to be updated, the metadata packaged in the AIP will >>> need to change relatively often (through the life of the AIP) . This >>> is of course in addition to any new renderings of the bit streams >>> produced for preservation purposes (manifestations as termed in some >>> systems). >>> >>> If to update the AIP the process involves the AIP being "loaded" and >>> "Modified" and "Stored" again as a whole then this will result in >>> significant "churn" of the offline or near line media (i.e. tapes) >>> in a HSM - which I would like to avoid. I think it would be really >>> great if the AIP representation could accommodate the concept of an >>> "update IP" (perhaps UIP?) where the UIP contains a "delta" of the >>> original AIP - the full AIP then being interpreted as the original >>> as modified by a series of deltas. This would then effectively >>> result in AIPs (and >>> UIPs) becoming WORM objects with clear benefits that I perceive in >>> managing their reliable and safe storage. >>> >>> I am not sufficiently familiar with the detail of all the different >>> AIP models or implementations, I was wondering if anyone in the team >>> would be able to comment on whether the they know of any AIP models, >>> specifications or implementations that would support such a use >>> case. >>> >>> I have just posted a version of this question to the E-Ark Linked in >>> Group so my apologies to those who see it twice. >>> >>> Many thanks >>> >>> Tim >>> Tim Gollins | Head of Digital Archiving and Director of the NRS >>> Digital Preservation Programme National Records of Scotland | West >>> Register House | Edinburgh EH2 4DF >>> + 44 (0)131 535 1431 / + 44 (0)7974 922614 | >>> tim.gollins at nrscotland.gov.uk | www.nrscotland.gov.uk >>> >>> Preserving the past | Recording the present | Informing the future >>> Follow us on Twitter: @NatRecordsScot | >>> http://twitter.com/NatRecordsScot >>> >>> >>> ******************************************************************** >>> * >>> * This e-mail (and any files or other attachments transmitted with >>> it) is intended solely for the attention of the addressee(s). >>> Unauthorised use, disclosure, storage, copying or distribution of >>> any part of this e-mail is not permitted. If you are not the >>> intended recipient please destroy the email, remove any copies from >>> your system and inform the sender immediately by return. >>> >>> Communications with the Scottish Government may be monitored or >>> recorded in order to secure the effective operation of the system >>> and for other lawful purposes. The views or opinions contained >>> within this e-mail may not necessarily reflect those of the Scottish >>> Government. >>> >>> >>> Tha am post-d seo (agus faidhle neo ceanglan c?mhla ris) dhan neach >>> neo luchd-ainmichte a-mh?in. Chan eil e ceadaichte a chleachdadh ann >>> an d?igh sam bith, a? toirt a-steach c?raichean, foillseachadh neo >>> sgaoileadh, gun chead. Ma ?s e is gun d?fhuair sibh seo le gun >>> fhiosd?, bu choir cur ?s dhan phost-d agus lethbhreac sam bith air >>> an t-siostam agaibh, leig fios chun neach a sgaoil am post-d gun d?il. >>> >>> Dh?fhaodadh gum bi teachdaireachd sam bith bho Riaghaltas na h-Alba >>> air a chl?radh neo air a sgr?dadh airson dearbhadh gu bheil an >>> siostam ag obair gu h-?ifeachdach neo airson adhbhar laghail eile. >>> Dh?fhaodadh nach eil beachdan anns a? phost-d seo co-ionann ri >>> beachdan Riaghaltas na h-Alba. >>> ******************************************************************** >>> * >>> * >>> >>> >>> >>> ---- >>> To subscribe, unsubscribe, or modify your subscription, please visit >>> http://mail.asis.org/mailman/listinfo/pasig-discuss >>> _______ >>> PASIG Webinars and conference material is at >>> http://www.preservationandarchivingsig.org/index.html >>> _______________________________________________ >>> Pasig-discuss mailing list >>> Pasig-discuss at mail.asis.org >>> http://mail.asis.org/mailman/listinfo/pasig-discuss >> >> ---- >> To subscribe, unsubscribe, or modify your subscription, please visit >> http://mail.asis.org/mailman/listinfo/pasig-discuss >> _______ >> PASIG Webinars and conference material is at >> http://www.preservationandarchivingsig.org/index.html >> _______________________________________________ >> Pasig-discuss mailing list >> Pasig-discuss at mail.asis.org >> http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss From gail at trumantechnologies.com Mon May 15 12:48:42 2017 From: gail at trumantechnologies.com (gail at trumantechnologies.com) Date: Mon, 15 May 2017 09:48:42 -0700 Subject: [Pasig-discuss] Proposed Digital Preservation Storage Criteria ver. 2 for community discussion Message-ID: <20170515094842.b554e26909f2beaf9f8ddbf6be9a6600.5a9e0b4cf7.wbe@email09.godaddy.com> An HTML attachment was scrubbed... URL: From randy_stern at harvard.edu Tue May 16 10:05:12 2017 From: randy_stern at harvard.edu (Stern, Randy) Date: Tue, 16 May 2017 14:05:12 +0000 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: References: <78ADB971-820E-4450-BDEB-1814B86B19F0@uq.edu.au> Message-ID: <2EBED878-C03B-41EB-BAA8-E36F949EF821@harvard.edu> Re costs: For Harvard Library?s Digital Repository Service - 2 disk copies plus 2 tape copies - as of July 1, the cost of storage for depositors to the DRS is $1.25/GB/year for storage. This figure is moderately close to the storage hardware costs. The storage cost does not include staff costs, preservation activities, or server costs associated with the core DRS software services, tools, and databases. Randy On 5/15/17, 4:02 AM, "Pasig-discuss on behalf of William Kilbride" wrote: Hi All, Hi Tim This is a super thread and I am learning a tonne. On the subject of costs I can make a recommendation and request ... The Curation Costs Exchange is a useful thing and well worth a look for anyone looking at comparative costs across the digital preservation lifecycle including storage. It's not been mentioned yet in the discussions, I assume because everyone is already aware of it. But have a look: http://www.curationexchange.org/ The conclusion we drew from the 4C project was that financial planning was a core skill in preservation planning. So to be a 'trusted' repository an institution should be able to demonstrate certain skills in financial planning and be transparent about it. It's expressed more elegantly in the 4c project roadmap: http://www.4cproject.eu/roadmap/ Now the request: there's a network effect here. The more agencies share data the more useful the data becomes. So can I encourage you all to share that information (anonymously or identifiably) via the costs exchange? All best wishes, William -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Jake Carroll Sent: 15 May 2017 04:01 To: pasig-discuss at asis.org Subject: Re: [Pasig-discuss] Digital repository storage benchmarking Certainly interesting. At the Queensland Brain Institute and the Australian Institute of Bioengineering and Nanotechnology at the University of Queensland, we have around 8.5PB of data under management across our HSM platforms. We currently use Oracle HSM for this task. We have 256TB of online ?cache? for the data landing location split across 6 different filesystems that are tuned differently for different types of workloads and different tasks. These workloads are generally categorised into a few functions: ? High IO, large serial writes from instruments ? Low IO, large serial writes from instruments ? High IO, granular ?many files, many IOPS? from instruments and computational factors ? Low IO, granular ?many files, low IOPS? from instruments and computational factors ? Generic group share ? Generic user dir It is an interesting thing to manage and run statistical modelling on in terms of performance analysis and micro benchmarking of data movement patterns. All the filesystems above are provisioned on 16Gbit/sec FC connected Hitachi HUS-VM, 10K SAS. The metadata for these filesystems is around 10 terabytes of Hitachi Accelerated Advanced Flash storage. We have around 3.8 billion files/unique objects under management. We run a ?disk based copy? (we call that copy1) which is our disk based VSN or vault. It is around 1PB of ZFS managed storage sitting inside the very large Hitachi HUS-VM platform. Our Copy2 and Copy3 are 2 * T10000D Oracle tape media copies in SL3000 storage silos, geographically distributed. We do some interesting things with our tape infrastructure, including DIV-always-on, proactive data protection sweeps inside the HSM and continuous validation checks against the media. We also run STA (tape analytics tools) extra-data-path so we can see *exactly* what each drive is doing at all times. Believe me, we see things that would baffle and boggle the mind (and probably create a healthy sense of paranoia!) if you knew exactly what was going on ?inside there?. We use finely tuned policy for data automation of movement between tiers so as to minimally impact user experience. Our HSM supports offline file mapping to the windows client, so people can tell when their files and objects are ?offline?. It is a useful semantic and great for usability for people. We ZFS scrub the disk copy for ?always on disk consistency?, we use tpverify commands on the tape media also to consistently check the media itself. We?re experimenting with implementing fixity shortly too, as the filesystem supports it. As for going ?all online?, at our scale ?we just can?t afford it yet, to walk away from ?cold tape? principles. We?re just too big. We?d love to rid ourselves of the complexities of it, and consider a full cloud based consumption model, but having crunched the very hard numbers of things such as AWS Glacier and S3, it is a long (long) way more expensive than the relative TCO?s of running it ?on premise? at this stage. My hope is that this will change soon and I can start experimenting with one of my copies being a ?cloud library?. Interesting thread, this? -jc On 15/5/17, 11:41 am, "Pasig-discuss on behalf of BUNTON, GLENN" wrote: This discussion of the various digital repository storage approaches has been very enlightening and useful so far. I appreciate all the excellent details. There is one piece of information, however, that is missing. Cost? Both initial implementation outlay and ongoing costs. Any general sense of costs would be greatly appreciated. -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Steve Knight Sent: Sunday, May 14, 2017 6:44 PM To: 'Sheila Morrissey' ; pasig-discuss at asis.org Subject: Re: [Pasig-discuss] Digital repository storage benchmarking Hi Tim At the National library of New Zealand, we are storing about 210TB of digital objects in our permanent repository. We have a 25TB online cache, with an online copy of all the digital objects sitting on disk. Three tape copies of the objects are made as soon as they enter into the disk archive. 1 copy remains within the tape library (nearline), the other 2 copies are sent offsite (offline). We use Oracle SAM-QFS to manage the storage policies and automatic tierage. We have a similar treatment for our 100TB of Test data, which has 1 less offsite tape copy. We are currently looking at replacing this storage architecture with a mix of Hitachi's HDI and HCP S30 object storage products and our cloud provider's object storage offering. The cloud provider storage includes replication across 3 geographic locations providing both higher availability and higher resilience than we currently have. By moving to an all online solution we hope to increase overall performance and make savings through utilising object storage and exiting some services related to current backup and restore processes. Regards Steve -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Sheila Morrissey Sent: Saturday, 13 May 2017 5:44 a.m. To: pasig-discuss at asis.org Subject: [Pasig-discuss] FW: Digital repository storage benchmarking Hello, Tim, At Portico (http://www.portico.org/digital-preservation/), we preserve e-journals, e-books, digitized historical collections, and other born-digital scholarly content. Currently, the Portico archive is comprised of roughly 77.7 million digital objects (we call them "Archival Units", or AUs); comprising over 400 TB; made up of 1.3 billion files. We maintain 3 copies of the archive: 2 on disk in geographically distributed data centers, and a 3rd copy in commercial cloud storage. We create and maintain backups (including fixity checks) using our own custom-written software. I hope this helpful. Best regards, Sheila Sheila M. Morrissey Senior Researcher ITHAKA 100 Campus Drive Suite 100 Princeton NJ 08540 609-986-2221 sheila.morrissey at ithaka.org ITHAKA (www.ithaka.org) is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. We provide innovative services that benefit higher education, including Ithaka S+R, JSTOR, and Portico. -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Tim Walsh Sent: Friday, May 12, 2017 10:16 AM To: pasig-discuss at asis.org Subject: [Pasig-discuss] Digital repository storage benchmarking Dear PASIG, I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions? configurations online. It?s very possible that this question has been asked before on-list, but I wasn?t able to find anything in the list archives. For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those ?various media? will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we?d like to benchmark our plans against other institutions. I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are: * Could you point me to published/available resources outlining other institutions? digital repository storage configurations? * Or, if you work at an institution, would you be willing to share the details of your configuration on- or off-list? (any information sent off-list will be kept strictly confidential) Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups. Thank you! Tim - - - Tim Walsh Archiviste, Archives num?riques Archivist, Digital Archives Centre Canadien d?Architecture Canadian Centre for Architecture 1920, rue Baile, Montr?al, Qu?bec H3H 2S6 T 514 939 7001 x 1532 F 514 939 7020 www.cca.qc.ca Pensez ? l?environnement avant d?imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n??tes pas le destinataire pr?vu, veuillez nous en aviser imm?diatement. Merci ?galement de supprimer le pr?sent courriel et d?en d?truire toute copie. This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you. ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss From luispo at gmail.com Tue May 16 13:53:39 2017 From: luispo at gmail.com (=?utf-8?Q?Louis_Su=C3=A1rez-Potts?=) Date: Tue, 16 May 2017 13:53:39 -0400 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: <2EBED878-C03B-41EB-BAA8-E36F949EF821@harvard.edu> References: <78ADB971-820E-4450-BDEB-1814B86B19F0@uq.edu.au> <2EBED878-C03B-41EB-BAA8-E36F949EF821@harvard.edu> Message-ID: > Now the request: there's a network effect here. The more agencies share data the more useful the data becomes. So can I encourage you all to share that information (anonymously or identifiably) via the costs exchange? Hi I'm all for sharing this data, as well as other relevant information, including accounts of how we do things, even when they are mistakes. But an email list is not the best venue; something more pliable, like a wiki or its equivalent? I'm sure there are options. And equally sure that this particular issue has complications related to political location that do need to be made clear, as political mandates (must be within certain political boundaries, say) affect cost, inter alia. Cheers, Louis > On 2017-05-16, at 10:05, Stern, Randy wrote: > > Re costs: For Harvard Library?s Digital Repository Service - 2 disk copies plus 2 tape copies - as of July 1, the cost of storage for depositors to the DRS is $1.25/GB/year for storage. This figure is moderately close to the storage hardware costs. The storage cost does not include staff costs, preservation activities, or server costs associated with the core DRS software services, tools, and databases. > > Randy > > > > On 5/15/17, 4:02 AM, "Pasig-discuss on behalf of William Kilbride" wrote: > > Hi All, Hi Tim > > This is a super thread and I am learning a tonne. On the subject of costs I can make a recommendation and request ... > > The Curation Costs Exchange is a useful thing and well worth a look for anyone looking at comparative costs across the digital preservation lifecycle including storage. It's not been mentioned yet in the discussions, I assume because everyone is already aware of it. But have a look: http://www.curationexchange.org/ > > The conclusion we drew from the 4C project was that financial planning was a core skill in preservation planning. So to be a 'trusted' repository an institution should be able to demonstrate certain skills in financial planning and be transparent about it. It's expressed more elegantly in the 4c project roadmap: > http://www.4cproject.eu/roadmap/ > > Now the request: there's a network effect here. The more agencies share data the more useful the data becomes. So can I encourage you all to share that information (anonymously or identifiably) via the costs exchange? > > All best wishes, > > William > > > -----Original Message----- > From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Jake Carroll > Sent: 15 May 2017 04:01 > To: pasig-discuss at asis.org > Subject: Re: [Pasig-discuss] Digital repository storage benchmarking > > Certainly interesting. > > At the Queensland Brain Institute and the Australian Institute of Bioengineering and Nanotechnology at the University of Queensland, we have around 8.5PB of data under management across our HSM platforms. We currently use Oracle HSM for this task. > > We have 256TB of online ?cache? for the data landing location split across 6 different filesystems that are tuned differently for different types of workloads and different tasks. These workloads are generally categorised into a few functions: > > ? High IO, large serial writes from instruments ? Low IO, large serial writes from instruments ? High IO, granular ?many files, many IOPS? from instruments and computational factors ? Low IO, granular ?many files, low IOPS? from instruments and computational factors ? Generic group share ? Generic user dir > > It is an interesting thing to manage and run statistical modelling on in terms of performance analysis and micro benchmarking of data movement patterns. All the filesystems above are provisioned on 16Gbit/sec FC connected Hitachi HUS-VM, 10K SAS. > > The metadata for these filesystems is around 10 terabytes of Hitachi Accelerated Advanced Flash storage. We have around 3.8 billion files/unique objects under management. > > We run a ?disk based copy? (we call that copy1) which is our disk based VSN or vault. It is around 1PB of ZFS managed storage sitting inside the very large Hitachi HUS-VM platform. > > Our Copy2 and Copy3 are 2 * T10000D Oracle tape media copies in SL3000 storage silos, geographically distributed. > > We do some interesting things with our tape infrastructure, including DIV-always-on, proactive data protection sweeps inside the HSM and continuous validation checks against the media. We also run STA (tape analytics tools) extra-data-path so we can see *exactly* what each drive is doing at all times. Believe me, we see things that would baffle and boggle the mind (and probably create a healthy sense of paranoia!) if you knew exactly what was going on ?inside there?. > > We use finely tuned policy for data automation of movement between tiers so as to minimally impact user experience. Our HSM supports offline file mapping to the windows client, so people can tell when their files and objects are ?offline?. It is a useful semantic and great for usability for people. > > We ZFS scrub the disk copy for ?always on disk consistency?, we use tpverify commands on the tape media also to consistently check the media itself. We?re experimenting with implementing fixity shortly too, as the filesystem supports it. > > As for going ?all online?, at our scale ?we just can?t afford it yet, to walk away from ?cold tape? principles. We?re just too big. We?d love to rid ourselves of the complexities of it, and consider a full cloud based consumption model, but having crunched the very hard numbers of things such as AWS Glacier and S3, it is a long (long) way more expensive than the relative TCO?s of running it ?on premise? at this stage. My hope is that this will change soon and I can start experimenting with one of my copies being a ?cloud library?. > > Interesting thread, this? > > -jc > > > > On 15/5/17, 11:41 am, "Pasig-discuss on behalf of BUNTON, GLENN" wrote: > > This discussion of the various digital repository storage approaches has been very enlightening and useful so far. I appreciate all the excellent details. There is one piece of information, however, that is missing. Cost? Both initial implementation outlay and ongoing costs. Any general sense of costs would be greatly appreciated. > > > -----Original Message----- > From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Steve Knight > Sent: Sunday, May 14, 2017 6:44 PM > To: 'Sheila Morrissey' ; pasig-discuss at asis.org > Subject: Re: [Pasig-discuss] Digital repository storage benchmarking > > Hi Tim > > At the National library of New Zealand, we are storing about 210TB of digital objects in our permanent repository. > > We have a 25TB online cache, with an online copy of all the digital objects sitting on disk. > > Three tape copies of the objects are made as soon as they enter into the disk archive. 1 copy remains within the tape library (nearline), the other 2 copies are sent offsite (offline). We use Oracle SAM-QFS to manage the storage policies and automatic tierage. > > We have a similar treatment for our 100TB of Test data, which has 1 less offsite tape copy. > > We are currently looking at replacing this storage architecture with a mix of Hitachi's HDI and HCP S30 object storage products and our cloud provider's object storage offering. The cloud provider storage includes replication across 3 geographic locations providing both higher availability and higher resilience than we currently have. > > By moving to an all online solution we hope to increase overall performance and make savings through utilising object storage and exiting some services related to current backup and restore processes. > > Regards > Steve > > > > > -----Original Message----- > From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Sheila Morrissey > Sent: Saturday, 13 May 2017 5:44 a.m. > To: pasig-discuss at asis.org > Subject: [Pasig-discuss] FW: Digital repository storage benchmarking > > > Hello, Tim, > > At Portico (http://www.portico.org/digital-preservation/), we preserve e-journals, e-books, digitized historical collections, and other born-digital scholarly content. > > Currently, the Portico archive is comprised of roughly 77.7 million digital objects (we call them "Archival Units", or AUs); comprising over 400 TB; made up of 1.3 billion files. > > We maintain 3 copies of the archive: 2 on disk in geographically distributed data centers, and a 3rd copy in commercial cloud storage. We create and maintain backups (including fixity checks) using our own custom-written software. > > I hope this helpful. > > Best regards, > Sheila > > > Sheila M. Morrissey > Senior Researcher > ITHAKA > 100 Campus Drive > Suite 100 > Princeton NJ 08540 > 609-986-2221 > sheila.morrissey at ithaka.org > > ITHAKA (www.ithaka.org) is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. We provide innovative services that benefit higher education, including Ithaka S+R, JSTOR, and Portico. > > > > -----Original Message----- > From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Tim Walsh > Sent: Friday, May 12, 2017 10:16 AM > To: pasig-discuss at asis.org > Subject: [Pasig-discuss] Digital repository storage benchmarking > > Dear PASIG, > > I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions? configurations online. It?s very possible that this question has been asked before on-list, but I wasn?t able to find anything in the list archives. > > For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those ?various media? will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we?d like to benchmark our plans against other institutions. > > I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are: > > * Could you point me to published/available resources outlining other institutions? digital repository storage configurations? > * Or, if you work at an institution, would you be willing to share the details of your configuration on- or off-list? (any information sent off-list will be kept strictly confidential) > > Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups. > > Thank you! > Tim > > - - - > > Tim Walsh > Archiviste, Archives num?riques > Archivist, Digital Archives > > Centre Canadien d?Architecture > Canadian Centre for Architecture > 1920, rue Baile, Montr?al, Qu?bec H3H 2S6 T 514 939 7001 x 1532 F 514 939 7020 www.cca.qc.ca > > > Pensez ? l?environnement avant d?imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n??tes pas le destinataire pr?vu, veuillez nous en aviser imm?diatement. Merci ?galement de supprimer le pr?sent courriel et d?en d?truire toute copie. > This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you. > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > ---- > To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss From william.kilbride at dpconline.org Wed May 17 04:18:18 2017 From: william.kilbride at dpconline.org (William Kilbride) Date: Wed, 17 May 2017 08:18:18 +0000 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: References: <78ADB971-820E-4450-BDEB-1814B86B19F0@uq.edu.au> <2EBED878-C03B-41EB-BAA8-E36F949EF821@harvard.edu> Message-ID: Hi Louis, Yes you're quite right: a list is great but it's not ideal for sharing this kind of information. I really do encourage you therefore to look seriously at the Curation Costs Exchange. The 4c project took quite a lot of time to manage not only the legal constraints and anonymization issues but also the different accountancy approaches that can make it hard to compare data meaningfully. Please do take a look (all!): http://www.curationexchange.org/ The more we put in the more we will get out... W :-) -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Louis Su?rez-Potts Sent: 16 May 2017 18:54 To: pasig-discuss at asis.org Subject: Re: [Pasig-discuss] Digital repository storage benchmarking > Now the request: there's a network effect here. The more agencies share data the more useful the data becomes. So can I encourage you all to share that information (anonymously or identifiably) via the costs exchange? Hi I'm all for sharing this data, as well as other relevant information, including accounts of how we do things, even when they are mistakes. But an email list is not the best venue; something more pliable, like a wiki or its equivalent? I'm sure there are options. And equally sure that this particular issue has complications related to political location that do need to be made clear, as political mandates (must be within certain political boundaries, say) affect cost, inter alia. Cheers, Louis > On 2017-05-16, at 10:05, Stern, Randy wrote: > > Re costs: For Harvard Library?s Digital Repository Service - 2 disk copies plus 2 tape copies - as of July 1, the cost of storage for depositors to the DRS is $1.25/GB/year for storage. This figure is moderately close to the storage hardware costs. The storage cost does not include staff costs, preservation activities, or server costs associated with the core DRS software services, tools, and databases. > > Randy > > > > On 5/15/17, 4:02 AM, "Pasig-discuss on behalf of William Kilbride" wrote: > > Hi All, Hi Tim > > This is a super thread and I am learning a tonne. On the subject of costs I can make a recommendation and request ... > > The Curation Costs Exchange is a useful thing and well worth a look > for anyone looking at comparative costs across the digital > preservation lifecycle including storage. It's not been mentioned yet > in the discussions, I assume because everyone is already aware of it. > But have a look: http://www.curationexchange.org/ > > The conclusion we drew from the 4C project was that financial planning was a core skill in preservation planning. So to be a 'trusted' repository an institution should be able to demonstrate certain skills in financial planning and be transparent about it. It's expressed more elegantly in the 4c project roadmap: > http://www.4cproject.eu/roadmap/ > > Now the request: there's a network effect here. The more agencies share data the more useful the data becomes. So can I encourage you all to share that information (anonymously or identifiably) via the costs exchange? > > All best wishes, > > William > > > -----Original Message----- > From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Jake Carroll > Sent: 15 May 2017 04:01 > To: pasig-discuss at asis.org > Subject: Re: [Pasig-discuss] Digital repository storage > benchmarking > > Certainly interesting. > > At the Queensland Brain Institute and the Australian Institute of Bioengineering and Nanotechnology at the University of Queensland, we have around 8.5PB of data under management across our HSM platforms. We currently use Oracle HSM for this task. > > We have 256TB of online ?cache? for the data landing location split across 6 different filesystems that are tuned differently for different types of workloads and different tasks. These workloads are generally categorised into a few functions: > > ? High IO, large serial writes from instruments ? Low IO, large > serial writes from instruments ? High IO, granular ?many files, many > IOPS? from instruments and computational factors ? Low IO, granular > ?many files, low IOPS? from instruments and computational factors ? > Generic group share ? Generic user dir > > It is an interesting thing to manage and run statistical modelling on in terms of performance analysis and micro benchmarking of data movement patterns. All the filesystems above are provisioned on 16Gbit/sec FC connected Hitachi HUS-VM, 10K SAS. > > The metadata for these filesystems is around 10 terabytes of Hitachi Accelerated Advanced Flash storage. We have around 3.8 billion files/unique objects under management. > > We run a ?disk based copy? (we call that copy1) which is our disk based VSN or vault. It is around 1PB of ZFS managed storage sitting inside the very large Hitachi HUS-VM platform. > > Our Copy2 and Copy3 are 2 * T10000D Oracle tape media copies in SL3000 storage silos, geographically distributed. > > We do some interesting things with our tape infrastructure, including DIV-always-on, proactive data protection sweeps inside the HSM and continuous validation checks against the media. We also run STA (tape analytics tools) extra-data-path so we can see *exactly* what each drive is doing at all times. Believe me, we see things that would baffle and boggle the mind (and probably create a healthy sense of paranoia!) if you knew exactly what was going on ?inside there?. > > We use finely tuned policy for data automation of movement between tiers so as to minimally impact user experience. Our HSM supports offline file mapping to the windows client, so people can tell when their files and objects are ?offline?. It is a useful semantic and great for usability for people. > > We ZFS scrub the disk copy for ?always on disk consistency?, we use tpverify commands on the tape media also to consistently check the media itself. We?re experimenting with implementing fixity shortly too, as the filesystem supports it. > > As for going ?all online?, at our scale ?we just can?t afford it yet, to walk away from ?cold tape? principles. We?re just too big. We?d love to rid ourselves of the complexities of it, and consider a full cloud based consumption model, but having crunched the very hard numbers of things such as AWS Glacier and S3, it is a long (long) way more expensive than the relative TCO?s of running it ?on premise? at this stage. My hope is that this will change soon and I can start experimenting with one of my copies being a ?cloud library?. > > Interesting thread, this? > > -jc > > > > On 15/5/17, 11:41 am, "Pasig-discuss on behalf of BUNTON, GLENN" wrote: > > This discussion of the various digital repository storage approaches has been very enlightening and useful so far. I appreciate all the excellent details. There is one piece of information, however, that is missing. Cost? Both initial implementation outlay and ongoing costs. Any general sense of costs would be greatly appreciated. > > > -----Original Message----- > From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Steve Knight > Sent: Sunday, May 14, 2017 6:44 PM > To: 'Sheila Morrissey' ; pasig-discuss at asis.org > Subject: Re: [Pasig-discuss] Digital repository storage > benchmarking > > Hi Tim > > At the National library of New Zealand, we are storing about 210TB of digital objects in our permanent repository. > > We have a 25TB online cache, with an online copy of all the digital objects sitting on disk. > > Three tape copies of the objects are made as soon as they enter into the disk archive. 1 copy remains within the tape library (nearline), the other 2 copies are sent offsite (offline). We use Oracle SAM-QFS to manage the storage policies and automatic tierage. > > We have a similar treatment for our 100TB of Test data, which has 1 less offsite tape copy. > > We are currently looking at replacing this storage architecture with a mix of Hitachi's HDI and HCP S30 object storage products and our cloud provider's object storage offering. The cloud provider storage includes replication across 3 geographic locations providing both higher availability and higher resilience than we currently have. > > By moving to an all online solution we hope to increase overall performance and make savings through utilising object storage and exiting some services related to current backup and restore processes. > > Regards > Steve > > > > > -----Original Message----- > From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Sheila Morrissey > Sent: Saturday, 13 May 2017 5:44 a.m. > To: pasig-discuss at asis.org > Subject: [Pasig-discuss] FW: Digital repository storage > benchmarking > > > Hello, Tim, > > At Portico (http://www.portico.org/digital-preservation/), we preserve e-journals, e-books, digitized historical collections, and other born-digital scholarly content. > > Currently, the Portico archive is comprised of roughly 77.7 million digital objects (we call them "Archival Units", or AUs); comprising over 400 TB; made up of 1.3 billion files. > > We maintain 3 copies of the archive: 2 on disk in geographically distributed data centers, and a 3rd copy in commercial cloud storage. We create and maintain backups (including fixity checks) using our own custom-written software. > > I hope this helpful. > > Best regards, > Sheila > > > Sheila M. Morrissey > Senior Researcher > ITHAKA > 100 Campus Drive > Suite 100 > Princeton NJ 08540 > 609-986-2221 > sheila.morrissey at ithaka.org > > ITHAKA (www.ithaka.org) is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. We provide innovative services that benefit higher education, including Ithaka S+R, JSTOR, and Portico. > > > > -----Original Message----- > From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Tim Walsh > Sent: Friday, May 12, 2017 10:16 AM > To: pasig-discuss at asis.org > Subject: [Pasig-discuss] Digital repository storage > benchmarking > > Dear PASIG, > > I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions? configurations online. It?s very possible that this question has been asked before on-list, but I wasn?t able to find anything in the list archives. > > For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those ?various media? will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we?d like to benchmark our plans against other institutions. > > I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are: > > * Could you point me to published/available resources outlining other institutions? digital repository storage configurations? > * Or, if you work at an institution, would you be willing to > share the details of your configuration on- or off-list? (any > information sent off-list will be kept strictly confidential) > > Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups. > > Thank you! > Tim > > - - - > > Tim Walsh > Archiviste, Archives num?riques > Archivist, Digital Archives > > Centre Canadien d?Architecture > Canadian Centre for Architecture > 1920, rue Baile, Montr?al, Qu?bec H3H 2S6 T 514 939 7001 x > 1532 F 514 939 7020 www.cca.qc.ca > > > Pensez ? l?environnement avant d?imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n??tes pas le destinataire pr?vu, veuillez nous en aviser imm?diatement. Merci ?galement de supprimer le pr?sent courriel et d?en d?truire toute copie. > This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you. > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > ---- > To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at > http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss From allasia at eurixgroup.com Wed May 17 06:37:50 2017 From: allasia at eurixgroup.com (Walter Allasia) Date: Wed, 17 May 2017 12:37:50 +0200 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: =?iso-8859-1?q?=3CDB6PR0202MB26163591FDE1B954E03FBB1895E70=40DB6PR020?= =?iso-8859-1?q?2MB2616=2Eeurprd02=2Eprod=2Eoutlook=2Ecom=3E?= References: =?iso-8859-1?q?=3C78ADB971=2D820E=2D4450=2DBDEB=2D1814B86B19F0=40uq?= =?iso-8859-1?q?=2Eedu=2Eau=3E_=3CDB6PR0202MB2616142EFA1F4E14C65E229B9?= =?iso-8859-1?q?5E10=40DB6PR0202MB2616=2Eeurprd02=2Eprod=2Eoutlook=2Ec?= =?iso-8859-1?q?om=3E_=3C2EBED878=2DC03B=2D41EB=2DBAA8=2DE36F949EF821?= =?iso-8859-1?q?=40harvard=2Eedu=3E_=3CF4B89086=2D407A=2D4EEA=2D8040?= =?iso-8859-1?q?=2DB5D04B798487=40gmail=2Ecom=3E_=3CDB6PR0202MB2616359?= =?iso-8859-1?q?1FDE1B954E03FBB1895E70=40DB6PR0202MB2616=2Eeurprd02=2E?= =?iso-8859-1?q?prod=2Eoutlook=2Ecom=3E?= Message-ID: Hi William, All,the 4C website is surely worth taking a look at.?I got good advices and warnings on digital preservation costs. The greatest issue I?m still dealing with ishow to get the budget for preservation from the actual stakeholders?that usually are not aware of what is needed and?what is running behind the scenes for keeping stuff alive, safe and sound in the long term perspective. Especially in contexts of ?public administration? where I am operating right now,?it seems that nobody cares about costs of storage (or digital preservation hardware)?as far as managers are asked to plan their annual budget.?Suddenly storage costs and hardware obsolescence become something?that everybody wants to leave to someone else?because it dramatically cuts their overall budget and?because it is not clear how to sell it.? That is exactly the point. My experience demonstrates that people usually perceives digital archives?such as digital wells with contents buried inside.? Managers at public administration offices are not different.?Even worst since they have ?public money? to manage. It?s not possible to create business models from ?digital wells?. Offered services are the key factor.? IT Services lay the foundations for revenues.?Every so often services are already making use of archives (many times unawares): costs MUST be shared.I believe that also the ?planning? of storage and preservation infrastructure MUST be shared and agreed.? That?s the reason why the National Broadcaster have LTO tapes running LTFS and at the same time have disk arrays of nearly the same size: it?s a use case of preservation infrastructure driven by offered services, that?s the point. Well, that?s just my two cents and apologise for the long mail.? Walter Allasia Walter Allasia, PhD? Project Manager at?EURIXGroup {allasia at eurixgroup.com} Adjunct professor at Physics University of Torino {walter.allasia at unito.it} Project Manager Consultant at CSI Piemonte {walter.allasia at consulenti.csi.it} ? From: "Pasig-discuss" pasig-discuss-bounces at asis.org To: "Louis Su?rez-Potts" luispo at gmail.com,"pasig-discuss at asis.org" pasig-discuss at asis.org Cc: Date: Wed, 17 May 2017 08:18:18 +0000 Subject: Re: [Pasig-discuss] Digital repository storage benchmarking Hi Louis, ? Yes you're quite right: a list is great but it's not ideal for sharing this kind of information. ? I really do encourage you therefore to look seriously at the Curation Costs Exchange. The 4c project took quite a lot of time to manage not only the legal constraints and anonymization issues but also the different accountancy approaches that can make it hard to compare data meaningfully. Please do take a look (all!): http://www.curationexchange.org/ The more we put in the more we will get out... ? W :-) ? -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Louis Su?rez-Potts Sent: 16 May 2017 18:54 To: pasig-discuss at asis.org Subject: Re: [Pasig-discuss] Digital repository storage benchmarking ? > Now the request: there's a network effect here. The more agencies share data the more useful the data becomes. So can I encourage you all to share that information (anonymously or identifiably) via the costs exchange? ? ? Hi I'm all for sharing this data, as well as other relevant information, including accounts of how we do things, even when they are mistakes. But an email list is not the best venue; something more pliable, like a wiki or its equivalent? I'm sure there are options. And equally sure that this particular issue has complications related to political location that do need to be made clear, as political mandates (must be within certain political boundaries, say) affect cost, inter alia. ? Cheers, Louis ? ? ? > On 2017-05-16, at 10:05, Stern, Randy wrote: > > Re costs: For Harvard Library?s Digital Repository Service - 2 disk copies plus 2 tape copies - as of July 1, the cost of storage for depositors to the DRS is $1.25/GB/year for storage. This figure is moderately close to the storage hardware costs. The storage cost does not include staff costs, preservation activities, or server costs associated with the core DRS software services, tools, and databases. > > Randy > > > > On 5/15/17, 4:02 AM, "Pasig-discuss on behalf of William Kilbride" wrote: > > Hi All, Hi Tim > > This is a super thread and I am learning a tonne. On the subject of costs I can make a recommendation and request ... > > The Curation Costs Exchange is a useful thing and well worth a look > for anyone looking at comparative costs across the digital > preservation lifecycle including storage. It's not been mentioned yet > in the discussions, I assume because everyone is already aware of it. > But have a look: http://www.curationexchange.org/ > > The conclusion we drew from the 4C project was that financial planning was a core skill in preservation planning. So to be a 'trusted' repository an institution should be able to demonstrate certain skills in financial planning and be transparent about it. It's expressed more elegantly in the 4c project roadmap: > http://www.4cproject.eu/roadmap/ > > Now the request: there's a network effect here. The more agencies share data the more useful the data becomes. So can I encourage you all to share that information (anonymously or identifiably) via the costs exchange? > > All best wishes, > > William > > > -----Original Message----- > From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Jake Carroll > Sent: 15 May 2017 04:01 > To: pasig-discuss at asis.org > Subject: Re: [Pasig-discuss] Digital repository storage > benchmarking > > Certainly interesting. > > At the Queensland Brain Institute and the Australian Institute of Bioengineering and Nanotechnology at the University of Queensland, we have around 8.5PB of data under management across our HSM platforms. We currently use Oracle HSM for this task. > > We have 256TB of online ?cache? for the data landing location split across 6 different filesystems that are tuned differently for different types of workloads and different tasks. These workloads are generally categorised into a few functions: > > ? High IO, large serial writes from instruments ? Low IO, large > serial writes from instruments ? High IO, granular ?many files, many > IOPS? from instruments and computational factors ? Low IO, granular > ?many files, low IOPS? from instruments and computational factors ? > Generic group share ? Generic user dir > > It is an interesting thing to manage and run statistical modelling on in terms of performance analysis and micro benchmarking of data movement patterns. All the filesystems above are provisioned on 16Gbit/sec FC connected Hitachi HUS-VM, 10K SAS. > > The metadata for these filesystems is around 10 terabytes of Hitachi Accelerated Advanced Flash storage. We have around 3.8 billion files/unique objects under management. > > We run a ?disk based copy? (we call that copy1) which is our disk based VSN or vault. It is around 1PB of ZFS managed storage sitting inside the very large Hitachi HUS-VM platform. > > Our Copy2 and Copy3 are 2 * T10000D Oracle tape media copies in SL3000 storage silos, geographically distributed. > > We do some interesting things with our tape infrastructure, including DIV-always-on, proactive data protection sweeps inside the HSM and continuous validation checks against the media. We also run STA (tape analytics tools) extra-data-path so we can see *exactly* what each drive is doing at all times. Believe me, we see things that would baffle and boggle the mind (and probably create a healthy sense of paranoia!) if you knew exactly what was going on ?inside there?. > > We use finely tuned policy for data automation of movement between tiers so as to minimally impact user experience. Our HSM supports offline file mapping to the windows client, so people can tell when their files and objects are ?offline?. It is a useful semantic and great for usability for people. > > We ZFS scrub the disk copy for ?always on disk consistency?, we use tpverify commands on the tape media also to consistently check the media itself. We?re experimenting with implementing fixity shortly too, as the filesystem supports it. > > As for going ?all online?, at our scale ?we just can?t afford it yet, to walk away from ?cold tape? principles. We?re just too big. We?d love to rid ourselves of the complexities of it, and consider a full cloud based consumption model, but having crunched the very hard numbers of things such as AWS Glacier and S3, it is a long (long) way more expensive than the relative TCO?s of running it ?on premise? at this stage. My hope is that this will change soon and I can start experimenting with one of my copies being a ?cloud library?. > > Interesting thread, this? > > -jc > > > > On 15/5/17, 11:41 am, "Pasig-discuss on behalf of BUNTON, GLENN" wrote: > > This discussion of the various digital repository storage approaches has been very enlightening and useful so far. I appreciate all the excellent details. There is one piece of information, however, that is missing. Cost? Both initial implementation outlay and ongoing costs. Any general sense of costs would be greatly appreciated. > > > -----Original Message----- > From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Steve Knight > Sent: Sunday, May 14, 2017 6:44 PM > To: 'Sheila Morrissey' ; pasig-discuss at asis.org > Subject: Re: [Pasig-discuss] Digital repository storage > benchmarking > > Hi Tim > > At the National library of New Zealand, we are storing about 210TB of digital objects in our permanent repository. > > We have a 25TB online cache, with an online copy of all the digital objects sitting on disk. > > Three tape copies of the objects are made as soon as they enter into the disk archive. 1 copy remains within the tape library (nearline), the other 2 copies are sent offsite (offline). We use Oracle SAM-QFS to manage the storage policies and automatic tierage. > > We have a similar treatment for our 100TB of Test data, which has 1 less offsite tape copy. > > We are currently looking at replacing this storage architecture with a mix of Hitachi's HDI and HCP S30 object storage products and our cloud provider's object storage offering. The cloud provider storage includes replication across 3 geographic locations providing both higher availability and higher resilience than we currently have. > > By moving to an all online solution we hope to increase overall performance and make savings through utilising object storage and exiting some services related to current backup and restore processes. > > Regards > Steve > > > > > -----Original Message----- > From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Sheila Morrissey > Sent: Saturday, 13 May 2017 5:44 a.m. > To: pasig-discuss at asis.org > Subject: [Pasig-discuss] FW: Digital repository storage > benchmarking > > > Hello, Tim, > > At Portico (http://www.portico.org/digital-preservation/), we preserve e-journals, e-books, digitized historical collections, and other born-digital scholarly content. > > Currently, the Portico archive is comprised of roughly 77.7 million digital objects (we call them "Archival Units", or AUs); comprising over 400 TB; made up of 1.3 billion files. > > We maintain 3 copies of the archive: 2 on disk in geographically distributed data centers, and a 3rd copy in commercial cloud storage. We create and maintain backups (including fixity checks) using our own custom-written software. > > I hope this helpful. > > Best regards, > Sheila > > > Sheila M. Morrissey > Senior Researcher > ITHAKA > 100 Campus Drive > Suite 100 > Princeton NJ 08540 > 609-986-2221 > sheila.morrissey at ithaka.org > > ITHAKA (www.ithaka.org) is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. We provide innovative services that benefit higher education, including Ithaka S+R, JSTOR, and Portico. > > > > -----Original Message----- > From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Tim Walsh > Sent: Friday, May 12, 2017 10:16 AM > To: pasig-discuss at asis.org > Subject: [Pasig-discuss] Digital repository storage > benchmarking > > Dear PASIG, > > I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions? configurations online. It?s very possible that this question has been asked before on-list, but I wasn?t able to find anything in the list archives. > > For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those ?various media? will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we?d like to benchmark our plans against other institutions. > > I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are: > > * Could you point me to published/available resources outlining other institutions? digital repository storage configurations? > * Or, if you work at an institution, would you be willing to > share the details of your configuration on- or off-list? (any > information sent off-list will be kept strictly confidential) > > Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups. > > Thank you! > Tim > > - - - > > Tim Walsh > Archiviste, Archives num?riques > Archivist, Digital Archives > > Centre Canadien d?Architecture > Canadian Centre for Architecture > 1920, rue Baile, Montr?al, Qu?bec H3H 2S6 T 514 939 7001 x > 1532 F 514 939 7020 www.cca.qc.ca > > > Pensez ? l?environnement avant d?imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n??tes pas le destinataire pr?vu, veuillez nous en aviser imm?diatement. Merci ?galement de supprimer le pr?sent courriel et d?en d?truire toute copie. > This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you. > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > ---- > To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at > http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss ? ? ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ? ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From bja at kb.dk Wed May 17 02:48:00 2017 From: bja at kb.dk (Bjarne Andersen) Date: Wed, 17 May 2017 06:48:00 +0000 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: <2EBED878-C03B-41EB-BAA8-E36F949EF821@harvard.edu> References: <78ADB971-820E-4450-BDEB-1814B86B19F0@uq.edu.au> <2EBED878-C03B-41EB-BAA8-E36F949EF821@harvard.edu> Message-ID: In Denmark The Royal Danish Library has developed the open source BitRepository software www.bitrepository.org This software handles "nothing but" the preservation of bits. Its very basically explained a system for handling multiple copies of data on different "pillars" (different technologies, different locations, different organisations) to ensure as independent copies of data as possible. In our own collections we store and preserve more than 4Pbytes unique content meaning that we have over 15Pbtes of current capacity The Royal Danish Library offers bit preservation using this software for other national cultural heritage institutions. Our pricing model basically has two prices - one for ingest (first year) and one for following years (which includes re-investment budget for periodic migration to new media/technology) The prices are roughly (per Tb/year) Online (disk): ingest: 500 Euros, following years: 200 Euros Nearline (tape inside robot): ingest 156 Euros, following years: 68 Euros Offline (tape moved to fire safe box): ingest: 132 Euros, following years 50 Euros. These are meant for long term preservation so there are access-prices as well - off cause higher for the tape based storage and especially naturally for the Off line model where staff needs to collect tapes from a box and mount into tape robot. With these prices we can offer a 3-copy setup with e.g. 1 disk and 2 tapes for a total of 750 Euros/Tbytes the first year and 300 Euros/Tbytes in the following years. The prices includes everything: hardware, staff, power, media migration, etc... best - Bjarne Andersen Vicedirekt?r Deputy Director General It-udvikling og Infrastruktur It developement & Infrastructure +45 89 46 21 65 / + 45 25 66 23 53 bja at kb.dk Det Kgl. Bibliotek Royal Danish Library Victor Albecks Vej 1 DK-8000 Aarhus C +45 3347 4747 CVR 2898 8842 EAN 5798 000 792142 -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Stern, Randy Sent: Tuesday, May 16, 2017 4:05 PM To: William Kilbride ; Jake Carroll ; pasig-discuss at asis.org Subject: Re: [Pasig-discuss] Digital repository storage benchmarking Re costs: For Harvard Library?s Digital Repository Service - 2 disk copies plus 2 tape copies - as of July 1, the cost of storage for depositors to the DRS is $1.25/GB/year for storage. This figure is moderately close to the storage hardware costs. The storage cost does not include staff costs, preservation activities, or server costs associated with the core DRS software services, tools, and databases. Randy On 5/15/17, 4:02 AM, "Pasig-discuss on behalf of William Kilbride" wrote: Hi All, Hi Tim This is a super thread and I am learning a tonne. On the subject of costs I can make a recommendation and request ... The Curation Costs Exchange is a useful thing and well worth a look for anyone looking at comparative costs across the digital preservation lifecycle including storage. It's not been mentioned yet in the discussions, I assume because everyone is already aware of it. But have a look: http://www.curationexchange.org/ The conclusion we drew from the 4C project was that financial planning was a core skill in preservation planning. So to be a 'trusted' repository an institution should be able to demonstrate certain skills in financial planning and be transparent about it. It's expressed more elegantly in the 4c project roadmap: http://www.4cproject.eu/roadmap/ Now the request: there's a network effect here. The more agencies share data the more useful the data becomes. So can I encourage you all to share that information (anonymously or identifiably) via the costs exchange? All best wishes, William -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Jake Carroll Sent: 15 May 2017 04:01 To: pasig-discuss at asis.org Subject: Re: [Pasig-discuss] Digital repository storage benchmarking Certainly interesting. At the Queensland Brain Institute and the Australian Institute of Bioengineering and Nanotechnology at the University of Queensland, we have around 8.5PB of data under management across our HSM platforms. We currently use Oracle HSM for this task. We have 256TB of online ?cache? for the data landing location split across 6 different filesystems that are tuned differently for different types of workloads and different tasks. These workloads are generally categorised into a few functions: ? High IO, large serial writes from instruments ? Low IO, large serial writes from instruments ? High IO, granular ?many files, many IOPS? from instruments and computational factors ? Low IO, granular ?many files, low IOPS? from instruments and computational factors ? Generic group share ? Generic user dir It is an interesting thing to manage and run statistical modelling on in terms of performance analysis and micro benchmarking of data movement patterns. All the filesystems above are provisioned on 16Gbit/sec FC connected Hitachi HUS-VM, 10K SAS. The metadata for these filesystems is around 10 terabytes of Hitachi Accelerated Advanced Flash storage. We have around 3.8 billion files/unique objects under management. We run a ?disk based copy? (we call that copy1) which is our disk based VSN or vault. It is around 1PB of ZFS managed storage sitting inside the very large Hitachi HUS-VM platform. Our Copy2 and Copy3 are 2 * T10000D Oracle tape media copies in SL3000 storage silos, geographically distributed. We do some interesting things with our tape infrastructure, including DIV-always-on, proactive data protection sweeps inside the HSM and continuous validation checks against the media. We also run STA (tape analytics tools) extra-data-path so we can see *exactly* what each drive is doing at all times. Believe me, we see things that would baffle and boggle the mind (and probably create a healthy sense of paranoia!) if you knew exactly what was going on ?inside there?. We use finely tuned policy for data automation of movement between tiers so as to minimally impact user experience. Our HSM supports offline file mapping to the windows client, so people can tell when their files and objects are ?offline?. It is a useful semantic and great for usability for people. We ZFS scrub the disk copy for ?always on disk consistency?, we use tpverify commands on the tape media also to consistently check the media itself. We?re experimenting with implementing fixity shortly too, as the filesystem supports it. As for going ?all online?, at our scale ?we just can?t afford it yet, to walk away from ?cold tape? principles. We?re just too big. We?d love to rid ourselves of the complexities of it, and consider a full cloud based consumption model, but having crunched the very hard numbers of things such as AWS Glacier and S3, it is a long (long) way more expensive than the relative TCO?s of running it ?on premise? at this stage. My hope is that this will change soon and I can start experimenting with one of my copies being a ?cloud library?. Interesting thread, this? -jc On 15/5/17, 11:41 am, "Pasig-discuss on behalf of BUNTON, GLENN" wrote: This discussion of the various digital repository storage approaches has been very enlightening and useful so far. I appreciate all the excellent details. There is one piece of information, however, that is missing. Cost? Both initial implementation outlay and ongoing costs. Any general sense of costs would be greatly appreciated. -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Steve Knight Sent: Sunday, May 14, 2017 6:44 PM To: 'Sheila Morrissey' ; pasig-discuss at asis.org Subject: Re: [Pasig-discuss] Digital repository storage benchmarking Hi Tim At the National library of New Zealand, we are storing about 210TB of digital objects in our permanent repository. We have a 25TB online cache, with an online copy of all the digital objects sitting on disk. Three tape copies of the objects are made as soon as they enter into the disk archive. 1 copy remains within the tape library (nearline), the other 2 copies are sent offsite (offline). We use Oracle SAM-QFS to manage the storage policies and automatic tierage. We have a similar treatment for our 100TB of Test data, which has 1 less offsite tape copy. We are currently looking at replacing this storage architecture with a mix of Hitachi's HDI and HCP S30 object storage products and our cloud provider's object storage offering. The cloud provider storage includes replication across 3 geographic locations providing both higher availability and higher resilience than we currently have. By moving to an all online solution we hope to increase overall performance and make savings through utilising object storage and exiting some services related to current backup and restore processes. Regards Steve -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Sheila Morrissey Sent: Saturday, 13 May 2017 5:44 a.m. To: pasig-discuss at asis.org Subject: [Pasig-discuss] FW: Digital repository storage benchmarking Hello, Tim, At Portico (http://www.portico.org/digital-preservation/), we preserve e-journals, e-books, digitized historical collections, and other born-digital scholarly content. Currently, the Portico archive is comprised of roughly 77.7 million digital objects (we call them "Archival Units", or AUs); comprising over 400 TB; made up of 1.3 billion files. We maintain 3 copies of the archive: 2 on disk in geographically distributed data centers, and a 3rd copy in commercial cloud storage. We create and maintain backups (including fixity checks) using our own custom-written software. I hope this helpful. Best regards, Sheila Sheila M. Morrissey Senior Researcher ITHAKA 100 Campus Drive Suite 100 Princeton NJ 08540 609-986-2221 sheila.morrissey at ithaka.org ITHAKA (www.ithaka.org) is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. We provide innovative services that benefit higher education, including Ithaka S+R, JSTOR, and Portico. -----Original Message----- From: Pasig-discuss [mailto:pasig-discuss-bounces at asis.org] On Behalf Of Tim Walsh Sent: Friday, May 12, 2017 10:16 AM To: pasig-discuss at asis.org Subject: [Pasig-discuss] Digital repository storage benchmarking Dear PASIG, I am currently in the process of benchmarking digital repository storage setups with our Director of IT, and am having trouble finding very much information about other institutions? configurations online. It?s very possible that this question has been asked before on-list, but I wasn?t able to find anything in the list archives. For context, we are a research museum with significant born-digital archival holdings preparing to manage about 200 TB of digital objects over the next 3 years, replicated several times on various media. The question is what precisely those ?various media? will be. Currently, our plan is to store one copy on disk on-site, one copy on disk in a managed off-site facility, and a third copy on LTO sent to a third facility. Before we commit, we?d like to benchmark our plans against other institutions. I have been able to find information about the storage configurations for MoMA and the Computer History Museum (who each wrote blog posts or presented on this topic), but not very many others. So my questions are: * Could you point me to published/available resources outlining other institutions? digital repository storage configurations? * Or, if you work at an institution, would you be willing to share the details of your configuration on- or off-list? (any information sent off-list will be kept strictly confidential) Helpful details would include: amount of digital objects being stored; how many copies of data are being stored; which copies are online, nearline, or offline; which media are being used for which copies; and what services/software applications are you using to manage the creation and maintainance of backups. Thank you! Tim - - - Tim Walsh Archiviste, Archives num?riques Archivist, Digital Archives Centre Canadien d?Architecture Canadian Centre for Architecture 1920, rue Baile, Montr?al, Qu?bec H3H 2S6 T 514 939 7001 x 1532 F 514 939 7020 www.cca.qc.ca Pensez ? l?environnement avant d?imprimer ce message Please consider the environment before printing this email Ce courriel peut contenir des renseignements confidentiels. Si vous n??tes pas le destinataire pr?vu, veuillez nous en aviser imm?diatement. Merci ?galement de supprimer le pr?sent courriel et d?en d?truire toute copie. This email may contain confidential information. If you are not the intended recipient, please advise us immediately and delete this email as well as any other copy. Thank you. ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss From luispo at gmail.com Wed May 17 13:24:50 2017 From: luispo at gmail.com (=?utf-8?Q?Louis_Su=C3=A1rez-Potts?=) Date: Wed, 17 May 2017 13:24:50 -0400 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: References: <78ADB971-820E-4450-BDEB-1814B86B19F0@uq.edu.au> <2EBED878-C03B-41EB-BAA8-E36F949EF821@harvard.edu> Message-ID: > On 2017-05-17, at 04:18, William Kilbride wrote: > > Hi Louis, > > Yes you're quite right: a list is great but it's not ideal for sharing this kind of information. > > I really do encourage you therefore to look seriously at the Curation Costs Exchange. The 4c project took quite a lot of time to manage not only the legal constraints and anonymization issues but also the different accountancy approaches that can make it hard to compare data meaningfully. Please do take a look (all!): http://www.curationexchange.org/ The more we put in the more we will get out... > > W :-) > Thanks, William. I'll look?also with an eye to how such a site wd. be of use to other open source projects. There are a lot out there and though there are meta-organisations like Software Conservancy, and others, the focus is seldom on finance, let alone the difference trans-nationalisation makes. louis Louis Su?rez-Potts, PhD Strategist & Co-Founder Age of Peers www.ageofpeers.com/ Skype: louisiam Twitter: @luispo Tel: +1.416.625.3843 From dave at dpn.org Wed May 17 17:11:44 2017 From: dave at dpn.org (David Pcolar) Date: Wed, 17 May 2017 17:11:44 -0400 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: References: <78ADB971-820E-4450-BDEB-1814B86B19F0@uq.edu.au> <2EBED878-C03B-41EB-BAA8-E36F949EF821@harvard.edu> Message-ID: <6B34DD3F-B3A8-4241-9971-C21BAC89F5BF@dpn.org> The Digital Preservation Network (www.dpn.org ) is a membership organization dedicated to the long term preservation of scholarly output. We have a cooperative distributed model that significantly reduces the risks surrounding content preservation. DPN ensures the secure preservation of stored content by leveraging a heterogeneous network that spans diverse geographic, technical, and institutional environments. DPN?s preservation process can be expressed in five steps: (1) Content is deposited into the system via an Ingest Node; (2) Content is replicated to at least two other Replicating Nodes and stored in varied repository infrastructures; (3) Content is checked via bit auditing and repair services to ensure the content remains the same over time; (4) Destroyed or corrupted content is restored by DPN; (5) As service provider Nodes enter and leave DPN, content is redistributed to maintain the continuity of preservation services into the far-future. We have providers with Disk, Tape, and Cloud infrastructures, replicating copies across the continental US. Our base service model is 3 copies with a preservation assurance of 20 years with a single payment. Current membership pricing is $20,000/year, which includes 5TB of deposits per year. Deposits above 5TB are a single payment of $2750/TB for a 20-year term ($137.50/TB/Yr) Please contact us for more information: Mary Molinaro Executive Director, Digital Preservation Network mary at dpn.org Dave Pcolar Technical Officer, Digital Preservation Network dave at dpn.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From dwilcox at duraspace.org Thu May 18 12:01:56 2017 From: dwilcox at duraspace.org (David Wilcox) Date: Thu, 18 May 2017 12:01:56 -0400 Subject: [Pasig-discuss] INVITATION: Fedora and Hydra Camp at Oxford Message-ID: DuraSpace and Data Curation Experts are pleased to invite you to attend the Fedora and Hydra Camp at Oxford University, Sept 4 - 8, 2017. The camp will be hosted by Oxford University , Oxford, UK and is supported by Jisc . Training begins with the basics and build toward more advanced concepts?no prior Fedora or Hydra experience is required. Participants can expect to come away with a deep dive Fedora and Hydra learning experience coupled with multiple opportunities for applying hands-on techniques working with experienced trainers from both communities. Registration is limited to the first 40 applicants so register here soon ! An early bird discount is available until July 10. Background Fedora is the robust, modular, open source repository platform for the management and dissemination of digital content. Fedora 4, the latest production version of Fedora, features vast improvements in scalability, linked data capabilities, research data support, modularity, ease of use and more. Hydra is a repository solution that is being used by institutions worldwide to provide access to their digital content (see map ). Hydra provides a versatile and feature rich environment for end-users and repository administrators alike. About Fedora Camp Previous Fedora Camps include the inaugural camp held at Duke University, the West Coast camp at CalTech, and the most recent, NYC camp held at Columbia University. Hydra Camps have been held throughout the US and in the UK and the Republic of Ireland. Most recently, DCE hosted the inaugural Advanced Hydra Camp focusing on advanced Hydra developer skills. The upcoming combined camp curriculum will provide a comprehensive overview of Fedora and Hydra by exploring such topics as: Core & Integrated features Data modeling and linked data Content and Metadata management Migrating to Fedora 4 Deploying Fedora and Hydra in production Ruby, Rails, and collaborative development using Github Introductory Blacklight including search and faceting Preservation Services The curriculum will be delivered by a knowledgeable team of instructors from the Fedora and Hydra communities: David Wilcox (DuraSpace), Andrew Woods (DuraSpace), Mark Bussey (Data Curation Experts), Bess Sadler (Data Curation Experts), Julie Allinson (University of London). -- David Wilcox Fedora Product Manager DuraSpace dwilcox at duraspace.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From katherine at educopia.org Thu May 18 13:52:38 2017 From: katherine at educopia.org (Katherine Skinner) Date: Thu, 18 May 2017 13:52:38 -0400 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: <6B34DD3F-B3A8-4241-9971-C21BAC89F5BF@dpn.org> References: <78ADB971-820E-4450-BDEB-1814B86B19F0@uq.edu.au> <2EBED878-C03B-41EB-BAA8-E36F949EF821@harvard.edu> <6B34DD3F-B3A8-4241-9971-C21BAC89F5BF@dpn.org> Message-ID: <6A4876CD-D687-4D07-87CD-E40986BFB1AD@educopia.org> I love this thread--thank you for starting it, Tim! The MetaArchive Cooperative started preserving content with six institutions in 2004; it has grown to encompass more than 60 institutions, including through consortial memberships with several regional consortia (in Barcelona and Ohio) and a library alliance (HBCU). Our mission is to provide a strong preservation community as well as an affordable preservation solution for distributed digital preservation for a wide variety of memory-oriented organizations. Our members constantly learn from each other as they compare workflows, tools, approaches, and policies. More details, specific to your questions, Tim: we are actively preserving 1,200+ collections totaling 85TB of content (and that is slated to almost double in the next year) content is ingested via bags (BagIt) and can be submitted in a variety of ways every file is replicated 7 times and stored in 7 secure, geographically distributed locations on infrastructure that includes both physical servers (at some member institutions) and "cloud-based" and VM infrastructures content is regularly audited using LOCKSS voting and polling mechanisms when needed, content is repaired and metadata describing that event is created Other details that may be of interest: pricing is $500/TB for storage fees, plus an annual membership fee of between $3,000-$5,500 depending on the selected category some members host network infrastructure; others pay a small annual fee ($1000) to waive that responsibility MetaArchive is entirely run, owned, and controlled by its members--including pricing decisions Carly Dearborn (Purdue University) is the current Chair of the Steering Committee. If you are interested in learning more, please reach out to me (Katherine at Educopia.org) or Carly (cdearbor at purdue.edu) while the network's facilitator, Sam Meister, is out on paternity leave until early July. Katherine Skinner, PhD Executive Director, Educopia Institute http://educopia.org Working from Greensboro, NC katherine at educopia.org | 404 783 2534 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkramersmyth at gmail.com Thu May 18 14:18:53 2017 From: jkramersmyth at gmail.com (Jeanne Kramer-Smyth) Date: Thu, 18 May 2017 14:18:53 -0400 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: <6A4876CD-D687-4D07-87CD-E40986BFB1AD@educopia.org> References: <78ADB971-820E-4450-BDEB-1814B86B19F0@uq.edu.au> <2EBED878-C03B-41EB-BAA8-E36F949EF821@harvard.edu> <6B34DD3F-B3A8-4241-9971-C21BAC89F5BF@dpn.org> <6A4876CD-D687-4D07-87CD-E40986BFB1AD@educopia.org> Message-ID: What would folks think of all of this amazing information being collected in a shared document somewhere? Jeanne On Thu, May 18, 2017 at 1:52 PM, Katherine Skinner wrote: > I love this thread--thank you for starting it, Tim! > > The MetaArchive Cooperative started preserving content with six > institutions in 2004; it has grown to encompass more than 60 institutions, > including through consortial memberships with several regional consortia > (in Barcelona and Ohio) and a library alliance (HBCU). > > Our mission is to provide a strong preservation community as well as an > affordable preservation solution for distributed digital preservation for a > wide variety of memory-oriented organizations. Our members constantly learn > from each other as they compare workflows, tools, approaches, and policies. > > More details, specific to your questions, Tim: > > - we are actively preserving 1,200+ collections totaling 85TB of > content (and that is slated to almost double in the next year) > - content is ingested via bags (BagIt) and can be submitted in a > variety of ways > - every file is replicated 7 times and stored in 7 secure, > geographically distributed locations on infrastructure that includes both > physical servers (at some member institutions) and "cloud-based" and VM > infrastructures > - content is regularly audited using LOCKSS voting and polling > mechanisms > - when needed, content is repaired and metadata describing that event > is created > > Other details that may be of interest: > > - pricing is $500/TB for storage fees, plus an annual membership fee > of between $3,000-$5,500 depending on the selected category > - some members host network infrastructure; others pay a small annual > fee ($1000) to waive that responsibility > - MetaArchive is entirely run, owned, and controlled by its > members--including pricing decisions > > Carly Dearborn (Purdue University) is the current Chair of the Steering > Committee. If you are interested in learning more, please reach out to me ( > Katherine at Educopia.org) or Carly (cdearbor at purdue.edu) while the > network's facilitator, Sam Meister, is out on paternity leave until early > July. > > > > > *Katherine Skinner, PhD* > Executive Director, Educopia Institute > http://educopia.org > > Working from Greensboro, NC > katherine at educopia.org | 404 783 2534 <(404)%20783-2534> > > > > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www. > preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From arthurpasquinelli at gmail.com Thu May 18 14:40:11 2017 From: arthurpasquinelli at gmail.com (Arthur Pasquinelli) Date: Thu, 18 May 2017 11:40:11 -0700 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: References: <78ADB971-820E-4450-BDEB-1814B86B19F0@uq.edu.au> <2EBED878-C03B-41EB-BAA8-E36F949EF821@harvard.edu> <6B34DD3F-B3A8-4241-9971-C21BAC89F5BF@dpn.org> <6A4876CD-D687-4D07-87CD-E40986BFB1AD@educopia.org> Message-ID: <47b8bfa3-b73d-3842-b7e8-e29fa8914628@gmail.com> I was just thinking the same thing since we have had some good discussions now and in the past. Since I have kept a copy of all past PASIG emails, I'll work on it with the other PASIG steering committee members. We are in the middle of some administrative work for PASIG right now, so I'll add this to the things being worked on. On 5/18/17 11:18 AM, Jeanne Kramer-Smyth wrote: > What would folks think of all of this amazing information being > collected in a shared document somewhere? > > Jeanne > > On Thu, May 18, 2017 at 1:52 PM, Katherine Skinner > > wrote: > > I love this thread--thank you for starting it, Tim! > > The MetaArchive Cooperative started preserving content with six > institutions in 2004; it has grown to encompass more than 60 > institutions, including through consortial memberships with > several regional consortia (in Barcelona and Ohio) and a library > alliance (HBCU). > > Our mission is to provide a strong preservation community as well > as an affordable preservation solution for distributed digital > preservation for a wide variety of memory-oriented organizations. > Our members constantly learn from each other as they compare > workflows, tools, approaches, and policies. > > More details, specific to your questions, Tim: > > * we are actively preserving 1,200+ collections totaling 85TB of > content (and that is slated to almost double in the next year) > * content is ingested via bags (BagIt) and can be submitted in a > variety of ways > * every file is replicated 7 times and stored in 7 secure, > geographically distributed locations on infrastructure that > includes both physical servers (at some member institutions) > and "cloud-based" and VM infrastructures > * content is regularly audited using LOCKSS voting and polling > mechanisms > * when needed, content is repaired and metadata describing that > event is created > > Other details that may be of interest: > > * pricing is $500/TB for storage fees, plus an annual membership > fee of between $3,000-$5,500 depending on the selected category > * some members host network infrastructure; others pay a small > annual fee ($1000) to waive that responsibility > * MetaArchive is entirely run, owned, and controlled by its > members--including pricing decisions > > Carly Dearborn (Purdue University) is the current Chair of the > Steering Committee. If you are interested in learning more, please > reach out to me (Katherine at Educopia.org > ) or Carly (cdearbor at purdue.edu > ) while the network's facilitator, Sam > Meister, is out on paternity leave until early July. > > > > > *Katherine Skinner, PhD* > Executive Director, Educopia Institute > http://educopia.org > > Working from Greensboro, NC > katherine at educopia.org | 404 783 > 2534 > > > > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > > _______ > PASIG Webinars and conference material is at > http://www.preservationandarchivingsig.org/index.html > > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > > > > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sschaefer at ucsd.edu Thu May 18 15:24:03 2017 From: sschaefer at ucsd.edu (Schaefer, Sibyl) Date: Thu, 18 May 2017 19:24:03 +0000 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: <47b8bfa3-b73d-3842-b7e8-e29fa8914628@gmail.com> References: <78ADB971-820E-4450-BDEB-1814B86B19F0@uq.edu.au> <2EBED878-C03B-41EB-BAA8-E36F949EF821@harvard.edu> <6B34DD3F-B3A8-4241-9971-C21BAC89F5BF@dpn.org> <6A4876CD-D687-4D07-87CD-E40986BFB1AD@educopia.org> <47b8bfa3-b73d-3842-b7e8-e29fa8914628@gmail.com> Message-ID: I?d like to chime in with some information about the Chronopolis Digital Preservation Network. We were originally funded by the Library of Congress NDIPP program and ingested our first production content in 2008. Chronopolis was designed to preserve hundreds of terabytes of digital data with minimal requirements on the data provider. The single, overriding commitment of the Chronopolis system is to preserve objects in such a way that they can be transmitted back to the original data providers in the exact form in which they were submitted. Chronopolis leverages high-speed networks, mass-scale storage capabilities, and the expertise of the partners in order to provide a geographically distributed, heterogeneous, and highly redundant archive system. Our partners include the University of California San Diego Library, the National Center for Atmospheric Research, The University of Maryland Institute for Advanced Computer Studies, and our newest partner, the Texas Digital Library. Features of the project include: ? Three geographically distributed copies of the data ? Curatorial audit reporting ? Development of best practices for data packaging and sharing We also serve as a founding node in the Digital Preservation Network and partner with DuraSpace to provide our services. We currently preserve over 50 TBs (150 replicated) of data. Our prices vary depending on the ingest mechanism, but the base rate for storage is $286/TB/year for three geographically-distributed copies. Best, Sibyl Sibyl Schaefer Chronopolis Program Manager // Digital Preservation Analyst University of California, San Diego From: Pasig-discuss on behalf of Arthur Pasquinelli Date: Thursday, May 18, 2017 at 11:40 AM To: "pasig-discuss at mail.asis.org" Subject: Re: [Pasig-discuss] Digital repository storage benchmarking I was just thinking the same thing since we have had some good discussions now and in the past. Since I have kept a copy of all past PASIG emails, I'll work on it with the other PASIG steering committee members. We are in the middle of some administrative work for PASIG right now, so I'll add this to the things being worked on. On 5/18/17 11:18 AM, Jeanne Kramer-Smyth wrote: What would folks think of all of this amazing information being collected in a shared document somewhere? Jeanne On Thu, May 18, 2017 at 1:52 PM, Katherine Skinner > wrote: I love this thread--thank you for starting it, Tim! The MetaArchive Cooperative started preserving content with six institutions in 2004; it has grown to encompass more than 60 institutions, including through consortial memberships with several regional consortia (in Barcelona and Ohio) and a library alliance (HBCU). Our mission is to provide a strong preservation community as well as an affordable preservation solution for distributed digital preservation for a wide variety of memory-oriented organizations. Our members constantly learn from each other as they compare workflows, tools, approaches, and policies. More details, specific to your questions, Tim: * we are actively preserving 1,200+ collections totaling 85TB of content (and that is slated to almost double in the next year) * content is ingested via bags (BagIt) and can be submitted in a variety of ways * every file is replicated 7 times and stored in 7 secure, geographically distributed locations on infrastructure that includes both physical servers (at some member institutions) and "cloud-based" and VM infrastructures * content is regularly audited using LOCKSS voting and polling mechanisms * when needed, content is repaired and metadata describing that event is created Other details that may be of interest: * pricing is $500/TB for storage fees, plus an annual membership fee of between $3,000-$5,500 depending on the selected category * some members host network infrastructure; others pay a small annual fee ($1000) to waive that responsibility * MetaArchive is entirely run, owned, and controlled by its members--including pricing decisions Carly Dearborn (Purdue University) is the current Chair of the Steering Committee. If you are interested in learning more, please reach out to me (Katherine at Educopia.org) or Carly (cdearbor at purdue.edu) while the network's facilitator, Sam Meister, is out on paternity leave until early July. Katherine Skinner, PhD Executive Director, Educopia Institute http://educopia.org Working from Greensboro, NC katherine at educopia.org | 404 783 2534 ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From randy_stern at harvard.edu Thu May 18 15:34:27 2017 From: randy_stern at harvard.edu (Stern, Randy) Date: Thu, 18 May 2017 19:34:27 +0000 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: References: <78ADB971-820E-4450-BDEB-1814B86B19F0@uq.edu.au> <2EBED878-C03B-41EB-BAA8-E36F949EF821@harvard.edu> <6B34DD3F-B3A8-4241-9971-C21BAC89F5BF@dpn.org> <6A4876CD-D687-4D07-87CD-E40986BFB1AD@educopia.org> <47b8bfa3-b73d-3842-b7e8-e29fa8914628@gmail.com> Message-ID: <06489D92-8E52-454C-8EC6-0A9EF3B04AC0@harvard.edu> Thanks for sharing this, $286/TB/year for 3 copies ? are the copies tape only? Does this include real time access to disk copies, or is it a dark archive? It would be great to have all these factors broken out in the shared repository of informaiton that Art Pasquinelli wrotw about! Randy From: Pasig-discuss on behalf of "Schaefer, Sibyl" Date: Thursday, May 18, 2017 at 3:24 PM To: Arthur Pasquinelli , "pasig-discuss at mail.asis.org" Subject: Re: [Pasig-discuss] Digital repository storage benchmarking I?d like to chime in with some information about the Chronopolis Digital Preservation Network. We were originally funded by the Library of Congress NDIPP program and ingested our first production content in 2008. Chronopolis was designed to preserve hundreds of terabytes of digital data with minimal requirements on the data provider. The single, overriding commitment of the Chronopolis system is to preserve objects in such a way that they can be transmitted back to the original data providers in the exact form in which they were submitted. Chronopolis leverages high-speed networks, mass-scale storage capabilities, and the expertise of the partners in order to provide a geographically distributed, heterogeneous, and highly redundant archive system. Our partners include the University of California San Diego Library, the National Center for Atmospheric Research, The University of Maryland Institute for Advanced Computer Studies, and our newest partner, the Texas Digital Library. Features of the project include: • Three geographically distributed copies of the data • Curatorial audit reporting • Development of best practices for data packaging and sharing We also serve as a founding node in the Digital Preservation Network and partner with DuraSpace to provide our services. We currently preserve over 50 TBs (150 replicated) of data. Our prices vary depending on the ingest mechanism, but the base rate for storage is $286/TB/year for three geographically-distributed copies. Best, Sibyl Sibyl Schaefer Chronopolis Program Manager // Digital Preservation Analyst University of California, San Diego From: Pasig-discuss on behalf of Arthur Pasquinelli Date: Thursday, May 18, 2017 at 11:40 AM To: "pasig-discuss at mail.asis.org" Subject: Re: [Pasig-discuss] Digital repository storage benchmarking I was just thinking the same thing since we have had some good discussions now and in the past. Since I have kept a copy of all past PASIG emails, I'll work on it with the other PASIG steering committee members. We are in the middle of some administrative work for PASIG right now, so I'll add this to the things being worked on. On 5/18/17 11:18 AM, Jeanne Kramer-Smyth wrote: What would folks think of all of this amazing information being collected in a shared document somewhere? Jeanne On Thu, May 18, 2017 at 1:52 PM, Katherine Skinner > wrote: I love this thread--thank you for starting it, Tim! The MetaArchive Cooperative started preserving content with six institutions in 2004; it has grown to encompass more than 60 institutions, including through consortial memberships with several regional consortia (in Barcelona and Ohio) and a library alliance (HBCU). Our mission is to provide a strong preservation community as well as an affordable preservation solution for distributed digital preservation for a wide variety of memory-oriented organizations. Our members constantly learn from each other as they compare workflows, tools, approaches, and policies. More details, specific to your questions, Tim: * we are actively preserving 1,200+ collections totaling 85TB of content (and that is slated to almost double in the next year) * content is ingested via bags (BagIt) and can be submitted in a variety of ways * every file is replicated 7 times and stored in 7 secure, geographically distributed locations on infrastructure that includes both physical servers (at some member institutions) and "cloud-based" and VM infrastructures * content is regularly audited using LOCKSS voting and polling mechanisms * when needed, content is repaired and metadata describing that event is created Other details that may be of interest: * pricing is $500/TB for storage fees, plus an annual membership fee of between $3,000-$5,500 depending on the selected category * some members host network infrastructure; others pay a small annual fee ($1000) to waive that responsibility * MetaArchive is entirely run, owned, and controlled by its members--including pricing decisions Carly Dearborn (Purdue University) is the current Chair of the Steering Committee. If you are interested in learning more, please reach out to me (Katherine at Educopia.org) or Carly (cdearbor at purdue.edu) while the network's facilitator, Sam Meister, is out on paternity leave until early July. Katherine Skinner, PhD Executive Director, Educopia Institute http://educopia.org Working from Greensboro, NC katherine at educopia.org | 404 783 2534 ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sschaefer at ucsd.edu Thu May 18 15:41:44 2017 From: sschaefer at ucsd.edu (Schaefer, Sibyl) Date: Thu, 18 May 2017 19:41:44 +0000 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: <06489D92-8E52-454C-8EC6-0A9EF3B04AC0@harvard.edu> References: <78ADB971-820E-4450-BDEB-1814B86B19F0@uq.edu.au> <2EBED878-C03B-41EB-BAA8-E36F949EF821@harvard.edu> <6B34DD3F-B3A8-4241-9971-C21BAC89F5BF@dpn.org> <6A4876CD-D687-4D07-87CD-E40986BFB1AD@educopia.org> <47b8bfa3-b73d-3842-b7e8-e29fa8914628@gmail.com> <06489D92-8E52-454C-8EC6-0A9EF3B04AC0@harvard.edu> Message-ID: Hi Randy- The copies are all on hard disks, allowing us to run vigorous fixity checking routines. It is a dark archive, so there is no real time access to copies. Let me know if you have more questions! Best, Sibyl Sibyl Schaefer Chronopolis Program Manager // Digital Preservation Analyst University of California, San Diego From: "Stern, Randy" Date: Thursday, May 18, 2017 at 12:34 PM To: "Schaefer, Sibyl" , Arthur Pasquinelli , "pasig-discuss at mail.asis.org" Subject: Re: [Pasig-discuss] Digital repository storage benchmarking Thanks for sharing this, $286/TB/year for 3 copies ? are the copies tape only? Does this include real time access to disk copies, or is it a dark archive? It would be great to have all these factors broken out in the shared repository of informaiton that Art Pasquinelli wrotw about! Randy From: Pasig-discuss on behalf of "Schaefer, Sibyl" Date: Thursday, May 18, 2017 at 3:24 PM To: Arthur Pasquinelli , "pasig-discuss at mail.asis.org" Subject: Re: [Pasig-discuss] Digital repository storage benchmarking I?d like to chime in with some information about the Chronopolis Digital Preservation Network. We were originally funded by the Library of Congress NDIPP program and ingested our first production content in 2008. Chronopolis was designed to preserve hundreds of terabytes of digital data with minimal requirements on the data provider. The single, overriding commitment of the Chronopolis system is to preserve objects in such a way that they can be transmitted back to the original data providers in the exact form in which they were submitted. Chronopolis leverages high-speed networks, mass-scale storage capabilities, and the expertise of the partners in order to provide a geographically distributed, heterogeneous, and highly redundant archive system. Our partners include the University of California San Diego Library, the National Center for Atmospheric Research, The University of Maryland Institute for Advanced Computer Studies, and our newest partner, the Texas Digital Library. Features of the project include: • Three geographically distributed copies of the data • Curatorial audit reporting • Development of best practices for data packaging and sharing We also serve as a founding node in the Digital Preservation Network and partner with DuraSpace to provide our services. We currently preserve over 50 TBs (150 replicated) of data. Our prices vary depending on the ingest mechanism, but the base rate for storage is $286/TB/year for three geographically-distributed copies. Best, Sibyl Sibyl Schaefer Chronopolis Program Manager // Digital Preservation Analyst University of California, San Diego From: Pasig-discuss on behalf of Arthur Pasquinelli Date: Thursday, May 18, 2017 at 11:40 AM To: "pasig-discuss at mail.asis.org" Subject: Re: [Pasig-discuss] Digital repository storage benchmarking I was just thinking the same thing since we have had some good discussions now and in the past. Since I have kept a copy of all past PASIG emails, I'll work on it with the other PASIG steering committee members. We are in the middle of some administrative work for PASIG right now, so I'll add this to the things being worked on. On 5/18/17 11:18 AM, Jeanne Kramer-Smyth wrote: What would folks think of all of this amazing information being collected in a shared document somewhere? Jeanne On Thu, May 18, 2017 at 1:52 PM, Katherine Skinner > wrote: I love this thread--thank you for starting it, Tim! The MetaArchive Cooperative started preserving content with six institutions in 2004; it has grown to encompass more than 60 institutions, including through consortial memberships with several regional consortia (in Barcelona and Ohio) and a library alliance (HBCU). Our mission is to provide a strong preservation community as well as an affordable preservation solution for distributed digital preservation for a wide variety of memory-oriented organizations. Our members constantly learn from each other as they compare workflows, tools, approaches, and policies. More details, specific to your questions, Tim: * we are actively preserving 1,200+ collections totaling 85TB of content (and that is slated to almost double in the next year) * content is ingested via bags (BagIt) and can be submitted in a variety of ways * every file is replicated 7 times and stored in 7 secure, geographically distributed locations on infrastructure that includes both physical servers (at some member institutions) and "cloud-based" and VM infrastructures * content is regularly audited using LOCKSS voting and polling mechanisms * when needed, content is repaired and metadata describing that event is created Other details that may be of interest: * pricing is $500/TB for storage fees, plus an annual membership fee of between $3,000-$5,500 depending on the selected category * some members host network infrastructure; others pay a small annual fee ($1000) to waive that responsibility * MetaArchive is entirely run, owned, and controlled by its members--including pricing decisions Carly Dearborn (Purdue University) is the current Chair of the Steering Committee. If you are interested in learning more, please reach out to me (Katherine at Educopia.org) or Carly (cdearbor at purdue.edu) while the network's facilitator, Sam Meister, is out on paternity leave until early July. Katherine Skinner, PhD Executive Director, Educopia Institute http://educopia.org Working from Greensboro, NC katherine at educopia.org | 404 783 2534 ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss ---- To subscribe, unsubscribe, or modify your subscription, please visit http://mail.asis.org/mailman/listinfo/pasig-discuss _______ PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html _______________________________________________ Pasig-discuss mailing list Pasig-discuss at mail.asis.org http://mail.asis.org/mailman/listinfo/pasig-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkramersmyth at gmail.com Thu May 18 15:43:23 2017 From: jkramersmyth at gmail.com (Jeanne Kramer-Smyth) Date: Thu, 18 May 2017 15:43:23 -0400 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: <47b8bfa3-b73d-3842-b7e8-e29fa8914628@gmail.com> References: <78ADB971-820E-4450-BDEB-1814B86B19F0@uq.edu.au> <2EBED878-C03B-41EB-BAA8-E36F949EF821@harvard.edu> <6B34DD3F-B3A8-4241-9971-C21BAC89F5BF@dpn.org> <6A4876CD-D687-4D07-87CD-E40986BFB1AD@educopia.org> <47b8bfa3-b73d-3842-b7e8-e29fa8914628@gmail.com> Message-ID: Dear Arthur, It could even be something as simple as a shared Google Spreadsheet that people can add their institution's information to over time. It would be great if it could be a living document. Thanks! Jeanne On Thu, May 18, 2017 at 2:40 PM, Arthur Pasquinelli < arthurpasquinelli at gmail.com> wrote: > I was just thinking the same thing since we have had some good discussions > now and in the past. Since I have kept a copy of all past PASIG emails, > I'll work on it with the other PASIG steering committee members. We are in > the middle of some administrative work for PASIG right now, so I'll add > this to the things being worked on. > > > On 5/18/17 11:18 AM, Jeanne Kramer-Smyth wrote: > > What would folks think of all of this amazing information being collected > in a shared document somewhere? > > Jeanne > > On Thu, May 18, 2017 at 1:52 PM, Katherine Skinner > wrote: > >> I love this thread--thank you for starting it, Tim! >> >> The MetaArchive Cooperative started preserving content with six >> institutions in 2004; it has grown to encompass more than 60 institutions, >> including through consortial memberships with several regional consortia >> (in Barcelona and Ohio) and a library alliance (HBCU). >> >> Our mission is to provide a strong preservation community as well as an >> affordable preservation solution for distributed digital preservation for a >> wide variety of memory-oriented organizations. Our members constantly learn >> from each other as they compare workflows, tools, approaches, and policies. >> >> More details, specific to your questions, Tim: >> >> - we are actively preserving 1,200+ collections totaling 85TB of >> content (and that is slated to almost double in the next year) >> - content is ingested via bags (BagIt) and can be submitted in a >> variety of ways >> - every file is replicated 7 times and stored in 7 secure, >> geographically distributed locations on infrastructure that includes both >> physical servers (at some member institutions) and "cloud-based" and VM >> infrastructures >> - content is regularly audited using LOCKSS voting and polling >> mechanisms >> - when needed, content is repaired and metadata describing that event >> is created >> >> Other details that may be of interest: >> >> - pricing is $500/TB for storage fees, plus an annual membership fee >> of between $3,000-$5,500 depending on the selected category >> - some members host network infrastructure; others pay a small annual >> fee ($1000) to waive that responsibility >> - MetaArchive is entirely run, owned, and controlled by its >> members--including pricing decisions >> >> Carly Dearborn (Purdue University) is the current Chair of the Steering >> Committee. If you are interested in learning more, please reach out to me ( >> Katherine at Educopia.org) or Carly (cdearbor at purdue.edu) while the >> network's facilitator, Sam Meister, is out on paternity leave until early >> July. >> >> >> >> >> *Katherine Skinner, PhD* >> Executive Director, Educopia Institute >> http://educopia.org >> >> Working from Greensboro, NC >> katherine at educopia.org | 404 783 2534 <%28404%29%20783-2534> >> >> >> >> >> ---- >> To subscribe, unsubscribe, or modify your subscription, please visit >> http://mail.asis.org/mailman/listinfo/pasig-discuss >> _______ >> PASIG Webinars and conference material is at >> http://www.preservationandarchivingsig.org/index.html >> _______________________________________________ >> Pasig-discuss mailing list >> Pasig-discuss at mail.asis.org >> http://mail.asis.org/mailman/listinfo/pasig-discuss >> >> > > > ---- > To subscribe, unsubscribe, or modify your subscription, please visithttp://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www.preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing listPasig-discuss at mail.asis.orghttp://mail.asis.org/mailman/listinfo/pasig-discuss > > > > ---- > To subscribe, unsubscribe, or modify your subscription, please visit > http://mail.asis.org/mailman/listinfo/pasig-discuss > _______ > PASIG Webinars and conference material is at http://www. > preservationandarchivingsig.org/index.html > _______________________________________________ > Pasig-discuss mailing list > Pasig-discuss at mail.asis.org > http://mail.asis.org/mailman/listinfo/pasig-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From corey at coppul.ca Thu May 18 18:33:34 2017 From: corey at coppul.ca (Corey Davis) Date: Thu, 18 May 2017 15:33:34 -0700 Subject: [Pasig-discuss] Digital preservation advocacy document examples Message-ID: Hi folks, In COPPUL we've recently established a major digital preservation initiative http://www.coppul.ca/blog/2017/04/coppul-builds-future-establishes-coppul-digital-preservation-network and one of the things we're hoping to develop in the near future is something we're tentatively calling a "digital preservation advocacy toolkit." This would primarily consist of a graphics-heavy document or template intended to brief senior academic administrators. We want them to better understand the issues so our member libraries are in a better position to advocate for resources. There's some great stuff in the CESSDA cost-benefit advocacy toolkit (and thanks to the CESSDA folks for making this CC-BY), but I'm wondering if others out there have developed briefing documents or other resources for senior academic administrators in relation to digital preservation, that they might be willing to share. Many thanks, Corey -- Corey Davis Digital Preservation Network Manager Council of Prairie and Pacific University Libraries (COPPUL) corey at coppul.ca (250) 472-5024 office (778) 677-5746 cell From Stephen.Abrams at ucop.edu Fri May 19 13:14:37 2017 From: Stephen.Abrams at ucop.edu (Stephen Abrams) Date: Fri, 19 May 2017 17:14:37 +0000 Subject: [Pasig-discuss] Digital repository storage benchmarking In-Reply-To: <06489D92-8E52-454C-8EC6-0A9EF3B04AC0@harvard.edu> References: <78ADB971-820E-4450-BDEB-1814B86B19F0@uq.edu.au> <2EBED878-C03B-41EB-BAA8-E36F949EF821@harvard.edu> <6B34DD3F-B3A8-4241-9971-C21BAC89F5BF@dpn.org> <6A4876CD-D687-4D07-87CD-E40986BFB1AD@educopia.org> <47b8bfa3-b73d-3842-b7e8-e29fa8914628@gmail.com> <06489D92-8E52-454C-8EC6-0A9EF3B04AC0@harvard.edu> Message-ID: CDL?s Merritt repository supports long-term preservation and current (and long-term) access. All content is actively replication and audited, with either 2 or ? 6 copies, depending upon how you count things. We rely on two cloud service providers: one, a private cloud at UCSD/SDSC, which itself manages 3 copies on independent arrays; and the other, AWS S3 (for bright content) and Glacier (for dark content), which manage at least 3 copies spread across availability zones. Both clouds perform local fixity audit, which Merritt overlays with its own audit (except for Glacier content, which would be cost prohibitive under its transactional fee structure). Merritt?s nominal price point is $650/TB/year, but the cost accounting is done by adding up the used byte-days (billed at $0.00000000000178/B/day = 650/365) over the service year. This lets our customers avoid having to worry about the timing of their contributions. So 1 TB deposited on the first day of the year will accrue the full $650 cost; that same 1 TB deposited on the last day of the year will accrue only $1.78 in cost. --sla Stephen Abrams Associate Director, UC Curation Center California Digital Library University of California, Office of the President Stephen.Abrams at ucop.edu +1 510-987-0370 -------------- next part -------------- An HTML attachment was scrubbed... URL: