[Pasig-discuss] Risks of encryption & compression built into storage options?

Matthew Addis matthew at addisdigital.co.uk
Sun Mar 19 16:45:24 EDT 2017


I feel a bit late to the party given so much interesting discussion so 
far, but could I humbly offer the following suggestions.  These are 
based on a long history working on archival storage solutions and 
working with many large archives.

Do a bit of risk assessment and think broadly about the threats to 
long-term data safety.  Along with David, I'd suggest that economic 
factors are right up there near the top.  Not having the budget, even 
temporarily, that's needed to sustain a programme of storage 
maintenance, upgrades, migrations, hosting fees, DR, auditing, testing 
exit plans etc. is often a bigger risk than using a specific storage 
technology or worrying about things like bit-rot.  Next comes people 
problems, e.g. outsider attacks such as hacking, staff not following 
procedures, skilled members of a team leaving and not handing over 
knowledge of how a system works or how to use it.  You can have the most 
reliable storage technology on the planet, but its not much good if 
content can easily be deleted by unauthorised staff or there are 
security vulnerabilities that open you up to ransomware or worse.   
Things that can help with thinking in the right direction include 
DRAMBORA, TRAC, TDR, DSA and other risk and audit criteria.

If you are going to consider storage in detail, then treat storage as a 
system and recognise that modes of data loss include things like bugs in 
the software and firmware that writes data to storage and reads it back 
again.  Unreliable media or the risks that come from specific techniques 
such as compression or encryption have a part to play, but system-level 
errors and lack of error detection and handling are often a bigger 
issue, not least because as David points out this can lead to correlated 
errors, systematic failures and much higher data loss rates than you 
might expect from looking at individual components of the system in 
isolation.   I'd suggest that diversity and independent copies of the 
data are your friends.  Multiple copies in different geographic 
locations using different storage technologies that are kept as isolated 
as far as possible helps reduce correlated errors and stop data 
corruption/loss from spreading.  If you want more information about 
storage error and failure modes in the real world then the proceedings 
of FAST are a great place to look.

One copy of the data stored offline with a third-party can be a 
lifesaver.  This provides a great 'firebreak' against the various 
threats to online systems or services - it's very hard for software 
bugs, hackers, disgruntled staff, replication of data corruption and 
other nasties to propagate to data that's stored in a box at the bottom 
of a salt mine with armed guards at the door.  This is of course 
additional to the one or more copies of the data that are online so the 
data is also easy to access and use, which is often crucial to showing 
its value and economic sustainability.

There's no magic number of copies when storing data, e.g. three.  Treat 
mathematical models with caution.  I know because I've made several in 
the past to simulate the risk of data loss so know what's involved. 
  Models rarely cover all the threats or match the real world.  Storage 
models and simulations are based on a lot of assumptions and are 
notoriously hard to calibrate.  Instead of trying to justify a specific 
number of copies, I reckon you are better off following a maturity 
model, e.g. NDSA preservation levels or DPCMM, i.e. start with a 
recognised strategy and work from there.  If you need to build a 
business case that justifies the cost of storing data properly and 
safely then I'd base it on NDSA preservation levels, the DPC handbook, 
what a big archive does or some other evidence of what's been found to 
work in practice rather than some mathematical model.

There's nothing wrong with proprietary storage solutions per se, 
including erasure coding, and at some level or other we all use it - 
there's no such thing as an open source hard drive after all.  As Neil 
says, the is to treat all storage as a black box and have independent 
checks and balances, e.g. using your own checksums and fixity checks to 
make sure data has been received by the storage system correctly and 
integrity hasn't been lost over time.  Multiple independent copies of 
the data that are regularly checked gives assurance that data is OK and 
if one copy has issues then you'll pick it up quickly and can fix it 
using one of the other copies.

Compression and encryption aren't necessarily risk multipliers if you 
chose carefully and there can sometimes be benefits.  For example, 
intra-frame compression in video or counter-mode encryption mean that if 
one part of a file is corrupted then you don't necessarily lose 
everything.  The upside is that compression can, in some circumstances, 
allow you to make more copies of the data for the same budget.  For 
example, I'd take two lossless compressed video files stored in two 
geographically separate locations over a single copy stored in an 
uncompressed format that takes up twice the storage space.  I’d also 
take erasure-coded storage plus an offline escrowed copy of the data 
over a single cloud provider irrespective of how many copies they make 
internally.

IMHO, it's all about trading off costs, risks and the value of content. 
  There isn't a one size fits all solution.

Links to pretty much everything above is in some slides and speakers 
notes for a talk I do for the DPC:
https://doi.org/10.6084/m9.figshare.4584859

There's also the DPC handbook that has sections on storage and fixity 
that build upon the NDSA levels and takes a risk assessment approach:
http://dpconline.org/handbook/technical-solutions-and-tools/fixity-and-checksums
http://dpconline.org/handbook/organisational-activities/storage

If anyone wants to know how we apply the above at Arkivum then I'd be 
happy to do a follow-up post and talk about our approach to keeping data 
safe to the level that allows us to offer a data integrity guarantee.

Cheers,

Matthew



Matthew Addis

Chief Technology Officer

Arkivum

tel: +44 1249 405060

mob: +44 7703 393374

email: matthew.addis at arkivum.com <mailto:matthew.addis at arkivum.com>

web: www.arkivum.com <http://www.arkivum.com/>

twitter: @arkivum
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/pasig-discuss/attachments/20170319/0239d59a/attachment-0001.html>


More information about the Pasig-discuss mailing list