[Asis-l] a look back at 20 years of the internet archive and its approach to web archiving

kalev leetaru kalev.leetaru5 at gmail.com
Mon Jan 18 12:52:13 EST 2016


Apologies for cross-posting, I thought many of you would find of interest
my latest piece "The Internet Archive Turns 20: A Behind The Scenes Look At
Archiving The Web" that explores the Internet Archive's evolution from
custodian to curator to collector over the last 20 years and its changing
approach to web archiving. Of especial relevance is how the Archive is
organized in the form of a physical library archive brought into the
digital era, rather than a traditional search engine with a preservation
component, and its collage approach to weaving together millions of files
in thousands of collections from hundreds of partners.

For those contemplating using the Archive's holdings for research on the
evolution of the web, there are a lot of details in there about how the
Archive is put together, its use of a collage approach to archiving rather
than a single centralized and standardized continuous crawl, and the
tremendous variability in the priorities and composition of its crawls. Of
particular note, the Wayback Machine and its data stores access just a
small portion of the Archive's web holdings.

Those concerned about the impact of robots.txt on archival of the web will
find of interest the discussion of the Archive's evolving stance on both
robots.txt and administrative exclusions, as well as the approaches taken
by several national libraries.



http://www.forbes.com/sites/kalevleetaru/2016/01/18/the-internet-archive-turns-20-a-behind-the-scenes-look-at-archiving-the-web/


~Kalev
http://kalevleetaru.com/
http://blog.gdeltproject.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/asis-l/attachments/20160118/0e572405/attachment.html>


More information about the Asis-l mailing list