FW: Electronic Archives--The Nightmare continues

Mon Nov 24 12:25:43 EST 2003

Thought this would interest our members. Best wishes for Thanksgiving. For
all our friends outside the US best greeting for the approaching New Year.
Gene Garfield

When responding, please attach my original message
__________________________________________________
Eugene Garfield, PhD. email:  garfield at codex.cis.upenn.edu
home page: http://www.eugenegarfield.org/
Tel: 215-243-2205 Fax 215-387-1266
President, The Scientist LLC. http://www.the-scientist.com/
3535 Market St., Phila. PA 19104-3389
Chairman Emeritus, ISI http://www.isinet.com/
3501 Market Street, Philadelphia, PA 19104-3302
Past President, American Society for Information Science and Technology
(ASIS&T) http://www.asis.org/

-----Original Message-----
From: Lucy Rowland [mailto:lrowland at uga.edu]
Sent: Monday, November 24, 2003 11:39 AM
To: grapevine at listserv.uga.edu; BSDNET-L; North Carolina Chapter
Subject: Electronic Archives--The Nightmare continues

http://www.washingtonpost.com/wp-dyn/articles/A8730-2003Nov23.html

On the Web, Research Work Proves Ephemeral
Electronic Archivists Are Playing Catch-Up in Trying to Keep Documents
>From Landing in History's Dustbin
By Rick Weiss
Washington Post Staff Writer
Monday, November 24, 2003; Page A08

It was in the mundane course of getting a scientific paper published
that physician Robert Dellavalle came to the unsettling realization
that the world was dissolving before his eyes.

The world, that is, of footnotes, references and Web pages.

Dellavalle, a dermatologist with the Veterans Affairs Medical Center in
Denver, had co-written a research report featuring dozens of footnotes -
- many of which referred not to books or journal articles but, as is
increasingly the case these days, to Web sites that he and his
colleagues had used to substantiate their findings.

Problem was, it took about two years for the article to wind its way to
publication. And by that time, many of the sites they had cited had
moved to other locations on the Internet or disappeared altogether,
rendering useless all those Web addresses -- also known as uniform
resource locators (URLs) -- they had provided in their footnotes.

"Every time we checked, some were gone and others had moved," said
Dellavalle, who is on the faculty at the University of Colorado Health
Sciences Center. "We thought, 'This is an interesting phenomenon
itself. We should look at this.' "

He and his co-workers have done just that, and what they have found is
not reassuring to those who value having a permanent record of
scientific progress. In research described in the journal Science last
month, the team looked at footnotes from scientific articles in three
major journals -- the New England Journal of Medicine, Science and
Nature -- at three months, 15 months and 27 months after publication.
The prevalence of inactive Internet references grew during those
intervals from 3.8 percent to 10 percent to 13 percent.

"I think of it like the library burning in Alexandria," Dellavalle
said, referring to the 48 B.C. sacking of the ancient world's greatest
repository of knowledge. "We've had all these hundreds of years of
stuff available by interlibrary loan, but now things just a few years
old are disappearing right under our noses really quickly."

Dellavalle's concerns reflect those of a growing number of scientists
and scholars who are nervous about their increasing reliance on a
medium that is proving far more ephemeral than archival. In one recent
study, one-fifth of the Internet addresses used in a Web-based high
school science curriculum disappeared over 12 months.

Another study, published in January, found that 40 percent to 50
percent of the URLs referenced in articles in two computing journals
were inaccessible within four years.

"It's a huge problem," said Brewster Kahle, digital librarian at the
Internet Archive in San Francisco. "The average lifespan of a Web page
today is 100 days. This is no way to run a culture."

Of course, even conventional footnotes often lead to dead ends. Some
experts have estimated that as many as 20 percent to 25 percent of all
published footnotes have typographical errors, which can lead people to
the wrong volume or issue of a sought-after reference, said Sheldon
Kotzin, chief of bibliographic services at the National Library of
Medicine in Bethesda.

But the Web's relentless morphing affects a lot more than footnotes.
People are increasingly dependent on the Web to get information from
companies, organizations and governments. Yet, of the 2,483 British
government Web sites, for example, 25 percent change their URL each
year, said David Worlock of Electronic Publishing Services Ltd. in
London.

That matters in part because some documents exist only as Web pages --
for example, the British government's dossier on Iraqi weapons. "It
only appeared on the Web," Worlock said. "There is no definitive
reference where future historians might find it."

Web sites become inaccessible for many reasons. In some cases
individuals or groups that launched them have moved on and have removed
the material from the global network of computer systems that makes up
the Web. In other cases the sites' handlers have moved the material to
a different virtual address (the URL that users type in at the top of
the browser page) without providing a direct link from the old address
to the new one.

Page 2 of 2    < Back
On the Web, Research Work Proves Ephemeral

When computer users try to access a URL that has died or moved to a new
location, they typically get what is called a "404 Not Found" message,
which reads in part: "The page cannot be displayed. The page you are
looking for is currently unavailable."

So common are such occurrences today, and so iconic has that message
become in the Internet era, that at least one eclectic band has named
itself "404 Not Found," and humorists have launched countless knockoffs
of the page -- including www.mamselle.ca/error.html, which looks like a
standard error page but scolds people for spending too much time on
their computers ("This page cannot be displayed because you need some
fresh air . . .") and www.coxar.pwp.blueyonder.co.uk, which offers
political commentary about the U.S. war in Iraq ("The weapons you are
looking for are currently unavailable.").

Not all apparently inaccessible Web sites are really beyond reach.
Several organizations, including the popular search engine Google and
Kahle's Internet Archive (www.archive.org), are taking snapshots of Web
pages and archiving them as fast as they can so they can be viewed even
after they are pulled down from their sites. The Internet Archive
already contains more than 200 terabytes of information (a terabyte is
a million million bytes) -- equivalent to about 200 million books.
Every month it is adding 20 more terabytes, equivalent to the number of
words in the entire Library of Congress.

"We're trying to make sure there's a good historical record of at least
some subsets of the Web, and at least some record of other parts,"
Kahle said. "We're injecting the past into the present."

But with an estimated 7 million new pages added to the Web every day,
archivists can do little more than play catch-up. So others are
creating new indexing and retrieval systems that can find Web pages
that have wandered to new addresses.

One such system, known as DOI (for digital object identifier), assigns
a virtual but permanent bar code of sorts to participating Web pages.
Even if the page moves to a new URL address, it can always be found via
its unique DOI.

Standard browsers cannot by themselves find documents by their DOIs.
For now, at least, users must use go-between "registration agencies" --
such as one called CrossRef -- and "handle servers," which together
work like digital switchboards to lead subscribers to the DOI-labeled
pages they seek. A hodgepodge of other retrieval systems is cropping
up, as well -- all part of the increasingly desperate effort to keep
the ballooning Web's thoughts accessible.

If it all sounds complicated, it is. But consider the stakes: The Web
contains unfathomably more information than did the Alexandria library.
If our culture ends up unable to retrieve and use that information,
then all that knowledge will, in effect, have gone up in smoke.

Research editor Margot Williams contributed to this report.

Lucy M. Rowland, MS, MLS, CNU
Head, Science Collections & Research Facilities
University of Georgia Libraries
Athens, GA 30602-7412
lrowland at uga.edu
+1-706-542-6643
FAX: +1-706-542-7907
www.libs.uga.edu/science/science.html

"Human subtlety will never devise an invention more beautiful, more simple,
or more direct than does Nature." --Leonardo da Vinci

"Always do right. It will gratify some people and astonish the rest." --Mark
Twain

________________________________________________________________________
This email has been scanned for all viruses by the MessageLabs Email
Security System.

________________________________________________________________________
This email has been scanned for all viruses by the MessageLabs Email
Security System. For more information on a proactive email security
service working around the clock, around the globe, visit
http://www.messagelabs.com
________________________________________________________________________