[Sigia-l] Searchable File Names

Sun Sep 22 14:17:45 EDT 2002

On Sun, 22 Sep 2002, David R. Austen wrote:

> Friday, September 13, 2002, 9:25:12 PM, you wrote:
>
> AM> On Fri, 13 Sep 2002, Charles Hanson wrote:
>
> >> I wonder whether anyone on the list has had success with developing a
> >> strategy to create unique filenames that are "meaningful" enough to be
> >> searched for by administrative users (editors) within a content
> >> production environment.  For the most part, the files in question are
> >> articles or the constituent elements of them (images, flash files,
> >> textual elements and so on).
> >>
> >> That is, the filename itself could contain some indication of:
> >>
> >> Source/Creator
> >> Title
> >> Date Published
> >>
> >> Users would search for some string in the filename.
> >>
> >> Advice or suggestions?  Many thanks.
>
> AM> There are a *lot* of competing aspects of the files content, role, etc
> AM> that might be considered here.  Probably your actual file names will be
> AM> based on some of these.  My gut reaction though is that over-specifying
> AM> how file names are to be constructed is going to be a bad idea in most
> AM> cases.  If you need detailed searchability, you need metadata, and the
> AM> filename just isn't really big enough to store that metadata.
>
> AM> Andrew

On Sun, 22 Sep 2002, David R. Austen wrote:

> Hello, Andrew:
>
> You make a good point here: a complicated system won't be used by many
> people. In my own realm, I'd use a faceted system more often to create
> meaningful file names--if I had more time for that!
>
> First, I like a system that automatically ensures a truly unique file
> name--machine-generated numbers and (especially) letters.

Sometimes automatic and arbitrary naming is useful, but there are usually
other factors involved.  On scoop.co.nz I archive old news stories using
an URL based on publication date/time and a checksum of some of the
metadata about the story.  This guarantees unique filenames, but equally
importantly it lets me select or organise stories by date (almost always
required when dealing with news stories) without reference to file
contents.  This is very important in order to be able to produce result
sets and even the search index files in a timely manner.

And of course there's assorted .xml and .html suffixes on the file names
on scoop which are fairly important for the correct operation of the web
server.  Again it's important for performance to know something about the
file without reference to the contents.

> Second, the organization would specify and supply a structured
> folder-naming system that would essentially describe whatever it
> contained. Use of this could even be a requirement of employment by
> that organization. It would allow moving files to other folders
> without difficulty.
>
> Or, employees could be allowed to use their own, more personalized
> file-naming system, but still strictly within that supplied folder
> structure.
>
> Either approach could be used for searching. Files and the folder
> location system "belong" to the employer and will ensure long-term
> usability of information long after the employee departed, regardless
> of file-naming.

All of what you suggest *might* be quite reasonable, depending on the
requirements of the particular collection.  I certainly understand the
appeal of this system, but I think it simplifies things too far. Any
system which does not allow for choosing file local naming conventions to
suit future requirements had better be very homogenous and static.  There
are just too many different reasons to select from a wide variety of file
naming conventions.

In most cases I'd have concerns about a system which expects filenames to
stand alone independently of the folder.  This would break relative file
references.  When you take a copy of a collection of files for your
website in order to work on the next version would you then start by
renaming all of your file names, and adjusting all the URLs?  What about
people's bookmarks?  surely its simpler to consider the fact that the
whole set is in a different folder adequate to identify these as separate
documents.

You might choose to follow the W3C's solution to this problem when
handling different versions of their standards documents.  eg:

http://www.w3.org/TR/html
http://www.w3.org/TR/html4
http://www.w3.org/TR/html401
http://www.w3.org/TR/1999/REC-html401-19991224
http://www.w3.org/TR/1999/PR-html40-19990824

Some of these currently refer to the same document, but that probably not
always be so.  That's the whole point of the system.  The meaning of each
URL is well defined in a way that will be durable.

Obviously this is a carefully thought out system, well tuned to the
requirements of finding the right version of the formal specification in
question.  What would happen though in the face of a policy which dictated
that each of these should also include the author's name?  You could make
it work, but as the authors names change over time, you would need a
similar set of aliases that omit the authors names in order for URLs and
other references to have any longevity.  For every extra facet you squeeze
in you make it that much more complex.

Also consider that the coherency of this system is somewhat dependent on
the header of each of these documents containing metadata which describes
the document's relationship to other standards documents  This must be
kept up to date, and I'm sure the w3C's site includes a fairly involved
system for maintaining these headers.

Andrew