[sigtag-l] ICWSM 2009 Data Challenge
Emma L. Tonkin
e.tonkin at ukoln.ac.uk
Sun Oct 26 18:28:40 EDT 2008
Hi all (hello again, to those of you who I've already encountered at
this year's ASIS&T AM),
I saw this on the IR mailing list and thought that it might be
interesting for the more mathematically inclined SIG-TAGgers.
-------- Original Message --------
Subject: [SIG-IRList] Data Challenge/ICWSM 2009: International
Conference on Weblogging and Social Media, San Jose, CA, USA; May 17-20,
2009
ICWSM 2009
International Conference on Weblogging and Social Media
San Jose, CA, USA; May 17-20, 2009
Data Challenge
Call for Participation
http://www.icwsm.org/2009/data
Continuing the ICWSM tradition, ICWSM 2009 is making a dataset
available to researchers in the blog and social media fields. We
invite you to download the dataset, explore it, learn something
interesting about it, and submit a paper about it to ICWSM 2009.
The dataset, provided by Spinn3r.com, is a set of 44 million blog
posts made between August 1st and October 1st, 2008. The post
includes the text as syndicated, as well as metadata such as the
blog's homepage, timestamps, etc. The data is formatted in XML and is
further arranged into tiers approximating to some degree search engine
ranking. The total size of the dataset is 142 GB uncompressed, (27 GB
compressed).
(We also anticipate possibly releasing additional datasets. Stay
tuned!)
For details on how to get the dataset, including a usage agreement,
please see the data page on the conference website,
http://www.icwsm.org/2009/data/. There is also a mailing list and
Google Code site for sharing ideas and resources.
This dataset spans a number of big news events (the Olympics; both US
presidential nominating conventions; the beginnings of the financial
crisis; ...) as well as everything else you might expect to find
posted to blogs. ICWSM invites research studies of this data,
including but not limited to
- link analysis
- social network extraction
- tracing the evolution of news
- blog search and filtering
- psychological, sociological, ethnographic, or personality-based
studies
- analysis of influence among bloggers
- blog summarization and discourse analysis
Instructions for submitting papers to ICWSM may be found at
http://icwsm.org/2009/cfp.shtml. When submitting your paper, indicate
that it makes use of the dataset. Dataset papers will be reviewed for
the main conference, and additionally for presentation at the data
challenge workshop to take place on May 20th, 2009 (the last day of
the conference). While we anticipate that several dataset papers may
appear in the main conference, the data challenge workshop will
provide an opportunity for in-depth discussion of the dataset in a
more focused forum.
We will be making a collaborative website available for sharing tools,
indexes, or other extracts of the dataset. Please see the ICWSM
website for links.
Ian Soboroff, NIST
Akshay Java, UMBC
ICWSM 2009 Data Chairs
More information about the sigtag-l
mailing list