Tools for Reuters-21578 Text Categorization Dataset

By Yuen-Hsien Tseng, Nov. 8, 2002

Introduction

Reuters-21578 is a test collection for evaluation of automatic text categorization techniques. Although it is widely used in many research studies, few has reported the details of how it is used.

For examples, how the documents are cleaned?
Which documents are used in the training set and which ones are in the test set?
How exactly the performance is measured?

Slightly different experiment setups (although using the same dataset) may lead to somewhat unfair comparisons among the results of different studies.

This page provides some simple tools to prepare the Reuters-21578 dataset for text categorization experiments. Through the use of these tools, it is hope that further studies will have exactly the same experiment setups such that comparisons among different studies can be more fair.

Tools

Three tools written in Perl are as follows (download here, 8K bytes):

Text Extraction: download here

Note: The Reuters-21578 dataset is originally save in 21 files, each contains up to 1000 documents. The following tools assume each document is saved separately in a HTML file in the same directory.

Synopsis:
    Given that the Reuters-21578 data are in a directory with each document 
    saved in a '.htm' file, extract the title and body of each document 
    successively and save them into a file (with the same name but '.txt' 
    extension) in another directory. See attached texts for example.

Examples:
  perl -s GetTitleBody.pl /test-html /test-txt > test-log.txt
    => Extract testing documents, whose NEWID ranges from 14826 to 21576.
    => Number of testing documents is 3019.
  perl -s GetTitleBody.pl /train-html /train-txt > train-log.txt
    => Extract training documents, whose NEWID ranges from 1 to 14818.
    => Number of training documents is 7770.

Category Extraction: download here

Synopsis:
    1. Get the category information from the Reuters' HTML files
    2. Update the category information such that only categories appearing
       in both the training and testing documents survive.
    3. Dump categories used in the training or the test set (for validation).
Examples:
  perl -s GetCat.pl /test-html > test-cat.txt
    => Get category information from test set
  perl -s GetCat.pl /train-html > train-cat.txt
    => Get category information from training set
  perl -s GetCat.pl -Oupdate train-cat.txt test-cat.txt
    => Update category information
  perl -s GetCat.pl -Odump train-cat.txt
    => Dump used categories

Effectiveness Measurement: download here

Synopsis:
    Given a classification result file, calculate the precision, recall,
    and F1 values in document-wise and category-wise averages.
    Each line in the result file should have 3 columns separated by ' : ':
        "document_ID : System_generate_output : Provided_Answer"
    The categories in the system output are separated by tabs: "\t" and each
    category may be followed by a semi-colon and then a score. Example:
        Doc_1 : cat:8 dog:4 : cat duck
        Doc_2 : goat dog : cat

Syntax:
        perl -s prf.pl [-Odetail] [-OCats=n] [-OSort=1,0] result_file.txt
    where option
        -Odetail : report averages for each category
        -OCats=n : calculate averages only for n categories
        -OSort=1,0 : 1 = Sort by DESC, 0 = Sort by ASC

Note: These averages (except the document-wise averages, which are defined
    here by Yuen-Hsien Tseng) are calculated following the way described by
    Yiming Yang in the ddlbeta discussion list. (See attached text)