By Yuen-Hsien Tseng, Nov. 8, 2002
Reuters-21578 is a test collection for evaluation of automatic text categorization techniques. Although it is widely used in many research studies, few has reported the details of how it is used.
For examples, how the documents are cleaned?
Which documents are used in the training set and which ones are in
the test set?
How exactly the performance is measured?
Slightly different experiment setups (although using the same dataset) may lead to somewhat unfair comparisons among the results of different studies.
This page provides some simple tools to prepare the Reuters-21578 dataset for text categorization experiments. Through the use of these tools, it is hope that further studies will have exactly the same experiment setups such that comparisons among different studies can be more fair.
Three tools written in Perl are as follows (download here, 8K bytes):
Note: The Reuters-21578 dataset is originally save in 21 files, each contains up to 1000 documents. The following tools assume each document is saved separately in a HTML file in the same directory.
Synopsis: Given that the Reuters-21578 data are in a directory with each document saved in a '.htm' file, extract the title and body of each document successively and save them into a file (with the same name but '.txt' extension) in another directory. See attached texts for example. Examples: perl -s GetTitleBody.pl /test-html /test-txt > test-log.txt => Extract testing documents, whose NEWID ranges from 14826 to 21576. => Number of testing documents is 3019. perl -s GetTitleBody.pl /train-html /train-txt > train-log.txt => Extract training documents, whose NEWID ranges from 1 to 14818. => Number of training documents is 7770.
Synopsis: 1. Get the category information from the Reuters' HTML files 2. Update the category information such that only categories appearing in both the training and testing documents survive. 3. Dump categories used in the training or the test set (for validation). Examples: perl -s GetCat.pl /test-html > test-cat.txt => Get category information from test set perl -s GetCat.pl /train-html > train-cat.txt => Get category information from training set perl -s GetCat.pl -Oupdate train-cat.txt test-cat.txt => Update category information perl -s GetCat.pl -Odump train-cat.txt => Dump used categories
Synopsis: Given a classification result file, calculate the precision, recall, and F1 values in document-wise and category-wise averages. Each line in the result file should have 3 columns separated by ' : ': "document_ID : System_generate_output : Provided_Answer" The categories in the system output are separated by tabs: "\t" and each category may be followed by a semi-colon and then a score. Example: Doc_1 : cat:8 dog:4 : cat duck Doc_2 : goat dog : cat Syntax: perl -s prf.pl [-Odetail] [-OCats=n] [-OSort=1,0] result_file.txt where option -Odetail : report averages for each category -OCats=n : calculate averages only for n categories -OSort=1,0 : 1 = Sort by DESC, 0 = Sort by ASC Note: These averages (except the document-wise averages, which are defined here by Yuen-Hsien Tseng) are calculated following the way described by Yiming Yang in the ddlbeta discussion list. (See attached text)