Tools for Reuters-21578 Text Categorization Dataset

By Yuen-Hsien Tseng, Nov. 8, 2002

Introduction

Reuters-21578 is a test collection for evaluation of automatic text categorization techniques. Although it is widely used in many research studies, few has reported the details of how it is used.

For examples, how the documents are cleaned?
Which documents are used in the training set and which ones are in the test set?
How exactly the performance is measured?

Slightly different experiment setups (although using the same dataset) may lead to somewhat unfair comparisons among the results of different studies.

This page provides some simple tools to prepare the Reuters-21578 dataset for text categorization experiments. Through the use of these tools, it is hope that further studies will have exactly the same experiment setups such that comparisons among different studies can be more fair.

Tools

Three tools written in Perl are as follows (download here, 8K bytes):