FJU Test Collection for Evaluation of
Chinese Text Categorization
By Yuen-Hsien Tseng, Feb. 1, 2004
http://www.lins.fju.edu.tw/~tseng/Collections/Chinese_TC.html
Introduction
Text categorization (or document classification) is a process of assigning
labels to documents according to the contents or topics of the documents.
The labels (classes or categories) usually come from one or more predefined
sets that reflect the knowledge structures intended to organize the documents.
Traditionally text categorization is carried out by human experts as
it requires a certain level of vocabulary recognition and knowledge processing.
As the amount of full-text documents rapidly increases in this digital age,
automatic ways of assigning labels to documents to assist human
experts become, in some cases, inevitable. Examples are
news dispatching and spam email filtering. Both require labeling
bulk messages in a short period of time according to the message contents,
a task that is not easily done by human efforts.
Automated text categorization also helps human categorizers in traditional
tasks.
By suggesting possible classes for each unlabelled document,
a machine classifier relieves the burdens of reading full-text documents
and memorizing every class definition in a knowledge structure,
both are required by a categorizer for the classification task.
For novices in such tasks, this way of having a machine classifier seems
to have an experienced colleague as a guide. Thus the training cost can
be reduced and the period of getting acquainted with the task can be
shortened, allowing more people involved in this knowledge organization and
value-added task.
But how well can a machine classifier perform?
It is a research interest and in fact a great need to evaluate the
effectiveness of automated text classification to justify its advantages.
There have already existed several test collections for evaluation of
automatic text categorization technologies. For examples: Reuters 21578,
OHSUMED, and 20NG. All of these test collections are in English.
It is needed to have some Chinese test collections for direct evaluation
of Chinese processing techniques in automatic text categorization.
Sources of the Test Collection
The test collection provided here is originated from a several-years long
digitization project of SCRC
(Socio-Cultural Research Center) at
Fu Jen Catholic University.
The digitization project is mostly sponsored by the National Science Council,
Republic of China.
The source of the test collection comes from the news broadcasts of
Mainland China's radio stations between 1966 and 1982.
These broadcasts were transcribed and labeled by hands in
manuscript papers on-site or by first recording the broadcasts and
then transcribed afterwards.
This material was used to reveal what happened in Mainland China
during the Cultural Revolution by SCRC, which formerly situated in
Hong Kong and published a well-known periodical, the CHINA NEWS ANALYSIS.
In year 2000-2001, under the digitization project, SCRC has 42371
manuscripts key-in manually for the preservation and better use
of this material. Among them, 30710 manuscripts have category labels
and dates.
Only a part of the manuscripts is included in this test collection
according to the following guidelines:
- Each category must have documents in the training set and in the test
set so that an effective training and testing of a machine classifier
is possible.
- The training documents predate all the test documents to reflect
the ordinary use of an operational machine classifier.
Contents of the Test Collection
- Document Set
- There are totally 28011 documents, divided into a training set
(19901 documents) and a test set (8110 documents).
- The documents dated back from 1966/01/01 to 1982/12/26.
- Training documents are from 1966/01/01 to 1976/12/31.
- Test documents are from 1977/01/01 to 1982/12/26.
- Since the documents come from the manual transcription of on-site news
broadcasts, missing words or even snippets are not un-common
in the documents.
- Categories
- There are totally 82 categories in the above documents.
Category definitions may change slightly over time or interpreted
differently among different categorizers. Only those categories that
occur both in the training set and test set are selected.
- A list of the category descriptions in Chinese keywords, provided
by SCRC, is included in this collection distribution (file: CatDesc.doc).
- Original category names are most in English and digits,
only 3 Chinese characters are used in this collection's categories.
The 3 characters are 文 (Culture), 傳 (Biography), and 史 (History).
They are encoded into Cu, Bi, and Hi, respectively, for the convenience
of non-Chinese users.
Note:
- 'Hi' only occur in the category names 'P5Hi'. But 'P5Hi' can not
be found in CatDesc.doc. After browsing some of the P5Hi's documents,
I wonder 'P5Hi' should be 'P5-10' (History of Chinese Communism).
- All the categories that are not in the description list
are: 'E1a', 'E36a', 'HK', 'P10a', 'P53', 'P5Hi'.
- Some category distributions are listed below:
No. | Cat | Train | Test |
1 | P52 | 3985 | 1073
| 2 | P10 | 1903 | 788
| 3 | Cu4 | 1115 | 236
| 4 | E34 | 1085 | 88
| 5 | E3 | 950 | 248
| 6 | P5 | 854 | 536
| 7 | E6 | 797 | 234
| 8 | P4 | 736 | 501
| 9 | E5 | 569 | 149
| 10 | E36 | 548 | 310
| 11 | Cu31 | 529 | 144
| 12 | E53 | 430 | 198
| 13 | P73 | 417 | 208
| 14 | E10 | 390 | 114
| 15 | Cu71 | 376 | 49
| 16 | P72 | 361 | 155
| 17 | P1 | 330 | 163
| 18 | P6 | 317 | 116
| 19 | E1 | 316 | 182
| 20 | E4 | 305 | 283
|
|
No. | Cat | Train | Test |
63 | E38 | 18 | 5
| 64 | P10a | 17 | 2
| 65 | P11 | 16 | 8
| 66 | P5Hi | 14 | 8
| 67 | P33 | 13 | 23
| 68 | Bi | 13 | 13
| 69 | Cu51 | 12 | 16
| 70 | Cu91 | 11 | 5
| 71 | P51 | 11 | 2
| 72 | E9 | 10 | 9
| 73 | E39 | 9 | 2
| 74 | E83 | 7 | 18
| 75 | E1a | 7 | 7
| 76 | Cu73 | 5 | 2
| 77 | HK | 4 | 10
| 78 | E1-3 | 4 | 2
| 79 | E31 | 4 | 1
| 80 | E1-4 | 3 | 74
| 81 | P9 | 3 | 1
| 82 | P53 | 2 | 9
|
|
|
|
- Label Assignment
- Each document was labeled 1 to 4 categories, but mostly only 1 label.
The average number of labels per documents is 1.035 categories
for the training set and 1.007 for the test set.
- Since the documents spread over 17 years, consistency of the
label assignment may somewhat be a problem. But since almost
all human-labeled collections have the consistency problem
(the inconsistency rate can range from 10% to 40%),
users of any test collections should bear this in mind
in interpreting their evaluation results.
Download
To get this collection,
click here and you will be prompted with a form that
needs your following information:
- First Name:
- Last Name:
- Title:
- Affiliation:
- Email:
- Home page (if any):
- To be listed in the Request List: Yes/No
- Other comments:
Upon receiving this request, a pair of username and password will be
included in a reply message for downloading the test collection.
Request List
Those who have requested this test collection will be listed here for 2 reasons:
- To allow direct or indirect contact and sharing of the experiences of using
this collection among those who have requested it.
- To have a chance to evaluate the usefulness of this collection, based on
the copies distributed among intended users, as it is created for research
purposes supported by the NSC, Taiwan, R.O.C..
Here are the requests:
- Yuen-Hsien Tseng.
Yuen-Hsien Tseng and William John Teahan,
"Verifying a Chinese Collection for Text Categorization,"
to appear in the Proceedings of the 27th International ACM SIGIR Conference
on Research and Development in Information Retrieval -
SIGIR '04, July 25 - 29 Sheffield, U.K., 2004.
- More ...
Acknowledgement
- SCRC provides all their digital collections.
Special thanks are to 關秉寅 former Chair of SCRC, 狄神父 Chair,
and 康芳菁, 李青玲, 鄭明賢 for their help during these years.
- National Science Council sponsors most of the expenditure to
create this test collection. Related grants are:
NSC 88-2418-H-001-011-B8908, NSC 88-2418-H-001-011-B9003,
NSC 88-2413-H-030-017-, and NSC 92-2213-E-030-017-.