FJU Test Collection
for Evaluation of Chinese OCR Text Retrieval

By Yuen-Hsien Tseng, Sep. 20, 2002

Introduction

The advent of the World Wide Web has revolutionized our ways of publising, disseminating, and accessing digital information. The ease of accessing such resources has inspired more and more libraries and information providers to digitize their data for networked information services.

Future information is likely to be present in full digital form and is thus readily accessible through the Internet with little difficulty. In contrast, retrospective paper materials require digitization and indexing before they can be easily accessed.

To digitize paper materials, they are first scanned into digital images. OCR (Optical Character Recognition) techniques are then applied to convert these images into digital texts. This approach has been shown to be the cheapest and fastest way to make retrospective data accessible with ease. However, the conversion is not always perfect. Actually this OCR process is error-prone especially for low-quality prints. It is thus an academic task to evaluate the effectiveness of such an approach to justify their cost advantage.

The test collection provided here originates from a several-year long digitization project of Socio-Cultural Research Center (SCRC) at Fu Jen Catholic University (FJU). The source of the digital images is the newspaper clipping collection of SCRC. The digitization project is mostly sponsored by the National Science Council, Taiwan, Republic of China.

Contents of the Test Collection

This collection mainly contains a set of text files (no any programs), including:

Document Set
- 8438 OCR text files, both in BIG5 code (mainly used in Taiwan and Hong Kong) and GB code (used in Mainland China and Singapore). The recognition rate of these OCR texts is estimated to be 69%.
- 899 manual key-in text files (these files are those documents judged relevant to any of the 30 query topics). Also they are both available in BIG5 and GB codes.
- 8438 image files. These are the digital images scanned from the newspaper clippings collected from Mainland China, Hong Kong and Taiwan. Most documents are in simplified Chinese, but some are traditional Chinese.
  Note: Due to the large size of the images files, they are separated from the rest of the test collection which is put in a compressed file.
Query Topics
- A text file containing thirty Chinese query topics. Available in both BIG5 code and GB code.
- An ASCII text file containing thirty English query topics. They are translated from their corresponding Chinese topics by the researchers in SCRC.
Relevance Judgment
- An ASCII text file lists the relevant documents for each query topic.

What can you do with this collection?

Evaluate how OCR errors affect retrieval performance.
Evaluate your indexing and retrieval techniques in dealing with noisy texts, including OCR texts, speech-converted texts, or dialog texts.
Since this collection is small, you can use it to verify or debug your indexing and retrieval programs in shorter time.

Download

To get this collection, click here and you will be prompted with a form that needs your following information:

First Name:
Last Name:
Title:
Affiliation:
Email:
Home page (if any):
To be listed in the Request List: Yes/No
Other comments:

Upon receiving this request, a pair of username and password will be included in a reply message for downloading the test collection.

Request List

Those who have requested this test collection will be listed here for 2 reasons:

To allow direct or indirect contact and sharing of the experiences of using this collection among those who have requested it.
To have a chance to evaluate the usefulness of this collection, based on the copies distributed among intended users, as it is created for research purposes supported by the NSC, Taiwan, R.O.C..

Here are the requests:

Yuen-Hsien Tseng and Douglas W. Oard.
Their experiments using this collection are reported in the Proceedings of the Fourth Symposium on Document Image Understanding Technology, Columbia Maryland, April 23-25th, 2001, pp. 151-158, titled " Document Image Retrieval Techniques for Chinese".
Art Pollard, President, Lextek International, pollarda@lextek.com, 2003/05/16.
Jiewei Chen, Beijing University of Posts and Telecommunications, Jiewei.chen@gmail.com, 2004/11/17.
Ryan Klempner, from Education, at taiwantroll@hotmail.com, on 2005/02/01
More ...

Acknowledgement

Meng-Chu Tsai (蔡孟竹) creates the 30 search topics, coordinates the 3 judges for relavence assessment (he himself is also one of the 3 judges), and convert most of the digital images into digital texts using a commercial OCR software package.
Cheng-I Chang (張政義) converts about 1000 images into digital texts and records the recognition rates reported by the OCR software package.
SCRC provides the scanned digital images of their newspaper clippings. Special thanks are to 關秉寅前主任, 狄神父主任, 以及康芳菁, 李青玲, 鄭明賢等人多年的協助.
National Science Council sponsors most of the expenditure to create this test collection. Related grants are: NSC 88-2418-H-001-011-B8908, NSC 88-2418-H-001-011-B9003, NSC 88-2413-H-030-017-, and NSC 89-2413-H-030-006- .

FJU Test Collection for Evaluation of Chinese OCR Text Retrieval