FJU Test Collection
for Evaluation of Chinese OCR Text Retrieval
By Yuen-Hsien Tseng, Sep. 20, 2002
Introduction
The advent of the World Wide Web has revolutionized our ways of
publising, disseminating, and accessing digital information.
The ease of accessing such resources has inspired more and more
libraries and information providers to digitize their
data for networked information services.
Future information is likely to be present in full digital form
and is thus readily accessible through the Internet with little difficulty.
In contrast, retrospective paper materials require digitization
and indexing before they can be easily accessed.
To digitize paper materials, they are first scanned into digital images.
OCR (Optical Character Recognition) techniques are then applied to
convert these images into digital texts.
This approach has been shown to be the cheapest and fastest way
to make retrospective data accessible with ease.
However, the conversion is not always perfect.
Actually this OCR process is error-prone especially for low-quality prints.
It is thus an academic task to evaluate the effectiveness of such an
approach to justify their cost advantage.
The test collection provided here originates from a several-year long
digitization project of
Socio-Cultural Research Center (SCRC)
at Fu Jen Catholic University (FJU).
The source of the digital images is the newspaper clipping collection of SCRC.
The digitization project is mostly sponsored by the National Science Council,
Taiwan, Republic of China.
Contents of the Test Collection
This collection mainly contains a set of text files (no any programs),
including:
- Document Set
- 8438 OCR text files, both in BIG5 code (mainly used in Taiwan and Hong Kong)
and GB code (used in Mainland China and Singapore). The recognition rate
of these OCR texts is estimated to be 69%.
- 899 manual key-in text files (these files are those documents judged
relevant to any of the 30 query topics). Also they are both available in BIG5
and GB codes.
- 8438 image files. These are the digital images scanned from the newspaper
clippings collected from Mainland China, Hong Kong and Taiwan. Most documents
are in simplified Chinese, but some are traditional Chinese.
Note: Due to the large size of the images files, they are separated
from the rest of the test collection which is put in a compressed file.
- Query Topics
- A text file containing thirty Chinese query topics. Available in both
BIG5 code and GB code.
- An ASCII text file containing thirty English query topics. They are
translated from their corresponding Chinese topics by the researchers in
SCRC.
- Relevance Judgment
- An ASCII text file lists the relevant documents for each query topic.
What can you do with this collection?
- Evaluate how OCR errors affect retrieval performance.
- Evaluate your indexing and retrieval techniques in dealing with
noisy texts, including OCR texts, speech-converted texts, or
dialog texts.
- Since this collection is small, you can use it to verify or
debug your indexing and retrieval programs in shorter time.
Download
To get this collection,
click here and you will be prompted with a form that
needs your following information:
- First Name:
- Last Name:
- Title:
- Affiliation:
- Email:
- Home page (if any):
- To be listed in the Request List: Yes/No
- Other comments:
Upon receiving this request, a pair of username and password will be
included in a reply message for downloading the test collection.
Request List
Those who have requested this test collection will be listed here
for 2 reasons:
- To allow direct or indirect contact and sharing of the
experiences of using this collection among those who have requested it.
- To have a chance to evaluate the usefulness of this collection, based on
the copies distributed among intended users, as it is created for research
purposes supported by the NSC, Taiwan, R.O.C..
Here are the requests:
- Yuen-Hsien Tseng and Douglas W. Oard.
Their experiments using this collection
are reported in the Proceedings of the Fourth Symposium on Document Image
Understanding Technology, Columbia Maryland, April 23-25th, 2001, pp. 151-158,
titled "
Document Image Retrieval Techniques for Chinese".
- Art Pollard, President, Lextek International, pollarda@lextek.com,
2003/05/16.
- Jiewei Chen, Beijing University of Posts and Telecommunications,
Jiewei.chen@gmail.com, 2004/11/17.
- Ryan Klempner, from Education, at taiwantroll@hotmail.com,
on 2005/02/01
- More ...
Acknowledgement
- Meng-Chu Tsai (蔡孟竹) creates the 30 search topics, coordinates the 3 judges for
relavence assessment (he himself is also one of the 3 judges), and convert most
of the digital images into digital texts using a commercial OCR software
package.
- Cheng-I Chang (張政義) converts about 1000 images into digital texts and records
the recognition rates reported by the OCR software package.
- SCRC provides the scanned digital images of their newspaper clippings.
Special thanks are to 關秉寅 前主任, 狄神父 主任, 以及 康芳菁, 李青玲, 鄭明賢等人多年的協助.
- National Science Council sponsors most of the expenditure to create this
test collection. Related grants are:
NSC 88-2418-H-001-011-B8908, NSC 88-2418-H-001-011-B9003,
NSC 88-2413-H-030-017-, and NSC 89-2413-H-030-006- .