FJU Test Collection 
for Evaluation of Chinese OCR Text Retrieval
 By Yuen-Hsien Tseng, Sep. 20, 2002
Introduction
The advent of the World Wide Web has revolutionized our ways of 
publising, disseminating, and accessing digital information. 
The ease of accessing such resources has inspired more and more 
libraries and information providers to digitize their 
data for networked information services. 
Future information is likely to be present in full digital form 
and is thus readily accessible through the Internet with little difficulty. 
In contrast, retrospective paper materials require digitization 
and indexing before they can be easily accessed.
To digitize paper materials, they are first scanned into digital images.
OCR (Optical Character Recognition) techniques are then applied to 
convert these images into digital texts. 
This approach has been shown to be the cheapest and fastest way 
to make retrospective data accessible with ease. 
However, the conversion is not always perfect. 
Actually this OCR process is error-prone especially for low-quality prints. 
It is thus an academic task to evaluate the effectiveness of such an
approach to justify their cost advantage.
The test collection provided here originates from a several-year long 
digitization project of 
 Socio-Cultural Research Center (SCRC) 
at Fu Jen Catholic University (FJU). 
The source of the digital images is the newspaper clipping collection of SCRC. 
The digitization project is mostly sponsored by the National Science Council, 
Taiwan, Republic of China.
Contents of the Test Collection
This collection mainly contains a set of text files (no any programs), 
including:
- Document Set
  
  - 8438 OCR text files, both in BIG5 code (mainly used in Taiwan and Hong Kong) 
  and GB code (used in Mainland China and Singapore). The recognition rate
  of these OCR texts is estimated to be 69%.
  
 - 899 manual key-in text files (these files are those documents judged 
  relevant to any of the 30 query topics). Also they are both available in BIG5 
  and GB codes.
  
 - 8438 image files. These are the digital images scanned from the newspaper 
  clippings collected from Mainland China, Hong Kong and Taiwan. Most documents 
  are in simplified Chinese, but some are traditional Chinese.
  
Note: Due to the large size of the images files, they are separated
  from the rest of the test collection which is put in a compressed file.
   
 - Query Topics
  
  - A text file containing thirty Chinese query topics. Available in both 
  BIG5 code and GB code.
  
 - An ASCII text file containing thirty English query topics. They are 
  translated from their corresponding Chinese topics by the researchers in 
  SCRC.
  
 
 - Relevance Judgment
  
  - An ASCII text file lists the relevant documents for each query topic.
  
 
 
What can you do with this collection?
- Evaluate how OCR errors affect retrieval performance.
 - Evaluate your indexing and retrieval techniques in dealing with
       noisy texts, including OCR texts, speech-converted texts, or 
       dialog texts.
 - Since this collection is small, you can use it to verify or
       debug your indexing and retrieval programs in shorter time.
 
Download
To get this collection, 
click here and you will be prompted with a form that 
needs your following information:
- First Name:
 - Last Name:
 - Title:
 - Affiliation:
 - Email:
 - Home page (if any):
 - To be listed in the Request List: Yes/No
 - Other comments:
 
Upon receiving this request, a pair of username and password will be 
included in a reply message for downloading the test collection.
Request List
Those who have requested this test collection will be listed here 
for 2 reasons:
- To allow direct or indirect contact and sharing of the 
experiences of using this collection among those who have requested it.
 - To have a chance to evaluate the usefulness of this collection, based on 
the copies distributed among intended users, as it is created for research 
purposes supported by the NSC, Taiwan, R.O.C..
 
Here are the requests:
- Yuen-Hsien Tseng and Douglas W. Oard. 
Their experiments using this collection 
are reported in the Proceedings of the Fourth Symposium on Document Image 
Understanding Technology, Columbia Maryland, April 23-25th, 2001, pp. 151-158,
titled "
Document Image Retrieval Techniques for Chinese".
 - Art Pollard, President, Lextek International, pollarda@lextek.com, 
	2003/05/16.
 - Jiewei Chen, Beijing University of Posts and Telecommunications,
Jiewei.chen@gmail.com, 2004/11/17.
 - Ryan Klempner, from Education, at taiwantroll@hotmail.com, 
on 2005/02/01
 - More ...
 
Acknowledgement
 
-  Meng-Chu Tsai (蔡孟竹) creates the 30 search topics, coordinates the 3 judges for 
relavence assessment (he himself is also one of the 3 judges), and convert most 
of the digital images into digital texts using a commercial OCR software 
package.
 -  Cheng-I Chang (張政義) converts about 1000 images into digital texts and records
the recognition rates reported by the OCR software package.
 -  SCRC provides the scanned digital images of their newspaper clippings.
Special thanks are to 關秉寅 前主任, 狄神父 主任, 以及 康芳菁, 李青玲, 鄭明賢等人多年的協助.
 -  National Science Council sponsors most of the expenditure to create this 
test collection. Related grants are: 
NSC 88-2418-H-001-011-B8908, NSC 88-2418-H-001-011-B9003, 
NSC 88-2413-H-030-017-, and NSC 89-2413-H-030-006- .