Chinese MARC (Taiwan) and its bibliographic database

中國機讀編目格式及其書目資料庫

Ching-Chen Anthony Mao (Fu Jen Catholic University, Taipei, Taiwan)
Ching-fen Frances Hsu (National Central Library , Taipei, Taiwan)

World Library and Information Congress: 72nd IFLA General Conference and Council
20-24 August 2006, Seoul, Korea, Monday 21 August 2006
Monday 21 August 2006, 77 SI - UNIMARC 08.30 - 10.30
http://www.lins.fju.edu.tw/mao/works/CMARC+UNICODE.html


ABSTRACT

Chinese MARC Format (CMARC) was first published in 1982, and is still the most widely used machine-readable format among libraries in Taiwan. Besides the background and current application of the CMARC, this paper describes two subjects: how CMARC adopts and differentiates from UNIMARC Format, and Chinese character internal encoding systems.

1982年出版中國機讀編目格式, 即中國機讀編目格式的前身; 20多年來, 始終是臺灣地區使用最廣泛的機讀編目格式, 敘述中國機讀編目格式的背景及現況後, 本文探討兩個主題: 中國機讀編目格式採用國際機讀編目格式(UNIMARC)的內容, 中文內碼系統。

Steps are taken to bridge between CMARC and other data formats, for instance, recent development of CMARC3-XML Schema and MARC 21 to CMARC3 mapping table. The next major task will be how to prepare for the challenge of international bibliographic data exchange.

已採取步驟縮短中國機讀編目格式與其他格式之間的差距, 如: 發展機讀編目延伸標示語言, MARC 21至CMARC3的對照表, 接著要面對的是國際間書目交換的挑戰。

Two traditional Chinese encoding systems are introduced: BIG5 and UTF-8. Simplified Chinese encoding systems will leave to my China colleague.

討論大五碼及UTF-8兩種繁體中文的編碼系統, 簡體中文的編碼系統有待中國大陸的學者補充。

1. The origin of CMARC

1. 源起

The introduction of MARC format by the Library of Congress in the 1960s pushed the library towards an era of computerization. Computerized bibliographic records not only enhanced the data retrieval and storage but also served as the basis of data exchange. In the mid-1960s machine-readable formats were developed almost concurrently but separately by the Library of Congress and by the Council of the British National Bibliography. Though there was cooperation in developing MARC II in 1968, in order to meet the needs for different cataloguing practices and requirements, various MARC format, such as USMARC, UKMARC, INTERMARC, emerged in the 1970s.[1] The creation of UNIMARC as an international machine-readable format provides libraries the solution to the problem in data exchange between different MARC formats. Influenced by the development of MARC format and considering both local application and international data exchange, the Chinese MARC Format was thus created.

美國國會圖書館於1960年代推出機讀編目格式, 將圖書館的服務帶入電腦化的時代, 書目記錄電腦化之後, 不僅強化資料檢索及儲存的能力 , 也是資料交換的基礎。幾乎同時, 英國國家書目也推出機讀編目格式。雖然, 美國國會圖書館於1968年推出的第二代機讀編目格式, 已參酌英國機讀編目格式的內容, 但到了1970年代之後, 仍有多個機讀編目格式問世, USMARC、UKMARC、INTERMARC等。UNIMARC適時推出, 企圖成為各機讀編目格式之間的中人, 擔任資料交換的角色。考量臺灣地區的需求及國際書目交換的可能, 中國機讀編目格式在多種機讀編目格式的環境下出版。

The Library Association of China and the National Central Library (NCL) jointly established the Library Automation Planning Committee (LAPC) in 1980 to improve library and information management and services. One of the objectives of the automation planning by the LAPC was to develop the machine-readable format as the standard for cataloging Chinese publications. The Chinese MARC Working Group (CMWG) was formed under the LAPC to design a machine-readable format that would not only process Chinese materials but would also conform to the standards for international data exchange. The decision was made to take UNIMARC as the model for designing the CMARC and USMARC as a major reference.[2]

1980年, 中國圖書館學會與國立中央圖書館共同成立圖書館自動化規畫委員會, 以改進圖書館及資訊管理的服務。其中一個目標是發展機讀編目格式, 做為編目中文出版品的標準。中國機讀編目格式工作小組於焉成立, 設立適合處理中文資料的機讀編目格式, 同時兼顧國際書目交換。最後決定, 採用UNIMARC的架構, 將USMARC當成主要的參考資料。

In 1981, Chinese MARC for Books was published mainly for processing monographic materials. The CMWG continued revising CMARC in reference to the new version of USMARC Formats for Bibliographic Data and UNIMARC so as to enhance the feasibilities of the CMARC format. In 1982, the First Edition of Chinese MARC Format (CMARC) was published with elements to process serials, maps, music and audio-visual materials in additional to monographs. With the help of the Institute of History and Philology of the Academia Sinica and the Division of Special Collections of the NCL, CMARC fields dedicated to Chinese rare books and rubbings were added to the Second Edition of CMARC published in 1984.

中國機讀編目格式工作小組參照稍晚出版的USMARC Formats for Bibliographic Data 及 UNIMARC, 於1981年出版Chinese MARC for Books, 以處理單本圖書為對象。1982年出版First Edition of Chinese MARC Format (CMARC), 處理對象不以單本圖書為限, 包括連續性出版品、地圖、樂譜、視聽資料等; 在中央研究院歷史語言研究所及國家圖書館特藏組的參與下, 1984年的中國機讀編目格式第二版再納入中文的善本書及拓片。

When the Third Edition of CMARC was published in 1989, libraries in Taiwan were in the phase of automation. Being promoted as the standardization of library automation, the Third Edition of CMARC was the first MARC format used by many libraries for its capacity to cover almost all the materials held by a library at that time. Even after the Fourth Edition was published in 1997 and the Version 2001 in 2002, the Third Edition of CMARC is still the most widely used MARC format among libraries in Taiwan[3].

中國機讀編目格式第三版於1989年出版時, 臺灣的圖書館正處於自動化的轉型期, 很多圖書館將中國機讀編目格式第三版視為圖書館自動化的標準之一, 涵蓋當時圖書館的各種館藏; 即使1997年的第四版、2001年版問世後, 第三版仍是臺灣圖書館界最流行的機讀編目格式。

2. State of the Art

現況

2.1 Features of CMARC

中國機讀編目格式的特色

The basic structure of CMARC involves record structure, content designation and data content just like UNIMARC or other MARC formats. The features of CMARC are:

中國機讀編目格式的基本架構與UNIMARC或其他機讀編目格式一樣, 包括record structure, content designation and data content:

In order to comprehensively describe all types of Chinese materials, data fields and codes dedicated to specific type of Chinese materials, which are not defined in UNIMARC, are added to CMARC. For examples:

為了描述中文資料的所有類型, 新增若干UNIMARC沒有的資料欄位及代碼, 如:

2.2 Development of CMARC

中國機讀編目格式的發展

The CMARC format is developed based on the model of UNIMARC. It is therefore important to keep the harmonization with international standardization. However, since the CMARC format is meant for use in libraries in Taiwan, factors such as librarian's adaptability and implementation in library system are needed to be taken into consideration.

中國機讀編目格式以UNIMARC為藍本, 必須時時保持兩者的合諧; 不過, 中國機讀編目格式以臺灣地區的圖書館為對象, 必須將圖書館員的適應性及圖書館系統的應用納入考量。

During the process of modifying CMARC in the late 1990s, opinions from experienced librarians and library system vendors as well as library scholars were invited. The discussions on modification resulted in replacing the Linking Entry Block (4XX) with equivalent Related Title Block (5XX). Both librarians and library vendors will take crucial adjustments if they want to fully comply with the modifications. In the current situation libraries adopt the new version of CMARC in different ways. Some select new fields for certain purpose such as for cataloging new types of medium, others remain unchanged.

圖書館員、圖書館系統廠商及學者專家對於中國機讀編目格式的修訂, 提出很多寶貴的意見,最重大的建議是將連接款目段(4__)刪除, 將各欄的內容併入相關題名段(5__); 圖書館員及圖書館系統廠商必須做出重大的調整, 才能適應此改變, 圖書館採取若干變通的手段, 包括不更動舊載體的編目習慣, 祗在編目新載體時, 才使用新的格式。

The task of modifying CMARC will continue to be made under the principles of maintaining structural integrity and embedding elements from current development of both UNIMARC and MARC21. It is foreseeable that, in the process of future modification, there will still be debates over issues regarding the MARC structure and practice in the library. Hopefully the next version of CMARC will emphasize on setting long-term strategies to extend its feasibilities for practical requirements in library and for maintaining the stability of CMARC structure.

在持續修訂的過程裡, 中國機讀編目格式加圖保持與UNIMARC和MARC21的結構一致與欄位相容, 可以想見在未來的日子裡, 中國機讀編目格式的結構與圖書館實務之間的爭議, 不會停歇; 希望在下一版的中國機讀編目格式, 能夠強調長期策略, 配合圖書館的實務, 維持中國機讀編目格式結構的穩定性。

2.3 CMARC3 XML schema/DTD

中國機讀編目格式第三版 XML schema/DTD

The MARC format conformed to ISO 2709/CNS-13148 is considered to be the standard among most libraries, whereas the development and utilization of XML has become a trend for data processing and transportation outside the field of library. In order to increase the possibilities of data sharing, in 2004 Dr. Shien-Chiang Yu from the Shih-shin University launched a research project funded by the NCL to construct CMARC XML[5].

與ISO 2709/CNS-13148相符的機讀編目格式被認為是圖書館的標準之一, XML 已成為資料處理及傳輸的趨勢, 圖書館界有必要正視此現象; 2004年, 在國家圖書館的補助下, 世新大學余顯強博士完成CMARC XML計畫。

Due to the features of containing document type definition and following the standard format in data input, XML becomes an ideal tool for data exchange or transformation across system. Compared with XML, the MARC format conformed to ISO 2709/CNS-13148 can neither recognize the MARC type, nor can the content be directly presented on the web. The drawbacks of MARC format limit its application to automation system.

由於具備文件格式定義及資料匯入標準格式的特性, XML成為跨系統資料交換或轉換的理想工具; 與XML相比, 與ISO 2709/CNS-13148相符的機讀編目格式, 既不能辨識機讀編目格式的類型, 也無法直接從網頁讀取內容, 限制其在自動化系統的應用。

The project includes analysis on both foreign and domestic methods for schema formations by adopting XML as the data format for bibliographic data exchange with references to interrelated definitions and contents. As part of the project, a program is developed to convert ISO2709/CNS-13148 files to and from XML documents based on XML Schema. The documents of CMARC3 XML schema/DTD and the conversion software could be downloaded for trial use by registering.

該計畫分析國內外書目資料交換的格式, 擷取與定義和內容相關的XML綱要格式, 發展出ISO2709/CNS-13148檔案格式與XML綱要文件的互換程式, 登入後, 可以試用此轉換程式。

2.4 CMARC3 to MARC21

中國機讀編目格式第三版轉換至MARC21

Among the libraries with comparatively large holdings, the CMARC is still the most widely used MARC format for cataloging Chinese materials in Taiwan. On the other hand, during the past two decades the libraries in Taiwan used extensively bibliographic resources in USMARC for materials in western languages. Since the majority of western language collections in the libraries are in English, catalogers depend a lot on deriving bibliographic resources in USMARC/MARC 21 provided by OCLC and ITS MARC.

臺灣的圖書館擁有相當館藏後, 採用機讀編目格式做為編目中文資料的依據; 過去20多年,臺灣的圖書館從國外的書目資源裡, 納入甚多USMARC的西文編目資料, 尤其是來自OCLC與ITS MARC的英文書目資源。

In order to avoid data loss during MARC conversion, many libraries use CMARC format for Chinese materials and USMARC/MARC 21 for materials in western languages. For those libraries that need to derive resources in USMARC/MARC21 but use only CMARC or vice versa in-house programs will have to be developed to convert data into the needed MARC format. Most of the MARC conversion programs are developed and built within the library system. It is important to ensure that the conversion programs are designed based on the same standard.

為了避免轉換機讀編目格式造成資料的遺漏, 很多圖書館以中國機讀編目格式處理中文資料,以USMARC/MARC 21處理西文資料。採用單一機讀編目格式的圖書館, 必須有互相轉換機讀編目格式的程式, 供內部使用。這些轉換程式多綁在圖書館系統之內, 圖書館必須確定它們的轉換標準是一樣的。

In 1992 the Ministry of Education funded a project to develop specifications for the conversion for bibliographic records in CMARC format to and from USMARC. The members of the project were experts in MARC format and experienced librarians in using CMARC or USMARC. The project resulted in MARC field mapping in tabular form in a two-volume set published in 1993, one for converting bibliographic records in CMARC to USMARC and another from USMARC to CMARC. Besides, the project also includes a suggested prototype for designing conversion program and related technical documents.

1992年, 教育部補助一項計畫, 由機讀編目格式的專家學者, 製作中國機讀編目格式與USMARC之間互相轉換的規格; 1993年, 出版該計畫的成果, 分別完成中國機讀編目格式轉換至USMARC的欄位對照表, 以及USMARC轉換至中國機讀編目格式的欄位對照表, 並且建議設計轉換程式的原型及相關技術文件。

To reflect the current usage of MARC21, the NCL just completed the conversion specifications from CMARC to MARC21 in April this year. The specifications are established in reference from UNIMARC to MARC21 conversion specifications (Version 3.0) and reviewed by library scholars. The specifications are expected to enhance the resource sharing for bibliographic records in Taiwan and also for international bibliographic exchange such as uploading data to OCLC.

2006年4月, 國家圖書館完成中國機讀編目格式轉換至MARC21的轉換規格, 該對照表以美國國會圖書館編寫的UNIMARC至MARC21轉換規格為藍本, 對於臺灣地區的書目記錄共享及上傳至OCLC等國際書目交換, 有相當的助益。

3. NBINet union catalog

全國圖書書目資訊網

One of the goals of developing CMARC format is to foster an online union catalog. The NCL launched the National Bibliographic Information Network (NBINet) in 1991. The current system started its operation in 1998 to cope with the bibliographic records in various MARC formats and Chinese internal codes contributed by member libraries. In 1990s, besides CMARC format, USMARC became popular especially for cataloging materials in western languages. As for Chinese internal code, Chinese Character Code for Information Interchange (CCCII) and Big5 are the most widely used Chinese internal codes among libraries in Taiwan. Due to the divergent development of library systems used by cooperative libraries, the MARC format and Chinese internal code are always the major concerns for establishing a union catalog in Taiwan.

線上聯合目錄是建立中國機讀編目格式的目標之一, 1991年, 國家圖書館成立全國圖書書目資訊網, 1998年改版後, 足以因應多種機讀編目格式及中文的編碼方式。中國機讀編目格式及USMARC是使用最廣泛的機讀編目格式, CCCII及Big5是使用最廣泛 的中文內碼, 機讀編目格式和中文內碼是全國圖書書目資訊網最關心的兩件事。

The NBINet system is able to store bibliographic records in multiple MARC format conformed with ISO 2709/CNS-13148 standard but the input Chinese internal code currently has to be CCCII. The internal code will be converted to Unicode in the near future. The bibliographic files provided by the member libraries could be in any MARC format with CCCII, Big5 or Unicode. All these files will be converted into CCCII before loading into the database. To satisfy needs for different data formats, the system is able to output bibliographic records in certain MARC format and internal code selected by the member library.

全國圖書書目資訊網已經接受符合ISO 2709/CNS-13148標準的多種機讀編目格式, 但祗接受CCCII的編碼, 近期內將轉換至Unicode, 屆時, 會員可以上傳CCCII、Big5或Unicode編碼的任何機讀編目格式書目資料, 轉換為CCCII後, 再儲存在系統的資料庫裡; 匯出的時候, 可以應使用者要求, 以指定的機讀編目格式及編碼方式匯出。

3.1 Issues of Multiple MARC format

多種機讀編目格式的議題

NBINet currently has 77 member libraries. Among the member libraries, 67 of them use CMARC to catalog materials in Chinese, Japanese and Korean; 32 out of the 67 libraries use only CMARC. 10 out of 77 libraries use only USMARC/MARC21. 35 out of 77 libraries use CMARC for CJK materials and USMARC/MARC21 for materials in other languages. It is likely that majority of the collection in almost all libraries is in Chinese. Since the NCL have the most Chinese materials published in Taiwan, most libraries will follow the MARC format used by the NCL for cataloging Chinese materials. On the other hand, the bibliographic resources for materials in western languages, especially those in English, are almost all in USMARC/MARC21. Libraries would use USMARC/MARC21 as well as CMARC to avoid the data loss of MARC conversion.

全國圖書書目資訊網有77個會員圖書館, 其中67個會員圖書館採用中國機讀編目格式處理中日韓文的資料, 10個會員圖書館祗採用USMARC/MARC21處理所有語文的資料;  67個採用中國機讀編目格式處理中日韓文資料的會員圖書館裡, 有32個會員圖書館以中國機讀編目格式處理所有語文的資料, 另外35個會員圖書館以中國機讀編目格式處理中日韓文的資料, 以USMARC/MARC21處理其他語文的資料。換句話說, 幾乎所有西文資料的書目都以USMARC/MARC21處理。混用USMARC/MARC21及中國機讀編目格式祗有一個目的, 避免轉換格式時, 造成資料流失。

The advantages of using multiple MARC format in a union catalog are: (1) the coverage of bibliographic resource is extended without being limited to single MARC format; (2) no effort is spent on MARC conversion to preprocess the input files; (3) there is no data loss if a record is input and exported in the same MARC format. Nevertheless, there are still disadvantages: (1) there are duplicate records for the same work but in different MARC format; (2) libraries have to check the MARC format before deriving records; (3) data loss caused MARC conversion is inevitable if a bibliographic record is exported in different MARC format from its original one.

聯合目錄採用多種機讀編目格式, 有優點也有不便之處。優點有三: (1)擴大書目資源的範圍, 不受限於特定的機讀編目格式, (2)匯入資料時, 省去轉換的工夫, (3)在同一機讀編目格式內匯入及匯出時, 沒有任何資料遺漏。它的缺點也需注意: (1)同樣的內容可能以多個機讀編目格式重複儲存, (2)將資料匯入之前, 圖書館必須先檢視其機讀編目格式, (3)以其他機讀編目格式匯出時, 轉換的過程, 不免遺漏資料。

3.2 Issues of Multiple internal codes

多種內碼的議題

The diversity of Chinese internal code has long been a problem for library systems used in Taiwan. The commonly used internal code sets among libraries are CCCII (around 54,000 codes) and Big5 (around 13,000 codes). The type of internal code implemented in the library system will affect the quality of processing bibliographic records and patron records. Among the 77 NBINet member libraries, 38 of them use CCCII, 32 libraries use Big5 and currently only 7 libraries use Unicode. Libraries using CCCII have more choices of characters than those who use Big5. However, CCCII is applied only to particular library software in maintaining bibliographic or patron records. A lot of codes are still unable to be displayed with Web OPAC in either Big5 or Unicode. On the other hand, Big5 system can display exactly what is input in the bibliographic record but librarians will usually encounter problem of insufficient characters[6].

中文有多種編碼方式, 臺灣圖書館系統的煩惱已久。5萬4千多字的CCCII及1萬3千多字的Big5, 是較常用的兩種中文內碼, 對於書目記錄及讀者記錄的品質, 有決定性的影響。全國圖書書目資訊網的77個會員圖書館裡, 有38間圖書館採用CCCII, 32間圖書館採用Big5, 另有7間圖書館採用Unicode。雖然CCCII可編定較多的中文字, 但祗限於特定的圖書館系統, 在Big5或Unicode的公用目錄裡, 仍無法呈現這些中文字; Big5可精準呈現鍵入的字, 圖書館嫌它的字數太少, 無法因應實際的需要。

Actually for either type of code designation, new codes has always been demanded by librarians. Unfortunately, there is no organization responsible for regular maintenance and the libraries just can not wait for the long process of assigning new codes. In order to solve the problem of insufficient characters, different vendors utilize user-defined area in different ways which results in difficulties for data exchange. With more than 70,000 CJK codes, Unicode is no doubt a solution to the chaotic situation.

圖書館員隨時歡迎新的編碼方式, 可惜沒有任何機構定期維護字碼, 圖書館不可能無限期等待新的字碼。圖書館資訊系統廠商在使用者自訂區新增若干字, 應付資料交換的需要; 毫無疑問地, 已編碼7萬多中日韓字的Unicode, 是解決混亂情況的最佳選擇。

Whether convert to Unicode or not, depends partly on the system vendors and partly on the standardization for conversion. If a vendor decides not to spend unaffordable efforts to do code conversion on the current system, the library will need to evaluate whether to keep the system or to take alternatives to use Unicode. Normally the alternatives will always bring up the budget issues, which need a long-term planning. The standardization for conversion would help to avoid data loss and incorrect conversion. Converting data from Big5 to Unicode is expected to have no problem since Unicode is likely to include all the characters in Big5. To convert from CCCII to Unicode takes necessary preparation because CCCII code set has the feature of that multiple codes mapped to an identical character for structural arrangement.

系統廠商及轉換標準是採用Unicode的考量因素, 如果廠商認定轉換沒有商業價值, 圖書館就需考慮以自己的資源投入。標準化的轉換程式, 有助於減少資料的流失及錯誤, Unicode的字數較多, 因此, 從Big5轉換成Unicode沒有問題; 由CCCII轉換成Unicode, 需要相當的準備工夫, CCCII將多個內碼對應到相同的字。

The unofficial Unicode Workgroup formed in 2004 hosted by the NCL for the library purpose has the following purposes[7]:

國家圖書館於2004年成立非官方的Unicode工作小組, 其任務如下:

The Workgroup just finalized two-way mapping tables including more than 50,000 mapping sets from CCCII to Unicode and more than 46,000 sets from Unicode to CCCII. The current mapping tables should cover almost all characters that are frequently used. Additional mapping sets for rarely used characters will be added at next version. These tables are not only used to prepare for the Unicode environment and also to provide data exchange standards for an interim while CCCII is still used among libraries.

該工作小組已完成可對應Unicode之50,764個 CCCII碼, 可對應CCCII之46,057個Unicode碼, 足以涵蓋常用的字, 罕用字的對照表有待後續修訂, 該等對照表不僅有利於Unicode環境, 對於仍使用CCCII的圖書館也是一個交換標準。

4. Conclusion

結論

One of the missions of a library is to preserve cultural legacy reflected in various forms of publications. Different MARC formats and language encoding systems are developed as standardized tools to properly record and store the publications held by libraries in different countries. Mutual respect is needed for diverse standards representing different cultures. Although the UNIMARC as well as Unicode is aimed at bridging different standards, they are still unable to fully encompass all elements in CMARC or all Chinese characters. The best solution to manage Chinese materials is to improve the current standards and to maintain the compatibility with other languages.

典藏各種型式的出版品, 是圖書館達成保存文化的手段之一; 各國發展的機讀編目格式及文字編碼系統, 儼然成為圖書館記錄及保存出版的標準工具之一。對於代表各種文化的多種標準, 應該給予尊重。雖然, UNIMARC及Unicode都以橋樑自居, 但仍無法達到溝通中國機讀編目格式內各細目或所有中文字的目標; 我們認為現階段最好的策略是改進現行標準及維持與其他標準的相容性。

Reference

1. The UKMARC Manual: Preface, http://www.bl.uk/services/bibliographic/marc/marcintro.html, accessed 27th June 2006

2. Chinese MARC Working Group, Library Automation Planning Committee, "Preface," Chinese MARC Format for Books (Taipei, Taiwan: Library Association of China & National Central Library, 1981), pp. iii-iv.

3. Mao, Ching-Chen, The Compatibility of CMARC [in Chinese], Journal of Educational Media and Library Sciences, 35(4): 310 - 337, 1998.

Huang, Mei-Lien and Huang, Wen-Yu, A Comparative Study of the CMARC3 and CMARC4[in Chinese], Bulletin of Library and Information Science, 39 (Nov. 2001): 94-108.

4. Chiang, Hsiu-ying, "Introduction to MARC format" [in Chinese], Library Association of China Workshop on Management of Library Resources, 26-31 July, 1999 (Taipei: National Central Library, 1999), pp. 20-21.

5. Yu, Shien-Chiang, MARC XML Schema/DTD report [in Chinese], Taipei, National Central Library, 2004.

5. 機讀編目延伸標示語言文件型別研究: 研究報告 / 余顯強, 2004年9月, [PDF], 國家圖書館, http://digbig.com/4pgwk

6. Chinese Code Introduction, at CNS 11643 Full Character Repository, http://61.60.106.73/eng/word.jsp, accessed 27th June 2006.

7. Unicode Workgroup [in Chinese], http://unicode.ncl.edu.tw/, accessed 27th June 2006.