IFLANET home - International Federation of Library Associations and InstitutionsActivities and ServicesSearchContacts

UDT Occasional Paper #1c.

Problems of Optical Character Recognition Technologies in Russian Libraries and Information Centres

M. Goncharov,
Department Head
D. Nikolaev, Section Head
The Russian National Public Library for Science and Technology
Moscow, Russia

March, 1995

NOTE: Cyrillic characters in paragraph 5 are not rendered correctly in this hypertext document.

Development and application of new processing technologies for large information amounts on various media is in the limelight of attention in Russia. The importance of this task for Russian libraries and information centres is even greater than for their Western partners due to late introduction of modern automated library-information systems and bibliographic information entry in a machine-readable form.

Large libraries that have already introduced such systems into the local technologies have met with a problem of retrospective conversion of information from catalogs of many millions cards and formation of full-text and hyper-text databases of unpublished editions and "grey" literature.

The problem is twofold: (a) software and technological, (b) economic.

In this paper, we shall describe briefly the first part of the problem - application of optical character recognition systems in Russian libraries and information centres.

1. In the late 1980s the leading libraries and information centres of the country, namely The Russian National Public Library for Science and Technology, The Moscow University Scientific Library, and The All-Russian Centre for Sci-Tech Information, started the reseach with the purpose to create automated systems for information entry from hard media. However, the research has yielded almost no significant results due to the absence of reliable software packages for the Cyrillic alphabet recognition. Note, that the algorithms for the Cyrillic alphabet recognition and their usage differ from those for the Latin alphabet. Unlinking characters minimize an opportunity to utilize even the most advanced and expensive Western programs for cyrillic texts. Letters "Û" and "É" cannot be duely processed and characters "Ô" and "Æ" are almost unrecognizable either. The impossibility to use spell-check for cyrillic texts turned out as still another problem, while spell-check is an inseparable part of all character recognition technologies.

Several Russian companies have tried to create new software products, but, due to the complexity of the problem, positive results have been achieved only recently.

With creation of a comparatively reliable recognition software many Russian libraries and information centres started its intensive introduction into local processing technologies for hard media. On this way, however, they met with still another problem hindering the application of software, i.e. low quality of original texts.

While the recognition accuracy for publications from western journals of a high polygraphic level is about 98 percent, it is much lower for publications produced on domestic polygraphic basis as some characters may stick together in a word or remain unprinted. Moreover, the accuracy of recognition is lowered due to the quality of cards printed on home-made typewriters as, constituting the most part of library card catalogs, they are not void of the above drawbacks.

The most developed is scanning of journals and newspapers, since, in most cases, their polygraphic quality corresponds to the Western standards.

As for the most significant tasks, such as retrospective conversion and creation of electronic depositories of domestic publications, we try to achieve the best combination of optical recognition (when the quality of the original is adequate), acquisition of electronic publications, and manual text entry.

Utilization of ready-made computer files is the principal task for creation of electronic depositories. Problems encountered here are mainly of organizational nature, as most publishers use electronic publishing systems and virtually all publications have their electronic counterparts, very often remaining unclaimed.

Another important aspect of scanning in Russian libraries is a combination of character recognition and storage of the documents' graphic patterns. In practice, it is done in the following way: first, journal content pages are scanned and used in cataloging, secondly, a journal, which is in high demand, is scanned, compressed, and stored as an electronic copy in a graphic form.

2. Speaking about software products for cyrillic text recognition (which does not exclude their ability to cope with latin texts as well), we should mention the existence of numerous products good in quality and moderate in prices. A list of Russian products carries over twenty titles, but the most popular product is FineReader 1.3 with recognition accuracy of 99.5 percent for the medium-quality originals (newspapers, type-written texts, and facsimile messages).

FineReader runs under Windows and allows for:

- automatic text reading from scanner,
- recognition in the "intellectual" mode of vitrually any fonts without preliminary training,
- automatic separation of text and graphics,
- automatic setting of scanning brightness,
- automatic correction and spell-check,
- recognition of Cyrillic and English texts and texts with a combination of both alphabets.

The Figure shows a part of the FineReader screen.

Tiger and Author are packages running under Windows and MS-DOS. They are also well known and popular.

Cinei Form is becoming popular because it is a multi-language package working with the Russian, English, German, French, and some other European languages.

Reference

Sarangulian, E. Kh. Gibkaia tekhnologiia podgotovki netirazhiruemykh dokumentov s ispol'zovaniem nastol'nykh izdatel'skikh system //NTI: Seriia 1. Organizatsiia i metodika informatsionnoi raboty, N 7-8, 1991. (Flexible technology for preparation of unpublished documents by means of desktop publishing systems //STI: Series 1. Organization and methodology of information activity, No.7-8, 1991).

*    

Latest Revision: March 29, 1995 Copyright © 1995-2000
International Federation of Library Associations and Institutions
www.ifla.org