Unicode / UTF-8
Unicode, developed by the ISO Working Group responsible for ISO/IEC 10646 (JTC 1/SC 2/WG 2) and the Unicode Consortium, is a universal standard for coding multilingual text. The ISO 10646 standard was first published in October 2002 and was revised in December 2003.
The 2014 version describes more than 110,000 characters from 100 scripts in addition to various symbols.
The uniform encoding can be done in a composed and decomposed form, the latter meaning that special characters consist of basic letters together with relevant diacritical marks. For example, Umlauts will be issued as a basic letter and an appended "combining diaeresis" on the vowel (U+0308).
Unicode (or ISO 10646) has the advantage of encoding almost every script and avoiding the ambiguity generated by the complex use of escape sequences. It can be used in a 16 bit set (UTF-16) or in a 8 bit set (UTF-8). The latter is increasingly used on the web for data exchange and is thus recommended as the best choice for bibliographic records and library data processing. The latest version of the Unicode standard is available on the web together with character code charts by script.
Further useful resources are:
- libUniCode-plus (Creation and manipulation of Unicode tables)
- Unicode Character Search (search for characters by their Unicode names)