UtfString
Unicode Overview

Originally, computers were only capable of displaying basic characters from Western European languages. The most popular character set was the ASCII character set, where each character had a corresponding number. There are 128 characters (not all of which are printable: some of which are control character such as a line feed or null terminator or intended to produce a sound, like the bell character) defined by basic ASCII. In addition, there are several different types of extended ASCII character sets, which include some other Western European characters, symbols, and line/box characters, allowing there to be 256 different ASCII characters. ASCII characters take up 8-bits (7-bits for basic ASCII), which was very nice for the computers at the time, which had limited memory and storage capacity.

Eventually, however, the need developed for displaying characters found in other languages as well. This led to the development of multi-byte character encodings and code pages. Characters would be represented by one or two bytes, where the meaning of those bytes depended on the current code page. The number 0xA2 might represent one particular character on an English code page, but a completely different character when using a Russian code page. This type of arrangement might be fine if the user only used one language and exchanged documents with other people who used the same language, but combining multiple languages or sending a document in one language to someone who used another language and a different code page would result in a completely unreadable document.

To solve this problem, the Unicode Consortium was founded, which worked for many years to come up with a standard character set including all the characters that would ever possibly be needed all around the world. So today we have the Unicode standard, which includes a character set containing all commonly-used characters used in the world. The first 128 Unicode characters are exactly the same as ASCII. For example, the letter 'g' is 0x67 and the number '8' is 0x38. From there, we find characters from other languages and scripts, going all the way up to 0x10FFFF. The characters in the Cyrillic alphabet, used in some Eastern European languages, range from 0x401 to 0x4E9. Most characters from Eastern Asian languages are found in the upper range of the character set. This standardized encoding system allows documents to be created using a tremendous variety of characters and read by any application that supports Unicode, no matter which language the operating system is set to.

At this point, we should introduce a number of important concepts that are important to understanding Unicode and the UtfString library.

Since UtfString only deals with the character set defined by Unicode, the UtfString documentation uses the terms "character" and "code point" interchangeably. From UtfString's perspective, they are the same thing.

Further information about Unicode can be found at the official Unicode website.