Unicode is a text encoding standard maintained by the Unicode Consortium, Designed to support the use of text in all of the world's writing systems that can be digitized.
Version 16.0 of the standard (the latest at the time of writing) defines 154998 characters and 168 scripts.
A script in Unicode is a collection of characters from one or more writing systems or languages.
Here is a list of supported scripts.
The Unicode standard defines three encodings: UTF-8, UTF-16, and UTF-32, though other encodings exist as well.
"UTF" stands for "Unicode Transformation Format".
A code point is a numerical value that maps to a specific character. For example, in ASCII the code point 65 (in decimal) represents uppercase "A".
In Unicode, a code point can be referred to as "U+"" followed by its value in hexadecimal.
Examples:
Character | Code point | Glyph |
---|---|---|
Latin A | U+0041 | A |
Latin sharp S | U+00DF | ß |
Han for "East" | U+6771 | 東 |
Ampersand | U+0026 | & |
Inverted exclamation mark | U+00A1 | ¡ |
Section sign | U+00A7 | § |
Examples of Unicode code points in strings:
A code unit is is the minimum bit combination that can represent a character in a character encoding. For example, common code units include 7-bit, 8-bit, 16-bit, and 32-bit units.
In some encodings such as UTF-8, code points larger than the length of the code unit are encoded using multiple code units; such an encoding is referred to as a variable-length encoding.
A code space (or a code page) is a set of characters encoded with unique numbers. The Unicode code space ranges from U+0000 to U+10FFFF (0 to 1114111).
The Unicode codespace is divided into 17 planes, numbered 0 to 16.
A Unicode plane is a contiguous group of 65,536 (216) code points.
Characters in the range U+0000 to U+FFFF are in plane 0, called the Basic Multilingual Plane (BMP). This plane contains the most commonly-used characters.
Characters in the range U+10000 to U+10FFFF (in the other planes) are called supplementary characters.
All code points in the BMP require only a single code unit in the UTF-16 encoding and can be encoded in one, two, or three bytes in UTF-8.
Code points in planes 1 through 16 (the supplementary planes) are encoded as pairs of code units in UTF-16 and encoded in four bytes in UTF-8.
Within each plane, characters are allocated within named blocks of related characters.
A Unicode block is one of several contiguous ranges of code points of the Unicode character set.
Each block is generally (but not always) meant to supply glyphs used by one or more specific languages, or in some general application area such as mathematics.
Here is a list of Unicode blocks.
Transcoding is the process of converting data from one encoding to another. Such as converting UTF-8 to UTF-16.
A grapheme is the smallest functional unit of a writing system.
A digraph is two letters that come together to make one new sound, such as 'ch', 'sh', 'th'.
In writing and typography, a ligature occurs where two or more graphemes or letters are joined to form a single glyph.
A Private Use Area (PUA) in Unicode is a range of code points that will not be assigned to characters by the standard.
These code points are left undefined so third parties can use them to encode their own characters.
There are three PUA blocks:
Specials is a short Unicode block that exists at the end of the BMP, ranging from U+FFF0 to U+FFFF.
It includes the characters:
U+FFFE and U+FFFF are noncharacters, meaning they are reserved but do not cause ill-formed Unicode text.
Variation Selectors is a Unicode block containing 16 variation selectors used to specify a glyph variant for a preceding character.
They are currently used to specify standardized variation sequences for mathematical symbols, emoji symbols, and other characters.
At present only standardized variation sequences with VS1-VS4, VS7, VS15 and VS16 have been defined;
VS15 and VS16 are reserved to request that a character should be displayed as text or as an emoji respectively.
These combining characters are named variation selector-1 (for U+FE00) through to variation selector-16 (U+FE0F),
and are abbreviated VS1-VS16.
As of Unicode 13.0:
A table of variation selectors:
U+FE0x | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
VS 1 | VS 2 | VS 3 | VS 4 | VS 5 | VS 6 | VS 7 | VS 8 | VS 9 | VS 10 | VS 11 | VS 12 | VS 13 | VS 14 | VS 15 | VS 16 |
Variation Selectors Supplement is a Unicode block containing additional variation selectors beyond those in the Variation Selectors block.
These combining characters are named variation selector-17 (for U+E0100) through to variation selector-256 (U+E01EF), abbreviated VS17-VS256.
As of 12 December 2017, VS17 (U+E0100) to VS48 (U+E011F) are used in ideographic variation sequences
in the Unicode Ideographic Variation Database (IVD).
These selectors are known as Ideographic Variation Selectors (IVS).
They are not listed in the list of standardized variation sequence, instead they are listed in another Ideographic Variation Database.
UTF-32 stands for "Unicode Transformation Format - 32-bit".
It is a fixed-length encoding, as a single UTF-32 code unit can resemble one Unicode code point.
This means that UTF-32 encodes code points in one code unit sized 32 bits.
The most common use of UTF-32 is in internal APIs
where the data is a single code point that is directly mapped to a certain glyph, and not a string of characters.
a UTF-32 code unit contains 11 bits that are always zero. Often non-Unicode information is stored in these "unused" bits.
UTF-16 stands for "Unicode Transformation Format - 16-bit".
It is capable of encoding all valid Unicode code points using a variable-length encoding of one or two 16-bit code units.
All code points in the BMP (with values less than 216) can be encoded using one code unit that is equal to the numerical value of the code point.
Code points from other planes are encoded as two 16-bit code units called a surrogate pair.
The first code unit is a high surrogate and the second is a low surrogate
(Also known as "leading" and "trailing" surrogates)
Code point - 0x10000 | yyyyyyyyyyxxxxxxxxxx |
---|---|
High surrogate | 110110yyyyyyyyyy |
Low surrogate | 110111xxxxxxxxxx |
The way to convert code points to UTF16 will be explained later.
To make the detection of surrogate pairs easy, the Unicode standard has reserved the range U+D800 to U+DFFF for the use of UTF-16.
Code points with values in this range are called surrogate code points.
The official Unicode standard says that no UTF form, including UTF-16, can encode the surrogate code points.
Since they will never be assigned to a character, there should be no reason to encode them.
However, Windows allows unpaired surrogates (a high surrogate code point not followed by a low one, or a low one not preceded by a high one)
in file paths and other places, which generally means that they have to be supported by software despite them being execluded from the Unicode standard.
Since most communication and storage protocols are defined for bytes, and each UTF-16 code unit takes two bytes,
the order of bytes may depend on the endianness (byte order) of the computer architecture.
UTF-16 allows a byte order mark (BOM), a code point with the value U+FEFF, to precede the first actual coded value.
(U+FEFF is the invisible zero-width non-breaking space ZWNBSP character).
If the endian of the decoder matches that of the encoder, the decoder detects the 0xFEFF value (bytes [FE, FF]),
but an opposite-endian decoder interprets the BOM as the noncharacter value 0xFFFE (bytes [FF, FE]), which is reserved for this purpose.
This incorrect result provides a hint to perform byte-swapping for the remaining values.
If the BOM is missing, RFC 2781 recommends that big-endian (BE) encoding be assumed.
In practice, due to Windows using little-endian (LE) order by default, many applications assume little-endian encoding.
The standard also allows the byte order to be stated explicitly by specifying UTF-16BE or UTF-16LE as the encoding type.
When the byte order is specified explicitly this way, a BOM is specifically not supposed to be prepended to the text,
and a U+FEFF at the beginning should be handled as a ZWNBSP character.
UTF-8 stands for "Unicode Transformation Format - 8-bit".
It is capable of encoding all valid Unicode code points using a variable-length encoding of one to four 8-bit code units.
Code points with lower numerical values (which tend to occur more frequently) are encoded using fewer bytes.
UTF-8 is backward compatibile with ASCII: the first 128 characters of Unicode are encoded with the same binary values as ASCII.
However, it is not backward compatibile with extended ASCII (values ranging from 128 to 255) and applications must have a dedicated conversion algorithm to convert extended ASCII to UTF-8 and vice versa.
UTF-8 encodes code points in one to four bytes depending on the value of the code point.
If you visualize a code point as (U+uvwxyz), UTF-8 bytes are arranged as follows:
Code point range | First byte | Second byte | Third byte | Fourth byte |
---|---|---|---|---|
U+0000 to U+007F | 0yyyzzzz₂ | |||
U+0080 to U+07FF | 110xxxyy₂ | 10yyzzzz₂ | ||
U+0800 to U+FFFF | 1110wwww₂ | 10xxxxyy₂ | 10yyzzzz₂ | |
U+010000 to U+10FFFF | 11110uvv₂ | 10vvwwww₂ | 10xxxxyy₂ | 10yyzzzz₂ |
As you can see bytes ending with the bits 10 represent units in the middle or the end of the code point (continuation bytes).
And:
Overlong encodings:
An overlong encoding is an encoding which uses more bytes than necessary.
They are a security risk as they allow the same code point to be encoded in multiple ways.
Overlong encodings should therefore be considered an error and never decoded.
Error handling:
Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be prepared for:
The standard recommends replacing each ill-formed code unit sequence and error with the replacement character "�" (U+FFFD) and continue decoding.
Since RFC 3629 (November 2003), the high and low surrogates used by UTF-16 (values ranging from U+D800 to U+DFFF) are not legal Unicode values,
and their UTF-8 encodings must be treated as an invalid byte sequence.
These encodings (in UTF-8) all start with 0xED followed by 0xA0 or higher.
This rule is often ignored as surrogates are allowed in Windows filenames and this means there must be a way to store them in a string.
UTF-8 that allows these surrogate halves has been informally called WTF-8.
If the Unicode byte-order mark U+FEFF is at the start of a UTF-8 file, the first three bytes will be 0xEF, 0xBB, 0xBF.
The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8,
but warns that it may be encountered at the start of a file transcoded from another encoding.
While UTF-8 is backward compatible with ASCII, this is not true when Unicode Standard recommendations are ignored and a BOM is added.
A BOM can confuse software that isn't prepared for it but can otherwise accept UTF-8,
e.g. programming languages that permit non-ASCII bytes in string literals but not at the start of the file.
Nevertheless, there is software that always inserts a BOM when writing UTF-8,
and refuses to correctly interpret UTF-8 unless the first character is a BOM (or the file only contains ASCII).
UTF-8 | UTF-16 | UTF-32 | |
---|---|---|---|
Common use | General text encoding | General text encoding | Mapping characters to glyphs |
Size of code unit | 1 byte | 2 bytes | 4 bytes |
Bytes used per character | 1-4 bytes | 2 or 4 bytes | 4 bytes |
ASCII file compatibility | Compatible | Not compatible | Not compatible |
The byte order mark (BOM) is simply the zero-width no-break space (ZWNBSP) (U+FEFF) when its in the beginning of a UTF-16 encoded text file.
The BOM is inserted at the beginning of a UTF-16 encoded text file to tell the endianness used in the encoding.
If the UTF-16 decoder found that the text file begins with the bytes [FE, FF], this suggests that the file is encoded in the same endianness of the decoder.
If the bytes were in the opposite order [FF, FE], this suggests that the file is encoded in the opposite endianness.
The name ZWNBSP should be used if the BOM appears in the middle of a data stream.
Unicode says it should be interpreted as a normal codepoint (namely a word joiner), not as a BOM.
Since Unicode 3.2, this usage has been deprecated in favor of the word joiner character (U+2060).
UTF-32 also uses the BOM character, padded with 16 zero bits.
Programmers using the BOM to identify the encoding will have to decide whether it's UTF-32 or UTF-16 by checking if a there is a null character first.
The word joiner (WJ) (U+2060) is a Unicode format character which is used to indicate that line breaking should not occur at its position.
It is not a visible character and it does not have a width.
The word joiner replaces the zero-width no-break space (ZWNBSP, U+FEFF), as a usage of the no-break space of zero width.
The ZWNBSP is used as the byte order mark (BOM) at the start of a file.
However, if encountered elsewhere, it should, according to Unicode, be treated as a word joiner (a no-break space of zero width).
The use of U+FEFF for this purpose is deprecated as of Unicode 3.2, with the word joiner strongly preferred.
A non-breaking space (also called NBSP, required space, hard space, or fixed space) (U+00A0),
is a space character that prevents an automatic line break at its position.
It also prevents consecutive whitespace characters from collapsing into a single space.
Non-breaking space characters with other widths also exist,
such as the narrow no-break space (NNBSP) (U+202F),
figure space (also called numeric space) (U+2007),
and the word joiner (WJ) (U+2060).
The zero-width space (ZWSP) (U+200B) is a non-printing character used to indicate where the word boundaries are,
without actually displaying a visible space in text.
This enables text-processing systems for scripts without visible spacing
to recognize where word boundaries are for the purpose of handling line breaks appropriately.
The ZWSP is located in the General Punctuation block (ranging from U+2000 to U+206F).
ICANN rules prohibit domain names from containing non-displayed characters, including the zero-width space,
and most browsers prohibit their use within domain names because they can be used to create a homograph attack,
where a malicious URL is visually indistinguishable from a legitimate one.
The zero-width joiner (ZWJ) (U+200D) is a non-printing character used in text-processing systems
in which the shape or positioning of a grapheme depends on its relation to other graphemes (complex scripts),
such as the Arabic script or any Indic script. Sometimes the Roman script is to be counted as complex, e.g. when using a Fraktur typeface. When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms.
Sometimes, when a ZWJ is placed between two emoji characters, it can result in a single emoji being shown,
such as the family emoji, made up of two adult emoji and one or two child emoji.
Examples:
Character sequence | Appearance | Description |
---|---|---|
[ra র] [virāma ্ ] [ya য] | র্য | |
[ra র] [ZWJ] [virāma ্ ] [ya য] | র্য | |
[Man] [ZWJ] [Woman] [ZWJ] [Boy] | 👨👩👦 | Family: Man, Woman, Boy |
[Black flag] [ZWJ] [Skull and crossbones] | 🏴☠️ | Pirate Flag |
[Runner] [Emoji modifier fitzpatrick type-1-2] [ZWJ] [Female sign] | 🏃🏻♀️ | Woman Running: Light Skin Tone |
[Runner] [Emoji modifier ftzpatrick type-6] [ZWJ] [Female sign] | 🏃🏿♀️ | Woman Running: Dark Skin Tone |
[Man] [ZWJ] [Red hair] | 👨🦰 | Man: Red Hair |
[Person] [ZWJ] [Sheaf of rice] | 👨🌾 | Farmer |
The zero-width non-joiner (ZWNJ) (U+200C) is a non-printing character used in text-processing systems that make use of ligatures.
When placed between two characters that would otherwise be connected into a ligature,
a ZWNJ causes them to be printed in their final and initial forms, respectively.
This is also an effect of a space character, but a ZWNJ is used when it is desirable to keep the characters closer together.
In certain languages, the ZWNJ is necessary for specifying the correct typographic form of a character sequence. Examples:
Display (with ZWNJ) | Code (with ZWNJ) | Display (incorrect) | Code (incorrect) | Meaning |
---|---|---|---|---|
میخواهم | می|ZWNJ|خواهم | میخواهم | میخواهم | Persian: "I want to" |
ساءينس | ساءين|ZWNJ|س | ساءينس | ساءينس | Malay: "Science" |
Use of ZWNJ to display alternative forms:
In Indic scripts, insertion of a ZWNJ after a consonant either with a halant or before a dependent vowel
prevents the characters from being joined properly.
Examples:
Script | First character | Second character | combined (no ZWNJ) | combined (ZWNJ between characters) |
---|---|---|---|---|
Devanagari | क् | ष | क्ष | क्ष |
Kannada | ನ್ | ನ | ನ್ನ | ನ್ನ |
Combining characters are characters that are intended to modify other characters.
The most common combining characters in the Latin script are the combining diacritical marks (including combining accents).
For example: Cyrillic "U" combined with a breve gives "ў".
Unicode also contains many precomposed characters.
This leads to a requirement to perform Unicode normalization before comparing two Unicode strings
and to carefully design encoding converters
to correctly map all of the valid ways to represent a character in Unicode to a legacy encoding to avoid data loss.
The main block of combining diacritics for European languages and the International Phonetic Alphabet is U+0300 to U+036F.
Combining diacritical marks are also present in many other blocks of Unicode characters.
Unicode diacritics are always added after the main character.
The following blocks are dedicated specifically to combining characters:
Combining characters are not limited to these blocks; for instance, the combining dakuten (U+3099) and combining handakuten (U+309A) are in the Hiragana block.
Zalgo text:
Combining characters have been used to create Zalgo text,
which is text that appears "corrupted" or "creepy" due to an overuse of combining characters.
This causes the text to extend vertically, overlapping other text.
This is mostly used in horror contexts on the internet.
It is typically very challenging for most software to render, so the combining marks are often reduced or completely removed.
Grapheme clusters:
A Grapheme Cluster is a sequence of one or more Unicode code points that must be treated as a single, unbreakable character.
Combining character sequences:
A combining character sequence is a character sequence consisting of a base character followed by one or more combining characters.
Character sequences:
The Unicode standard specifies notational conventions for referring to sequences of characters (or code points) treated as a single unit.
An example of a combining character sequence:
[U+0061, U+0302, U+0301] |
[Latin small letter A, Combining circumflex accent, Combining acute accent] |
Here is a list of named character sequences.
The combining grapheme joiner (CGJ) (U+034F) is a character that has no visible glyph.
Its name is a misnomer and does not describe its function, the character does not join graphemes.
Its purpose is to separate characters that should not be considered digraphs
as well as to block canonical reordering of combining marks during normalization.
For example, in a Hungarian language context,
adjoining letters c and s would normally be considered equivalent to the cs digraph.
If they are separated by the CGJ, they will be considered as two separate graphemes.
However, in contrast to the zero-width joiner and similar characters,
the CGJ does not affect whether the two letters are rendered separately or as a ligature or cursively joined,
the default behavior for this is determined by the font.
In Unicode, the dotted circle (◌) (U+25CC) is a non-significant typographic character used to illustrate the effect of a combining mark,
such as a diacritic mark.
An illustration:
Unicode 15.1 specifies a total of 3,782 emoji using 1,424 characters spread across 24 blocks.
26 of these emoji are Regional indicator symbols that combine in pairs to form flag emoji,
and 12 (#, * and 0 - 9) are base characters for keycap emoji sequences.
Code points that are considered emoji are:
Some vendors add emoji presentation to some other existing Unicode characters or make their own ZWJ sequences.
Microsoft displayed all Mahjong tiles (U+1F000‥2B, not just U+1F004 🀄 MAHJONG TILE RED DRAGON)
and alternative card suits (U+2661 ♡ WHITE HEART SUIT, U+2662 ♢ WHITE DIAMOND SUIT, U+2664 ♤ WHITE SPADE SUIT, U+2667 ♧ WHITE CLUB SUIT)
as emoji. They also supported additional pencils (U+270E ✎ LOWER RIGHT PENCIL, U+2710 ✐ UPPER RIGHT PENCIL)
and a heart-shaped bullet (U+2765 ❥ ROTATED HEAVY BLACK HEART BULLET).
While only U+261D (☝) WHITE UP POINTING INDEX is officially an emoji,
Microsoft and Samsung added the other three directions as well (U+261C ☜ WHITE LEFT POINTING INDEX,
U+261E ☞ WHITE RIGHT POINTING INDEX, U+261F ☟ WHITE DOWN POINTING INDEX). Microsoft no longer supports these emoji now.
Both vendors pair the standard checked ballot box emoji U+2611 ☑ BALLOT BOX WITH CHECK
with its crossed variant U+2612 ☒ BALLOT BOX WITH X,
but only Samsung also has the empty ballot box U+2610 ☐ BALLOT BOX.
The regional indicator symbols are a set of 26 alphabetic Unicode characters (A-Z)
intended to be used to encode two-letter country codes in a way that allows optional special treatment.
These were defined by October 2010 as part of the Unicode 6.0 support for emoji,
as an alternative to encoding separate characters for each country flag.
Although they can be displayed as Roman letters,
it is intended that implementations may choose to display them in other ways, such as by using national flags.
The Unicode FAQ indicates that this mechanism should be used and that symbols for national flags will not be directly encoded.
They are encoded in the range (U+1F1E6) (🇦) REGIONAL INDICATOR SYMBOL LETTER A to (U+1F1FF) (🇿) REGIONAL INDICATOR SYMBOL LETTER Z
within the Enclosed Alphanumeric Supplement block in the Supplementary Multilingual Plane.
A pair of regional indicator symbols is referred to as an emoji flag sequence
(although it represents a specific region, not a specific flag for that region).
Out of the 676 possible pairs of regional indicator symbols (26x26), only 270 are considered valid Unicode region codes.
There are emoji sequences that are made of multiple emoji,
here is a list of these emoji sequences
There are emoji ZWJ sequences made out of multiple emoji with the ZWJ character,
they are listed here
Here is a list of all emoji.
Emoticons:
Emoticons is a Unicode block containing emoticons and other emoji.
Most of them are intended as representations of faces,
although some of them include hand gestures or non-human characters (a horned imp, monkeys, and cartoonish cats).
Each emoticon has two variants:
If there is no variation selector appended, the default is the emoji-style.
Example:
Code points | Result |
---|---|
U+1F610 (Neutral face) | 😐 |
U+1F610 (Neutral face), U+FE0E (Variation selector-15) | 😐︎ |
U+1F610 (Neutral face), U+FE0F (Variation selector-16) | 😐️ |
Emoji modifiers:
Five symbol modifier characters were added with Unicode 8.0 to provide a range of skin tones for human emoji.
These modifiers are called EMOJI MODIFIER FITZPATRICK TYPE- 1-2, 3, 4, 5, and 6 (U+1F3FB - U+1F3FF): 🏻 🏼 🏽 🏾 🏿.
They are based on the Fitzpatrick scale for classifying human skin color.
Here is an example of emoji with different FITZ emoji modifiers:
1F645 | 1F646 | 1F647 | 1F64B | 1F64C | 1F64D | 1F64E | 1F64F | |
---|---|---|---|---|---|---|---|---|
No modifier | 🙅 | 🙆 | 🙇 | 🙋 | 🙌 | 🙍 | 🙎 | 🙏 |
FITZ-1-2 | 🙅🏻 | 🙆🏻 | 🙇🏻 | 🙋🏻 | 🙌🏻 | 🙍🏻 | 🙎🏻 | 🙏🏻 |
FITZ-3 | 🙅🏼 | 🙆🏼 | 🙇🏼 | 🙋🏼 | 🙌🏼 | 🙍🏼 | 🙎🏼 | 🙏🏼 |
FITZ-4 | 🙅🏽 | 🙆🏽 | 🙇🏽 | 🙋🏽 | 🙌🏽 | 🙍🏽 | 🙎🏽 | 🙏🏽 |
FITZ-5 | 🙅🏾 | 🙆🏾 | 🙇🏾 | 🙋🏾 | 🙌🏾 | 🙍🏾 | 🙎🏾 | 🙏🏾 |
FITZ-6 | 🙅🏿 | 🙆🏿 | 🙇🏿 | 🙋🏿 | 🙌🏿 | 🙍🏿 | 🙎🏿 | 🙏🏿 |
Some examples of emoji sequences:
Code points | Character names | Characters | Result | Emoji name |
---|---|---|---|---|
[U+0039, U+FE0F, U+20E3] | [Nine, Variation selector-16, Combining enclosing keycap] | [9, U+FE0F, ⃣] | 9️⃣ | Keycap digit nine emoji |
[U+2764, U+FE0F, U+200D, U+1FA79] | [Red heart, Variation selector-16, ZWJ, Adhesive bandage] | [❤️, U+FE0F, U+200D, 🩹] | ❤️🩹 | Mending heart emoji |
[U+1F3F4, U+200D, U+2620, U+FE0F] | [Black flag, ZWJ, Skull and crossbones, Variation selector-16] | [🏴, U+200D, ☠️, U+FE0F] | 🏴☠️ | Pirate flag |
[U+1F1EE, U+1F1F6] | [Regional indicator symbol letter I, Regional indicator symbol letter Q] | [🇮, 🇶] | 🇮🇶 | Flag: Iraq |
[U+1FAF1, U+1F3FB, U+200D, U+1FAF2, U+1F3FF] | [Rightwards hand, Emoji modifier fitzpatrick type-1-2, ZWJ, Leftwards hand, Emoji modifier fitzpatrick type-6] | [🫱, U+1F3FB, U+200D, 🫲, U+1F3FF] | 🫱🏻🫲🏿 | Handshake: light skin tone, dark skin tone |
The unhyphenated term "noncharacter" refers to 66 code points (labeled <not a character>)
permanently reserved for internal use, and therefore guaranteed to never be assigned to a character.
Each of the 17 planes has its two ending code points set as noncharacters. So, noncharacters are:
U+FFFE and U+FFFF on the BMP, U+1FFFE and U+1FFFF on Plane 1, and so on, up to U+10FFFE and U+10FFFF on Plane 16,
for a total of 34 code points. In addition, there is a contiguous range of another 32 noncharacter code points in the BMP:
U+FDD0 - U+FDEF. Software implementations are free to use these code points for internal use.
One particularly useful example of a noncharacter is the code point U+FFFE.
This code point has the reverse UTF-16 byte sequence of the byte order mark (U+FEFF).
If a stream of text contains this noncharacter at the start,
this is an indication the text has been interpreted with the incorrect endianness.
Noncharacters are not illegal in interchange nor do they cause ill-formed Unicode text.
A whitespace character is a character that represents empty space when text is rendered.
A printable character results in output when rendered, but a whitespace character does not.
Instead, whitespace characters define the layout of text to a limited degree,
interrupting the normal sequence of rendering characters next to each other.
The output of subsequent characters is typically shifted to the right (or to the left for right-to-left text)
or to the start of the next line.
The term whitespace is rooting in the common practice of printing text on white paper.
Normally, a whitespace character is not rendered as white. It affects rendering, but it is not itself rendered.
The table below lists the twenty-five characters defined as whitespace ("WSpace=Y", "WS") characters in the Unicode Character Database.
Seventeen use a definition of whitespace consistent with the algorithm for bidirectional writing ("Bidirectional Character Type=WS")
and are known as "Bidi-WS" characters.
Bidirectional text is explained in the next section of this summary.
Character name | Code point | May break | In IDN | Script | Block | General category |
---|---|---|---|---|---|---|
Character tabulation | U+0009 | Yes | No | Common | Basic Latin | Other, control |
Line feed | U+000A | Is a line-break | Is a line-break | Common | Basic Latin | Other, control |
Line tabulation | U+000B | Is a line-break | Is a line-break | Common | Basic Latin | Other, control |
Form feed | U+000C | Is a line-break | Is a line-break | Common | Basic Latin | Other, control |
Carriage return | U+000D | Is a line-break | Is a line-break | Common | Basic Latin | Other, control |
Space | U+0020 | Yes | No | Common | Basic Latin | Separator, space |
Next line | U+0085 | Is a line-break | Is a line-break | Common | Latin-1 Supplement | Other, control |
No-break space | U+00A0 | No | No | Common | Latin-1 Supplement | Separator, space |
Ogham space mark | U+1680 | Yes | No | Ogham | Ogham | Separator, space |
En quad | U+2000 | Yes | No | Common | General Punctuation | Separator, space |
Em quad | U+2001 | Yes | No | Common | General Punctuation | Separator, space |
En space | U+2002 | Yes | No | Common | General Punctuation | Separator, space |
Em space | U+2003 | Yes | No | Common | General Punctuation | Separator, space |
Three-per-em space | U+2004 | Yes | No | Common | General Punctuation | Separator, space |
Four-per-em space | U+2005 | Yes | No | Common | General Punctuation | Separator, space |
Six-per-em space | U+2006 | Yes | No | Common | General Punctuation | Separator, space |
Figure space | U+2007 | No | No | Common | General Punctuation | Separator, space |
Punctuation space | U+2008 | Yes | No | Common | General Punctuation | Separator, space |
Thin space | U+2009 | Yes | No | Common | General Punctuation | Separator, space |
Hair space | U+200A | Yes | No | Common | General Punctuation | Separator, space |
Line separator | U+2028 | Is a line-break | Is a line-break | Common | General Punctuation | Separator, paragraph |
Paragraph separator | U+2029 | Is a line-break | Is a line-break | Common | General Punctuation | Separator, paragraph |
Narrow no-break space | U+202F | No | No | Common | General Punctuation | Separator, space |
Medium mathematical space | U+205F | Yes | No | Common | General Punctuation | Separator, space |
Ideographic space | U+3000 | Yes | No | Common | CJK Symbols and Punctuation | Separator, space |
Related characters with property White_Space=no:
Character name | Code point | May break | In IDN | Script | Block | General category |
---|---|---|---|---|---|---|
Mongolian vowel separator | U+180E | Yes | No | Mongolian | Mongolian | Other, Format |
Zero width space | U+200B | Yes | No | ? | General Punctuation | Other, Format |
Zero-width non-joiner | U+200C | Yes | Context-dependent | ? | General Punctuation | Other, Format |
Zero-width joiner | U+200D | Yes | Context-dependent | ? | General Punctuation | Other, Format |
Word joiner | U+2060 | No | No | ? | General Punctuation | Other, Format |
Zero width non-breaking space | U+FEFF | No | No | ? | Arabic Presentation Forms-B | Other, Format |
Substitute images:
Unicode also provides some visible characters that can be used to represent various whitespace characters,
in contexts where a visible symbol should be displayed:
Character name | Code point | Displayed character | Block | Description |
---|---|---|---|---|
Middle dot | U+00B7 | · | Latin-1 Supplement | Interpunct. |
Downwards two headed arrow | U+21A1 | ↡ | Arrows | ECMA-17 / ISO 2047 symbol for form feed (page break). |
Identical to | U+2261 | ≡ | Mathematical Operators | Amongst other uses, is the ECMA-17 / ISO 2047 symbol for line feed. |
Shouldered open box | U+237D | ⍽ | Miscellaneous Technical | Used to indicate a NBSP. |
Return symbol | U+23CE | ⏎ | Miscellaneous Technical | The symbol for a return key, which enters a line break. |
Control Pictures | U+2409 | ␉ | Control Pictures | Substitutes for a tab character. |
Symbol for line feed | U+240A | ␊ | Control Pictures | Substitutes for a line feed. |
Symbol for vertical tabulation | U+240B | ␋ | Control Pictures | Substitutes for a vertical tab (line tab). |
Symbol for form feed | U+240C | ␌ | Control Pictures | Substitutes for a form feed (page break). |
Symbol for carriage return | U+240D | ␍ | Control Pictures | Substitutes for a carriage return. |
Symbol for space | U+2420 | ␠ | Control Pictures | Substitutes for an ASCII space. |
Blank symbol | U+2422 | ␢ | Control Pictures | also known as "substitute blank", used in BCDIC, EBCDIC, ASCII-1963 etc. as a symbol for the word separator. |
Open box | U+2423 | ␣ | Control Pictures |
Used in block letter handwriting at least since the 1980s
when it is necessary to explicitly indicate the number of space characters
(e.g. when programming with pen and paper).
Used in a textbook published by Springer-Verlag on Modula-2, a programming language where space codes require explicit indication. Also used in the keypad of the Texas Instruments' TI-8x series of graphing calculators. |
Symbol for newline | U+2424 |  | Control Pictures | Substitutes for a line break. |
White up-pointing triangle | U+25B3 | △ | Geometric Shapes | Amongst other uses, it's the ECMA-17 / ISO 2047 symbol for the ASCII space. |
Logical Or with middle stem | U+2A5B | ⩛ | Supplemental Mathematical Operators | Amongst other uses, it's the ECMA-17 / ISO 2047 symbol for vertical tab (line tab). |
Smaller than | U+2AAA | ⪪ | Supplemental Mathematical Operators | Amongst other uses, it'ss the ECMA-17 / ISO 2047 symbol for carriage return. |
Larger than | U+2AAB | ⪫ | Supplemental Mathematical Operators | Amongst other uses, it's the ECMA-17 / ISO 2047 symbol for the tab character. |
Ideographic Telegraph Line Feed Separator Symbol | U+3037 | 〷 | CJK Symbols and Punctuation | Graphic used for code 9999 in Chinese telegraph code, representing a line feed. |
The Unicode Standard assigns various properties to each Unicode code point.
These properties can be used to handle code points in processes, like in line-breaking, script direction or applying controls.
Some character properties are also defined for code points that have no character assigned,
and code points that are labeled "<not a character>".
Character properties is a very long subject which will not be covered in this summary.
If you would like to read about character properties in Unicode,
here is a Wikipedia article about it.
This article also contains information about whitespace characters, blocks, and other elements.
A bidirectional text contains two text directionalities, right-to-left (RTL) and left-to-right (LTR).
In some right-to-left scripts such as the Persian script and Arabic,
mathematical expressions, numeric dates, numbers, and left-to-right text are embedded from left to right.
A text is also bidirectional if a right-to-left script is embedded in a left-to-right text.
The Unicode standard calls for characters to be ordered logically, i.e. in the sequence they are intended to be interpreted.
In order to offer bidi support,
Unicode prescribes an algorithm for how to convert the logical sequence of characters into the correct visual presentation.
For this purpose, the Unicode encoding standard divides all of its characters into one of four types:
"strong", "weak", "neutral", and "explicit formatting".
Strong characters are those with a definite direction.
Examples of this type of character include most alphabetic characters, syllabic characters, Han ideographs,
non-European or non-Arabic digits, and punctuation characters that are specific to only those scripts.
Weak characters are those with vague direction.
Examples of this type of character include European digits, Eastern Arabic-Indic digits, arithmetic symbols, and currency symbols.
Neutral characters have direction indeterminable without context. Examples include paragraph separators,
tabs, and most other whitespace characters.
Punctuation symbols that are common to many scripts,
such as the colon, comma, full-stop, and the no-break-space also fall within this category.
Explicit formatting characters, also referred to as "directional formatting characters",
are special Unicode sequences that direct the algorithm to modify its default behavior.
These characters are subdivided into "marks", "embeddings", "isolates", and "overrides".
Their effects continue until the occurrence of either a paragraph separator, or a "pop" character.
Subdivisions of the "Explicit formatting" character type:
If a "weak" character is followed by another "weak" character, the algorithm will look at the first neighbouring "strong" character.
Sometimes this leads to unintentional display errors. These errors are corrected or prevented with "pseudo-strong" characters.
Such Unicode control characters are called marks. The mark (U+200E) LEFT-TO-RIGHT MARK (LRM) or (U+200F) RIGHT-TO-LEFT MARK (RLM)
is to be inserted into a location to make an enclosed weak character inherit its writing direction.
For example, to correctly display the TRADE MARK SIGN character (™) (U+2122) for an English name brand (LTR)
in an Arabic (RTL) passage, a LRM mark is inserted after the trademark symbol if the symbol is not followed by LTR text.
If the LRM mark is not added, the weak character ™ will be neighbored by a strong LTR character and a strong RTL character.
Hence, in an RTL context, it will be considered to be RTL, and displayed in an incorrect order.
The "embedding" directional formatting characters are the classical Unicode method of explicit formatting,
and as of Unicode 6.3, are being discouraged in favor of "isolates".
An "embedding" signals that a piece of text is to be treated as directionally distinct.
The text within the scope of the embedding formatting characters is not independent of the surrounding text.
Also, characters within an embedding can affect the ordering of characters outside.
Unicode 6.3 recognized that directional embeddings usually have too strong an effect on their surroundings
and are thus unnecessarily difficult to use.
The "isolate" directional formatting characters signal that a piece of text is to be treated
as directionally isolated from its surroundings.
As of Unicode 6.3, these are the formatting characters that are being encouraged in new documents.
These formatting characters were introduced after it became apparent
that "embeddings" usually have too strong an effect on their surroundings
and are thus unnecessarily difficult to use.
Unlike the legacy 'embedding' directional formatting characters,
'isolate' characters have no effect on the ordering of the text outside their scope.
Isolates can be nested, and may be placed within embeddings and overrides.
The "override" directional formatting characters allow for special cases, such as for part numbers
(e.g. to force a part number made of mixed English digits and Hebrew letters to be written from right to left),
and are recommended to be avoided wherever possible.
Just like the other directional formatting characters, "overrides" can be nested one inside another, and in embeddings and isolates.
Pops:
The "pop" directional formatting characters terminate the scope of the most recent "embedding", "override", or "isolate".
In the algorithm, each sequence of concatenated strong characters is called a "run".
A "weak" character that is located between two "strong" characters with the same orientation will inherit their orientation.
A "weak" character that is located between two "strong" characters with a different writing direction
will inherit the main context's writing direction (in an LTR document the character will become LTR, in an RTL document, it will become RTL).
A table of Bidi character types:
Type | Type description | Strength | Directionality | General scope | Bidi control character |
---|---|---|---|---|---|
L | Left-to-Right | Strong | L-to-R | Most alphabetic and syllabic characters, Chinese characters, non-European or non-Arabic digits, LRM character, ... | U+200E LEFT-TO-RIGHT MARK (LRM) |
R | Right-to-Left | Strong | R-to-L | Adlam, Garay, Hebrew, Mandaic, Mende Kikakui, N'Ko, Samaritan, ancient scripts like Kharoshthi and Nabataean, RLM character, ... | U+200F RIGHT-TO-LEFT MARK (RLM) |
AL | Arabic Letter | Strong | R-to-L | Arabic, Hanifi Rohingya, Sogdian, Syriac, and Thaana alphabets, and most punctuation specific to those scripts, ALM character, ... | U+061C ARABIC LETTER MARK (ALM) |
EN | European Number | Weak | European digits, Eastern Arabic-Indic digits, Coptic epact numbers, ... | ||
ES | European Separator | Weak | Plus sign, minus sign, ... | ||
ET | European Number Terminator | Weak | Degree sign, currency symbols, ... | ||
AN | Arabic Number | Weak | Arabic-Indic digits, Arabic decimal and thousands separators, Rumi digits, Hanifi Rohingya digits, ... | ||
CS | Common Number Separator | Weak | Colon, comma, full stop, no-break space, ... | ||
NSM | Nonspacing Mark | Weak | Characters in the general categories: Mark (M), Nonspacing mark (Mn), Enclosing mark (Me) | ||
BN | Boundary Neutral | Weak | Default ignorables, noncharacters, control characters other than those explicitly given other types | ||
B | Paragraph Separator | Neutral | Paragraph separator, appropriate Newline Functions, higher-level protocol paragraph determination | ||
S | Segment Separator | Neutral | Tabs | ||
WS | Whitespace | Neutral | Space, figure space, line separator, form feed, General Punctuation block spaces (smaller set than the whitespace list) | ||
ON | Other Neutrals | Neutral | All other characters, including object replacement character | ||
LRE | Left-to-Right Embedding | Explicit | L-to-R | LRE character only | U+202A LEFT-TO-RIGHT EMBEDDING (LRE) |
LRO | Left-to-Right Override | Explicit | L-to-R | LRO character only | U+202D LEFT-TO-RIGHT OVERRIDE (LRO) |
RLE | Right-to-Left Embedding | Explicit | R-to-L | RLE character only | U+202B RIGHT-TO-LEFT EMBEDDING (RLE) |
RLO | Right-to-Left Override | Explicit | R-to-L | RLO character only | U+202E RIGHT-TO-LEFT OVERRIDE (RLO) |
Pop Directional Format | Explicit | PDF character only | U+202C POP DIRECTIONAL FORMATTING (PDF) | ||
LRI | Left-to-Right Isolate | Explicit | L-to-R | LRI character only | U+2066 LEFT-TO-RIGHT ISOLATE (LRI) |
RLI | Right-to-Left Isolate | Explicit | R-to-L | RLI character only | U+2067 RIGHT-TO-LEFT ISOLATE (RLI) |
FSI | First Strong Isolate | Explicit | FSI character only | U+2068 FIRST STRONG ISOLATE (FSI) | |
PDI | Pop Directional Isolate | Explicit | PDI character only | U+2069 POP DIRECTIONAL ISOLATE (PDI) |
The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10,
which is a customizable method to produce binary strings from strings representing
text in any writing system and language that can be represented with Unicode.
These binary strings can then be efficiently compared byte by byte in order to collate or sort them according to the rules of the language,
with options for ignoring case, accents, etc.
Unicode Technical Report #10 also specifies the Default Unicode Collation Element Table (DUCET).
This data file specifies a default collation ordering. The DUCET is customizable for different languages,
and some such customizations can be found in the Unicode Common Locale Data Repository (CLDR).
An open source implementation of UCA is included with the International Components for Unicode, ICU.
ICU supports tailoring, and the collation tailorings from CLDR are included in ICU.
Unicode equivalence is the specification by the Unicode standard that some sequences of code points represent essentially the same character.
This feature was introduced in the standard to allow compatibility with pre-existing standard character sets,
which often included similar or identical characters.
Unicode provides two such notions, canonical equivalence and compatibility.
Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed.
For example, the code point U+006E (n) LATIN SMALL LETTER N followed by U+0303 (̃ ) COMBINING TILDE
is defined by Unicode to be canonically equivalent to the single code point U+00F1 (ñ) LATIN SMALL LETTER N WITH TILDE of the Spanish alphabet.
Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications
such as alphabetizing names or searching, and may be substituted for each other.
Similarly, each Hangul syllable block that is encoded as a single character may be equivalently encoded as a
combination of a leading conjoining jamo, a vowel conjoining jamo, and, if appropriate, a trailing conjoining jamo.
Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts.
Thus, for example, the code point U+FB00 (the typographic ligature "ff")
is defined to be compatible but not canonically equivalent to the sequence U+0066 U+0066 (two Latin "f" letters).
Compatible sequences may be treated the same way in some applications (such as sorting and indexing), but not in others;
and may be substituted for each other in some situations, but not in others.
Sequences that are canonically equivalent are also compatible, but the opposite is not necessarily true.
The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters
so that any two texts that are equivalent will be reduced to the same sequence of code points,
called the normalization form or normal form of the original text.
For each of the two equivalence notions, Unicode defines two normal forms,
one fully composed (where multiple code points are replaced by single points whenever possible),
and one fully decomposed (where single points are split into multiple ones).
Character duplication:
For compatibility or other reasons, Unicode sometimes assigns two different code points to entities
that are essentially the same character.
For example, the letter "A with a ring diacritic above" is encoded as U+00C5 (Å) LATIN CAPITAL LETTER A WITH RING ABOVE
(a letter of the alphabet in Swedish and several other languages) or as U+212B Å ANGSTROM SIGN.
Yet the symbol for angstrom is defined to be that Swedish letter, and most other symbols that are letters
(such as "V" for volt) do not have a separate code point for each usage.
In general, the code points of truly identical characters are defined to be canonically equivalent.
Combining and precomposed characters:
For consistency with some older standards, Unicode provides single code points for many characters that could be viewed
as modified forms of other characters (such as U+00F1 for "ñ" or U+00C5 for "Å")
or as combinations of two or more characters (such as U+FB00 for the ligature "ff" or U+0132 for the Dutch letter "IJ")
For consistency with other standards, and for greater flexibility, Unicode also provides codes for many elements
that are not used on their own, but are meant instead to modify or combine with a preceding base character.
Examples of these combining characters are the combining tilde and the Japanese diacritic dakuten ("◌゛", U+3099).
In the context of Unicode, character composition is the process of replacing the code points of a base letter
followed by one or more combining characters into a single precomposed character; and character decomposition is the opposite process.
In general, precomposed characters are defined to be canonically equivalent to the sequence of
their base letter and subsequent combining diacritic marks, in whatever order these may occur.
An example: "Amélie" with its two canonically equivalent Unicode forms (NFC and NFD):
NFC characters | A | m | é | l | i | e |
---|---|---|---|---|---|---|
NFC code points | U+0041 | U+006D | U+00E9 | U+006C | U+0069 | U+0065 |
NFD code points | U+0041 | U+006D | [U+0065, U+0301] | U+006C | U+0069 | U+0065 |
NFD characters | A | m | [e, ◌́ ] | l | i | e |
Typographical non-interaction:
Some scripts regularly use multiple combining marks that do not, in general, interact typographically,
and do not have precomposed characters for the combinations.
Pairs of such non-interacting marks can be stored in either order. These alternative sequences are, in general, canonically equivalent.
The rules that define their sequencing in the canonical form also define whether they are considered to interact.
Typographic conventions:
Unicode provides code points for some characters or groups of characters which are modified only for aesthetic reasons
(such as ligatures, the half-width katakana characters, or the full-width Latin letters for use in Japanese texts),
or to add new semantics without losing the original one (such as digits in subscript or superscript positions,
or the circled digits (such as "①") inherited from some Japanese fonts).
Such a sequence is considered compatible with the sequence of original (individual and unmodified) characters,
for the benefit of applications where the appearance and added semantics are not relevant.
However, the two sequences are not declared canonically equivalent,
since the distinction has some semantic value and affects the rendering of the text.
A text processing software implementing the Unicode string search and comparison functionality must take into account
the presence of equivalent code points. In the absence of this feature, users searching for a particular code point sequence
would be unable to find other visually indistinguishable glyphs that have a different, but canonically equivalent, code point representation.
Algorithms:
Unicode provides standard normalization algorithms that produce a unique (normal) code point sequence
for all sequences that are equivalent; the equivalence criteria can be either canonical (NF) or compatibility (NFK).
Since one can arbitrarily choose the representative element of an equivalence class,
multiple canonical forms are possible for each equivalence criterion.
Unicode provides two normal forms that are semantically meaningful for each of the two compatibility criteria:
the composed forms NFC and NFKC, and the decomposed forms NFD and NFKD.
Both the composed and decomposed forms impose a canonical ordering on the code point sequence,
which is necessary for the normal forms to be unique.
In order to compare or search Unicode strings, software can use either composed or decomposed forms;
this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc.
On the other hand, the choice of equivalence criteria can affect search results.
For instance, some typographic ligatures like U+FB03 (ffi), Roman numerals like U+2168 (Ⅸ) and even subscripts and superscripts,
e.g. U+2075 (⁵) have their own Unicode code points. Canonical normalization (NF) does not affect any of these,
but compatibility normalization (NFK) will decompose the ffi ligature into the constituent letters, so a search for U+0066 (f)
as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03.
Likewise when searching for the Latin letter I (U+0049) in the precomposed Roman numeral Ⅸ (U+2168).
Similarly, the superscript ⁵ (U+2075) is transformed to 5 (U+0035) by compatibility mapping.
Transforming superscripts into baseline equivalents may not be appropriate,
however, for rich text software, because the superscript information is lost in the process. To allow for this distinction,
the Unicode character database contains compatibility formatting tags that provide additional details on the compatibility transformation.
In the case of typographic ligatures, this tag is simply <compat>, while for the superscript it is <super>.
Rich text standards like HTML take into account the compatibility tags.
For instance, HTML uses its own markup to position a "5" (U+0035) in a superscript position.
Normal forms:
The four Unicode normalization forms and the algorithms (transformations) for obtaining them are listed in the table:
NFD (Normalization Form Canonical Decomposition) | Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order. |
---|---|
NFC (Normalization Form Canonical Composition) | Characters are decomposed and then recomposed by canonical equivalence. |
NFKD (Normalization Form Compatibility Decomposition) | Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order. |
NFKC (Normalization Form Compatibility Composition) | Characters are decomposed by compatibility, then recomposed by canonical equivalence. |
All these algorithms are idempotent transformations, meaning that a string that is already in one of these normalized forms
will not be modified if processed again by the same algorithm.
The normal forms are not closed under string concatenation. For defective Unicode strings starting with a Hangul vowel
or trailing conjoining jamo, concatenation can break Composition.
However, they are not injective (they map different original glyphs and sequences to the same normalized sequence)
and thus also not bijective (cannot be restored).
For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å")
are both expanded by NFD (or NFKD) into the sequence [U+0041 U+030A] (Latin letter "A" and combining ring above "°")
which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").
A single character (other than a Hangul syllable block) that will get replaced by another under normalization
can be identified in the Unicode tables for having a non-empty compatibility field but lacking a compatibility tag.
Canonical ordering:
The canonical ordering is mainly concerned with the ordering of a sequence of combining characters.
For the examples in this section we assume these characters to be diacritics,
even though in general some diacritics are not combining characters, and some combining characters are not diacritics.
Unicode assigns each character a combining class, which is identified by a numerical value.
Non-combining characters have class number 0, while combining characters have a positive combining class value.
To obtain the canonical ordering, every substring of characters having non-zero combining class value
must be sorted by the combining class value using a stable sorting algorithm.
Stable sorting is required because combining characters with the same class value are assumed to interact typographically,
thus the two possible orders are not considered equivalent.
For example, the character U+1EBF (ế), used in Vietnamese, has both an acute and a circumflex accent.
Its canonical decomposition is the three-character sequence U+0065 (e) U+0302 (circumflex accent) U+0301 (acute accent).
The combining classes for the two accents are both 230, thus U+1EBF is not equivalent to U+0065 U+0301 U+0302.
Since not all combining sequences have a precomposed equivalent
(the last one in the previous example can only be reduced to U+00E9 U+0302),
even the normal form NFC is affected by combining characters' behavior.
We will use the C programming language for example code, as it's an easy language to understand.
On Windows, you can use wide characters and wide print functions to print Unicode text. You will also need to use the "_setmode" function to set the output mode to Unicode, otherwise only ASCII characters will be printed:
#include <stdio.h>
#include <fcntl.h> // The "_O_U16TEXT" macro is here
int main()
{
_setmode(1, _O_U16TEXT);
wprintf(L"Hello مرحباً\n"); // A string prefixed with the letter "L" is a wide string
return 0;
}
(The arabic text might be printed in a left-to-right direction without forming ligatures).
In a revisit of the UTF-8 byte table: if you visualize a code point as (U+uvwxyz), UTF-8 bytes are arranged as follows:
Code point range | First byte | Second byte | Third byte | Fourth byte |
---|---|---|---|---|
U+0000 to U+007F | 0yyyzzzz₂ | |||
U+0080 to U+07FF | 110xxxyy₂ | 10yyzzzz₂ | ||
U+0800 to U+FFFF | 1110wwww₂ | 10xxxxyy₂ | 10yyzzzz₂ | |
U+010000 to U+10FFFF | 11110uvv₂ | 10vvwwww₂ | 10xxxxyy₂ | 10yyzzzz₂ |
In the code below, the if statements check for the amount of code units required to fit the code point:
#include <stdio.h>
#include <stdint.h>
uint8_t utf32ToUtf8(uint32_t codePoint, uint8_t output[4])
{
if (codePoint <= 0x007F) // Fits into one UTF8 code unit (ASCII)
{
output[0] = (uint8_t) codePoint;
return 1;
}
else if (codePoint <= 0x07FF) // Fits into two UTF8 code units
{
output[0] = (uint8_t) (((codePoint >> 6) & 0b0011111) | 0b11000000);
output[1] = (uint8_t) ((codePoint & 0b0111111) | 0b10000000);
return 2;
}
else if (codePoint <= 0xFFFF) // Fits into three UTF8 code units
{
output[0] = (uint8_t) (((codePoint >> 12) & 0b00001111) | 0b11100000);
output[1] = (uint8_t) (((codePoint >> 6) & 0b00111111) | 0b10000000);
output[2] = (uint8_t) ((codePoint & 0b00111111) | 0b10000000);
return 3;
}
else if (codePoint <= 0x10FFFF) // Fits into four UTF8 code units
{
output[0] = (uint8_t) (((codePoint >> 18) & 0b00000111) | 0b11110000);
output[1] = (uint8_t) (((codePoint >> 12) & 0b00111111) | 0b10000000);
output[2] = (uint8_t) (((codePoint >> 6) & 0b00111111) | 0b10000000);
output[3] = (uint8_t) ((codePoint & 0b00111111) | 0b10000000);
return 4;
}
else return UINT8_MAX; // We will consider the return value "UINT8_MAX" (255) to mean "invalid"
}
int main()
{
uint8_t output[4] = {0};
uint8_t codeUnitCount = utf32ToUtf8(0x1D70F, output); // The character U+1D70F is "Mathematical Italic Small Tau" "𝜏"
printf("Amount of code units: %d\n", codeUnitCount);
printf("Code units:\n");
for (uint8_t i = 0; i < codeUnitCount; i++)
printf("%X\n", output[i]);
return 0;
}
The output is [F0, 9D, 9C, 8F]. In binary
[11110000, 10011101,
10011100, 10001111].
If you're confused by the bitwise operations, here's an explanation:
Please note that the code blocks contain basic examples, they aren't perfect and they don't check for common errors.
In the code below, we're doing a bitwise AND with comparison to check if the first UTF-8 code unit starts with 0₂, 110₂, 1110₂, or 11110₂ to determine the amount of code units.
#include <stdio.h>
#include <stdint.h>
uint32_t utf8ToUtf32(const uint8_t utf8Char[4])
{
if ((utf8Char[0] & 0b10000000) == 0b0) // One code unit
{
return (uint32_t) utf8Char[0];
}
else if ((utf8Char[0] & 0b11100000) == 0b11000000) // Two code units
{
if ((utf8Char[1] & 0b11000000) != 0b10000000) // Check if the next unit is valid
return UINT32_MAX; // We will consider the return value "UINT32_MAX" (0xFFFFFFFF) to mean "invalid"
return ((uint32_t)(utf8Char[0] & 0b00011111) << 6) +
(uint32_t)(utf8Char[1] & 0b00111111);
}
else if ((utf8Char[0] & 0b11110000) == 0b11100000) // Three code units
{
if ((utf8Char[1] & 0b11000000) != 0b10000000
|| (utf8Char[2] & 0b11000000) != 0b10000000) // Check if the next units are valid
return UINT32_MAX;
return ((uint32_t)(utf8Char[0] & 0b00001111) << 12) +
((uint32_t)(utf8Char[1] & 0b00111111) << 6) +
(uint32_t)(utf8Char[2] & 0b00111111);
}
else if ((utf8Char[0] & 0b11111000) == 0b11110000) // Four code units
{
if ((utf8Char[1] & 0b11000000) != 0b10000000 // Check if the next units are valid
|| (utf8Char[2] & 0b11000000) != 0b10000000
|| (utf8Char[3] & 0b11000000) != 0b10000000)
return UINT32_MAX;
return ((uint32_t)(utf8Char[0] & 0b00000111) << 18) +
((uint32_t)(utf8Char[1] & 0b00111111) << 12) +
((uint32_t)(utf8Char[2] & 0b00111111) << 6) +
(uint32_t)(utf8Char[3] & 0b00111111);
}
}
int main()
{
uint8_t codeUnits[4] = {0xD0, 0x87}; // The character (U+0407) is "Cyrillic Capital Letter Yi" "ї"
printf("%X\n", utf8ToUtf32(codeUnits));
return 0;
}
The output is 0x407.
Code point range | First byte | Second byte | Third byte | Fourth byte |
---|---|---|---|---|
U+0000 to U+007F | 0yyyzzzz₂ | |||
U+0080 to U+07FF | 110xxxyy₂ | 10yyzzzz₂ | ||
U+0800 to U+FFFF | 1110wwww₂ | 10xxxxyy₂ | 10yyzzzz₂ | |
U+010000 to U+10FFFF | 11110uvv₂ | 10vvwwww₂ | 10xxxxyy₂ | 10yyzzzz₂ |
If the code point's value is in the BMP (less than 65,536 (0x10000))
it simply gets converted to an unsigned 16-bit integer with the same value as that code point.
However, if it isn't in the BMP (greater than or equal to 65,536 (0x10000)):
A visual illustration of this looks like this:
#include <stdio.h>
#include <stdint.h>
uint8_t utf32ToUtf16(uint32_t codePoint, uint16_t output[2])
{
if (codePoint < 0x10000)
{
output[0] = (uint16_t) codePoint;
return 1;
}
else if (codePoint <= 0x10FFFF)
{
codePoint -= 0x10000;
output[0] = 0xD800 + (codePoint >> 10); // High surrogate
output[1] = 0xDC00 + (codePoint & 0b1111111111); // Low surrogate
return 2;
}
else return UINT8_MAX; // We will consider the return value "UINT8_MAX" (255) to mean "invalid"
}
int main()
{
uint16_t output[2] = {0};
uint8_t codeUnitCount = utf32ToUtf16(0x10001, output); // The character U+10001 is "Linear B Syllable B038 E" "𐀁"
printf("Amount of code units: %d\n", codeUnitCount);
printf("Code units:\n");
for (uint8_t i = 0; i < codeUnitCount; i++)
printf("%X\n", output[i]);
return 0;
}
The output is [D800, DC01]. In binary [1101100000000000, 1101110000000001].
In the code below, if there is one unit (BMP code point) we convert it to an unsigned 32-bit integer.
If there is a pair of surrogate units, we remove extra bits from both surrogates, cast them to an unsigned 32-bit integer,
bitshift the result of calculations of the high surrogate to the left by ten, and finally we add the results together
with the 0x10000 we subtracted when encoding the code point in UTF-16.
#include <stdio.h>
#include <stdint.h>
uint32_t utf16ToUtf32(const uint16_t utf16Char[2])
{
// High surrogate
if (utf16Char[0] >= 0xD800 && utf16Char[0] <= 0xDBFF)
{
if (utf16Char[1] < 0xDC00 || utf16Char[1] > 0xDFFF) // The next unit is a low surrogate
return UINT32_MAX; // We will consider the return value "UINT32_MAX" (0xFFFFFFFF) to mean "invalid"
return ( ((uint32_t)(utf16Char[0] & 0b1111111111)) << 10 ) + (uint32_t)(utf16Char[1] & 0b1111111111) + 0x10000U;
}
// The unit isn't a low surrogate, and is in the BMP
else if ((utf16Char[0] < 0xDC00 || utf16Char[0] > 0xDFFF) && utf16Char[0] < 0x10000)
{
return (uint32_t) utf16Char[0];
}
else return UINT32_MAX;
}
int main()
{
uint16_t codeUnits[2] = {0xD802, 0xDD07}; // The character (U+10907) is "Phoenician Letter Het" "𐤇"
printf("%X\n", utf16ToUtf32(codeUnits));
return 0;
}
The output is 0x10907.
Code points in the BMP are encoded in one, two, or three code units in UTF-8, and in one code unit in UTF-16.
In the code below, if the code point is in the BMP, we just mix the UTF-8 code units to a single UTF-16 unit.
If the code point is outside the BMP, we create a UTF-32 code unit from the UTF-8 units, and subtract it by 0x10000.
After that we convert the result to a UTF-16 unit pair.
#include <stdio.h>
#include <stdint.h>
uint8_t utf8ToUtf16(const uint8_t utf8Char[4], uint16_t output[2])
{
if ((utf8Char[0] & 0b10000000) == 0b0) // One code unit
{
output[0] = (uint16_t) utf8Char[0];
return 1;
}
else if ((utf8Char[0] & 0b11100000) == 0b11000000) // Two code units
{
if ((utf8Char[1] & 0b11000000) != 0b10000000) // Check if the next unit is valid
return UINT8_MAX; // We will consider the return value "UINT8_MAX" (255) to mean "invalid"
output[0] = ((uint16_t)(utf8Char[0] & 0b00011111) << 6) +
(uint16_t)(utf8Char[1] & 0b00111111);
return 1;
}
else if ((utf8Char[0] & 0b11110000) == 0b11100000) // Three code units
{
if ((utf8Char[1] & 0b11000000) != 0b10000000
|| (utf8Char[2] & 0b11000000) != 0b10000000) // Check if the next units are valid
return UINT8_MAX;
output[0] = ((uint16_t)(utf8Char[0] & 0b00001111) << 12) +
((uint16_t)(utf8Char[1] & 0b00111111) << 6) +
(uint16_t)(utf8Char[2] & 0b00111111);
return 1;
}
else if ((utf8Char[0] & 0b11111000) == 0b11110000) // Four code units
{
if ((utf8Char[1] & 0b11000000) != 0b10000000 // Check if the next units are valid
|| (utf8Char[2] & 0b11000000) != 0b10000000
|| (utf8Char[3] & 0b11000000) != 0b10000000)
return UINT8_MAX;
uint32_t result = ((uint32_t)(utf8Char[0] & 0b00000111) << 18) +
((uint32_t)(utf8Char[1] & 0b00111111) << 12) +
((uint32_t)(utf8Char[2] & 0b00111111) << 6) +
(uint32_t)(utf8Char[3] & 0b00111111)
- 0x10000;
output[0] = 0xD800 + (result >> 10); // High surrogate
output[1] = 0xDC00 + (result & 0b1111111111); // Low surrogate
return 2;
}
}
int main()
{
uint8_t codeUnits[4] = {0xF0, 0x90, 0xA4, 0x87}; // The character (U+10907) is "Phoenician Letter Het" "𐤇"
uint16_t output[2] = {0};
uint8_t codeUnitCount = utf8ToUtf16(codeUnits, output);
printf("Amount of code units: %d\n", codeUnitCount);
printf("Code units:\n");
for (uint8_t i = 0; i < codeUnitCount; i++)
printf("%X\n", output[i]);
return 0;
}
The output is [D802, DD07].
Converting UTF-16 to UTF-8 is almost the same as converting UTF-32 to UTF-8:
#include <stdio.h>
#include <stdint.h>
uint8_t utf16ToUtf8(uint16_t utf16Char[2], uint8_t output[4])
{
// High surrogate
if (utf16Char[0] >= 0xD800 && utf16Char[0] <= 0xDBFF)
{
if (utf16Char[1] < 0xDC00 || utf16Char[1] > 0xDFFF) // Make sure the next unit is a low surrogate
return UINT8_MAX; // We will consider the return value "UINT8_MAX" (255) to mean "invalid"
uint32_t result = ( ((uint32_t)(utf16Char[0] & 0b1111111111)) << 10 ) + (uint32_t)(utf16Char[1] & 0b1111111111) + 0x10000U;
output[0] = (uint8_t) (((result >> 18) & 0b00000111) | 0b11110000);
output[1] = (uint8_t) (((result >> 12) & 0b00111111) | 0b10000000);
output[2] = (uint8_t) (((result >> 6) & 0b00111111) | 0b10000000);
output[3] = (uint8_t) ((result & 0b00111111) | 0b10000000);
return 4;
}
// The unit isn't a low surrogate, and is in the BMP
else if ((utf16Char[0] < 0xDC00 || utf16Char[0] > 0xDFFF) && utf16Char[0] < 0x10000)
{
if (utf16Char[0] <= 0x007F) // Fits into one UTF8 code unit (ASCII)
{
output[0] = (uint8_t) utf16Char[0];
return 1;
}
else if (utf16Char[0] <= 0x07FF) // Fits into two UTF8 code units
{
output[0] = (uint8_t) (((utf16Char[0] >> 6) & 0b0011111) | 0b11000000);
output[1] = (uint8_t) ((utf16Char[0] & 0b0111111) | 0b10000000);
return 2;
}
else if (utf16Char[0] <= 0xFFFF) // Fits into three UTF8 code units
{
output[0] = (uint8_t) (((utf16Char[0] >> 12) & 0b00001111) | 0b11100000);
output[1] = (uint8_t) (((utf16Char[0] >> 6) & 0b00111111) | 0b10000000);
output[2] = (uint8_t) ((utf16Char[0] & 0b00111111) | 0b10000000);
return 3;
}
}
else return UINT8_MAX;
}
int main()
{
uint16_t codeUnits[2] = {0xD801, 0xDE43}; // The character (U+10643) is "Linear A Sign Ab082" "𐙃"
uint8_t output[4] = {0};
uint8_t codeUnitCount = utf16ToUtf8(codeUnits, output);
printf("Amount of code units: %d\n", codeUnitCount);
printf("Code units:\n");
for (uint8_t i = 0; i < codeUnitCount; i++)
printf("%X\n", output[i]);
return 0;
}
The output is [0xF0, 0x90, 0x99, 0x83].