A Summary of Unicode

Index

What is Unicode
Design and terminology
Unique blocks in Unicode
Encodings
Unique characters
Character properties
Bidirectional text
The Unicode collation algorithm
Unicode equivalence
Implementing Unicode support in your code
Sources
Bottom

What is Unicode

Unicode is a text encoding standard maintained by the Unicode Consortium, Designed to support the use of text in all of the world's writing systems that can be digitized.
Version 16.0 of the standard (the latest at the time of writing) defines 154998 characters and 168 scripts.

Design and terminology

Scripts

A script in Unicode is a collection of characters from one or more writing systems or languages.

Some scripts support one and only one writing system and language, for example, Armenian.
Other scripts support many different writing systems; for example, the Latin script supports English, French, German, Italian, Vietnamese, Latin, and other languages.
Some languages make use of multiple writing systems and several scripts; for example, in Turkish, the Arabic script was used before the 20th century but transitioned to Latin in the early part of the 20th century.

Here is a list of supported scripts.

UTF

The Unicode standard defines three encodings: UTF-8, UTF-16, and UTF-32, though other encodings exist as well.
"UTF" stands for "Unicode Transformation Format".

Code points

A code point is a numerical value that maps to a specific character. For example, in ASCII the code point 65 (in decimal) represents uppercase "A".

Code point representation

In Unicode, a code point can be referred to as "U+"" followed by its value in hexadecimal.
Examples:

Character	Code point	Glyph
Latin A	U+0041	A
Latin sharp S	U+00DF	ß
Han for "East"	U+6771	東
Ampersand	U+0026	&
Inverted exclamation mark	U+00A1	¡
Section sign	U+00A7	§

Examples of Unicode code points in strings:

The character string "Hello!" is [U+0048, U+0065, U+006C, U+006C, U+006F, U+0021]
The character string "مرحبا" is [U+0645, U+0631, U+062D, U+0628, U+0627]

Code units

A code unit is is the minimum bit combination that can represent a character in a character encoding. For example, common code units include 7-bit, 8-bit, 16-bit, and 32-bit units.
In some encodings such as UTF-8, code points larger than the length of the code unit are encoded using multiple code units; such an encoding is referred to as a variable-length encoding.

A code unit in ASCII consists of 7 bits.
A code unit in UTF-8 consists of 8 bits.
A code unit in UTF-16 consists of 16 bits.
A code unit in UTF-32 consists of 32 bits.

Code spaces

A code space (or a code page) is a set of characters encoded with unique numbers. The Unicode code space ranges from U+0000 to U+10FFFF (0 to 1114111).

Planes

The Unicode codespace is divided into 17 planes, numbered 0 to 16.
A Unicode plane is a contiguous group of 65,536 (2¹⁶) code points.
Characters in the range U+0000 to U+FFFF are in plane 0, called the Basic Multilingual Plane (BMP). This plane contains the most commonly-used characters.
Characters in the range U+10000 to U+10FFFF (in the other planes) are called supplementary characters.

All code points in the BMP require only a single code unit in the UTF-16 encoding and can be encoded in one, two, or three bytes in UTF-8.
Code points in planes 1 through 16 (the supplementary planes) are encoded as pairs of code units in UTF-16 and encoded in four bytes in UTF-8.

Blocks

Within each plane, characters are allocated within named blocks of related characters.
A Unicode block is one of several contiguous ranges of code points of the Unicode character set. Each block is generally (but not always) meant to supply glyphs used by one or more specific languages, or in some general application area such as mathematics.

Here is a list of Unicode blocks.

Transcoding

Transcoding is the process of converting data from one encoding to another. Such as converting UTF-8 to UTF-16.

Grapheme

A grapheme is the smallest functional unit of a writing system.

Digraph

A digraph is two letters that come together to make one new sound, such as 'ch', 'sh', 'th'.

Ligature

In writing and typography, a ligature occurs where two or more graphemes or letters are joined to form a single glyph.

Unique blocks in Unicode

Private use areas

A Private Use Area (PUA) in Unicode is a range of code points that will not be assigned to characters by the standard. These code points are left undefined so third parties can use them to encode their own characters.
There are three PUA blocks:

"Private Use Area" in the BMP ranging from U+E000 to U+F8FF.
"Supplementary Private Use Area-A" in plane 15 (which holds the same name, "PUA-A") ranging from U+F0000 to U+FFFFD.
"Supplementary Private Use Area-B" in plane 16 (which holds the same name, "PUA-B") ranging from U+100000 to U+10FFFD.

The specials block

Specials is a short Unicode block that exists at the end of the BMP, ranging from U+FFF0 to U+FFFF.
It includes the characters:

(U+FFF9) "Interlinear annotation anchor": Marks the start of annotated text.
(U+FFFA) "Interlinear annotation seperator": Marks the start of annotating character(s).
(U+FFFB) "Interlinear annotation terminator": Marks the end of an annotation block.
(U+FFFC) "Object replacement character": A placeholder in the text for an unspecified object, for example in a compound document.
(U+FFFD) � "Replacement character": Used to replace unknown or unmapped characters.
(U+FFFE) <noncharacter-FFFE>: noncharacter.
(U+FFFF) <noncharacter-FFFF>: noncharacter.

U+FFFE and U+FFFF are noncharacters, meaning they are reserved but do not cause ill-formed Unicode text.

Variation Selectors

Variation Selectors is a Unicode block containing 16 variation selectors used to specify a glyph variant for a preceding character.

They are currently used to specify standardized variation sequences for mathematical symbols, emoji symbols, and other characters. At present only standardized variation sequences with VS1-VS4, VS7, VS15 and VS16 have been defined; VS15 and VS16 are reserved to request that a character should be displayed as text or as an emoji respectively.

These combining characters are named variation selector-1 (for U+FE00) through to variation selector-16 (U+FE0F), and are abbreviated VS1-VS16.

As of Unicode 13.0:

CJK compatibility ideograph variation sequences contain VS1-VS3 (U+FE00-U+FE02)
CJK Unified Ideographs Extension A and B variation sequences contain VS1 (U+FE00) and VS2 (U+FE01)
Emoji variation sequences contain VS16 (U+FE0F) for emoji-style (with color) or VS15 (U+FE0E) for text style (monochrome)
Basic Latin, Halfwidth and Fullwidth Forms, Manichaean, Myanmar, Myanmar Extended-A, Phags-pa, and mathematical variation sequences contain only VS1 (U+FE00)
Egyptian Hieroglyphs variation sequences VS1-VS4 and VS7 (U+FE00-FE03, and U+FE06) are used to rotate specific signs
VS5, VS6, and VS8-VS14 (U+FE04, U+FE05, and U+FE07-U+FE0D) are not used for any variation sequences

A table of variation selectors:

U+FE0x	0	1	2	3	4	5	6	7	8	9	A	B	C	D	E	F
	VS 1	VS 2	VS 3	VS 4	VS 5	VS 6	VS 7	VS 8	VS 9	VS 10	VS 11	VS 12	VS 13	VS 14	VS 15	VS 16

Variation Selectors Supplement

Variation Selectors Supplement is a Unicode block containing additional variation selectors beyond those in the Variation Selectors block.
These combining characters are named variation selector-17 (for U+E0100) through to variation selector-256 (U+E01EF), abbreviated VS17-VS256.

As of 12 December 2017, VS17 (U+E0100) to VS48 (U+E011F) are used in ideographic variation sequences in the Unicode Ideographic Variation Database (IVD). These selectors are known as Ideographic Variation Selectors (IVS). They are not listed in the list of standardized variation sequence, instead they are listed in another Ideographic Variation Database.

Encodings

UTF-32

UTF-32 stands for "Unicode Transformation Format - 32-bit".
It is a fixed-length encoding, as a single UTF-32 code unit can resemble one Unicode code point.
This means that UTF-32 encodes code points in one code unit sized 32 bits.

The most common use of UTF-32 is in internal APIs where the data is a single code point that is directly mapped to a certain glyph, and not a string of characters.

a UTF-32 code unit contains 11 bits that are always zero. Often non-Unicode information is stored in these "unused" bits.

UTF-16

UTF-16 stands for "Unicode Transformation Format - 16-bit".
It is capable of encoding all valid Unicode code points using a variable-length encoding of one or two 16-bit code units.

All code points in the BMP (with values less than 2¹⁶) can be encoded using one code unit that is equal to the numerical value of the code point.

Code points from other planes are encoded as two 16-bit code units called a surrogate pair. The first code unit is a high surrogate and the second is a low surrogate (Also known as "leading" and "trailing" surrogates)

Code point - 0x10000	yyyyyyyyyyxxxxxxxxxx
High surrogate	110110yyyyyyyyyy
Low surrogate	110111xxxxxxxxxx

The way to convert code points to UTF16 will be explained later.

To make the detection of surrogate pairs easy, the Unicode standard has reserved the range U+D800 to U+DFFF for the use of UTF-16. Code points with values in this range are called surrogate code points.

The official Unicode standard says that no UTF form, including UTF-16, can encode the surrogate code points. Since they will never be assigned to a character, there should be no reason to encode them.
However, Windows allows unpaired surrogates (a high surrogate code point not followed by a low one, or a low one not preceded by a high one) in file paths and other places, which generally means that they have to be supported by software despite them being execluded from the Unicode standard.

Since most communication and storage protocols are defined for bytes, and each UTF-16 code unit takes two bytes, the order of bytes may depend on the endianness (byte order) of the computer architecture.

UTF-16 allows a byte order mark (BOM), a code point with the value U+FEFF, to precede the first actual coded value. (U+FEFF is the invisible zero-width non-breaking space ZWNBSP character).
If the endian of the decoder matches that of the encoder, the decoder detects the 0xFEFF value (bytes [FE, FF]),
but an opposite-endian decoder interprets the BOM as the noncharacter value 0xFFFE (bytes [FF, FE]), which is reserved for this purpose.
This incorrect result provides a hint to perform byte-swapping for the remaining values.

If the BOM is missing, RFC 2781 recommends that big-endian (BE) encoding be assumed. In practice, due to Windows using little-endian (LE) order by default, many applications assume little-endian encoding.
The standard also allows the byte order to be stated explicitly by specifying UTF-16BE or UTF-16LE as the encoding type.
When the byte order is specified explicitly this way, a BOM is specifically not supposed to be prepended to the text, and a U+FEFF at the beginning should be handled as a ZWNBSP character.

UTF-8

UTF-8 stands for "Unicode Transformation Format - 8-bit".
It is capable of encoding all valid Unicode code points using a variable-length encoding of one to four 8-bit code units. Code points with lower numerical values (which tend to occur more frequently) are encoded using fewer bytes.

UTF-8 is backward compatibile with ASCII: the first 128 characters of Unicode are encoded with the same binary values as ASCII.
However, it is not backward compatibile with extended ASCII (values ranging from 128 to 255) and applications must have a dedicated conversion algorithm to convert extended ASCII to UTF-8 and vice versa.

UTF-8 encodes code points in one to four bytes depending on the value of the code point.
If you visualize a code point as (U+uvwxyz), UTF-8 bytes are arranged as follows:

Code point range	First byte	Second byte	Third byte	Fourth byte
U+0000 to U+007F	0yyyzzzz₂
U+0080 to U+07FF	110xxxyy₂	10yyzzzz₂
U+0800 to U+FFFF	1110wwww₂	10xxxxyy₂	10yyzzzz₂
U+010000 to U+10FFFF	11110uvv₂	10vvwwww₂	10xxxxyy₂	10yyzzzz₂

As you can see bytes ending with the bits 10 represent units in the middle or the end of the code point (continuation bytes).
And:

If the first byte ends with the bit 0, the code point is only one byte (ASCII).
If it ends with the bits 110, the UTF-8 encoded code point is two bytes.
If it ends with the bits 1110, the UTF-8 encoded code point is three bytes.
If it ends with the bits 11110, the UTF-8 encoded code point is four bytes.

Overlong encodings:
An overlong encoding is an encoding which uses more bytes than necessary.
They are a security risk as they allow the same code point to be encoded in multiple ways.
Overlong encodings should therefore be considered an error and never decoded.

Error handling:
Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be prepared for:

Bytes that never appear in UTF-8: 0xC0, 0xC1, 0xF5 to 0xFF.
A continuation byte (0x80 to 0xBF) at the start of a character.
A non-continuation byte or the string of code units ending before the end of a character.
An overlong encoding (0xE0 followed by less than 0xA0, or 0xF0 followed by less than 0x90).
A 4-byte sequence that decodes to a value greater than U+10FFFF (0xF4 followed by 0x90 or greater).

The standard recommends replacing each ill-formed code unit sequence and error with the replacement character "�" (U+FFFD) and continue decoding.

Since RFC 3629 (November 2003), the high and low surrogates used by UTF-16 (values ranging from U+D800 to U+DFFF) are not legal Unicode values, and their UTF-8 encodings must be treated as an invalid byte sequence. These encodings (in UTF-8) all start with 0xED followed by 0xA0 or higher. This rule is often ignored as surrogates are allowed in Windows filenames and this means there must be a way to store them in a string. UTF-8 that allows these surrogate halves has been informally called WTF-8.

If the Unicode byte-order mark U+FEFF is at the start of a UTF-8 file, the first three bytes will be 0xEF, 0xBB, 0xBF.
The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8, but warns that it may be encountered at the start of a file transcoded from another encoding.
While UTF-8 is backward compatible with ASCII, this is not true when Unicode Standard recommendations are ignored and a BOM is added. A BOM can confuse software that isn't prepared for it but can otherwise accept UTF-8, e.g. programming languages that permit non-ASCII bytes in string literals but not at the start of the file. Nevertheless, there is software that always inserts a BOM when writing UTF-8, and refuses to correctly interpret UTF-8 unless the first character is a BOM (or the file only contains ASCII).

Comparison

	UTF-8	UTF-16	UTF-32
Common use	General text encoding	General text encoding	Mapping characters to glyphs
Size of code unit	1 byte	2 bytes	4 bytes
Bytes used per character	1-4 bytes	2 or 4 bytes	4 bytes
ASCII file compatibility	Compatible	Not compatible	Not compatible

Unique characters

The byte order mark (zero-width no-break space)

The byte order mark (BOM) is simply the zero-width no-break space (ZWNBSP) (U+FEFF) when its in the beginning of a UTF-16 encoded text file.

The BOM is inserted at the beginning of a UTF-16 encoded text file to tell the endianness used in the encoding.
If the UTF-16 decoder found that the text file begins with the bytes [FE, FF], this suggests that the file is encoded in the same endianness of the decoder. If the bytes were in the opposite order [FF, FE], this suggests that the file is encoded in the opposite endianness.

The name ZWNBSP should be used if the BOM appears in the middle of a data stream. Unicode says it should be interpreted as a normal codepoint (namely a word joiner), not as a BOM. Since Unicode 3.2, this usage has been deprecated in favor of the word joiner character (U+2060).

UTF-32 also uses the BOM character, padded with 16 zero bits.
Programmers using the BOM to identify the encoding will have to decide whether it's UTF-32 or UTF-16 by checking if a there is a null character first.

The word joiner character

The word joiner (WJ) (U+2060) is a Unicode format character which is used to indicate that line breaking should not occur at its position.
It is not a visible character and it does not have a width.

The word joiner replaces the zero-width no-break space (ZWNBSP, U+FEFF), as a usage of the no-break space of zero width.
The ZWNBSP is used as the byte order mark (BOM) at the start of a file. However, if encountered elsewhere, it should, according to Unicode, be treated as a word joiner (a no-break space of zero width).
The use of U+FEFF for this purpose is deprecated as of Unicode 3.2, with the word joiner strongly preferred.

The non-breaking space character

A non-breaking space (also called NBSP, required space, hard space, or fixed space) (U+00A0), is a space character that prevents an automatic line break at its position.
It also prevents consecutive whitespace characters from collapsing into a single space.

Non-breaking space characters with other widths also exist, such as the narrow no-break space (NNBSP) (U+202F), figure space (also called numeric space) (U+2007), and the word joiner (WJ) (U+2060).

The zero-width space character

The zero-width space (ZWSP) (U+200B) is a non-printing character used to indicate where the word boundaries are, without actually displaying a visible space in text.
This enables text-processing systems for scripts without visible spacing to recognize where word boundaries are for the purpose of handling line breaks appropriately.

The ZWSP is located in the General Punctuation block (ranging from U+2000 to U+206F).

ICANN rules prohibit domain names from containing non-displayed characters, including the zero-width space, and most browsers prohibit their use within domain names because they can be used to create a homograph attack, where a malicious URL is visually indistinguishable from a legitimate one.

The zero-width joiner character

The zero-width joiner (ZWJ) (U+200D) is a non-printing character used in text-processing systems in which the shape or positioning of a grapheme depends on its relation to other graphemes (complex scripts), such as the Arabic script or any Indic script. Sometimes the Roman script is to be counted as complex, e.g. when using a Fraktur typeface. When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms.

Sometimes, when a ZWJ is placed between two emoji characters, it can result in a single emoji being shown, such as the family emoji, made up of two adult emoji and one or two child emoji.
Examples:

Character sequence	Appearance	Description
[ra র] [virāma ্ ] [ya য]	র্য
[ra র] [ZWJ] [virāma ্ ] [ya য]	র‍্য
[Man] [ZWJ] [Woman] [ZWJ] [Boy]	👨‍👩‍👦	Family: Man, Woman, Boy
[Black flag] [ZWJ] [Skull and crossbones]	🏴‍☠️	Pirate Flag
[Runner] [Emoji modifier fitzpatrick type-1-2] [ZWJ] [Female sign]	🏃🏻‍♀️	Woman Running: Light Skin Tone
[Runner] [Emoji modifier ftzpatrick type-6] [ZWJ] [Female sign]	🏃🏿‍♀️	Woman Running: Dark Skin Tone
[Man] [ZWJ] [Red hair]	👨‍🦰	Man: Red Hair
[Person] [ZWJ] [Sheaf of rice]	👨‍🌾	Farmer

The zero-width non-joiner character

The zero-width non-joiner (ZWNJ) (U+200C) is a non-printing character used in text-processing systems that make use of ligatures.

When placed between two characters that would otherwise be connected into a ligature, a ZWNJ causes them to be printed in their final and initial forms, respectively.
This is also an effect of a space character, but a ZWNJ is used when it is desirable to keep the characters closer together.

In certain languages, the ZWNJ is necessary for specifying the correct typographic form of a character sequence. Examples:

Display (with ZWNJ)	Code (with ZWNJ)	Display (incorrect)	Code (incorrect)	Meaning
می‌خواهم	می\|ZWNJ\|خواهم	میخواهم	میخواهم	Persian: "I want to"
ساءين‌س	ساءين\|ZWNJ\|س	ساءينس	ساءينس	Malay: "Science"

Use of ZWNJ to display alternative forms:
In Indic scripts, insertion of a ZWNJ after a consonant either with a halant or before a dependent vowel prevents the characters from being joined properly.
Examples:

Script	First character	Second character	combined (no ZWNJ)	combined (ZWNJ between characters)
Devanagari	क्	ष	क्ष	क्‌ष
Kannada	ನ್	ನ	ನ್ನ	ನ್‌ನ

Combining characters

Combining characters are characters that are intended to modify other characters.
The most common combining characters in the Latin script are the combining diacritical marks (including combining accents).

For example: Cyrillic "U" combined with a breve gives "ў".

Unicode also contains many precomposed characters. This leads to a requirement to perform Unicode normalization before comparing two Unicode strings and to carefully design encoding converters to correctly map all of the valid ways to represent a character in Unicode to a legacy encoding to avoid data loss.

The main block of combining diacritics for European languages and the International Phonetic Alphabet is U+0300 to U+036F.
Combining diacritical marks are also present in many other blocks of Unicode characters. Unicode diacritics are always added after the main character.

The following blocks are dedicated specifically to combining characters:

Combining Diacritical Marks (U+0300 to U+036F)
Combining Diacritical Marks Extended (U+1AB0 to U+1AFF)
Combining Diacritical Marks Supplement (U+1DC0 to U+1DFF)
Combining Diacritical Marks for Symbols (U+20D0 to U+20FF)
Cyrillic Extended-A (U+2DE0 to U+2DFF)
Combining Half Marks (U+FE20 to U+FE2F)

Combining characters are not limited to these blocks; for instance, the combining dakuten (U+3099) and combining handakuten (U+309A) are in the Hiragana block.

More terms

Zalgo text:
Combining characters have been used to create Zalgo text, which is text that appears "corrupted" or "creepy" due to an overuse of combining characters. This causes the text to extend vertically, overlapping other text.
This is mostly used in horror contexts on the internet. It is typically very challenging for most software to render, so the combining marks are often reduced or completely removed.

Grapheme clusters:
A Grapheme Cluster is a sequence of one or more Unicode code points that must be treated as a single, unbreakable character.

Combining character sequences:
A combining character sequence is a character sequence consisting of a base character followed by one or more combining characters.

Character sequences:
The Unicode standard specifies notational conventions for referring to sequences of characters (or code points) treated as a single unit.

An example of a combining character sequence:

[U+0061, U+0302, U+0301]

[Latin small letter A, Combining circumflex accent, Combining acute accent]

Here is a list of named character sequences.

Combining grapheme joiner

The combining grapheme joiner (CGJ) (U+034F) is a character that has no visible glyph. Its name is a misnomer and does not describe its function, the character does not join graphemes. Its purpose is to separate characters that should not be considered digraphs as well as to block canonical reordering of combining marks during normalization.

For example, in a Hungarian language context, adjoining letters c and s would normally be considered equivalent to the cs digraph. If they are separated by the CGJ, they will be considered as two separate graphemes.
However, in contrast to the zero-width joiner and similar characters, the CGJ does not affect whether the two letters are rendered separately or as a ligature or cursively joined, the default behavior for this is determined by the font.

Dotted circle

In Unicode, the dotted circle (◌) (U+25CC) is a non-significant typographic character used to illustrate the effect of a combining mark, such as a diacritic mark.

An illustration:

Diacritic ̒ used alone between regular spaces.
Diacritic ◌̒ used after a dotted circle.

Emoji

Unicode 15.1 specifies a total of 3,782 emoji using 1,424 characters spread across 24 blocks. 26 of these emoji are Regional indicator symbols that combine in pairs to form flag emoji, and 12 (#, * and 0 - 9) are base characters for keycap emoji sequences.

Code points that are considered emoji are:

637 of the 768 code points in the Miscellaneous Symbols and Pictographs block.
242 of the 256 code points in the Supplemental Symbols and Pictographs block.
All of the 107 code points in the Symbols and Pictographs Extended-A block.
All of the 80 code points in the Emoticons block.
105 of the 118 code points in the Transport and Map Symbols block.
83 of the 256 code points in the Miscellaneous Symbols block.
33 of the 192 code points in the Dingbats block.

Some vendors add emoji presentation to some other existing Unicode characters or make their own ZWJ sequences.

Microsoft displayed all Mahjong tiles (U+1F000‥2B, not just U+1F004 🀄 MAHJONG TILE RED DRAGON) and alternative card suits (U+2661 ♡ WHITE HEART SUIT, U+2662 ♢ WHITE DIAMOND SUIT, U+2664 ♤ WHITE SPADE SUIT, U+2667 ♧ WHITE CLUB SUIT) as emoji. They also supported additional pencils (U+270E ✎ LOWER RIGHT PENCIL, U+2710 ✐ UPPER RIGHT PENCIL) and a heart-shaped bullet (U+2765 ❥ ROTATED HEAVY BLACK HEART BULLET).

While only U+261D (☝) WHITE UP POINTING INDEX is officially an emoji, Microsoft and Samsung added the other three directions as well (U+261C ☜ WHITE LEFT POINTING INDEX, U+261E ☞ WHITE RIGHT POINTING INDEX, U+261F ☟ WHITE DOWN POINTING INDEX). Microsoft no longer supports these emoji now.

Both vendors pair the standard checked ballot box emoji U+2611 ☑ BALLOT BOX WITH CHECK with its crossed variant U+2612 ☒ BALLOT BOX WITH X, but only Samsung also has the empty ballot box U+2610 ☐ BALLOT BOX.

The regional indicator symbols are a set of 26 alphabetic Unicode characters (A-Z) intended to be used to encode two-letter country codes in a way that allows optional special treatment.
These were defined by October 2010 as part of the Unicode 6.0 support for emoji, as an alternative to encoding separate characters for each country flag.

Although they can be displayed as Roman letters, it is intended that implementations may choose to display them in other ways, such as by using national flags. The Unicode FAQ indicates that this mechanism should be used and that symbols for national flags will not be directly encoded.

They are encoded in the range (U+1F1E6) (🇦) REGIONAL INDICATOR SYMBOL LETTER A to (U+1F1FF) (🇿) REGIONAL INDICATOR SYMBOL LETTER Z within the Enclosed Alphanumeric Supplement block in the Supplementary Multilingual Plane.

A pair of regional indicator symbols is referred to as an emoji flag sequence (although it represents a specific region, not a specific flag for that region).
Out of the 676 possible pairs of regional indicator symbols (26x26), only 270 are considered valid Unicode region codes.

There are emoji sequences that are made of multiple emoji, here is a list of these emoji sequences
There are emoji ZWJ sequences made out of multiple emoji with the ZWJ character, they are listed here
Here is a list of all emoji.

Emoticons:
Emoticons is a Unicode block containing emoticons and other emoji. Most of them are intended as representations of faces, although some of them include hand gestures or non-human characters (a horned imp, monkeys, and cartoonish cats).

Each emoticon has two variants:

U+FE0E (Variation selector-15) selects text presentation (monochrome) (e.g. 😊︎ 😐︎ ☹︎).
U+FE0F (Variation selector-16) selects emoji-style (e.g. 😊️ 😐️ ☹️).

If there is no variation selector appended, the default is the emoji-style.
Example:

Code points	Result
U+1F610 (Neutral face)	😐
U+1F610 (Neutral face), U+FE0E (Variation selector-15)	😐︎
U+1F610 (Neutral face), U+FE0F (Variation selector-16)	😐️

Emoji modifiers:
Five symbol modifier characters were added with Unicode 8.0 to provide a range of skin tones for human emoji.
These modifiers are called EMOJI MODIFIER FITZPATRICK TYPE- 1-2, 3, 4, 5, and 6 (U+1F3FB - U+1F3FF): 🏻 🏼 🏽 🏾 🏿. They are based on the Fitzpatrick scale for classifying human skin color.
Here is an example of emoji with different FITZ emoji modifiers:

	1F645	1F646	1F647	1F64B	1F64C	1F64D	1F64E	1F64F
No modifier	🙅	🙆	🙇	🙋	🙌	🙍	🙎	🙏
FITZ-1-2	🙅🏻	🙆🏻	🙇🏻	🙋🏻	🙌🏻	🙍🏻	🙎🏻	🙏🏻
FITZ-3	🙅🏼	🙆🏼	🙇🏼	🙋🏼	🙌🏼	🙍🏼	🙎🏼	🙏🏼
FITZ-4	🙅🏽	🙆🏽	🙇🏽	🙋🏽	🙌🏽	🙍🏽	🙎🏽	🙏🏽
FITZ-5	🙅🏾	🙆🏾	🙇🏾	🙋🏾	🙌🏾	🙍🏾	🙎🏾	🙏🏾
FITZ-6	🙅🏿	🙆🏿	🙇🏿	🙋🏿	🙌🏿	🙍🏿	🙎🏿	🙏🏿

Some examples of emoji sequences:

Code points	Character names	Characters	Result	Emoji name
[U+0039, U+FE0F, U+20E3]	[Nine, Variation selector-16, Combining enclosing keycap]	[9, U+FE0F, ⃣]	9️⃣	Keycap digit nine emoji
[U+2764, U+FE0F, U+200D, U+1FA79]	[Red heart, Variation selector-16, ZWJ, Adhesive bandage]	[❤️, U+FE0F, U+200D, 🩹]	❤️‍🩹	Mending heart emoji
[U+1F3F4, U+200D, U+2620, U+FE0F]	[Black flag, ZWJ, Skull and crossbones, Variation selector-16]	[🏴, U+200D, ☠️, U+FE0F]	🏴‍☠️	Pirate flag
[U+1F1EE, U+1F1F6]	[Regional indicator symbol letter I, Regional indicator symbol letter Q]	[🇮, 🇶]	🇮🇶	Flag: Iraq
[U+1FAF1, U+1F3FB, U+200D, U+1FAF2, U+1F3FF]	[Rightwards hand, Emoji modifier fitzpatrick type-1-2, ZWJ, Leftwards hand, Emoji modifier fitzpatrick type-6]	[🫱, U+1F3FB, U+200D, 🫲, U+1F3FF]	🫱🏻‍🫲🏿	Handshake: light skin tone, dark skin tone

Noncharacters

The unhyphenated term "noncharacter" refers to 66 code points (labeled <not a character>) permanently reserved for internal use, and therefore guaranteed to never be assigned to a character.

Each of the 17 planes has its two ending code points set as noncharacters. So, noncharacters are: U+FFFE and U+FFFF on the BMP, U+1FFFE and U+1FFFF on Plane 1, and so on, up to U+10FFFE and U+10FFFF on Plane 16, for a total of 34 code points. In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0 - U+FDEF. Software implementations are free to use these code points for internal use.

One particularly useful example of a noncharacter is the code point U+FFFE. This code point has the reverse UTF-16 byte sequence of the byte order mark (U+FEFF).
If a stream of text contains this noncharacter at the start, this is an indication the text has been interpreted with the incorrect endianness.

Noncharacters are not illegal in interchange nor do they cause ill-formed Unicode text.

Whitespace characters

A whitespace character is a character that represents empty space when text is rendered.
A printable character results in output when rendered, but a whitespace character does not. Instead, whitespace characters define the layout of text to a limited degree, interrupting the normal sequence of rendering characters next to each other. The output of subsequent characters is typically shifted to the right (or to the left for right-to-left text) or to the start of the next line.

The term whitespace is rooting in the common practice of printing text on white paper. Normally, a whitespace character is not rendered as white. It affects rendering, but it is not itself rendered.

The table below lists the twenty-five characters defined as whitespace ("WSpace=Y", "WS") characters in the Unicode Character Database. Seventeen use a definition of whitespace consistent with the algorithm for bidirectional writing ("Bidirectional Character Type=WS") and are known as "Bidi-WS" characters.
Bidirectional text is explained in the next section of this summary.

Character name	Code point	May break	In IDN	Script	Block	General category
Character tabulation	U+0009	Yes	No	Common	Basic Latin	Other, control
Line feed	U+000A	Is a line-break	Is a line-break	Common	Basic Latin	Other, control
Line tabulation	U+000B	Is a line-break	Is a line-break	Common	Basic Latin	Other, control
Form feed	U+000C	Is a line-break	Is a line-break	Common	Basic Latin	Other, control
Carriage return	U+000D	Is a line-break	Is a line-break	Common	Basic Latin	Other, control
Space	U+0020	Yes	No	Common	Basic Latin	Separator, space
Next line	U+0085	Is a line-break	Is a line-break	Common	Latin-1 Supplement	Other, control
No-break space	U+00A0	No	No	Common	Latin-1 Supplement	Separator, space
Ogham space mark	U+1680	Yes	No	Ogham	Ogham	Separator, space
En quad	U+2000	Yes	No	Common	General Punctuation	Separator, space
Em quad	U+2001	Yes	No	Common	General Punctuation	Separator, space
En space	U+2002	Yes	No	Common	General Punctuation	Separator, space
Em space	U+2003	Yes	No	Common	General Punctuation	Separator, space
Three-per-em space	U+2004	Yes	No	Common	General Punctuation	Separator, space
Four-per-em space	U+2005	Yes	No	Common	General Punctuation	Separator, space
Six-per-em space	U+2006	Yes	No	Common	General Punctuation	Separator, space
Figure space	U+2007	No	No	Common	General Punctuation	Separator, space
Punctuation space	U+2008	Yes	No	Common	General Punctuation	Separator, space
Thin space	U+2009	Yes	No	Common	General Punctuation	Separator, space
Hair space	U+200A	Yes	No	Common	General Punctuation	Separator, space
Line separator	U+2028	Is a line-break	Is a line-break	Common	General Punctuation	Separator, paragraph
Paragraph separator	U+2029	Is a line-break	Is a line-break	Common	General Punctuation	Separator, paragraph
Narrow no-break space	U+202F	No	No	Common	General Punctuation	Separator, space
Medium mathematical space	U+205F	Yes	No	Common	General Punctuation	Separator, space
Ideographic space	U+3000	Yes	No	Common	CJK Symbols and Punctuation	Separator, space

Related characters with property White_Space=no:

Character name	Code point	May break	In IDN	Script	Block	General category
Mongolian vowel separator	U+180E	Yes	No	Mongolian	Mongolian	Other, Format
Zero width space	U+200B	Yes	No	?	General Punctuation	Other, Format
Zero-width non-joiner	U+200C	Yes	Context-dependent	?	General Punctuation	Other, Format
Zero-width joiner	U+200D	Yes	Context-dependent	?	General Punctuation	Other, Format
Word joiner	U+2060	No	No	?	General Punctuation	Other, Format
Zero width non-breaking space	U+FEFF	No	No	?	Arabic Presentation Forms-B	Other, Format

Substitute images:
Unicode also provides some visible characters that can be used to represent various whitespace characters, in contexts where a visible symbol should be displayed:

Character name	Code point	Displayed character	Block	Description
Middle dot	U+00B7	·	Latin-1 Supplement	Interpunct.
Downwards two headed arrow	U+21A1	↡	Arrows	ECMA-17 / ISO 2047 symbol for form feed (page break).
Identical to	U+2261	≡	Mathematical Operators	Amongst other uses, is the ECMA-17 / ISO 2047 symbol for line feed.
Shouldered open box	U+237D	⍽	Miscellaneous Technical	Used to indicate a NBSP.
Return symbol	U+23CE	⏎	Miscellaneous Technical	The symbol for a return key, which enters a line break.
Control Pictures	U+2409	␉	Control Pictures	Substitutes for a tab character.
Symbol for line feed	U+240A	␊	Control Pictures	Substitutes for a line feed.
Symbol for vertical tabulation	U+240B	␋	Control Pictures	Substitutes for a vertical tab (line tab).
Symbol for form feed	U+240C	␌	Control Pictures	Substitutes for a form feed (page break).
Symbol for carriage return	U+240D	␍	Control Pictures	Substitutes for a carriage return.
Symbol for space	U+2420	␠	Control Pictures	Substitutes for an ASCII space.
Blank symbol	U+2422	␢	Control Pictures	also known as "substitute blank", used in BCDIC, EBCDIC, ASCII-1963 etc. as a symbol for the word separator.
Open box	U+2423	␣	Control Pictures	Used in block letter handwriting at least since the 1980s when it is necessary to explicitly indicate the number of space characters (e.g. when programming with pen and paper). Used in a textbook published by Springer-Verlag on Modula-2, a programming language where space codes require explicit indication. Also used in the keypad of the Texas Instruments' TI-8x series of graphing calculators.
Symbol for newline	U+2424	␤	Control Pictures	Substitutes for a line break.
White up-pointing triangle	U+25B3	△	Geometric Shapes	Amongst other uses, it's the ECMA-17 / ISO 2047 symbol for the ASCII space.
Logical Or with middle stem	U+2A5B	⩛	Supplemental Mathematical Operators	Amongst other uses, it's the ECMA-17 / ISO 2047 symbol for vertical tab (line tab).
Smaller than	U+2AAA	⪪	Supplemental Mathematical Operators	Amongst other uses, it'ss the ECMA-17 / ISO 2047 symbol for carriage return.
Larger than	U+2AAB	⪫	Supplemental Mathematical Operators	Amongst other uses, it's the ECMA-17 / ISO 2047 symbol for the tab character.
Ideographic Telegraph Line Feed Separator Symbol	U+3037	〷	CJK Symbols and Punctuation	Graphic used for code 9999 in Chinese telegraph code, representing a line feed.

Character properties

The Unicode Standard assigns various properties to each Unicode code point.
These properties can be used to handle code points in processes, like in line-breaking, script direction or applying controls.
Some character properties are also defined for code points that have no character assigned, and code points that are labeled "<not a character>".

Character properties is a very long subject which will not be covered in this summary.
If you would like to read about character properties in Unicode, here is a Wikipedia article about it. This article also contains information about whitespace characters, blocks, and other elements.

Bidirectional text

A bidirectional text contains two text directionalities, right-to-left (RTL) and left-to-right (LTR).
In some right-to-left scripts such as the Persian script and Arabic, mathematical expressions, numeric dates, numbers, and left-to-right text are embedded from left to right.
A text is also bidirectional if a right-to-left script is embedded in a left-to-right text.

The Unicode standard calls for characters to be ordered logically, i.e. in the sequence they are intended to be interpreted.
In order to offer bidi support, Unicode prescribes an algorithm for how to convert the logical sequence of characters into the correct visual presentation.
For this purpose, the Unicode encoding standard divides all of its characters into one of four types: "strong", "weak", "neutral", and "explicit formatting".

Strong characters

Strong characters are those with a definite direction.
Examples of this type of character include most alphabetic characters, syllabic characters, Han ideographs, non-European or non-Arabic digits, and punctuation characters that are specific to only those scripts.

Weak characters

Weak characters are those with vague direction.
Examples of this type of character include European digits, Eastern Arabic-Indic digits, arithmetic symbols, and currency symbols.

Neutral characters

Neutral characters have direction indeterminable without context. Examples include paragraph separators, tabs, and most other whitespace characters.
Punctuation symbols that are common to many scripts, such as the colon, comma, full-stop, and the no-break-space also fall within this category.

Explicit formatting

Explicit formatting characters, also referred to as "directional formatting characters", are special Unicode sequences that direct the algorithm to modify its default behavior.
These characters are subdivided into "marks", "embeddings", "isolates", and "overrides". Their effects continue until the occurrence of either a paragraph separator, or a "pop" character.

Subdivisions of the "Explicit formatting" character type:

Marks

If a "weak" character is followed by another "weak" character, the algorithm will look at the first neighbouring "strong" character.
Sometimes this leads to unintentional display errors. These errors are corrected or prevented with "pseudo-strong" characters. Such Unicode control characters are called marks. The mark (U+200E) LEFT-TO-RIGHT MARK (LRM) or (U+200F) RIGHT-TO-LEFT MARK (RLM) is to be inserted into a location to make an enclosed weak character inherit its writing direction.

For example, to correctly display the TRADE MARK SIGN character (™) (U+2122) for an English name brand (LTR) in an Arabic (RTL) passage, a LRM mark is inserted after the trademark symbol if the symbol is not followed by LTR text.
If the LRM mark is not added, the weak character ™ will be neighbored by a strong LTR character and a strong RTL character. Hence, in an RTL context, it will be considered to be RTL, and displayed in an incorrect order.

Embeddings

The "embedding" directional formatting characters are the classical Unicode method of explicit formatting, and as of Unicode 6.3, are being discouraged in favor of "isolates".

An "embedding" signals that a piece of text is to be treated as directionally distinct. The text within the scope of the embedding formatting characters is not independent of the surrounding text.
Also, characters within an embedding can affect the ordering of characters outside.
Unicode 6.3 recognized that directional embeddings usually have too strong an effect on their surroundings and are thus unnecessarily difficult to use.

Isolates

The "isolate" directional formatting characters signal that a piece of text is to be treated as directionally isolated from its surroundings.

As of Unicode 6.3, these are the formatting characters that are being encouraged in new documents. These formatting characters were introduced after it became apparent that "embeddings" usually have too strong an effect on their surroundings and are thus unnecessarily difficult to use.

Unlike the legacy 'embedding' directional formatting characters, 'isolate' characters have no effect on the ordering of the text outside their scope.
Isolates can be nested, and may be placed within embeddings and overrides.

Overrides

The "override" directional formatting characters allow for special cases, such as for part numbers (e.g. to force a part number made of mixed English digits and Hebrew letters to be written from right to left), and are recommended to be avoided wherever possible.

Just like the other directional formatting characters, "overrides" can be nested one inside another, and in embeddings and isolates.

Pops:
The "pop" directional formatting characters terminate the scope of the most recent "embedding", "override", or "isolate".

In the algorithm, each sequence of concatenated strong characters is called a "run".
A "weak" character that is located between two "strong" characters with the same orientation will inherit their orientation.
A "weak" character that is located between two "strong" characters with a different writing direction will inherit the main context's writing direction (in an LTR document the character will become LTR, in an RTL document, it will become RTL).

A table of Bidi character types:

Type	Type description	Strength	Directionality	General scope	Bidi control character
L	Left-to-Right	Strong	L-to-R	Most alphabetic and syllabic characters, Chinese characters, non-European or non-Arabic digits, LRM character, ...	U+200E LEFT-TO-RIGHT MARK (LRM)
R	Right-to-Left	Strong	R-to-L	Adlam, Garay, Hebrew, Mandaic, Mende Kikakui, N'Ko, Samaritan, ancient scripts like Kharoshthi and Nabataean, RLM character, ...	U+200F RIGHT-TO-LEFT MARK (RLM)
AL	Arabic Letter	Strong	R-to-L	Arabic, Hanifi Rohingya, Sogdian, Syriac, and Thaana alphabets, and most punctuation specific to those scripts, ALM character, ...	U+061C ARABIC LETTER MARK (ALM)
EN	European Number	Weak		European digits, Eastern Arabic-Indic digits, Coptic epact numbers, ...
ES	European Separator	Weak		Plus sign, minus sign, ...
ET	European Number Terminator	Weak		Degree sign, currency symbols, ...
AN	Arabic Number	Weak		Arabic-Indic digits, Arabic decimal and thousands separators, Rumi digits, Hanifi Rohingya digits, ...
CS	Common Number Separator	Weak		Colon, comma, full stop, no-break space, ...
NSM	Nonspacing Mark	Weak		Characters in the general categories: Mark (M), Nonspacing mark (Mn), Enclosing mark (Me)
BN	Boundary Neutral	Weak		Default ignorables, noncharacters, control characters other than those explicitly given other types
B	Paragraph Separator	Neutral		Paragraph separator, appropriate Newline Functions, higher-level protocol paragraph determination
S	Segment Separator	Neutral		Tabs
WS	Whitespace	Neutral		Space, figure space, line separator, form feed, General Punctuation block spaces (smaller set than the whitespace list)
ON	Other Neutrals	Neutral		All other characters, including object replacement character
LRE	Left-to-Right Embedding	Explicit	L-to-R	LRE character only	U+202A LEFT-TO-RIGHT EMBEDDING (LRE)
LRO	Left-to-Right Override	Explicit	L-to-R	LRO character only	U+202D LEFT-TO-RIGHT OVERRIDE (LRO)
RLE	Right-to-Left Embedding	Explicit	R-to-L	RLE character only	U+202B RIGHT-TO-LEFT EMBEDDING (RLE)
RLO	Right-to-Left Override	Explicit	R-to-L	RLO character only	U+202E RIGHT-TO-LEFT OVERRIDE (RLO)
PDF	Pop Directional Format	Explicit		PDF character only	U+202C POP DIRECTIONAL FORMATTING (PDF)
LRI	Left-to-Right Isolate	Explicit	L-to-R	LRI character only	U+2066 LEFT-TO-RIGHT ISOLATE (LRI)
RLI	Right-to-Left Isolate	Explicit	R-to-L	RLI character only	U+2067 RIGHT-TO-LEFT ISOLATE (RLI)
FSI	First Strong Isolate	Explicit		FSI character only	U+2068 FIRST STRONG ISOLATE (FSI)
PDI	Pop Directional Isolate	Explicit		PDI character only	U+2069 POP DIRECTIONAL ISOLATE (PDI)

The Unicode collation algorithm

The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary strings from strings representing text in any writing system and language that can be represented with Unicode.
These binary strings can then be efficiently compared byte by byte in order to collate or sort them according to the rules of the language, with options for ignoring case, accents, etc.

Unicode Technical Report #10 also specifies the Default Unicode Collation Element Table (DUCET). This data file specifies a default collation ordering. The DUCET is customizable for different languages, and some such customizations can be found in the Unicode Common Locale Data Repository (CLDR).

An open source implementation of UCA is included with the International Components for Unicode, ICU. ICU supports tailoring, and the collation tailorings from CLDR are included in ICU.

Unicode equivalence

Unicode equivalence is the specification by the Unicode standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with pre-existing standard character sets, which often included similar or identical characters.
Unicode provides two such notions, canonical equivalence and compatibility.

Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E (n) LATIN SMALL LETTER N followed by U+0303 (̃ ) COMBINING TILDE is defined by Unicode to be canonically equivalent to the single code point U+00F1 (ñ) LATIN SMALL LETTER N WITH TILDE of the Spanish alphabet.
Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other.
Similarly, each Hangul syllable block that is encoded as a single character may be equivalently encoded as a combination of a leading conjoining jamo, a vowel conjoining jamo, and, if appropriate, a trailing conjoining jamo.

Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts. Thus, for example, the code point U+FB00 (the typographic ligature "ﬀ") is defined to be compatible but not canonically equivalent to the sequence U+0066 U+0066 (two Latin "f" letters).
Compatible sequences may be treated the same way in some applications (such as sorting and indexing), but not in others; and may be substituted for each other in some situations, but not in others.
Sequences that are canonically equivalent are also compatible, but the opposite is not necessarily true.

The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text.
For each of the two equivalence notions, Unicode defines two normal forms, one fully composed (where multiple code points are replaced by single points whenever possible), and one fully decomposed (where single points are split into multiple ones).

Sources of equivalence

Character duplication:
For compatibility or other reasons, Unicode sometimes assigns two different code points to entities that are essentially the same character.
For example, the letter "A with a ring diacritic above" is encoded as U+00C5 (Å) LATIN CAPITAL LETTER A WITH RING ABOVE (a letter of the alphabet in Swedish and several other languages) or as U+212B Å ANGSTROM SIGN. Yet the symbol for angstrom is defined to be that Swedish letter, and most other symbols that are letters (such as "V" for volt) do not have a separate code point for each usage.
In general, the code points of truly identical characters are defined to be canonically equivalent.

Combining and precomposed characters:
For consistency with some older standards, Unicode provides single code points for many characters that could be viewed as modified forms of other characters (such as U+00F1 for "ñ" or U+00C5 for "Å") or as combinations of two or more characters (such as U+FB00 for the ligature "ﬀ" or U+0132 for the Dutch letter "Ĳ")
For consistency with other standards, and for greater flexibility, Unicode also provides codes for many elements that are not used on their own, but are meant instead to modify or combine with a preceding base character. Examples of these combining characters are the combining tilde and the Japanese diacritic dakuten ("◌゛", U+3099).
In the context of Unicode, character composition is the process of replacing the code points of a base letter followed by one or more combining characters into a single precomposed character; and character decomposition is the opposite process.
In general, precomposed characters are defined to be canonically equivalent to the sequence of their base letter and subsequent combining diacritic marks, in whatever order these may occur.
An example: "Amélie" with its two canonically equivalent Unicode forms (NFC and NFD):

NFC characters	A	m	é	l	i	e
NFC code points	U+0041	U+006D	U+00E9	U+006C	U+0069	U+0065
NFD code points	U+0041	U+006D	[U+0065, U+0301]	U+006C	U+0069	U+0065
NFD characters	A	m	[e, ◌́ ]	l	i	e

Typographical non-interaction:
Some scripts regularly use multiple combining marks that do not, in general, interact typographically, and do not have precomposed characters for the combinations. Pairs of such non-interacting marks can be stored in either order. These alternative sequences are, in general, canonically equivalent. The rules that define their sequencing in the canonical form also define whether they are considered to interact.

Typographic conventions:
Unicode provides code points for some characters or groups of characters which are modified only for aesthetic reasons (such as ligatures, the half-width katakana characters, or the full-width Latin letters for use in Japanese texts), or to add new semantics without losing the original one (such as digits in subscript or superscript positions, or the circled digits (such as "①") inherited from some Japanese fonts).
Such a sequence is considered compatible with the sequence of original (individual and unmodified) characters, for the benefit of applications where the appearance and added semantics are not relevant.
However, the two sequences are not declared canonically equivalent, since the distinction has some semantic value and affects the rendering of the text.

Normalization

A text processing software implementing the Unicode string search and comparison functionality must take into account the presence of equivalent code points. In the absence of this feature, users searching for a particular code point sequence would be unable to find other visually indistinguishable glyphs that have a different, but canonically equivalent, code point representation.

Algorithms:
Unicode provides standard normalization algorithms that produce a unique (normal) code point sequence for all sequences that are equivalent; the equivalence criteria can be either canonical (NF) or compatibility (NFK).
Since one can arbitrarily choose the representative element of an equivalence class, multiple canonical forms are possible for each equivalence criterion. Unicode provides two normal forms that are semantically meaningful for each of the two compatibility criteria: the composed forms NFC and NFKC, and the decomposed forms NFD and NFKD. Both the composed and decomposed forms impose a canonical ordering on the code point sequence, which is necessary for the normal forms to be unique.
In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance, some typographic ligatures like U+FB03 (ﬃ), Roman numerals like U+2168 (Ⅸ) and even subscripts and superscripts, e.g. U+2075 (⁵) have their own Unicode code points. Canonical normalization (NF) does not affect any of these, but compatibility normalization (NFK) will decompose the ffi ligature into the constituent letters, so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03.
Likewise when searching for the Latin letter I (U+0049) in the precomposed Roman numeral Ⅸ (U+2168). Similarly, the superscript ⁵ (U+2075) is transformed to 5 (U+0035) by compatibility mapping.
Transforming superscripts into baseline equivalents may not be appropriate, however, for rich text software, because the superscript information is lost in the process. To allow for this distinction, the Unicode character database contains compatibility formatting tags that provide additional details on the compatibility transformation. In the case of typographic ligatures, this tag is simply <compat>, while for the superscript it is <super>. Rich text standards like HTML take into account the compatibility tags. For instance, HTML uses its own markup to position a "5" (U+0035) in a superscript position.

Normal forms:
The four Unicode normalization forms and the algorithms (transformations) for obtaining them are listed in the table:

NFD (Normalization Form Canonical Decomposition)	Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order.
NFC (Normalization Form Canonical Composition)	Characters are decomposed and then recomposed by canonical equivalence.
NFKD (Normalization Form Compatibility Decomposition)	Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.
NFKC (Normalization Form Compatibility Composition)	Characters are decomposed by compatibility, then recomposed by canonical equivalence.

All these algorithms are idempotent transformations, meaning that a string that is already in one of these normalized forms will not be modified if processed again by the same algorithm.
The normal forms are not closed under string concatenation. For defective Unicode strings starting with a Hangul vowel or trailing conjoining jamo, concatenation can break Composition.
However, they are not injective (they map different original glyphs and sequences to the same normalized sequence) and thus also not bijective (cannot be restored). For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence [U+0041 U+030A] (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").
A single character (other than a Hangul syllable block) that will get replaced by another under normalization can be identified in the Unicode tables for having a non-empty compatibility field but lacking a compatibility tag.

Canonical ordering:
The canonical ordering is mainly concerned with the ordering of a sequence of combining characters. For the examples in this section we assume these characters to be diacritics, even though in general some diacritics are not combining characters, and some combining characters are not diacritics.
Unicode assigns each character a combining class, which is identified by a numerical value. Non-combining characters have class number 0, while combining characters have a positive combining class value. To obtain the canonical ordering, every substring of characters having non-zero combining class value must be sorted by the combining class value using a stable sorting algorithm. Stable sorting is required because combining characters with the same class value are assumed to interact typographically, thus the two possible orders are not considered equivalent.
For example, the character U+1EBF (ế), used in Vietnamese, has both an acute and a circumflex accent. Its canonical decomposition is the three-character sequence U+0065 (e) U+0302 (circumflex accent) U+0301 (acute accent). The combining classes for the two accents are both 230, thus U+1EBF is not equivalent to U+0065 U+0301 U+0302.
Since not all combining sequences have a precomposed equivalent (the last one in the previous example can only be reduced to U+00E9 U+0302), even the normal form NFC is affected by combining characters' behavior.

Implementing Unicode support in your code

We will use the C programming language for example code, as it's an easy language to understand.

Printing Unicode encoded text onto the terminal

On Windows, you can use wide characters and wide print functions to print Unicode text. You will also need to use the "_setmode" function to set the output mode to Unicode, otherwise only ASCII characters will be printed:


#include <stdio.h>
#include <fcntl.h> // The "_O_U16TEXT" macro is here

int main()
{
    _setmode(1, _O_U16TEXT);
    wprintf(L"Hello مرحباً\n"); // A string prefixed with the letter "L" is a wide string
    return 0;
}

(The arabic text might be printed in a left-to-right direction without forming ligatures).

Converting UTF-32 to UTF-8

In a revisit of the UTF-8 byte table: if you visualize a code point as (U+uvwxyz), UTF-8 bytes are arranged as follows:

Code point range	First byte	Second byte	Third byte	Fourth byte
U+0000 to U+007F	0yyyzzzz₂
U+0080 to U+07FF	110xxxyy₂	10yyzzzz₂
U+0800 to U+FFFF	1110wwww₂	10xxxxyy₂	10yyzzzz₂
U+010000 to U+10FFFF	11110uvv₂	10vvwwww₂	10xxxxyy₂	10yyzzzz₂

In the code below, the if statements check for the amount of code units required to fit the code point:

If the code point's value <= U+007F (127), it's an ASCII character which fits into one UTF-8 code unit.
If the code point's value <= U+07FF (2047), it fits into two UTF-8 code units.
If the code point's value <= U+FFFF (65535), it fits into three UTF-8 code units.
If the code point's value <= U+10FFFF (1114111), it fits into four UTF-8 code units.
If the code point's value is greater than U+10FFFF, it's an invalid code point.


#include <stdio.h>
#include <stdint.h>

uint8_t utf32ToUtf8(uint32_t codePoint, uint8_t output[4])
{
    if (codePoint <= 0x007F) // Fits into one UTF8 code unit (ASCII)
    {
        output[0] = (uint8_t) codePoint;
        return 1;
    }
    else if (codePoint <= 0x07FF) // Fits into two UTF8 code units
    {
        output[0] = (uint8_t) (((codePoint >> 6) & 0b0011111) | 0b11000000);
        output[1] = (uint8_t) ((codePoint & 0b0111111) | 0b10000000);
        return 2;
    }
    else if (codePoint <= 0xFFFF) // Fits into three UTF8 code units
    {
        output[0] = (uint8_t) (((codePoint >> 12) & 0b00001111) | 0b11100000);
        output[1] = (uint8_t) (((codePoint >> 6) & 0b00111111) | 0b10000000);
        output[2] = (uint8_t) ((codePoint & 0b00111111) | 0b10000000);
        return 3;
    }
    else if (codePoint <= 0x10FFFF) // Fits into four UTF8 code units
    {
        output[0] = (uint8_t) (((codePoint >> 18) & 0b00000111) | 0b11110000);
        output[1] = (uint8_t) (((codePoint >> 12) & 0b00111111) | 0b10000000);
        output[2] = (uint8_t) (((codePoint >> 6) & 0b00111111) | 0b10000000);
        output[3] = (uint8_t) ((codePoint & 0b00111111) | 0b10000000);
        return 4;
    }
    else return UINT8_MAX; // We will consider the return value "UINT8_MAX" (255) to mean "invalid"
}

int main()
{
    uint8_t output[4] = {0};
    uint8_t codeUnitCount = utf32ToUtf8(0x1D70F, output); // The character U+1D70F is "Mathematical Italic Small Tau" "𝜏"

    printf("Amount of code units: %d\n", codeUnitCount);

    printf("Code units:\n");
    for (uint8_t i = 0; i < codeUnitCount; i++)
        printf("%X\n", output[i]);
    
    return 0;
}

The output is [F0, 9D, 9C, 8F]. In binary [11110000, 10011101, 10011100, 10001111].

If you're confused by the bitwise operations, here's an explanation:

When the code point's value <= U+07FF (two code units):
the maximum amount of that value (0x07FF) in binary is twenty-one zeroes followed by eleven ones (0000_0000_0000_0000_0000_0111_1111_1111), so when we bitshift to the right by six, we are removing the rightmost six bits, with the result of five bits left.
These five bits are then bitwise ANDed by five set bits to remove any extra set bits on the left, and bitwise ORed by 11000000₂ to set the first three bits to 110₂, and finally casted to an unsigned 8-bit integer, creating the first code unit "110xxxyy₂".

The six bits we ignored before will be now used to make the second code unit "10yyzzzz₂". To get these bits we bitwise AND the code point by 0111111₂ to remove any extra set bits, and then bitwise OR the result by 10000000₂ to set the first two bits to 10₂.

When the code point's value <= U+FFFF (three code units):
0xFFFF in binary is sixteen zeroes followed by sixteen ones.
We first get four bits for the first code unit (1110wwww₂) by bitshifting by twelve (16 - 4), then we perform bitwise AND and OR in the same way.
After that we get six bits for the second unit (10xxxxyy₂) by bitshifting by six (16 - 4 - 6) proceeded by the same steps.
At last we get another six bits for the third unit (10yyzzzz₂) by only performing a bitwise AND, then a bitwise OR.

When the code point's value <= U+10FFFF (four code units):
0x10FFFF in binary is eleven zeroes followed by twenty-one ones. The first code unit holds three bits of the code point. The second, third, and fourth all hold six bits.

Please note that the code blocks contain basic examples, they aren't perfect and they don't check for common errors.

Converting UTF-8 to UTF-32

In the code below, we're doing a bitwise AND with comparison to check if the first UTF-8 code unit starts with 0₂, 110₂, 1110₂, or 11110₂ to determine the amount of code units.


#include <stdio.h>
#include <stdint.h>

uint32_t utf8ToUtf32(const uint8_t utf8Char[4])
{
    if ((utf8Char[0] & 0b10000000) == 0b0) // One code unit
    {
        return (uint32_t) utf8Char[0];
    }
    else if ((utf8Char[0] & 0b11100000) == 0b11000000) // Two code units
    {
        if ((utf8Char[1] & 0b11000000) != 0b10000000) // Check if the next unit is valid
            return UINT32_MAX; // We will consider the return value "UINT32_MAX" (0xFFFFFFFF) to mean "invalid"

        return ((uint32_t)(utf8Char[0] & 0b00011111) << 6) +
                (uint32_t)(utf8Char[1] & 0b00111111);
    }
    else if ((utf8Char[0] & 0b11110000) == 0b11100000) // Three code units
    {
        if ((utf8Char[1] & 0b11000000) != 0b10000000
            || (utf8Char[2] & 0b11000000) != 0b10000000) // Check if the next units are valid
                return UINT32_MAX;
        
        return ((uint32_t)(utf8Char[0] & 0b00001111) << 12) +
                ((uint32_t)(utf8Char[1] & 0b00111111) << 6) +
                (uint32_t)(utf8Char[2] & 0b00111111);
    }
    else if ((utf8Char[0] & 0b11111000) == 0b11110000) // Four code units
    {
        if ((utf8Char[1] & 0b11000000) != 0b10000000 // Check if the next units are valid
            || (utf8Char[2] & 0b11000000) != 0b10000000
            || (utf8Char[3] & 0b11000000) != 0b10000000)
                return UINT32_MAX;

        return ((uint32_t)(utf8Char[0] & 0b00000111) << 18) +
            ((uint32_t)(utf8Char[1] & 0b00111111) << 12) +
            ((uint32_t)(utf8Char[2] & 0b00111111) << 6) +
            (uint32_t)(utf8Char[3] & 0b00111111);
    }
}

int main()
{
    uint8_t codeUnits[4] = {0xD0, 0x87}; // The character (U+0407) is "Cyrillic Capital Letter Yi" "ї"
    printf("%X\n", utf8ToUtf32(codeUnits));
    return 0;
}

The output is 0x407.

Code point range	First byte	Second byte	Third byte	Fourth byte
U+0000 to U+007F	0yyyzzzz₂
U+0080 to U+07FF	110xxxyy₂	10yyzzzz₂
U+0800 to U+FFFF	1110wwww₂	10xxxxyy₂	10yyzzzz₂
U+010000 to U+10FFFF	11110uvv₂	10vvwwww₂	10xxxxyy₂	10yyzzzz₂

If there is only one code unit, we simply cast it to a 32-bit unsigned integer with the same value as that code unit.

If there are two code units, we remove the extra leftmost bits from the code units, we convert both of them to an unsigned 32-bit integer, we bitshift the first unit (when it's a 32-bit integer) to the left by six, then we add the results of the first and second units together. This mixes the values of the code units back to a full code point (a UTF-32 code unit).

If there are three units, we also remove the extra bits on the left, and convert them all to an unsigned 32-bit integer. Then we bitshift the first unit's result to the left by twelve, and the second by six. And finally mix the results with addition.

If there are four units, we do the same steps but we bitshift the first unit to the left by eighteen, the second by twelve, and the third by six.

Converting UTF-32 to UTF-16

If the code point's value is in the BMP (less than 65,536 (0x10000)) it simply gets converted to an unsigned 16-bit integer with the same value as that code point.
However, if it isn't in the BMP (greater than or equal to 65,536 (0x10000)):

0x10000 is subtracted from the code point, leaving a 20-bit number in the range 0x00000 to 0xFFFFF.
The high ten bits (value ranging from 0x000 to 0x3FF) are added to 0xD800 to get the first 16-bit code unit (high surrogate), which will be in the range 0xD800 to 0xDBFF.
The low ten bits (value also ranging from 0x000 to 0x3FF) are added to 0xDC00 to get the second 16-bit code unit (low surrogate), which will be in the range 0xDC00 to 0xDFFF.

A visual illustration of this looks like this:

Code point - 0x10000 = yyyyyyyyyyxxxxxxxxxx₂
High surrogate = 0xD800 + yyyyyyyyyy₂ = 110110yyyyyyyyyy₂
Low surrogate = 0xDC00 + xxxxxxxxxx₂ = 110111xxxxxxxxxx₂


#include <stdio.h>
#include <stdint.h>

uint8_t utf32ToUtf16(uint32_t codePoint, uint16_t output[2])
{
    if (codePoint < 0x10000)
    {
        output[0] = (uint16_t) codePoint;
        return 1;
    }
    else if (codePoint <= 0x10FFFF)
    {
        codePoint -= 0x10000;
        output[0] = 0xD800 + (codePoint >> 10); // High surrogate
        output[1] = 0xDC00 + (codePoint & 0b1111111111); // Low surrogate
        return 2;
    }
    else return UINT8_MAX; // We will consider the return value "UINT8_MAX" (255) to mean "invalid"
}

int main()
{
    uint16_t output[2] = {0};
    uint8_t codeUnitCount = utf32ToUtf16(0x10001, output); // The character U+10001 is "Linear B Syllable B038 E" "𐀁"

    printf("Amount of code units: %d\n", codeUnitCount);

    printf("Code units:\n");
    for (uint8_t i = 0; i < codeUnitCount; i++)
        printf("%X\n", output[i]);
    
    return 0;
}

The output is [D800, DC01]. In binary [1101100000000000, 1101110000000001].

Converting UTF-16 to UTF-32

In the code below, if there is one unit (BMP code point) we convert it to an unsigned 32-bit integer.
If there is a pair of surrogate units, we remove extra bits from both surrogates, cast them to an unsigned 32-bit integer, bitshift the result of calculations of the high surrogate to the left by ten, and finally we add the results together with the 0x10000 we subtracted when encoding the code point in UTF-16.


#include <stdio.h>
#include <stdint.h>
        
uint32_t utf16ToUtf32(const uint16_t utf16Char[2])
{
    // High surrogate
    if (utf16Char[0] >= 0xD800 && utf16Char[0] <= 0xDBFF)
    {
        if (utf16Char[1] < 0xDC00 || utf16Char[1] > 0xDFFF) // The next unit is a low surrogate
            return UINT32_MAX; // We will consider the return value "UINT32_MAX" (0xFFFFFFFF) to mean "invalid"

        return ( ((uint32_t)(utf16Char[0] & 0b1111111111)) << 10 ) + (uint32_t)(utf16Char[1] & 0b1111111111) + 0x10000U;
    }
    // The unit isn't a low surrogate, and is in the BMP
    else if ((utf16Char[0] < 0xDC00 || utf16Char[0] > 0xDFFF) && utf16Char[0] < 0x10000)
    {
        return (uint32_t) utf16Char[0];
    }
    else return UINT32_MAX;
}

int main()
{
    uint16_t codeUnits[2] = {0xD802, 0xDD07}; // The character (U+10907) is "Phoenician Letter Het" "𐤇"
    printf("%X\n", utf16ToUtf32(codeUnits));
    return 0;
}

The output is 0x10907.

Converting UTF-8 to UTF-16

Code points in the BMP are encoded in one, two, or three code units in UTF-8, and in one code unit in UTF-16.
In the code below, if the code point is in the BMP, we just mix the UTF-8 code units to a single UTF-16 unit.
If the code point is outside the BMP, we create a UTF-32 code unit from the UTF-8 units, and subtract it by 0x10000. After that we convert the result to a UTF-16 unit pair.


#include <stdio.h>
#include <stdint.h>

uint8_t utf8ToUtf16(const uint8_t utf8Char[4], uint16_t output[2])
{
    if ((utf8Char[0] & 0b10000000) == 0b0) // One code unit
    {
        output[0] = (uint16_t) utf8Char[0];
        return 1;
    }
    else if ((utf8Char[0] & 0b11100000) == 0b11000000) // Two code units
    {
        if ((utf8Char[1] & 0b11000000) != 0b10000000) // Check if the next unit is valid
            return UINT8_MAX; // We will consider the return value "UINT8_MAX" (255) to mean "invalid"

        output[0] = ((uint16_t)(utf8Char[0] & 0b00011111) << 6) +
                (uint16_t)(utf8Char[1] & 0b00111111);
        return 1;
    }
    else if ((utf8Char[0] & 0b11110000) == 0b11100000) // Three code units
    {
        if ((utf8Char[1] & 0b11000000) != 0b10000000
            || (utf8Char[2] & 0b11000000) != 0b10000000) // Check if the next units are valid
                return UINT8_MAX;
        
        output[0] = ((uint16_t)(utf8Char[0] & 0b00001111) << 12) +
                ((uint16_t)(utf8Char[1] & 0b00111111) << 6) +
                (uint16_t)(utf8Char[2] & 0b00111111);
        return 1;
    }
    else if ((utf8Char[0] & 0b11111000) == 0b11110000) // Four code units
    {
        if ((utf8Char[1] & 0b11000000) != 0b10000000 // Check if the next units are valid
            || (utf8Char[2] & 0b11000000) != 0b10000000
            || (utf8Char[3] & 0b11000000) != 0b10000000)
                return UINT8_MAX;

        uint32_t result = ((uint32_t)(utf8Char[0] & 0b00000111) << 18) +
                            ((uint32_t)(utf8Char[1] & 0b00111111) << 12) +
                            ((uint32_t)(utf8Char[2] & 0b00111111) << 6) +
                            (uint32_t)(utf8Char[3] & 0b00111111)
                            - 0x10000;
        output[0] = 0xD800 + (result >> 10); // High surrogate
        output[1] = 0xDC00 + (result & 0b1111111111); // Low surrogate
        return 2;
    }
}

int main()
{
    uint8_t codeUnits[4] = {0xF0, 0x90, 0xA4, 0x87}; // The character (U+10907) is "Phoenician Letter Het" "𐤇"
    uint16_t output[2] = {0};
    
    uint8_t codeUnitCount = utf8ToUtf16(codeUnits, output);
    printf("Amount of code units: %d\n", codeUnitCount);

    printf("Code units:\n");
    for (uint8_t i = 0; i < codeUnitCount; i++)
        printf("%X\n", output[i]);
    
    return 0;
}

The output is [D802, DD07].

Converting UTF-16 to UTF-8

Converting UTF-16 to UTF-8 is almost the same as converting UTF-32 to UTF-8:


#include <stdio.h>
#include <stdint.h>

uint8_t utf16ToUtf8(uint16_t utf16Char[2], uint8_t output[4])
{
    // High surrogate
    if (utf16Char[0] >= 0xD800 && utf16Char[0] <= 0xDBFF)
    {
        if (utf16Char[1] < 0xDC00 || utf16Char[1] > 0xDFFF) // Make sure the next unit is a low surrogate
            return UINT8_MAX; // We will consider the return value "UINT8_MAX" (255) to mean "invalid"

        uint32_t result = ( ((uint32_t)(utf16Char[0] & 0b1111111111)) << 10 ) + (uint32_t)(utf16Char[1] & 0b1111111111) + 0x10000U;
        output[0] = (uint8_t) (((result >> 18) & 0b00000111) | 0b11110000);
        output[1] = (uint8_t) (((result >> 12) & 0b00111111) | 0b10000000);
        output[2] = (uint8_t) (((result >> 6) & 0b00111111) | 0b10000000);
        output[3] = (uint8_t) ((result & 0b00111111) | 0b10000000);
        return 4;
    }
    // The unit isn't a low surrogate, and is in the BMP
    else if ((utf16Char[0] < 0xDC00 || utf16Char[0] > 0xDFFF) && utf16Char[0] < 0x10000)
    {
        if (utf16Char[0] <= 0x007F) // Fits into one UTF8 code unit (ASCII)
        {
            output[0] = (uint8_t) utf16Char[0];
            return 1;
        }
        else if (utf16Char[0] <= 0x07FF) // Fits into two UTF8 code units
        {
            output[0] = (uint8_t) (((utf16Char[0] >> 6) & 0b0011111) | 0b11000000);
            output[1] = (uint8_t) ((utf16Char[0] & 0b0111111) | 0b10000000);
            return 2;
        }
        else if (utf16Char[0] <= 0xFFFF) // Fits into three UTF8 code units
        {
            output[0] = (uint8_t) (((utf16Char[0] >> 12) & 0b00001111) | 0b11100000);
            output[1] = (uint8_t) (((utf16Char[0] >> 6) & 0b00111111) | 0b10000000);
            output[2] = (uint8_t) ((utf16Char[0] & 0b00111111) | 0b10000000);
            return 3;
        }
    }
    else return UINT8_MAX;
}

int main()
{
    uint16_t codeUnits[2] = {0xD801, 0xDE43}; // The character (U+10643) is "Linear A Sign Ab082" "𐙃"
    uint8_t output[4] = {0};
    
    uint8_t codeUnitCount = utf16ToUtf8(codeUnits, output);
    printf("Amount of code units: %d\n", codeUnitCount);

    printf("Code units:\n");
    for (uint8_t i = 0; i < codeUnitCount; i++)
        printf("%X\n", output[i]);
    
    return 0;
}

The output is [0xF0, 0x90, 0x99, 0x83].

Sources

Wikipedia: Unicode character property

Wikipedia: Bidirectional text

Wikipedia: Unicode collation algorithm

Wikipedia: Unicode equivalence

Top

Go back to the top