UTF-8 is a brilliant design

UTF-8 is a Brilliant Design UTF-8 encoding is a well-thought-out system designed to represent millions of characters from various languages and scripts while maintaining backward compatibility with ASCII. Key Features UTF-8 uses up to 32 bits per character, whereas ASCII uses 7 bits. Every ASCII file is a valid UTF-8 file. Every UTF-8 file containing only ASCII characters is also a valid ASCII file. This design allows UTF-8 to scale to millions of characters without breaking compatibility with legacy ASCII systems. How Does UTF-8 Work? UTF-8 is a variable-width character encoding representing Unicode characters using 1 to 4 bytes: Characters from U+0000 to U+007F (the first 128 characters) use 1 byte and are identical to ASCII bytes. Characters beyond this range use 2, 3, or 4 bytes. Byte Patterns and Meaning | 1st Byte Pattern | # of Bytes | Full Byte Sequence Pattern | |------------------|------------|--------------------------------------------------| | 0xxxxxxx | 1 | 0xxxxxxx (ASCII character) | | 110xxxxx | 2 | 110xxxxx 10xxxxxx | | 1110xxxx | 3 | 1110xxxx 10xxxxxx 10xxxxxx | | 11110xxx | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | Continuation bytes always start with 10. The leading bits in the first byte define the number of bytes in the sequence. Remaining bits from all bytes combined define the Unicode code point (a unique character identifier like U+0041 for A). Decoding Process: Read a byte: If it starts with 0, it’s an ASCII character. Otherwise, determine total bytes from leading bits (110, 1110, 11110). Read continuation bytes starting with 10. Combine bits (except leading ones) to find the Unicode code point. Lookup the code point to display the character. Repeat for next byte. Example: Hindi Letter "अ" (U+0905) UTF-8 bytes: 11100000 10100100 10000101 First byte indicates three bytes. Bits combined give hexadecimal code point 0x0905. Represents "Devanagari Letter A" in Unicode. Example Text Files Text file with Hey👋 Buddy Contains ASCII letters and an emoji. 13 bytes including a 4-byte emoji 👋 (U+1F44B). Emoji encoded as: 11110000 10011111 10010001 10001011 Bytes decoded following UTF-8 rules produce correct characters including the emoji. Text file with Hey Buddy (ASCII only) Contains 9 bytes. All bytes start with 0 indicating single-byte ASCII characters. Valid UTF-8 and valid ASCII file. Demonstrates backward compatibility. Other Encodings Some encodings are backward compatible with ASCII but are less popular (e.g., GB 18030 for Chinese). ISO/IEC 8859 encodings extend ASCII but limited to 256 characters. UTF-16 and UTF-32 are not backward compatible: A in UTF-16: 00 41 (2 bytes) A in UTF-32: 00 00 00 41 (4 bytes) Bonus: UTF-8 Playground The author created an interactive UTF-8 Playground to visualize UTF-8 encoding and decoding, enabling users to explore UTF-8 byte sequences and code points dynamically. --- Tags: #tech, #history, #programming