public:: true tags Development programming unicode
- {{video https://www.youtube.com/watch?v=ut74oHojxqo&t=90s}}
- how ascii works
- used for western symbols
- 127 characters
- it’s a table, each character has a number, and this number is encoded in binary
- but with 127 chars we don’t have enough for all the languages like chinese, arabic, Cyrillic, emojis etc…
- Unicode
- now instead of talking about characters, and use graphene
- graphene: a single unit of human writing → think of what would be written in a single tile of scrable
- code point
- each graphene has a code point associated
- some graphene can be done by combining two code points (for example and emoji of a hands up + a color skin to generate different emojis)
- encodings
- UTF-32 each code point is encoded as 4 bytes (32 bits)
- issue → we wast a lot of space
- lower code points are way more used
- UTF-8: depending on the code point is encoded in 1, 2, 3 or 4 bytes
- if it’s one of the 127 ascii chars, its mapped the same for backward compatibility
- if next byte starts with a 0 → its a single code point
- if next byte starts with a 110 → they are 2 code points, read next byte as well
- if next byte is starts with 1110 → is a 3 code point → read 2 more bytes
- if starts with 11110 is a 4 byte code point so you need to read this byte and 3 more
- UTF-32 each code point is encoded as 4 bytes (32 bits)