Unicode

{{video https://www.youtube.com/watch?v=ut74oHojxqo&t=90s}}
how ascii works
- used for western symbols
- 127 characters
- it’s a table, each character has a number, and this number is encoded in binary
but with 127 chars we don’t have enough for all the languages like chinese, arabic, Cyrillic, emojis etc…
Unicode
- now instead of talking about characters, and use graphene
- graphene: a single unit of human writing → think of what would be written in a single tile of scrable
- code point
  - each graphene has a code point associated
  - some graphene can be done by combining two code points (for example and emoji of a hands up + a color skin to generate different emojis)
- encodings
  - UTF-32 each code point is encoded as 4 bytes (32 bits)
    - issue → we wast a lot of space
    - lower code points are way more used
  - UTF-8: depending on the code point is encoded in 1, 2, 3 or 4 bytes
    - if it’s one of the 127 ascii chars, its mapped the same for backward compatibility
    - if next byte starts with a 0 → its a single code point
    - if next byte starts with a 110 → they are 2 code points, read next byte as well
    - if next byte is starts with 1110 → is a 3 code point → read 2 more bytes
    - if starts with 11110 is a 4 byte code point so you need to read this byte and 3 more

Kzk's garden