public:: true tags Development programming unicode

  • {{video https://www.youtube.com/watch?v=ut74oHojxqo&t=90s}}
  • how ascii works
    • used for western symbols
    • 127 characters
    • it’s a table, each character has a number, and this number is encoded in binary
  • but with 127 chars we don’t have enough for all the languages like chinese, arabic, Cyrillic, emojis etc…
  • Unicode
    • now instead of talking about characters, and use graphene
    • graphene: a single unit of human writing think of what would be written in a single tile of scrable
    • code point
      • each graphene has a code point associated
      • some graphene can be done by combining two code points (for example and emoji of a hands up + a color skin to generate different emojis)
    • encodings
      • UTF-32 each code point is encoded as 4 bytes (32 bits)
        • issue we wast a lot of space
        • lower code points are way more used
      • UTF-8: depending on the code point is encoded in 1, 2, 3 or 4 bytes
        • if it’s one of the 127 ascii chars, its mapped the same for backward compatibility
        • if next byte starts with a 0 its a single code point
        • if next byte starts with a 110 they are 2 code points, read next byte as well
        • if next byte is starts with 1110 is a 3 code point read 2 more bytes
        • if starts with 11110 is a 4 byte code point so you need to read this byte and 3 more