• ggml is a library that focus in 2 basic things
    • create computional graph that work well on CPU
    • Quantitazation of big models from 32bits to 4/8 bits
  • The author of ggml build LlamaCpp and that was big deal because
    • Even Llama-7b needs around 7*4 = 28Gb (during inference) of vram on a GPU on full precision 32bits
    • that meant that you need a last gen for running the most basic model
    • BUT with LlamaCpp the model was quantized to 4bits 7x0.5 = 5gb and it was able to do inference in the CPU no GPU need
  • At the end, we want models that everybody can use that bring an explosion of new ideas without having to deal with third-party services like ChatGPT