- ggml is a library that focus in 2 basic things
- create computional graph that work well on CPU
- Quantitazation of big models from 32bits to 4/8 bits
- The author of ggml build LlamaCpp and that was big deal because
- Even Llama-7b needs around
7*4 = 28Gb (during inference) of vram on a GPU on full precision 32bits
- that meant that you need a last gen for running the most basic model
- BUT with LlamaCpp the model was quantized to 4bits ⇒
7x0.5 = 5gb and it was able to do inference in the CPU no GPU need
- At the end, we want models that everybody can use ⇒ that bring an explosion of new ideas without having to deal with third-party services like ChatGPT