Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How much vram do you need to run this model? Is 48 gb unified memory enough?


39gb if you use a fp8 quantized model.[1] Remember that your OS might be using some of that itself.

As far as I recall, Ollama/llama.cpp recently added a feature to page-in parameters - so you'll be able to go arbitrarily large soon enough (at a performance cost). Obviously more in RAM = more speed = more better.

[1]: https://token-calculator.net/llm-memory-calculator


I am using the Q6_K_L quant and it's running at about 40G of vram with the KV cache.

Device 1 [NVIDIA GeForce RTX 4090] MEM[||||||||||||||||||20.170Gi/23.988Gi]

Device 2 [NVIDIA GeForce RTX 4090] MEM[||||||||||||||||||19.945Gi/23.988Gi]


What's the context length?


The model has a context of 131,072, but I only have 48G of VRAM so I run it with a context of 32768.


It's enough for 6 bit quant with a somewhat restricted context length.

Though based on the responses here, it needs sizable context to work, so we may be limited to 4 bit (I'm on an M3 Max w/ 48gb as well).


The quantized model fits in about 20 GB, so 32 would probably be sufficient unless you want to use the full context length (long inputs and/or lots of reasoning). 48 should be plenty.


I‘ve tried the very early Q4 mlx release on an M1 Max 32GB (LM Studio @ default settings), and have run into severe issues. For the coding tasks I gave it, it froze before it was done with reasoning. I guess I should limit context size. I do love what I‘m seeing though, the output reads very similar to R1, and I mostly agree with its conclusions. The Q8 version has to be way better even.


Does the Q8 fit within your 32GB (also using an M1 32GB)


No, Q4 just barely fits, and with a longer context sometimes things freeze. I definitely have to close Xcode.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: