How much vram do you need to run this model? Is 48 gb unified memory enough?

zamalek · 2025-03-06T00:23:52 1741220632

39gb if you use a fp8 quantized model.[1] Remember that your OS might be using some of that itself.

As far as I recall, Ollama/llama.cpp recently added a feature to page-in parameters - so you'll be able to go arbitrarily large soon enough (at a performance cost). Obviously more in RAM = more speed = more better.

[1]: https://token-calculator.net/llm-memory-calculator

dulakian · 2025-03-06T00:59:18 1741222758

I am using the Q6_K_L quant and it's running at about 40G of vram with the KV cache.

Device 1 [NVIDIA GeForce RTX 4090] MEM[||||||||||||||||||20.170Gi/23.988Gi]

Device 2 [NVIDIA GeForce RTX 4090] MEM[||||||||||||||||||19.945Gi/23.988Gi]

lostmsu · 2025-03-06T05:02:10 1741237330

What's the context length?

dulakian · 2025-03-06T15:16:28 1741274188

The model has a context of 131,072, but I only have 48G of VRAM so I run it with a context of 32768.

brandall10 · 2025-03-06T01:56:38 1741226198

It's enough for 6 bit quant with a somewhat restricted context length.

Though based on the responses here, it needs sizable context to work, so we may be limited to 4 bit (I'm on an M3 Max w/ 48gb as well).

daemonologist · 2025-03-06T00:21:04 1741220464

The quantized model fits in about 20 GB, so 32 would probably be sufficient unless you want to use the full context length (long inputs and/or lots of reasoning). 48 should be plenty.

manmal · 2025-03-06T06:17:44 1741241864

I‘ve tried the very early Q4 mlx release on an M1 Max 32GB (LM Studio @ default settings), and have run into severe issues. For the coding tasks I gave it, it froze before it was done with reasoning. I guess I should limit context size. I do love what I‘m seeing though, the output reads very similar to R1, and I mostly agree with its conclusions. The Q8 version has to be way better even.

whitehexagon · 2025-03-06T16:32:08 1741278728

Does the Q8 fit within your 32GB (also using an M1 32GB)

manmal · 2025-03-06T17:30:55 1741282255

No, Q4 just barely fits, and with a longer context sometimes things freeze. I definitely have to close Xcode.