Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The 32 parallel sequences is also arbitrary and significantly changes your conclusions. For example, if they run with 256 parallel sequences then that would result in a 8x cheaper factor in your calculations for both prefill and decode.

The component about requiring long context lengths to be compute-bound for attention is also quite misleading.



Anyone up to publishing their own guess range?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: