One thing I’d like to see in these releases is stronger emphasis on regression b...

One thing I’d like to see in these releases is stronger emphasis on regression behavior, not just headline capability.

In production, the costly failures are usually "almost right" edits that quietly shift semantics across large diffs.

We now gate model upgrades behind a fixed eval set of our own repos + prompts and compare pass rates by task category (refactor, test repair, API migration). Raw benchmark gains matter less to us than variance and rollback safety. If 3.1 improves consistency on long multi-file edits, that’s a bigger win than a small jump on one-shot tasks.