Use of common tools like `ls` and file patching is already baked into model's weights, it can do that with minimal amount of effort, leaving more room for actually thinking about app's code.
If you force it to wrap these actions into non-standard tools you're basically distracting the model: it has to think about app-code and tool-code in the same context.
In some cases it does make sense to encourage the model to create utilities for itself - but you can do that without enforcing code-only.
It doesn’t matter if it’s less efficient, what matters is that it has more chances to verify and get it right. It’s hard to rollback a series of tool calls. It’s easier to revert state and rerun a complete piece of code until you get the desired result.
Use of common tools like `ls` and file patching is already baked into model's weights, it can do that with minimal amount of effort, leaving more room for actually thinking about app's code.
If you force it to wrap these actions into non-standard tools you're basically distracting the model: it has to think about app-code and tool-code in the same context.
In some cases it does make sense to encourage the model to create utilities for itself - but you can do that without enforcing code-only.