You can't turn off delayed ACKs and make them stay off, which is a related problem.
The Linux API for that is very strange. It only applies for a short period.
Delayed ACKs and the Nagle algorithm should never be on at the same time. The trouble is, they're controlled at opposite ends of the connection. You can turn off the Nagle algorithm at your end, but you want to turn off delayed ACKs at the other end. That's the practical problem.
Still, what's the use case for having multiple tiny messages in flight during one RTT? Games usually send their interactive traffic over UDP. If you have delayed ACKs off, you should never have a propagation delay of more than one RTT. It's that fixed timer in delayed ACKS, set to a value that made sense for keyboard echo, that can cause delays of more than one RTT.
I find it hilarious that people want to turn off your algorithm because most people don’t know delayed ACKs is the “real problem”.
I’d much rather see delayed ACKs disabled as the new default vs. your algorithm being disabled by default.
I’ve seen too many applications not filling packets and sending tons of tiny packets. They’d benefit from your algorithm, even with delayed ACKs on… but people also confuse latency with throughput all the time (sometimes you want/need one more than the other — and languages like Go not giving programmers this ability to decide is frustrating, mainly because for awhile, those programs were the biggest offenders)…
I digress… I do find this whole thing rather amusing.
Delayed ACKs are not the "real problem". For the last 10 years I haven't encountered a single environment where Nagle's algorithm would be a net win. No heuristics, no fancy auto-sensing. It should be off by default.
If you're working in distributed systems, fintech, mobile network optimization, broadcasting, the first thing you should do is switch off Nagle's algorithm. Animats should focus on Second Life and stop holding on to what's now a very bad default.
It is not, in general, a net win. It is a preventative measure against things getting really bad.
My other work back then, on fair queuing, followed the same line - keep things from getting really bad just because something was slightly overloaded.
When I was working on this, funding was from DARPA, Defense Communications Agency, and such. They wanted networks to keep working under bad conditions. Maximum price/performance was far less important. The price of getting the last 10% in performance is usually complexity and often fragility.
See, this is why we can't have real conversations about this problem, because most people don't even understand what they are saying and just spout off dogma.
1. Just because you haven't "encountered a single environment" doesn't mean you haven't been in one or that you'd even know what to look for to know if you were in one where it would be a net win.
2. "If you're working in distributed systems," you likely have very fat, fast, and reliable pipes between your services. Nagle's algorithm is probably a Bad Thing[tm] in those situations. If you have a lot of Wi-Fi interference or dropped packets, Nagle's algorithm can be the difference between 50bps and 1mbps (assuming your application isn't filling packets), except that Delayed Acks prevents you from realizing all that.
There's no "one size fits all" solution, but you have control over Nagle's algorithm, you do not have control over Delayed Acks.
For anyone who doesn't know, HN user Animats is John Nagle, eponym of Nagle's algorithm.
My main mental association with that algorithm is always being asked about it in "make menuconfig" when I used to compile my own Linux kernels. One of relatively few networking concepts I can think of that's named after a person (along with Van Jacobson header compression).
> The trouble is, they're controlled at opposite ends of the connection. You can turn off the Nagle algorithm at your end, but you want to turn off delayed ACKs at the other end. That's the practical problem.
For an extant case of this practical problem chosen at random, I'd be curious to know-- what's the likelihood that it's just openssh on both ends of the connection?
I'd like to be able to confidently turn off Nagle's algorithm system-wide, but I'm always going to be concerned that I'll some day run a high-traffic application that depends on it without explicitly enabling it because it's been the default for so long.
I think this is one of the rare cases where I'd prefer if the kernel had a "magic" setting that went something along the lines of "if a connection isn't explicitly setting its use of Nagle's algorithm, default it to off, but occasionally look at the connections in this category that are generating the most packets on the sysetm and turn Nagle's on for these connections if their packets are mostly small (or other heuristics)."
Potential waste from overly fragmented packets is going to be negligible on connections that don't represent a large portion of the system's traffic, so there's no reason to bother observing and tuning them (in case there are many of them and the cost of doing so becomes noticeable). It's really only the highest traffic connections where Nagle's might help, and selectively turning it on for connections that seem to benefit from it would maintain backwards compatibility while still reaping the nodelay benefits for the majority of software that doesn't care.
> I think this is one of the rare cases where I'd prefer if the kernel had a "magic" setting...
All this stuff should be automatic. The problem is that it can take a few round trips to discover what the application is doing. This comes up with "slow start", where you need some time to discover what's going on. In a world of short-lived HTTP connections, self-adjusting algorithms don't have time to self-adjust before exit.
A delayed ACK is a bet. You're betting that the other end is going to respond with useful data before the delayed ACK timer runs out. If it does, you won the bet. If it doesn't, you lost. Nothing checks whether you're on a losing streak. But it takes a few round trips to make that decision.
When I was working on this, the object was to get from appallingly bad to acceptable performance without too much complexity. Today, people crank up things like HTTP/3 to get a few percent more performance at the cost of greatly increased complexity.
> In a world of short-lived HTTP connections, self-adjusting algorithms don't have time to self-adjust before exit
We could potentially have the kernel group these heuristics to a process or process group, since there's no real reason to live in the network stack and have its context restricted to a single connection. Like, if most packets on a system are being generated by a few nginx processes, and most of them seem to be tiny (or are on a losing streak for any other kind of bet), enable Nagle's (or any other relevant optimization) for any connections those nginx processes create?
Reset packet injection is popular in the wild, so I'm optimistic about tip to toe authenticated UDP transports, also less retarded congestion control that doesn't take network down just to see if it will break. Complexity can be moved to a proxy, and simple applications don't care about tuning anyway.
tcp_autocorking - BOOLEAN
Enable TCP auto corking :
When applications do consecutive small write()/sendmsg() system calls,
we try to coalesce these small writes as much as possible, to lower
total amount of sent packets. This is done if at least one prior
packet for the flow is waiting in Qdisc queues or device transmit
queue. Applications can still use TCP_CORK for optimal behavior
when they know how/when to uncork their sockets.
I have no idea! This is the first I'm hearing of corking, so I don't really know how it behaves in reality. It certainly seems pretty close to what I was talking about at first glance.
The proposal talks about a few applications which are better with nagle
off by default. Most of those applications have already turned off
Nagle, after deciding that the cognitive load of driving their small
write system calls via single internal buffing layering is too complicated
(that's ssh, that is most http services, etc). In that software, Nagle was
manipulated by a developer after systematically studying & modifying the
application as a whole.
But applying it to all applications, just because 'few applications
prove Nagle bad'? That is backwards. It needs to prove that the entire
application ecosystem is MAJORITY improved by disabling Nagle.
I strongly doubt it is improved. I suspect a majority of software is
different from the few well-known ones disabling Nagle -- and I'm sure a
few which intentionally leave Nagle enabled -- furthermore I suspect the
majority of software gains full-system benefits from this 'teeny buffer
bloat' layer.
It mostly has to do with what the internal IO subsystem of a program looks
like. Does it use stdio, does it use raw writes, does it use BIO, etc.
(That's where short writes due to intersecting layers of API).
So I suspect "Nagle always bad" would need to be disproven before we
give people a dangerous knob -- which a segment of the user community
would toggle, and thus increase our cognitive load when trying to
diagnose their vague bug reports in the future...
A lof of stuff will have this already since it's an option that can be specified when opening the listening socket, for example apache httpd, nginx, openlitespeed all do this.
Interestingly enough, I've seen other projects switch from github to email-based patch submission (using git behind the scenes though). If I understood correctly, to receive only higher effort contributions.
It's built into the socket library, so most high performance web apps already manually enable TCP_NODELAY. This just allows you to force it OS-wide.
Linux used to have something similar, called TCP low-latency mode, but the flag is no longer functional. Now, there are various distribution specific options or you can rebuild your kernel with multiple related build options to achieve the same.
Delayed ACKs and the Nagle algorithm should never be on at the same time. The trouble is, they're controlled at opposite ends of the connection. You can turn off the Nagle algorithm at your end, but you want to turn off delayed ACKs at the other end. That's the practical problem.
Still, what's the use case for having multiple tiny messages in flight during one RTT? Games usually send their interactive traffic over UDP. If you have delayed ACKs off, you should never have a propagation delay of more than one RTT. It's that fixed timer in delayed ACKS, set to a value that made sense for keyboard echo, that can cause delays of more than one RTT.