Add sysctl to disable Nagle's algorithm (RFC 896 – Congestion Control)

Animats · on May 15, 2024

You can't turn off delayed ACKs and make them stay off, which is a related problem. The Linux API for that is very strange. It only applies for a short period.

Delayed ACKs and the Nagle algorithm should never be on at the same time. The trouble is, they're controlled at opposite ends of the connection. You can turn off the Nagle algorithm at your end, but you want to turn off delayed ACKs at the other end. That's the practical problem.

Still, what's the use case for having multiple tiny messages in flight during one RTT? Games usually send their interactive traffic over UDP. If you have delayed ACKs off, you should never have a propagation delay of more than one RTT. It's that fixed timer in delayed ACKS, set to a value that made sense for keyboard echo, that can cause delays of more than one RTT.

withinboredom · on May 15, 2024

I find it hilarious that people want to turn off your algorithm because most people don’t know delayed ACKs is the “real problem”.

I’d much rather see delayed ACKs disabled as the new default vs. your algorithm being disabled by default.

I’ve seen too many applications not filling packets and sending tons of tiny packets. They’d benefit from your algorithm, even with delayed ACKs on… but people also confuse latency with throughput all the time (sometimes you want/need one more than the other — and languages like Go not giving programmers this ability to decide is frustrating, mainly because for awhile, those programs were the biggest offenders)…

I digress… I do find this whole thing rather amusing.

armitron · on May 16, 2024

Delayed ACKs are not the "real problem". For the last 10 years I haven't encountered a single environment where Nagle's algorithm would be a net win. No heuristics, no fancy auto-sensing. It should be off by default.

If you're working in distributed systems, fintech, mobile network optimization, broadcasting, the first thing you should do is switch off Nagle's algorithm. Animats should focus on Second Life and stop holding on to what's now a very bad default.

Animats · on May 16, 2024

It is not, in general, a net win. It is a preventative measure against things getting really bad.

My other work back then, on fair queuing, followed the same line - keep things from getting really bad just because something was slightly overloaded.

When I was working on this, funding was from DARPA, Defense Communications Agency, and such. They wanted networks to keep working under bad conditions. Maximum price/performance was far less important. The price of getting the last 10% in performance is usually complexity and often fragility.

withinboredom · on May 16, 2024

See, this is why we can't have real conversations about this problem, because most people don't even understand what they are saying and just spout off dogma.

1. Just because you haven't "encountered a single environment" doesn't mean you haven't been in one or that you'd even know what to look for to know if you were in one where it would be a net win.

2. "If you're working in distributed systems," you likely have very fat, fast, and reliable pipes between your services. Nagle's algorithm is probably a Bad Thing[tm] in those situations. If you have a lot of Wi-Fi interference or dropped packets, Nagle's algorithm can be the difference between 50bps and 1mbps (assuming your application isn't filling packets), except that Delayed Acks prevents you from realizing all that.

There's no "one size fits all" solution, but you have control over Nagle's algorithm, you do not have control over Delayed Acks.

boulos · on May 15, 2024

I kind of love that this continues to haunt you :).

schoen · on May 15, 2024

For anyone who doesn't know, HN user Animats is John Nagle, eponym of Nagle's algorithm.

My main mental association with that algorithm is always being asked about it in "make menuconfig" when I used to compile my own Linux kernels. One of relatively few networking concepts I can think of that's named after a person (along with Van Jacobson header compression).

Animats · on May 15, 2024

I called it "tinygram prevention".

jancsika · on May 15, 2024

> The trouble is, they're controlled at opposite ends of the connection. You can turn off the Nagle algorithm at your end, but you want to turn off delayed ACKs at the other end. That's the practical problem.

For an extant case of this practical problem chosen at random, I'd be curious to know-- what's the likelihood that it's just openssh on both ends of the connection?

jdougan · on May 15, 2024

Do you have any sense regarding how helpful tcp_autocorking on Linux is (or isn't)?

bhaney · on May 15, 2024

I'd like to be able to confidently turn off Nagle's algorithm system-wide, but I'm always going to be concerned that I'll some day run a high-traffic application that depends on it without explicitly enabling it because it's been the default for so long.

I think this is one of the rare cases where I'd prefer if the kernel had a "magic" setting that went something along the lines of "if a connection isn't explicitly setting its use of Nagle's algorithm, default it to off, but occasionally look at the connections in this category that are generating the most packets on the sysetm and turn Nagle's on for these connections if their packets are mostly small (or other heuristics)."

Potential waste from overly fragmented packets is going to be negligible on connections that don't represent a large portion of the system's traffic, so there's no reason to bother observing and tuning them (in case there are many of them and the cost of doing so becomes noticeable). It's really only the highest traffic connections where Nagle's might help, and selectively turning it on for connections that seem to benefit from it would maintain backwards compatibility while still reaping the nodelay benefits for the majority of software that doesn't care.

Animats · on May 15, 2024

> I think this is one of the rare cases where I'd prefer if the kernel had a "magic" setting...

All this stuff should be automatic. The problem is that it can take a few round trips to discover what the application is doing. This comes up with "slow start", where you need some time to discover what's going on. In a world of short-lived HTTP connections, self-adjusting algorithms don't have time to self-adjust before exit.

A delayed ACK is a bet. You're betting that the other end is going to respond with useful data before the delayed ACK timer runs out. If it does, you won the bet. If it doesn't, you lost. Nothing checks whether you're on a losing streak. But it takes a few round trips to make that decision.

When I was working on this, the object was to get from appallingly bad to acceptable performance without too much complexity. Today, people crank up things like HTTP/3 to get a few percent more performance at the cost of greatly increased complexity.

bhaney · on May 15, 2024

> In a world of short-lived HTTP connections, self-adjusting algorithms don't have time to self-adjust before exit

We could potentially have the kernel group these heuristics to a process or process group, since there's no real reason to live in the network stack and have its context restricted to a single connection. Like, if most packets on a system are being generated by a few nginx processes, and most of them seem to be tiny (or are on a losing streak for any other kind of bet), enable Nagle's (or any other relevant optimization) for any connections those nginx processes create?

Animats · on May 15, 2024

Yes, and then you need a UI and a dashboard and an alarm system and a policy deployment manager and a ...

As you go for the last 10% of performance, the complexity climbs rapidly.

bhaney · on May 15, 2024

Ah, so you're saying this should be managed by systemd rather than the kernel ;)

kstrauser · on May 15, 2024

Do not say that out loud 3 times into a mirror.

GoblinSlayer · on May 15, 2024

Reset packet injection is popular in the wild, so I'm optimistic about tip to toe authenticated UDP transports, also less retarded congestion control that doesn't take network down just to see if it will break. Complexity can be moved to a proxy, and simple applications don't care about tuning anyway.

jdougan · on May 15, 2024

Is the Linux tcp_autocorking setting sufficient?

https://marc.info/?l=openbsd-tech&m=171573285422908&w=2

tcp_autocorking - BOOLEAN Enable TCP auto corking : When applications do consecutive small write()/sendmsg() system calls, we try to coalesce these small writes as much as possible, to lower total amount of sent packets. This is done if at least one prior packet for the flow is waiting in Qdisc queues or device transmit queue. Applications can still use TCP_CORK for optimal behavior when they know how/when to uncork their sockets.

          Default : 1

bhaney · on May 15, 2024

I have no idea! This is the first I'm hearing of corking, so I don't really know how it behaves in reality. It certainly seems pretty close to what I was talking about at first glance.

londons_explore · on May 15, 2024

> This is done if at least one prior packet for the flow is waiting in Qdisc queues or device transmit queue.

It should be 'or was sent less than rtt/2 ago'.

augusto-moura · on May 15, 2024

Maybe there's a way to do it with BPF or eBPF? I honestly don't know what are the limits for these two

meinersbur · on May 15, 2024

Response from Theo de Raadt (https://marc.info/?l=openbsd-tech&m=171572099614639&w=2):

The proposal talks about a few applications which are better with nagle off by default. Most of those applications have already turned off Nagle, after deciding that the cognitive load of driving their small write system calls via single internal buffing layering is too complicated (that's ssh, that is most http services, etc). In that software, Nagle was manipulated by a developer after systematically studying & modifying the application as a whole.

But applying it to all applications, just because 'few applications prove Nagle bad'? That is backwards. It needs to prove that the entire application ecosystem is MAJORITY improved by disabling Nagle.

I strongly doubt it is improved. I suspect a majority of software is different from the few well-known ones disabling Nagle -- and I'm sure a few which intentionally leave Nagle enabled -- furthermore I suspect the majority of software gains full-system benefits from this 'teeny buffer bloat' layer.

It mostly has to do with what the internal IO subsystem of a program looks like. Does it use stdio, does it use raw writes, does it use BIO, etc. (That's where short writes due to intersecting layers of API).

So I suspect "Nagle always bad" would need to be disproven before we give people a dangerous knob -- which a segment of the user community would toggle, and thus increase our cognitive load when trying to diagnose their vague bug reports in the future...

kreetx · on May 16, 2024

Related from a week ago: It's always TCP_NODELAY (brooker.co.za), https://news.ycombinator.com/item?id=40310896

wmf · on May 15, 2024

Related discussion from 5 days ago: https://news.ycombinator.com/item?id=40310896

patrakov · on May 15, 2024

Note: this is OpenBSD, not Linux.

Neil44 · on May 16, 2024

A lof of stuff will have this already since it's an option that can be specified when opening the listening socket, for example apache httpd, nginx, openlitespeed all do this.

voidfunc · on May 16, 2024

Not very familiar with the openbsd dev process but noted the patches at the bottom... do they still use CVS for version control?!

kreetx · on May 16, 2024

Interestingly enough, I've seen other projects switch from github to email-based patch submission (using git behind the scenes though). If I understood correctly, to receive only higher effort contributions.

natebc · on May 16, 2024

They do use CVS for version control. https://cvsweb.openbsd.org/

from their github: https://github.com/openbsd/src

> Read-only git conversion of OpenBSD's official CVS src repository. Pull requests not accepted - send diffs to the tech@ mailing list.

JSDevOps · on May 15, 2024

I was just looking for this the other day!

kazinator · on May 15, 2024

That was my reaction to the latest story: isn't there some darned sysctl to just turn that off globally?

silverwind · on May 16, 2024

Does Linux have this as well?

deaddodo · on May 16, 2024

It's built into the socket library, so most high performance web apps already manually enable TCP_NODELAY. This just allows you to force it OS-wide.

Linux used to have something similar, called TCP low-latency mode, but the flag is no longer functional. Now, there are various distribution specific options or you can rebuild your kernel with multiple related build options to achieve the same.

blueflow · on May 16, 2024

With "socket library" you mean libc?

signa11 · on May 16, 2024

yes, there is [typically] nothing else.

you can ofcourse do a syscall directly, but i am not sure what value add _that_ is, `man -S 2 syscall` for more information.

edit: 'typically' because some languages / runtimes might go the syscall route.

blueflow · on May 16, 2024

Thats why i ask. "the socket library" is an very odd way to refer to the libc.

_flux · on May 16, 2024

Some other Unix operating systems had the functions in a own socket library, e.g. SunOS: https://docs.oracle.com/cd/E19120-01/open.solaris/817-4415/s... .

I suppose though in the context of Linux it's a bit weird, but you did get the correct meaning :).

DEADMINCE · on May 16, 2024

Not in context.

justincormack · on May 16, 2024

Well, Go for example does syscalls directly.

dang · on May 15, 2024

Url changed from https://www.undeadly.org/cgi?action=article;sid=202405140750..., which points to this.

Submitters: "Please submit the original source. If a post reports on something found on another site, submit the latter." - https://news.ycombinator.com/newsguidelines.html