A Wait-Free Queue as Fast as Fetch-And-Add [pdf]

svckr · on July 17, 2016

I've been playing with wait-free algorithms some time ago but never got anywhere meaningful.

A wait-free queue for multiple consumers and multiple producers would be kind of the holy grail when it comes to wait free data structures. Is it MPMC though? I skipped through the PDF and could not find a definitive claim…

Also, big thumbs up to Chaoran for putting the implementation on GitHub: https://github.com/chaoran/fast-wait-free-queue The code looks like it is easy to follow and it's MIT licensed.

colanderman · on July 17, 2016

Yes, it's MPMC. Progress guarantees are only really meaningful for multiple readers/writers. (If you have only one reader/writer, and it dies, who cares if the writers/readers on the other side keep working? Your system's hosed anyway.)

OJFord · on July 17, 2016

Ah! I thought I recognised that name/domain.

He previously saved my bacon with `node-finish` - a Node.JS library that let's you fire off a bunch of tasks, and wait for _all_ to finish before executing a callback.

What a hero.

rco8786 · on July 17, 2016

Promise.all?

striking · on July 17, 2016

But for callbacks, which is much more of a pain.

There was a time before Promises, you know.

rco8786 · on July 17, 2016

OP said "recently".

Even before Promises, wrapping up some callback functions and tracking async task state isn't rocket surgery.

OJFord · on July 20, 2016

    > OP said "recently".

I did? I think you misread my "previously".

Regardless, I was new to JS and not up to the task of implementing it myself. @chaoran's library did exactly what I needed.

An appreciative note didn't need to turn into criticism of the project's worth.

adrianratnapala · on July 17, 2016

So I just recently learned how a rather simple ring buffer implementation is lock-free and doesn't even need primitives like CAS (although it does need memory barriers).

In spite of this, all the literature about lock-free queues that I see (including this one) is for some kind of linked structure. Any reason for that?

I guess one big limitation of the ring buffer is that it only works for a single reader and single writer. Although maybe you can work around that using CAS primitves...

Betelgeuse90 · on July 17, 2016

I'm not sure you're aware of this, but there's a distinction between wait-free and lock-free.

http://rethinkdb.com/blog/lock-free-vs-wait-free-concurrency...

This might be relevant to answering your question.

adrianratnapala · on July 17, 2016

I became aware of the distinction when I skimmed the paper. But I couldn't figure out what it meant in this case.

Specifically does the every thread requirement rule out algorithms where there only one reader and one writer is allowed? If not, then this ring buffer is wait-free, since there is no CAS-and-retry step. (Of course an operation can fail because the queue is empty/full, but that is still completion).

the8472 · on July 17, 2016

there are several aspects to consider when talking about a concurrent queue:

* multiple/single producer, multiple/single consumer

* progress guarantees (wait-free, lock-free)

* bounded vs. unbounded

* whether it's double-ended or single-ended

* memory reclamation being dependant on a GC or not

* which atomic operations they need (e.g. some require double-CAS)

What you mention is a spsc bounded queue. this is about a mpmc unbounded queue. And from skimming the paper is seems like it does not require a GC either.

So it's far more powerful.

Of course spsc queues still have their use even if we have a good mpmc queue, they generally incur less overhead.

jstimpfle · on July 17, 2016

Wouldn't the ring buffer queue be also wait-free in this sense? (except that the number of unread items in the queue is restricted).

scott_s · on July 17, 2016

The single-writer, single-reader limitation is a big deal. You can work around it, but it's no longer the same thing, as what makes such queues so fast is that they don't do CAS or other kinds of atomic operations at all. Such queues are extremely useful when you can ensure single-reader and single-writer, but it does not prevent us from need more general purpose solutions. More general solutions will require more synchronization, and if they do not want them to be fixed-sized structures, they will usually require some kind of linked structure.

For practical examples of all of the above, see Boost's lockfree structures: http://www.boost.org/doc/libs/1_61_0/doc/html/lockfree.html

ambrop7 · on July 17, 2016

A ring buffer can work for a single writer and multiple readers, in the model that each element is received by all the readers. Each reader just needs their own read index and the writer probably needs to check all the read indices before writing a new element, to ensure it doesn't overwrite data still being read.

And indeed it you want all readers to get all elements, I think a ring buffer is easier than a linked structure.

daigoba66 · on July 17, 2016

A ring buffer has a limit to the number of unread items before you either block or start discarding. Different use cases I suppose.

adrianratnapala · on July 17, 2016

I think the fixed-size limitation is the least important thing in practice.

If your queue is filling up arbitrarily far, then you are almost certainly in trouble anyway, and want some limitation strategy. The ring buffer gives you this quite naturally.

I suppose there could be cases where you did not want to pre-commit to a buffer size. E.g. you have lots of queues which you expect to be short, but which might grow large in rare cases. But I have never actually seen such a case.

ramchip · on July 17, 2016

I would expect it to be a common use case in actor-based systems like Erlang, Akka, etc. It's perfectly supported to have 100k+ processes running in an Erlang VM, but the vast majority of them will have an empty message queue at any given time.

adrianratnapala · on July 18, 2016

Indeed large numbers of small queues are an important case. But large numbers of small queues, every one of which might unexpectedly but legitimately need to grow large is not.

I don't know Erlang or Akka first hand, but I know that Go channels are inpsired by Erlang. Channels can be very numerous and are usually empty; but they are fixed size. In the rare occasion that you think a channel needs more buffering, you must say so explicilty.

the8472 · on July 17, 2016

fixed-size vs. variable size maybe?

kbwt · on July 17, 2016

For context, the traditional wait-free queues are only wait-free for one side, writing or reading. The other side has to spin in an atomic fetch loop, waiting for the wait-free multi-step operation to complete.

The double wait-free queue from the paper is using a technicality to achieve its status. The spinning operation is replaced with a loop over all the threads doing opposite (multi-step) operations, finishing them for threads that may be blocked in the middle of the steps. It's not that it doesn't spin, but the spinning is bounded by the number of threads. Edit: Actually, spinning is bounded by O(#threads^4) according to the paper.

swalsh · on July 17, 2016

Funny, I had a coworker several years ago whom built this exact algorithm. The performance of our system increased literally orders of magnitude.

I've been slipping it into systems for a couple years now :D

bjterry · on July 17, 2016

Did his algorithm include the helper concept that makes this wait-free, or are you just talking about the fetch-and-add queue part?

biokoda · on July 17, 2016

Is it better than: http://www.1024cores.net/home/lock-free-algorithms/queues/no...

Which is what Rust mpsc uses. It has a big advantage in being short and easy to understand.

gpderetta · on July 18, 2016

Dmitry Vyukov queue is awesome, but as the author himself notices, a slow writer can prevent the reader to make progress, and as such is not even non-blocking.

the8472 · on July 17, 2016

it's mpmc

zvrba · on July 17, 2016

Hmm.. why bother with a complicated approach now that transactional memory (Intel TSX) is available in HW now? What are the advantages and disadvantages of this approach vs using TSX?

hendzen · on July 17, 2016

Hardware transactions as implemented by TSX are not wait free. There is no guarantee that any TSX transaction will ever commit, so you need a non-transactional fallback path to guarantee forward progress.

zvrba · on July 19, 2016

Do you know the relative performance of Power7 and Haswell? The implementation is not wait-free on Power7 because of HW limitation, and has significantly lower throughput than Haswell in the benchmarks.

karmicthreat · on July 17, 2016

Because the TSX feature isn't stable and Intel is still disabling it in some recent processors.

zvrba · on July 17, 2016

Not "recent". It's disabled in Haswell where it debuted, but it's OK from Broadwell on. From wikipedia, with citation: The bug was fixed in F-0 steppings of the vPro-enabled Core M-5Y70 Broadwell CPU in November 2014.[22] We're talking about end of 2014.

brightball · on July 17, 2016

So how different is this from something like https://github.com/chanks/que