I'm not sure I see what's novel about this - maybe it's a verbiage thing around "wait free," but if they're atomically updating the top pointer and linked list, there will be lock contention on writes, and similarly when marking an item popped, on reads. I suppose the contention is bounded by the number of readers or writers, but I wouldn't consider that wait free (again, that could just be a verbiage thing). But, more to the point, this is just a slight twist on how Kafka works (stack vs log / queue, but same with the pointer holding place and a cleanup operation), I don't really see it as particularly novel...perhaps I'm missing something?
A "wait free" data structure has a specific technical definition so yes, in some ways the novelty in this is that it meets that verbiage.
Lock free stacks have been around since at least the 80s but I haven't seen a general wait free stack before (though I'm no expert).
In neither this paper or the classic lock free stacks do you have lock contention on writes as you are using CAS operations.
I'm having a little trouble parsing the algo in this paper, but what they seem to have added that is novel is their cleanup function will always complete in a finite number of steps, thus making the whole thing both wait free and bounded, which is pretty neat.
In any case, with wait free structures generally implementation details matter a lot and stacks are pretty notoriously hard to handle because of the contention around the top node, as opposed to something like an append only log like Kafka uses, so that particular comparison is not fair.
> In neither this paper or the classic lock free stacks do you have lock contention on writes as you are using CAS operations.
Given that real-world locks are most often implemented using CAS, is this distinction a valuable one? In either case you have contention on a memory location between multiple threads trying to modify that memory location.
Yes, it is a real distinction. In lock-free algorithms, progress is guaranteed. If they are not wait-free, then they tend to update some values, then try to commit those values with CAS. If the CAS fails, they try again. Wait-free algorithms do the first part, but on the failure, they don't try again, they go off and do something different. All threads can make progress, even if some thread is suspended or dies; no thread has exclusive access to modify the state.
In lock-based algorithms, that is not the case; the thread with the lock can be suspended, killed, or go off on a wild goose chase, and the progress of the algorithm cannot continue. Only that thread may modify the state, and the rest must wait.
You did not imply otherwise and your description is great, but I wanted to clarify that an algorithm that retries a CAS can still be lock-free (just not wait free) if the failed CAS implies some other thread made progress.
In the case "real world performance" it has to do with the implementation of the lock and the architecture, but the CAS intrinsics on the architecture are almost certainly going to be the most performant way to do memory sync, so either the lock elides to those or it doesn't but if you are doing the CAS yourself you know what you are using.
For the purposes of these algorithm discussions though it can matter a lot because of the semantics of the lock. To prove something is lock free you have to prove that a thread will complete in a finite number of steps and lots of lock semantics don't allow for that.
For wait free all of the threads must complete and I don't know of any locking semantics that can allow for this property (but I'm definitely not an expert and usually poorly muddle through lock/algo papers).
I think its important to recognize that lock-free/wait-free is more important for real time systems than it is for "performance" (in terms of latency of request). Real time systems are often practically slower than a non-real time version of the same thing, but they have a more consistent worst case. This is where lock-free/wait-free are especially useful.
How useful is a lock-free (not wait-free) approach in a real-time context? The point is that any thread contending on lock-free data is liable to starve, so the worst-case behavior is still quite bad, no? The only benefit I see is freedom from priority inversion, and that's a fixable problem anyway. Are there other properties of the system that you somehow use to get real-time guarantees?
On the other hand, if you combine interrupt-disabling with a fair lock, then your lock wait time can be bounded in every thread. This is probably only an option inside the kernel, though.
In userspace, and with more threads than CPUs, the lock-free approach probably will save you a lot of context-switching time. It'll probably also be at least as fine-grained as the most fine-grained locking scheme you can come up with, probably with less overhead.
So I'd say that the lock-free approach is an optimistic strategy, and I don't see the benefit except for performance. I say this as someone who really likes lockfree programming.
> The only benefit I see is freedom from priority inversion, and that's a fixable problem anyway. Are there other properties of the system that you somehow use to get real-time guarantees?
Priority inversion was what I was thinking of. When you say its a fixable problem, the common way to fix that is to have a real time scheduler deal with it. Lots of systems don't have real time schedulers available to them, so lock free algos make sense there and you deal with the starvation issue instead of the inversion issue.
That said, I'm no expert on real time systems and my use of lock free algos has fallen into the exact category you are talking about, which is wanting to avoid context switches in user land, but that is more a property of the OSes I use than the algorithms themselves.
Both lock-free and wait-free algorithms/implementations offer the same advantage, as a corollary of their definition: latency guarantees. These are central to real-time systems.
PS. The difference is that lock-free algorithms improve throughput _at the expense of single-thread latency_, and thus are more useful for satisfying system-wide throughput requirements, e.g. for a DBMS. Wait-freedom additionally guarantees freedom from starvation.
res = last.nextDone.compareAndSet( (null,false), (myNode, false))
I fail to see how under contention the statement above is "wait-free".
Having globalPhase is another contention point (although atomic adds are wait free on most hardware, nowadays).
Not ceratain if 'tid' is threadId but having AtomicReferenceArray (all notation appears to stem from java.util.concurrent) with an arbitrary length is certainly not a wait-free operation, hence the announce would fail to be wait free. Using CPU-id instead is a hard one as well unless all threads are core-bound. Morealso if the array (announce) is not fixed length the entire algorithm cannot finish in finite amount of steps.