What about using Apache Beam[0] for an abstracting over these stream processing frameworks? In a recent Software Engineering Daily podcast[1], Frances Perry recommends using Flink for the data flow model.
The problem with Beam is that it is an abstraction layer.
This sounds like a stupid comment, but when the underlying services are all evolving as quickly as they do in this space programming against an abstraction layer means you need to wait for (often sorely needed) new features.
(Frances Perry is an engineer on Beam at Google, so it would be surprising if she recommended against it)
Also, these distributed processing platforms are extraordinarily leaky (for anything non-trivial, you really need to have a good low-level understanding of your system in order to tune/debug/optimize/implement). There are definitely other benefits of a unified API, but making it seamless to switch underlying implementations is probably not one...
Do you have examples of such features? Most production deployments of these kinds of things don't upgrade THAT often.
You have a point that abstraction layers can limit you since it's another compatibility layer, but looking at what's done in actual practice I don't see it being that bad.
If anything, it should be evaluated on a case by case basis.
Flink is actually better tech overall for my use case, but a lot of customers want spark streaming since it's already installed. Having beam where we can do both is kinda nice.
We had to abandon Cloudera (and Horton Works) Hadoop distributions because they didn't ship R support for Spark quickly enough (in 1.3 or 1.2 or whatever version that was).
We jumped from 1.5 to 1.6 because of algorithms in MLLib (although that turned out to be a bit of a disappointment).
Most organizations I see tend to just wait...this sounds more like 1 off to me. You're at a unique employer if the data scientists have that much power..
Great article, although it would be nice to also add the cost (price) of running the systems. For instance, on our own custom solution we were able on 1 small machine to save 100M+ records (about 100GB) a day for $10 total (processing cost, disk cost, and backup cost). Which is why I'm curious how it compares. See https://www.youtube.com/watch?v=sG5qtN8E-6Q and https://www.youtube.com/watch?v=x_WqBuEA7s8 .
I work on IBM Streams, which was discussed and dismissed. The author is correct that we need to have more public benchmarks. I think our system should be in the upper right of the high-level view, but of course I would think that, as someone who works on it.
This benchmark was made at the request of a potential customer. Our implementation scaled to 200 "lanes." The other systems tested did not scale past 50. Unfortunately, that's as much as I feel I can say until I speak with some of the people involved in the comparison.
I've done a huge amount of work on Godot2 [1] (nodejs Riemann port) this past couple weeks, updating it to modern Streams2 syntax in Node. Really lightweight and deployable on all sorts of platforms. Includes many windowing features.
I also updated the dashboard to match [2], including Riemann's query language [3].
I haven't had a chance to update the docs yet, but the tests have a ton of examples [4]. Reactors are just Transform streams now so it's possible to use any duplex/transform stream in the pipeline. Docs coming ASAP.
Depending on your requirements, a single-node solution like PipelineDB may suffice; it's a PostgreSQL extension that lets you write streaming SQL queries.
Jeff from PipelineDB here. We are also rolling out a SaaS product based on PipelineDB called Stride (stride.io) that enables this type of computation via an API with little to no overhead.
Is there a reference regarding the number of nodes that can be deployed in the cluster, and are feasible to maintain? From the docs, I gather that "[a]ll DDL/non-stream DML statements are executed in a distributed transaction on all nodes in the cluster and committed via two-phase commit." [1]
Doesn't that get in the way of low latency and availability?
If you can tolerate some latency on writes, AWS Kinesis is perfectly serviceable. You can write your own "stream" processing logic in a number of languages and the provided KCL/SDK from Amazon does most of plumbing. Eventually, you will probably want to move to something more sophisticated though.
You can try Striim (shameless plug, I work there). You build all your stream processing pipelines in a simple, SQL-like language in a nice web-based UI. It also has built in adapters to convert log data, JSON, XML, database change capture, and other complex data formats into in-memory streams.
The product also comes with dashboards so you can do ad-hoc analytics and data visualization on your streams and windows.
It has support for stream replay built on top of Kafka-backed persistent streams. It has a commercial license, but we're pretty flexible with startups
It still requires Java and Kafka, but Kafka Streams is a library that lets you build stream processing jobs without a centralized cluster if you have Kafka running. Just embed the jar and run your class on as many machines as you like and they will self-organize and balance. Worth checking out.
yeah.. The kafka streams library does look nice, especially how they don't care how you run the thing.
It's a bit harder to dive into java though.
The thing I have a use case for stream processing is basically glorified wordcount - just over a realtime 7 day window. The data processing code is the easy part, the tooling/runtime is the hard part.
It's actually pretty easy to setup a multi-node setup, if you don't worry about YARN/Mesos and just use the plain Spark cluster thing. I think it's like 1 config entry per node.
It doesn't solve the data storage problem, so if you need that you need a way to map the same files to the same path on every node (eg, network drive, RSync).
If you are just doing streaming it might not matter though.
My use case is 100% streaming.. I actually implemented what I'm trying to do using some basic python code that runs against archived logs, but I just process the last weeks worth of data from scratch at each iteration. It only takes 30 seconds to run, but this method doesn't work for a 100x larger stream that I want to do the same thing for.
The ability to run the data pipeline in realtime but also replay older data against new code using something like kafka would be a huge plus.
Really the main thing my code is missing is the sliding window implementation, so I can either port to spark streaming which has all that stuff built in, or just implement my own window code.
You can replay streams against Spark, too. streamingContext.textFileStream will stream data from files dumped in a directory - to replay them, just dump them there again.
Figuring out how to do stateful processing is a little tricky, but updateStateByKey seems to do what i need.. I need to dedup the output per key for time period t. Though, some recommendations are just to use something like redis or memcached which would work.
You could pretty easily batch stream it to S3 for ridiculously cheap. I posted a video of our setup elsewhere in the thread. So overall I'd say this is a pretty good idea, especially when you can do some cool stuff with RxJS.
0: http://beam.incubator.apache.org/
1: http://softwareengineeringdaily.com/2016/08/19/apache-beam-w...