Scalable Stream Processing: A Survey of Storm, Samza, Spark and Flink

techwizrd · on Sept 18, 2016

What about using Apache Beam[0] for an abstracting over these stream processing frameworks? In a recent Software Engineering Daily podcast[1], Frances Perry recommends using Flink for the data flow model.

0: http://beam.incubator.apache.org/

1: http://softwareengineeringdaily.com/2016/08/19/apache-beam-w...

nl · on Sept 18, 2016

The problem with Beam is that it is an abstraction layer.

This sounds like a stupid comment, but when the underlying services are all evolving as quickly as they do in this space programming against an abstraction layer means you need to wait for (often sorely needed) new features.

(Frances Perry is an engineer on Beam at Google, so it would be surprising if she recommended against it)

cle · on Sept 19, 2016

Also, these distributed processing platforms are extraordinarily leaky (for anything non-trivial, you really need to have a good low-level understanding of your system in order to tune/debug/optimize/implement). There are definitely other benefits of a unified API, but making it seamless to switch underlying implementations is probably not one...

agibsonccc · on Sept 19, 2016

Do you have examples of such features? Most production deployments of these kinds of things don't upgrade THAT often.

You have a point that abstraction layers can limit you since it's another compatibility layer, but looking at what's done in actual practice I don't see it being that bad.

If anything, it should be evaluated on a case by case basis.

Flink is actually better tech overall for my use case, but a lot of customers want spark streaming since it's already installed. Having beam where we can do both is kinda nice.

nl · on Sept 19, 2016

We had to abandon Cloudera (and Horton Works) Hadoop distributions because they didn't ship R support for Spark quickly enough (in 1.3 or 1.2 or whatever version that was).

We jumped from 1.5 to 1.6 because of algorithms in MLLib (although that turned out to be a bit of a disappointment).

agibsonccc · on Sept 19, 2016

Most organizations I see tend to just wait...this sounds more like 1 off to me. You're at a unique employer if the data scientists have that much power..

marknadal · on Sept 18, 2016

Great article, although it would be nice to also add the cost (price) of running the systems. For instance, on our own custom solution we were able on 1 small machine to save 100M+ records (about 100GB) a day for $10 total (processing cost, disk cost, and backup cost). Which is why I'm curious how it compares. See https://www.youtube.com/watch?v=sG5qtN8E-6Q and https://www.youtube.com/watch?v=x_WqBuEA7s8 .

scott_s · on Sept 18, 2016

I work on IBM Streams, which was discussed and dismissed. The author is correct that we need to have more public benchmarks. I think our system should be in the upper right of the high-level view, but of course I would think that, as someone who works on it.

Something I can point to is a modified Linear Road benchmark: https://github.com/IBMStreams/benchmarks/tree/master/Streams...

This benchmark was made at the request of a potential customer. Our implementation scaled to 200 "lanes." The other systems tested did not scale past 50. Unfortunately, that's as much as I feel I can say until I speak with some of the people involved in the comparison.

eternalban · on Sept 18, 2016

Interested in learing about the architecture and internals & can't dig anything up. Any doc links you can share?

scott_s · on Sept 19, 2016

Official documentation: http://www.ibm.com/support/knowledgecenter/SSCRJU_4.1.1/com....

Development community: https://developer.ibm.com/streamsdev/

Some pointers to posts I've made in the development community focusing on the language and performance: http://www.scott-a-s.com/streams-posts/

Academic paper on the language; this is an IBM technical report, a version of this will be published in TOPLAS: http://hirzels.com/martin/papers/tr14-rc25486-spl.pdf

Brief academic paper on the systems aspects of the language: http://hirzels.com/martin/papers/debull15-spl.pdf

eternalban · on Sept 19, 2016

Thanks.

scott_s · on Sept 22, 2016

Turns out we do have public information on the Linear Road performance: http://www.slideshare.net/RedisLabs/walmart-ibm-revisit-the-...

justinsaccount · on Sept 18, 2016

Is there a SIMPLE stream processing library/framework?

Something like spark or kafka streaming but that doesn't depend on hundreds of megabytes of java stuffs?

I just want to do basic windowing/counting against data streams.. I don't want to be part of a gigantic java ecosystem :-(

doublerebel · on Sept 18, 2016

I've done a huge amount of work on Godot2 [1] (nodejs Riemann port) this past couple weeks, updating it to modern Streams2 syntax in Node. Really lightweight and deployable on all sorts of platforms. Includes many windowing features.

I also updated the dashboard to match [2], including Riemann's query language [3].

I haven't had a chance to update the docs yet, but the tests have a ton of examples [4]. Reactors are just Transform streams now so it's possible to use any duplex/transform stream in the pipeline. Docs coming ASAP.

[1]: https://github.com/nextorigin/godot2

[2]: https://github.com/nextorigin/godot2-dash

[3]: https://github.com/nextorigin/riemann-query-parser

[4]: https://github.com/nextorigin/godot2/tree/nextorigin/test/re...

reubano · on Sept 18, 2016

I wrote riko [1] with exactly this use case in mind. Its a simple python stream processing engine.

[1] https://github.com/nerevu/riko

Edit: HN submission

https://news.ycombinator.com/item?id=12136011

wolle · on Sept 18, 2016

Depending on your requirements, a single-node solution like PipelineDB may suffice; it's a PostgreSQL extension that lets you write streaming SQL queries.

Fergi · on Sept 19, 2016

Jeff from PipelineDB here. We are also rolling out a SaaS product based on PipelineDB called Stride (stride.io) that enables this type of computation via an API with little to no overhead.

justinsaccount · on Sept 18, 2016

That is definitely along the lines of what I'm looking for..

the esper stuff was on my radar, but something that works with existing postgresql tools is a huge plus.

avifreedman · on Sept 18, 2016

PipelineDB can scale out to multiple nodes (for $).

wolle · on Sept 19, 2016

Is there a reference regarding the number of nodes that can be deployed in the cluster, and are feasible to maintain? From the docs, I gather that "[a]ll DDL/non-stream DML statements are executed in a distributed transaction on all nodes in the cluster and committed via two-phase commit." [1]

Doesn't that get in the way of low latency and availability?

[1] http://enterprise.pipelinedb.com/docs/two-phase.html#two-pha...

papercruncher · on Sept 18, 2016

If you can tolerate some latency on writes, AWS Kinesis is perfectly serviceable. You can write your own "stream" processing logic in a number of languages and the provided KCL/SDK from Amazon does most of plumbing. Eventually, you will probably want to move to something more sophisticated though.

alexdean · on Sept 18, 2016

Exactly this - here's some simple windowing/counting in JavaScript with Kinesis, Lambda and DynamoDB, no JVM in sight :-) https://github.com/snowplow/aws-lambda-nodejs-example-projec...

justinsaccount · on Sept 18, 2016

a hosted service is a non-starter for my use cases :-/

nathanielc · on Sept 18, 2016

Kapacitor, it has a focus on time series data, and simplicity.

https://www.influxdata.com/time-series-platform/kapacitor/

Disclaimer: I am the author.

capkutay · on Sept 18, 2016

You can try Striim (shameless plug, I work there). You build all your stream processing pipelines in a simple, SQL-like language in a nice web-based UI. It also has built in adapters to convert log data, JSON, XML, database change capture, and other complex data formats into in-memory streams.

The product also comes with dashboards so you can do ad-hoc analytics and data visualization on your streams and windows.

It has support for stream replay built on top of Kafka-backed persistent streams. It has a commercial license, but we're pretty flexible with startups

http://www.striim.com/download-striim/

gfodor · on Sept 18, 2016

It still requires Java and Kafka, but Kafka Streams is a library that lets you build stream processing jobs without a centralized cluster if you have Kafka running. Just embed the jar and run your class on as many machines as you like and they will self-organize and balance. Worth checking out.

justinsaccount · on Sept 18, 2016

yeah.. The kafka streams library does look nice, especially how they don't care how you run the thing.

It's a bit harder to dive into java though.

The thing I have a use case for stream processing is basically glorified wordcount - just over a realtime 7 day window. The data processing code is the easy part, the tooling/runtime is the hard part.

fn1 · on Sept 19, 2016

Regarding java tooling: check out the blog posts "not your father's java" and especially the part about capsule packaging.

oellegaard · on Sept 19, 2016

Check out motorway: https://github.com/plecto/motorway/ - its a few hundred lines of Python that nearly does the same as eg storm

di4na · on Sept 19, 2016

Elixir is getting a really nice Flow lib.

You can have a look here : https://hexdocs.pm/gen_stage/Experimental.Flow.html

eikenberry · on Sept 19, 2016

I've been keeping an eye on this one for the next time I needed a system of this sort. Looks promising...

https://github.com/chrislusf/glow

justinsaccount · on Sept 19, 2016

glow looks like a nifty mapreduce implemenation, but it doesn't appear to do stream processing.

nl · on Sept 18, 2016

Spark with Python is pretty good. You just install it, then use a python shell or scripts. You don't really need to get into the Java side at all.

I'm not sure the size of the install should be the primary thing you are concerned about.

justinsaccount · on Sept 18, 2016

Indeed. I was just playing with the python bindings.. I was able to build a prototype of what I am trying to do in a few minutes.

I'd prefer the kafka streams approach of deployment, but I suppose for just having it run on a single box this would work.

nl · on Sept 19, 2016

It's actually pretty easy to setup a multi-node setup, if you don't worry about YARN/Mesos and just use the plain Spark cluster thing. I think it's like 1 config entry per node.

It doesn't solve the data storage problem, so if you need that you need a way to map the same files to the same path on every node (eg, network drive, RSync).

If you are just doing streaming it might not matter though.

justinsaccount · on Sept 19, 2016

My use case is 100% streaming.. I actually implemented what I'm trying to do using some basic python code that runs against archived logs, but I just process the last weeks worth of data from scratch at each iteration. It only takes 30 seconds to run, but this method doesn't work for a 100x larger stream that I want to do the same thing for.

The ability to run the data pipeline in realtime but also replay older data against new code using something like kafka would be a huge plus.

Really the main thing my code is missing is the sliding window implementation, so I can either port to spark streaming which has all that stuff built in, or just implement my own window code.

nl · on Sept 20, 2016

Spark has sliding window operations:

    windowedWordCounts = pairs.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 30, 10)

Search for "window operations" on http://spark.apache.org/docs/latest/streaming-programming-gu.... Unless you meant something different?

You can replay streams against Spark, too. streamingContext.textFileStream will stream data from files dumped in a directory - to replay them, just dump them there again.

justinsaccount · on Sept 20, 2016

Yeah.. that's exactly what I used in the spark version i threw together yesterday.

reduceByKeyAndWindow is what my simple non-spark proof of concept is missing. otherwise the code is basically the same.

my non spark code is essentially

  while True:
    data = {}
    for line in input:
        rec = parse(line)
        data = aggregate(data, rec)
    data = filter(is_bad, data)
    pprint(data)

the spark version is 99% the same code:

  lines.map(parse).reduceByKeyAndWindow(add, sub, 3600, 60).
    filter(is_bad).pprint()

Figuring out how to do stateful processing is a little tricky, but updateStateByKey seems to do what i need.. I need to dedup the output per key for time period t. Though, some recommendations are just to use something like redis or memcached which would work.

hokkos · on Sept 18, 2016

Why not just use a RX library ? http://reactivex.io/

fn1 · on Sept 18, 2016

And how do you store the data?

marknadal · on Sept 18, 2016

You could pretty easily batch stream it to S3 for ridiculously cheap. I posted a video of our setup elsewhere in the thread. So overall I'd say this is a pretty good idea, especially when you can do some cool stuff with RxJS.