Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Scalable Stream Processing: A Survey of Storm, Samza, Spark and Flink (medium.com/baqend-blog)
135 points by DivineTraube on Sept 18, 2016 | hide | past | favorite | 41 comments


What about using Apache Beam[0] for an abstracting over these stream processing frameworks? In a recent Software Engineering Daily podcast[1], Frances Perry recommends using Flink for the data flow model.

0: http://beam.incubator.apache.org/

1: http://softwareengineeringdaily.com/2016/08/19/apache-beam-w...


The problem with Beam is that it is an abstraction layer.

This sounds like a stupid comment, but when the underlying services are all evolving as quickly as they do in this space programming against an abstraction layer means you need to wait for (often sorely needed) new features.

(Frances Perry is an engineer on Beam at Google, so it would be surprising if she recommended against it)


Also, these distributed processing platforms are extraordinarily leaky (for anything non-trivial, you really need to have a good low-level understanding of your system in order to tune/debug/optimize/implement). There are definitely other benefits of a unified API, but making it seamless to switch underlying implementations is probably not one...


Do you have examples of such features? Most production deployments of these kinds of things don't upgrade THAT often.

You have a point that abstraction layers can limit you since it's another compatibility layer, but looking at what's done in actual practice I don't see it being that bad.

If anything, it should be evaluated on a case by case basis.

Flink is actually better tech overall for my use case, but a lot of customers want spark streaming since it's already installed. Having beam where we can do both is kinda nice.


We had to abandon Cloudera (and Horton Works) Hadoop distributions because they didn't ship R support for Spark quickly enough (in 1.3 or 1.2 or whatever version that was).

We jumped from 1.5 to 1.6 because of algorithms in MLLib (although that turned out to be a bit of a disappointment).


Most organizations I see tend to just wait...this sounds more like 1 off to me. You're at a unique employer if the data scientists have that much power..


Great article, although it would be nice to also add the cost (price) of running the systems. For instance, on our own custom solution we were able on 1 small machine to save 100M+ records (about 100GB) a day for $10 total (processing cost, disk cost, and backup cost). Which is why I'm curious how it compares. See https://www.youtube.com/watch?v=sG5qtN8E-6Q and https://www.youtube.com/watch?v=x_WqBuEA7s8 .


I work on IBM Streams, which was discussed and dismissed. The author is correct that we need to have more public benchmarks. I think our system should be in the upper right of the high-level view, but of course I would think that, as someone who works on it.

Something I can point to is a modified Linear Road benchmark: https://github.com/IBMStreams/benchmarks/tree/master/Streams...

This benchmark was made at the request of a potential customer. Our implementation scaled to 200 "lanes." The other systems tested did not scale past 50. Unfortunately, that's as much as I feel I can say until I speak with some of the people involved in the comparison.


Interested in learing about the architecture and internals & can't dig anything up. Any doc links you can share?


Official documentation: http://www.ibm.com/support/knowledgecenter/SSCRJU_4.1.1/com....

Development community: https://developer.ibm.com/streamsdev/

Some pointers to posts I've made in the development community focusing on the language and performance: http://www.scott-a-s.com/streams-posts/

Academic paper on the language; this is an IBM technical report, a version of this will be published in TOPLAS: http://hirzels.com/martin/papers/tr14-rc25486-spl.pdf

Brief academic paper on the systems aspects of the language: http://hirzels.com/martin/papers/debull15-spl.pdf


Thanks.


Turns out we do have public information on the Linear Road performance: http://www.slideshare.net/RedisLabs/walmart-ibm-revisit-the-...


Is there a SIMPLE stream processing library/framework?

Something like spark or kafka streaming but that doesn't depend on hundreds of megabytes of java stuffs?

I just want to do basic windowing/counting against data streams.. I don't want to be part of a gigantic java ecosystem :-(


I've done a huge amount of work on Godot2 [1] (nodejs Riemann port) this past couple weeks, updating it to modern Streams2 syntax in Node. Really lightweight and deployable on all sorts of platforms. Includes many windowing features.

I also updated the dashboard to match [2], including Riemann's query language [3].

I haven't had a chance to update the docs yet, but the tests have a ton of examples [4]. Reactors are just Transform streams now so it's possible to use any duplex/transform stream in the pipeline. Docs coming ASAP.

[1]: https://github.com/nextorigin/godot2

[2]: https://github.com/nextorigin/godot2-dash

[3]: https://github.com/nextorigin/riemann-query-parser

[4]: https://github.com/nextorigin/godot2/tree/nextorigin/test/re...


I wrote riko [1] with exactly this use case in mind. Its a simple python stream processing engine.

[1] https://github.com/nerevu/riko

Edit: HN submission

https://news.ycombinator.com/item?id=12136011


Depending on your requirements, a single-node solution like PipelineDB may suffice; it's a PostgreSQL extension that lets you write streaming SQL queries.


Jeff from PipelineDB here. We are also rolling out a SaaS product based on PipelineDB called Stride (stride.io) that enables this type of computation via an API with little to no overhead.


That is definitely along the lines of what I'm looking for..

the esper stuff was on my radar, but something that works with existing postgresql tools is a huge plus.


PipelineDB can scale out to multiple nodes (for $).


Is there a reference regarding the number of nodes that can be deployed in the cluster, and are feasible to maintain? From the docs, I gather that "[a]ll DDL/non-stream DML statements are executed in a distributed transaction on all nodes in the cluster and committed via two-phase commit." [1]

Doesn't that get in the way of low latency and availability?

[1] http://enterprise.pipelinedb.com/docs/two-phase.html#two-pha...


If you can tolerate some latency on writes, AWS Kinesis is perfectly serviceable. You can write your own "stream" processing logic in a number of languages and the provided KCL/SDK from Amazon does most of plumbing. Eventually, you will probably want to move to something more sophisticated though.


Exactly this - here's some simple windowing/counting in JavaScript with Kinesis, Lambda and DynamoDB, no JVM in sight :-) https://github.com/snowplow/aws-lambda-nodejs-example-projec...


a hosted service is a non-starter for my use cases :-/


Kapacitor, it has a focus on time series data, and simplicity.

https://www.influxdata.com/time-series-platform/kapacitor/

Disclaimer: I am the author.


You can try Striim (shameless plug, I work there). You build all your stream processing pipelines in a simple, SQL-like language in a nice web-based UI. It also has built in adapters to convert log data, JSON, XML, database change capture, and other complex data formats into in-memory streams.

The product also comes with dashboards so you can do ad-hoc analytics and data visualization on your streams and windows.

It has support for stream replay built on top of Kafka-backed persistent streams. It has a commercial license, but we're pretty flexible with startups

http://www.striim.com/download-striim/


It still requires Java and Kafka, but Kafka Streams is a library that lets you build stream processing jobs without a centralized cluster if you have Kafka running. Just embed the jar and run your class on as many machines as you like and they will self-organize and balance. Worth checking out.


yeah.. The kafka streams library does look nice, especially how they don't care how you run the thing.

It's a bit harder to dive into java though.

The thing I have a use case for stream processing is basically glorified wordcount - just over a realtime 7 day window. The data processing code is the easy part, the tooling/runtime is the hard part.


Regarding java tooling: check out the blog posts "not your father's java" and especially the part about capsule packaging.


Check out motorway: https://github.com/plecto/motorway/ - its a few hundred lines of Python that nearly does the same as eg storm


Elixir is getting a really nice Flow lib.

You can have a look here : https://hexdocs.pm/gen_stage/Experimental.Flow.html


I've been keeping an eye on this one for the next time I needed a system of this sort. Looks promising...

https://github.com/chrislusf/glow


glow looks like a nifty mapreduce implemenation, but it doesn't appear to do stream processing.


Spark with Python is pretty good. You just install it, then use a python shell or scripts. You don't really need to get into the Java side at all.

I'm not sure the size of the install should be the primary thing you are concerned about.


Indeed. I was just playing with the python bindings.. I was able to build a prototype of what I am trying to do in a few minutes.

I'd prefer the kafka streams approach of deployment, but I suppose for just having it run on a single box this would work.


It's actually pretty easy to setup a multi-node setup, if you don't worry about YARN/Mesos and just use the plain Spark cluster thing. I think it's like 1 config entry per node.

It doesn't solve the data storage problem, so if you need that you need a way to map the same files to the same path on every node (eg, network drive, RSync).

If you are just doing streaming it might not matter though.


My use case is 100% streaming.. I actually implemented what I'm trying to do using some basic python code that runs against archived logs, but I just process the last weeks worth of data from scratch at each iteration. It only takes 30 seconds to run, but this method doesn't work for a 100x larger stream that I want to do the same thing for.

The ability to run the data pipeline in realtime but also replay older data against new code using something like kafka would be a huge plus.

Really the main thing my code is missing is the sliding window implementation, so I can either port to spark streaming which has all that stuff built in, or just implement my own window code.


Spark has sliding window operations:

    windowedWordCounts = pairs.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 30, 10)
Search for "window operations" on http://spark.apache.org/docs/latest/streaming-programming-gu.... Unless you meant something different?

You can replay streams against Spark, too. streamingContext.textFileStream will stream data from files dumped in a directory - to replay them, just dump them there again.


Yeah.. that's exactly what I used in the spark version i threw together yesterday.

reduceByKeyAndWindow is what my simple non-spark proof of concept is missing. otherwise the code is basically the same.

my non spark code is essentially

  while True:
    data = {}
    for line in input:
        rec = parse(line)
        data = aggregate(data, rec)
    data = filter(is_bad, data)
    pprint(data)
the spark version is 99% the same code:

  lines.map(parse).reduceByKeyAndWindow(add, sub, 3600, 60).
    filter(is_bad).pprint()
Figuring out how to do stateful processing is a little tricky, but updateStateByKey seems to do what i need.. I need to dedup the output per key for time period t. Though, some recommendations are just to use something like redis or memcached which would work.


Why not just use a RX library ? http://reactivex.io/


And how do you store the data?


You could pretty easily batch stream it to S3 for ridiculously cheap. I posted a video of our setup elsewhere in the thread. So overall I'd say this is a pretty good idea, especially when you can do some cool stuff with RxJS.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: