Tag Archives: samza

Unified frameworks for Batch and Realtime processing

Many a times there will be requirement to process the data in real and batch oriented fashion as in case of stock data, weather data etc. The same data stream can be stored/processed in a batch oriented fashion and also in a real time fashion as shown in the below data flow. This is what the Lamda Architecture (1, 2) is all about.real-and-batch-processingThe good thing about the this architecture is that the input data is retained, so if the logic for processing the data changes over time (bug fix, algorithm improvements etc), then processing can be rerun on the same historical data. But, the problem with the above architecture is that the logic for processing the data has to be written in two models (batch and real time) and there are also operation challenges because two separate systems have to be maintained.

Google Dataflow minimizes the above mentioned problem, by allowing the same program semantics with some minimal changes for batch and real time processing of the data. But, Google Dataflow is a service and is also not available for subscription as of this writing. Was under the impression that this was the first attempt to combine the batch and the real time processing worlds. Looks like LinkedIn is using Samza for both batch and real time processing of the data. Instead of repeating the same, here is an article from O’reilly. The article talks about the pros and cons of Lamda Architecture and on how LinkedIn is using Samza for batch oriented processing of the data. Continue reading Unified frameworks for Batch and Realtime processing