pipeline process: beam
What’s beam
beam is a open-source, unified model for defining both batched & streaming data-parallel processing pipelines.
- open-source (apache v2 license)
- to define data-parallel processing pipelines
- an unified model to define pipelines. The real processing is run by the underlying runner (eg. spark, apache apex, etc.). all available runners
- can process both batched (bounded datasets) & streaming (unbounded datasets) datasets
Use it
See the wordcount examples, wordcount src
Now we define a simple pipeline and run it.
Transform
, Count
are all built-in atom operations to define the pipeline scripts.
1 | package org.apache.beam.examples; |
Some conceptions
I/O (data source/target)
Beam can process both batched (bounded datasets) & streaming (unbounded datasets) datasets. built-in io transforms
Take reading as example, you specify the file location (the location must be accessable for the runner), and then the reader pull from datasource. You may also define the trigger to collect input window. When trigger is satisfied, window elements are emitted.
For unbounded datasets, they are split into windows. And each window is again a bounded datasets. In each window, there’re some elements. You can define how the elements are grouped as a window and when to emit the window elements for processing. window concept
Runner
Beam is an unified model. It abstracts the conception to define and run a pipeline. The real execution is conducted by the underlying runners.
For unbounded datasets, the underlying runner must support stream processing.