Skip to main content
search

© 2021 Excellerate. All Rights Reserved

Big Data

Apache Apex for Real Time Stream Processing

By August 4, 2016No Comments

Apache Apex, which evolved from the incubator stage in April 2016, is currently gaining wide popularity among live data streaming processors.  Apache Apex stands out for its ability to process large volumes of data substantially faster than its counterparts.
Synerzip is known to take on work in the latest cutting-edge technologies and thus has gained experience in working with Apache Apex in its early stages.
Use Case –
One of the regular needs of the IT security domain is to monitor and analyze real-time firewall data. This also includes the requirement to run large and complex queries against the entire data set to explore and analyze the data for business decisions.
Apex Solution –
Apache Apex is suitable for this problem domain due to its low latency and abundant data connectors via Apache Malhar.  Data visualization can be done with Kibana. So one of the possible solutions is as below: Flowchart
NOTE: Flume operator is not yet part of Apache Malhar
The firewall is configured to send Syslog data to Flume.  The firewall data (incoming/outgoing data log) is the input to the application. Apache Flume works efficiently for collecting, aggregating, and moving large amounts of log data. It also works as a connector, which connects the firewall data to Kafka, which is used as a message queue.
Data coming in from Kafka needs to be sorted before getting into Elasticsearch. Elasticsearch is a distributed, open source search and analytics engine, designed for horizontal scalability, reliability, and easy management. The data transformation is done by the Apache Apex application. Apache Apex reads and processes the data coming from Kafka and the processed data is searched through Elasticsearch. The searched data is lastly visualized using Kibana.
The Synerzip team has written the Apex application which converts and processes the data from Kafka to Elasticsearch.

Data coming in from Syslog to Kafka –
1. Data coming from Syslog to Kafka
Apache Apex Application –
The developed application is a Directed Acyclic Graph (DAG) which contains one input operator (Kafka), one output operator (Elasticsearch) and multiple custom operators which receive data from Kafka. The data is then processed and forwarded to Elasticsearch to persist it.
The custom operators process the stream of incoming data and aggregates it as per the business logic. It is this aggregated data which is then sent out to Elasticsearch. Aggregation is done on a stream of data in a configurable period of time, and it can be different for each custom operator. The team has written multiple custom operators to process different types of firewall messages.
2. Apache Apex Application
3. Output in Kibana
To learn more about Apache Apex, visit Apache Apex.
Demo
Check out these demo applications:
Demo 1 –  This Apache Apex application gets firewall logs from Kafka, processes that log and pushes it to the Elasticsearch using Apex Elasticsearch output operator.
Demo 2 – This Apache Apex application performs aggregation on live stream of Syslog data coming from firewall and stores the aggregated data in Elasticsearch.
Why Apache Apex?
ApacheApexApache Apex was chosen over others because it is highly scalable and gives a great performance, is fault tolerant, stateful and most importantly, easily operable.
With Apache Apex, Synerzip observed that data could be easily monitored and analyzed to meet the clients’ growing business needs, while improving performance up to 50x at times.
Apache Apex is slowly generating steam both in headlines and also in real-life adoption by data science companies in managing and analyzing their big data.