Posts

Unify Batch and Stream Processing with Apache Beam on AWS

One of the big visions of Apache Beam is to provide a single programming model for both batch and streaming that runs on multiple execution engines. In this session, we explore an end to end example that shows how you can combine batch and streaming aspects in one uniform Beam pipeline: We start with ingesting taxi trip events into an Amazon Kinesis data stream and use a Beam pipeline to analyze the streaming data in near real time. We then show how to archive the trip data to Amazon S3 and how we can extend and update the Beam pipeline to generate additional metrics from the streaming data moving forward. We subsequently explain how to backfill the added metrics by executing the same Beam pipeline in a batch fashion against the archived data in S3. Along the way we furthermore discuss how to leverage different execution engines, such as, Amazon Kinesis Data Analytics for Java and Amazon Elastic Map Reduce, to run Beam pipelines in a fully managed environment. ...

Amazon Kinesis Replay

A simple Java application that replays Json events that are stored in objects in Amazon S3 into a Amazon Kinesis stream. The application reads the timestamp attribute of the stored events and replays them as if they occurred in real time. https://github.com/aws-samples/amazon-kinesis-replay

Build and run streaming applications with Apache Flink and Amazon Kinesis Data Analytics

Stream processing facilitates the collection, processing, and analysis of real-time data and enables the continuous generation of insights and quick reactions to emerging situations. This capability is useful when the value of derived insights diminishes over time. Hence, the faster you can react to a detected situation, the more valuable the reaction is going to be. Consider, for instance, a streaming application that analyzes and blocks fraudulent credit card transactions while they occur. Compare that application to a traditional batch-oriented approach that identifies fraudulent transactions at the end of every business day and generates a nice report for you to read the next morning. ...

Amazon Kinesis Analytics Taxi Consumer

Sample Apache Flink application that can be deployed to Kinesis Analytics for Java. It reads taxi events from a Kinesis data stream, processes and aggregates them, and ingests the result to an Amazon Elasticsearch Service cluster for visualization with Kibana. https://github.com/aws-samples/amazon-kinesis-analytics-taxi-consumer

Build Your First Big Data Application on AWS

AWS makes it easy to build and operate a highly scalable and flexible data platforms to collect, process, and analyze data so you can get timely insights and react quickly to new information. In this session, we will demonstrate how you can quickly build a fully managed data platform that transforms, cleans, and analyses incoming data in real time and persist the cleaned data for subsequent visualizations and through exploration by means of SQL. To this end, we will build an end-to-end streaming data solution using Kinesis Data Streams for data ingestion, Kinesis Data Analytics for real-time outlier and hotspot detection, and show how the incoming data can be persisted by means of Kinesis Data Firehose to make it available for Amazon Athena and Amazon QuickSight for data exploration and visualization. ...

Build a Real-time Stream Processing Pipeline with Apache Flink on AWS (FF)

The increasing number of available data sources in today’s application stacks created a demand to continuously capture and process data from various sources to quickly turn high volume streams of raw data into actionable insights. Apache Flink addresses many of the challenges faced in this domain as it’s specifically tailored to distributed computations over streams. While Flink provides all the necessary capabilities to process streaming data, provisioning and maintaining a Flink cluster still requires considerable effort and expertise. We will discuss how cloud services can remove most of the burden of running the clusters underlying your Flink jobs and explain how to build a real-time processing pipeline on top of AWS by integrating Flink with Amazon Kinesis and Amazon EMR. We will furthermore illustrate how to leverage the reliable, scalable, and elastic nature of the AWS cloud to effectively create and operate your real-time processing pipeline with little operational overhead. ...

Build a Real-time Stream Processing Pipeline with Apache Flink on AWS

In today’s business environments, data is generated in a continuous fashion by a steadily increasing number of diverse data sources. Therefore, the ability to continuously capture, store, and process this data to quickly turn high-volume streams of raw data into actionable insights has become a substantial competitive advantage for organizations. Apache Flink is an open source project that is well-suited to form the basis of such a stream processing pipeline. It offers unique capabilities that are tailored to the continuous analysis of streaming data. However, building and maintaining a pipeline based on Flink often requires considerable expertise, in addition to physical resources and operational efforts. ...