Steffen Hausmann

Performance Testing Framework for Apache Kafka

The tool is designed to evaluate the maximum throughput of a cluster and compare the put latency of different broker, producer, and consumer configurations. To run a test, you basically specify the different parameters that should be tested and the tool will iterate through all different combinations of the parameters, producing a graph similar to the one below. https://github.com/aws-samples/performance-testing-framework-for-apache-kafka/

Flink Improvement Proposal 171: Async Sink

Apache Flink has a rich connector ecosystem that can persist data in various destinations. Flink natively supports Apache Kafka, Amazon Kinesis Data Streams, Elasticsearch, HBase, and many more destinations. Additional connectors are maintained in Apache Bahir or directly on GitHub. The basic functionality of these sinks is quite similar. They batch events according to user defined buffering hints, sign requests and send them to the respective endpoint, retry unsuccessful or throttled requests, and participate in checkpointing. They primarily just differ in the way they interface with the destination. Yet, all the above-mentioned sinks are developed and maintained independently. ...

Building real-time applications using Apache Flink

Build real-time applications using Apache Flink with Apache Kafka and Amazon Kinesis Data Streams. Apache Flink is a framework and engine for building streaming applications for use cases such as real-time analytics and complex event processing. This session covers best practices for building low-latency applications with Apache Flink when reading data from either Amazon MSK or Amazon Kinesis Data Streams. It also covers best practices for running low-latency Apache Flink applications using Amazon Kinesis Data Analytics and discusses AWS’s open-source contributions to this use case. ...

Build a Unified Batch and Stream Processing Pipeline with Apache Beam on AWS

In this workshop, we explore an end to end example that combines batch and streaming aspects in one uniform Beam pipeline. We start to analyze incoming taxi trip events in near real time with an Apache Beam pipeline. We then show how to archive the trip data to Amazon S3 for long term storage. We subsequently explain how to read the historic data from S3 and backfill new metrics by executing the same Beam pipeline in a batch fashion. Along the way, you also learn how you can deploy and execute the Beam pipeline with Amazon Kinesis Data Analytics in a fully managed environment. ...

Streaming ETL with Apache Flink and Amazon Kinesis Data Analytics

This post looks at how to use Apache Flink as a basis for sophisticated streaming extract-transform-load (ETL) pipelines. Apache Flink is a framework and distributed processing engine for processing data streams. AWS provides a fully managed service for Apache Flink through Amazon Kinesis Data Analytics, which enables you to build and run sophisticated streaming applications quickly, easily, and with low operational overhead. https://aws.amazon.com/blogs/big-data/streaming-etl-with-apache-flink-and-amazon-kinesis-data-analytics/

Choosing the right service for your data streaming needs (ANT316)

In this chalk talk, we discuss the benefits of different AWS streaming services and walk through some use cases for each. We share best practices based on real customer examples and discuss a framework that you can use to determine which set of services best suit your specific use case. Finally, we show some interactive examples, so come ready with your real-life scenarios that we can discuss live. https://d1.awsstatic.com/events/reinvent/2019/Choosing_the_right_service_for_your_data_streaming_needs_ANT316.pdf

Build real-time analytics for a ride-sharing app (ANT401)

In this session, we walk through how to perform real-time analytics on ride-sharing and taxi data, and we explore how to build a reliable, scalable, and highly available streaming architecture based on managed services. You learn how to deploy, operate, and scale an Apache Flink application with Amazon Kinesis Data Analytics for Java applications. Leave this workshop knowing how to build an end-to-end streaming analytics pipeline, starting with ingesting data into a Kinesis data stream, writing and deploying a Flink application to perform basic stream transformations and aggregations, and persisting the results to Amazon Elasticsearch Service to be visualized from Kibana. ...

Build and run streaming applications with Apache Flink and Amazon Kinesis Data Analytics (FF)

Stream processing facilitates the collection, processing, and analysis of real-time data and enables the continuous generation of insights and quick reactions to emerging situations. Yet, despite these advantages compared to traditional batch-oriented analytics applications, streaming applications are much more challenging to operate. Some of these challenges include the ability to provide and maintain low end-to-end latency, to seamlessly recover from failure, and to deal with a varying amount of throughput. ...

Streaming Analytics Workshop

In this workshop, you will build an end-to-end streaming architecture to ingest, analyze, and visualize streaming data in near real-time. You set out to improve the operations of a taxi company in New York City. You’ll analyze the telemetry data of a taxi fleet in New York City in near-real time to optimize their fleet operations. You will not only learn how to deploy, operate, and scale an Apache Flink application with Kinesis Data Analytics for Java Applications, but also explore the basic concepts of Apache Flink and running Flink applications in a fully managed environment on AWS.v ...

Unify Batch and Stream Processing with Apache Beam on AWS

One of the big visions of Apache Beam is to provide a single programming model for both batch and streaming that runs on multiple execution engines. In this session, we explore an end to end example that shows how you can combine batch and streaming aspects in one uniform Beam pipeline: We start with ingesting taxi trip events into an Amazon Kinesis data stream and use a Beam pipeline to analyze the streaming data in near real time. We then show how to archive the trip data to Amazon S3 and how we can extend and update the Beam pipeline to generate additional metrics from the streaming data moving forward. We subsequently explain how to backfill the added metrics by executing the same Beam pipeline in a batch fashion against the archived data in S3. Along the way we furthermore discuss how to leverage different execution engines, such as, Amazon Kinesis Data Analytics for Java and Amazon Elastic Map Reduce, to run Beam pipelines in a fully managed environment. ...