Highlights

How to Simplify Microservices with a Shared Database and Materialized Views

This blog post was both enjoyable and quick to write, at least according to my standards. It explores a slightly provocative idea: challenging the fundamental assumption that microservices must not share a database to expose data to external services, and examining what breaks when this convention is ignored. As it turns out, quite a lot breaks, but there are also significant benefits, especially when business logic requires consistent data from multiple services. ...

Everything you need to know to be a Materialize power-user

This post is also available on the Materialize blog. Materialize is a distributed SQL database built on streaming internals. With it, you can use the SQL you are already familiar with to build powerful stream processing capabilities. But as with any abstraction, sometimes the underlying implementation details leak through the abstraction. Queries that look simple and innocent when you are formulating them in SQL can sometimes require more resources than expected when evaluated incrementally against a continuous stream of arriving updates. ...

Best practices for right-sizing your Apache Kafka clusters to optimize performance and cost

Apache Kafka is well known for its performance and tunability to optimize for various use cases. But sometimes it can be challenging to find the right infrastructure configuration that meets your specific performance requirements while minimizing the infrastructure cost. This post explains how the underlying infrastructure affects Apache Kafka performance. We discuss strategies on how to size your clusters to meet your throughput, availability, and latency requirements. Along the way, we answer questions like “when does it make sense to scale up vs. scale out?” We end with guidance on how to continuously verify the size of your production clusters. ...

Flink Improvement Proposal 171: Async Sink

Apache Flink has a rich connector ecosystem that can persist data in various destinations. Flink natively supports Apache Kafka, Amazon Kinesis Data Streams, Elasticsearch, HBase, and many more destinations. Additional connectors are maintained in Apache Bahir or directly on GitHub. The basic functionality of these sinks is quite similar. They batch events according to user defined buffering hints, sign requests and send them to the respective endpoint, retry unsuccessful or throttled requests, and participate in checkpointing. They primarily just differ in the way they interface with the destination. Yet, all the above-mentioned sinks are developed and maintained independently. ...

Streaming ETL with Apache Flink and Amazon Kinesis Data Analytics

This post looks at how to use Apache Flink as a basis for sophisticated streaming extract-transform-load (ETL) pipelines. Apache Flink is a framework and distributed processing engine for processing data streams. AWS provides a fully managed service for Apache Flink through Amazon Kinesis Data Analytics, which enables you to build and run sophisticated streaming applications quickly, easily, and with low operational overhead. https://aws.amazon.com/blogs/big-data/streaming-etl-with-apache-flink-and-amazon-kinesis-data-analytics/

Build and run streaming applications with Apache Flink and Amazon Kinesis Data Analytics (FF)

Stream processing facilitates the collection, processing, and analysis of real-time data and enables the continuous generation of insights and quick reactions to emerging situations. Yet, despite these advantages compared to traditional batch-oriented analytics applications, streaming applications are much more challenging to operate. Some of these challenges include the ability to provide and maintain low end-to-end latency, to seamlessly recover from failure, and to deal with a varying amount of throughput. ...

Build Your First Big Data Application on AWS

AWS makes it easy to build and operate a highly scalable and flexible data platforms to collect, process, and analyze data so you can get timely insights and react quickly to new information. In this session, we will demonstrate how you can quickly build a fully managed data platform that transforms, cleans, and analyses incoming data in real time and persist the cleaned data for subsequent visualizations and through exploration by means of SQL. To this end, we will build an end-to-end streaming data solution using Kinesis Data Streams for data ingestion, Kinesis Data Analytics for real-time outlier and hotspot detection, and show how the incoming data can be persisted by means of Kinesis Data Firehose to make it available for Amazon Athena and Amazon QuickSight for data exploration and visualization. ...