Blog Post

How to Simplify Microservices with a Shared Database and Materialized Views

This blog post was both enjoyable and quick to write, at least according to my standards. It explores a slightly provocative idea: challenging the fundamental assumption that microservices must not share a database to expose data to external services, and examining what breaks when this convention is ignored. As it turns out, quite a lot breaks, but there are also significant benefits, especially when business logic requires consistent data from multiple services. ...

How Materialize Unlocks Private Kafka Connectivity via PrivateLink and SSH

At Materialize, we’ve built a data warehouse that runs on real-time data. Our customers use this real-time data to power critical business use cases, from fraud detection, to dynamic pricing, to loan underwriting. To provide our customers with streaming data, we have first-class support for loading and unloading data via Apache Kafka, the de facto standard for transit for real-time data. Because of the sensitivity of their data, our customers require strong encryption and authentication schemes at a minimum. Many of our customers go one step further and require that no data is loaded or unloaded over the public internet. ...

Everything you need to know to be a Materialize power-user

This post is also available on the Materialize blog. Materialize is a distributed SQL database built on streaming internals. With it, you can use the SQL you are already familiar with to build powerful stream processing capabilities. But as with any abstraction, sometimes the underlying implementation details leak through the abstraction. Queries that look simple and innocent when you are formulating them in SQL can sometimes require more resources than expected when evaluated incrementally against a continuous stream of arriving updates. ...

Leaving Amazon

After more than 7.5 years my time at AWS came to a close at the end of 2022. It’s been an incredible journey to learn and grow professionally. I’m still surprised how much trust and support I’ve received over the years to focus on things I found important and impactful. Just last year the work I’ve started to improve the Apache Flink connectors system was contributed back to the open source project, not only resulting in several blog posts and a session at Flink Forward, but also getting early adoption that lead to support of new destinations that now integrate with Apache Flink. I’ve also spent a ridiculous amount of energy and time on understanding Apache Kafka performance in cloud environments, which not only discovered several opportunities for internal improvements, but also led to one of the most popular blog posts on the AWS big data blog in 2022. Throughout 2022 I’ve also started building my own team within the messaging and streaming organization with the goal to enable and support customers adopting streaming technologies on AWS. ...

Making it Easier to Build Connectors with Apache Flink: Introducing the Async Sink

Apache Flink is a popular open source framework for stateful computations over data streams. It allows you to formulate queries that are continuously evaluated in near real time against an incoming stream of events. To persist derived insights from these queries in downstream systems, Apache Flink comes with a rich connector ecosystem that supports a wide range of sources and destinations. However, the existing connectors may not always be enough to support all conceivable use cases. Our customers and the community kept asking for more connectors and better integrations with various open source tools and services. ...

Best practices for right-sizing your Apache Kafka clusters to optimize performance and cost

Apache Kafka is well known for its performance and tunability to optimize for various use cases. But sometimes it can be challenging to find the right infrastructure configuration that meets your specific performance requirements while minimizing the infrastructure cost. This post explains how the underlying infrastructure affects Apache Kafka performance. We discuss strategies on how to size your clusters to meet your throughput, availability, and latency requirements. Along the way, we answer questions like “when does it make sense to scale up vs. scale out?” We end with guidance on how to continuously verify the size of your production clusters. ...

Streaming ETL with Apache Flink and Amazon Kinesis Data Analytics

This post looks at how to use Apache Flink as a basis for sophisticated streaming extract-transform-load (ETL) pipelines. Apache Flink is a framework and distributed processing engine for processing data streams. AWS provides a fully managed service for Apache Flink through Amazon Kinesis Data Analytics, which enables you to build and run sophisticated streaming applications quickly, easily, and with low operational overhead. https://aws.amazon.com/blogs/big-data/streaming-etl-with-apache-flink-and-amazon-kinesis-data-analytics/

Build and run streaming applications with Apache Flink and Amazon Kinesis Data Analytics

Stream processing facilitates the collection, processing, and analysis of real-time data and enables the continuous generation of insights and quick reactions to emerging situations. This capability is useful when the value of derived insights diminishes over time. Hence, the faster you can react to a detected situation, the more valuable the reaction is going to be. Consider, for instance, a streaming application that analyzes and blocks fraudulent credit card transactions while they occur. Compare that application to a traditional batch-oriented approach that identifies fraudulent transactions at the end of every business day and generates a nice report for you to read the next morning. ...

Build a Real-time Stream Processing Pipeline with Apache Flink on AWS

In today’s business environments, data is generated in a continuous fashion by a steadily increasing number of diverse data sources. Therefore, the ability to continuously capture, store, and process this data to quickly turn high-volume streams of raw data into actionable insights has become a substantial competitive advantage for organizations. Apache Flink is an open source project that is well-suited to form the basis of such a stream processing pipeline. It offers unique capabilities that are tailored to the continuous analysis of streaming data. However, building and maintaining a pipeline based on Flink often requires considerable expertise, in addition to physical resources and operational efforts. ...