Data

Streams

Live data processing provides you with a major competitive advantage. Do not solely dump your data into a lake and analyze it in batches. Instead, let us enable you to run online analytics and act instantaneously. We create live decision support and reporting systems. You will run powerful AI on your live data, observing effectiveness and efficiency at any point in time.

Based upon many years of experience with stream data systems and all sorts of inputs and outputs, as well as continuous training and in-house special software, we engineer your bespoke stream data solution.

Quick

for Data Streams

Quick receives and orchestrates your data streams. It runs and maintains your analytics and AI algorithms on your data streams. Quick exposes production API backends and connects your data streams to your applications and devices. Quick facilitates stream data quality and global consistency. Quick is easy to integrate, yet scalable and reliable.

Based upon our experience from many bespoke stream dara projects, Quick is the quintessence engineered with passion by bakdata. It is built upon Apache Kafka and runs on Kubernetes, in your private, hybrid or public cloud.

Open

Source

We believe that a vendor-neutral offer provides our customers with the ability to deploy their own expertise and develop software at their own discretion. Together with you, we open source a number of our solutions:

streams-explorer
Explore Data Pipelines in Apache Kafka.

streams-bootstrap
A micro-framework that lets you focus on writing Kafka Streams Apps.

kafka-large-message-serde
A Kafka Serde that reads and writes records from and to a blob storage, such as Amazon S3 and Azure Blob Storage, transparently.

kafka-profile-store
A Kafka Streams application that creates a queryable object store.

fluent-kafka-streams-tests
Write clean and concise tests for your Kafka Streams application.

Talks

2021

Stream Data Deduplication Powered by Kafka Streams

Speaker: Philipp Schirmer

Representations of data, e.g., describing news, persons or places, differ. Therefore, we need to identify duplicates, for example, if we want to stream deduplicated news from different sources into a sentiment classifier.
We built a system that collects data from different sources in a streaming fashion, aligns them to a global schema and then detects duplicates within the data stream without time window constraints. The challenge is not only to process newly published data without significant delay, but also to reprocess hundreds of millions existing messages, for example, after improving the similarity measure.
In this talk, we present our implementation for deduplication of data streams built on top of Kafka Streams. For this, we leverage Kafka APIs, namely state stores, and also use Kubernetes to auto-scale our application from 0 to a defined maximum. This allows us to process live data immediately and also reprocess all data from scratch within a reasonable amount of time.

https://www.confluent.de/resources/presentation/stream-data-deduplication-powered-by-kafka-streams/

 

Bayer Document Stream Processing

Speaker: Dr. Astrid Rheinländer, Dr.-Ing. Christoph Böhm

Bayer selected Apache Kafka as the primary layer for a variety of document streams flowing through several text processing and enrichment steps. Every day, Bayer analyzes numerous documents including clinical trials, patents, reports, news, literature, etc. We will give an idea about the strategic importance, peek into future challenges and we will provide an end-to-end technical overview.
Throughout the discussion, we will look at challenges we handle in the platform and discuss respective solutions. Among others, we discuss our approach to continuously pull in data from a variety of external sources and how we harmonize different formats and schemas. We discuss large document processing and error handling, which allows efficient debugging while not blocking the pipeline.
Then, we take on the user’s perspective and demo the platform. One will learn how users create new document processing pipelines and how Bayer keeps track of the many running Kafka pipelines.

https://www.confluent.io/events/kafka-summit-europe-2021/bayer-document-stream-processing/

 

Cost-effective GraphQL Queries against Kafka Topics at scale

Speaker: Torben Meyer

In our projects, we often have to query the content of Kafka topics. To that end, we expose REST-APIs based on Kafka Streams’ interactive queries. However, this approach has some shortcomings. For example, users must stitch various APIs and results together. Furthermore, it can become costly as each topic’s API requires one or more JVMs. In this talk, we show how GraphQL can serve single queries involving multiple Kafka topics returning only data the user requested. Our approach eliminates unnecessary overhead and the lack of flexibility associated with traditional API-approaches on Kafka topics. We will also highlight different ways to reduce costs for computational resources such as CPU and RAM. First, we introduce more efficient queries through smart sub-query routing. Second, we build an ahead-of-time compiled, self-contained executable with GraalVM’s Native Image and compare it to the traditionally packaged JAR regarding memory usage and performance for different query workloads.

https://www.confluent.de/resources/presentation/cost-effective-graphql-queries-against-kafka-topics-at-scale/

 

2020

End-to-end large messages processing with Kafka Streams & Kafka Connect

Speaker: Philipp Schirmer

There are several data streaming scenarios, where the messages are too large at the beginning when published to Kafka, or become too large during processing when puzzled together and pushed to the next Kafka topic or to another data system. Due to performance impacts, the default message size in Kafka is 1 MB. Although this limit can be increased, there will always be messages exceeding the configured limit and therefore are too large for Kafka.

Therefore, we implemented a lightweight and transparent approach to publish and process large messages with Kafka Streams and Kafka Connect. Messages exceeding a configurable maximum message size are stored on an external file system, such as Amazon S3. By using the available Kafka APIs, i.e., SerDes and Kafka Connect Converters, this process works transparently without changing any existing code. Our implementation works as a wrapper for actual serialization and deserialization and is thus suitable for any data format used with Kafka.

https://videos.confluent.io/watch/4v4MNjB7TjXjqqWQApbRmD

 

2019

Mining Stream Data mit Apache Kafka

Speaker: Dr.-Ing. Alexander Albrecht

Many Big Data projects deal with never ending data streams. In such streaming scenarios, it does not suffice to run a pre-trained model for making highly accurate predictions.

Also, most machine learning techniques require the complete input during the training phase and thus cannot be trained at high speed in real time.

In this talk, we demonstrate how to implement incremental Decision Tree learning in Kafka Streams. The presented techniques can be transferred to other stream learning approaches.