Operational AI WIKI

All 0-9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Apache Kafka

Apache Kafka is a log-based, event streaming platform (ESP) that serves as a pipeline for real-time data feeds. It is used in high-performance situations such as streaming analytics, data integration, and other applications that require high throughput and low latency.

 

Kafka can be thought of as a digital central nervous system. It features the ability to publish (write) and subscribe to (read) streams of events, including continuous import and export on multiple systems simultaneously. Kafka stores data as “topics” within partitions that are accessible to the “producers” that store key-value messages as well as the “consumers” that can read messages from partitions. Kafka is distributed, highly scalable, elastic, fault-tolerant, and secure.

 

Kafka was originally developed at LinkedIn and has become widely used by companies such as Spotify, Netflix, Uber, Goldman Sachs, Paypal, and CloudFlare. Kafka was open sourced in 2011 and is licensed under Apache License 2.0.

 

A Kafka implementation consists of servers and clients that communicate via a high-performance TCP network protocol that uses a binary TCP-based protocol that is optimized for efficiency and relies on a “message set” abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. Kafka can be deployed on bare-metal hardware, virtual machines, and containers in on-premise as well as cloud environments.

 

datapipeline_complex

Dozens of data systems and repositories connected with custom piping between each pair of systems can become impossible to manage. https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

 

datapipeline_simple

Kafka was created to allow for a single data repository to integrate all consumers and sources. https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

 

datapipeline_ownership apache kafka architecture

Kafka provides a central pipeline, called a log, with a well defined API for adding data. https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

 

At Molecula

Molecula’s FeatureBase is a database platform that specializes in high-throughput, low-latency, and highly concurrent real-time data performance. A typical Molecula customer requires the ability to make complex queries across both historical and streaming data. Data pipelines such as Kafka make it easier to manage large volumes of continuous arrival of data. Kafka topics are Molecula’s preferred ingest method for consuming streaming data. Even at massive scale, once events are ingested into FeatureBase, running queries against both current and historical data result in ultra low latency without having to batch or preaggregate the data.

 

Learn More About Apache Kafka

Apache Kafka website

Wikipedia entry: Apache Kafka

Bernard Marr & Co: What is Kafka? A super-simple explanation of this important data analytics tool