0% found this document useful (0 votes)

110 views111 pages

Slide 5-6 Kafka

The document provides an overview of Apache Kafka and event streaming. It discusses what event streaming is and common use cases. It then describes what a stream is in the context of event streaming. The document also discusses how to store data in Kafka and provides tutorials for installing and running Kafka on Windows and Google Colab. It includes tutorials for using the Kafka Python client and integrating Kafka with Spark structured streaming. The last part provides additional tutorials for streaming data from CSV files, video streaming using Kafka, and real-time anomaly detection with Kafka.

Uploaded by

Quý Nguyễn Hoàng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

110 views111 pages

Slide 5-6 Kafka

Uploaded by

Quý Nguyễn Hoàng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 111

Distributed and Parallel Computing

Trong-Hop Do
Kafka – A distributed event streaming flatform
What is event streaming?
What can I use event streaming for?
• To process payments and financial transactions in real-time, such as in stock exchanges, banks, and insurances.

• To track and monitor cars, trucks, fleets, and shipments in real-time, such as in logistics and the automotive industry.

• To continuously capture and analyze sensor data from IoT devices or other equipment, such as in factories and wind

parks.

• To collect and immediately react to customer interactions and orders, such as in retail, the hotel and travel industry,

and mobile applications.

• To monitor patients in hospital care and predict changes in condition to ensure timely treatment in emergencies.

• To connect, store, and make available data produced by different divisions of a company.

• To serve as the foundation for data platforms, event-driven architectures, and microservices.
What is a stream?
Store data in Kafka?
Tutorial 1: Kafka installation on Window

• Download Kafka from https://kafka.apache.org/

• Unzip the download file

• Rename the kafka to “kafka” and move it to C:\ drive

Kafka installation on Window

• Open C:\kafka\config\server.properties

• Change the path of log.dir

Kafka installation on Window

• Open C:\kafka\config\zookeeper.properties

• Change the path of dataDir

• By default Apache Kafka will run on port 9092 and Apache Zookeeper will run on port 2181.
Tutorial 2: Run Apache Kafka on Windows

• Start the Kafka cluster

• Run the following command to start ZooKeeper:
cd C:\kafka\
.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties
Run Apache Kafka on Windows

• Start the Kafka cluster

• Run the following command to start the Kafka broker:
cd C:\kafka\
bin\windows\kafka-server-start.bat .\config\server.properties
Run Apache Kafka on Windows

• Produce and consume some messages

• Run the kafka-topics command to create a Kafka topic named TestTopic

bin\windows\kafka-topics.bat --create --topic TestTopic --bootstrap-server localhost:9092

• Let’s create another topic named NewTopic

bin\windows\kafka-topics.bat --create --topic NewTopic --bootstrap-server localhost:9092

• Let’s show list of created topics

bin\windows\kafka-topics.bat --list --bootstrap-server localhost:9092
Run Apache Kafka on Windows

• Produce and consume some messages

• Run the producer and consumer on separate Command Prompt:

bin\windows\kafka-console-producer.bat --topic TestTopic --bootstrap-server localhost:9092

bin\windows\kafka-console-consumer.bat --topic TestTopic --from-beginning --bootstrap-server localhost:9092
Tutorial 3: Kafka Python client
• https://kafka-python.readthedocs.io/en/master/index.html

• Install Kafka-Python

• pip install kafka-python

• Start Zookeeper server and Kafka broker

• Zookeeper is running default on localhost:2181 and Kafka on localhost:9092

Kafka-Python
• Run consumer code
Kafka-Python
• Run producer code
Kafka-Python
• Check the result
Tutorial 4: Run Kafka on Colab
• Download Kafka and unzip
!curl -sSOL https://downloads.apache.org/kafka/3.3.1/kafka_2.13-3.3.1.tgz
!tar -xzf kafka_2.13-3.3.1.tgz

• Start zookeeper server and kafka server

!./kafka_2.13-3.3.1/bin/zookeeper-server-start.sh -daemon ./kafka_2.13-3.3.1/config/zookeeper.properties
!./kafka_2.13-3.3.1/bin/kafka-server-start.sh -daemon ./kafka_2.13-3.3.1/config/server.properties

• Create a topic
!./kafka_2.13-3.3.1/bin/kafka-topics.sh --create --bootstrap-server 127.0.0.1:9092 --replication-factor 1 --partitions 1 --topic TestTopic
Run Kafka on Colab

• Describe the created topic

!./kafka_2.13-3.3.1/bin/kafka-topics.sh --describe --bootstrap-server 127.0.0.1:9092 --topic TestTopic

• Write some event in the topic

!./kafka_2.13-3.3.1/bin/kafka-console-producer.sh --topic TestTopic --bootstrap-server 127.0.0.1:9092

• Read the event

!./kafka_2.13-3.3.1/bin/kafka-console-consumer.sh --topic TestTopic --from-beginning --bootstrap-server 127.0.0.1:9092
Run Kafka on Colab
• You can run cells sequentially and get the result (not really streaming)
Run Kafka on Colab
• Or you can run producer and consumer parallelly in different terminals

• Open terminal using Xterm and run consumer (it will be empty at first)

• Open terminal using Xterm and run producer, write some lines and they will appear on the consumer’s terminal
Run Kafka on Colab
• Use kafka-python on Colab
Tutorial 5: Test Kafka and Spark Structure Streaming on Colab
• Start kafka

• Install PySpark

#currently, 3.3.0 is the latest version. However, you still need to specify this.
!pip install pyspark==3.3.0

from pyspark.sql import SparkSession

scala_version = '2.13'
spark_version = '3.3.0‘
packages = [ f'org.apache.spark:spark-sql-kafka-0-10_{scala_version}:{spark_version}’ , 'org.apache.kafka:kafka-clients:3.3.1’ ]
spark = SparkSession.builder.master("local").appName("kafka-example").config("spark.jars.packages", ",".join(packages)).getOrCreate()
spark
• Install kafka-python
!pip install kafka-python

from kafka import KafkaProducer

from json import dumps
topic_name = 'Number'
kafka_server = 'localhost:9092'
producer = KafkaProducer(bootstrap_servers=kafka_server,value_serializer = lambda x:dumps(x).encode('utf-8'))
for e in range(1000):
data = {'number' : e}
producer.send(topic_name, value=data)
producer.flush()

• You can test if the topic is sent sucessfully

• Create datafram from Kafka topic
producer.flush()
kafkaDf = spark.read.format("kafka")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("subscribe", topic_name)\
.option("startingOffsets", "earliest")\
.load()
kafkaDf.show()

• Show the dataframe in a formatted way

from pyspark.sql.functions import col, concat, lit
kafkaDf.select(
concat(col("topic"), lit(':'),
col("partition").cast("string")).alias("topic_partition"),col("offset"),col("value").cast("string")
).show()
Tutorial 6: Test Kafka and Spark Structure Streaming on Local

• Step1: Start Kafka cluster using Terminal

• Step 2: Run KafkaProducer in Jupyter Notebook

from kafka import KafkaProducer

from json import dumps
from time import sleep

topic_name = 'RandomNumber'
kafka_server = 'localhost:9092'

producer = KafkaProducer(bootstrap_servers=kafka_server,value_serializer = lambda x:dumps(x).encode('utf-8'))

for e in range(1000):
data = {'number' : e}
producer.send(topic_name, value=data)
print(str(data) + " sent")
sleep(5)

producer.flush()
• Open another Jupyter Notebook

• You will reading data from Kafka in two ways:

• Batch query
• Streaming query
• See more at https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
Creating a Kafka Source for Batch Queries
• Create dataframe from Kafka data
topic_name = 'RandomNumber'

kafka_server = 'localhost:9092'

kafkaDf = spark.read.format("kafka").option("kafka.bootstrap.servers", kafka_server).option("subscribe", topic_name).option("startingOffsets",

"earliest").load()

• Show data (converting dataframe to pandas for cleaner view of data)

• Show streaming data using for loop
batchDF = kafkaDf.select(col('topic'),col('offset'),col('value').cast('string').substr(12,1).alias('rand_number'))
from time import sleep
from IPython.display import display, clear_output
for x in range(0, 2000):
try:
print("Showing live view refreshed every 5 seconds")
print(f"Seconds passed: {x*5}")
display(batchDF.toPandas())
sleep(5)
clear_output(wait=True)
except KeyboardInterrupt:
print("break")
break
print("Live view ended...")
• Perform some data aggregation and show live results
Creating a Kafka Source for Streaming Queries
• Create Streaming dataframe from Kafka

streamRawDf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", kafka_server).option("subscribe", topic_name).load()

streamDF = streamRawDf.select(col('topic'),col('offset'),col('value').cast('string').substr(12,1).alias('rand_number'))
checkEvenDF = streamDF.withColumn('Is_Even',col('rand_number').cast('int') % 2 == 0 )

• Write stream

from random import randint

randNum=str(randint(0,10000))
q1name = "queryNumber"+randNum
q2name = "queryCheckEven"+randNum

stream_writer1 = (streamDF.writeStream.queryName(q1name).trigger(processingTime="5 seconds").outputMode("append").format("memory"))

stream_writer2 = (checkEvenDF.writeStream.queryName(q2name).trigger(processingTime="5 seconds").outputMode("append").format("memory"))

query1 = stream_writer1.start()
query2 = stream_writer2.start()
• View streaming result
Tutorial 7: Kafka and MongoDB on Window
Tutorial 8
https://towardsdatascience.com/make-a-mock-real-time-stream-of-data-with-python-and-kafka-7e5e23123582
Tutorial 8: streaming from CSV

• sendStream.py
Tutorial 8: streaming from CSV

• processStream.py
Tutorial 8: streaming from CSV

Start the consumer

Tutorial 8: streaming from CSV
Start the producer
Tutorial 8: streaming from CSV
If you terminate the consumer and then restart it, the streaming will be resumed from where it stop
Tutorial 9

https://medium.com/@kevin.michael.horan/distributed-video-streaming-with-python-and-kafka-551de69fe1dd
Tutorial 9: Video streaming using Kafka
• Producer.py
Tutorial 9: Video streaming using Kafka
• Producer.py
Tutorial 9: Video streaming using Kafka
• Producer.py
Tutorial 9: Video streaming using Kafka
• consumer.py
Tutorial 9: Video streaming using Kafka

• Run consumer.py
Tutorial 9: Video streaming using Kafka
• Stream video from webcam
Tutorial 9: Video streaming using Kafka
• Stream a video entitled Countdow1.mp4
Tutorial 10

https://towardsdatascience.com/real-time-anomaly-detection-with-apache-kafka-and-python-3a40281c01c9
Tutorial 10: real-time anomaly detection
Tutorial 10: real-time anomaly detection

• Producer.py
Tutorial 10: real-time anomaly detection

• train.py
Tutorial 10: real-time anomaly detection

• detector.py
Tutorial 10: real-time anomaly detection
Tutorial 10: real-time anomaly detection
Tutorial 10: real-time anomaly detection
Tutorial 10: real-time anomaly detection
Tutorial 10: real-time anomaly detection
Tutorial 11: Tensorflow-IO and Kafka

https://www.tensorflow.org/io/tutorials/kafka
• Just follow https://www.tensorflow.org/io/tutorials/kafka
Tutorial 12: Spotify Recommendation System

https://www.analyticsvidhya.com/blog/2021/06/spotify-recommendation-system-using-pyspark-and-kafka-streaming/
Tutorial 13: Order book simulation
https://github.com/rongpenl/order-book-simulation
Tutorial 14: Create your own data stream
https://aiven.io/blog/create-your-own-data-stream-for-kafka-with-python-and-faker
Tutorial 15: Bigmart sale prediction
• Dataset: https://www.kaggle.com/datasets/brijbhushannanda1979/bigmart-sales-data
• Use train set to train some simple prediction model using Spark MLlib
• Stream data from test set to Kafka server (remember to set the time interval)
• Create Spark streaming dataframe from Kafka and apply the trained model to get the real-time prediction

Kafka Low Level Architecture
No ratings yet
Kafka Low Level Architecture
52 pages
10.API Gateway PDF
No ratings yet
10.API Gateway PDF
11 pages
Apache Kafka 101
No ratings yet
Apache Kafka 101
25 pages
Apache Kafka Essentials
No ratings yet
Apache Kafka Essentials
10 pages
Mastering Kafka Streams: From Basics to Expert Proficiency
From Everand
Mastering Kafka Streams: From Basics to Expert Proficiency
William Smith
No ratings yet
5 Kafka Producer Advanced
No ratings yet
5 Kafka Producer Advanced
152 pages
Cloudurable Kafka Tutorial v1 PDF
No ratings yet
Cloudurable Kafka Tutorial v1 PDF
79 pages
KAFKA
No ratings yet
KAFKA
22 pages
Kafka Secuirty
No ratings yet
Kafka Secuirty
4 pages
Kafka Producer Internals: Find Answers On The Fly, or Master Something New. Subscribe Today
No ratings yet
Kafka Producer Internals: Find Answers On The Fly, or Master Something New. Subscribe Today
1 page
Amazon Elastic MapReduce PDF
No ratings yet
Amazon Elastic MapReduce PDF
231 pages
Aws Solutions Associate
No ratings yet
Aws Solutions Associate
9 pages
Practice Questions Edition'22: Prepare Yourself For Exam Azure Administrator
No ratings yet
Practice Questions Edition'22: Prepare Yourself For Exam Azure Administrator
14 pages
Getting Started With Apache Kafka
No ratings yet
Getting Started With Apache Kafka
21 pages
CIS Amazon Web Services Three-Tier Web Architecture Benchmark v1.0.0
No ratings yet
CIS Amazon Web Services Three-Tier Web Architecture Benchmark v1.0.0
215 pages
Sudhir Gannavarapu Full Stack Developer Professional Summary
No ratings yet
Sudhir Gannavarapu Full Stack Developer Professional Summary
4 pages
Configuring Kafka For High Throughput
No ratings yet
Configuring Kafka For High Throughput
11 pages
Apache Kafka Description
No ratings yet
Apache Kafka Description
36 pages
Amazon Elastic Container Service (ECS) Is A Highly Scalable, High Performance Container
No ratings yet
Amazon Elastic Container Service (ECS) Is A Highly Scalable, High Performance Container
8 pages
Documentation
No ratings yet
Documentation
105 pages
3 Tierproject
No ratings yet
3 Tierproject
32 pages
Vijay G Phone Number: 510-921-2473 Professional Summary
No ratings yet
Vijay G Phone Number: 510-921-2473 Professional Summary
5 pages
?MongoDB & Docker Data Volume Containers
No ratings yet
?MongoDB & Docker Data Volume Containers
8 pages
Microservices CXF Karaf PDF
No ratings yet
Microservices CXF Karaf PDF
47 pages
Dump 3
No ratings yet
Dump 3
23 pages
AWS Project Terraform
No ratings yet
AWS Project Terraform
21 pages
Chandan Prakash's Blog
No ratings yet
Chandan Prakash's Blog
4 pages
Snapmanager 2.0 For Virtual Infrastructure Best Practices: Front Cover
No ratings yet
Snapmanager 2.0 For Virtual Infrastructure Best Practices: Front Cover
118 pages
CISUC - Microservices Observability
No ratings yet
CISUC - Microservices Observability
3 pages
Kafka Architectures Notes
No ratings yet
Kafka Architectures Notes
9 pages
Dhruba Jyoti Saha - Java Architect
No ratings yet
Dhruba Jyoti Saha - Java Architect
15 pages
Apache Kafka Tutorial
No ratings yet
Apache Kafka Tutorial
24 pages
Top 10 Kafka Problems
No ratings yet
Top 10 Kafka Problems
3 pages
Message Queues (ActiveMQs and Kafka)
No ratings yet
Message Queues (ActiveMQs and Kafka)
7 pages
Cloudera Distribution of Apache Kafka
No ratings yet
Cloudera Distribution of Apache Kafka
56 pages
Aws Perspective
No ratings yet
Aws Perspective
70 pages
Amazon Web Services Training
No ratings yet
Amazon Web Services Training
5 pages
Aws Lambda
No ratings yet
Aws Lambda
23 pages
Practice Test 6
No ratings yet
Practice Test 6
50 pages
Amazon Elastic MapReduce Best Practices
No ratings yet
Amazon Elastic MapReduce Best Practices
38 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Azure Storage
No ratings yet
Azure Storage
649 pages
Cloudera Kafka PDF
No ratings yet
Cloudera Kafka PDF
175 pages
JVM (Java Virtual Machine)
No ratings yet
JVM (Java Virtual Machine)
34 pages
Module 7: Data Management Backup, DR, Test/Dev Environments
No ratings yet
Module 7: Data Management Backup, DR, Test/Dev Environments
9 pages
AWS-1-Web Application Hosting in The Cloud
No ratings yet
AWS-1-Web Application Hosting in The Cloud
29 pages
Automation of Mongo Export in An AWS Cloud
No ratings yet
Automation of Mongo Export in An AWS Cloud
6 pages
EC2 Snapshots
No ratings yet
EC2 Snapshots
8 pages
Hadoop Distributed File System (HDFS) : Suresh Pathipati
No ratings yet
Hadoop Distributed File System (HDFS) : Suresh Pathipati
43 pages
9 - Kubernetes (Light Theme)
No ratings yet
9 - Kubernetes (Light Theme)
11 pages
1 Apache Zookeeper
No ratings yet
1 Apache Zookeeper
7 pages
Rogelio Ivan Espinosa Flores
No ratings yet
Rogelio Ivan Espinosa Flores
6 pages
Top 51 AWS Interview Questions (2023)
No ratings yet
Top 51 AWS Interview Questions (2023)
22 pages
Spring Transactions
No ratings yet
Spring Transactions
22 pages
Micro Service
No ratings yet
Micro Service
38 pages
List The Various Components in Kafka
No ratings yet
List The Various Components in Kafka
2 pages
Cloudera Kafka
No ratings yet
Cloudera Kafka
175 pages
Introduction to Mastering Modern Web Technologies with React.js and Ant Design
From Everand
Introduction to Mastering Modern Web Technologies with React.js and Ant Design
Pedro Martins
No ratings yet
Kubernetes A Complete Guide
From Everand
Kubernetes A Complete Guide
Gerardus Blokdyk
No ratings yet
About Kubernetes and Security Practices - Short Edition: First Edition, #1
From Everand
About Kubernetes and Security Practices - Short Edition: First Edition, #1
Ami Adi
No ratings yet
SANS - 5 Critical Controls
No ratings yet
SANS - 5 Critical Controls
19 pages
Manual Do CLP
No ratings yet
Manual Do CLP
16 pages
TSMP3002 - SmartPlant 3D Equipment Reference Data Labs v7
No ratings yet
TSMP3002 - SmartPlant 3D Equipment Reference Data Labs v7
20 pages
Karel Hendrych
No ratings yet
Karel Hendrych
23 pages
KA141 Service Aid 101 Jumper - KFC 200, 250, 300 - 4-15-87
No ratings yet
KA141 Service Aid 101 Jumper - KFC 200, 250, 300 - 4-15-87
18 pages
3.3.1 APU Starter Logic: Aircraft Electrical and Electronic Systems 64
No ratings yet
3.3.1 APU Starter Logic: Aircraft Electrical and Electronic Systems 64
15 pages
Database Assingment - Saksham Mishra Section F
No ratings yet
Database Assingment - Saksham Mishra Section F
124 pages
Analysis and Design of Algorithms - Handout
No ratings yet
Analysis and Design of Algorithms - Handout
32 pages
HTB Bolt Writeup
No ratings yet
HTB Bolt Writeup
24 pages
Template of shipment info
No ratings yet
Template of shipment info
9 pages
Unit IV File Handling - CSV Files
No ratings yet
Unit IV File Handling - CSV Files
28 pages
Google Professional Developer
No ratings yet
Google Professional Developer
76 pages
Lecture 6: Instruction Set Architectures III - Last Time: ISA Design Principles
No ratings yet
Lecture 6: Instruction Set Architectures III - Last Time: ISA Design Principles
10 pages
(Ebook) The Science of the Blockchain by Roger Wattenhofer ISBN 9781522751830, 1522751831 - Read the ebook online or download it for a complete experience
100% (1)
(Ebook) The Science of the Blockchain by Roger Wattenhofer ISBN 9781522751830, 1522751831 - Read the ebook online or download it for a complete experience
55 pages
02 - Chapter 2 - Threats, Vulnerabilities, and Mitigations - Q111-Q220
No ratings yet
02 - Chapter 2 - Threats, Vulnerabilities, and Mitigations - Q111-Q220
19 pages
Computer Assignment
No ratings yet
Computer Assignment
3 pages
S038 - Rushil Shah - BDA Assignment 1
No ratings yet
S038 - Rushil Shah - BDA Assignment 1
2 pages
Desktop Support Engineer
0% (1)
Desktop Support Engineer
6 pages
Car Riding Opengl Project
No ratings yet
Car Riding Opengl Project
30 pages
Linux Final
No ratings yet
Linux Final
79 pages
Software Engineering
No ratings yet
Software Engineering
35 pages
Introduction To VB (Lesson One)
No ratings yet
Introduction To VB (Lesson One)
22 pages
Blue Prism Certification: Developer
No ratings yet
Blue Prism Certification: Developer
3 pages
Pillar 1 - Understanding the Parts
No ratings yet
Pillar 1 - Understanding the Parts
79 pages
Paging With Segmentation
No ratings yet
Paging With Segmentation
19 pages
Everrise Protocol Token & Staking V3 Smart Contract Audit: Chainsulting by Softstack GMBH Audit Report
No ratings yet
Everrise Protocol Token & Staking V3 Smart Contract Audit: Chainsulting by Softstack GMBH Audit Report
34 pages
5G RAN KPI Reference (V100R015C10 - 01) (PDF) - EN
No ratings yet
5G RAN KPI Reference (V100R015C10 - 01) (PDF) - EN
43 pages
Mohammad Wahaj Tariq Resume AI_ML (3)
No ratings yet
Mohammad Wahaj Tariq Resume AI_ML (3)
1 page
MESWeb Portal User Guide
No ratings yet
MESWeb Portal User Guide
135 pages
Semi-: Supervised Learning
No ratings yet
Semi-: Supervised Learning
40 pages

Uploaded by

Uploaded by

Distributed and Parallel Computing

and mobile applications.

• Download Kafka from https://kafka.apache.org/

• Unzip the download file

• Rename the kafka to “kafka” and move it to C:\ drive

• Change the path of log.dir

• Change the path of dataDir

• Start the Kafka cluster

• Start the Kafka cluster

• Produce and consume some messages

bin\windows\kafka-topics.bat --create --topic TestTopic --bootstrap-server localhost:9092

• Let’s create another topic named NewTopic

• Let’s show list of created topics

• Produce and consume some messages

• Run the producer and consumer on separate Command Prompt:

bin\windows\kafka-console-producer.bat --topic TestTopic --bootstrap-server localhost:9092

• pip install kafka-python

• Start Zookeeper server and Kafka broker

• Zookeeper is running default on localhost:2181 and Kafka on localhost:9092

• Start zookeeper server and kafka server

• Describe the created topic

• Write some event in the topic

• Read the event

from pyspark.sql import SparkSession

from kafka import KafkaProducer

• You can test if the topic is sent sucessfully

• Show the dataframe in a formatted way

• Step1: Start Kafka cluster using Terminal

• Step 2: Run KafkaProducer in Jupyter Notebook

from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers=kafka_server,value_serializer = lambda x:dumps(x).encode('utf-8'))

• You will reading data from Kafka in two ways:

kafkaDf = spark.read.format("kafka").option("kafka.bootstrap.servers", kafka_server).option("subscribe", topic_name).option("startingOffsets",

• Show data (converting dataframe to pandas for cleaner view of data)

streamRawDf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", kafka_server).option("subscribe", topic_name).load()

from random import randint

stream_writer1 = (streamDF.writeStream.queryName(q1name).trigger(processingTime="5 seconds").outputMode("append").format("memory"))

Start the consumer

You might also like