0% found this document useful (0 votes)

152 views19 pages

Bigdata Unit II

Uploaded by

Smitha Rajesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

152 views19 pages

Bigdata Unit II

Uploaded by

Smitha Rajesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Unit II: Mining Data Streams

Introduction To Streams Concepts – Stream Data Model and Architecture - Stream

Computing - Sampling Data in a Stream – Filtering Streams – Counting Distinct
Elements in a Stream – Estimating Moments – Counting Oneness in a Window –
Decaying Window - Real time Analytics Platform(RTAP) Applications - Case Studies -
Real Time Sentiment Analysis, Stock Market Predictions.

Introduction
Traditional DBMS store data that are finite and persistent. Data is available
whenever we want it. In Big data Analytics or in Data mining data is assumed to come in
streams. If not processed immediately it is lost. Data streams are continuous flow of
data. Sensor data, network traffic, call center records, satellite images, data from electric
power grids etc.are some of the popular examples for data stream. Data streams possess
several unique properties
● Infinite
● Massive
● Fast changing
● New class of data may evolve, that makes it difficult to include in the
existing classes. (concept evolution)
● The relation between input data and output data may change. (concept
drift)
Apart from these unique characteristics there are some potential challenges in data
stream mining
● It is not manually possible to label all the data points in the stream.
● It is not feasible to store or archive these data stream in a conventional database.
● Concept drift
● Concept evolution
● Speed and huge volume makes it difficult to mine the data. Only single scan
algorithms will be feasible.
● Difficult to query with SQL-based tools due to lack of schema and structure.

Stream Data Model

A stream data architecture is a framework of software components built to
incorporate and process large volume of streaming data from various sources.
A streaming data model processes the data immediately as it is generated, and
continue it to store. This architecture also includes various additional tools for real time
processing, data manipulation and analysis.
Benefits of stream processing
⚫ Able to deal with infinite or never-ending streams of data
⚫ Real-time processing
⚫ Easy data scalability
 Modern stream processing infrastructure is hyper-scalable, able to
deal with Gigabytes of data per second with a single stream
processor. This allows you to easily deal with growing data volumes
without infrastructure changes.
The following figure shows Data Stream Management System

User/ Application

Continuou
s query Results

Multiple Streams

Stream Query
Processor

Limited
working
Archival
Storage

There are four building blocks of stream architecture are:

1. Stream Processor/ Message broker
Takes data coming from various sources, translates it into a standard
format, streams it on an ongoing basis.
Two popular stream processing tools are Apache Kafka and Amazon
Kinesis Data streams
2. Batch and Real-time ETL tools
ETL stands for Extract, Transform and Load. It is the process of moving
huge volume of unstructured data from one source to another. ETL is
basically a data integration process
ETL tools aggregate data streams from one or more message brokers.
ETL tool or platform receives queries from users, fetches events from
message queues and applies the query, to generate a result. The result may
be an API call, an action, a visualization, an alert, or in some cases a new
data stream.
A few examples of open-source ETL tools for streaming data are Apache
Storm, Spark Streaming and WSO2 Stream Processor.
3. Query Engine
After streaming data is prepared for consumption by the stream processor,
it must be analyzed to provide value. Some of the commonly used data
analytics tools are Amazon Athena, Amazon Redshift, Elasticsearch,
Cassandra.
4. Data Storage
Streams may be archived in a large archival store, but it is not possible to
answer queries from the archival store.
A working store of limited capacity is used into which summaries or parts
of streams may be placed, and which can be used for answering queries.
The working store might be disk, or it might be main memory, depending
on how fast we need to process queries.
The advent of low cost storage technologies paved a way for organizations
to store streaming event data.
There are several ways in which event can be stored;in a database or a data
warehouse, in the message broker, in data lake.
A data lake is the most flexible and inexpensive option for storing event
data. But the latency (time required to transfer data from the storage) is
high for real time analysis.
Streaming data architectures enable developers to develop applications that use both
bound and unbound data in new ways. For example, Alibaba’s search infrastructure
team uses a streaming data architecture powered by Apache Flink to update product
details and inventory information in real-time. Netflix also uses Flink to support its
recommendation engines and ING, the global bank based in The Netherlands, uses the
architecture to prevent identity theft and provide better fraud protection. Other
platforms that can accommodate both stream and batch processing include Apache
Spark, Apache Storm, Google Cloud Dataflow and AWS Kinesis.
Stream Queries
There are two types of stream queries: standing queries and ad-hoc queries
⚫ Standing queries
 Permanently executing, produce output at appropriate times.
Consider a temperature sensor bobbing about in the ocean, sending back
to a base station a reading of the surface temperature each hour. The data
produced by this sensor is a stream of real numbers. In this case we can
ask a query, what is the maximum temperature ever recorded by the
sensor. For answering this query we need not store the entire stream.
When a new stream element arrives, we compare it with the stored
maximum, and set the maximum to whichever is larger. Similarly, if we
want the average temperature over all time, we have only to record two
values: the number of readings ever sent in the stream and the sum of
those readings.
⚫ Ad-hoc queries
 Asked only when a particular information is needed. Not permanently
executing.
 Asked once about the current state of streams. Such queries are difficult to
answer as we are not archiving the entire stream.
 For answering such ad-hoc queries we have to store the parts or
summaries of the stream.

⚫ Sliding Window
 Sliding window approach can be used to answer ad-hoc queries.
 Each sliding window stores the most recent n elements of the stream for
some n.
 Or it can be all the elements that are arrived within the last t time units;
may be day.
 The length of the sliding window is specified by it range. Stride specifies
the portion of the window that is omitted when the window moves
forward.
 2 types of sliding windows
■ time-based
⚫ Range and stride are specified by time intervals.
⚫ For example a sliding window with range= 10 mins and
stride= 2 mins produces window that cover the data in the
last 10 mins. A new window is created
■ count-based
⚫ Range and stride are specified in terms of number of
intervals.
 The obvious approach would be to generate a random number, say an
integer from 0 to 9, in response to each search query.
 Store the tuple if and only if the random number is 0.
 Each user has, on average, 1/10th of their queries stored.

Sampling Data Stream

As mentioned earlier a data stream is a massive, infinite dataset. Hence it is not
possible to store the entire stream. While mining a data stream a typical question that
can arise is how can we answer certain critical queries without storing the entire stream.
In some cases we can get an answer from certain samples in the stream, without
examine the entire stream. Here, we have to keep in mind two things; one is the sample
should be unbiased. The second one is the typical sample should be able to answer the
queries. Choosing the right samples is critical. Carelessly choosing samples can destroy
the results of the query. While sampling we must take care of some pitfalls.
An Example
Consider a search engine like Google that receives a stream of queries. Google
wants to study the behavior of users. A typical question that can be asked is “What
fraction of queries asked past the month are unique?”. Only 1/10th of the stream element
is stored.
The obvious approach would be to generate a random number, say an integer
from 0 to 9, in response to each search query. Store the tuple if and only if the random
number is 0. If we do so, each user has, on average, 1/10th of their queries stored.
Statistical fluctuations will introduce some noise into the data, but if users issue many
queries, the law of large numbers will assure us that most users will have a fraction quite
close to 1/10th of their queries stored.
This scheme gives us the wrong answer to the query asking for the average
number of duplicate queries for a user. Suppose a user has issued s search queries one
time in the past month, d search queries twice, and no search queries more than twice.
If we have a 1/10th sample, of queries, we shall see in the sample for that user an
expected s/10 of the search queries issued once. Of the d search queries issued twice,
only d/100 will appear twice in the sample; that fraction is d times the probability that
both occurrences of the query will be in the 1/10th sample. Of the queries that appear
twice in the full stream, 18d/100 will appear exactly once.
[Sample will contain s/10 of the singleton queries and 2d/10 of the duplicate queries at
least once .But only d/100 pairs of duplicates d/100 = 1/10 * 1/10 * d]
To see why, note that 18/100 is the probability that one of the two occurrences will be in
the 1/10th of the stream that is selected, while the other is in the 9/10th that is not
selected.
[Of d “duplicates” 18d/100 appear once 18d/100 = ((1/10*9/10)+(9/10*1/10))*d]
The correct answer to the query about the fraction of repeated searches is d/(s+d).
However, the answer we shall obtain from the sample is d/(10s+19d).
d/100 appear twice, while s/10+18d/100 appear once. Thus, the fraction appearing
twice in the sample is d/100 divided by d/100 + s/10 + 18d/100. This ratio is d/(10s +
19d).

Solution
A solution for the above scenario is sample the users.
• Pick 1/10th of users and take all their searches in the sample.
• Use a hash function that hashes the user name or user id uniformly into 10
buckets.
• Each time a search query arrives in the stream, we look up the user to see
whether or not they are in the sample. If so, we add this search query to the
sample, and if not, then not.
• By using a hash function, one can avoid keeping the list of users.
• Hash each user name to one of ten buckets, 0 through 9. If the user hashes to
bucket 0, then accept this search query for the sample, and if not, then not.
General Solution
• Stream of tuples with keys.
• Key is some subset of each tuple’s components e.g., tuple is (user, search, time).
• Key is user Choice of key depends on application
• To get a sample of size a/b: Hash each tuple’s key uniformly into b buckets Pick
the tuple if its hash value is at most a.

Filtering Streams
Filtering involves selecting the streams that satisfies a particular criterion. There
are different methods for selecting streams. The process is hard when it is required to
search for membership in a set.
• Each element of data stream is a tuple
• Given a list of keys S
• Determine which tuples of stream are in S
Applications of filtering
• Email spam filtering
◦ We know 1 billion “good” email addresses
◦ If an email comes from one of these, it is NOT spam
• Publish-subscribe systems
◦ You are collecting lots of messages (news articles)
◦ People express interest in certain sets of keywords
◦ Determine whether each message matches user’s interest
Example
Suppose we want to create an account in a Gmail. The Gmail maintains a list of
usernames of those persons who already have an account. When we give our preferred
username, we may get a message that “username already exists”. Gmail check
availability of username by searching millions of username registered with it. There are
several methods to do the search.
• Linear Search : obviously this is a bad idea because there may be billions of
accounts.
• Binary search : the usernames must be stored in sorted order. Even then it may
not be possible to search in billions.
Solution is Bloom Filter Technique
Bloom Filter Technique
• A Bloom filter is a space-efficient probabilistic data structure that is used to test
whether an element is a member of a set.
• Bloom Filter method uses hashing.
• A hash function takes input and outputs a unique identifier of fixed length
which is used for identification of input.
Working of Bloom Filter
• A empty bloom filter is a bit array of m bits, all set to zero, like this –
• We need k number of hash functions to calculate the hashes for a given input.

• When we want to add an item in the filter, the bits at k indices h1(x), h2(x), … hk(x) are
set, where indices are calculated using hash functions.

• Example – Suppose we want to enter “good” in the filter, we are using 3 hash functions
and a bit array of length 10, all set to 0 initially. First we’ll calculate the hashes as
following :

• First we’ll calculate the hashes as following :

h1(“good”) % 10 = 1
h2(“good”) % 10 = 4
h3(“good”) % 10 = 7

[These outputs are random for explanation only.]

Now we will set the bits at indices 1, 4 and 7 to 1
good

• Again we want
to enter “bad”,
similarly we’ll calculate hashes
h1(“bad”) % 10 = 3
h2(“bad”) % 10 = 5
h3(“bad”) % 10 = 4

• Set the bits at indices 3, 5 and 4 to 1

bad

Now if we want to check a username present in the list or not we do the reverse process.

• We calculate respective hashes using h1, h2 and h3 and check if all these indices are set
to 1 in the bit array.
• If all the bits are set then we can say that username is probably present.

• If any of the bit at these indices are 0 then username is definitely not present.

• False Positive in Bloom Filters

The result “probably present”, is uncertainty. Let’s understand this with an
example. Suppose we want to check whether “cat” is present or not. We’ll calculate hashes using
h1, h2 and h3
h1(“cat”) % 10 = 1
h2(“cat”) % 10 = 3
h3(“cat”) % 10 = 7

If we check the bit array, bits at these indices are set to 1 but we know that
“cat” was never added to the filter. Bit at index 1 and 7 was set when we added “good”
and bit 3 was set when we added “bad”.

So, because bits at calculated indices are already set by some other item,
bloom filter erroneously claim that “cat” is present and generating a false positive result.

• By controlling the size of bloom filter we can control the probability of getting
false positives.

• Use more number of hash functions and more bit array.

• Probability of False positivity: Let m be the size of bit array, k be the

number of hash functions and n be the number of expected elements to be
inserted in the filter, then the probability of false positive p can be calculated as:

Generalization
The Bloom Filter
A Bloom filter consists of:
1. An array of n bits, initially all 0’s.
2. A collection of hash functions h 1 , h 2 , . . . , h k . Each hash function maps
“key” values to n buckets, corresponding to the n bits of the bit-array.
3. A set S of m key values.
The purpose of the Bloom filter is to allow through all stream elements whose
keys are in S, while rejecting most of the stream elements whose keys are not
in S.

• To initialize the bit array, begin with all bits 0.

• Take each key value in S and hash it using each of the k hash functions.

• Set to 1 each bit that is h i (K) for some hash function h i and some key value K in S.

• To test a key K that arrives in the stream, check that all of h 1 (K), h 2 (K), . . . , h k (K) are
1’s in the bit-array.

• If all are 1’s, then let the stream element through. If one or more of these bits are 0, then
K could not be in S, so reject the stream element.

The hash function used in bloom filters should be independent and uniformly
distributed. They should be fast as possible.

Applications of Bloom filters

• Medium uses bloom filters for recommending post to users by filtering post which have
been seen by user.
• Quora implemented a shared bloom filter in the feed backend to filter out stories that
people have seen before.
• The Google Chrome web browser used to use a Bloom filter to identify malicious URLs
• Google BigTable, Apache HBase and Apache Cassandra, and Postgresql use Bloom filters
to reduce the disk lookups for non-existent rows or columns

Counting Distinct Elements in a Stream

This process is to count distinct elements in a data stream with repeated element.

The elements might represent IPaddresses of packets passing through a router, unique
visitor to a web site, elements in a large database, motifs in a DNA sequence, or
elements of sensor/RFID networks.

Definition
A stream of elements {x1,x2,...xs } with repetitions, and an integer m. Let n be the
number of distinct elements, namely , n=|{x1,x2,...xs }|and let these elements be
{e1,e2,...en}.

Objective: Find an estimate of n using only m storage units, where m«n

• Data stream consists of a universe of elements chosen from a set of size N.

• Maintain a count of the number of distinct elements seen so far

Solution:

• Maintain the set of elements seen so far.

• That is, keep a hash table of all the distinct elements seen so far.
Counting distinct elements is very important in many practical applications. For example:

• How many different words are found among the Web pages being crawled at a site?
◦ Unusually low or high numbers could indicate artificial pages (spam?)

• How many different Web pages does each customer request in a week?
• How many distinct products have we sold in the last week?
Example
Suppose Google wants to gather the statistics of unique users it has seen in each month.
Google does not require a unique login to issue a search query. The only way to
recognize users is to identify the IP addresses from which the queries are issued. In this
case the 4 billion IP addresses serve as the universal set.
Solution
• Keep the list of all elements seen so far in a hash table or a search tree in the main
memory.
• When a new query arrives check whether the IP address from which the query
issued is in the list or not.
• If it is not there add the new IP address. Otherwise discard.
The above solution works well as long as the number of distinct elements is not too
large. The problem arises when the number of elements is too great and all the streams
need to be processed at once. The data may not be fit in the main memory.
The Flajolet-Martin Algorithm is an efficient technique to estimate the number of
distinct using much less memory.
Flajolet-Martin Algorithm
This algorithm approximates the number of distinct elements in a stream or a database
in one pass.
Suppose the stream consists of n elements with m of them are unique.
• Then time complexity of the algorithm is O(n).
• The algorithm requires O(log(m)) memory.
Algorithm
For a given input stream and a hash function.
• Step 1: Apply the hash function h(x) to each element in the stream.
• Step 2 : For each hash function obtained,write the binary equivalent for the same.
• Step 3: Count the number of trailing zeros ( zeros in the end ) of each bit of the
hash function.
• Step 4: Write the value of maximum number of trailing zeros. Let the number be
r.
• Step 5: Calculate the number of distinct elements as R=2r
Example
Consider a data stream of integers, 3, 1, 4, 1, 5, 9, 2, 6, 5. Determine the tail length for
each stream element and the resulting estimate of the number of distinct elements if the
hash function is:
(a) h(x) = 2x + 1 mod 32.
(b) h(x) = 3x + 7 mod 32.
(c) h(x) = 4x mod 32.
(Treat the result of hash functions as 5 bit binary integer).
Solution
Since the data stream is small we can readily count the number of distinct
elements. There 7 distinct elements.
a) Using hash function h(x)=2x+1 mode 32
h(3)= (2*3 + 1 ) mod 32 =0 h(2)= (2*2 + 1 ) mod 32 =0
h(1)= (2*1 + 1 ) mod 32 =0 h(6)= (2*6 + 1 ) mod 32 =0
h(4)= (2*1 + 1 ) mod 32 =0 h(5)= (2*5 + 1 ) mod 32 =0
h(1)= (2*1 + 1 ) mod 32 =0 h(9)= (2*9 + 1 ) mod 32 =0
h(5)= (2*5 + 1) mod 32=0
Convert the result into binary. Here all the results are zeros. Hence the maximum
number of trailing zeros r=0
Count of distinct number of element=2r=20=1
b) Using hash function h(x)=3x+7 mode 32
h(3)= (3*3 + 7 ) mod 32 =0 h(2)= (3*2 + 7 ) mod 32 =0
h(1)= (3*1 + 7 ) mod 32 =0 h(6)= (3*6 + 7 ) mod 32 =0
h(4)= (3*4 + 7 ) mod 32 =0 h(5)= (3*5 + 7 ) mod 32 =0
h(1)= (3*1 + 7 ) mod 32 =0 h(9)= (3*9 + 7 ) mod 32 =2
h(5)= (3*3 + 7 ) mod 32 =0
Converting the results of hash functions into binary we get two values
00000, 00010.
The maximum number of trailing zeros r=1
Therefore, Count of distinct number of element=2r=21=2
c) Using hash function h(x)=h(x) = 4x mod 32.
h(3)= 12 mod 32 =0 h(2)= 8 mod 32 =0
h(1)= 4 mod 32 =0 h(6)= 24 mod 32 =0
h(4)= 16 mod 32 =0 h(5)= 20 mod 32 =0
h(1)= 4 mod 32 =0 h(9)= 36 mod 32 =4
h(5)= 20 mod 32 =0
Converting the results of hash functions into binary we get two values
00000, 00100.
The maximum number of trailing zeros r=2
Therefore, Count of distinct number of element=2r=22=4
The resulting count of distinct number of elements=1+2+4=7

Estimating Moments
Estimating moments involves computation of distribution of frequencies of different
elements in the stream.
Definition of Moments
Consider a data stream of elements from a universal set. Let mi be the number of
occurrences of the ith element for any i. Then the kth-moment of the stream is the sum
over all i of (mi )k .
• 0th moment is the count of distinct elements in the stream.
• 1st moment is the sum of mi’’s. That is the length of the stream.
• 2nd moment is the sum of squares of mi’’s. It is also called “surprise
number”.
Example
Suppose we have a stream of length 100, in which eleven different elements
appear. The most even distribution of these eleven elements would have one appearing
10 times and the other ten appearing 9 times each. In this case, the surprise number is
102 + 10 × 92 = 910. At the other extreme, one of the eleven elements could appear 90
times and the other ten appear 1 time each. Then, the surprise number would be 90 2 +
10 × 12 = 8110.
As explained in the above example moments of any order can be computed as long as
the stream fits in the main memory. In the case where the stream does not fit in the
memory, we can compute the kth moment by keeping a limited number of values and
computing an estimate from these values. We can use the following algorithm for
computing the second moment.
The Alon-Matias-Szegedy Algorithm for Second Moments
Even if there is not enough storage space, the second moment can still be estimated
using AMS Algortihm.
Algorithm
Consider a stream of length n. Without taking all the elements, compute some
sample variables.
For each variable X,
• Store X.element as a particular element in the set.
• Choose a position in the stream between 1 and n.
• Assign X.element to be the element found in that particular position.
• Assign X.value=1 for that element.
• Scan the stream and add 1 to X.value each time another occurrence of the
variable encountered.
• Derive the estimate second moment of the variable X using n(2X.value-1).
• Calculate the average of all the estimates.
Counting Ones in a Window
In general, the kth moments for any k≥2

𝑛(𝑣 𝑘 − (𝑣 − 1)𝑘 ),v is the X.value for some variable X in the stream.

Counting Ones in a Window

The Datar-Gionis-Indyk-Motwani Algorithm
• Designed to find the number of 1s in a datastream.
• Allows to estimate the no.of 1s in a window with an error not more than 50%.
◦ Each bit that comes in the stream has timestamp, same as the position of bit
in the stream.
▪ First bit has a timestamp of 1, second bit has a timestamp 2 and so on.
◦ Distinguish the positions within the window of length N.
◦ Take the window size as a multiple of 2.
◦ Represent the timestamp as log2N.
◦ Divide the window into buckets, consisting of:
▪ The timestamp of its right (most recent) end.
▪ The number of 1’s in the bucket. This number must be a power of 2.
▪ The number of 1’s is referred to as the size of the bucket.
• Example-1001011-the number of 1s is 4-Hence the bucket size is 4.
• Rules that must be followed when representing a stream by buckets.
◦ The right end of a bucket is always a position with a 1.
◦ Every bucket must contain one bit 1. No buckets can be formed without a bit 1.
◦ All sizes must be a power of 2.
◦ The size of the buckets must increase as we move on to the left.

Example
Consider the bit stream,
..1 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0
The aim is to divide the stream into buckets that satisfies the DGIM rules.

Maintaining DGIM rules

Whenever a new bit enters, we may have to modify the buckets so as to represent
the window and continue to maintain the DGIM rules.
• When a new bit enter
◦ Check the leftmost bucket. If its timestamp has now reached the current
timestamp minus N , then this bucket no longer has any of its 1’s in the
window. Therefore, drop it from the list of buckets.
◦ Consider whether the new bit is 0 or 1.
▪ If it is 0, then no further change to the buckets is needed.
▪ If the new bit is a 1,
1. Create a new bucket with the current timestamp and size 1.

2. If there are more than 2 buckets of size 1, combine the earliest two
buckets of size 1.
3. To combine any two adjacent buckets of the same size, replace them by
one bucket of twice the size.
4. The timestamp of the new bucket is the timestamp of the rightmost of
the two buckets.
▪ The time complexity O(log N).

Decaying Windows
This approach is used for finding the most popular element in the stream. This
can be considered as an extension of DGIM Algorithm. The aim is to weight the
recent elements more heavily.
• Recording the popularity of items sold at Amazon,
• The rate at which different Twitter-users tweet.
Let a stream currently consist of the elements a1 , a2 , . . . , at , where a1 is the first
element to arrive and at is the current element. Let c be a small constant, such as
10−6 or 10−9 .
• Define the exponentially decaying window for this stream to be the sum
𝑡−1

∑ 𝑎𝑡−𝑖 (1 − 𝑐)𝑖
𝑖=0

• when a new element at+1 arrives at the stream input, all we need to do is:
1. Multiply the current sum by 1 − c.

2. Add a t+1 .

Decaying Window Algorithm

Identify the most popular elements (trending, in other words) in an incoming

data stream.
• Tracks the most recurring elements in an incoming data stream.
• Discounts any random spikes or spam requests that might have boosted an
element’s frequency.
• Assign a score or weight to every element of the incoming data stream.
• Calculate the aggregate sum for each distinct element by adding all the weights
assigned to that element.
• The element with the highest total score is listed as trending or the most popular.

1. Assign each element with a weight/score.

2. Calculate aggregate sum for each distinct element by adding all the weights assigned
to that element.

• Assign more weight to newer elements.

• For a new element, you first reduce the weight of all the existing elements by a
constant factor k and then assign the new element with a specific weight.
• The aggregate sum of the decaying exponential weights can be calculated using
the following formula:

𝑡−1

∑ 𝑎𝑡−𝑖 (1 − 𝑐)𝑖
𝑖=0

1. Multiply the current sum/score by the value (1−c).

2. Add the weight corresponding to the new element.

Weight decays exponentially over time

For example, consider a sequence of twitter tags below:

fifa, ipl, fifa, ipl, ipl, ipl, fifa

Also, let's say each element in sequence has weight of 1.

Let's c be 0.1
The aggregate sum of each tag in the end of above stream will be calculated as below:
fifa
fifa - 1 * (1-0.1) = 0.9
ipl - 0.9 * (1-0.1) + 0 = 0.81 (adding 0 because current tag is different than fifa)
fifa - 0.81 * (1-0.1) + 1 = 1.729 (adding 1 because current tag is fifa only)
ipl - 1.729 * (1-0.1) + 0 = 1.5561
ipl - 1.5561 * (1-0.1) + 0 = 1.4005
ipl - 1.4005 * (1-0.1) + 0 = 1.2605
fifa - 1.2605 * (1-0.1) + 1 = 2.135
ipl
fifa - 0 * (1-0.1) = 0
ipl - 0 * (1-0.1) + 1 = 1
fifa - 1 * (1-0.1) + 0 = 0.9 (adding 0 because current tag is different than ipl)
ipl - 0.9 * (1-0.01) + 1 = 1.81
ipl - 1.81 * (1-0.01) + 1 = 2.7919
ipl -2.7919 * (1-0.01) + 1 = 3.764
fifa - 3.764 * (1-0.01) + 0 = 3.7264

In the end of the sequence, we can see the score of fifa is 2.135 but ipl is 3.7264
So, ipl is more trending then fifa
Even though both of them occurred almost same number of times in input there score is
still different.
Advantages of Decaying Window Algorithm:
1. Sudden spikes or spam data is taken care.
2. New element is given more weight by this mechanism, to achieve right trending
output.

12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
No ratings yet
12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
2 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
CS964 Data Warehousing and Data Mining
No ratings yet
CS964 Data Warehousing and Data Mining
1 page
Question Bank_CSE-DS
No ratings yet
Question Bank_CSE-DS
5 pages
Kmbn It01_ Unit 4
No ratings yet
Kmbn It01_ Unit 4
19 pages
Chapter 10 Asset Management 2014 From Machine To Machine To The Internet of Things
No ratings yet
Chapter 10 Asset Management 2014 From Machine To Machine To The Internet of Things
8 pages
Data Structure Question Bank
No ratings yet
Data Structure Question Bank
24 pages
Data Structure Question Bank 2023-24
No ratings yet
Data Structure Question Bank 2023-24
1 page
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
R Language
No ratings yet
R Language
59 pages
BDA Lab ManuaL[1]
No ratings yet
BDA Lab ManuaL[1]
83 pages
Unit-1 Basics of Algorithms and Mathematics
No ratings yet
Unit-1 Basics of Algorithms and Mathematics
47 pages
Unit 5
No ratings yet
Unit 5
104 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
2 pages
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
No ratings yet
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
46 pages
ER Practical 7r
No ratings yet
ER Practical 7r
5 pages
Cp7029 Information Storage Management
100% (1)
Cp7029 Information Storage Management
1 page
Adhoc and Sensor Networks Chapter 02
No ratings yet
Adhoc and Sensor Networks Chapter 02
68 pages
Data Structure Question Bank
No ratings yet
Data Structure Question Bank
10 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Tableau Lab Manual
No ratings yet
Tableau Lab Manual
6 pages
Unit V
No ratings yet
Unit V
13 pages
Data Engineering UNIT-1
No ratings yet
Data Engineering UNIT-1
14 pages
SKP Engineering College: A Course Material On
No ratings yet
SKP Engineering College: A Course Material On
212 pages
Unit 4 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Data Mining - WWW - Rgpvnotes.in
12 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Tree Traversals (Inorder, Preorder and Postorder)
No ratings yet
Tree Traversals (Inorder, Preorder and Postorder)
4 pages
ds4015-big-data-analytics-vignesh-k-notes
No ratings yet
ds4015-big-data-analytics-vignesh-k-notes
146 pages
Non Linear Data Structure
No ratings yet
Non Linear Data Structure
59 pages
DWDM R13 Unit 1 PDF
No ratings yet
DWDM R13 Unit 1 PDF
10 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
4 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
16 pages
DSF Unit IV MCQ Notes
No ratings yet
DSF Unit IV MCQ Notes
6 pages
Busa2001 2023 Sem2 Newcastle
No ratings yet
Busa2001 2023 Sem2 Newcastle
6 pages
AD3491 - Unit 4 - Analysis of Variance Important Questions 2 Marks With Answer --3-9 (1)
No ratings yet
AD3491 - Unit 4 - Analysis of Variance Important Questions 2 Marks With Answer --3-9 (1)
7 pages
UE20CS332 Unit2 Slides PDF
No ratings yet
UE20CS332 Unit2 Slides PDF
264 pages
6 1 Mining Complex Data
No ratings yet
6 1 Mining Complex Data
69 pages
IT DWDM Unit I New PPT
No ratings yet
IT DWDM Unit I New PPT
60 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
48 pages
Practical 5: Introduction To Weka For Classfication
100% (1)
Practical 5: Introduction To Weka For Classfication
4 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Handling Missing Value
No ratings yet
Handling Missing Value
12 pages
Notes - EDA-Unit1 (2)
No ratings yet
Notes - EDA-Unit1 (2)
34 pages
CS302 Unit1-III
No ratings yet
CS302 Unit1-III
18 pages
Scsa4003 - Business Analytics QB
No ratings yet
Scsa4003 - Business Analytics QB
6 pages
Question Bank 1to11
No ratings yet
Question Bank 1to11
19 pages
Data Modelling and Visualization
No ratings yet
Data Modelling and Visualization
31 pages
Mining Social Network Graphs
No ratings yet
Mining Social Network Graphs
35 pages
CS 2032 - Data Warehousing and Data Mining PDF
No ratings yet
CS 2032 - Data Warehousing and Data Mining PDF
3 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
Data Science Laboratory Lab Manual: Prepared by Dr. R Obulakonda Reddy, Associate Professor
No ratings yet
Data Science Laboratory Lab Manual: Prepared by Dr. R Obulakonda Reddy, Associate Professor
35 pages
Data Science Module1
No ratings yet
Data Science Module1
20 pages
Nosql Database Systems: M.Tech. (Iind, Sem Ce/Cn)
100% (1)
Nosql Database Systems: M.Tech. (Iind, Sem Ce/Cn)
135 pages
Unit 5 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Mining - WWW - Rgpvnotes.in
15 pages
WT Unit 3
No ratings yet
WT Unit 3
57 pages
Dwbi Unit 4 & 5
No ratings yet
Dwbi Unit 4 & 5
26 pages
Ankit Sir All Units Dbms
100% (1)
Ankit Sir All Units Dbms
142 pages
Decision Support System: Fundamentals and Applications for The Art and Science of Smart Choices
From Everand
Decision Support System: Fundamentals and Applications for The Art and Science of Smart Choices
Fouad Sabry
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
big-data
No ratings yet
big-data
223 pages
STATISTICAL CONCEPTS-module1
No ratings yet
STATISTICAL CONCEPTS-module1
9 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Unit V-HBase
No ratings yet
Unit V-HBase
10 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Analog To Analog Conversion Techniques
No ratings yet
Analog To Analog Conversion Techniques
15 pages
ITP Control and Relay Panel
No ratings yet
ITP Control and Relay Panel
4 pages
Ultra Web
No ratings yet
Ultra Web
4 pages
System Technology Management of Drainage System As Basis For Adoption
No ratings yet
System Technology Management of Drainage System As Basis For Adoption
13 pages
Keltron Finance PRJC Report
No ratings yet
Keltron Finance PRJC Report
52 pages
Westmire Academy Orientation: Breathing in The Westmire Air
No ratings yet
Westmire Academy Orientation: Breathing in The Westmire Air
25 pages
Lesson Plan Name: Liliana Rojas Unit Number: 11 Lesson Length: 60 Mins Level: Date: Main Aim/s Subsidiary Aim/s: Personal Aim/s
No ratings yet
Lesson Plan Name: Liliana Rojas Unit Number: 11 Lesson Length: 60 Mins Level: Date: Main Aim/s Subsidiary Aim/s: Personal Aim/s
5 pages
MCQ in Philippine Electrical Code (PEC) Part 1 _ REE Board Exam - Pinoybix Engineering
No ratings yet
MCQ in Philippine Electrical Code (PEC) Part 1 _ REE Board Exam - Pinoybix Engineering
18 pages
s11708 022 0821 0 Distillation Cryogénique
No ratings yet
s11708 022 0821 0 Distillation Cryogénique
19 pages
C 20 CSE 1 2 Sem
No ratings yet
C 20 CSE 1 2 Sem
153 pages
Chapter-7_Basic-Concepts-of-Crystal-Structure
No ratings yet
Chapter-7_Basic-Concepts-of-Crystal-Structure
11 pages
Hall, Young, Kenway - 2002 - Manual For The Determination of Egg Fertility in Penaeus Monodon - PDF Traducido
No ratings yet
Hall, Young, Kenway - 2002 - Manual For The Determination of Egg Fertility in Penaeus Monodon - PDF Traducido
54 pages
GRES Integrated Energy Storage System
100% (1)
GRES Integrated Energy Storage System
33 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
Q3 COT - LP English 7 ROMERO - ARLENE
No ratings yet
Q3 COT - LP English 7 ROMERO - ARLENE
7 pages
Orquídeas Puyo Restauración
No ratings yet
Orquídeas Puyo Restauración
6 pages
UTS Activity 1
No ratings yet
UTS Activity 1
1 page
Mpo
No ratings yet
Mpo
347 pages
Handout - LEED v4 O+M Checklist
No ratings yet
Handout - LEED v4 O+M Checklist
1 page
All New Grand Livina/Grand Livina X-Gear 1.5 M/T
No ratings yet
All New Grand Livina/Grand Livina X-Gear 1.5 M/T
1 page
Numeric in Mine Legislation-2019
No ratings yet
Numeric in Mine Legislation-2019
68 pages
Evolution Book Coretactics
No ratings yet
Evolution Book Coretactics
10 pages
QQ S 764B
No ratings yet
QQ S 764B
13 pages
CHAPTER 3 NALANG KULANG AYOS GUYzzzzz
No ratings yet
CHAPTER 3 NALANG KULANG AYOS GUYzzzzz
38 pages
CENTENARY BANK STATEMNT
No ratings yet
CENTENARY BANK STATEMNT
5 pages
PART II Attracting Human Resources
No ratings yet
PART II Attracting Human Resources
61 pages
Business Case
No ratings yet
Business Case
4 pages
Class 7 - LITERATURE - Paving Paradise Assessment 2 Whole Text
100% (2)
Class 7 - LITERATURE - Paving Paradise Assessment 2 Whole Text
3 pages
Group9 TyBscC
No ratings yet
Group9 TyBscC
23 pages
142A Practiceexam3 W20KEY
No ratings yet
142A Practiceexam3 W20KEY
3 pages
Assignment 4 DBS
No ratings yet
Assignment 4 DBS
1 page

Uploaded by

Uploaded by

Unit II: Mining Data Streams

Introduction To Streams Concepts – Stream Data Model and Architecture - Stream

Stream Data Model

There are four building blocks of stream architecture are:

Sampling Data Stream

• First we’ll calculate the hashes as following :

[These outputs are random for explanation only.]

• Set the bits at indices 3, 5 and 4 to 1

• False Positive in Bloom Filters

• Use more number of hash functions and more bit array.

• Probability of False positivity: Let m be the size of bit array, k be the

• To initialize the bit array, begin with all bits 0.

Applications of Bloom filters

Counting Distinct Elements in a Stream

Objective: Find an estimate of n using only m storage units, where m«n

• Data stream consists of a universe of elements chosen from a set of size N.

• Maintain a count of the number of distinct elements seen so far

• Maintain the set of elements seen so far.

Counting Ones in a Window

Maintaining DGIM rules

Decaying Window Algorithm

Identify the most popular elements (trending, in other words) in an incoming

1. Assign each element with a weight/score.

• Assign more weight to newer elements.

1. Multiply the current sum/score by the value (1−c).

Weight decays exponentially over time

For example, consider a sequence of twitter tags below:

Also, let's say each element in sequence has weight of 1.

You might also like