Bigdata Unit II
Bigdata Unit II
Introduction
Traditional DBMS store data that are finite and persistent. Data is available
whenever we want it. In Big data Analytics or in Data mining data is assumed to come in
streams. If not processed immediately it is lost. Data streams are continuous flow of
data. Sensor data, network traffic, call center records, satellite images, data from electric
power grids etc.are some of the popular examples for data stream. Data streams possess
several unique properties
● Infinite
● Massive
● Fast changing
● New class of data may evolve, that makes it difficult to include in the
existing classes. (concept evolution)
● The relation between input data and output data may change. (concept
drift)
Apart from these unique characteristics there are some potential challenges in data
stream mining
● It is not manually possible to label all the data points in the stream.
● It is not feasible to store or archive these data stream in a conventional database.
● Concept drift
● Concept evolution
● Speed and huge volume makes it difficult to mine the data. Only single scan
algorithms will be feasible.
● Difficult to query with SQL-based tools due to lack of schema and structure.
User/ Application
Continuou
s query Results
Multiple Streams
Stream Query
Processor
Limited
working
Archival
Storage
⚫ Sliding Window
Sliding window approach can be used to answer ad-hoc queries.
Each sliding window stores the most recent n elements of the stream for
some n.
Or it can be all the elements that are arrived within the last t time units;
may be day.
The length of the sliding window is specified by it range. Stride specifies
the portion of the window that is omitted when the window moves
forward.
2 types of sliding windows
■ time-based
⚫ Range and stride are specified by time intervals.
⚫ For example a sliding window with range= 10 mins and
stride= 2 mins produces window that cover the data in the
last 10 mins. A new window is created
■ count-based
⚫ Range and stride are specified in terms of number of
intervals.
The obvious approach would be to generate a random number, say an
integer from 0 to 9, in response to each search query.
Store the tuple if and only if the random number is 0.
Each user has, on average, 1/10th of their queries stored.
Solution
A solution for the above scenario is sample the users.
• Pick 1/10th of users and take all their searches in the sample.
• Use a hash function that hashes the user name or user id uniformly into 10
buckets.
• Each time a search query arrives in the stream, we look up the user to see
whether or not they are in the sample. If so, we add this search query to the
sample, and if not, then not.
• By using a hash function, one can avoid keeping the list of users.
• Hash each user name to one of ten buckets, 0 through 9. If the user hashes to
bucket 0, then accept this search query for the sample, and if not, then not.
General Solution
• Stream of tuples with keys.
• Key is some subset of each tuple’s components e.g., tuple is (user, search, time).
• Key is user Choice of key depends on application
• To get a sample of size a/b: Hash each tuple’s key uniformly into b buckets Pick
the tuple if its hash value is at most a.
Filtering Streams
Filtering involves selecting the streams that satisfies a particular criterion. There
are different methods for selecting streams. The process is hard when it is required to
search for membership in a set.
• Each element of data stream is a tuple
• Given a list of keys S
• Determine which tuples of stream are in S
Applications of filtering
• Email spam filtering
◦ We know 1 billion “good” email addresses
◦ If an email comes from one of these, it is NOT spam
• Publish-subscribe systems
◦ You are collecting lots of messages (news articles)
◦ People express interest in certain sets of keywords
◦ Determine whether each message matches user’s interest
Example
Suppose we want to create an account in a Gmail. The Gmail maintains a list of
usernames of those persons who already have an account. When we give our preferred
username, we may get a message that “username already exists”. Gmail check
availability of username by searching millions of username registered with it. There are
several methods to do the search.
• Linear Search : obviously this is a bad idea because there may be billions of
accounts.
• Binary search : the usernames must be stored in sorted order. Even then it may
not be possible to search in billions.
Solution is Bloom Filter Technique
Bloom Filter Technique
• A Bloom filter is a space-efficient probabilistic data structure that is used to test
whether an element is a member of a set.
• Bloom Filter method uses hashing.
• A hash function takes input and outputs a unique identifier of fixed length
which is used for identification of input.
Working of Bloom Filter
• A empty bloom filter is a bit array of m bits, all set to zero, like this –
• We need k number of hash functions to calculate the hashes for a given input.
• When we want to add an item in the filter, the bits at k indices h1(x), h2(x), … hk(x) are
set, where indices are calculated using hash functions.
• Example – Suppose we want to enter “good” in the filter, we are using 3 hash functions
and a bit array of length 10, all set to 0 initially. First we’ll calculate the hashes as
following :
• Again we want
to enter “bad”,
similarly we’ll calculate hashes
h1(“bad”) % 10 = 3
h2(“bad”) % 10 = 5
h3(“bad”) % 10 = 4
Now if we want to check a username present in the list or not we do the reverse process.
• We calculate respective hashes using h1, h2 and h3 and check if all these indices are set
to 1 in the bit array.
• If all the bits are set then we can say that username is probably present.
• If any of the bit at these indices are 0 then username is definitely not present.
If we check the bit array, bits at these indices are set to 1 but we know that
“cat” was never added to the filter. Bit at index 1 and 7 was set when we added “good”
and bit 3 was set when we added “bad”.
So, because bits at calculated indices are already set by some other item,
bloom filter erroneously claim that “cat” is present and generating a false positive result.
• By controlling the size of bloom filter we can control the probability of getting
false positives.
Generalization
The Bloom Filter
A Bloom filter consists of:
1. An array of n bits, initially all 0’s.
2. A collection of hash functions h 1 , h 2 , . . . , h k . Each hash function maps
“key” values to n buckets, corresponding to the n bits of the bit-array.
3. A set S of m key values.
The purpose of the Bloom filter is to allow through all stream elements whose
keys are in S, while rejecting most of the stream elements whose keys are not
in S.
• Take each key value in S and hash it using each of the k hash functions.
• Set to 1 each bit that is h i (K) for some hash function h i and some key value K in S.
• To test a key K that arrives in the stream, check that all of h 1 (K), h 2 (K), . . . , h k (K) are
1’s in the bit-array.
• If all are 1’s, then let the stream element through. If one or more of these bits are 0, then
K could not be in S, so reject the stream element.
The hash function used in bloom filters should be independent and uniformly
distributed. They should be fast as possible.
• Medium uses bloom filters for recommending post to users by filtering post which have
been seen by user.
• Quora implemented a shared bloom filter in the feed backend to filter out stories that
people have seen before.
• The Google Chrome web browser used to use a Bloom filter to identify malicious URLs
• Google BigTable, Apache HBase and Apache Cassandra, and Postgresql use Bloom filters
to reduce the disk lookups for non-existent rows or columns
The elements might represent IPaddresses of packets passing through a router, unique
visitor to a web site, elements in a large database, motifs in a DNA sequence, or
elements of sensor/RFID networks.
Definition
A stream of elements {x1,x2,...xs } with repetitions, and an integer m. Let n be the
number of distinct elements, namely , n=|{x1,x2,...xs }|and let these elements be
{e1,e2,...en}.
• That is, keep a hash table of all the distinct elements seen so far.
Counting distinct elements is very important in many practical applications. For example:
• How many different words are found among the Web pages being crawled at a site?
◦ Unusually low or high numbers could indicate artificial pages (spam?)
• How many different Web pages does each customer request in a week?
• How many distinct products have we sold in the last week?
Example
Suppose Google wants to gather the statistics of unique users it has seen in each month.
Google does not require a unique login to issue a search query. The only way to
recognize users is to identify the IP addresses from which the queries are issued. In this
case the 4 billion IP addresses serve as the universal set.
Solution
• Keep the list of all elements seen so far in a hash table or a search tree in the main
memory.
• When a new query arrives check whether the IP address from which the query
issued is in the list or not.
• If it is not there add the new IP address. Otherwise discard.
The above solution works well as long as the number of distinct elements is not too
large. The problem arises when the number of elements is too great and all the streams
need to be processed at once. The data may not be fit in the main memory.
The Flajolet-Martin Algorithm is an efficient technique to estimate the number of
distinct using much less memory.
Flajolet-Martin Algorithm
This algorithm approximates the number of distinct elements in a stream or a database
in one pass.
Suppose the stream consists of n elements with m of them are unique.
• Then time complexity of the algorithm is O(n).
• The algorithm requires O(log(m)) memory.
Algorithm
For a given input stream and a hash function.
• Step 1: Apply the hash function h(x) to each element in the stream.
• Step 2 : For each hash function obtained,write the binary equivalent for the same.
• Step 3: Count the number of trailing zeros ( zeros in the end ) of each bit of the
hash function.
• Step 4: Write the value of maximum number of trailing zeros. Let the number be
r.
• Step 5: Calculate the number of distinct elements as R=2r
Example
Consider a data stream of integers, 3, 1, 4, 1, 5, 9, 2, 6, 5. Determine the tail length for
each stream element and the resulting estimate of the number of distinct elements if the
hash function is:
(a) h(x) = 2x + 1 mod 32.
(b) h(x) = 3x + 7 mod 32.
(c) h(x) = 4x mod 32.
(Treat the result of hash functions as 5 bit binary integer).
Solution
Since the data stream is small we can readily count the number of distinct
elements. There 7 distinct elements.
a) Using hash function h(x)=2x+1 mode 32
h(3)= (2*3 + 1 ) mod 32 =0 h(2)= (2*2 + 1 ) mod 32 =0
h(1)= (2*1 + 1 ) mod 32 =0 h(6)= (2*6 + 1 ) mod 32 =0
h(4)= (2*1 + 1 ) mod 32 =0 h(5)= (2*5 + 1 ) mod 32 =0
h(1)= (2*1 + 1 ) mod 32 =0 h(9)= (2*9 + 1 ) mod 32 =0
h(5)= (2*5 + 1) mod 32=0
Convert the result into binary. Here all the results are zeros. Hence the maximum
number of trailing zeros r=0
Count of distinct number of element=2r=20=1
b) Using hash function h(x)=3x+7 mode 32
h(3)= (3*3 + 7 ) mod 32 =0 h(2)= (3*2 + 7 ) mod 32 =0
h(1)= (3*1 + 7 ) mod 32 =0 h(6)= (3*6 + 7 ) mod 32 =0
h(4)= (3*4 + 7 ) mod 32 =0 h(5)= (3*5 + 7 ) mod 32 =0
h(1)= (3*1 + 7 ) mod 32 =0 h(9)= (3*9 + 7 ) mod 32 =2
h(5)= (3*3 + 7 ) mod 32 =0
Converting the results of hash functions into binary we get two values
00000, 00010.
The maximum number of trailing zeros r=1
Therefore, Count of distinct number of element=2r=21=2
c) Using hash function h(x)=h(x) = 4x mod 32.
h(3)= 12 mod 32 =0 h(2)= 8 mod 32 =0
h(1)= 4 mod 32 =0 h(6)= 24 mod 32 =0
h(4)= 16 mod 32 =0 h(5)= 20 mod 32 =0
h(1)= 4 mod 32 =0 h(9)= 36 mod 32 =4
h(5)= 20 mod 32 =0
Converting the results of hash functions into binary we get two values
00000, 00100.
The maximum number of trailing zeros r=2
Therefore, Count of distinct number of element=2r=22=4
The resulting count of distinct number of elements=1+2+4=7
Estimating Moments
Estimating moments involves computation of distribution of frequencies of different
elements in the stream.
Definition of Moments
Consider a data stream of elements from a universal set. Let mi be the number of
occurrences of the ith element for any i. Then the kth-moment of the stream is the sum
over all i of (mi )k .
• 0th moment is the count of distinct elements in the stream.
• 1st moment is the sum of mi’’s. That is the length of the stream.
• 2nd moment is the sum of squares of mi’’s. It is also called “surprise
number”.
Example
Suppose we have a stream of length 100, in which eleven different elements
appear. The most even distribution of these eleven elements would have one appearing
10 times and the other ten appearing 9 times each. In this case, the surprise number is
102 + 10 × 92 = 910. At the other extreme, one of the eleven elements could appear 90
times and the other ten appear 1 time each. Then, the surprise number would be 90 2 +
10 × 12 = 8110.
As explained in the above example moments of any order can be computed as long as
the stream fits in the main memory. In the case where the stream does not fit in the
memory, we can compute the kth moment by keeping a limited number of values and
computing an estimate from these values. We can use the following algorithm for
computing the second moment.
The Alon-Matias-Szegedy Algorithm for Second Moments
Even if there is not enough storage space, the second moment can still be estimated
using AMS Algortihm.
Algorithm
Consider a stream of length n. Without taking all the elements, compute some
sample variables.
For each variable X,
• Store X.element as a particular element in the set.
• Choose a position in the stream between 1 and n.
• Assign X.element to be the element found in that particular position.
• Assign X.value=1 for that element.
• Scan the stream and add 1 to X.value each time another occurrence of the
variable encountered.
• Derive the estimate second moment of the variable X using n(2X.value-1).
• Calculate the average of all the estimates.
Counting Ones in a Window
In general, the kth moments for any k≥2
𝑛(𝑣 𝑘 − (𝑣 − 1)𝑘 ),v is the X.value for some variable X in the stream.
Example
Consider the bit stream,
..1 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0
The aim is to divide the stream into buckets that satisfies the DGIM rules.
2. If there are more than 2 buckets of size 1, combine the earliest two
buckets of size 1.
3. To combine any two adjacent buckets of the same size, replace them by
one bucket of twice the size.
4. The timestamp of the new bucket is the timestamp of the rightmost of
the two buckets.
▪ The time complexity O(log N).
Decaying Windows
This approach is used for finding the most popular element in the stream. This
can be considered as an extension of DGIM Algorithm. The aim is to weight the
recent elements more heavily.
• Recording the popularity of items sold at Amazon,
• The rate at which different Twitter-users tweet.
Let a stream currently consist of the elements a1 , a2 , . . . , at , where a1 is the first
element to arrive and at is the current element. Let c be a small constant, such as
10−6 or 10−9 .
• Define the exponentially decaying window for this stream to be the sum
𝑡−1
∑ 𝑎𝑡−𝑖 (1 − 𝑐)𝑖
𝑖=0
• when a new element at+1 arrives at the stream input, all we need to do is:
1. Multiply the current sum by 1 − c.
2. Add a t+1 .
𝑡−1
∑ 𝑎𝑡−𝑖 (1 − 𝑐)𝑖
𝑖=0
In the end of the sequence, we can see the score of fifa is 2.135 but ipl is 3.7264
So, ipl is more trending then fifa
Even though both of them occurred almost same number of times in input there score is
still different.
Advantages of Decaying Window Algorithm:
1. Sudden spikes or spam data is taken care.
2. New element is given more weight by this mechanism, to achieve right trending
output.