EDA Unit 3 Notes

Open navigation menu

Upload

0% found this document useful (0 votes)

934 views35 pages

EDA Unit 3 Notes

Uploaded by

sivashankarsridevi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

934 views35 pages

EDA Unit 3 Notes

Uploaded by

sivashankarsridevi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 35

Syllabus Introduction to Single variable : Distributions and Variables - Numerical Summaries of Level and Spread - Scaling and Standardizing - Inequality - ‘Smoothing Time Series. Univariate Analysis 34 Contents Introduction to Single Variable Storing and Importing Data using Python Numerical Summaries of Level and Spread Scaling and Standardizing Time Series and Smoothing Time Series Two Marks Questions with Answers aeUnivariate Data Exploration and Visualization (3-2) Analy £Q 3.1 Introduction to Single Varlable © Exploratory data analysis is cross-classified in two different hae where each meting either graphical or non-graphical and then, each method is either univariate, bivariag multivariate. * Univariate analysis is simplest analysis of statistical data, The term univariate Analysiy refers to the analysis of one variable. The prefix. “uni” means “one.” The purpose op univariate analysis is to understand the distribution of values for a single variable Univariate analysis explores each variable in a data set, separately. ¢ In other words in univariate analysis data has only one variable. It doesn’t deal with Causes or relationships and it’s major purpose is to describe data; it takes data, summarizes thay data and finds patterns in the data. * ‘Some ways one can describe patterns found in univariate data include central tendency (mean, mode and median) and dispersion : Range , variance, maximum, minimum, quartiles (including the interquartile range) and standard deviation. * Univariate analysis works by examining the effects of a singular variable on a set of data, For example, a frequency distribution table is a form of univariate analysis as frequency ig the only variable being measured. Alternative variables may be age, height, weight, etc, however it is important to note that as soon as a secondary variable is introduced it becomes bivariate analysis. With three or more variables, it becomes multivariate analysis, @ 3.1.1 univariate Statistics * Univariate analysis can be performed in a statistical setting. Two types of statistics can be used for analysis namely, descriptive and inferential, Descriptive statistics ° As the name suggests, descriptive statistics are used to describe data. The statistics used here are commonly referred to as summary statistics. * Descriptive, statistics can be used for calculating things like missing value proportions, upper and lower limits for outliers, level of variancé through the coefficient of variance, etc. Inferential statistics © +* Often, the data one is dealing with is a subset (sample) of the complete data (population). Thus, the common question here is - i © Can the findings of the sample be extrapolated to the population ? That is, is the sample representative of the population or has the Population changed ? Such questions are answered using specific hypothesis tests designed to deal with such univariate data- based problems. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeota Exploration and Visualization @ -9) a) Univariate Analysis ce Hypothesis tests help to answer crucii . population from where they eee about the data and their relation with the : awn, oa ; F echanisms come in handy here, such as - Several hypotheses or univariate testing 1. Z Test - Used for numerical (quantitati ; (quantitative) data where th ize i cH and the population’s standard deviation is known. ae 2, One-Sample t-Test - ‘ : as "0 a si oe for numerical (quantitative) data where the sample size is jess than 30 or the population’s standard deviation is unknown. 3, Chi-Square Test - Used with ordinal categorical data. 4, Kolmogoroy-Smirnoy Test - Used with nominal categorical data. ‘There are below common methods for performing univariate analysis, 1, Summary statistics 2, Frequency distributions 3. Charts 4, Univariate tables. 4. Summary statistics ‘The most common way to perform the univariate analysis is to use summary statistics to describe a variable. There are two kinds of summary statistics : .se values describe where the dataset's center or 1, Measures of central tendency - The: middle value is located. The mean, mode and median are examples. These numbers describe how evenly distributed the values 2. Dispersion measures - dard deviation and variance are some examples. » gre in the dataset. The range, stan The shape of the data distribution can explain a great deal help ‘in identifying the type of distribution followed specific properties that can be used to IL-know if the data is symmetrical, or negative kurtosis, 3, Measure of shape - about the data as the shape can by the data. Each of these distributions has s| one’s advantage. By analyzing the shapes, one wil non-symmetrical, left or right-skewed, is suffering from positive among other things. 2. Frequency distributions s occur in a dataset. © A frequency distribution describes how frequently different values This acts as another way to perform univariate analysis. TECHNICAL PUBLICATIONS® -@” paths for knowledgeUnivari Data Exploration and Visualization (3-4) (ariate Analyei, 3. Charts Another method for performing univariate analysis is to create charts that show th, distribution of values for a specific variable. Various types of graphs can be used to understand data, The standard type of graphs include - 1. Histograms : A histogram displays the frequency of each value or group of values (bins) in numerical data, This helps in understanding how the values are distributed, Rr . Boxplot : A boxplot provides several important information such as minimum, maximum, median, 1" and 3" quartiles. It is beneficial in identifying outliers in the data. 3. Density curve : The density curve helps in understanding the shape of the data’s distribution. It helps answer questions such as if the data is bimodal, normally distributed, skewed, etc. 4, Bar chart : Bar charts, mainly frequency bar charts, is a univariate chart used to find the frequency of the different categories of categorical data. 5. Pie chart : Frequency Pie charts convey similar information to bar charts. The difference is that they have a circular formation with each slice indicating the share of each category in the data. 4. Univariate tables Tables help in univariate analysis and are typically used with categorical data or numerical data with limited cardinality. Different types of tables include : 1. Frequency tables : Each unique value and its respective frequency in the data is shown through a table. Thus, it summarizes the frequency the way a histogram, frequency bar or pie chart does but in a tabular manner. 2. Grouped tables : Rather than finding the count of each unique value, the values are _ binned or grouped and the frequency of each group is reflected in the table. It is typically used for numerical data with high cardinality. "3, Percentage (Proportion) tables : Rather than showing the frequency of the unique values (or groups), such a table shows their proportion in the data (in percentage). 4. Cumulative proportion tables : It is similar to the proportion table, with the difference being that the proportion is shown cumulatively. It is typically used with binned data having a distinct order (or with categorical ordinal data). TECHNICAL PUBLICATIONS® - an up-thnust for knowledge= ee oration and Visualization pata Exp (3-5) Univariate Analysis 3.1.2 Variable and Distribution in Univariate Analysis ‘A variable in univariate analysis is a condition or subset that data falls into. Variable can be thought of as a “category.” For example, the analysis might work on a variable “height” or it might work on “weight”. Univariate analysis can be carried out on any of the individual variables in the dataset to gain a better understanding of its distribution of values. univariate data examples The salaries of employees in a specific industry; the variable in this example is employee's salaries. a The heights of ten students in a class are measured; the variable here is the student's heights. ‘A veterinarian wants to weigh 20 cats; the variable, in this case, is the weight of the cats. o Finding the average height of a country’s men from a sample. © Calculate how reliable a batsman is by calculating the variance of their runs. © Finding which country is the most frequent in winning Olympic Gold Medal by creating a frequency bar chart or frequency table. o Understanding the income distribution of a county by analyzing the distribution’s shape. ‘A right-skewed distribution can indicate an unequal society. Checking if the price of sugar has statistically significantly risen from the generally accepted price by using sample survey data. Hypothesis tests such as the Z or t-test solve such questions. © Assessing the predictive capability of a vari able by calculating the coefficient of variance. Distribution and variables Types of variables : Variables can be one of two types : Categorical or numerical. Categorical Data Categorical data classify items into groups. This type of data can be further broken down into nominal, ordifial and binary values. © Ordinal values have a set order. An examp! © Nominal values have no set order. Examples inclu alignment. © Binary data has only two values. This could be represent Je here could be a ranking of low to high. de the superhero’s gender and ted as true / false or 1/0. TEGHINIGAL PUBLIGATIONS® - an uptust fr knowiedeData Exploration and Visualization (3-6) Univariate Anaiys, * A common way to summarize categorical variables is with a frequency table, * Columns holding categorical data : Gender, Married, BankCustomer, Indus Ethnicity, PriorDefault, Employed, DrivingLicense, Citizen, Approved. Numerical data ° Numerical data are values that one can perform mathematical operations on, They are further broken down into continuous and discrete data types. © Discrete variables have to. be an integer. An example is number of superheroes. © Continuous can be any value. Examples here include height and weight. Numerical data can be visualized with a histogram. Histograms are a great first analysis of continuous data. Four main aspects to consider here are shape, center, spread and outliers, © Shape is the overall appearance of the histogram. It can be symmetric, skewed, uniform or have multiple peaks. p © Center refers to the mean or median. © Spread refers to the range or how far the data reaches. © Outliers are. data points that fall far from the bulk of the data. © Columns holding numerical and continuous data : Age, debt, YearsEmployed, CreditScore, Income. QQ 3.2 Storing and Importing Data using Python © There are various methods to import data in Python. One of the way to import data is using Pandas library. : ¢ In most simplest form data can be stored in a CSV file. CSV stands for “Comma Separated Values.” It is the simplest form of storing data in tabular form as plain text. * CSV file structure is very simple in which, the first line of a CSV file is the header and contains the names of the fields/features separated by “comma”. After the header, each line of the file is an data set value/observation/record. The values of a record are separated by “comma.” Steps to import using Pandas 1. Get the correct and full file path « Firstly, capture the full path where CSV file is stored. * For example, a CSV file is stored under the following path (ES Piresk Taata\mytoamicev Bae RS TECHNICAL PUBLICATIONS® - an up-thrust for knowledgejoration and Visualization pata EXP! (3-7) Univariate Analysis: File name - It should mad je sure cee that the file name specified in the cod 1e code matches with the File extension - The fil : ile extension should always be ‘.csv’ example program - 4 ys be ‘.csv’ when importing CSV files. poy # Python code to import the ble import pandas as pd fread the csv file (put '' b fore the path strin ¥ path, ‘such as \). 6 path string to address any special #characters in Sspaneadecey D\ Tech: det=\uy ean oe print (af) Example program 1 - Output “Name City plays Lucky Pune —sCricket ‘Aniket Mumbai Cyclist Lahu Kolhapur Tennis Shital Amarawati Badminton Nashik Cricket ‘Monit Ratnagiri Tennis Jaya “pnagar Badminton’ “Dev © paghangari Cyclist “anita Sasi acon: Badminton! / Rashmi ©) Dhule eat of the distribution and of rative variable that are of tof dispersion in the ‘number used t0 tion for 4 quantit s the amount primary interest are the seal summary is & distribution and the shape describe a specific characteristic about(3-8) _ Univariate 7 At Data Expioration and Visualisation Below are some of the useful numerical summaries © Center : Mean, median, mode five number summaries © Quantiles : Percentile © Spread : Standard deviation, variance, interquartile range © Outliers © Shape : Skewness, kurtosis © Concordance : Correlation, quantite-quantile plots. Mean ‘© This is the point of balance, describing the most typical value for normally distributed data, By “normally distributed” data it means it is highly influenced by outliers. © The mean adds up all the data values and divides by the total number of values, as follows © The ‘x-bar’ is used to represent the sample mean (the mean of a sample of data) (sigma) implies the addition of all values up from ‘i=1” until ‘i=n’ (’n’ is the number of data values). The result is then divided by ‘n’. Median This is the “middle data point”, where half of the data is below the median and halfis above the median. It’s the 50 percentile of the data It’s also mostly used with skewed data because outliers won't have a big effect on the median. Theré are two formulas to compute the median. The choice of which formula to ust depends on n (number of data points in the sample or sample size) if it’s even or odd. Xa) X(n/ Median = —2/ alae) When n is even, there is no “middle” data point, so the middle two values are averaged. Median = X¢q41)/2) When n is odd, the middle data point is the median. Mode © The mode returns the most commonly occurring data value.> pote Exploration and Visualization Univartate Analysis percentile ercent of data that i be . poh oe oe to or less than a given data point. It’s useful for describing 4 within the data set. If the percentile is close to zero, then the observation is one of the smallest. If the percentil en the data point is one 7 rc is int i of the largest in the data set. entile is close to 100, then tl int i quartiles (Five-number summary) . ut eed center and it’s also great to describe the spread of the data. Highly usel pee e a. hers are four quartiles and they compose the five-number summary(combined with the minimum). The Five-number summary is ‘composed of : 1. Minimum th 2 25 percentile (lower quartile) th 50"" percentile (median) ae PY th 75"" percentile (upper quartile) th 5, 100" percentile (maximum) Standard deviation Standard deviation is extensively used in statistics and data science. It measures the amount of variation or dispersion of a data set, calculating how spread out the data are from the mean. Small values mean the data is consistent and close to the mean. Larger values indicate the data is-highly variable. Deviation : The idea is to use the mean as a reference point from which everything varies. distance an observation lies from the reference point. This ‘A deviation is defined as the distance is obtained by subtracting the data point (x,) from the mean (x-bar). all the deviations will always turn ‘um up the results. Then, one can result to undo ‘on : The average of ch deviation and si Further, square root the final Calculating the standard deviati n square 21 f freedom). out tobe zero, so one cal divide it for ‘n— 1” (called degrees oO the squaring of the deviations. of all deviations in the data. It’s never negative © The standard deviation is a representation and it’s zero oily if-all the values are the sem. TECHNICAL PUBLIGATIONS® - an upthrust or knowledgeData Exploration and Visualization (3-19) 5 Univariate Araya, Variance *° Variance is almost the same calculation of the standard deviation, but it stays in SQuareg units. So, if taken the square root of the variance, one gets the standard deviation. Note tha it’s represented by ‘s-squared’, while the standard deviation is represented by ‘s’, a 2 2 Bay %-%) n=l Range * The difference between the maximum and minimum values. Useful for some basic exploratory analysis, but not as powerful as the standard deviation. ¥ ay Xay Proportion ' © It’s often referred to as “percentage”. Defines the percent of observations in the data set that satisfy some requirements. Bix p= Correlation ° Defines the strength and direction of the association between two quantitative variables. It ranges between — | and 1. Positive correlations mean that one variable increases as the other variable increases. Negative correlations mean that one variable decreases as the other increases. When the correlation is zero, there is no correlation at all. As closest to one of the extreme the result is, stronger is the association between the two variables. fa STE SE STITT import numpy as np 5 j df = pd DataFrame({(‘Indian Cinema’, ‘Restaurant’, 289.0), (RamKrishna’, ‘Restaurant’, 224.0), 5 (Zingo’, ‘Juice bar, 80.5), : e —_(The Place’, 'Play Club’, np.nan)J, Ee = a TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeexploration and Visualization pate gaamnsetmame’ ‘type’, ‘AvgBil!') ) print fpintcavol Mean = ', dfl'AvgBill.moan()) faint AvoBi Median = ', df'AvgBill'|.median()) ple program - 2 Output Exam! roane type AvgBill | jedian Cinema Restaurant 289.0 pomKrishna Restaurant — 2240 ingo Juice bar 80.5 ‘the Place Play Club NaN lavgBill Mean = 197.89393333333934 [avgBill Median = 224.0 Example program - 3 {import pandas as pd data = pd.DataFrame({'col1' ‘col2'st'y’, x, se, '2', x, yy’, "2, oe, "2 Se], ‘group'('A', ‘CB, 'B', pprint(data) pptint('Col2 Mode =", datal'col1'.mode()) Example program - 3 Output [Feoll col2 group 0 5 y A fe 2 x c. Be 7 eo B Bee 3 2 oa B 4 4 x A 5B 40 sy c b 2 y A Bees A B22 c Bed x B Bt 2 UB Be 5 x A Pel2Mode= 0 2 ‘dtype; intea Univariate Analysis 5, 2,7, 3, 4, 4, 2,3, 2,1,2,5], # Create pandas DataFrame. TEGHNIGAL PUBLIGATIONS® - an up-thrust for knowiodgeData Exploration and Visualization (3-12) Univariate Ana, alysis Example program - 4 # calculate a 5-number summary from numpy import percentile from numpy.random import rand # generate data sample data = rand(1000) ¥ calculate quartiles quartiles = percentile(data, [25, 50, 75]) # calculate min/max (date_min, data_max = data.min(), data.max() # print 5-number summary print(Min: %.3f % data_min) pprint(Qt: %.3f % quartiles|0}) print(Median: %.3f % quartiles[1]) ‘print(03: %.3f % quartiles|2]) [print(Max: %.3f.% data_max) Example program - 4 Output ‘Min: 0.001 Q1: 0.269 “Median : 0.509 03: 0.762 ‘Max: 0.999 QQ 3.4 Scaling and Standardizing Feature scaling (also known as data normalization) is the method used to standardize the range of features of data. Since, the range of values of data may vary widely, it becomes necessary step in data preprocessing while exploring and visualizing data. © Scaling of data may be useful and/or necessary under certain circumstances (e.2. Wte? variables span different ranges). There are several different versions of scaling, the most important of which are listed below. Scaling procedures may be applied to the full é" matrix or to parts of the matrix only (e.g. column-wise). TECHNICAL PUBLICATIONS® - an up-thrust for knowtedge= ia Exploration and Visualization al Univariate Analysis ange sealing Range scaling eae the values to another range which usually includes both a shift and a change of U : scale (magnification or reduction). In scaling (also called min-max sealing), one transforms the data such that the features are within a specific range eg: [0, 1) scaling is important in the algorithms stich as Support Vector Machines (SVM) and jenearest neighbors (KNN) where distance between the data points is important. For example, in the dataset containing prices of products; without scaling, SVM might treat 1 € equivalent to 1 Euro though 1 Euro = 90 INR. ‘The data samples are transformed according to the following equation : D, Dra ‘max Fig. 3.4.1 Range scaling Bunax~ Rin, Rmin Prax Bmax Prin Dynax : Dinin 7 (port numpy as mp 2 g aR a Pore {mport matplotlib pyplot as plt : i ‘rom skleam.preprocessing import minmax_scale, scale j ‘set seed for reproducibility prandom.seed(0) j | / : : i Generate random data points from an exponential distribution | = np.random.exponential(size=1000) mix-max scaling i Caled data =\minmax_scale(x) iia - TaTions®- an upthrust for knowledgeData Exploration and Visualization (3-14) Wiecaled data = (x-x.min())/(x.max()-x.min()) # plot both together to compare 1, ax = plt.subplots(1,2) ‘sns.distplot(x, ax=ax{0], color= ‘ax{0}.s0t_title("Original data") [sns.diotplot(scaled_data, ax=ax|1), color="r) ax{1].set_title("Scaled data") plt.show() ') Example program -5 Output Original data 00 251. 50 7.6 Scaled data 05 Fig. 3.4.2 Original and scaled data Mean centering * Subtracting the mean of the data is often called "mean centering". It results in a shift of the data towards the mean. The mean of the transformed data thereafter equals zero : Y= X-p . Standardization and Normalization * Standardization (sometimes also called autoscaling or z-transformation, z-score j malization) is the scaling procedure which results in a zero mean and unit variance of any descriptor variable. For every data value the mean 1 has to be subtracted and the result has to be divided by the standard deviation o (note that the order of these two operations must not be reversed) : y = Ky o TECHNICAL PUBLIGATIONS®- on up-hnist or knowodgo 1.0 Univariate Ana, Ms= MY 6+ teas) 12+(¥%,_5+ ‘ 2 (2) ¥y 6+ Ys Vig s +12) Ying et seers Smee 12 +¥,Qr2 TECHNICAL PUBLICATIONS® - an up-thrust for knowiedgoExploration and Visualization - pata (3 - 26) Univariate Analysis: sing the seasonal frequer i i » By using ‘quency for the coefficients in the moving average, the procedure izes for any seasonal fre : eral 'Y seasonal frequency (je, quarterly, weekly, etc. series), provided the ondition that the coefficients sum up to unity is still mot estimating the seasonal component, S, « Anestimate of S, at time t can be obtained by subtracting T, é f a A S, = Y,-T, « By averaging these estimates of the monthly effects for each month (January, February etc.), one seat a single estimate of the effect for each month. That is, if the seasonality period is d, then : Ss, m Stra Seasonal factors can be thought of as expected variations from trend throughout a seasonal period, so one would expect them to cancel each other out over that period - i.e. they should add up to zero. 4 Zs, =0 t-1 * Itshould be noted that this applies to the additive decomposition. Estimating the seasonal component, S, « If the estimated (average) seasonal factors S, do not add up to zero, then one can correct them by dividing the sum of the seasonal estimates by the seasonality period and adjusting each seasonal factor. For example, if the seasonal period is d, then d 1. Calculate the total sum : 2 _ d x 2. Calculate the value w = ar 3. Adjust each period S, = S,-w * Now, the seasonal components add up to zero : doa 1S, = 0 TECHNICAL PUBLICATIONS® = an up-thrust for knowledgeUnivariate a, Data Exploration and Visualization (3 - 26) Mnalyaiy indi ment It is common to present economic indicators such as unemploy! Percentages. This highlights any (rend that might otherwise be maskeq to the end of the academic year, when schools and seasonally adjusted series seasonal variation (for example, ee 3 university graduates ate seeking work). If the seasonal effect is additive, a seasonay, x . adjusted series is given by, Y,—S, The described moving-average procedure usually quite successfully describes the time series in question, however it does not allow to forecast it. * To decide upon the mathematical form of a trend, one must first draw the plot of the time series. * If the behavior of the series is rather ‘regular’, one can choose a parametric trend - usually itis a low order polynomial in t, exponential, inverse or similar functions. * Inany case, the smoothing method is acceptable if the residuals €, = Y,— T, - 8, constitute a stationary process. * If there are a few competing trend specifications, the best one can be chosen by AIC, BIC or similar criterions. * An alternative approach is to create models for all but some TO end points and then choose the model whose forecast fits the original data best. To select the model, one can use such characteristics as : z & To Root Mean Square Error, . RMSE T Mean Absolute Percentage Error, MAPE = 22 > Ty. t-T—T) S| and similar statistics. 3.5.5 Transforms used for Stationarizing Data ¢ Detrending - One can remove the underlying trend in the.series. This can be done in several ways, depending on the nature of data, ¢ Indexed data : Data measured in currencies are linked to a price index or related to inflation. Dividing the series by this index (that is deflating) element-wise is therefore the solution to de-trend the data. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgetion and Visualization explora psl@ Univariate Analysis: Non-indexed data : Is it n 5 rn ry to esti e i { Feponentil, ThE FFs Wo eases an imate if the trend is constant, linear or Ea re casy, _ growth rate (inflation or deflation) ang pa Or the Test one it is necessary to estimate a y the same method as for i e pifferencing - Seasonal or cyclical patterns adie il ems can be removed ing, periodi values: if ne a is 12-month Seasonal, subtracting the seri bh subtracting _ wees will give a “falter” series, series with a 12-lag difference ing - In the case where th ; lex ogeing the compound rate i is ice i ie ceies is not hweasurediy on a in the trend is not due to a price ind exponential trend (recall that log(exp(x)) = whatsoever, unlike deflation, logging can help linearize a series with an x). It does not remove an eventual trend 3.5.6 Checking Stationarity plotting rolling statistics « Plotting rolling means and variances is a first good way to visually inspect the series. If the rolling statistics exhibit a clear trend (upwards or downwards) and show varying variance (increasing or decreasing amplitude), then one might conclude that the series is very likely not to be stationary. Augmented Dickey-Fuller test This test is used to assess whether or not a time-series is stationary. Without getting into too much details about hypothesis testing, one should know that this test will give a result called a “test-statistic”, based on which one can say, with different levels (or percentage) of confidence, if the time-series is stationary or not. KPSS The KPSS (Kwiatkowski-Phillips-Schmidt-Shin) tests for the null hypothesis that the series is trend stationary. In other words, if the p-value of the test statistic is below the X % confidence threshold, this means one can reject this hypothesis and that the series is not trend-stationary with X % confidence. A p-value higher, than the threshold will lead to accept this hypothesis and conclude that the series is trend-stationary. Autocorrelation plots (ACF & PACF) * An autocorrelation (ACF) plot represents the autocorrelation of the series with lags of itself. A partial autocorrelation (PACF) plot represents the amount of correlation between a series and a lag of itself that is not explained by correlations at all lower-order lags. Uaeally, one Would want no correlation between the series and lags of itself. Graphically speaking, one Would like all the spikes to fall in the blue region. row TECHNIGAL PUBLICATIONS® - an upsrst Tor knowtedgoData Exploration and Visualization (3 - 28) Univariate Analysis Choosing a model * Exponential smoothing methods are appropriate for non-stationary data (ie data witha tng and seasonal data), ARIMA (Autoregressive Integrated Moving Average) models shoug be used on stationary data only. One should therefore remove the trend of the data (jg deflating or logging) and then look at the differenced series. 3.5.7 Smoothing Methods The smoothing technique is a family of time-series forecasting algorithms, which Utilizes the weighted averages of a previous observation to predict or forecast a new value, This technique is more efficient when time-series data is moving slowly over time. It harmonizes errors, trends and seasonal components into computing smoothing parameters. Smoothing methods work as weighted averages. Forecasts are weighted averages of Past observations. The weights can be uniform (this is a moving average) or following an exponential decay - This means giving more weight to recent observations and less weight to old observations. More advanced methods include other parts in the forecast, like seasonal components and trend components. 1. Simple exponential smoothing * Simple Exponential Smoothing (SES) is one of the minimal models of the exponential smoothing algorithms. SES is the method of time series forecasting used with univariate data with no trend and no seasonal pattern. It needs a single parameter called alpha (a), also’ known as the smoothing factor. Alpha controls the rate at which the influence of past observations decreases exponentially. The parameter is often set to a value between 0 and 1. This method can be used to predict series that do not have trends or seasonality, © The simple exponential smoothing formula is given by, 8 = OH a)s,_1 = 54 FOG —8_,) 8, = Smoothed statistic (simple weighted average of current observation x) = Previous smoothed statistic © =. Smoothing factor of data; 0 1, sa ox, + (1- 0)(,_, + by) B, = B@,-s_))+0-B)_1 here, b, = Best estimate of the trend at time t 6 = Trend smoothing factor; 0
You might also like
Data Analytics Using Python
100% (1)
Data Analytics Using Python
982 pages
Data Visualization Complete Notes
100% (9)
Data Visualization Complete Notes
28 pages
CCS341-Data Warehousing Notes-Unit I
100% (2)
CCS341-Data Warehousing Notes-Unit I
30 pages
Ccs346 Eda Unit 1 Notes
100% (2)
Ccs346 Eda Unit 1 Notes
20 pages
Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
Data Visualization in Python Preview PDF
100% (9)
Data Visualization in Python Preview PDF
58 pages
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
100% (18)
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
208 pages
Ad3411 Data Science and Analytics Laboratory
100% (7)
Ad3411 Data Science and Analytics Laboratory
24 pages
EDA Unit 1 Notes
No ratings yet
EDA Unit 1 Notes
27 pages
Data Analysis and Interpretation
90% (10)
Data Analysis and Interpretation
14 pages
Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
100% (10)
Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
104 pages
Python Data Science
92% (12)
Python Data Science
65 pages
Ad3301 - Data Exploration and Visualization
100% (4)
Ad3301 - Data Exploration and Visualization
2 pages
Machine Learning With Python
100% (14)
Machine Learning With Python
692 pages
CSBS - AD3491 - FDSA - IA 1 - Answer Key
100% (11)
CSBS - AD3491 - FDSA - IA 1 - Answer Key
14 pages
PracticeExam DataEngineerAssociate
No ratings yet
PracticeExam DataEngineerAssociate
23 pages
EDA Unit 4 Notes
No ratings yet
EDA Unit 4 Notes
22 pages
EDA Unit 2 Notes
No ratings yet
EDA Unit 2 Notes
61 pages
EDA Unit3
No ratings yet
EDA Unit3
44 pages
EDA Unit V
No ratings yet
EDA Unit V
28 pages
EDA Unit IV
No ratings yet
EDA Unit IV
17 pages
Eda Unit 1
No ratings yet
Eda Unit 1
57 pages
ccs346 Eda Lab Manual
No ratings yet
ccs346 Eda Lab Manual
41 pages
Ad3301-Data-Exploration-And-Visualization Lab Manual
No ratings yet
Ad3301-Data-Exploration-And-Visualization Lab Manual
24 pages
cognitive science UNIT 4 (1)
No ratings yet
cognitive science UNIT 4 (1)
10 pages
UNIT 2
No ratings yet
UNIT 2
34 pages
Ad3491 Fdsa Unit 2 Notes Eduengg
No ratings yet
Ad3491 Fdsa Unit 2 Notes Eduengg
82 pages
Data Exploration and Visualization
100% (1)
Data Exploration and Visualization
281 pages
AD3491 FDSA Syllabus
No ratings yet
AD3491 FDSA Syllabus
2 pages
Ccs341 - Data Warehousing
100% (1)
Ccs341 - Data Warehousing
2 pages
CCS341 Set1
100% (2)
CCS341 Set1
2 pages
Ad3301 Data Exploration and Visualization
No ratings yet
Ad3301 Data Exploration and Visualization
24 pages
Ad3301 Data Exploration and Visualization
100% (3)
Ad3301 Data Exploration and Visualization
30 pages
cognitive science unit 1 (1)
No ratings yet
cognitive science unit 1 (1)
15 pages
cognitive science unit 3 (1)
No ratings yet
cognitive science unit 3 (1)
15 pages
Ad3491 Fdsa Unit 4 Notes Eduengg-2
No ratings yet
Ad3491 Fdsa Unit 4 Notes Eduengg-2
16 pages
cs3362 Foundations of Data Science Lab Manual
75% (8)
cs3362 Foundations of Data Science Lab Manual
53 pages
Unit5 BD
100% (2)
Unit5 BD
91 pages
DVT - Question Bank
100% (1)
DVT - Question Bank
3 pages
Ccs341 DW Lab Manual Chumma Chumma Practical Notes
No ratings yet
Ccs341 DW Lab Manual Chumma Chumma Practical Notes
89 pages
Eda Question Paper
No ratings yet
Eda Question Paper
4 pages
ML Lab Manual - Ex No. 1 To 9
No ratings yet
ML Lab Manual - Ex No. 1 To 9
26 pages
ccs341 Data Warehousing Lab Manual2021
No ratings yet
ccs341 Data Warehousing Lab Manual2021
41 pages
CS3461 Oslab
No ratings yet
CS3461 Oslab
2 pages
CS3361 Data Science Lab Manual (II CYS)
100% (1)
CS3361 Data Science Lab Manual (II CYS)
40 pages
AD3251 Data Structures Design Question Bank 1
No ratings yet
AD3251 Data Structures Design Question Bank 1
1 page
Machine Learning - AL3451 - Notes - Unit 1 - Introduction To Machine Learning
No ratings yet
Machine Learning - AL3451 - Notes - Unit 1 - Introduction To Machine Learning
29 pages
Lab Manual Daa Ad3351 Aids III Sem Regulation 2021
No ratings yet
Lab Manual Daa Ad3351 Aids III Sem Regulation 2021
48 pages
Cd3291 Dsa Notes
100% (1)
Cd3291 Dsa Notes
168 pages
MAD - 4 unit
No ratings yet
MAD - 4 unit
8 pages
AD3391 Database Design and Management Nov Dec 2023 Question Paper Download
No ratings yet
AD3391 Database Design and Management Nov Dec 2023 Question Paper Download
3 pages
Unit I Content Beyond Syllabus - I Introduction To Data Mining and Data Warehousing What Are Data Mining and Knowledge Discovery?
No ratings yet
Unit I Content Beyond Syllabus - I Introduction To Data Mining and Data Warehousing What Are Data Mining and Knowledge Discovery?
12 pages
Experiment 5
100% (1)
Experiment 5
6 pages
CS3591 Computer Networks Unit-01 Notes
No ratings yet
CS3591 Computer Networks Unit-01 Notes
87 pages
Cs3451 Ios Unit 5 Notes
No ratings yet
Cs3451 Ios Unit 5 Notes
21 pages
Ad3491 Fdsa Unit 3 Notes Eduengg
No ratings yet
Ad3491 Fdsa Unit 3 Notes Eduengg
37 pages
Dpco Unit-3 Notes
No ratings yet
Dpco Unit-3 Notes
31 pages
CCS341-Data Warehousing Lab Manual (2021)
100% (1)
CCS341-Data Warehousing Lab Manual (2021)
50 pages
CS8392 - Oop - Unit 1 - PPT - 1.1
67% (3)
CS8392 - Oop - Unit 1 - PPT - 1.1
28 pages
AL3391 Notes Unit I
100% (1)
AL3391 Notes Unit I
52 pages
Ccs341 Dw Notes All 5 Units
100% (1)
Ccs341 Dw Notes All 5 Units
159 pages
ccs346 Eda
No ratings yet
ccs346 Eda
2 pages
AL3391 AI UNIT 5 NOTES EduEngg
100% (1)
AL3391 AI UNIT 5 NOTES EduEngg
26 pages
Perform Data Preprocessing Tasks Using Labor Data Set in WEKA
No ratings yet
Perform Data Preprocessing Tasks Using Labor Data Set in WEKA
6 pages
CH - 2 - Application To Univariate and Bivariate Analysis in Stata
No ratings yet
CH - 2 - Application To Univariate and Bivariate Analysis in Stata
32 pages
Data Analysis-Univariate & Bivariate
100% (1)
Data Analysis-Univariate & Bivariate
9 pages
EDA - Day 3
No ratings yet
EDA - Day 3
18 pages
U1_Exploring_One-Variable_Data
No ratings yet
U1_Exploring_One-Variable_Data
22 pages
Unit 4
No ratings yet
Unit 4
21 pages
Machine Learning With Python.
100% (1)
Machine Learning With Python.
147 pages
Python Cheat Sheets Compilation
100% (4)
Python Cheat Sheets Compilation
14 pages