C207 Study Guide
C207 Study Guide
Big Data
Refers to both structured and unstructured data in such large volumes that it's difficult to process using
traditional database and software techniques.
Data Mining
Process of discovering patterns in large data sets. Data mining is performed on big data to decipher
patterns from these large databases.
1
• Solving the Problem
o Modeling Step
o Data Collection Step
o Data Analysis Step
• Communicating Results
Data Management
Refers to cleaning and organizing a data set that has been collected
• Available
• Accurate
• Complete
• Relevant
• Timely
4 Levels of Measurement
• Continuous Data – a data point can lay along any point in a range of data (age)
o Interval Data – all objects are an equal interval apart, cannot have a natural zero (time)
o Ratio Data – has a unique zero point (age, Kevin scale, income, stock price, inventory)
• Discrete Data – can only take on whole values and has clear boundaries (number of cars)
o Nominal Data – called categorical data, used to label subjects in a study (males/females)
o Ordinal Data – places data objects into an order according to some quality (degrees)
Skewness (Bias) – is a measure of the degree to which data leans toward one side.
2
Research Design
• Observational Studies – when it’s impractical or impossible to control the conditions of the study
o Cohort Study
o Case Control Study
• Experimental Studies – variable measurements and subjects are under the researcher’s control
o Experimental units – subjects of objects under observation
o Treatments – the procedures applied to each subject
o Responses – the effects of the experimental treatments
Experimental Design
• Qualitative Research – exploratory research, data not characterized by numbers
• Quantitative Research – uses numerical data and measurements
3
Probability
It is the chance of an event occurring or happening at some time in the future
Independent Events – first result does not have any impact on the second one
Complementary Events – the only possible outcomes of that event (flipping a coin – heads or tails)
Conditional Probability – probability of even occurring, given that another event has already occurred
4
Permutations – when the order does matter
Permutation (where repetition is allowed): nr = n x n x n (r times)
Permutation (with no repetition): mPn = m!/(m - n)! = n x (n – 1) x (n -2) x (n – 3) …
Bayes’ Theorem
Describes probability of event, based on prior knowledge of conditions that might be related to event.
Empirical Rule – applies to a normal, bell-shaped curve which is symmetrical about the mean.
Graphic Displays
Range – represents the array of possibilities in which a value can exist, from minimum to maximum.
Percentiles – unit of measurement that gives a value of which a percentage of population falls below.
6
Inter-quartile Range – measures the difference between the third quartile and the first quartile.
Boxplot – is standardized way of displaying the distribution of data based on a 5-number summary
(“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”).
Histogram – graph that displays continuous, non-discrete data (compare frequency of numerical data)
7
Bars Chart – is a graph that displays discrete data (compare different categories of data)
Scatter Diagram (bivariate chart) – shows relationships between two variables for determining how
closely they are related.
8
Line Graph (bivariate chart) – shows relationship between two or more variables by using connected
data points.
9
Linear Programming
Mathematical technique used to find a maximum or minimum of linear equations containing several
variables. A technique for minimize total cost or maximize profit based on constraints
Crossover Analysis
When there are two or more plans or options to consider, crossover analysis allows a decision maker to
identify the crossover point, which represents the point at which they are indifferent between.
Break-even Analysis
Tells how many units of a product must be sold to cover the fixed and variable costs of production.
Hypothesis Tests
t-Test
Tests null hypothesis about one or two means; most often, it tests the hypothesis that two means are
equal, or that the difference between them is zero.
Chi-squared Test
Performs hypothesis testing on two categorical variables from a single population.
ANOVA Test
Used to compare multiple (three or more) samples with a single test.
P-value
If the p-value is less than 0.05 we will reject the null hypothesis.
R-square
o Measures the goodness of fit in a regression analysis, it ranges in value from 0 to 1.
o Value close to 1 indicates that estimation error is small, and data closely aligns to regression line
o Value close to 0 indicates that data does not align as closely to the estimated regression line
Correlation Coefficient
Measures the strength of a linear relationship.
10
Forecasting, Regression Analysis, and Quantitative Techniques
Forecasting Techniques
Regression Analysis
Linear Regression
A technique using a single independent variable to predict a single dependent variable.
Dependent variable is the variable whose value depends on the other variables in the equation.
Independent variables are variables presumed to influence the dependent variable.
11
Correlation
o The strength of a linear relationship can be measured with the correlation coefficient.
o Correlation coefficient, a number between -1 and 1, is only useful in measuring linear regression.
o Correlation coefficient that is close to 0 indicates a weak linear relationship
o Correlation coefficient closer to -1 or 1 represents a strong linear relationship.
o Correlation coefficient equal to exactly -1 or 1 would be considered perfectly linear.
Multiple Regression
o A technique using more than one independent variable to predict a single dependent variable.
o Multicollinearity describes a linear relationship between variables.
o Autocorrelation describes correlation of variable with itself given a time lag – generates concern.
Cluster Analysis
Also known as segmentation, is the process of arranging terms or values based on different variables
into "natural" groups. Most often with cluster analysis, these terms or values are survey responses from
people.
12
Decision Analysis
Proves of weighing all outcomes of a decision to determine the best course of action.
Simulations
Simulation is an attempt to emulate a real process or system through an imitative model. This allows
considering problems that may not lend themselves to direct experimentation and helps managers make
decisions. Common simulation tools include what-if analysis, and Monte Carlo simulation.
What-if analysis
A form of simulation analysis that involves selecting different values for the probabilistic inputs in a
model and then computing the possible outputs.
SIPOC (Supplier-Input-Process-Output-Customer)
SIPOC Benefits
o Helps define the boundaries of your operations by providing a high-level view of complete process.
o Helps understand how process elements fit together.
o It ensures that you take a broad view of work instead of focusing only on the internal work.
o Takes into account the quality of the work and materials that suppliers provide to the process.
o Checks how the outputs of the process are perceived and used by customers.
o Stops you from optimizing work to satisfy only the internal process stakeholders.
Sampling
Involves choosing one or several outputs generated from process as representatives of the entire group.
14
Attribute Data
Collected to show if the result meets requirement or not; answers to yes/no question or pass/fail test.
Variable Data
Tests how well a result meets a requirement; results can be rated on a scale between 0 and infinity.
Control Limits
Upper control limit and a lower control limit are equidistant from the mean by 3 standard deviations.
1. Run Chart – simple way to illustrate performance measurements over a period of time.
2. Control Chart
o Modified run chart—it shows the performance of a process over time, but it also includes limits or
constraints that a process should not exceed.
o Especially helpful in distinguishing special cause from common cause variation.
15
3. Cause-and-Effect Diagram
o Often called a fishbone diagram
o Helps project participants systematically uncover sources of problems
o Creates a hierarchy of the primary and underlying factors that cause an event or problem
4. Flowchart
o Graphic representation of the steps that make up a process.
o Documents a process as it currently exists and compare it to one that shows an ideal condition.
16
5. Check Sheet
o Structured form or table used to count how many times an event or problem happened.
o Ensures that everyone collecting data is compiling and recording it in a similar way.
6. Scatter Diagram
o Data are displayed as a collection of points, each having the value of one variable determining the
position on the horizontal axis and the value of the other variable determining the position on the
vertical axis.
17
7. Histogram
8. Pareto Chart
o Bar chart that sorts data into categories, then prioritizes them from most significant factors.
o Based on the 80/20 rule – 80% of problems are the result of a small number (about 20%) of causes.
Six Sigma
o Statistical concept that places 6 standard deviations between the mean allowed limits.
o Processes working at six-sigma level are 99.9997% defect-free (only 3.4 defects per million outputs).
DMAIC Framework
o Six Sigma employs a five-step framework to analyze an existing process and to incorporate changes.
o Define – Measure – Analyze – Improve – Control
ISO Certification
18
International Organization for Standardization (ISO) established a certification program that guarantees
that an organization is dedicated to quality concepts and is continually working to ensure that it is
producing the highest level of quality possible.
2. Leadership - Leaders at all levels establish unity of purpose and direction and create conditions in
which people are engaged in achieving the organization’s quality objectives.
3. Engagement of people - Competent, empowered and engaged people at all levels throughout the
organization are essential to enhance its capability to create and deliver value.
4. Process approach - Consistent and predictable results are achieved more effectively/efficiently when
activities are understood and managed as interrelated processes that function as a coherent system.
7. Factual approach to decision making - Effective decisions are based on the analysis of data and
information.
8. Mutually Beneficial Supplier Relationship - An organization and its suppliers are interdependent and
a mutually beneficial relationship enhances the ability of both to create value.
19
1. Affinity Diagram
o Groups items based on relationships with are then analyzed.
o Used when confronted with many facts or ides in apparent chaos.
2. Interrelationship Digraph
o Displays all the interrelated cause-and-effect relationships and factors involved in a complex
problem and describes desired outcomes.
3. Tree Diagram
o Hierarchical tool that breaks a topic down into its components.
o Breaks down broad categories into finer and finer levels of detail.
4. Prioritization Matrix
o Prioritizes multiple options, based on how well these options satisfy preselected criteria.
o Prioritizes items in terms of weighted criteria.
o Popular applications: Return on Investment (ROI) or Cost/Benefit Analysis
5. Matrix Diagram
o Table or chart that shows the strength of the relationships between items or sets of items
6. Network Diagram
o A scheduling diagram that shows the relationships between project activities
o Helps in determining the critical path (longest sequence of tasks).
21
Business Improvement Analytics
Index Numbers – are a common analytic for business improvement.
Index = (Price / Base Period Price) x 100
Healthcare Analytics
Epidemiology
o Studies incidence, distribution, and possible control of diseases and other factors relating to health.
o Rate is the measure of an event occurring over a period of time.
o Proportion – ratio of a group to the whole
o Prevalence counts all of the existing cases of a disease
o Incidence only counts new cases.
o Cumulative Incidence – measures the number of new cases that arise in a period of time.
Education Analytics
Helps education leaders to a better understanding of student progress, the effectiveness of different
questions, and the construction of tests.
22
Test Construction
o Norm-referenced tests–compare an individual to others e.g., standard score (Z-score)
o Criterion-referenced tests –compare an individual to defined standards e.g., exam cut-score
23
Module 6: Improving Organizational Performance
24
Balanced Scorecard
Measures an organization's performance on balanced mix of financial and non-financial measures.
25
Net Promoter Score
Quantifies how strong an organization's customer relations are.
26
Performance Assessment and Strategy
Performance assessment can and should be linked to a company's strategy.
27