BA Module 02 - 2.1 + 2.2
BA Module 02 - 2.1 + 2.2
1 Sampling at Amazon
As we've seen,
descriptive statistics
in graphical
representations of data
often provide a
great deal of insight
into patterns or
relationships in a data set.
Often we wish to analyze a
large set of people or objects.
We'll call this set our
population of interest,
and the people or
objects in the set
the members of the population.
Due to time or
resource constraints,
it is often not
practicable to analyze all
of the members of a population.
Fortunately, analyzing
a sample that
is a representative subset
of the population often
helps us draw useful conclusions
about the full population.
Before we discuss
how to do this,
let's learn about
how Amazon uses
sampling to answer an
important managerial question.
We decided long ago that
our company's mission
is to be Earth's most
customer centric company.
We are obsessed with
the customer experience.
So when we have an opportunity
to improve the customer
experience through
analytics, we'll
usually focus on the thing that
is likely to have the highest
impact on customer
experience, positive impact,
and the broadest
possible impact,
because we're a global company.
So we look for things like
low prices, huge selection,
improvements in the
delivery experience
and convenience that
are likely to apply
for a long period of time
everywhere in the world.
When we ship items to customers,
they come from a warehouse
where we store inventory.
The way that we
process inventory
is that we receive a truck that
has books, consumer electronics
items, toys, kitchen,
sports, clothes, shoes.
The truck comes in.
We receive the items,
which basically
means that we open up a
carton and take out the items
and make sure that
they're in good shape.
And then we stow
them into a shelf,
waiting on the customer
orders that will eventually
come to ship to customers.
The places for errors
in this process
include misidentifying
the item at receive.
So we think we have black shoes,
and someone's made a mistake
and identified
them as blue shoes.
We could place the item
into the wrong bin.
We could pull the wrong
item from the shelf,
and then there are a couple
of other smaller ways
that we might make mistakes.
We're trying to minimize
the defects to customers,
meaning minimize the chance
that a customer would receive
the wrong item or receive a
delay because the last item
that we have is in
the wrong place.
And we're trying
to reduce our costs
to deal with those kinds of
defects at the same time,
so improve quality
and lower costs.
The best to do that is to have
as few defects as possible
in our inventory.
Years ago, way
before Amazon.com,
retail learned that
inventory accuracy matters
in stores and in
warehouses, and retailers
got accustomed to annual counts,
annual inventory accounts.
Often, stores would close
for a day or two days,
or sometimes a week.
Warehouses would do
the same, and humans
would go out into the
warehouse and count every item,
make sure that they
knew what was where
in the warehouse, and
then you would reopen
and start selling again.
That's a very expensive
process, because you actually
have to close your
operation during the time
that the warehouse is closed.
And you also don't
have the benefit
of knowing whether you're
perfect in your inventory
throughout the rest of the year.
You basically have one sample.
It's a complete sample, but
it's one sample, once a year,
and then you hope that your
processes are good enough
the rest of the year.
What we've learned to do
is to sample our inventory
continuously, sample the
accuracy of our inventory
continuously, to make sure
that we have as accurate
an inventory as we
can afford to have.
The idea behind sampling
is it might not be possible
for you to learn the true value
of a statistic of interest
in the population.
We have many warehouses
that house that inventory.
Going through all that would
be very, very time consuming.
And the idea behind sampling
in that situation would be you
would at random pick a subset
of the items in inventory,
and ask whether they
had those defects.
So it's a lower
cost way to learn
the rate at which the
statistic of interest
occurs in the population.
Below is a summary of the steps in the sampling process. Remember, we only start this
process after we have clearly established the problem we wish to solve and the question
we will ask of the members of the sample.
In the previous module, we learned about descriptive statistics. The numerical properties
of a population are called parameters and those of a sample are called statistics. A
statistic is an estimate of a true value of a parameter. If a sample is sufficiently
large and is representative of the population, the sample statistics should be
reasonably good estimates of the population parameters.
To differentiate between population and sample measures, we use the Greek alphabet
for population parameters, and the Latin alphabet for sample statistics. The symbols for
the mean and standard deviation are summarized in the table below.
Click on the button below to generate a random sample of 30 points from the population.
In this case we are given the population mean and standard deviation, but generally we
will not have that information. Indeed, we take a sample precisely because we do NOT
have complete information about a population. Take as many random samples as you
would like, making sure to notice if the sample statistics vary and how accurately they
represent the population parameters.
What happens to the sample mean and standard deviation as you take new samples of
equal size?
The sample mean and standard deviation remain exactly the same
Since each sample is randomly selected, the mean and standard deviation vary from
one sample to the next.
The sample mean and standard deviation vary but remain fairly close to the
population mean and standard deviation (CORRECT)
Since each sample is randomly selected, the mean and standard deviation vary from
one sample to the next. However, since the sample size is fairly large, each sample’s
mean and standard deviation are fairly close to the population mean and standard
deviation. We’ll learn more about how to select a good sample later.
The sample mean and standard deviation vary substantially from one sample to the
next
Since each sample is randomly selected, the mean and standard deviation may vary
from one sample to the next. However, since the sample size is fairly large, each
sample’s mean and standard deviation are fairly close to the population mean and
standard deviation.
In some cases, selecting a random sample is quite straightforward. If we have a list of all
members of a population in a database, we can use a computer to assign a random
number to each member and draw a sample from the list. This process makes sure that
each member—that is, each element of the population—has an equal likelihood of being
selected, which ensures that the sample is representative of the population.
Suppose we have the phone numbers of 20,000 people, and we want to survey a
random sample of 100 of them. We will do this using Excel’s RAND function. RAND
assigns a random identification (ID) number between 0 and 1 to each data point—in this
case, to each phone number. We use these random ID numbers to sort the data,
creating a list of the phone numbers in “random” order. We then call the first 100
numbers on the list.
The Excel formula requires that we simply type the formula with closed
parentheses.
We can use the RAND function to generate random numbers between any two
specified values. For example, if we wanted to generate random numbers
between 0 and 10 we would multiply the function by 10 and enter =RAND()*10. If
we wanted numbers between 5 and 15, we would enter =5+RAND()*10.
Step 1
Step 2
In cell A2, enter the function =RAND() to generate a random ID number between 0 and
1.
Step 3
Copy and paste the function from cell A2 into cells A3:A26 so that all 25 phone numbers
are assigned a random ID number. You can use auto-fill instead of copying and pasting.
Step 4
Now we need to sort the phone numbers. Highlight the data in column A and column
B, excluding the labels, and select Sort Ascending from the Data menu.
Note that the RAND function generates a random number for each phone
number every time the spreadsheet is calculated. Therefore, even though the
phone numbers actually were sorted, the (new) random numbers will not appear
in order. The sorting was based on the previously assigned random numbers.
After sorting, the 25 phone numbers on the list are in random order. If we wanted
to draw a random sample of 10 phone numbers, we would start at the top of the
list and choose the first 10 people.
If our population of interest is not listed in an easily accessible database, the task of
selecting a sample at random becomes more difficult. In such cases, we have to be
extremely careful not to introduce bias into our selection process.
Amazon’s inventory sampling process is more complex than selecting from a list of
phone numbers. Let’s see how Amazon’s managers ensure that their samples are
randomly selected.
We have a team of
auditors that are
dedicated to sampling
our inventory
and they sample continuously.
It's a relatively small team
in each of the warehouses.
We randomize the
places that they
go to inspect the inventory.
So they don't decide where
to go to check a shelf,
they have a software tool that
directs them to the shelf,
and then they go
and check to see
if what the computer
believes is on the shelf
actually exists on
the shelf, and we're
doing this all the time.
The logic behind
the sampling scheme
is to randomize across all of
the different types of storage
that we have in the warehouse.
That ensures that we cover
all of the warehouse,
or all the types of
storage that we have
several times during the year.
And types of
storage might be, we
might have small items
stored in shelves that
make for easy picking manually.
We might have clothes
stored in shelves
that allow for easy stacking
of shirts or folded jeans.
For larger items
like TVs, we might
have them stored on pallets,
just wooden boxes on the floor,
and use mechanical
equipment because it's
too heavy for a
single person to lift.
We check all of these
types of locations
during our cycle counts.
When we're deciding on
the right sample size,
we use statistics to figure
out the smallest sample
at the right frequency to
ensure statistical significance
of the results.
It can vary from one
location to the next,
depending on the velocity
of the items coming
in and out of the warehouse.
So the more opportunities
we have to create defects,
the more likely we
are to need to sample.
If you have a
warehouse where there
is no movement in
and no movement out,
and nobody ever goes
into the shelves,
you just stored them
once and leave them,
we probably don't need
to sample that warehouse,
because the probability that
there is a defect, after you
know that's correct at the
beginning, is about zero.
On the other hand, if you
are removing the items
and replacing them with new
ones every day, 365 days a year,
you're probably creating
defects and you need to sample
more frequently.
Suppose a college has asked you to conduct a survey to determine the percentage of
8:00 AM classrooms that were full on a given morning. The college has three classroom
buildings, each containing two lecture halls. Each lecture hall has a capacity of 100
students. You randomly choose one of three buildings, and stand outside the entrance
when classes let out. You ask the first 60 students leaving the building how full their
class was. However, you soon realize that this sample is not random because you only
went to only one of the buildings and the classes at that building may not be
representative of all 8:00 AM classes. Moreover, since the students you surveyed were
the first to exit the building, it’s also quite possible that they all came from the same
class!
Realizing that your survey approach would not produce a random and representative
sample, you gather some friends to help sample. You place one surveyor outside each
building. You each randomly select 20 students leaving the buildings that morning and
tally the results: 5 people decline to participate, 35 tell you that their class was full, and
20 tell you that their class was not full. Is your sample now representative of all classes
that morning?
Yes
See correct answer for explanation.
No
This question is a bit tricky. This sample still may not be representative of all classes
because there is a bias in the approach. When you sample students leaving each of the
buildings, you will, on average, select more people from full classes, simply because
there were more people in those classes. Imagine that of the 6 classes that took place
that morning, 4 were full (each having 100 students) and 2 had only 40 students each. In
this case, most of the students, 400 of the total 480, were in full classes. Your sample
would include more students from the full classes and therefore is not representative of
all classes that took place that morning.
Based on what we have learned, how can we ensure that we choose a sample of
students that is representative of all 8:00 AM classes that take place on a given
morning?
I would ask 10 students at random from each class by observing their exit time and
selecting them at different points in time while they exit (not just the first 10 to exit so that
we don’t get them all from the same class). I would also place 5 other friends to stand at
2 other schools buildings to do the same survey. I would also gather information asking
them which class they were from.
I would enlist the help from my friends. I and 5 other friends (total 6 surveyors) would go
to the exits of each of the classrooms. Each would survey 10 students asking them
which class they were from and question whether their class was full. I would also make
sure that we selected the students at random from throughout the duration of their exit
and not just the first 10 to exit.