Calculating Standard Deviation and Z Score

hi everyone so the final two measures
that we are going to calculate today are
the standard deviation and the normal
score the standard deviation is a
measure for which a single one a single
standard deviation exists for a given
data set the normal so-called numerous
score standard scores these scores are
all names for the same thing is
something that we have on a per value
basis so basically every entry in our
data set is going to have a unique
standard score and the reason for that
is that the standard score basically is
just explains how far from the mean how
far from the meaning given value is in
terms of standard deviations so it uses
a standard deviation as a unit to
describe how far a given value is from
the mean value but in order to get that
we first have to create the standard we
first have to calculate the standard
deviation now we are going to do this
from scratch so be patient but before we
can calculate it we first have to
understand what it is so this will see
in your readings this is part of the
Wikipedia page of standard deviation
which I like because it gives us an
opportunity to kind of just work
backwards through a description of what
the standard deviation is which seems
very complex at first but if we just
work through it step by step it's
actually fairly straightforward so for
this example we're going to be using
this definition in fact I might even go
ahead and zoom in on it just so it's
really big and easy to see so the the
first thing that we want to look at is
and we do start at the end of the
sentence here so the first part of
calculating is standard deviation is to
take the deviation of each value from
the average value okay deviation is just
another word for difference and of
course in math when we say difference we
mean subtraction so the first thing that
we're going to do in calculating the
standard deviation is we are going to
subtract each value from the we're going
to subtract in the average value which
of course is just another two
for the mean from from each value in our
data set okay so how we're gonna do
about go about this practically I'm
gonna go ahead and come here and
selecting in the letter part right
clicking and I'm gonna say insert and
it's gonna automatically insert a column
I'm gonna label this one difference from
mean okay and I'm just gonna do my
calculations so here I'm gonna say
equals right because again I'm doing if
I'm doing a formula and asking Excel to
do calculation and it's going to be my
particular value - now here's where I
want to make sure that I actually
reference this cell minus the mean okay
so pretty simple pretty handy now
fortunately Excel is reasonably clever
and I can copy this formula down and
you'll notice that it has shifted and
now it's pointing to a three right so I
don't have to manually type in this
formula for every single cell however if
you look closely you'll notice that
something else has happened so in this
cell I was working with a 2 and s 2
because s 2 is where my mean is but when
I move the formula down one it not only
moved the a column reference down so
that's a 3 it also is doing now - S 3
and that's a problem because s 3 is not
my median value it's my mean value so
this is a great point - this is a great
moment to see how we can better more
closely control our formulas in Excel so
when we're using references as they're
written here they will increment if I
were to paste it over a column it would
move all the references over a column so
it tries to be a little intelligent
about how it handles the formulas you
give it now in this case we want part of
the formula to change but we want part
of it to stay static in other words we
want to make sure that we're always
subtracting the value in s - no matter
what row we're on in our data and the
way that we do that is using a dollar
sign so the dollar sign okay before the
column and the row is going to fix that
part of the formula so I don't actually
need both dollar signs in this case
because I'm only changing rows and
we'll be sufficient for me to put a
dollar sign before the two but I'm gonna
go ahead and do that just to be super
careful and so once I've added this
dollar sign what we'll see is that if I
copy and paste this formula down and I'm
just doing a like a controller Apple see
Apple V there that although it has now
switched to a3 it is still pointing to
s2 it's still pointing to my meat okay
this is a really handy thing in Excel
you're gonna use it a lot so now I copy
it again how you decide to copy it down
is up to you I ever heard of left all
the cell's and hit paste and we'll see
that unsurprisingly obviously as we get
over the halfway mark because my mean
and my median are close together and
this data set this is wouldn't
necessarily be a halfway mark in other
data sets but I go from negative to
positive values so I've done the first
bit of my calculations here so I'm going
to go back to my definition and I'm
gonna work through the next part so it
says okay the deviations of the value
from their average value we just did
that the difference of each value from
the mean and now it says we have to
square them so that's pretty
straightforward
in order to square them I'm just gonna
create once again another column and I'm
gonna call this square of difference
okay and how do I square something again
I'm just gonna use the equal sign and
then I reference the cell and of course
whoops and of course to square something
I just multiply it times itself in this
case I don't need to worry about using
that dollar sign because I'm only
changing rows right I'm gonna paste this
down I'm not referencing something in
another column and row that I need to
hold static right
I'm always multiplying things against
themselves okay not so bad
hopefully right just taking it step by
step and then I'm looking for the
average of those squared deviations well
we certainly know how to do that I can
just do that by let's see I'm gonna come
over here and I'm gonna say average
squared deviations
and it's going to equal what well I'm
just doing the average so it's the mean
right so the way that I do the mean in
this case is I do the sum all right so
I'm gonna do the sum of all my squares
right and my parentheses and of course I
think the song I'm doing I'm just doing
the oh no I'm doing the average so I
take the sum I divide by 20 because I
have 20 records there is the average
squared deviation and if I flip back to
my formula it says the square root I'm
taking the square root of the average of
the square deviation to the okay so all
I have to do is take the square root of
this and that square root is actually
the standard deviation and the formula
in Excel is sq RT I believe okay and I'm
just gonna point it at the cell that has
my squared and there we go the standard
deviation of my dataset is five forty
seven point five two one six right so
what does this mean um in order to
understand what this means it's helpful
to look back at our at our image here
get rid of that so the reason why this
is valuable to us why why the standard
deviation is meaningful is because the
standard deviation has a very special
relationship to the mean and to the
volume of data that we find within
certain distances from the mean within a
certain number of standard deviations of
the mean
when we're dealing with a standard
distribution of data that was a lot of
words the thing to look at here is to
understand that zero standard deviations
from the mean is of course the mean
right and what we're seeing here is that
in a case where we have a standard
distribution we can expect or we will
find that sixty eight point two seven
percent of all of the data values fall
within one standard deviation of the
mean okay so that means if I go if I
take the value of a mean
and then I add whatever the amount of
one standard deviation is and I subtract
whatever the amount of one standard
deviation is all of those data points
sixty-eight point two seven percent of
my data points will fall within those
number between those numbers if I go out
another couple of standard deviations I
get another thirty one and a half
percent or so right so between one and
three standard deviations of the mean
right and we can go both negative and
positive of course because it's
symmetrical between one and three
standard deviations I'll get another
fifteen point seven three percent and
then past three standard deviations from
the mean we're talking about a very very
small fraction of the data is gonna fall
in that range and what that means is
that if we have a normally distributed
data set or a standard distribution in
our data that we if we find something
that falls more than three standard
deviations from the mean or less than
negative three standard deviations would
mean there is something going on there
right there is something very unusual
about that situation again sometimes if
we find this it's just an error right
there was just something wrong with the
data but sometimes it means there's
something more interesting going on so
this is the reason because this distance
from the mean is it has interpretive
value right we want to calculate it and
so this process of saying how many
standard deviations away from the mean
is a particular value is the normal
score or the standard score and so we
calculate this by just using the the
quantity of the standard deviation for a
particular data set as the denominator
and we were saying how many standard
deviations away from the mean is is
record X right so in this case the way
that I would do this is I would insert
once again one more column oops I don't
want to put there kind of wanna put it
over here okay and this is going to be
the I'm gonna call it a z-score z-score
standard score all I have to do is say
look how far is this value from the mean
I've already calculated that and then
how many standard deviations is that
distance and so I do that
when I by dividing it by the standard
deviation and I get this value 1.4 bla
bla bla bla right now in this case again
we're gonna want to use our cell
freezing mechanism of the dollar sign
because again I have a distinct value
for Ichiro I mean have a distinct
standards for for each row but I have to
always be dividing I have to always be
dividing the value by the standard
deviation so I need to lock that
reference okay
and then once again I can just copy and
paste this all the way down okay and
what this reveals to me as I look at
these values is nothing in here seems
quite out of the ordinary if this is a
normally distributed data set right
resonates very close it's like 2.9 for
really really almost three standard
deviations from the mean
so probably and I will note that this
value as well if I compare it to say my
upper bound the 25 77 right it's
actually a little outside of our upper
bound as well so there's a reason for us
to think that perhaps or something
unusual about this trip obviously we
have no idea we can't say at all right
now what that might be what might be
unusual about it but we can say that it
is it is not typical of this data set
right now huge caveats here so many
things about this data set are arbitrary
I just decided to take the first 10,000
rows and then look at the station 72
values within that etc etc etc but but
the process that we've gone through for
this holds in any case right the
mechanics of it will stay the same it is
up to you as the data journalist as the
person conducting the analysis to make
sure that when you select a data set the
frame of that data set has meaning right
so we wouldn't just look at the first
10,000 rows right we might look at all
of the rides on a particular day or we
might look at a particular set of days
or types of days right we might look at
weekends versus weekdays
that is all up to the care and judgement
of you the human and has nothing to do
with the math right so this is a process
that we can apply to any set of data but
it is always on us to make sure that
that the frame that we have chosen for
the data essentially the data set that
we have selected has meaning the last
thing I just want to point out here is
that there are ways to check all of
these things unsurprisingly but again
it's up to us to do the sanity check
right and make sure that things make
sense so for example the z score is
actually a great place to do this
because the if you look at this curve
again what you'll see again is that it's
extremely unlikely to ever get something
that is four standard deviations away
from the mean three is already pretty
unlikely um if you find yourself with
standard scores that are much greater
than four or five or much less than
negative four negative five something is
wrong right so we can never let
ourselves be complacent about applying
this math and say well that's what the
math says right because again it's up to
us to frame the data set in a meaningful
way and it's up to us to check and make
sure that the results of our formulas
are actually sensible this is really the
primary way that you're ever going to
catch errors so with that I look forward
to seeing you all next week and going
over some of this material in class
Expertise:

A tutorial on calculating the standard deviation and z score on a data set

Contributor: Susan McGregor

Video 3 of 3