Calculating Median, Quartiles and Bounds

okay everyone so where we left off last
time we had just finished calculating
our mean which is what I like to
describe is he add them all up and
divide by the number there are now the
median and some of the related values to
the median are a little bit trickier um
not really tricky just deceptively
tricky they seem like it should be very
straightforward and yet somehow it's
always a little bit more complicated
than it seems so the median as you may
recall is just the middle values okay
and in our case we have 20 values so the
question is right you can see right
there 20 rows one column right and we
can also be able to get this because we
have one row that is header and then so
if we go to 21 it's gonna be 21 minus 1
it's in a total number of data rows okay
so how of course do we deal with finding
the middle of an even number of rows
well the way that we do this is we are
gonna take the total number of rows
which is 20 divided by 2 which is 10
okay and then we're going to in this
case we're gonna add one because we have
because we have one header row right so
the 10th row is actually row 11 the 10th
row of data is actually row number 11
and the trick is that the median is
actually between rows 11 and 12 and so
when the way that we're going to
calculate it in this class I'm actually
going to take the average of the of the
two middle rows and treat that as my
median okay now there are multiple ways
to do this but we're going to work with
the average so in my case and the way
that I like to test this is very simple
I just count okay so I've got nine rows
above I've got nine rows below which
means that these are the middle two rows
so in calculating my median here all I'm
gonna do is put a median and now the
important thing to remember here is that
although the median rose for my
calculation purposes our rows are the
10th and 11th rows which for me are rows
11 and 12 right again because I have
that offset from the header row
the actual median value is the average
of the values that appear in those rows
right so my median isn't eleven or
twelve more than ten and eleven right it
is going to be the average of the values
in this case trip duration which is the
value the column that we're looking at
the average of the values that appear in
those rows so in order to do this I get
to use my sum formula again so anytime I
want to do a calculation or a formula I
start with equals I'm gonna do sum over
my left hand parenthesis I'm going to
select my two columns close my right
hand parenthesis right because again
those parenthesis are what I like to
think of is like the wraparound right
they indicate the things that are going
to be ingredients into that function so
my ingredients for the sum function are
just a 11 and a 12 and then I'm just
going to divide it by two okay so pretty
straightforward to get to my mean and my
median values now that's mean and median
those are our measures of central
tendency and you'll notice that they're
actually not wildly far apart here um
there's we'll talk a little bit more
about what that implies in class or
doesn't apply but now we want to move on
to our measures of spread okay now the
measures of spread start to get a little
bit more complicated and actually
because finding my standard deviation
which is the measure of spread that goes
along with the mean requires adding some
columns for me I am going to go ahead
and actually move and you may want to
just start with your calculations off to
the side here I'm going to move my
values off to the side okay um let's
just make sure it's clear yes it is okay
so happily sorted data and now let's
talk about measures of spread so each of
our measures of central tendency has a
corresponding measure of spread and I'm
actually going to start with measures of
spread for median for the median and the
way that I like to talk about this is is
with respect to our standard
distribution okay
standard distribution is kind of this
magical thing in statistics because
there are some things in nature of the
follow standard just
Bhushan this as I'd like to point out is
the infamous curve if you've ever had a
teacher that that has graded on a curve
the idea of the curve is that is that
most people are going to be in the
middle and that you're gonna have a very
few at the very top end and a very few
at the very bottom end and what this
image is showing us actually is the
relationship between the how the
measures of central tendency and spread
relate to one another with respect to a
standard curve now we use the median not
when we have a standard curve but when
we have a skewed distribution but what a
hell does understand is sort of how much
of the data is encompassed within these
various boundaries so the measures that
we're going to be looking for when we're
dealing with the median are these four
sort of kind of five because we have q1
q3 IQR which is so which is actually so
it's quartile one quartile three the
median is actually q2 and this can be
helpful to remember when we're trying to
locate this something called the IQR
which is the interquartile range which
is just the difference between q3 and q1
and then what I like to call the lower
bound and the upper bound okay and
basically the lower bound and the upper
bound for us are going to serve as
measures of extremity things that are
smaller than the lower bound or greater
than the upper bound are things that are
going to be that that are basically
unusual data points okay with respect to
the particular data set that we're
looking at so to calculate these it's
actually pretty straightforward because
just as the median is the middle value
q1 is the middle value in the smaller
half of the data set and q3 is the
middle value of the larger half of the
data set so in this case my smaller 1/2
has nine rows okay so the easiest way to
find this again is I'm going to do 9
divided by 2 which is 4 0.5 I'm gonna
round up okay because it's an odd number
I'm just gonna round up to the fifth row
and then I have to find the fifth row of
my data and so again because I have an
extra row for the header
I am going to it's actually going to be
row number six okay
and so I'm going to highlight this and
again I'm just going to double check one
two three four one two three four now I
have done this I have I've chosen to
exclude the median rows you can also
include them but you need to be explicit
about the fact that you're including
them and you need to be consistent so
quite simply my q1 value in this case
again is going to be the value that
appears in that q1 row in that middle of
the first half row and so for this it's
579 okay and then I'm going to repeat
the same process on the lower half okay
so I have nine rows again it's going to
be now I'm going to start at thirteen or
I'm gonna start after twelve and add
five so I guess that's gonna be so yeah
wait seventeen yes four point five nine
okay yeah divided me to four point five
so five thirteen plus five 12 plus 5
should be row 17 did I divide that wrong
oh yeah because it's after a row so I
actually started I count from 12 and add
five anyway I always get tied up with
this and this is why I say it's actually
deceptively difficult right it's
actually just counting um but for some
reason it's always really tricky and and
so I have provided you all with a
written up explanation of how to
actually do the counting which gets a
little tricky when you are dealing with
larger numbers of rows yeah I guess the
point is is that it is five rows above
row 12 so it's 12 plus 5 17 and I can
confirm that because there's four rows
so my q3 value here is gonna be one
three one four okay so four okay now my
IQR interquartile range
it's just a complex sounding name for
the difference between q3 and q1 so I'm
just going to subtract and again when I
want Excel to do math I start with the
equals sign I could actually put a
parenthesis here I don't have to because
I'm just doing the simple subtraction
but I can say equals and again I can
just place my cursor there my mouse
there are five minus R for now an
important point that I want to make here
is that you'll notice that I where
possible I'm doing all of these things
with cell references right so I'm not
just writing in the mean as ninety six
point nine sixty four point eight five
or ninety seven six nine seventy six
point five etc etc right I'm actually
referencing particular cells and the
reason why I'm doing that is because
inevitably at some point and certainly
in the course of your work in data
journalism possibly in the course of
doing these assignments themselves um
you're gonna forget something right
something's gonna change that these new
wrong something's need accidentally
overwritten and what you want is to make
sure that when you correct it everything
updates automatically okay and so it
also I would also point out that these
in fact for q3 I shouldn't even do this
I shouldn't typed it in because it's
also everything we do we want to kind of
try to minimize typos and things like
that so I'm actually now gonna use the
cell reference and what's useful about
this is not only that it minimizes the
risk that I'm gonna introduce an error
of an error for a typo it also it also
means that the formula itself provides
evidence of what was done right it's
clear not just that I picked five
seventy nine out of thin air but that
I'm actually referring to row six right
because when I put my cursor over this I
can see in the function bar that it's
referencing row six that this one is
referencing row 17 that this is
referencing the difference so this is
actually a really important part not
just in making sure that your work is
resilient to errors but also it provides
another piece of another type of
documentation that shows where these
numbers are coming from right so you're
all gonna have beautiful and detailed
and readable documentation files anyway
but this will provide an extra layer of
detail that will be useful okay so
finally I want to
calculate my lower bound okay and my
upper bound and this is again done quite
simply if we go back to our image here
you can see that it is q3 plus 1.5 times
IQR is the upper bound the lower bound
is q1 minus 1.5 times the IQR okay so I
can do this all directly in a single
cell so the lower bound is going to be
what it is going to first of all start
with q1 okay and because the lower bound
is going to be less than q1 I am going
to be subtracting okay and then what's
the value that I'm subtracting it's
going to be one point five times the IQR
okay and I get nine negative five twenty
three point five and for the upper bound
I'm going to start with q3 and I'm going
to add because again the upper bound is
larger than q3 in this case one point
five times the IQR now this is the case
we're using the parentheses in my math
is important if any of you remember the
order of operations you may have studied
in arithmetic or algebra we want to make
sure that we first multiply the IQR
times 1.5 before we added or subtracted
from anything and that's what those
parentheses help us do okay so now we
have our upper bound and our lower bound
now one thing you may be asking yourself
is well how can we have a lower bound
that is negative the answer is
statistics don't care whether their
results are sensible and what that means
is that which is just say that the math
doesn't know that we're talking about
time right if we were talking about
temperature a negative value will be
perfectly reasonable and acceptable
right so all this is doing is is kind of
giving you a range of values that from
the statistical perspective might be
considered unusual or the limits of what
is usual or expected that doesn't mean
that they're gonna make any sense in the
real world right so our lower bound of
course we're not going to end up with a
ride that's negative 500 seconds upper
bound M
220 400 will be kind of interesting to
see if we have any values that are
actually larger than that and in fact we
do we have one
we have one ride that that is much
longer than the other rides in this data
set that's sort of unexpectedly long and
that is essentially that is essentially
the biggest part of what we're looking
for right what we want to do with these
measures of central tendency and spread
is understand when something is sort of
kind worth investigating right we're
finding a little bit more about there
are a lot of reasons why we might get
values that are outside what we expect
errors and the data being one key reason
but it does give us as I said a place to
start to say huh
that seems that that that is
meaningfully different from the other
values let's see what's going on there
ok so when we come back we are going to
look at calculating the standard
deviation which is the measure of
quote-unquote spread for that goes with
the mean so we've looked at all of the
measures of spread that go along with
the median we'll look at the standard
deviation which goes along with the mean
and we'll also look at something called
the Z index or the normal score which is
a way of kind of quantifying how far
away from the mean something is kind of
ranking if you will each of our values
by how far away from the mean it is so
we will be back in just a minute
Expertise:

A tutorial on calculating the median, quartiles, upper and lower bounds on a data set in Excel

Contributor: Susan McGregor

Video 2 of 3