Locating an Arithmetic Mean
Locating Data and Finding Mean
Transcript
hi everyone so this week we are going to
start talking about some of the basic
statistical tools that we're going to
use to give context to our data so as
we've already touched on briefly in
class the really important thing that we
want to be able to do with data is to
give it context because context is how
we derive and communicate meeting an
amount of data and in order to do that
some of the basic statistical tools that
we need are what we call measures of
spread and central tendency and we
already touched on these briefly
measures of central tendency are mean
and median our measures of spread are
going to be are going to be the standard
deviation as well as something called
the upper and lower bounds and the IQR
those last couple actually apply in
cases where the median is the more
appropriate measure of central tendency
in any case we're going to actually go
through the mechanics of how we identify
or rather calculate these various
measures and then use them in a
journalistic context to do that of
course we need to have some data and
work with and so for the purposes of
this exercise we are going to use
citywide data citibike data is readily
available as you can see it's provided
in a couple of nice formats the one
we're going to be working with for the
next couple of weeks is a CSV based file
so this is sort of historical data
looking at as we can infer from the file
names in this case monthly data so the
data is released monthly with probably
some bit of a delay so my guess is that
when we go down to look for most recent
data did do 2017 there we go
aha yeah so we're gonna see September is
the most recently available data which
is fine I'm just gonna go ahead and
download this and throw it on the
desktop now one thing that I want to
point out here is there's actually a
second kind of set of sitting bike trip
data if we were to do some sleuthing we
would discover that that is actually
city bike data about Jersey City city
bike is now available in Jersey City New
Jersey but when I first encountered this
a few weeks ago I actually didn't know I
couldn't tell it wasn't obvious to me
right just from the file name what the
difference was between these two data
sets now one clue actually lies in
he lies in the file size right so if I
hadn't guessed that JC was Jersey City
they'd have the one thing that I do know
about it right off the bat is that it's
much much much much much much much much
smaller file than the traditional city
bike data and in fact what I ended up
doing was looking at some of the
locations the latitudes and longitudes
of the stations in determining they were
in fact in Jersey City which again
correlated with my hypothesis that this
was not the regular city bike data the
point of that is mostly just to say that
if you look at the very sparse amount of
data that we're given here just a file
name that in almost every case where
we're handling structured data we have
to do additional reporting in order to
understand actually what those files
contain in the case of city bike data
they're really helpful they actually
give us a meaningful description of not
just the the columns that we're going to
find in this structured data but they
also give us things like units which is
really helpful and so in distinguishing
characteristics right customers a
24-hour or three-day pass user
subscribers' an annual member gender
zero known one male to female year or
things like that so very often however
when we get this date when we get
structured data that information is not
gonna be ready readily available and
we're gonna have to do reporting in
order to find out so the first thing
that I'm gonna do I am using that I am
using using a PC so your process is
gonna look a little bit different I'm
gonna have to manually extract this and
once I have that then I think I'm gonna
do actually as I am going to open this
with notepad now if you're on a Mac
which most of you will be you're gonna
open this in TextEdit instead of notepad
but they do equivalent things on the
different operating systems and the
reason why I'm doing this in notepad is
because I know that working with notepad
isn't going to crash my computer okay so
if we look at the file size the file
size is pretty substantial it's over 300
megabytes right here this is the kind of
thing that if I just double clicked it
so that it opened in Excel might very
well sell out my computer there are a
lot of rows of data here and one of the
big things that we're gonna be focusing
on
um during this course is how do we deal
with progressively larger and Messier
sets of data because very often that's a
big part of dealing with data is kind of
cleaning it up okay so what I can see
now handily is the fact that these the
columns right I see this is a CSV you
separate it by commas that these are
separated by commas and that they seem
to line up with the promised columns
from the metadata provided on the
citibike web page so I see trip duration
start time stop time start ID cetera et
cetera et cetera latitude longitude and
bike ID just kind of interesting you can
see where a particular bike went these
are type birth year gender all the
things that we were promised okay so
this is great but in order to deal with
it in the background here what I'm first
gonna do is a little bit of magnet magic
so that I'm dealing with a slightly
smaller data set and then I'm gonna
demonstrate the kinds of analyses that
we would do and how we would do them in
Microsoft Excel because particularly for
our measures of spread and central
tendency that's gonna probably be at the
outset the the tool that we want to use
first so here we go okay so in the
background I have done a little bit of
manipulation this isn't really magic as
I said because you're gonna because it's
not and because you're going to learn
how to do it yourselves next week but
for the meantime I have arbitrarily
decided to take the first 10,000 rows of
data and now I am going to open that in
Microsoft Excel
so my double clicking is gonna open this
in Excel and you can see that my
computer is able to handle this pretty
easily depending on the machine you're
on it might be more or less quick but
this is a good place to introduce some
of the basics of Excel so for those of
you who are familiar just hang on tight
for those of you aren't here's a little
bit of an intro so one thing that we see
right away is that there is a column
that is full of hash tags hash signs or
pound signs
that's just Microsoft Excel is way of
telling you that there's too much data
for it to display and the width of the
column so we can actually solve that
quite quickly just by resizing the
column so you can get a slightly better
idea of what is what the content is
there and you'll
notice as I scroll across here that I am
going that it's showing me the value
that's contained in that data in this
function bar now the function bar is
going to be a place where we do a fair
amount of work and it's going to help us
see the difference between a calculation
that's being executed and the resulting
value so that'll make a little bit more
sense when we start doing functions in
just a second but basically you know
again I can kind of the sort of typical
way that we interact with Excel is
through kind of scrolling down I can
scroll all the way down to my very last
row here which I will point out is
labeled 10,001 and that is of course
because I have a header row in my data
and that header row means that
understanding the count understanding
the count in my dataset is always going
to be offset by one as long as I have
that header row so keep that in your
back your head because it's going to be
meaningful in just a moment now in order
to make any kinds of so what we're going
to be looking at here is we're gonna be
interpreting we're gonna be looking for
measure the spread in central tendency
along the trip duration parameter right
and why is this this is how long
people's rides are and so maybe there's
a pattern to that maybe there's not
we'll talk and we'll talk a little bit
more detail in class about how we derive
shapes the shapes of the data from this
but for now we're just gonna kind of do
the math and then we'll talk about the
implications so the first thing that I
want to do however is it I want to have
some uniformity I do want to organize my
data in some way and in this case it
makes the most sense for me to organize
it by the start station ID right I want
to compare the duration of different
trips taken from the same place right
leaving from the same location because
you know otherwise you know I might be
comparing things that are very remote to
things that are very close to others and
it would be hard to say anything
substantive about that remember that the
whole objective of this is to be able to
contextualize and say something
meaningful about this data so the first
thing that I'm going to do is I'm
actually going to highlight all
my rows here and just to clarify the way
that I did this there's a few different
ways to do this this is my preferred way
which is that I like to highlight the
actual letter now again it's gonna look
a little bit different for you all
because you're on Macs probably but
basically I highlight this and then
under the date of you
I can choose sort and when I choose sort
it's gonna say sort by and I'm gonna
choose chips or a trip duration so
because I have that header row it's
gonna actually give me the name of the
header right instead of column a column
B column C it'll say trip duration
sorting on values smallest to largest so
all this is gonna do oh sorry I actually
don't want to sort of trip duration that
was wrong I'm sorry start station ID and
I could say smallest largest because who
cares and you can see right away that
the lowest start station ID in this is
start station 72 and I'm just gonna
scroll down to see how many I have and I
have about 20 entries which is fine
again this is for the purposes of the
exercise all I'm doing is is I'm just
getting a sample that I can work with
now again for the purposes of this
exercise one of the first things that
I'm gonna do is I am going to select all
those rows and so again the way that I
did that this is another way that I like
to do selections so if I'm selecting if
I put my cursor in the where the number
is the labeling of the number of the row
I can actually also just scroll down to
the last row that contains the value I'm
interested in and if I press the shift
key and then click it'll automatically
select everything in between you can
also hold down shift while you use the
arrow keys there's a bunch of different
ways to do it each of you will have your
preference that's something to kind of
practice play around with a little bit
so now I'm going to go ahead and copy
this I'm just going to do an Apple C and
if you see down here I'm gonna say plus
and I'm gonna do an Apple V to paste it
okay now this is important because as
much as possible I want to I always want
to retain a reference to my original
data set it's we're always going to end
up making edits to our data as we do
analyses or most of the time we're gonna
end up making edits to our data but it's
good to keep a frame of reference
just for where it came from right this
is the history and one piece that I
haven't been doing
here that of course you all are going to
need to do is the documentation file so
for me that would just be a notepad or
WordPad document where I would just
literally start by writing down where my
data came from so I would probably start
it with you know download data from and
I would actually paste in the URL and I
would say something like browse and
notepad open in Microsoft Excel alright
and so on and so forth but this is
something that I want to be doing in
real time so I want to be doing right
alongside the steps that I'm executing
so it's important now I'm going to real
able this sheet one and I'm just gonna
call this I'm gonna call this station 72
because that's the station ID that I
have copied here and I'm now gonna do a
file and I'm gonna do a save as ok and
the reason why I'm doing a save as here
is because a CSV again is just a text
file and it cannot save multiple sheets
and it can say formatting and it cannot
save formulas and we are gonna want to
save all of those things so all I have
to do is go up here to Microsoft - Excel
workbook and that will take care of that
so now I have my original data in one
tab I have my edited data in a second
tab and I can continue to do whatever
work on this that I need to the next
step that I'm going to do is I am going
to start looking at my measures of
central tendency
right and happily these are pretty
straightforward to do so the first one
the first thing that I'm going to do
just to kind of get ahead of it here is
I'm going to sort once again so I'm
gonna I'm already on the data tab and
then a quick sort and I'm gonna say sort
by trip duration okay so in this case in
order to so in order to find the median
hopefully you'll remember from your
reading you have to first sort the
values that you're finding the median of
so I'm gonna say sort smallest to
largest and I can see here that my
smallest is 177 my largest is 25 77 okay
and from here I can find both the mean
the median now the mean is pretty easy
I'm gonna label this here okay and in
this cell is where I'm gonna actually
calculate the mean now when I want to do
some math in Excel okay anytime I want
it to use a formula I always signal that
by beginning with an equals sign okay
and in this case all I have to do is say
some okay which is the name of the
formula and then I give it the
ingredients that I want it to do that
formula on in this case the ingredients
as I like to call them also called
arguments are just the trip duration
cells and I can select them just as you
saw actually with my mouse okay and then
I have to close the round parentheses so
notice that I wrote equals sum and then
I opened around for open two left round
parentheses and now I've selected all my
my cells and I do have to come up here
and close it and when I hit enter it's
going to give me the sum okay now notice
that I see the formula I see the formula
in the function bar but I see the result
in in the south now of course the mean
is add them all up and divide by the
number and since I have 20 rows I'm
gonna divide by 20 and there I have my
mean the next thing we're gonna work on
is the median the median is pretty
simple you basically just have to find
the midpoint of the data set however we
have an even number of rows here
so when we come back we're going to talk
about how to deal with finding the
median when you have an even number of
rows see you in a moment
start talking about some of the basic
statistical tools that we're going to
use to give context to our data so as
we've already touched on briefly in
class the really important thing that we
want to be able to do with data is to
give it context because context is how
we derive and communicate meeting an
amount of data and in order to do that
some of the basic statistical tools that
we need are what we call measures of
spread and central tendency and we
already touched on these briefly
measures of central tendency are mean
and median our measures of spread are
going to be are going to be the standard
deviation as well as something called
the upper and lower bounds and the IQR
those last couple actually apply in
cases where the median is the more
appropriate measure of central tendency
in any case we're going to actually go
through the mechanics of how we identify
or rather calculate these various
measures and then use them in a
journalistic context to do that of
course we need to have some data and
work with and so for the purposes of
this exercise we are going to use
citywide data citibike data is readily
available as you can see it's provided
in a couple of nice formats the one
we're going to be working with for the
next couple of weeks is a CSV based file
so this is sort of historical data
looking at as we can infer from the file
names in this case monthly data so the
data is released monthly with probably
some bit of a delay so my guess is that
when we go down to look for most recent
data did do 2017 there we go
aha yeah so we're gonna see September is
the most recently available data which
is fine I'm just gonna go ahead and
download this and throw it on the
desktop now one thing that I want to
point out here is there's actually a
second kind of set of sitting bike trip
data if we were to do some sleuthing we
would discover that that is actually
city bike data about Jersey City city
bike is now available in Jersey City New
Jersey but when I first encountered this
a few weeks ago I actually didn't know I
couldn't tell it wasn't obvious to me
right just from the file name what the
difference was between these two data
sets now one clue actually lies in
he lies in the file size right so if I
hadn't guessed that JC was Jersey City
they'd have the one thing that I do know
about it right off the bat is that it's
much much much much much much much much
smaller file than the traditional city
bike data and in fact what I ended up
doing was looking at some of the
locations the latitudes and longitudes
of the stations in determining they were
in fact in Jersey City which again
correlated with my hypothesis that this
was not the regular city bike data the
point of that is mostly just to say that
if you look at the very sparse amount of
data that we're given here just a file
name that in almost every case where
we're handling structured data we have
to do additional reporting in order to
understand actually what those files
contain in the case of city bike data
they're really helpful they actually
give us a meaningful description of not
just the the columns that we're going to
find in this structured data but they
also give us things like units which is
really helpful and so in distinguishing
characteristics right customers a
24-hour or three-day pass user
subscribers' an annual member gender
zero known one male to female year or
things like that so very often however
when we get this date when we get
structured data that information is not
gonna be ready readily available and
we're gonna have to do reporting in
order to find out so the first thing
that I'm gonna do I am using that I am
using using a PC so your process is
gonna look a little bit different I'm
gonna have to manually extract this and
once I have that then I think I'm gonna
do actually as I am going to open this
with notepad now if you're on a Mac
which most of you will be you're gonna
open this in TextEdit instead of notepad
but they do equivalent things on the
different operating systems and the
reason why I'm doing this in notepad is
because I know that working with notepad
isn't going to crash my computer okay so
if we look at the file size the file
size is pretty substantial it's over 300
megabytes right here this is the kind of
thing that if I just double clicked it
so that it opened in Excel might very
well sell out my computer there are a
lot of rows of data here and one of the
big things that we're gonna be focusing
on
um during this course is how do we deal
with progressively larger and Messier
sets of data because very often that's a
big part of dealing with data is kind of
cleaning it up okay so what I can see
now handily is the fact that these the
columns right I see this is a CSV you
separate it by commas that these are
separated by commas and that they seem
to line up with the promised columns
from the metadata provided on the
citibike web page so I see trip duration
start time stop time start ID cetera et
cetera et cetera latitude longitude and
bike ID just kind of interesting you can
see where a particular bike went these
are type birth year gender all the
things that we were promised okay so
this is great but in order to deal with
it in the background here what I'm first
gonna do is a little bit of magnet magic
so that I'm dealing with a slightly
smaller data set and then I'm gonna
demonstrate the kinds of analyses that
we would do and how we would do them in
Microsoft Excel because particularly for
our measures of spread and central
tendency that's gonna probably be at the
outset the the tool that we want to use
first so here we go okay so in the
background I have done a little bit of
manipulation this isn't really magic as
I said because you're gonna because it's
not and because you're going to learn
how to do it yourselves next week but
for the meantime I have arbitrarily
decided to take the first 10,000 rows of
data and now I am going to open that in
Microsoft Excel
so my double clicking is gonna open this
in Excel and you can see that my
computer is able to handle this pretty
easily depending on the machine you're
on it might be more or less quick but
this is a good place to introduce some
of the basics of Excel so for those of
you who are familiar just hang on tight
for those of you aren't here's a little
bit of an intro so one thing that we see
right away is that there is a column
that is full of hash tags hash signs or
pound signs
that's just Microsoft Excel is way of
telling you that there's too much data
for it to display and the width of the
column so we can actually solve that
quite quickly just by resizing the
column so you can get a slightly better
idea of what is what the content is
there and you'll
notice as I scroll across here that I am
going that it's showing me the value
that's contained in that data in this
function bar now the function bar is
going to be a place where we do a fair
amount of work and it's going to help us
see the difference between a calculation
that's being executed and the resulting
value so that'll make a little bit more
sense when we start doing functions in
just a second but basically you know
again I can kind of the sort of typical
way that we interact with Excel is
through kind of scrolling down I can
scroll all the way down to my very last
row here which I will point out is
labeled 10,001 and that is of course
because I have a header row in my data
and that header row means that
understanding the count understanding
the count in my dataset is always going
to be offset by one as long as I have
that header row so keep that in your
back your head because it's going to be
meaningful in just a moment now in order
to make any kinds of so what we're going
to be looking at here is we're gonna be
interpreting we're gonna be looking for
measure the spread in central tendency
along the trip duration parameter right
and why is this this is how long
people's rides are and so maybe there's
a pattern to that maybe there's not
we'll talk and we'll talk a little bit
more detail in class about how we derive
shapes the shapes of the data from this
but for now we're just gonna kind of do
the math and then we'll talk about the
implications so the first thing that I
want to do however is it I want to have
some uniformity I do want to organize my
data in some way and in this case it
makes the most sense for me to organize
it by the start station ID right I want
to compare the duration of different
trips taken from the same place right
leaving from the same location because
you know otherwise you know I might be
comparing things that are very remote to
things that are very close to others and
it would be hard to say anything
substantive about that remember that the
whole objective of this is to be able to
contextualize and say something
meaningful about this data so the first
thing that I'm going to do is I'm
actually going to highlight all
my rows here and just to clarify the way
that I did this there's a few different
ways to do this this is my preferred way
which is that I like to highlight the
actual letter now again it's gonna look
a little bit different for you all
because you're on Macs probably but
basically I highlight this and then
under the date of you
I can choose sort and when I choose sort
it's gonna say sort by and I'm gonna
choose chips or a trip duration so
because I have that header row it's
gonna actually give me the name of the
header right instead of column a column
B column C it'll say trip duration
sorting on values smallest to largest so
all this is gonna do oh sorry I actually
don't want to sort of trip duration that
was wrong I'm sorry start station ID and
I could say smallest largest because who
cares and you can see right away that
the lowest start station ID in this is
start station 72 and I'm just gonna
scroll down to see how many I have and I
have about 20 entries which is fine
again this is for the purposes of the
exercise all I'm doing is is I'm just
getting a sample that I can work with
now again for the purposes of this
exercise one of the first things that
I'm gonna do is I am going to select all
those rows and so again the way that I
did that this is another way that I like
to do selections so if I'm selecting if
I put my cursor in the where the number
is the labeling of the number of the row
I can actually also just scroll down to
the last row that contains the value I'm
interested in and if I press the shift
key and then click it'll automatically
select everything in between you can
also hold down shift while you use the
arrow keys there's a bunch of different
ways to do it each of you will have your
preference that's something to kind of
practice play around with a little bit
so now I'm going to go ahead and copy
this I'm just going to do an Apple C and
if you see down here I'm gonna say plus
and I'm gonna do an Apple V to paste it
okay now this is important because as
much as possible I want to I always want
to retain a reference to my original
data set it's we're always going to end
up making edits to our data as we do
analyses or most of the time we're gonna
end up making edits to our data but it's
good to keep a frame of reference
just for where it came from right this
is the history and one piece that I
haven't been doing
here that of course you all are going to
need to do is the documentation file so
for me that would just be a notepad or
WordPad document where I would just
literally start by writing down where my
data came from so I would probably start
it with you know download data from and
I would actually paste in the URL and I
would say something like browse and
notepad open in Microsoft Excel alright
and so on and so forth but this is
something that I want to be doing in
real time so I want to be doing right
alongside the steps that I'm executing
so it's important now I'm going to real
able this sheet one and I'm just gonna
call this I'm gonna call this station 72
because that's the station ID that I
have copied here and I'm now gonna do a
file and I'm gonna do a save as ok and
the reason why I'm doing a save as here
is because a CSV again is just a text
file and it cannot save multiple sheets
and it can say formatting and it cannot
save formulas and we are gonna want to
save all of those things so all I have
to do is go up here to Microsoft - Excel
workbook and that will take care of that
so now I have my original data in one
tab I have my edited data in a second
tab and I can continue to do whatever
work on this that I need to the next
step that I'm going to do is I am going
to start looking at my measures of
central tendency
right and happily these are pretty
straightforward to do so the first one
the first thing that I'm going to do
just to kind of get ahead of it here is
I'm going to sort once again so I'm
gonna I'm already on the data tab and
then a quick sort and I'm gonna say sort
by trip duration okay so in this case in
order to so in order to find the median
hopefully you'll remember from your
reading you have to first sort the
values that you're finding the median of
so I'm gonna say sort smallest to
largest and I can see here that my
smallest is 177 my largest is 25 77 okay
and from here I can find both the mean
the median now the mean is pretty easy
I'm gonna label this here okay and in
this cell is where I'm gonna actually
calculate the mean now when I want to do
some math in Excel okay anytime I want
it to use a formula I always signal that
by beginning with an equals sign okay
and in this case all I have to do is say
some okay which is the name of the
formula and then I give it the
ingredients that I want it to do that
formula on in this case the ingredients
as I like to call them also called
arguments are just the trip duration
cells and I can select them just as you
saw actually with my mouse okay and then
I have to close the round parentheses so
notice that I wrote equals sum and then
I opened around for open two left round
parentheses and now I've selected all my
my cells and I do have to come up here
and close it and when I hit enter it's
going to give me the sum okay now notice
that I see the formula I see the formula
in the function bar but I see the result
in in the south now of course the mean
is add them all up and divide by the
number and since I have 20 rows I'm
gonna divide by 20 and there I have my
mean the next thing we're gonna work on
is the median the median is pretty
simple you basically just have to find
the midpoint of the data set however we
have an even number of rows here
so when we come back we're going to talk
about how to deal with finding the
median when you have an even number of
rows see you in a moment