Randomness and Significance
Frontiers of Computational Journalism week 7 - Randomness and Statistical Significance
Transcript
okay so special Halloween edition of the
course today we are going to talk about
statistical significance and related
issues today I think you've all heard of
a p-value yeah okay how many of you
could explain what a p-value is okay
we've got one okay we're gonna beat it
to death and the reason we're gonna beat
it to death is not so much because I
think you're gonna do a lot of
statistical significance calculations
although you're gonna do one in the
homework but because you're gonna read a
lot of research that uses statistical
significance calculations and you need
to know what that means we used to have
an assignment in this course that was
read this paper evaluate this homework
and it's a or evaluate the paper there
was a super-realistic assignment because
you you're gonna do a lot of that you
know likelihood so this is actually like
a sort of stats 101 lecture in many ways
I'm gonna try to introduce you to a
different way of doing statistics which
is via randomization which I think is
not only easier to do if you come from a
computer science background because it's
based on programming and simulation but
also I think it hangs together better
conceptually and yeah so here we go
the the first thing we need to talk
about is randomness then I'm going to
show you a bunch of examples of
significance testing so it actually
happens in journalism and you can use it
to detect unusual things that could be
indicators of fraud and there's at least
three good examples of that we're going
to talk about p-values we're going to
talk about Bayesian statistics and what
the difference is between frequencies
and values and then we're going to go
into this this idea of p-value hacking
and the replicability crisis and what's
been called the garden of forking paths
which is a phrase that means that if you
get to choose your data analysis method
you can
different outcomes and we're building up
to a sort of general theory of data
interpretation and I'm going to show you
I'm gonna view it through particular
lens which was actually developed in
intelligence analysis called the
analysis of competing hypotheses which
is sort of a modern incarnation of an
extremely old argument about how to
decide what is true so I really like
this class I think this is like
epistemological fundamentals I think
this is basic stuff and I don't know it
feels hard-fought right this is the
lecture I wish I had when I started
studying this stuff twenty years ago
okay so we need to talk about randomness
randomness is a bit of a slippery
concept this is one of my favorite
pieces around this idea and what this is
is an analysis of the error in the New
York time or the the monthly job figures
so every year every month the number of
jobs created for that month gets
released it's like you know 180,000 or
something however the margin of error is
more than a hundred thousand so let me
show you what this looks like this was a
one of my favorite pieces which is
basically making the argument that we
almost always misinterpret this so the
way they did this is very clever
they basically assumes that the job
growth was flat for the whole year and
then added in statistical noise to show
you all of the different types of
patterns that you might see even though
the job growth rate was actually steady
so where's the pause button
Oh interesting it looks like this is no
longer interactive and they've just
replaced it with it with animated gifts
interesting there used to be a bottom
where you could stop it
but if you look at this for a second you
can imagine if you wait long enough you
will see almost any trend right so
imagine yourself writing a story about
interpreting this trend oh it fell later
in the year
oh it peaked here early in the spring oh
it increased at the end of the year oh
we got a peak and then it fell off in
December right you can imagine almost
any story available just due to noise or
in this example here it was actually
increasing but if you wait for a moment
there that's more of a flat trend okay
so this is one of the basic statistical
problems we need to solve and data
interpretation which is if there is
randomness or noise in our data then we
need to know how much and we need to
know whether we're likely to be misled
by it so almost every article that is
written about the jobs rate you know
almost any every article that said you
know it fell by 50,000 and analysts sold
off stocks as a result almost all of
them are probably wrong because you're
well inside the noise and by the way the
noise here comes from it's just a
sampling error so what you're actually
doing when you compute the change the
job growth numbers is you take the
result of the monthly Employment Survey
which is well there's actually two
there's a household survey and there's a
company survey I forget which one they
use for this I think it's the household
survey so that's a big monthly random
sample and so then you get some number
you know you say ok the labor force is
120 million or something and then next
month it's 120 million point 1 and you
subtract the two and that's the job
growth rate
the reason the noise is so large is
because you take this huge random sample
and you get you know maybe a 1 percent
margin of error well 1 percent on a 100
million is a million so actually you can
see that they do a lot better than 1
percent so what you're seeing is the
noise that's results from subtracting
two very large numbers which reminds me
of one of my very favorite pieces of
financial journalism Bank of America not
living see if I can find it yeah
Bank of America made 168 million last
quarter more or less so this is Matt
Levine at Bloomberg who has a fantastic
financial journalist and what the point
he's making here is that a hundred and
168 million is a rounding error on their
income depending on the accounting
standards they use depending on you know
whether they're counting certain things
this year or next year I mean you can't
even tell if they were profitable or not
right and this is not even statistical
error this is just the rules about how
to count this are very complicated
because and the reason this is
complicated because you're looking at
very small percentages of very large
numbers so it's the same situation of
the job report
yeah Thank You Vera cos earnings are a
rounding error so when you're thinking
about two large numbers and comparing
two large numbers the difference is
going to be incredibly noisy because
that's what you're doing here you're
comparing this year's earnings to last
year's earnings and because there's so
much variance just in getting this
number the difference is not that
interesting
okay so that's the idea of variation
we're going to talk a little bit more
about randomness so if I show you these
two pictures which one do you think has
a random placement of the stars how many
for the left okay how many for the right
now all of you on the left good I feel
like my students are getting smarter
over the years okay
in fact the one on the right is less
random in some sense because it it broke
it down into grids so we might think of
randomness is sort of unpadded
but randomness actually has a lot of
pattern in it right think of think of
all of the different ways that you can
have 12 months of job growth most of
those ways show some sort of pattern so
here's another quiz here is some some
data and eight of them are randomly
generated values and one of them shows
an actual statistical trend which one do
you think is the real one any guesses
top sender why okay
any other guesses bottom-right it's a
nice decreasing trend yeah actually
they're all random data right so the
point I'm trying to make here is that if
you're looking for a pattern and noise
you will find it so whether you see a
pattern or not is not the interesting
thing right that's not that's not the
criterion we're going by if we're going
to do inference from data another
principle I'm gonna try to get across is
that when you have less data the problem
is more severe so here's the same thing
again I've just generated a bunch of
random points and drawn regression lines
through them notice that compared to
when we have less data the lines here
are a bit flatter and this is a basic
statistical principle which is that if
you're if what you're looking at is
noise the more of it you have the easier
it is to tell that it's noise right when
you get small samples it's much easier
to see a pattern we get very small
samples all the time right so let's say
we're talking about the number of
homicides in Chicago and whether it's
going up or down so you do some story
and you're like well look at the last
five years of data it's a lot higher
that was now last year than it was five
years ago it's only five data points if
there's any sort of random variation in
that it's very likely that it's going to
completely swamp the actual data this is
part of the problem with trying to do
these sorts of comparisons is you have
very little data so here's so this again
this is like stats 101 stuff and I
expect you've all seen some version of
this right but here are sort of the
basic issues that we're going to have to
try to
deal with and this is why statistical
significance was invented is so we try
not to fool ourselves by seeing patterns
in randomness and in particular this is
a bigger deal when you have less data
the easiest way to start talking about
statistical significance is going back
to this classic type of statistical
question and this has been a you know
people have been talking about this for
hundreds of years here we have a
histogram of die rolls which I just
built in our using the code at the
bottom and we roll the die 60 times so
we should get 10 for each but we get it
looks like 13 or 14 for two is this way
is this die weighted towards a two what
do you think no okay is there a
principled way to answer the question
so there's a lot of different versions
of this question one of the most famous
ones which was proposed by a
statistician named Fisher it's called
the the lady tasting tea problem and the
problem is this so if any brits among us
know okay so apparently Brits feel very
strongly about the order in which you
put the tea in the milk and I can't
remember if it's milk before tea or tea
before milk at this point but let's say
you prepare a bunch of cups of tea half
of which have tea before milk and half
of which have milk before tea and then
you bring in you know a an aristocratic
woman and you say ah please taste this
tea and I'm gonna give you either all
milk before tea or all tea before milk
and I want you to tell me which order
you think each each cup is and the
question is how many times does she have
have to sort of sorry I'm not gonna give
all you're gonna give her each cup is
going to be one or the other right
you're gonna give her randomly each cup
either MLP for tea or TV for milk and
she has to guess which one it is how
many times does she have to be right
before you start to think oh yeah she
really can tell the difference this
really does make a difference because
you know she can get three right guesses
one out of every eight shots right what
if she has three right guesses and two
wrong guesses how often does that happen
so it's the same type of problem
so in order to answer these types of
questions oh yeah so here's another one
how about this one do you think this is
a loaded die or do you think this is a
fair die
okay you still say it's fair okay so you
know I know the answer because I wrote
the code so I either did 60 die rolls
and just kept doing it until I saw this
or I made it over wait the ones and did
it once right so which one did I do yeah
I mean I made this so would it take
hours to get this how many time how many
rolls would it take to get to get that
or how many sets of rolls because each
of these is actually sixty titles okay
so clearly the answer depends on not
only the the difference that I observe
but the how often I would get that
difference by chance all right so we're
building up to sort of the basic
statistical framework behind
significance testing and in fact all
kinds of statistical work here now I
roll the dice 60,000 times okay and I
get a very small difference I'm looking
at what it's still about a 10%
difference so the difference is about
the same as this one right it's not much
you know it's a little bit smaller but
what do you think is this a loaded dice
okay so why are you so certain for this
one which means what why is the larger
sample size relevant
well look at this stuff this is
relatively flat right why is it
relatively flat yeah when you have a
larger sample size your averages have
less noise okay and this is these are
really all averages right I'm just
counting how many times I get each one
so this is the you know the average
number of rolls that got a three so you
can start to see I hope how these
factors play into each other and we're
gonna make this more precise in a second
with one die we have a flat distribution
with two die we have a non-uniform
distribution we can derive the
distribution analytically by just
counting how many ways we we have to
roll each number right you can usually
for simple problems you can just compute
the results combinatoric ly it's
basically just a symmetry principle you
just say well the number of times we
expect to get a two on the on the white
dies the same as the number times we
expect to get a four and so we just run
through all of the possible combinations
that way when we actually simulate it
the histograms don't look anything it's
clean I think here I've got another 60
rolls or something and you so you can
see they're a bit all over the place
with samples this small and if you only
do 60 rolls and this one is rigged to
give two fives for example let's say
that's why we get all this 10 it may be
very hard to tell that something is
wrong with the die until you roll at
hundreds of times there are cases so so
far we've talked about cases where
there's an analytical solution you can
sit there with some algebra and figure
out what they what the fair distribution
looks like when you're doing your actual
data analysis or if you're doing some
sort of data visualization you may not
have an analytical answer to what noise
is going to look like right that's kind
of the question we're answering by doing
these
simulations is what does noise look like
so this is from a beautiful paper which
is in the readings which sort of pulls
out a lot of these statistical ideas
into into a visualization framework and
here we're showing cancer rate per
County and what we've done in this image
is one of these is the real data the
rest are synthetic data generated by
taking the original cancer rates and
scrambling the counties okay so this is
our first instance of generating a null
hypothesis by randomization so I'm
introducing a couple ideas here right
first is null hypothesis and what that
means is you can think of this
colloquially as the data is just noise
okay
our theory is that what we're looking at
there's actually no pattern in the sense
that we care about the die is fair
there's no relation between location and
cancer rate there's no difference
between the two classes the jobs rate
was flat you know whatever it is we're
saying that the actual pattern we see is
just noise in some way and then the
other idea we're introducing is
randomization and we're using
randomization to generate examples of
what the data looks like under the null
hypothesis so we've already been doing
this when we do this type of thing what
we're using is we're using a model of
the null hypothesis all right we're just
rolling to fair die and adding them when
we do this the way we make the model is
okay so we have some counties this is
what our data looks like
and each of these counties has a certain
cancer rate and then what we do is you
know the null hypothesis is that there's
actually no association between where
the county is or which County it is and
what cancer it is so we permute to this
distribution right we say you know let's
just let's just scramble them a little
bit okay we just all we do is we
randomly reorder these things or in the
case of our County data here well we
just swap all the counties around and
what we get is something that has the
same distribution of underlying rates so
if you draw a histogram of the County
data you see exactly the same thing
we're gonna change the data but if you
map this stuff what you see is this
alright what you see is the some of the
data that we could see if there was no
relation between where the county is and
the cancer rate so then the question is
can we tell which is the real data and
which is the synthetic data because if
we can't tell the two apart then the
pattern that we're seeing in this in the
real data is a pattern that happens just
by chance and it's much less likely to
be meaningful so anyone want to take a
guess what the real data is yeah
everyone said three three is correct so
the fact that you could correctly guess
which was the real data means that there
is actually Geographic pattern here that
is distinguishable that looks looks like
fairly accurately distinguishable from
random data okay so you have all just
provided statistical evidence that
something is happening that is not
random
I think we looked at this earlier when
we were discussing text analysis this is
this is an analysis of the anachronisms
and downton abbey' did we talk about
this earlier yeah okay so on the
horizontal axis is so it's taking every
two-word phrase in the scripts of
Downton Abbey and first of all plotting
the overall frequency it's on a log
scale so 10 to the minus 1 means every
tenth diagram is this I think that can't
be quite right because 10 to the 0 would
be every Biograph is this mm anyway
something like that though right so the
more the the more common words go to the
right and the vertical axis is how much
more commonly we see this in the Downton
Abbey scripts than in Google books
results from the same time period right
so 0 here so it's a logged it's the log
of the ratio so 0 means we see it at the
same amount so of course you see
everything clustered around 0 which
means the language is broadly the same
and then this stuff up here write all of
the stuff that starts to get high on
this chart is over-represented right so
to here that means we see this a hundred
times more often just need we see about
hundred times more often in the Downton
Abbey scripts and in the books corpus
now the question I want to put to you is
what would be the shape of this
scatterplot if actually the language in
Downton Abbey is drawn from the same
distribution as the language in Google
Books so that's our null hypothesis here
right so null hypothesis and I'm being a
little more technical here
abby's scripts from same ORD
distribution as google books of the same
era right I don't know it sits around
World War one I'm not sure exactly which
years he used so how think about it this
way rather than trying it so there-there
is you can do an analytical analysis
right so you can algebraically work this
out but I don't want to so let's imagine
simulating this how can we make a
simulated version of this chart
so think of this as making fake Downton
Abbey scripts
any guesses
let's generate some Downton Abbey
scripts that are drawn from the same
distribution as the words that are
stored in Google Books how do we do that
yeah we don't need to read the scripts I
don't need to make any sense there just
yeah so if there's a hundred thousand
words in the Downton Abbey scripts then
let's sample a hundred thousand diagrams
from the Google Books corpus which we
can do by the way you can download the
frequency tables for Google books for
each year for up to 5 grams all right so
we can just download the data and do
this ok so we can generate fake data and
then we can produce this this plot and
we just do that by then basically just
take just comparing the sample that we
have to the Google Books baseline we can
generate this stuff right so we take
each diagram might want we say ok how
many times did we choose it in our
scripts which gives us the x-axis and
how many times do we have it in as
compared to in Google books which gives
us the y-axis and we go and so when we
do that we will get a certain shape and
I want you to think about what that
shape might be what is that what shape
would we see for this chart any guesses
right okay yes okay so good it's a good
guess that's a good start right so the
zero line is means the ratio is the same
because their log one is zero right so
because we pull them from the same
distribution we should get stuff around
this zero line but we're not quite going
to get stuff around the zero line what's
that like how we generated the well
because we're not going to because the
google books is this huge big
distribution if we sample a certain
number of words some words are gonna
appear more often by chance some words
were never we're never going to sample
yeah yeah yeah so right yeah the
randomness is how is how often did we
pick this word right so if we by chance
picked a word a lot it'll be up here if
we didn't pick a word very much it'll be
down here I think it's in Google Books
yeah I'm not sure that okay
so here's the thing though we're going
to get this sort of funnel shape and the
reason we're gonna get this type of
funnel shape it's gonna sort of walk up
the chart here it's gonna be denser in
the air but it's going to sort of go up
here and the reason is because these are
very rare words right so think about it
if a if a word appears once and I'm just
gonna say word it's actually by grams if
a word appears once in every hundred
thousand words okay and we have one in
our script of a hundred thousand words
then we're gonna be on the zero line but
say by chance we pick it twice well now
it's twice as often so it's gonna go up
here so the rarer the word is the more
difference choosing that word a single
time makes on the vertical axis of the
chart another way to think of this is as
we go to the left of the chart we get a
smaller sample size because remember
we're comparing the ratio of two samples
this is the sample in Google Books this
is the sample on the script alright if
the samples are very small we're gonna
get a lot more noise so if you actually
do this experiment you will get a shape
that looks like this and so if we
interpret this chart we have to sort of
imagine that there's a curve which runs
kind of like this and we're really only
interested in the stuff above the curve
in fact what we could do is we could
numerically calculate what is the curve
that represents we only get a value this
high one out of every hundred samples so
you see this characteristic type of
funnel shape in a lot of cases so for
example if you plot if you take school
test data where you have the test
average test score for the class and you
have the class size and you plot class
size on the horizontal axis and test
score on the vertical axis the smaller
classes will have the most extreme
values if you have if you plot crime
rate per County well crime rate is is a
ratio right it's how many crimes versus
how many people if you have a smaller
population in the county one murder can
make the crime rate incredibly high
right that one crime has a much larger
effect on the rate so anytime you're
taking averages or rates when you have a
smaller group of people you get a lot
more noise which means and this is a
classic trap you know the classic story
is you download the crime rate data and
you say oh which county has the highest
rate of car theft you know what it's
gonna be one of the smallest counties
always
because it just are most likely because
it just has the most noise so when
you're analyzing this any type of data
which has some sort of noise or random
process in it that first question you
have to ask is what does this look like
if there is no pattern which means you
have to figure out what no pattern means
and for the example we just did no
pattern means we I just defined it to
mean the same distribution as Google
books of the era here's another example
this is a from the signal in the noise
which is nate silver's book on
prediction and what he's talking about
is the claim that hey look the
temperature rise stopped I think now we
have data almost ten years later and
it's going up again but in 2010 people
were saying like oh yeah global warming
has stopped look right it's been flat
over the last 10 years and but you know
you can see there's a variation here the
biggest variation by the way is the
11-year sunspot cycle so there is actual
astronomical sources of variation here
as well on multiple time scales there's
some like hundred thousand year cycles
as well but so he posed the question how
often are we going to see a flat decade
even if the temperature is actually
going up so how would you answer that
from this data any guesses
what's that well some decades are going
up right most I mean there's a general
upward twin right so that's coming up
that's coming up but this is a flat
decade that's a flat decade that's a
flat decade that's a flat decade that's
a flat decade so what he did is he said
ok well how often do we see a flat or
decreasing decade just from that data so
in this case he's just using the data
itself to get an estimate of how how
likely it is because we look at this and
we know there's a general upward trend
right so there's definitely a long-term
upward trend and what he's saying here
is even with this general upward twin
trend you're very often going to see
decade-long
decreases in temperature so seeing a
decade-long decrease in temperature
doesn't mean that the upward trend has
stopped writing what no you can I mean
you can say that's totally factual you
can say that this is the greatest
temperature in the last decade right but
to say that you know global the
temperature trend in the last decade has
been flat therefore global warming has
stopped that's nonsensical because we
can see just from the historical record
that it appears that there are
decreasing trends in on the timescale of
a decade fairly often and yet we know
because we have the data subsequent that
this decreasing trend was not the end of
global warming
okay so it's I I hope you're seeing the
sort of conceptual connection between
all these examples okay they all rely on
the idea of how often does the thing I'm
looking for happened by chance given the
the inherent structure of the underlying
data so here's another example this is
what I call the lottery fallacy this is
also due to Nate Silver in where was
this 1976 there was thought that there
was going to be a big flu epidemic and
so there was a huge vaccination program
and some of the people who had the
vaccine died but three people who were
vaccinated in one clinic in Pittsburgh
within the same hour all died that day
so this is the same structure as is the
dilated doesn't make a difference to if
you put the milk or the tea in first do
we see a pattern the cancer rates here
right look at the structure of this
argument it is extremely improbable that
such a group of deaths to take place in
such a peculiar cluster by pure
coincidence that's the argument right
it's the same structure of the argument
this is not chance the null hypothesis
is wrong or another way to say this is
the null hypothesis in this case would
be something like the vaccine is safe
they died from something else there this
editorial is explicitly saying the null
hypothesis does not generate the
observed data with high enough probably
so Nate silver being nate silver said
okay let's calculate how high the
probability is so here you go here's
what he did basically he just he takes
the death rates for elderly Americans
which he defined 65 or older he asks how
many people how many elderly people
visited a vaccine clinic he makes an
estimate of the number of clinics
converts that to how often just the
chance of any one person dying and then
two out Chance of three people dying and
there you go
so this is probably what the person who
made the argument the editorial argument
was thinking about about four hundred
eighty thousand to one all right so it's
very unlikely that three people who went
to the same clinic in the same day will
die from some other cause however there
are five thousand clinics to do this and
eleven days over which this happened so
when you put that all together the odds
are about one in nine right so greater
than ten percent so here he's making the
opposite argument he's saying no this is
not extremely improbable this is about
one in ten which now we have the problem
of well do we think the vaccine is safe
or not what do you think our is a one in
ten chance of this happening by chance
does that suggest that there's a problem
with vaccine or does it suggest that the
vaccine is probably quite safe
so you what you're doing there's you're
invoking a different hypothesis and
you're saying it's not a problem with
the vaccine in general it's the problem
with the vaccine at that clinic or
something about what that clinic is
doing yeah that's a reasonable
alternative in fact let's start writing
these down right so null hypothesis is
[Music]
right the null hypothesis vaccine is
safe let's call this h1 clinic is bad h2
vaccine is bad h3 food poisoning right
and we can go on now we're starting to
get into what we're gonna look at the
end of the class which is the analysis
of competing hypotheses and the question
we can ask is how often will this
hypothesis
generate our real data okay so we have
some real data in this case three deaths
in one day or we have some real data in
this case you know this word cloud or
these images there's always some data
that we observe and the question we're
asking is for all of the hypotheses that
might be causing this how often will it
generate the data now this is a purely
statistical question the answer will be
phrased in terms of probabilities or
ratios or probabilities and so forth
that is a different question than should
we investigate the safety of the vaccine
this is the difference between
probability and decision theory
probability gives you a chance of
something happening decision Theory
tells you well what choice should I make
and decision Theory introduces costs
okay
it introduces the idea of an expected
value expected benefit expected loss
expected utility and although the
probability that the vaccine is bad
might be very low the cost of having the
bad x scene is enormous
so the expected value calculation
becomes very large this is similar to
our issue with setting the threshold
false negatives and false positives for
releasing people from unbale pretrial
where you want to set that threshold
depends on what you consider the cost of
a false negative versus the cost of a
false positive so how much worse is it
to keep someone in jail who wouldn't
have committed a crime versus release
someone in jail who then goes on to
assault someone right how do you weigh
those so
probabilities themselves cannot tell you
what to do you have to combine them with
costs so this is sort of the first step
in a decision analysis because the thing
is this right even if we have a very
small chance that the vaccine is bad
because the potential outcomes of having
a bad back scene and not handling that
situation are so costly even if this
probability is very small you still want
to investigate it
maybe stop vaccinating people so I'm not
even so going back to this editorial
right I'm not even saying that this
person is substantively wrong right
there they're arguing here that we
should probably be careful with this
vaccine I'm saying that their their
argument that it's unlikely isn't is not
really true because there's a lot of
chances for it to happen okay
thoughts on all this stuff questions
[Music]
yeah no I mean we should I would say
this is definitely reason to investigate
that possibility however this so what
I'm saying is that if you're going to
argue that if you're going to argue that
something is improbable by chance put a
number on it all right
I actually think about it is it that
unlikely potentially not but you are
right so that's something you need to
think about
what's that yeah I mean you may or may
not want to publish the calculation it's
probably better if you can rely on an
expert in that field to do these types
of calculations but when you start
looking for this you see it a lot right
what you see is the argument that
something couldn't possibly happen by
chance and almost always when someone
says that they haven't actually
estimated the probabilities
okay so we're zeroing in on the idea of
statistical significance which is this
idea of arguing that something must be
true because the alternative is very
unlikely so this is the earliest example
I can find this was the Howland will
trail as Sylvia and Howland was a rich
American aristocrat who died in the 19th
century and there was an addendum to her
will that said that all of her money
went to her niece and this is an
original signature that they know were
know was real or at least it was
undisputed in the case these are the two
signatures on the addendum and the
argument was that her niece forged this
stuff and the to prove this the
prosecutor hired charles sanders peirce
who is best known as the inventor of the
randomized controlled experiment and one
of the founding philosophers detained
pragmatism he was a great 19th century
scientific epistemologists and asked him
to compute how likely it would be that
these were forged and the way he did it
is he took 41 I think pairs of
signatures do I have this no I don't
have that text here but late 19th
century and so I have this in this story
in curious journalists so arguing from
the odds yes oh he took 42 signatures
that the court believed to be genuine
and he was actually up here signature to
the senior actually did this and he
printed them out on transparent
photographic plates and he overlaid
every possible pairs which is 861
possible pairs and asked how often he
broke the the signature into 30
individual strokes downward strokes so
that's a stroke that's a stroke that's a
stroke that's a stroke that's a stroke
right and he asked how often were the
strokes in the same had the same length
and horizontal position and he found
that the same stroke matched only a
fifth of a time so he said that well for
30 strokes to match which the disputed
signature did match exactly happens by
chance only once in 5 to the 30 okay so
now this is not there there are problems
with this statistical argument but this
gives us sort of a ballpark so his
argument was that it is very unlikely
that if Sylvia Allen signed her name
that she would sign it exactly the same
on two different documents and we know
that from looking at the variations on
her real signatures so again what she's
saying is
so in this case the null hypothesis is
which is normally called h0 is it's an
original signature and h1 is forged by
copying right so by tracing it and what
he's trying to do is calculate how often
the null hypothesis replicates the data
and he's saying well it's super rare so
this same basic form appears over and
over and it is the fundamental concept
in hypothesis testing and it's a little
bit of a weird thing right because what
you actually care about is proving this
guy but to try to prove this what we do
is we say well the alternative is that
and the alternative is super unlikely
[Music]
and people have done this in journalism
so The Wall Street Journal has won a
Pulitzer Prize and actually done three
different versions of this story over
ten years looking for various types of
insider trading
so insider trading is when an executive
in a company has information that is not
public and trade stock based on that
information there are laws about this
for example there are blackout periods
before and after earnings announcements
where executives and often other
employees can't trade during those
periods
to answer the question of whether people
were or to try to find insider trading
what they did is they took the
executives of a bunch of different
companies and they looked for cases
where they made they sold a bunch of
stock and made a bunch of money right
before the stock changed value if so
bought stock before went up sold stock
before I went down and in this
particular country context what they
used our back dated stock options so and
this is an instrument where they grant
them some stock options but then later
they decide oh they were active as of
this date and so the question is did
they really pick the date just
arbitrarily or did they pick the date
sometime later and set it at a date
where they would make a lot of money and
answer that question
what they did is they took the all of
the trades and did thousands of
simulations to see how much they would
normally make if the trades were
unrounded dates so here breaking this
down a little bit so they looked at
people who are using this type of stock
option and they're finding that they
they make a lot of money but the people
who don't do that use that type of stock
option don't make as much money and let
me show you how this this works
oh no Google tends to ask for my
password it awkward really
why does it randomly pick my it's not
new at all okay
yeah so this is a nice discussion which
I will post to the course slack and how
this works and it explains how this
analysis worked
so first of all they looked for news
that changed the stock price and then
they looked at who sold stock the week
before and they found that 10% of them
made a bunch of money but the problem is
sometimes you're just gonna make money
anyway because the stock is going up
right or down so instead what they did
is simulation so what they did is they
broke the link between the trades and
the timing of the trades so they took
the same trade you know I sold this many
shares and they said okay rather than it
happening on this date before the news
let's pick a random day in the year and
let's see how often you make this amount
of money if there's no connection to the
timing of the news and what they were
able to show is that if this was just
luck they were incredibly lucky they did
these simulations and they found minut
chances of making this amount of money
from the stock trade it's the same thing
we've been talking about this like
randomization approach so now instead of
randomizing the link between cancer
rates and county we're randomizing the
date of the stock sale and asking how
often do I make this much money in a
random on a random day now is this
evidence of insider trading how do we
handle this result as journalists this
is the complicated part
so you're saying that if if the news is
particularly unusual then that means
they're even luckier okay
so really what I'm asking here is so say
we do this we identify you know this
person okay and you know they made an
extremely improbable amount of money if
they were just trading randomly okay
right so we think okay this is not a
random pattern
can we write a story saying that they
traded on non-public information no okay
why not still got to be a journalist was
they mean yeah right so remember insider
training is a criminal offense you can't
accuse someone of a criminal events
without proof right and and usually you
don't you don't accuse someone of a
criminal offense unless they've been
convicted convicted by a court if
they've been accused but not convicted
you say you know allegedly or your hedge
it in various other or you know you
hedge it in various ways but that's not
even that right the SEC has not opened
an investigation against them right
there's nothing going on here except the
statistical result so you can't call
them a criminal right you can't say that
but what do you say instead
right and in fact I loved how they
handled this look at this headline he
actives good luck in trading own stock
all right
they didn't accuse them of any wrong the
only thing they said is they had very
good luck and the thing about this is
that it may be true you know some some
of these people probably just traded
randomly and got lucky but not all of
them it would it's just too unlikely
that they all got that lucky the problem
is we don't know which ones were really
lucky probably you know a small number
of them will have just been lucky
probably most of them it is some sort of
insight information whether that's
illegal or even immoral it depends on
the specifics right that's a question of
law not statistics
all right there's all kinds of
complicated law about whether it's
insider trading or not but they actually
did three different versions of this
they did it with trading before press
releases they did it with back dated
stop options and they did it with one
other type of investment vehicle so
they've actually done this story in
multiple times over ten years with about
the same statistical technique another
nice example of this is the tennis
racket which was a story that BuzzFeed
did recently and this was the part of a
much larger investigation into
match-fixing in tennis and what they did
is they took public betting odds right
they scraped it from a bunch of betting
sites and they looked at cases where the
odds shifted dramatically between the
start of the match and later on in the
match right and the idea there was that
if someone was heavily favored to win
and they lost May
they threw the match or as they put it
if bets come in against a favored player
maybe that's evidence of match-fixing
so BuzzFeed actually publishes published
the code for this it's a pretty
straightforward simulation and what they
did is they generated a list of I think
15 players who frequently paid matches
where the odds shifted a large amount
and they didn't actually publish the
names of these players they just said we
found a bunch they did publish the
algorithm to do it and the data is
publicly available so some readers
promptly ran the code on the data and
got the names so interestingly again
they did not accuse they didn't even
publish the names they didn't want to
accuse people of match-fixing based on
this type of evidence alone why what is
it what is the weakness of this type of
analysis it could have had a bad day for
all kinds of reasons and there's a
really nice article this is also in your
readings this is a very where are we
okay so here's the original how they use
data to investigate match fixing in
tennis and here's a lovely article why
betting data alone can't identify match
fixers in tennis and this is a beautiful
critique of the method and one of the
things that they talk about is what are
all of the other ways that a player can
tank a match so it's the same sort of
concept right null hypothesis through
the match and then the alternate
hypothesis
deliberately losing losing yeah the
alternate hypotheses are things like h1
betters have inside information age2
the betting markets are wrong and so the
example that gives there is
maybe there's a recent injury that the
bookie doesn't take into account for
example that's what he's saying there
and then this is very interesting h3 is
let's call it methodological problems
and we're going to talk about this a lot
when we talk about the Garden of forking
paths so check this out
what 538 did for this article is they
hired a statistician to replicate the
results and they did it a little bit
differently first of all he used a
little a few more matches okay but also
the BuzzFeed analysis looked at the
bookmaker right so that there's there's
seven different companies producing
betting odds and they're gonna have
different odds for the match and what
the BuzzFeed analysis did is took the
bookmaker that had showed the maximum
movement in the odds
whereas Sackman took them the the median
of the odds given by the bookmakers and
looked at how much the median moved so
that's a different methodological choice
why would that be justifiable by the way
why would you think well maybe I should
use the meeting of the odds rather than
just the maximum of all of them why
would you want to do that
yeah alright so the median removes
extreme values right so so if you take
the median of all of the bookmakers
you're you know basically you're just
throwing out the data for you know the
bookmakers who just got it stupid wrong
you know made a stupid mistake we're
having a bad day didn't didn't take into
account the the data and the right way
there can be data errors as well right
there could be typos you're just
throwing out all of that stuff right and
when you use the median you get a much
smaller number of players who have
suspicious odds movements so you made a
bunch of changes no it's gonna be the
whole market because the idea is that
the player deliberately chooses to lose
the game and it's being paid off by
betters who are betting against a player
who is favored right so everyone thinks
he's gonna win you say I'm betting that
he's gonna lose you get very good odds
on that right it pays back let's say ten
to one then the player loses and you
make a whole bunch of money and the
player is in on it right they get part
of the winnings right that's why it's
illegal okay
oh I see you picked you picked the
bookie who's the most pessimistic cuz
you get the best odds yeah probably
it's a good question but four so that's
where the money would go but for this
analysis basically the the whole concept
of the analysis is you look at cases
where the odds shifted very rapidly once
they started playing indicating that
they were seems to be very unlikely to
lose and yet they were losing
anyway any when you make this
methodological change you get a
different set of names and a smaller set
of names as as you said according to the
statistical significance threshold that
the original analysis used only one
player not four or above this threshold
so when we change the method we get a
different result and that's an
indication that we have methodological
issues what you hope for is a result is
a result where when you change the
method you get the same answer that's
known as the robust result and this is
true not just in statistics by the way
this is true in journalism in general
so then BuzzFeed's analysis
there you go so this is how BuzzFeed
handle this they said first of all we're
not claiming that by itself this
analysis proves match-fixing
but they said ok but we also had this
whole other investigation right the
statistical analysis was only one part
of the investigation and they also noted
that six of the I think 14 players they
found were also under investigation so
there's some reason to believe that the
analysis was effective in finding the
people although again you have the same
problem it's hard to make the argument
that any particular player is cheating
based on this so I'm going to show you
okay
thoughts on this again it's exactly the
same structure of the argument we're
saying that Oh actually I got this wrong
the null hypothesis is not that they
threw the match that's that's H I don't
know let's call that H F for fixing
the null hypothesis is let's say lost by
chance and how they generated data to
see if they lost by chances they
simulated all of these games they took
the initial odds and they said let's
simulate a million games and see how
often this player would lose as badly as
they did and they said well that doesn't
happen very often so we think it's
evidence of this and then 538 is coming
back and saying ok but why is it
evidence of that and not evidence of
this or this or this so this is the
basic weakness of the statistical
significance argument the statistical
significance argument is got a very
funny structure it always says the
alternative is unlikely therefore my my
proposal is correct
ok and in a certain sense that has to be
true because if we give a probability to
every hypothesis right so let's say p0
is the probability of match-fixing p1
equals probability of injury etc right
and then
let me real able this p1 is match-fixing
p2 is injury p0 equals we saw the
results that we saw on the odd shift by
chance
we know that p0 plus p1 plus p2 plus all
of the other possibilities equals 1
that's what that's an accident
probability if we could list out every
possible option they have to sum to 1 so
if this is very small then these have to
be big okay so the the structure of the
argument does make sense formally if
it's very unlikely that the null
hypothesis produced the data that we saw
then that is evidence that something
else produced it the question is how
many different things could have
produced it and the difficulty with the
structure of the buzzfeed argument is
they're saying well it must have been
this and BuzzFeed is saying well what
hang on there's all of these other
possibilities if there's only one other
possibility so think of the hell and
we'll trial either it's fake or or it's
real and if it's very unlikely that we
see a signature by that like that that
is a real signature then it has to be a
fake signature okay but in the tennis
fixing case and in the insider trading
case we have other possibilities so all
we know is that we can exclude chance
but now we don't know which of the other
ones it actually is
one of the things you can do to
understand whether statistical
significance testing is going to like or
what kind of evidence it provides is to
talk to other people who have done it so
this is again from that 538 story right
and they they talk to this guy who you
know gave them this quote about what it
meant and also I find this really
interesting
doing the same thing has proven
problematic and other sports so that
feels like one of those sentences where
the reporter knows a lot they probably
have like pages of interview notes and
like interesting links but they haven't
given it to us yeah I'm sure there's a
long history of trying to detect
cheating through statistical methods and
you see this in elections as well we're
not really going to talk about that
today but you can see the election fraud
and the Russian elections by looking at
the results in a bunch of different ways
one of the ways you can see it is if you
do a histogram of the precinct counts
way too many of them end in zero or five
you know so 65% 70% 75% for Putin's
party people are very bad at making up
random numbers pick a random number
between 1 and 10 all of you how many of
you chose 7
[Laughter]
that's right it's because I decided you
would okay rounding out our our
discussion of what I'm going to call
pure significance testing you'll see why
in a second this is a fascinating
article which was originally on medium
and this was about someone discovered
that there were a series of payments
just before October 28 2016 which was
the date of the contract between store
mean annuals and Trump there were a
bunch of campaign finance payments that
totaled almost exactly 130,000 and so
then this person did an analysis and
what they did was the same sort of
randomization technique right they said
the null hypothesis is these are just
random payments and how they generated
instances of the null hypothesis was
they just randomly sampled from all of
the payments in the previous month or so
and they asked how often if I randomly
sample 10 payments do I get a bunch of
them that get too near 130,000 and
here's what they did they found this
thing and they said whoops we get to
within you know 1% of the time we get to
within 2 dollars and 75 cents and that
was their argument that these campaign
finance payments were actually the
payments to stormy Daniels now there are
some problems with this argument this is
a great example of you know we sort of
have to think through the argument right
like what are the possibilities other
than it was a payoff well you know
chance is one of them you know where
they're recurring payments that totaled
$130,000
are these numbers accurate or they
changed later and then you get into this
whole issue of if you're trying to hide
payments by the campaign is this how you
do it it's this is how campaigns have
done it in the past there's all these
other issues aside from the statistical
issues but it's the same idea and so
finally we get to p-values which are one
of the most widely understood con
misunderstood concepts most most widely
used and most least understood concepts
in statistics and mmm that's a formal
definition let's let's try this
informally anybody want to hazard a
explanation of what a p-value is our
resident statistician
yeah
okay so it's a complicated idea it's
actually a little weird so here's the
here's the sort of standard definition
the idea is that we have something
called a test statistic which measures
how out there is our answer right so you
know if we're looking at the change in
jobs and we're like oh look at that
that changed by 100k the test statistic
is maybe you know how many standard
deviations away from last month are we
or this test the test statistic is often
turned into a z-score which is a
difference expressed in standard
deviations but it can be really anything
that the fundamental concept of a
p-value is p-value probability that we
would get data at least I'm being a
little informal here at least as Extreme
as our
real data right so I'm I'm
differentiating here between the data we
observe in the world and the data we're
simulating at least as Extreme as our
real data if the no hypothesis is true
so generally for the sort of standard
statistic is that the general threshold
is P smaller than 0.05 so 5% chance but
this definition is complicated if we
have a p-value of 0.05 for you know is
there a difference between these two
drugs is there a difference in the
salaries of men and women at this
University
you know what do we see or you know is
have crime rates really increased that
type of thing this is where we normally
use p-values that doesn't mean that
there's only a 5% chance that the value
we see is due to noise it only means
exactly this which is really hard to get
your head around it means that if there
was no difference we would see the
difference we actually saw about 5% of
the time right so it actually doesn't
say anything about that fatha since we
care about it only says something about
the hypothesis we don't care about and
the implication goes in the wrong
direction what we really want is the
probability that our hypothesis is true
given the data instead what we get is
the probability that we see our data
given a hypothesis that we're not
testing so it's a weird calculation now
that there is still logic in it because
as I showed you before if the null
hypothesis is very unlikely to have
general
data then something else did okay and if
the only other thing that could have
done it is the thing we care about then
yeah when the p-value gets very small
that is strong evidence for our other
hypothesis so let's talk about this this
is I interest the idea of a test
statistic this is a measure of like how
extreme our result is so this is a
standard formula for the difference in
means of two groups with different
variances I'm not going to go into how
we derive this and in fact I'm never I'm
gonna recommend that you never use this
formula I'm gonna recommend that you do
it through randomization in fact what
I'm going to show you is a way to
compute as standard statistical p-value
without these formulas so the test that
I'm going to show you isn't and this
you're gonna do homework on this sound
exactly this problem is let's imagine
that there's two different classrooms
that differ and some variable we care
about maybe one has a better paid
teacher maybe one uses different
textbooks and we have the the
standardized test scores for every
student in both classrooms and we're
going to ask does it matter which
classroom they're in now now obviously
the students are not the same so they're
gonna be some difference in means
between the two classrooms alright
there's gonna be some difference in the
actual data but the question we're
asking is how likely is that data that
difference to be something other than
just chance so to answer that question
we have to ask about why we would have
those differences and there's basically
only two reasons there are differences
because of differences in the two
classrooms and there are differences
because of things that do not depend on
the classrooms like the variation in the
students who go into the class and we
have to hope by the way that
students are not assigned to a class
based on how good a student they are
right like we're that's why we do
randomization and experiments we have to
hope that there's no other correlation
between classroom and students but if we
think that's true then what we can do is
we can basically break the association
between which classroom they're in and
their test score through permutations so
this is how we do it
this is what the data looks like this is
the test score this is which class
they're in and what we do is we randomly
reassign the students to classes we
permute the class assignments or permute
the data same thing right we have the
same number of A's and B's but we
reorder them and we do this over and
over in principle we actually run
through every possible permutation it
you know it'll be in the billions in
practice what we do is we sample a
subset of them randomly and for each
permutation we compute the difference in
the means between the two classes so
here's the idea here's the original
scores for Class A and B with the means
marked what we do is we randomly
reassign each of these dots to the left
or the right ensuring that we keep the
total number of students the same in
each class and then we recompute the
means and you can see sometimes Class B
comes out higher and sometimes Class A
comes out higher we do this thousands of
times and then we look at the difference
between the class scores for each one is
that clear so far and then we make a
histogram of the thing and this is what
we get we get this histogram of possible
differences in the class scores after
resampling and somewhere in this
histogram is our observed data so in
other words this is a real data here and
all of this our
relations of the null hypothesis and
then we ask what is the probability that
we get a test statistic in this case the
difference between the means of the test
scores of the classes that is at least
as big as the observed data at least as
Extreme as the real data so looking at
this histogram can anybody tell me how
how that probability appears on this
chart there is there is a relationship
between this visualization and the
p-value anybody know what it is again
the question is how often if the null
hypothesis is true do we see a
difference in means that is at least as
great as the one that we saw so what is
that on this chart
right here no there's lesser differences
so this is it and if we do that
calculation and we're not using we're
not using any formulas here we're simply
counting the percentage of trials where
we get an observed difference at least
as big as the real difference and we get
14% so that is what a p-value is and
I've shown you one way to compute a so
I've tried to demonstrate two things in
this example one the concept of a
p-value how often do we get a difference
that is at least as great as the one we
saw if the null hypothesis is true to
the concept of a permutation test which
is a randomization test basically a way
use simulation to compute a p-value
where you don't have to work through the
analytical formulas of classical
statistics and with a little creativity
you can come up with simulation methods
to calculate all of these statistical
properties you very rarely have to
actually use the analytical formulas for
this type of stuff so would this be
considered a statistically significant
difference
yeah not by the standard definition of
point zero five right so this is gonna
be your homework I'm gonna give you two
sets of class scores or or something
like claw scores and ask you to compute
the p-value through a permutation test
oh I see so yeah so this is the
difference between a one-tailed and a
two-tailed test
so yeah depending on how you think about
this you might want to conclude all of
this stuff too on the left because
you're asking two different questions
what is the probability that Class A is
better than Class B and what is the Oh
as opposed to that the difference is
this great by chance
unfortunately so I'm showing you this
because I want you to I'm trying to give
you a little intuition about what a
p-value is measuring and a way to
calculate it for a very standard case
this is mostly what people use p-values
for unfortunately p-values answer the
wrong question let's see we're going to
bootstrap here I'm not gonna do
bootstrapping today so here we are we're
looking at it the this means point zero
five this means point zero one this
means point one oh and they're like well
it's not statistically significant but
you know there's always this problem of
threshold and this is one of the
difficulties of p-values is people want
to use them to decide whether something
is is true or false but you can't
actually decide whether something is
true or false all you can do is weigh
the evidence in different ways okay
that's what uncertainty is uncertainty
is irreducible lack of knowledge
emphasis on irreducible there is no
statistical process that you can do
that can tell you whether the difference
in female assistant professor salaries
in the life sciences is due to something
is due to discrimination right you can't
get that from the data because there is
irreducible uncertainty in the data all
you can get is different ways of talking
about the strength of the evidence and
the p-value is one way to talk about the
strength of the evidence if the p-value
is small that means the difference is
unlikely to have been generated under
the null hypothesis which means that
something other than the null hypothesis
must be true it doesn't tell you what
that was the problem we were running
into earlier with the statistical test
for tennis fixing
and this is a nice paper it's linked in
the syllabus it's about a bunch of ways
that the p-value is misinterpreted this
is the number one misinterpretation if P
equals point zero five the null
hypothesis is only a 5% chance of being
true it's not at all what it says and
one way to see this is to think about
writing the probabilities as more formal
statements and so we're going to use
this notation e is evidence that is what
we observe you can think of that as the
data H is hypothesis that's what we
think of true this is the this is not
quite the p value because the p value is
at at least as extreme not equal to but
more than this is kind of what we're
doing with the p value we're saying what
is the probability evidence given that
hypothesis given the null hypothesis
what we really want is this what is the
probably the hypothesis became a care
about there's a difference in the
classrooms there's a difference in the
pay the payoff was cover up that match
was fixed there was insider trading
right that's not pasta so we care about
given the evidence that we have that's
what we want but that's not what a p
value is and I hope that I have drilled
into your head at this point that
reversing a conditional probability is
completely different right you get
completely these are very different
concepts there are only different
concepts but I'm actually talking about
different hypotheses and again it's not
that the reasoning is bad right so let's
go back to the hell and we'll trial
all right so if
it is unlikely that the signatures match
because by chance then it has to be true
that they match for some other reason
okay the logic is good because we know
that the sum of the probabilities of all
the hypotheses is 1 so if we can
eliminate one hypothesis it has to be
something else the question is what and
in this case we the only things we're
considering is its Forge it's you know
it's forged or it's real so if it's not
real it has to be forged alright if you
have only two possibilities P values can
really work but in practice you often
have more
also the p-value doesn't tell you much
about the effect size which is the thing
you actually care about right how much
better were the scores right statistical
significance only compares against the
null hypothesis which is normally the
hypothesis of no effect
so if it's significant statistically it
may still be a super small effect so
that's this case here right different
effects same p-value right so the
percent benefit let's say there there
you know medical trial right these both
have the same key value of point zero
five but this is a two percent
difference and this is a twenty percent
difference all right so the p-value does
not measure the effect size conversely
you can get this other problem which
where the p-value is just sort of
measuring how accurate your result is
basically how many people you had in the
trial right we have the same effect size
we just have a lot more uncertainty in
when we have a higher p-value so the
p-value is this weird construct that
kind of blends effect size and
confidence interval and there's a big
movement to eradicate p-values and just
have people report confidence intervals
instead because really that's the
information you need here another way of
thinking about statistical significance
is does the confidence interval cross
zero
so yeah unfortunately they mean less
than we let me hope they would but I'm
going to show you an alternative and I'm
not going to go into Bayesian statistics
in any great detail because that's a
whole other course but I'm going to show
you foundational ideas in Bayesian
inference and and basically based on
this right it's based on let's back up
and represent everything as conditional
probabilities so you're familiar with
conditional probability I hope this is a
fundamental concept you need to get this
in your head or you're gonna make bad
mistakes but it's a very simple idea
really it's just a definition it's the
probability of a and B happening divided
by the probability of a alright so we're
just we're just taking a denominator
we're talking about two different events
and we're taking and saying of the
things where a happened
how many also had B and I kind of think
of it as this bar as a Division sign and
so the a is like okay that's the
denominator that's the set work and
through thinking of so that's why I want
you to think of it of the A's how many
had B which is not the same as of the
B's how many have a or I think of the
you'll see these diagrams in a second so
here's the classic example let's say
there are yellow and blue taxis in a
city and some of them have accidents and
we're trying to ask which cab company is
safer now if you just know it should be
obvious to you that if you just count
the number of accidents you're gonna say
well the yellow company has more
accidents but of course there's more
yellow cabs so what we actually want is
the rates and that's kind of what
conditional probability is it's you know
given that we're yellow how many
accidents did we have or in this case
blue right so this is what conditional
probability is the the thing after the
pipe is
the denominator and then we count within
that how many had an accident right so
we only end up counting this accident
now that accident because this one is
inside the denominator we talked about
relative risk a lot a few classes ago
relative risk is this I mean these are
the formulas from the contingency tables
but the easiest way to think about it is
as a ratio of conditional probabilities
in other words what is the ratio of
getting the disease given that I smoke
over getting the disease given that I
don't smoke and I feel like this is a
much easier thing to remember and once
you have that then you can convert this
back into the formula of the contingency
tables by using these ideas right so you
should be familiar with converting
between this notation and contingency
table notation hopefully your last
assignment forced you to think about
that stuff a bit the base rate and this
is the reason I show you this is there's
a standard error called the base rate
fallacy that's where you look at this
data so when you look at this it's
pretty obvious that the yellow taxis are
safer when you get this data and you're
like wow there are many more accidents
involving yellow yellow cabs you have to
sort of wiggle this stuff around a
little bit to finish the derivation so
I'm turning the numbers that we have
into conditional probabilities so we
know the overall probability of an
accident we know the conditional
probabilities of having an accident
depending on which your which caviar in
but we don't know how many yellow cabs
there are
oh actually I've done this wrong 75
accidents involving a yellow cab this is
not this is wrong this is just an
accident given yellow this should be
ends not not Peas
well we actually want are these numbers
and to get those numbers we need to know
the total number of yellow and blue
taxis so another way to think about this
is this is one of these you need four
numbers to make this work right if
you're gonna ask which taxi company is
safer you need every entry of this
contingency table because ultimately
you're computing this and this uses all
four entries I'm now gonna do sort of
expand on this a little bit into this
evidence and hypothesis framework which
is an extremely clear framework for
thinking about many statistical problems
so let's say we observe this person
coughing and we want to know and this is
why I introduced the base rate we want
to know whether she has a cold let's say
we know that most people with with colds
are coughing right so we know that P
coughing given cold is 0.9 what is the
probability that she has a cold let me
see your coughing we know there's this
link between having a cold and coughing
what is the probability she has a cold
Yeah right we are missing the base rates
okay
so here's the basic issue we have again
probability of we have why we just
learned the probability of the evidence
given the hypothesis we have that we
want probability of hypothesis given
evidence so we have to go the other way
the way we get that is base theorem now
base theorem is not metric magic it
follows immediately from the definition
of conditional probability it just tells
you the relationship between a and B
right so it tells us that probability of
hypothesis given evidence is gonna be
probability of evidence given hypothesis
divided by probability of the evidence
so and then divided by the probability
of the hypothesis no hang on time's the
probability of hypothesis divided by the
probability evidence there we go okay so
in this case if we translate this out
this is probability coughing given cold
times probability cold divided by
probability coughing all right so it
tells us how to reverse this so let's do
this what are reasonable values for each
of these things so we were given
coughing giving cold at 0.9 what is P
cold how many people have a cold at any
given moment what fraction of the
population
15% how many people are in this room
have a cold depends on the season or in
the world I don't you're usually epical
ok what do you want to call this 5% ok
overruled 5 yeah you need to get out
more so you get more cold what is the
probability that someone is coughing
there's 110 percent ok 0.1 so what do we
get 0.9 times 0.1 is point 0 9 and we
get half of that
so was a Oh / point wine sorry we get
0.9 point two point zero zero five so
that's zero zero for 5/10 okay did I do
this right no there we go so okay 45
percent let's do it this way hey look at
that
we ended up with exactly the same
numbers have ya I'm always curious what
people will guess yeah exactly
there you go so there's the calculation
worked out in detail right and it's not
a complicated calculation the only thing
you need is the conceptual machinery to
see how all these numbers are late okay
these things have names so probability
of the evidence given the hypothesis or
the data given the hypothesis is called
the likelihood the probability that you
get the evidence at all
[Music]
it's called the base rate so the problem
how often do I see that data and
probability that the hypothesis is true
without any other information is called
the prior Bayes theorem is also
extremely important in interpreting
diagnostic tests such as for mammograms
in this example or let's say you've got
your terrorism detector and someone
tells you that if the person is a
terrorist and they walk through the
scanner the alarm will go off with 99
percent probability but that's not what
you actually want again it's the same
problem you don't want the probability
that the alarm goes off if they're
terrorists
you want the probability that they're a
terrorist if the alarm goes off that's
the that's a significant number right so
this is the same thing right if you get
a positive mammogram under this scenario
how likely is it that you actually have
breast cancer
and here's the situation and again I've
borrowed from from Nate silver and
here's the idea is that the dark squares
are positive mammograms and the pluses
are cancer but because cancer is rare
actually most of the dark squares do not
have cancer this is why you do follow-up
tests where you get the same thing with
any rare disease or you get a very
similar situation with hie HIV tests or
your terrorism detector and here's sort
of the logic I'm gonna walk walk through
this pretty fast
again we have this sort of this is a
contingency table in graphical form and
we have these basic quantities right so
we have the probability that the test is
positive if you have cancer and so this
is the easy one to get right we can do
this this just depends on the test
we have probably it's positive if you
don't have cancer and again these all
have names so false positive very
positive predictive value is one set of
words precision and recall as another
set of words which we think about an
inform a
retrieval in medicine you use
sensitivity and specificity which I
forget how they relate to PPV and so
forth but they're all names for these
particular conditional probabilities and
so once you understand the conditional
probabilities you can work with all of
this stuff here's just the base rates of
cancer and here it's the same form of
problem right we know the probability of
positive giving cancer and some other
stuff we want to know cancer given
positive and you can work with it and so
here you go we want this piece of the
chart right where as what we're given is
this and some other information so you
can work this through you end up having
to compute the overall number overall
rate of positive
here's Bayes theorem at the top you
grind it through you end up with this
9.6 percent chance all right so there's
a lot most positives are actually false
positives and for your terrorism
detector this is going to be like 99
percent false positives there is sort of
a philosophy that goes with Bayesian ISM
that is interesting the idea is that and
so now we've gone from mathematical
formalism to epistemological virtue and
the idea here is that we should only
believe things for which we have
evidence in other words evidence is the
thing that justifies a belief which
means E is evidence for X if the
conditional probability of X given E is
greater than just the conditional
probability of X alone in other words if
we revised our probability estimate of
the hypothesis upwards when we see that
evidence so that's the the formal
definition of evidence for something and
the the normative part of that is to say
that we should only believe things for
which we have evidence
in this sense unfortunately it's very
hard to formalize the world it only
works in a few situations and again
here's the basic form right we start
with a model so this is the model which
gives us the likelihood how often do we
call if we have a cold and then we start
with the base rate how often do we see
that evidence and we start with the
prior which is how likely is that
hypothesis to be true in the beginning
and we end up with this thing which is
normally called the posterior which is
how off how common is that eyeball that
says given the evidence and that's what
we just did with our example here now
this gets arbitrarily complicated when
this starts to get things like you know
what is the probability of pulling
someone over given their race then we
start to have to do much more
complicated calculations when we have
large amounts of data and Bayesian
inference that statistics is this whole
subject but this is the fundamental
formula this is all we're doing is
flipping the conditional probabilities
because we normally have this and we
normally want that this is what a model
tells us this is a simulation and in
fact I'll show you how this works
Yunis in a simulation context so
p-values really only talked about the
null hypothesis bayesian statistics
usually talks about comparing multiple
hypotheses so here's a very simple
example again this is also in my book
you put a traffic light at an
intersection and you ask did it reduce
accidents so say you're looking at this
chart and you're trying to write a story
saying huh did the traffic light work so
just looking at this data what do you
think did the traffic light work did it
reduce the accidents
okay maybe a good answer okay so to
answer this question what we do is we do
two sets of simulations one is we
simulate how many traffic accidents we
would get without the stoplight so
everything to the right of the red bar
here is fake data this is simulated data
which means we need a model a model that
tells us how often we get each number of
accidents in other words we have a model
of the intersection at a stoplight which
is this term that we're computing
probably your evidence given hypothesis
hypothesis is no stoplight evidence is
these numbers here and we have to make
up some we immediately run into this
other problem we have to make up some
definition of reduced traffic accidents
so let's say we'll call the traffic
accidents reduced if all of the values
after the red line are lower than the
minimum of these two values right so
that means that this one and this one
are lower right we have to make up some
definition of what it so if so this is
the problem translating words into
numbers what does it mean to say that
the traffic light reduced accidents well
we're making up a definition how we
simulate these things maybe we draw
randomly from the history of the traffic
data before the stoplight was installed
maybe we compute the average rate and we
use a Poisson distribution and draw from
a persona distribution we make some
model and then we make fake data and we
count how often the data produces the
thing we're looking for
and then we make different fake data
let's say we imagine that the stoplight
reduces traffic accidents by 50 percent
and we so basically we take our model
and we simulate everything with the
number of accidents cut in half and we
ask how often do we see a reduction by
our definition and so in this case it's
one two three
four five six seven and then we're going
to ask what is the ratio of where we saw
fewer accidents to where we didn't do I
have that yeah okay so what we're
actually computing is we're computing
the ratio of the probability of
hypothesis 1 the stoplight did nothing
to Apophis is to the stoplight cut
accidents in half and if you grind
through this what you actually find out
is that ratio is equal to the ratio of
the prior probabilities like how often
how much more likely we thought the
stoplight would work let's begin with
times the ratio of the likelihoods so
doing that simulation what we get is
this term okay
we're simulating so we're assuming the
hypothesis that of--this is on the right
we're getting the evidence on the Left
we're taking the the how often do we see
the data so evidence data same thing in
this case given our hypothesis and a
little algebra tells us that the ratio
of the probabilities of the two app
offices is equal to the ratio of the
likelihoods times the ratio of the
priors
so the Bayes factor the thing we can
compute by comparing these two different
simulations you can think of it as
telling it telling us what is the
strength of the evidence it doesn't tell
us what the overall probability is
because maybe we know that it that
stoplights reduce traffic already maybe
we know from every other city that when
you put in a stoplight it cuts it in
half in which case this prior would be
very high or maybe we have no reason to
believe that stoplights work in which
case we start with 50
okay so the and this is one of those
sort of mistakes of the p-value approach
is that it tries to come to a definitive
conclusion is this true is this not true
based on only the data we're analyzing
and not all the other data in the world
so the the likelihood here you can think
of this as incorporating all of the
other information we could possibly have
and this is in a sense more honest
because the thing that we can calculate
is only what this data that we have
right in front of us is telling us so
this is called the Bayes factor you can
think of it as what is the relative
strength of the evidence for different
hypotheses and there's no sort of
agreed-upon thresholds people don't
really use this as a base factor of
greater than two so therefore we accept
this as true the whole concept of
accepting or rejecting a result based on
a threshold is flawed again we can never
reduce uncertainty we can only measure
the strength of the evidence so here we
go here's ways of thinking about the
strength of the evidence by looking at
Bayes factors and you know they're
saying that by and this is a kind of a
weird table right because what this
strong evidence mean this is one attempt
to associate numbers with words but you
know by the time you've this ratio gets
up to about 20 which is you know about
where the standard peas or point 0 5 is
kind of set that we say it's very strong
so the way to interpret a Bayes factor
of 20 is it's the a model which assumes
the hypothesis is true is 20 times more
likely to generate the data that we
actually saw that's why I'm tying this
back to simulation ok this is
a ton of theory I know I've gone just
really rapidly through this so any
questions about these concepts before we
move on
hmm I know I don't quite believe it but
I will say this you will in your career
have to interpret p-values so you have
to understand what they are but honestly
if you're trying to solve a statistical
problem phrase it in the language of
conditional probability you're gonna do
a lot better it's gonna be a lot clearer
and if you have to do something that is
like a significance test which by the
way I think you shouldn't by the way I
think you should just do look at
confidence intervals and effect sizes
but if you have to do something like
that then think about the relative
strength of the evidence between
different hypotheses that's one of the
problems with the null hypothesis idea
because it only measures the strength of
the evidence for one hypothesis which is
the one you don't care about I want to
show you a lovely little interactive
here how many of you have heard of the
replication crisis in science
yeah it turns out that a lot of
published papers especially in
psychology don't replicate and you know
p-values was supposed to prevent this so
why isn't it working
so here god I wish I could just get rid
of this banner so here's a lovely little
interactive from a long story called
science isn't broken which demonstrates
the issues with using statistical
significance as a test and the idea here
is are we going to test is the does the
economy depend on whether Republicans or
Democrats are in office and what we can
do is use different definitions of
economics and different definitions of
politician and you know maybe a couple
other factors right so if we measure
only GDP and we measure only presidents
then no but if we measure GDP and
governors then yes look at that all
right we got a p-value smaller than 0.5
if we measure stock prices and governors
know if we measure everybody anyway you
can sort of see how this works right
whether we include or exclude or
sections and you can always find a
combination that gives you a high
p-value again this is this problem of
translating words into numbers
so this is known as the Garden of
forking paths' meaning that if you
change your data analysis you can get a
different significance level this is a
beautiful example of that and the
articles worth reading so what happened
here is the journalist collaborated with
a German researcher a real researcher it
actually you know worked at a German
medical school and they did a real
scientific study with all the standard
protocols of feeding people chocolate
and then measuring a bunch of variables
and then they found that the people who
ate the chocolate had less weight had
lower weight to a significant 2.05 they
wrote it up they published it in a
scientific journal which previous
research by this researcher had had
shown that a lot of scientific journals
don't actually read the papers they
accept so so they got it published in a
peer-reviewed journal then they sent a
press release with a link to the paper
to a bunch of media organizations and
some of them wrote stories about how
chocolate causes weight loss so this is
our standard scientific process they did
everything right it was a real
researcher following standard study
protocol publishing in a peer-reviewed
journal this was a and with a
statistically significant finding what
was the problem
how did they get this this nonsense
finding
no totally standard people they
randomized them all they there was
nothing wrong with the study design this
was a completely standard study design
so what's the issue nope no properly
randomized this that's one issue it
could be small sample size all the
p-values are designed to correct for
sample size
all right because think about doing that
permutation test if you have small
samples you're gonna get more variance
you're gonna have more variance cuz
you're taking averages and that's what
you're doing here you're comparing the
average of the chocolate group to the
non chocolate group so p-values should
correct for sample size so what is the
issue anyone know I did actually say it
but it's most people aren't trained to
look for it the problem is they measured
20 different variables so chocolate is
going to show an improvement on
something it aids sleep it lowers
cholesterol but of course weight loss is
the best headline so this is a name this
is from a paper by Andrew Gelman who's
one of the Bayesian heavies this has a
name in science which is the garden of
forking paths which is from a title of a
bore his short story which I recommend
you don't have to be fraudulent to get a
significant and nonsense significant
value all you have to do is pick a
method of data analysis which gives you
that value right so the this is the key
idea with the same data analysis
decisions have been made with a
different data set how do we solve this
problem how do we avoid getting nonsense
results just by picking different
definitions of what we measure or how we
do the analysis doesn't this destroy the
idea
data analysis anybody got an idea of an
answer to that one yeah warmer check for
quarterly I mean that's a good idea
anyway yeah how do we get out of this
trap
so I think there's basically two answers
one is statistical significance the
whole idea of statistical significance
has serious conceptual problems that
we've touched a little bit on today
right because you're trying to turn
uncertainty into certainty so use effect
sizes instead rather than saying you
know this this policy definitely reduced
crime rates say the crime rates fell by
five percent but you know that
statistical error so it could be
actually anything from an increase of 1%
to a decrease of 10%
all right fine find a way to show the
uncertainty and we're still learning how
to do this in journalism to talk about
the uncertainty the other thing is oh we
saw that robustness which we're just
gonna have to do this later robustness
is an idea that appears in a lot of
forms and the social sciences they call
it triangulation this is a quote by one
of my heroes charles sanders peirce who
appeared earlier in the handwriting
example and what he's saying is that
rather than using one type of argument
to know if something is true find lots
of arguments so going back to our
hacking our way to scientific glory
example if we got a big a large effect
size no matter what definition we chose
that's a sign that the result is strong
we say that is there a bust result okay
what we're looking for our results that
don't matter on the definitions we
choose the timescales we look at we want
and and also we want it to be true if we
look at different data sets we want it
to be true if we interview people as
well we want the qualitative and the
quantitative data to match up right the
the p-value is asking the wrong question
that's this that's this chart here right
we the p-value is using a statistical
model
a there's a scientific but let's say
substantive right here's a model of the
real world we think we have students and
teachers and classrooms and we think
that if there's fewer students in a
class they're gonna get more attention
that is this substantive model the
statistical model is just this
permutation test the statistical model
is not the substantive model right and
if you only draw conclusions based on
the statistical model while ignoring
what happens in the real world you're
gonna have problems all right so they
the p-value will not tell you whether
your hypothesis is true or not it's just
one tool in a way to evaluate the
evidence and this a strong conclusion
will have multiple lines of evidence
all right that's that's all we're gonna
do for today thanks everyone
course today we are going to talk about
statistical significance and related
issues today I think you've all heard of
a p-value yeah okay how many of you
could explain what a p-value is okay
we've got one okay we're gonna beat it
to death and the reason we're gonna beat
it to death is not so much because I
think you're gonna do a lot of
statistical significance calculations
although you're gonna do one in the
homework but because you're gonna read a
lot of research that uses statistical
significance calculations and you need
to know what that means we used to have
an assignment in this course that was
read this paper evaluate this homework
and it's a or evaluate the paper there
was a super-realistic assignment because
you you're gonna do a lot of that you
know likelihood so this is actually like
a sort of stats 101 lecture in many ways
I'm gonna try to introduce you to a
different way of doing statistics which
is via randomization which I think is
not only easier to do if you come from a
computer science background because it's
based on programming and simulation but
also I think it hangs together better
conceptually and yeah so here we go
the the first thing we need to talk
about is randomness then I'm going to
show you a bunch of examples of
significance testing so it actually
happens in journalism and you can use it
to detect unusual things that could be
indicators of fraud and there's at least
three good examples of that we're going
to talk about p-values we're going to
talk about Bayesian statistics and what
the difference is between frequencies
and values and then we're going to go
into this this idea of p-value hacking
and the replicability crisis and what's
been called the garden of forking paths
which is a phrase that means that if you
get to choose your data analysis method
you can
different outcomes and we're building up
to a sort of general theory of data
interpretation and I'm going to show you
I'm gonna view it through particular
lens which was actually developed in
intelligence analysis called the
analysis of competing hypotheses which
is sort of a modern incarnation of an
extremely old argument about how to
decide what is true so I really like
this class I think this is like
epistemological fundamentals I think
this is basic stuff and I don't know it
feels hard-fought right this is the
lecture I wish I had when I started
studying this stuff twenty years ago
okay so we need to talk about randomness
randomness is a bit of a slippery
concept this is one of my favorite
pieces around this idea and what this is
is an analysis of the error in the New
York time or the the monthly job figures
so every year every month the number of
jobs created for that month gets
released it's like you know 180,000 or
something however the margin of error is
more than a hundred thousand so let me
show you what this looks like this was a
one of my favorite pieces which is
basically making the argument that we
almost always misinterpret this so the
way they did this is very clever
they basically assumes that the job
growth was flat for the whole year and
then added in statistical noise to show
you all of the different types of
patterns that you might see even though
the job growth rate was actually steady
so where's the pause button
Oh interesting it looks like this is no
longer interactive and they've just
replaced it with it with animated gifts
interesting there used to be a bottom
where you could stop it
but if you look at this for a second you
can imagine if you wait long enough you
will see almost any trend right so
imagine yourself writing a story about
interpreting this trend oh it fell later
in the year
oh it peaked here early in the spring oh
it increased at the end of the year oh
we got a peak and then it fell off in
December right you can imagine almost
any story available just due to noise or
in this example here it was actually
increasing but if you wait for a moment
there that's more of a flat trend okay
so this is one of the basic statistical
problems we need to solve and data
interpretation which is if there is
randomness or noise in our data then we
need to know how much and we need to
know whether we're likely to be misled
by it so almost every article that is
written about the jobs rate you know
almost any every article that said you
know it fell by 50,000 and analysts sold
off stocks as a result almost all of
them are probably wrong because you're
well inside the noise and by the way the
noise here comes from it's just a
sampling error so what you're actually
doing when you compute the change the
job growth numbers is you take the
result of the monthly Employment Survey
which is well there's actually two
there's a household survey and there's a
company survey I forget which one they
use for this I think it's the household
survey so that's a big monthly random
sample and so then you get some number
you know you say ok the labor force is
120 million or something and then next
month it's 120 million point 1 and you
subtract the two and that's the job
growth rate
the reason the noise is so large is
because you take this huge random sample
and you get you know maybe a 1 percent
margin of error well 1 percent on a 100
million is a million so actually you can
see that they do a lot better than 1
percent so what you're seeing is the
noise that's results from subtracting
two very large numbers which reminds me
of one of my very favorite pieces of
financial journalism Bank of America not
living see if I can find it yeah
Bank of America made 168 million last
quarter more or less so this is Matt
Levine at Bloomberg who has a fantastic
financial journalist and what the point
he's making here is that a hundred and
168 million is a rounding error on their
income depending on the accounting
standards they use depending on you know
whether they're counting certain things
this year or next year I mean you can't
even tell if they were profitable or not
right and this is not even statistical
error this is just the rules about how
to count this are very complicated
because and the reason this is
complicated because you're looking at
very small percentages of very large
numbers so it's the same situation of
the job report
yeah Thank You Vera cos earnings are a
rounding error so when you're thinking
about two large numbers and comparing
two large numbers the difference is
going to be incredibly noisy because
that's what you're doing here you're
comparing this year's earnings to last
year's earnings and because there's so
much variance just in getting this
number the difference is not that
interesting
okay so that's the idea of variation
we're going to talk a little bit more
about randomness so if I show you these
two pictures which one do you think has
a random placement of the stars how many
for the left okay how many for the right
now all of you on the left good I feel
like my students are getting smarter
over the years okay
in fact the one on the right is less
random in some sense because it it broke
it down into grids so we might think of
randomness is sort of unpadded
but randomness actually has a lot of
pattern in it right think of think of
all of the different ways that you can
have 12 months of job growth most of
those ways show some sort of pattern so
here's another quiz here is some some
data and eight of them are randomly
generated values and one of them shows
an actual statistical trend which one do
you think is the real one any guesses
top sender why okay
any other guesses bottom-right it's a
nice decreasing trend yeah actually
they're all random data right so the
point I'm trying to make here is that if
you're looking for a pattern and noise
you will find it so whether you see a
pattern or not is not the interesting
thing right that's not that's not the
criterion we're going by if we're going
to do inference from data another
principle I'm gonna try to get across is
that when you have less data the problem
is more severe so here's the same thing
again I've just generated a bunch of
random points and drawn regression lines
through them notice that compared to
when we have less data the lines here
are a bit flatter and this is a basic
statistical principle which is that if
you're if what you're looking at is
noise the more of it you have the easier
it is to tell that it's noise right when
you get small samples it's much easier
to see a pattern we get very small
samples all the time right so let's say
we're talking about the number of
homicides in Chicago and whether it's
going up or down so you do some story
and you're like well look at the last
five years of data it's a lot higher
that was now last year than it was five
years ago it's only five data points if
there's any sort of random variation in
that it's very likely that it's going to
completely swamp the actual data this is
part of the problem with trying to do
these sorts of comparisons is you have
very little data so here's so this again
this is like stats 101 stuff and I
expect you've all seen some version of
this right but here are sort of the
basic issues that we're going to have to
try to
deal with and this is why statistical
significance was invented is so we try
not to fool ourselves by seeing patterns
in randomness and in particular this is
a bigger deal when you have less data
the easiest way to start talking about
statistical significance is going back
to this classic type of statistical
question and this has been a you know
people have been talking about this for
hundreds of years here we have a
histogram of die rolls which I just
built in our using the code at the
bottom and we roll the die 60 times so
we should get 10 for each but we get it
looks like 13 or 14 for two is this way
is this die weighted towards a two what
do you think no okay is there a
principled way to answer the question
so there's a lot of different versions
of this question one of the most famous
ones which was proposed by a
statistician named Fisher it's called
the the lady tasting tea problem and the
problem is this so if any brits among us
know okay so apparently Brits feel very
strongly about the order in which you
put the tea in the milk and I can't
remember if it's milk before tea or tea
before milk at this point but let's say
you prepare a bunch of cups of tea half
of which have tea before milk and half
of which have milk before tea and then
you bring in you know a an aristocratic
woman and you say ah please taste this
tea and I'm gonna give you either all
milk before tea or all tea before milk
and I want you to tell me which order
you think each each cup is and the
question is how many times does she have
have to sort of sorry I'm not gonna give
all you're gonna give her each cup is
going to be one or the other right
you're gonna give her randomly each cup
either MLP for tea or TV for milk and
she has to guess which one it is how
many times does she have to be right
before you start to think oh yeah she
really can tell the difference this
really does make a difference because
you know she can get three right guesses
one out of every eight shots right what
if she has three right guesses and two
wrong guesses how often does that happen
so it's the same type of problem
so in order to answer these types of
questions oh yeah so here's another one
how about this one do you think this is
a loaded die or do you think this is a
fair die
okay you still say it's fair okay so you
know I know the answer because I wrote
the code so I either did 60 die rolls
and just kept doing it until I saw this
or I made it over wait the ones and did
it once right so which one did I do yeah
I mean I made this so would it take
hours to get this how many time how many
rolls would it take to get to get that
or how many sets of rolls because each
of these is actually sixty titles okay
so clearly the answer depends on not
only the the difference that I observe
but the how often I would get that
difference by chance all right so we're
building up to sort of the basic
statistical framework behind
significance testing and in fact all
kinds of statistical work here now I
roll the dice 60,000 times okay and I
get a very small difference I'm looking
at what it's still about a 10%
difference so the difference is about
the same as this one right it's not much
you know it's a little bit smaller but
what do you think is this a loaded dice
okay so why are you so certain for this
one which means what why is the larger
sample size relevant
well look at this stuff this is
relatively flat right why is it
relatively flat yeah when you have a
larger sample size your averages have
less noise okay and this is these are
really all averages right I'm just
counting how many times I get each one
so this is the you know the average
number of rolls that got a three so you
can start to see I hope how these
factors play into each other and we're
gonna make this more precise in a second
with one die we have a flat distribution
with two die we have a non-uniform
distribution we can derive the
distribution analytically by just
counting how many ways we we have to
roll each number right you can usually
for simple problems you can just compute
the results combinatoric ly it's
basically just a symmetry principle you
just say well the number of times we
expect to get a two on the on the white
dies the same as the number times we
expect to get a four and so we just run
through all of the possible combinations
that way when we actually simulate it
the histograms don't look anything it's
clean I think here I've got another 60
rolls or something and you so you can
see they're a bit all over the place
with samples this small and if you only
do 60 rolls and this one is rigged to
give two fives for example let's say
that's why we get all this 10 it may be
very hard to tell that something is
wrong with the die until you roll at
hundreds of times there are cases so so
far we've talked about cases where
there's an analytical solution you can
sit there with some algebra and figure
out what they what the fair distribution
looks like when you're doing your actual
data analysis or if you're doing some
sort of data visualization you may not
have an analytical answer to what noise
is going to look like right that's kind
of the question we're answering by doing
these
simulations is what does noise look like
so this is from a beautiful paper which
is in the readings which sort of pulls
out a lot of these statistical ideas
into into a visualization framework and
here we're showing cancer rate per
County and what we've done in this image
is one of these is the real data the
rest are synthetic data generated by
taking the original cancer rates and
scrambling the counties okay so this is
our first instance of generating a null
hypothesis by randomization so I'm
introducing a couple ideas here right
first is null hypothesis and what that
means is you can think of this
colloquially as the data is just noise
okay
our theory is that what we're looking at
there's actually no pattern in the sense
that we care about the die is fair
there's no relation between location and
cancer rate there's no difference
between the two classes the jobs rate
was flat you know whatever it is we're
saying that the actual pattern we see is
just noise in some way and then the
other idea we're introducing is
randomization and we're using
randomization to generate examples of
what the data looks like under the null
hypothesis so we've already been doing
this when we do this type of thing what
we're using is we're using a model of
the null hypothesis all right we're just
rolling to fair die and adding them when
we do this the way we make the model is
okay so we have some counties this is
what our data looks like
and each of these counties has a certain
cancer rate and then what we do is you
know the null hypothesis is that there's
actually no association between where
the county is or which County it is and
what cancer it is so we permute to this
distribution right we say you know let's
just let's just scramble them a little
bit okay we just all we do is we
randomly reorder these things or in the
case of our County data here well we
just swap all the counties around and
what we get is something that has the
same distribution of underlying rates so
if you draw a histogram of the County
data you see exactly the same thing
we're gonna change the data but if you
map this stuff what you see is this
alright what you see is the some of the
data that we could see if there was no
relation between where the county is and
the cancer rate so then the question is
can we tell which is the real data and
which is the synthetic data because if
we can't tell the two apart then the
pattern that we're seeing in this in the
real data is a pattern that happens just
by chance and it's much less likely to
be meaningful so anyone want to take a
guess what the real data is yeah
everyone said three three is correct so
the fact that you could correctly guess
which was the real data means that there
is actually Geographic pattern here that
is distinguishable that looks looks like
fairly accurately distinguishable from
random data okay so you have all just
provided statistical evidence that
something is happening that is not
random
I think we looked at this earlier when
we were discussing text analysis this is
this is an analysis of the anachronisms
and downton abbey' did we talk about
this earlier yeah okay so on the
horizontal axis is so it's taking every
two-word phrase in the scripts of
Downton Abbey and first of all plotting
the overall frequency it's on a log
scale so 10 to the minus 1 means every
tenth diagram is this I think that can't
be quite right because 10 to the 0 would
be every Biograph is this mm anyway
something like that though right so the
more the the more common words go to the
right and the vertical axis is how much
more commonly we see this in the Downton
Abbey scripts than in Google books
results from the same time period right
so 0 here so it's a logged it's the log
of the ratio so 0 means we see it at the
same amount so of course you see
everything clustered around 0 which
means the language is broadly the same
and then this stuff up here write all of
the stuff that starts to get high on
this chart is over-represented right so
to here that means we see this a hundred
times more often just need we see about
hundred times more often in the Downton
Abbey scripts and in the books corpus
now the question I want to put to you is
what would be the shape of this
scatterplot if actually the language in
Downton Abbey is drawn from the same
distribution as the language in Google
Books so that's our null hypothesis here
right so null hypothesis and I'm being a
little more technical here
abby's scripts from same ORD
distribution as google books of the same
era right I don't know it sits around
World War one I'm not sure exactly which
years he used so how think about it this
way rather than trying it so there-there
is you can do an analytical analysis
right so you can algebraically work this
out but I don't want to so let's imagine
simulating this how can we make a
simulated version of this chart
so think of this as making fake Downton
Abbey scripts
any guesses
let's generate some Downton Abbey
scripts that are drawn from the same
distribution as the words that are
stored in Google Books how do we do that
yeah we don't need to read the scripts I
don't need to make any sense there just
yeah so if there's a hundred thousand
words in the Downton Abbey scripts then
let's sample a hundred thousand diagrams
from the Google Books corpus which we
can do by the way you can download the
frequency tables for Google books for
each year for up to 5 grams all right so
we can just download the data and do
this ok so we can generate fake data and
then we can produce this this plot and
we just do that by then basically just
take just comparing the sample that we
have to the Google Books baseline we can
generate this stuff right so we take
each diagram might want we say ok how
many times did we choose it in our
scripts which gives us the x-axis and
how many times do we have it in as
compared to in Google books which gives
us the y-axis and we go and so when we
do that we will get a certain shape and
I want you to think about what that
shape might be what is that what shape
would we see for this chart any guesses
right okay yes okay so good it's a good
guess that's a good start right so the
zero line is means the ratio is the same
because their log one is zero right so
because we pull them from the same
distribution we should get stuff around
this zero line but we're not quite going
to get stuff around the zero line what's
that like how we generated the well
because we're not going to because the
google books is this huge big
distribution if we sample a certain
number of words some words are gonna
appear more often by chance some words
were never we're never going to sample
yeah yeah yeah so right yeah the
randomness is how is how often did we
pick this word right so if we by chance
picked a word a lot it'll be up here if
we didn't pick a word very much it'll be
down here I think it's in Google Books
yeah I'm not sure that okay
so here's the thing though we're going
to get this sort of funnel shape and the
reason we're gonna get this type of
funnel shape it's gonna sort of walk up
the chart here it's gonna be denser in
the air but it's going to sort of go up
here and the reason is because these are
very rare words right so think about it
if a if a word appears once and I'm just
gonna say word it's actually by grams if
a word appears once in every hundred
thousand words okay and we have one in
our script of a hundred thousand words
then we're gonna be on the zero line but
say by chance we pick it twice well now
it's twice as often so it's gonna go up
here so the rarer the word is the more
difference choosing that word a single
time makes on the vertical axis of the
chart another way to think of this is as
we go to the left of the chart we get a
smaller sample size because remember
we're comparing the ratio of two samples
this is the sample in Google Books this
is the sample on the script alright if
the samples are very small we're gonna
get a lot more noise so if you actually
do this experiment you will get a shape
that looks like this and so if we
interpret this chart we have to sort of
imagine that there's a curve which runs
kind of like this and we're really only
interested in the stuff above the curve
in fact what we could do is we could
numerically calculate what is the curve
that represents we only get a value this
high one out of every hundred samples so
you see this characteristic type of
funnel shape in a lot of cases so for
example if you plot if you take school
test data where you have the test
average test score for the class and you
have the class size and you plot class
size on the horizontal axis and test
score on the vertical axis the smaller
classes will have the most extreme
values if you have if you plot crime
rate per County well crime rate is is a
ratio right it's how many crimes versus
how many people if you have a smaller
population in the county one murder can
make the crime rate incredibly high
right that one crime has a much larger
effect on the rate so anytime you're
taking averages or rates when you have a
smaller group of people you get a lot
more noise which means and this is a
classic trap you know the classic story
is you download the crime rate data and
you say oh which county has the highest
rate of car theft you know what it's
gonna be one of the smallest counties
always
because it just are most likely because
it just has the most noise so when
you're analyzing this any type of data
which has some sort of noise or random
process in it that first question you
have to ask is what does this look like
if there is no pattern which means you
have to figure out what no pattern means
and for the example we just did no
pattern means we I just defined it to
mean the same distribution as Google
books of the era here's another example
this is a from the signal in the noise
which is nate silver's book on
prediction and what he's talking about
is the claim that hey look the
temperature rise stopped I think now we
have data almost ten years later and
it's going up again but in 2010 people
were saying like oh yeah global warming
has stopped look right it's been flat
over the last 10 years and but you know
you can see there's a variation here the
biggest variation by the way is the
11-year sunspot cycle so there is actual
astronomical sources of variation here
as well on multiple time scales there's
some like hundred thousand year cycles
as well but so he posed the question how
often are we going to see a flat decade
even if the temperature is actually
going up so how would you answer that
from this data any guesses
what's that well some decades are going
up right most I mean there's a general
upward twin right so that's coming up
that's coming up but this is a flat
decade that's a flat decade that's a
flat decade that's a flat decade that's
a flat decade so what he did is he said
ok well how often do we see a flat or
decreasing decade just from that data so
in this case he's just using the data
itself to get an estimate of how how
likely it is because we look at this and
we know there's a general upward trend
right so there's definitely a long-term
upward trend and what he's saying here
is even with this general upward twin
trend you're very often going to see
decade-long
decreases in temperature so seeing a
decade-long decrease in temperature
doesn't mean that the upward trend has
stopped writing what no you can I mean
you can say that's totally factual you
can say that this is the greatest
temperature in the last decade right but
to say that you know global the
temperature trend in the last decade has
been flat therefore global warming has
stopped that's nonsensical because we
can see just from the historical record
that it appears that there are
decreasing trends in on the timescale of
a decade fairly often and yet we know
because we have the data subsequent that
this decreasing trend was not the end of
global warming
okay so it's I I hope you're seeing the
sort of conceptual connection between
all these examples okay they all rely on
the idea of how often does the thing I'm
looking for happened by chance given the
the inherent structure of the underlying
data so here's another example this is
what I call the lottery fallacy this is
also due to Nate Silver in where was
this 1976 there was thought that there
was going to be a big flu epidemic and
so there was a huge vaccination program
and some of the people who had the
vaccine died but three people who were
vaccinated in one clinic in Pittsburgh
within the same hour all died that day
so this is the same structure as is the
dilated doesn't make a difference to if
you put the milk or the tea in first do
we see a pattern the cancer rates here
right look at the structure of this
argument it is extremely improbable that
such a group of deaths to take place in
such a peculiar cluster by pure
coincidence that's the argument right
it's the same structure of the argument
this is not chance the null hypothesis
is wrong or another way to say this is
the null hypothesis in this case would
be something like the vaccine is safe
they died from something else there this
editorial is explicitly saying the null
hypothesis does not generate the
observed data with high enough probably
so Nate silver being nate silver said
okay let's calculate how high the
probability is so here you go here's
what he did basically he just he takes
the death rates for elderly Americans
which he defined 65 or older he asks how
many people how many elderly people
visited a vaccine clinic he makes an
estimate of the number of clinics
converts that to how often just the
chance of any one person dying and then
two out Chance of three people dying and
there you go
so this is probably what the person who
made the argument the editorial argument
was thinking about about four hundred
eighty thousand to one all right so it's
very unlikely that three people who went
to the same clinic in the same day will
die from some other cause however there
are five thousand clinics to do this and
eleven days over which this happened so
when you put that all together the odds
are about one in nine right so greater
than ten percent so here he's making the
opposite argument he's saying no this is
not extremely improbable this is about
one in ten which now we have the problem
of well do we think the vaccine is safe
or not what do you think our is a one in
ten chance of this happening by chance
does that suggest that there's a problem
with vaccine or does it suggest that the
vaccine is probably quite safe
so you what you're doing there's you're
invoking a different hypothesis and
you're saying it's not a problem with
the vaccine in general it's the problem
with the vaccine at that clinic or
something about what that clinic is
doing yeah that's a reasonable
alternative in fact let's start writing
these down right so null hypothesis is
[Music]
right the null hypothesis vaccine is
safe let's call this h1 clinic is bad h2
vaccine is bad h3 food poisoning right
and we can go on now we're starting to
get into what we're gonna look at the
end of the class which is the analysis
of competing hypotheses and the question
we can ask is how often will this
hypothesis
generate our real data okay so we have
some real data in this case three deaths
in one day or we have some real data in
this case you know this word cloud or
these images there's always some data
that we observe and the question we're
asking is for all of the hypotheses that
might be causing this how often will it
generate the data now this is a purely
statistical question the answer will be
phrased in terms of probabilities or
ratios or probabilities and so forth
that is a different question than should
we investigate the safety of the vaccine
this is the difference between
probability and decision theory
probability gives you a chance of
something happening decision Theory
tells you well what choice should I make
and decision Theory introduces costs
okay
it introduces the idea of an expected
value expected benefit expected loss
expected utility and although the
probability that the vaccine is bad
might be very low the cost of having the
bad x scene is enormous
so the expected value calculation
becomes very large this is similar to
our issue with setting the threshold
false negatives and false positives for
releasing people from unbale pretrial
where you want to set that threshold
depends on what you consider the cost of
a false negative versus the cost of a
false positive so how much worse is it
to keep someone in jail who wouldn't
have committed a crime versus release
someone in jail who then goes on to
assault someone right how do you weigh
those so
probabilities themselves cannot tell you
what to do you have to combine them with
costs so this is sort of the first step
in a decision analysis because the thing
is this right even if we have a very
small chance that the vaccine is bad
because the potential outcomes of having
a bad back scene and not handling that
situation are so costly even if this
probability is very small you still want
to investigate it
maybe stop vaccinating people so I'm not
even so going back to this editorial
right I'm not even saying that this
person is substantively wrong right
there they're arguing here that we
should probably be careful with this
vaccine I'm saying that their their
argument that it's unlikely isn't is not
really true because there's a lot of
chances for it to happen okay
thoughts on all this stuff questions
[Music]
yeah no I mean we should I would say
this is definitely reason to investigate
that possibility however this so what
I'm saying is that if you're going to
argue that if you're going to argue that
something is improbable by chance put a
number on it all right
I actually think about it is it that
unlikely potentially not but you are
right so that's something you need to
think about
what's that yeah I mean you may or may
not want to publish the calculation it's
probably better if you can rely on an
expert in that field to do these types
of calculations but when you start
looking for this you see it a lot right
what you see is the argument that
something couldn't possibly happen by
chance and almost always when someone
says that they haven't actually
estimated the probabilities
okay so we're zeroing in on the idea of
statistical significance which is this
idea of arguing that something must be
true because the alternative is very
unlikely so this is the earliest example
I can find this was the Howland will
trail as Sylvia and Howland was a rich
American aristocrat who died in the 19th
century and there was an addendum to her
will that said that all of her money
went to her niece and this is an
original signature that they know were
know was real or at least it was
undisputed in the case these are the two
signatures on the addendum and the
argument was that her niece forged this
stuff and the to prove this the
prosecutor hired charles sanders peirce
who is best known as the inventor of the
randomized controlled experiment and one
of the founding philosophers detained
pragmatism he was a great 19th century
scientific epistemologists and asked him
to compute how likely it would be that
these were forged and the way he did it
is he took 41 I think pairs of
signatures do I have this no I don't
have that text here but late 19th
century and so I have this in this story
in curious journalists so arguing from
the odds yes oh he took 42 signatures
that the court believed to be genuine
and he was actually up here signature to
the senior actually did this and he
printed them out on transparent
photographic plates and he overlaid
every possible pairs which is 861
possible pairs and asked how often he
broke the the signature into 30
individual strokes downward strokes so
that's a stroke that's a stroke that's a
stroke that's a stroke that's a stroke
right and he asked how often were the
strokes in the same had the same length
and horizontal position and he found
that the same stroke matched only a
fifth of a time so he said that well for
30 strokes to match which the disputed
signature did match exactly happens by
chance only once in 5 to the 30 okay so
now this is not there there are problems
with this statistical argument but this
gives us sort of a ballpark so his
argument was that it is very unlikely
that if Sylvia Allen signed her name
that she would sign it exactly the same
on two different documents and we know
that from looking at the variations on
her real signatures so again what she's
saying is
so in this case the null hypothesis is
which is normally called h0 is it's an
original signature and h1 is forged by
copying right so by tracing it and what
he's trying to do is calculate how often
the null hypothesis replicates the data
and he's saying well it's super rare so
this same basic form appears over and
over and it is the fundamental concept
in hypothesis testing and it's a little
bit of a weird thing right because what
you actually care about is proving this
guy but to try to prove this what we do
is we say well the alternative is that
and the alternative is super unlikely
[Music]
and people have done this in journalism
so The Wall Street Journal has won a
Pulitzer Prize and actually done three
different versions of this story over
ten years looking for various types of
insider trading
so insider trading is when an executive
in a company has information that is not
public and trade stock based on that
information there are laws about this
for example there are blackout periods
before and after earnings announcements
where executives and often other
employees can't trade during those
periods
to answer the question of whether people
were or to try to find insider trading
what they did is they took the
executives of a bunch of different
companies and they looked for cases
where they made they sold a bunch of
stock and made a bunch of money right
before the stock changed value if so
bought stock before went up sold stock
before I went down and in this
particular country context what they
used our back dated stock options so and
this is an instrument where they grant
them some stock options but then later
they decide oh they were active as of
this date and so the question is did
they really pick the date just
arbitrarily or did they pick the date
sometime later and set it at a date
where they would make a lot of money and
answer that question
what they did is they took the all of
the trades and did thousands of
simulations to see how much they would
normally make if the trades were
unrounded dates so here breaking this
down a little bit so they looked at
people who are using this type of stock
option and they're finding that they
they make a lot of money but the people
who don't do that use that type of stock
option don't make as much money and let
me show you how this this works
oh no Google tends to ask for my
password it awkward really
why does it randomly pick my it's not
new at all okay
yeah so this is a nice discussion which
I will post to the course slack and how
this works and it explains how this
analysis worked
so first of all they looked for news
that changed the stock price and then
they looked at who sold stock the week
before and they found that 10% of them
made a bunch of money but the problem is
sometimes you're just gonna make money
anyway because the stock is going up
right or down so instead what they did
is simulation so what they did is they
broke the link between the trades and
the timing of the trades so they took
the same trade you know I sold this many
shares and they said okay rather than it
happening on this date before the news
let's pick a random day in the year and
let's see how often you make this amount
of money if there's no connection to the
timing of the news and what they were
able to show is that if this was just
luck they were incredibly lucky they did
these simulations and they found minut
chances of making this amount of money
from the stock trade it's the same thing
we've been talking about this like
randomization approach so now instead of
randomizing the link between cancer
rates and county we're randomizing the
date of the stock sale and asking how
often do I make this much money in a
random on a random day now is this
evidence of insider trading how do we
handle this result as journalists this
is the complicated part
so you're saying that if if the news is
particularly unusual then that means
they're even luckier okay
so really what I'm asking here is so say
we do this we identify you know this
person okay and you know they made an
extremely improbable amount of money if
they were just trading randomly okay
right so we think okay this is not a
random pattern
can we write a story saying that they
traded on non-public information no okay
why not still got to be a journalist was
they mean yeah right so remember insider
training is a criminal offense you can't
accuse someone of a criminal events
without proof right and and usually you
don't you don't accuse someone of a
criminal offense unless they've been
convicted convicted by a court if
they've been accused but not convicted
you say you know allegedly or your hedge
it in various other or you know you
hedge it in various ways but that's not
even that right the SEC has not opened
an investigation against them right
there's nothing going on here except the
statistical result so you can't call
them a criminal right you can't say that
but what do you say instead
right and in fact I loved how they
handled this look at this headline he
actives good luck in trading own stock
all right
they didn't accuse them of any wrong the
only thing they said is they had very
good luck and the thing about this is
that it may be true you know some some
of these people probably just traded
randomly and got lucky but not all of
them it would it's just too unlikely
that they all got that lucky the problem
is we don't know which ones were really
lucky probably you know a small number
of them will have just been lucky
probably most of them it is some sort of
insight information whether that's
illegal or even immoral it depends on
the specifics right that's a question of
law not statistics
all right there's all kinds of
complicated law about whether it's
insider trading or not but they actually
did three different versions of this
they did it with trading before press
releases they did it with back dated
stop options and they did it with one
other type of investment vehicle so
they've actually done this story in
multiple times over ten years with about
the same statistical technique another
nice example of this is the tennis
racket which was a story that BuzzFeed
did recently and this was the part of a
much larger investigation into
match-fixing in tennis and what they did
is they took public betting odds right
they scraped it from a bunch of betting
sites and they looked at cases where the
odds shifted dramatically between the
start of the match and later on in the
match right and the idea there was that
if someone was heavily favored to win
and they lost May
they threw the match or as they put it
if bets come in against a favored player
maybe that's evidence of match-fixing
so BuzzFeed actually publishes published
the code for this it's a pretty
straightforward simulation and what they
did is they generated a list of I think
15 players who frequently paid matches
where the odds shifted a large amount
and they didn't actually publish the
names of these players they just said we
found a bunch they did publish the
algorithm to do it and the data is
publicly available so some readers
promptly ran the code on the data and
got the names so interestingly again
they did not accuse they didn't even
publish the names they didn't want to
accuse people of match-fixing based on
this type of evidence alone why what is
it what is the weakness of this type of
analysis it could have had a bad day for
all kinds of reasons and there's a
really nice article this is also in your
readings this is a very where are we
okay so here's the original how they use
data to investigate match fixing in
tennis and here's a lovely article why
betting data alone can't identify match
fixers in tennis and this is a beautiful
critique of the method and one of the
things that they talk about is what are
all of the other ways that a player can
tank a match so it's the same sort of
concept right null hypothesis through
the match and then the alternate
hypothesis
deliberately losing losing yeah the
alternate hypotheses are things like h1
betters have inside information age2
the betting markets are wrong and so the
example that gives there is
maybe there's a recent injury that the
bookie doesn't take into account for
example that's what he's saying there
and then this is very interesting h3 is
let's call it methodological problems
and we're going to talk about this a lot
when we talk about the Garden of forking
paths so check this out
what 538 did for this article is they
hired a statistician to replicate the
results and they did it a little bit
differently first of all he used a
little a few more matches okay but also
the BuzzFeed analysis looked at the
bookmaker right so that there's there's
seven different companies producing
betting odds and they're gonna have
different odds for the match and what
the BuzzFeed analysis did is took the
bookmaker that had showed the maximum
movement in the odds
whereas Sackman took them the the median
of the odds given by the bookmakers and
looked at how much the median moved so
that's a different methodological choice
why would that be justifiable by the way
why would you think well maybe I should
use the meeting of the odds rather than
just the maximum of all of them why
would you want to do that
yeah alright so the median removes
extreme values right so so if you take
the median of all of the bookmakers
you're you know basically you're just
throwing out the data for you know the
bookmakers who just got it stupid wrong
you know made a stupid mistake we're
having a bad day didn't didn't take into
account the the data and the right way
there can be data errors as well right
there could be typos you're just
throwing out all of that stuff right and
when you use the median you get a much
smaller number of players who have
suspicious odds movements so you made a
bunch of changes no it's gonna be the
whole market because the idea is that
the player deliberately chooses to lose
the game and it's being paid off by
betters who are betting against a player
who is favored right so everyone thinks
he's gonna win you say I'm betting that
he's gonna lose you get very good odds
on that right it pays back let's say ten
to one then the player loses and you
make a whole bunch of money and the
player is in on it right they get part
of the winnings right that's why it's
illegal okay
oh I see you picked you picked the
bookie who's the most pessimistic cuz
you get the best odds yeah probably
it's a good question but four so that's
where the money would go but for this
analysis basically the the whole concept
of the analysis is you look at cases
where the odds shifted very rapidly once
they started playing indicating that
they were seems to be very unlikely to
lose and yet they were losing
anyway any when you make this
methodological change you get a
different set of names and a smaller set
of names as as you said according to the
statistical significance threshold that
the original analysis used only one
player not four or above this threshold
so when we change the method we get a
different result and that's an
indication that we have methodological
issues what you hope for is a result is
a result where when you change the
method you get the same answer that's
known as the robust result and this is
true not just in statistics by the way
this is true in journalism in general
so then BuzzFeed's analysis
there you go so this is how BuzzFeed
handle this they said first of all we're
not claiming that by itself this
analysis proves match-fixing
but they said ok but we also had this
whole other investigation right the
statistical analysis was only one part
of the investigation and they also noted
that six of the I think 14 players they
found were also under investigation so
there's some reason to believe that the
analysis was effective in finding the
people although again you have the same
problem it's hard to make the argument
that any particular player is cheating
based on this so I'm going to show you
okay
thoughts on this again it's exactly the
same structure of the argument we're
saying that Oh actually I got this wrong
the null hypothesis is not that they
threw the match that's that's H I don't
know let's call that H F for fixing
the null hypothesis is let's say lost by
chance and how they generated data to
see if they lost by chances they
simulated all of these games they took
the initial odds and they said let's
simulate a million games and see how
often this player would lose as badly as
they did and they said well that doesn't
happen very often so we think it's
evidence of this and then 538 is coming
back and saying ok but why is it
evidence of that and not evidence of
this or this or this so this is the
basic weakness of the statistical
significance argument the statistical
significance argument is got a very
funny structure it always says the
alternative is unlikely therefore my my
proposal is correct
ok and in a certain sense that has to be
true because if we give a probability to
every hypothesis right so let's say p0
is the probability of match-fixing p1
equals probability of injury etc right
and then
let me real able this p1 is match-fixing
p2 is injury p0 equals we saw the
results that we saw on the odd shift by
chance
we know that p0 plus p1 plus p2 plus all
of the other possibilities equals 1
that's what that's an accident
probability if we could list out every
possible option they have to sum to 1 so
if this is very small then these have to
be big okay so the the structure of the
argument does make sense formally if
it's very unlikely that the null
hypothesis produced the data that we saw
then that is evidence that something
else produced it the question is how
many different things could have
produced it and the difficulty with the
structure of the buzzfeed argument is
they're saying well it must have been
this and BuzzFeed is saying well what
hang on there's all of these other
possibilities if there's only one other
possibility so think of the hell and
we'll trial either it's fake or or it's
real and if it's very unlikely that we
see a signature by that like that that
is a real signature then it has to be a
fake signature okay but in the tennis
fixing case and in the insider trading
case we have other possibilities so all
we know is that we can exclude chance
but now we don't know which of the other
ones it actually is
one of the things you can do to
understand whether statistical
significance testing is going to like or
what kind of evidence it provides is to
talk to other people who have done it so
this is again from that 538 story right
and they they talk to this guy who you
know gave them this quote about what it
meant and also I find this really
interesting
doing the same thing has proven
problematic and other sports so that
feels like one of those sentences where
the reporter knows a lot they probably
have like pages of interview notes and
like interesting links but they haven't
given it to us yeah I'm sure there's a
long history of trying to detect
cheating through statistical methods and
you see this in elections as well we're
not really going to talk about that
today but you can see the election fraud
and the Russian elections by looking at
the results in a bunch of different ways
one of the ways you can see it is if you
do a histogram of the precinct counts
way too many of them end in zero or five
you know so 65% 70% 75% for Putin's
party people are very bad at making up
random numbers pick a random number
between 1 and 10 all of you how many of
you chose 7
[Laughter]
that's right it's because I decided you
would okay rounding out our our
discussion of what I'm going to call
pure significance testing you'll see why
in a second this is a fascinating
article which was originally on medium
and this was about someone discovered
that there were a series of payments
just before October 28 2016 which was
the date of the contract between store
mean annuals and Trump there were a
bunch of campaign finance payments that
totaled almost exactly 130,000 and so
then this person did an analysis and
what they did was the same sort of
randomization technique right they said
the null hypothesis is these are just
random payments and how they generated
instances of the null hypothesis was
they just randomly sampled from all of
the payments in the previous month or so
and they asked how often if I randomly
sample 10 payments do I get a bunch of
them that get too near 130,000 and
here's what they did they found this
thing and they said whoops we get to
within you know 1% of the time we get to
within 2 dollars and 75 cents and that
was their argument that these campaign
finance payments were actually the
payments to stormy Daniels now there are
some problems with this argument this is
a great example of you know we sort of
have to think through the argument right
like what are the possibilities other
than it was a payoff well you know
chance is one of them you know where
they're recurring payments that totaled
$130,000
are these numbers accurate or they
changed later and then you get into this
whole issue of if you're trying to hide
payments by the campaign is this how you
do it it's this is how campaigns have
done it in the past there's all these
other issues aside from the statistical
issues but it's the same idea and so
finally we get to p-values which are one
of the most widely understood con
misunderstood concepts most most widely
used and most least understood concepts
in statistics and mmm that's a formal
definition let's let's try this
informally anybody want to hazard a
explanation of what a p-value is our
resident statistician
yeah
okay so it's a complicated idea it's
actually a little weird so here's the
here's the sort of standard definition
the idea is that we have something
called a test statistic which measures
how out there is our answer right so you
know if we're looking at the change in
jobs and we're like oh look at that
that changed by 100k the test statistic
is maybe you know how many standard
deviations away from last month are we
or this test the test statistic is often
turned into a z-score which is a
difference expressed in standard
deviations but it can be really anything
that the fundamental concept of a
p-value is p-value probability that we
would get data at least I'm being a
little informal here at least as Extreme
as our
real data right so I'm I'm
differentiating here between the data we
observe in the world and the data we're
simulating at least as Extreme as our
real data if the no hypothesis is true
so generally for the sort of standard
statistic is that the general threshold
is P smaller than 0.05 so 5% chance but
this definition is complicated if we
have a p-value of 0.05 for you know is
there a difference between these two
drugs is there a difference in the
salaries of men and women at this
University
you know what do we see or you know is
have crime rates really increased that
type of thing this is where we normally
use p-values that doesn't mean that
there's only a 5% chance that the value
we see is due to noise it only means
exactly this which is really hard to get
your head around it means that if there
was no difference we would see the
difference we actually saw about 5% of
the time right so it actually doesn't
say anything about that fatha since we
care about it only says something about
the hypothesis we don't care about and
the implication goes in the wrong
direction what we really want is the
probability that our hypothesis is true
given the data instead what we get is
the probability that we see our data
given a hypothesis that we're not
testing so it's a weird calculation now
that there is still logic in it because
as I showed you before if the null
hypothesis is very unlikely to have
general
data then something else did okay and if
the only other thing that could have
done it is the thing we care about then
yeah when the p-value gets very small
that is strong evidence for our other
hypothesis so let's talk about this this
is I interest the idea of a test
statistic this is a measure of like how
extreme our result is so this is a
standard formula for the difference in
means of two groups with different
variances I'm not going to go into how
we derive this and in fact I'm never I'm
gonna recommend that you never use this
formula I'm gonna recommend that you do
it through randomization in fact what
I'm going to show you is a way to
compute as standard statistical p-value
without these formulas so the test that
I'm going to show you isn't and this
you're gonna do homework on this sound
exactly this problem is let's imagine
that there's two different classrooms
that differ and some variable we care
about maybe one has a better paid
teacher maybe one uses different
textbooks and we have the the
standardized test scores for every
student in both classrooms and we're
going to ask does it matter which
classroom they're in now now obviously
the students are not the same so they're
gonna be some difference in means
between the two classrooms alright
there's gonna be some difference in the
actual data but the question we're
asking is how likely is that data that
difference to be something other than
just chance so to answer that question
we have to ask about why we would have
those differences and there's basically
only two reasons there are differences
because of differences in the two
classrooms and there are differences
because of things that do not depend on
the classrooms like the variation in the
students who go into the class and we
have to hope by the way that
students are not assigned to a class
based on how good a student they are
right like we're that's why we do
randomization and experiments we have to
hope that there's no other correlation
between classroom and students but if we
think that's true then what we can do is
we can basically break the association
between which classroom they're in and
their test score through permutations so
this is how we do it
this is what the data looks like this is
the test score this is which class
they're in and what we do is we randomly
reassign the students to classes we
permute the class assignments or permute
the data same thing right we have the
same number of A's and B's but we
reorder them and we do this over and
over in principle we actually run
through every possible permutation it
you know it'll be in the billions in
practice what we do is we sample a
subset of them randomly and for each
permutation we compute the difference in
the means between the two classes so
here's the idea here's the original
scores for Class A and B with the means
marked what we do is we randomly
reassign each of these dots to the left
or the right ensuring that we keep the
total number of students the same in
each class and then we recompute the
means and you can see sometimes Class B
comes out higher and sometimes Class A
comes out higher we do this thousands of
times and then we look at the difference
between the class scores for each one is
that clear so far and then we make a
histogram of the thing and this is what
we get we get this histogram of possible
differences in the class scores after
resampling and somewhere in this
histogram is our observed data so in
other words this is a real data here and
all of this our
relations of the null hypothesis and
then we ask what is the probability that
we get a test statistic in this case the
difference between the means of the test
scores of the classes that is at least
as big as the observed data at least as
Extreme as the real data so looking at
this histogram can anybody tell me how
how that probability appears on this
chart there is there is a relationship
between this visualization and the
p-value anybody know what it is again
the question is how often if the null
hypothesis is true do we see a
difference in means that is at least as
great as the one that we saw so what is
that on this chart
right here no there's lesser differences
so this is it and if we do that
calculation and we're not using we're
not using any formulas here we're simply
counting the percentage of trials where
we get an observed difference at least
as big as the real difference and we get
14% so that is what a p-value is and
I've shown you one way to compute a so
I've tried to demonstrate two things in
this example one the concept of a
p-value how often do we get a difference
that is at least as great as the one we
saw if the null hypothesis is true to
the concept of a permutation test which
is a randomization test basically a way
use simulation to compute a p-value
where you don't have to work through the
analytical formulas of classical
statistics and with a little creativity
you can come up with simulation methods
to calculate all of these statistical
properties you very rarely have to
actually use the analytical formulas for
this type of stuff so would this be
considered a statistically significant
difference
yeah not by the standard definition of
point zero five right so this is gonna
be your homework I'm gonna give you two
sets of class scores or or something
like claw scores and ask you to compute
the p-value through a permutation test
oh I see so yeah so this is the
difference between a one-tailed and a
two-tailed test
so yeah depending on how you think about
this you might want to conclude all of
this stuff too on the left because
you're asking two different questions
what is the probability that Class A is
better than Class B and what is the Oh
as opposed to that the difference is
this great by chance
unfortunately so I'm showing you this
because I want you to I'm trying to give
you a little intuition about what a
p-value is measuring and a way to
calculate it for a very standard case
this is mostly what people use p-values
for unfortunately p-values answer the
wrong question let's see we're going to
bootstrap here I'm not gonna do
bootstrapping today so here we are we're
looking at it the this means point zero
five this means point zero one this
means point one oh and they're like well
it's not statistically significant but
you know there's always this problem of
threshold and this is one of the
difficulties of p-values is people want
to use them to decide whether something
is is true or false but you can't
actually decide whether something is
true or false all you can do is weigh
the evidence in different ways okay
that's what uncertainty is uncertainty
is irreducible lack of knowledge
emphasis on irreducible there is no
statistical process that you can do
that can tell you whether the difference
in female assistant professor salaries
in the life sciences is due to something
is due to discrimination right you can't
get that from the data because there is
irreducible uncertainty in the data all
you can get is different ways of talking
about the strength of the evidence and
the p-value is one way to talk about the
strength of the evidence if the p-value
is small that means the difference is
unlikely to have been generated under
the null hypothesis which means that
something other than the null hypothesis
must be true it doesn't tell you what
that was the problem we were running
into earlier with the statistical test
for tennis fixing
and this is a nice paper it's linked in
the syllabus it's about a bunch of ways
that the p-value is misinterpreted this
is the number one misinterpretation if P
equals point zero five the null
hypothesis is only a 5% chance of being
true it's not at all what it says and
one way to see this is to think about
writing the probabilities as more formal
statements and so we're going to use
this notation e is evidence that is what
we observe you can think of that as the
data H is hypothesis that's what we
think of true this is the this is not
quite the p value because the p value is
at at least as extreme not equal to but
more than this is kind of what we're
doing with the p value we're saying what
is the probability evidence given that
hypothesis given the null hypothesis
what we really want is this what is the
probably the hypothesis became a care
about there's a difference in the
classrooms there's a difference in the
pay the payoff was cover up that match
was fixed there was insider trading
right that's not pasta so we care about
given the evidence that we have that's
what we want but that's not what a p
value is and I hope that I have drilled
into your head at this point that
reversing a conditional probability is
completely different right you get
completely these are very different
concepts there are only different
concepts but I'm actually talking about
different hypotheses and again it's not
that the reasoning is bad right so let's
go back to the hell and we'll trial
all right so if
it is unlikely that the signatures match
because by chance then it has to be true
that they match for some other reason
okay the logic is good because we know
that the sum of the probabilities of all
the hypotheses is 1 so if we can
eliminate one hypothesis it has to be
something else the question is what and
in this case we the only things we're
considering is its Forge it's you know
it's forged or it's real so if it's not
real it has to be forged alright if you
have only two possibilities P values can
really work but in practice you often
have more
also the p-value doesn't tell you much
about the effect size which is the thing
you actually care about right how much
better were the scores right statistical
significance only compares against the
null hypothesis which is normally the
hypothesis of no effect
so if it's significant statistically it
may still be a super small effect so
that's this case here right different
effects same p-value right so the
percent benefit let's say there there
you know medical trial right these both
have the same key value of point zero
five but this is a two percent
difference and this is a twenty percent
difference all right so the p-value does
not measure the effect size conversely
you can get this other problem which
where the p-value is just sort of
measuring how accurate your result is
basically how many people you had in the
trial right we have the same effect size
we just have a lot more uncertainty in
when we have a higher p-value so the
p-value is this weird construct that
kind of blends effect size and
confidence interval and there's a big
movement to eradicate p-values and just
have people report confidence intervals
instead because really that's the
information you need here another way of
thinking about statistical significance
is does the confidence interval cross
zero
so yeah unfortunately they mean less
than we let me hope they would but I'm
going to show you an alternative and I'm
not going to go into Bayesian statistics
in any great detail because that's a
whole other course but I'm going to show
you foundational ideas in Bayesian
inference and and basically based on
this right it's based on let's back up
and represent everything as conditional
probabilities so you're familiar with
conditional probability I hope this is a
fundamental concept you need to get this
in your head or you're gonna make bad
mistakes but it's a very simple idea
really it's just a definition it's the
probability of a and B happening divided
by the probability of a alright so we're
just we're just taking a denominator
we're talking about two different events
and we're taking and saying of the
things where a happened
how many also had B and I kind of think
of it as this bar as a Division sign and
so the a is like okay that's the
denominator that's the set work and
through thinking of so that's why I want
you to think of it of the A's how many
had B which is not the same as of the
B's how many have a or I think of the
you'll see these diagrams in a second so
here's the classic example let's say
there are yellow and blue taxis in a
city and some of them have accidents and
we're trying to ask which cab company is
safer now if you just know it should be
obvious to you that if you just count
the number of accidents you're gonna say
well the yellow company has more
accidents but of course there's more
yellow cabs so what we actually want is
the rates and that's kind of what
conditional probability is it's you know
given that we're yellow how many
accidents did we have or in this case
blue right so this is what conditional
probability is the the thing after the
pipe is
the denominator and then we count within
that how many had an accident right so
we only end up counting this accident
now that accident because this one is
inside the denominator we talked about
relative risk a lot a few classes ago
relative risk is this I mean these are
the formulas from the contingency tables
but the easiest way to think about it is
as a ratio of conditional probabilities
in other words what is the ratio of
getting the disease given that I smoke
over getting the disease given that I
don't smoke and I feel like this is a
much easier thing to remember and once
you have that then you can convert this
back into the formula of the contingency
tables by using these ideas right so you
should be familiar with converting
between this notation and contingency
table notation hopefully your last
assignment forced you to think about
that stuff a bit the base rate and this
is the reason I show you this is there's
a standard error called the base rate
fallacy that's where you look at this
data so when you look at this it's
pretty obvious that the yellow taxis are
safer when you get this data and you're
like wow there are many more accidents
involving yellow yellow cabs you have to
sort of wiggle this stuff around a
little bit to finish the derivation so
I'm turning the numbers that we have
into conditional probabilities so we
know the overall probability of an
accident we know the conditional
probabilities of having an accident
depending on which your which caviar in
but we don't know how many yellow cabs
there are
oh actually I've done this wrong 75
accidents involving a yellow cab this is
not this is wrong this is just an
accident given yellow this should be
ends not not Peas
well we actually want are these numbers
and to get those numbers we need to know
the total number of yellow and blue
taxis so another way to think about this
is this is one of these you need four
numbers to make this work right if
you're gonna ask which taxi company is
safer you need every entry of this
contingency table because ultimately
you're computing this and this uses all
four entries I'm now gonna do sort of
expand on this a little bit into this
evidence and hypothesis framework which
is an extremely clear framework for
thinking about many statistical problems
so let's say we observe this person
coughing and we want to know and this is
why I introduced the base rate we want
to know whether she has a cold let's say
we know that most people with with colds
are coughing right so we know that P
coughing given cold is 0.9 what is the
probability that she has a cold let me
see your coughing we know there's this
link between having a cold and coughing
what is the probability she has a cold
Yeah right we are missing the base rates
okay
so here's the basic issue we have again
probability of we have why we just
learned the probability of the evidence
given the hypothesis we have that we
want probability of hypothesis given
evidence so we have to go the other way
the way we get that is base theorem now
base theorem is not metric magic it
follows immediately from the definition
of conditional probability it just tells
you the relationship between a and B
right so it tells us that probability of
hypothesis given evidence is gonna be
probability of evidence given hypothesis
divided by probability of the evidence
so and then divided by the probability
of the hypothesis no hang on time's the
probability of hypothesis divided by the
probability evidence there we go okay so
in this case if we translate this out
this is probability coughing given cold
times probability cold divided by
probability coughing all right so it
tells us how to reverse this so let's do
this what are reasonable values for each
of these things so we were given
coughing giving cold at 0.9 what is P
cold how many people have a cold at any
given moment what fraction of the
population
15% how many people are in this room
have a cold depends on the season or in
the world I don't you're usually epical
ok what do you want to call this 5% ok
overruled 5 yeah you need to get out
more so you get more cold what is the
probability that someone is coughing
there's 110 percent ok 0.1 so what do we
get 0.9 times 0.1 is point 0 9 and we
get half of that
so was a Oh / point wine sorry we get
0.9 point two point zero zero five so
that's zero zero for 5/10 okay did I do
this right no there we go so okay 45
percent let's do it this way hey look at
that
we ended up with exactly the same
numbers have ya I'm always curious what
people will guess yeah exactly
there you go so there's the calculation
worked out in detail right and it's not
a complicated calculation the only thing
you need is the conceptual machinery to
see how all these numbers are late okay
these things have names so probability
of the evidence given the hypothesis or
the data given the hypothesis is called
the likelihood the probability that you
get the evidence at all
[Music]
it's called the base rate so the problem
how often do I see that data and
probability that the hypothesis is true
without any other information is called
the prior Bayes theorem is also
extremely important in interpreting
diagnostic tests such as for mammograms
in this example or let's say you've got
your terrorism detector and someone
tells you that if the person is a
terrorist and they walk through the
scanner the alarm will go off with 99
percent probability but that's not what
you actually want again it's the same
problem you don't want the probability
that the alarm goes off if they're
terrorists
you want the probability that they're a
terrorist if the alarm goes off that's
the that's a significant number right so
this is the same thing right if you get
a positive mammogram under this scenario
how likely is it that you actually have
breast cancer
and here's the situation and again I've
borrowed from from Nate silver and
here's the idea is that the dark squares
are positive mammograms and the pluses
are cancer but because cancer is rare
actually most of the dark squares do not
have cancer this is why you do follow-up
tests where you get the same thing with
any rare disease or you get a very
similar situation with hie HIV tests or
your terrorism detector and here's sort
of the logic I'm gonna walk walk through
this pretty fast
again we have this sort of this is a
contingency table in graphical form and
we have these basic quantities right so
we have the probability that the test is
positive if you have cancer and so this
is the easy one to get right we can do
this this just depends on the test
we have probably it's positive if you
don't have cancer and again these all
have names so false positive very
positive predictive value is one set of
words precision and recall as another
set of words which we think about an
inform a
retrieval in medicine you use
sensitivity and specificity which I
forget how they relate to PPV and so
forth but they're all names for these
particular conditional probabilities and
so once you understand the conditional
probabilities you can work with all of
this stuff here's just the base rates of
cancer and here it's the same form of
problem right we know the probability of
positive giving cancer and some other
stuff we want to know cancer given
positive and you can work with it and so
here you go we want this piece of the
chart right where as what we're given is
this and some other information so you
can work this through you end up having
to compute the overall number overall
rate of positive
here's Bayes theorem at the top you
grind it through you end up with this
9.6 percent chance all right so there's
a lot most positives are actually false
positives and for your terrorism
detector this is going to be like 99
percent false positives there is sort of
a philosophy that goes with Bayesian ISM
that is interesting the idea is that and
so now we've gone from mathematical
formalism to epistemological virtue and
the idea here is that we should only
believe things for which we have
evidence in other words evidence is the
thing that justifies a belief which
means E is evidence for X if the
conditional probability of X given E is
greater than just the conditional
probability of X alone in other words if
we revised our probability estimate of
the hypothesis upwards when we see that
evidence so that's the the formal
definition of evidence for something and
the the normative part of that is to say
that we should only believe things for
which we have evidence
in this sense unfortunately it's very
hard to formalize the world it only
works in a few situations and again
here's the basic form right we start
with a model so this is the model which
gives us the likelihood how often do we
call if we have a cold and then we start
with the base rate how often do we see
that evidence and we start with the
prior which is how likely is that
hypothesis to be true in the beginning
and we end up with this thing which is
normally called the posterior which is
how off how common is that eyeball that
says given the evidence and that's what
we just did with our example here now
this gets arbitrarily complicated when
this starts to get things like you know
what is the probability of pulling
someone over given their race then we
start to have to do much more
complicated calculations when we have
large amounts of data and Bayesian
inference that statistics is this whole
subject but this is the fundamental
formula this is all we're doing is
flipping the conditional probabilities
because we normally have this and we
normally want that this is what a model
tells us this is a simulation and in
fact I'll show you how this works
Yunis in a simulation context so
p-values really only talked about the
null hypothesis bayesian statistics
usually talks about comparing multiple
hypotheses so here's a very simple
example again this is also in my book
you put a traffic light at an
intersection and you ask did it reduce
accidents so say you're looking at this
chart and you're trying to write a story
saying huh did the traffic light work so
just looking at this data what do you
think did the traffic light work did it
reduce the accidents
okay maybe a good answer okay so to
answer this question what we do is we do
two sets of simulations one is we
simulate how many traffic accidents we
would get without the stoplight so
everything to the right of the red bar
here is fake data this is simulated data
which means we need a model a model that
tells us how often we get each number of
accidents in other words we have a model
of the intersection at a stoplight which
is this term that we're computing
probably your evidence given hypothesis
hypothesis is no stoplight evidence is
these numbers here and we have to make
up some we immediately run into this
other problem we have to make up some
definition of reduced traffic accidents
so let's say we'll call the traffic
accidents reduced if all of the values
after the red line are lower than the
minimum of these two values right so
that means that this one and this one
are lower right we have to make up some
definition of what it so if so this is
the problem translating words into
numbers what does it mean to say that
the traffic light reduced accidents well
we're making up a definition how we
simulate these things maybe we draw
randomly from the history of the traffic
data before the stoplight was installed
maybe we compute the average rate and we
use a Poisson distribution and draw from
a persona distribution we make some
model and then we make fake data and we
count how often the data produces the
thing we're looking for
and then we make different fake data
let's say we imagine that the stoplight
reduces traffic accidents by 50 percent
and we so basically we take our model
and we simulate everything with the
number of accidents cut in half and we
ask how often do we see a reduction by
our definition and so in this case it's
one two three
four five six seven and then we're going
to ask what is the ratio of where we saw
fewer accidents to where we didn't do I
have that yeah okay so what we're
actually computing is we're computing
the ratio of the probability of
hypothesis 1 the stoplight did nothing
to Apophis is to the stoplight cut
accidents in half and if you grind
through this what you actually find out
is that ratio is equal to the ratio of
the prior probabilities like how often
how much more likely we thought the
stoplight would work let's begin with
times the ratio of the likelihoods so
doing that simulation what we get is
this term okay
we're simulating so we're assuming the
hypothesis that of--this is on the right
we're getting the evidence on the Left
we're taking the the how often do we see
the data so evidence data same thing in
this case given our hypothesis and a
little algebra tells us that the ratio
of the probabilities of the two app
offices is equal to the ratio of the
likelihoods times the ratio of the
priors
so the Bayes factor the thing we can
compute by comparing these two different
simulations you can think of it as
telling it telling us what is the
strength of the evidence it doesn't tell
us what the overall probability is
because maybe we know that it that
stoplights reduce traffic already maybe
we know from every other city that when
you put in a stoplight it cuts it in
half in which case this prior would be
very high or maybe we have no reason to
believe that stoplights work in which
case we start with 50
okay so the and this is one of those
sort of mistakes of the p-value approach
is that it tries to come to a definitive
conclusion is this true is this not true
based on only the data we're analyzing
and not all the other data in the world
so the the likelihood here you can think
of this as incorporating all of the
other information we could possibly have
and this is in a sense more honest
because the thing that we can calculate
is only what this data that we have
right in front of us is telling us so
this is called the Bayes factor you can
think of it as what is the relative
strength of the evidence for different
hypotheses and there's no sort of
agreed-upon thresholds people don't
really use this as a base factor of
greater than two so therefore we accept
this as true the whole concept of
accepting or rejecting a result based on
a threshold is flawed again we can never
reduce uncertainty we can only measure
the strength of the evidence so here we
go here's ways of thinking about the
strength of the evidence by looking at
Bayes factors and you know they're
saying that by and this is a kind of a
weird table right because what this
strong evidence mean this is one attempt
to associate numbers with words but you
know by the time you've this ratio gets
up to about 20 which is you know about
where the standard peas or point 0 5 is
kind of set that we say it's very strong
so the way to interpret a Bayes factor
of 20 is it's the a model which assumes
the hypothesis is true is 20 times more
likely to generate the data that we
actually saw that's why I'm tying this
back to simulation ok this is
a ton of theory I know I've gone just
really rapidly through this so any
questions about these concepts before we
move on
hmm I know I don't quite believe it but
I will say this you will in your career
have to interpret p-values so you have
to understand what they are but honestly
if you're trying to solve a statistical
problem phrase it in the language of
conditional probability you're gonna do
a lot better it's gonna be a lot clearer
and if you have to do something that is
like a significance test which by the
way I think you shouldn't by the way I
think you should just do look at
confidence intervals and effect sizes
but if you have to do something like
that then think about the relative
strength of the evidence between
different hypotheses that's one of the
problems with the null hypothesis idea
because it only measures the strength of
the evidence for one hypothesis which is
the one you don't care about I want to
show you a lovely little interactive
here how many of you have heard of the
replication crisis in science
yeah it turns out that a lot of
published papers especially in
psychology don't replicate and you know
p-values was supposed to prevent this so
why isn't it working
so here god I wish I could just get rid
of this banner so here's a lovely little
interactive from a long story called
science isn't broken which demonstrates
the issues with using statistical
significance as a test and the idea here
is are we going to test is the does the
economy depend on whether Republicans or
Democrats are in office and what we can
do is use different definitions of
economics and different definitions of
politician and you know maybe a couple
other factors right so if we measure
only GDP and we measure only presidents
then no but if we measure GDP and
governors then yes look at that all
right we got a p-value smaller than 0.5
if we measure stock prices and governors
know if we measure everybody anyway you
can sort of see how this works right
whether we include or exclude or
sections and you can always find a
combination that gives you a high
p-value again this is this problem of
translating words into numbers
so this is known as the Garden of
forking paths' meaning that if you
change your data analysis you can get a
different significance level this is a
beautiful example of that and the
articles worth reading so what happened
here is the journalist collaborated with
a German researcher a real researcher it
actually you know worked at a German
medical school and they did a real
scientific study with all the standard
protocols of feeding people chocolate
and then measuring a bunch of variables
and then they found that the people who
ate the chocolate had less weight had
lower weight to a significant 2.05 they
wrote it up they published it in a
scientific journal which previous
research by this researcher had had
shown that a lot of scientific journals
don't actually read the papers they
accept so so they got it published in a
peer-reviewed journal then they sent a
press release with a link to the paper
to a bunch of media organizations and
some of them wrote stories about how
chocolate causes weight loss so this is
our standard scientific process they did
everything right it was a real
researcher following standard study
protocol publishing in a peer-reviewed
journal this was a and with a
statistically significant finding what
was the problem
how did they get this this nonsense
finding
no totally standard people they
randomized them all they there was
nothing wrong with the study design this
was a completely standard study design
so what's the issue nope no properly
randomized this that's one issue it
could be small sample size all the
p-values are designed to correct for
sample size
all right because think about doing that
permutation test if you have small
samples you're gonna get more variance
you're gonna have more variance cuz
you're taking averages and that's what
you're doing here you're comparing the
average of the chocolate group to the
non chocolate group so p-values should
correct for sample size so what is the
issue anyone know I did actually say it
but it's most people aren't trained to
look for it the problem is they measured
20 different variables so chocolate is
going to show an improvement on
something it aids sleep it lowers
cholesterol but of course weight loss is
the best headline so this is a name this
is from a paper by Andrew Gelman who's
one of the Bayesian heavies this has a
name in science which is the garden of
forking paths which is from a title of a
bore his short story which I recommend
you don't have to be fraudulent to get a
significant and nonsense significant
value all you have to do is pick a
method of data analysis which gives you
that value right so the this is the key
idea with the same data analysis
decisions have been made with a
different data set how do we solve this
problem how do we avoid getting nonsense
results just by picking different
definitions of what we measure or how we
do the analysis doesn't this destroy the
idea
data analysis anybody got an idea of an
answer to that one yeah warmer check for
quarterly I mean that's a good idea
anyway yeah how do we get out of this
trap
so I think there's basically two answers
one is statistical significance the
whole idea of statistical significance
has serious conceptual problems that
we've touched a little bit on today
right because you're trying to turn
uncertainty into certainty so use effect
sizes instead rather than saying you
know this this policy definitely reduced
crime rates say the crime rates fell by
five percent but you know that
statistical error so it could be
actually anything from an increase of 1%
to a decrease of 10%
all right fine find a way to show the
uncertainty and we're still learning how
to do this in journalism to talk about
the uncertainty the other thing is oh we
saw that robustness which we're just
gonna have to do this later robustness
is an idea that appears in a lot of
forms and the social sciences they call
it triangulation this is a quote by one
of my heroes charles sanders peirce who
appeared earlier in the handwriting
example and what he's saying is that
rather than using one type of argument
to know if something is true find lots
of arguments so going back to our
hacking our way to scientific glory
example if we got a big a large effect
size no matter what definition we chose
that's a sign that the result is strong
we say that is there a bust result okay
what we're looking for our results that
don't matter on the definitions we
choose the timescales we look at we want
and and also we want it to be true if we
look at different data sets we want it
to be true if we interview people as
well we want the qualitative and the
quantitative data to match up right the
the p-value is asking the wrong question
that's this that's this chart here right
we the p-value is using a statistical
model
a there's a scientific but let's say
substantive right here's a model of the
real world we think we have students and
teachers and classrooms and we think
that if there's fewer students in a
class they're gonna get more attention
that is this substantive model the
statistical model is just this
permutation test the statistical model
is not the substantive model right and
if you only draw conclusions based on
the statistical model while ignoring
what happens in the real world you're
gonna have problems all right so they
the p-value will not tell you whether
your hypothesis is true or not it's just
one tool in a way to evaluate the
evidence and this a strong conclusion
will have multiple lines of evidence
all right that's that's all we're gonna
do for today thanks everyone