Quantification and Statistical Inference

Frontiers of Computational Journalism week 4 - Quantification and Statistical Inference
all right so this is the first of two
weeks on statistics generally I know we
have varying levels of statistical
training in the class obviously I can't
teach you all of statistics but I found
through the years that there's certain
types of core concepts I'm both
applicable to journalism and sort of
they don't come out clearly in a
standard statistical education so we're
gonna go through that starting with the
most neglected subject in statistics
which is measuring things so that's
that's what I mean by quantification and
then we're so we're the beginning of
this we're going to talk about things
which may be statisticians wouldn't talk
about quantification and data quality
and then we're gonna go into risk ratios
which you've probably seen before it's
not a very complicated concept but we're
going to apply them to a story that the
AP wrote alleging corruption and really
try to think very carefully about what's
going on in this story you'll see it's
every time I have this discussion it
comes out complicated so that's good
and we're going to talk a little bit
about regression and causal analysis and
I'm going to try to show you how all of
these things appear in journalism in the
next class we're gonna do statistical
significance testing and a bunch of
other fun stuff randomization bootstraps
how many of you have had like a 101
stats course yeah so there will probably
actually be very little overlap with
what you were taught because mostly I
don't know I don't like the way
statistics is talk quantification so you
saw this picture in the first class the
idea tears that we turned some
complicated
difficult to describe difficult to
categorize reality into a series of
vectors so quantification is this arrow
here and a bunch of things happen at
that arrow a bunch of choices about what
to record and a bunch of lost
information and the lost information is
normal we call that abstraction that is
both what is powerful about
quantification and what is weak about it
is we throw out things so I want to talk
a little bit about what we lose by
quantification and these are old
critiques right I'm not going to say
anything that people haven't said over
the last hundred years I just sort of
want to talk about what this means in
journalism
first of all quantification can mean a
bunch of stuff we have you know it's
counting but there's different types of
scales so you're all familiar with the
difference between continuous and
discrete scales at this point there is
structure there are many different types
of structure for example you continuous
scale may or may not be uniform so
consider a Likert scale you all know
what a Likert scales so Likert scale is
a greedy disagree scale you know
cigarettes are a healthy habit agree
disagree one two five that's everything
or maybe it's like a one to five star
scale you know how much did you like
this this movie the thing about such
scales is the difference between a three
and a four may not be the same as the
difference between a four and a five in
fact it's usually not right so think
about rating your
driver in Syria you could have one to
five but in practice there's a big
difference between three and four so 3
and a 4 is a big jump four out of five
is a smaller job so the scales are
monotonic meaning that a higher number
on the scale definitely means a higher
number in reality in some sense but they
are not uniform or proportional so you
can't average them for example or it's
not clear what averaging them means and
then there are skills that are
categorical and unordered so for example
if I ask someone to select their
favorite color it's hard to say whether
blue is more or less than green like
what does that mean and then there are
skills that are categorical and ordered
anyone come up with an example of an
ordered categorical scale
sure
yeah sure or you know big bigger biggest
they're actually pretty common so you
can get things like ranks in the
military or an accompany you can get
things like the classification system we
saw for classifying tweets as chit chat
to news events discrete ordered scales
show up actually a fair amount
more fundamental than that I think our
or and certainly logically prior
questions about what we're counting so
colloquially we talk about the
unemployment rate of course there are
actually many different unemployment
rates so here's a different measure of
the unemployment rate which is part-time
for economic reasons meaning that you
wanted a full-time job but you couldn't
get one because you couldn't find a job
right and so when we look at this
unemployment rate which is the standard
unemployment rate you can see that it
falls slowly after the recession if you
look at this unemployment rate which as
I wanted to work full-time but I
couldn't it doesn't fall nearly as fast
so depending on what you're counting you
literally get a different story and this
is part of what's happens when people
have very different frames on the same
events and we'll talk about that more in
fact there are six standard ly recorded
unemployment rates call you one through
six and if you read them you can see
that they all have different numbers and
they all count different things so the
one that is commonly reported is u3 even
says the official unemployment rate so
if you have to talk about the
unemployment rate that's normally what
people are talking about but there's
lots of different ways to look at these
numbers and tell different stories
because there's lots of different ways
of counting does anybody know where
these numbers actually come from like
how does the Bureau of Labor Statistics
produce them
Servais yeah do you know any more about
that okay does anybody looked into this
it does come out every month yeah so
there's actually two surveys that are
done monthly of course there's samples
because you can't talk to everyone
monthly one of them is the household
survey and one of them is the Business
Survey and the household survey takes a
survey of houses and asks you know are
you currently working have you been
looking for a job these types of
questions what industry do you work in
the business survey takes a sample of
businesses and asks how many people do
employ how did you plan to lay people
off these types of things and so
actually these to serve as an
independent check on each other because
they get like they don't collect all of
the same data but they collect enough
cross over that you can check the
accuracy of the surveys so that happens
every month and I think there's about a
hundred thousand people on the survey
and the margin of error is large you
know when you're talking about
estimating a fraction of 300 million
people you know one or two percent
margin of error starts to be millions of
people so for example the job figures
you may have heard the job reports you
know the economy gains two hundred
thousand jobs the margin of error on
that is plus or minus over a hundred
thousand so there's actually way less
information in that that is usually
reported it's it's a crime that that is
reported so authoritative really it's
just something I've been banging my head
against for years but nobody wants to
hear it we'll talk about noise a lot
next class we're going to talk about
randomness
the survey does change over time so I
mean slowly also the unemployment
numbers are revised after the fact you
know six months later the government say
oh we're revising this up point two
percent yeah and in fact sophisticated
economic models use two sets of numbers
the official when you run them when you
back test them
the official corrected numbers and the
numbers that were available at the time
because you want it to work on the
numbers that you have when you want to
make the predictions
how do they do the Croatians I'm not
sure why they correct it or what their
rationale is we'd have to look into in
particular cases this is another way to
look at the unemployment rate we can
sort of slice it by well here we go
number of states above a certain level
and what we get is this sort of contour
map there are a tremendous number of
ways to tell the story about what
unemployment what happens to
unemployment during the Great Recession
yeah yeah it's it's think of it as a yes
right so if you're a ten year also at
9/8 talking about GDP you've all heard
GDP whenever you know how it's counted
yeah like that think about it right so
where where do these numbers come from
how do you how do you know what
investment is how do you know where a
port is yeah yeah so there's gonna be
official documents of various sorts a
lot of this comes from these surveys
again as well so it's got some error in
it to get consumption or equivalently
production that's done with a survey of
businesses as well so that has to that's
an intentionally non-random sample
usually it's stratified normally which
means it's weighted to include some big
businesses some small businesses
businesses in different industries
stratification is a sampling technique
which can give you a more accurate
result for the same amount of work we
talk about GDP like it's this as they
say naturalized concept right growth
it's you know everybody knows what
economic growth is but it's actually
this very complicated thing and here's
another picture of it it's it's defined
to be a particular to measure a
particular set of flows and I still meet
economic reporters who don't really
understand this so I guess my lesson
here is don't use numbers that you don't
know where they came from because
otherwise you won't be able to see the
stories right so if you say that GDP
fell you know you were just looking at
this this equation is that because there
was less consumption people are buying
less or because as the government cut
back spending on social programs or
because there were more imports right
those those could all be different
stories so if you just write
headline story that's a GDP fell by 0.3%
it's hard to say what that actually
means included in GDP because the
government buys things so if the
government by is paper then the factory
produced that paper and we're trying to
measure total production but because
production is always consumed by someone
it's actually counted as consumption you
can actually calculate GDP a number of
different ways which are in theory
equivalent right so in theory everything
that is produced is either consumed
domestically or exported so it's just
it's thinking of GDP as kind of
consumption instead of production but
actually they should be the same thing
yeah how they calculate there's there's
International Accounting Standards and
there's a bunch of problems related to
the fact of the different countries use
different standards so for example China
switched its GDP accounting about ten
years ago to get closer to international
standards which means that you can't
really compare GDP across that switch
very much yeah so I don't know if you've
ever looked into it coming standards but
this is this entire field there's
there's big thick books of what they
call GAAP GAAP generally accepted
accounting principles this is a huge
body of law on counting how do you count
things so my point is that you make
choices here it's a really good question
sounds like a story
there's definitely been scandals
involved involving suspicious GDP
numbers again China stands out as an
example for many years it was basically
exactly 7 percent you know or six point
eight or seven point one or something
and there I have a bunch of papers about
how Chinese GDP figures were thought to
be inaccurate in a variety of ways
basically political pressure on the the
local governments to report particular
numbers one of the indicators there is
that Chinese GDP volatility is lower
than any other country which is
interesting because it could mean that
there is a problem reporting numbers it
also just mean that China is really big
right if you have if you have lots and
lots of people that's going to tend to
average out the noise but the question
of Chinese economic reporting accuracy
is this whole subject what do you mean
well I mean ultimately they're the
national governments right they they
have solvent either than with reporting
the figures exactly okay fun fact the
global accounts of the world do not add
up they're off by depending on you count
five to ten percent so that is to say if
you add up all of the so every country
reports imports and exports if you add
up all of the exports of Venezuela to
France as reported by exports from
Venezuela they are not equal to the
imports from France two from Venezuela
to France as reported by France right
yeah exactly it's according to global
economic trade data there's huge amounts
of money that just disappear and by the
way this these amounts of money dwarfed
things like foreign aid this is far
larger than the amount of money that's
spent on development and a lot of this
is financial crime in fact most of it is
thought to be
financial crime yeah
I mean a marvel well I'm really saying
is that if you're gonna use a number you
have to know where it comes from
we'll go into we'll develop ballistic
questions to try to understand this a
lot of reporters don't understand what
their numbers mean okay so one of some
things are just very difficult to
quantify so one of the things that I've
dug into this tree to is the history of
the quantification of race on the US
census so this is what census
enumerators were instructed to do in the
1940 u.s. census so this is a very
complicated process right you're
supposed to look at this person and
apply all of these rules to decide what
race to to write down right so yeah all
right there's they're there is literally
the one-drop rule right there should be
returned as a Negro no matter how small
the percentage of Negro blood and in
fact the form had quadroon and octoroon
on it meaning a white person with one
black grandparent or great-grandparents
so you were supposed to be able to
figure out their envis ancestry three
generations back by looking at them
needless to say this did not give very
accurate counts and in fact after the
1940 census there was some research that
discovered that many minorities were
being undercounted by you know millions
because they want be incorrectly
identified as non-white so they changed
the way that the counting was done so
this is what it looked like on the last
census and
how it works now is you ask the person
what raised there so now it's
self-reported race so what do you think
is this an improvement no yeah this is
yeah you you fill out this form on the
census because that's I mean that's just
how they'd always done the census so how
the census is Donner was done at that
time there are still enumerators but not
as many if someone would go visit each
person's house and ask questions and
write down their their answers like fill
out this form and yeah there were these
complicated rules about how you're
supposed to report race and they
switched in the nineteen cameras 1950 or
the 1960 census to having people
self-report race because people were
being miscoded his wife it was
particularly bad for the Native American
population it was you know if you if the
enumerators ran into someone in a city
that it wouldn't even occur to them that
maybe this person has Native American
heritage so that was the goal was
accuracy
and in particularly the US uses a very
weird categorization of race versus an
ethnicity so Hispanic is considered an
ethnicity not a race and the reason for
that anybody know why by the way there
is actually a reason
afro Cubans black Hispanics that's why
there when the system was designed in
the 1970s they needed some way to record
that and so they said ok well we're just
gonna make race and ethnicity these
separate things and this leads to all
kinds of problems I actually wrote quite
a lot about some of these issues in the
curious journalist guide to data so I'm
gonna give you a link to entire chapter
I wrote on a race and a real life story
of a reporter getting into trouble
trying to join two databases and one
database recorded Hispanic as a race and
one database reported to Hispanic as an
ethnicity and so when he did a database
join here you go he missed all of that
static names until he figured out what
was going on
so just because you think you know what
a variable means doesn't mean that the
data you have recorded it in the way
that you interpret it right so there's
you know you'll you'll get a database
which has a column that says a race at
the top you have no idea what that is
until you start digging into the
production process for the data you
don't know how its recorded you don't
know how they got that information the
hell is that so say you're trying to
study some aspect of the let's say that
phenotypic portions of rice right so
what somebody looks like as opposed to
say how they behave where they were
raised where they live right race isn't
appearance but certainly people's
appearance changed how changes how
they're treated so one of these ways of
recording race takes into account
appearance more than the other so it
might give you a type of information
that you want if you want to look at
correlations between appearance and
other variables so it really depends
what you're after with the data so they
they're there it's it's often hard to
say that there is a uniformly better way
of quantifying any particular thing
there are with ways that are better for
particular purposes so for the purposes
of knowing how many Native Americans
there are in the u.s. asking people are
do you have Native American
ancestry is way better but for other
purposes it might not be for the purpose
of knowing how many Native Americans
there are who no one knows they're
Native American it's not going to be the
data that you want so we record a lot of
things that are actually very difficult
to quantify and you know things like
intelligence for example oh man the
fights that people have had historically
over the quantification of intelligence
you know what does IQ mean and you know
are these tests biased in one way or
another and it's just this every one of
these things on this slide has this
whole history of arguments about how to
record it and not just how to record it
but how to use it and even something
like income is way harder to quantify
than you would expect does anybody know
what some of the problems are just
trying to put quantify income yeah right
so taxes cost of living do you include
government services and their income so
if you're gonna analyze income
inequality do you analyze the cash that
they were paid or you analyze the cash
they were paid plus all of the free
services they used right to analyze you
know health care that they received that
their insurer paid for do you analyze do
you analyze the subsidies of the schools
they go to in the roads they drive on
right they there's all of these
questions about what counts as income
nonetheless we quantify these things and
we can get useful information out of
them if we if we work with them
carefully and understand their
limitations
so all of that was just on the question
of how do we count something now I want
to talk about the issue of is the data
any good right so not what are we
counting but is it counted the way it's
supposed to be so why do we have data
quality problems what problems do we we
end up with when we try to quantify
things it's just yeah so let's say
different definitions and if the data
you have was combined from multiple data
sources of noise that are incompatible
you may not even know participation
completeness yeah
completeness is a big problem this is a
problem that the polling industry has
have to solve in a big way because
apparently 90% of people will hang up
when you call them to do you have
surveys something what do you do
actually anybody know the answer to that
how modern polling works yeah pretty
much so you combine different types of
polling data hmm so you do things like
first of all you wait the sample you do
have for demographic so you know if you
call people and you discover that you
know you own you only have 10% of your
sample is 18 to 35 but you know that
actually 35 it's 30% of the people that
you are surveying or trying to survey
then you can multiply that sample like
you can combine online polls and phone
polls polling has become this incredibly
complicated thing we'll talk about that
more later what other data quality
problems do we have yeah what about
selection bias right right exactly
so the questions on the survey okay
yeah so let's let's call that survey
design
[Music]
yeah okay so let's let's just call this
lies that's a category of data quality
what about more technical problems
there's this there's this whole category
of just yeah and in particular there's a
sort of accessibility bias you're much
more likely to use through machine
readable records you saw handwritten
notes in the documents that I used for
the Iraq story I mostly didn't use that
material because it was too annoying who
knows what I missed
you've got the whole category of like
data conversion issues right so you know
you've got it's this long list you start
to get some things like you know missing
files and bad formatting you've got this
is a very large cat very inconsistent
which we call this inconsistent values
so you know is the u.s. always spelled
u.s. or do we get u dot s dot or do we
get United States or do we get United
States of America or do we get us a
everybody in America right there's lots
and lots of different ways of recording
that all right
there's all kinds of issues there is I
think safe to say no consensus on how
people of Hispanic or Latin descent
identify themselves now you see that in
census data some people right let's look
at this form again so some people will
write in Hispanic in that field or
latina in that field
some people will identify by their
country they'll say I'm Mexican alright
I'm Brazilian or watermelon or whatever
it is and if different databases are
different people record this in
different ways and you try to to
aggregate them in some way do you have
all the categories I don't know what to
call that consistent identification
maybe you get these cases where there's
actually no agreement on what people
want to call themselves then you get a
bunch of just plain old errors like you
know truncated data missing rows and we
haven't even gone into the economic or
political reasons why someone would
report count one thing with not another
what do I mean I will show you so you
can have data quality problems that are
not even intentional that are just the
result of technical errors so that's
that first case there when the Los
Angeles Times did they they built a
crime map using a live data feed from
the Police Department and when they did
that they discovered that 40 percent of
the crimes and books were not coming
through that data feed and this
apparently just turned out to be a
technical error it's like well nobody
had really looked closely at that API
and you know it wasn't working so they
wrote this story about this problem so
by the way that's that's always an
approach to handling data quality
problems if you think you should be able
to get data because it is Public
Interest data and you can't that's a
story right you can always write the
story about how we don't have access to
this critical data maybe it's not as
good a story about you know here's our
analysis of this data but lots of
examples were writing the story about
being unable to get the information was
an important story and resulted in the
information becoming available or here's
another issue the second one this is a
New York Times story some time ago and
this is about different police officers
having different conceptions of whether
to record something as sexual harassment
so who you run into when you're
reporting this crime affects whether
it's reported as a crime so think of any
sort of data entry process right so for
example the serious assault cases that
the LAPD underreported somebody's going
through and reading that description
maybe it's the officer or maybe it's a
data entry person and checking a box
right is this a serious assault or not
different people are different standards
changing definitions
oh my god the fights that people get
into over definitions just I mean we're
seeing we this is a long-standing one
but we're seeing this right now what
counts as a rape right how do you know
how many there are
I I taught an entire workshop which was
around use of data for gender advocacy
and that's one of the things we go away
into is what is the history of rape
statistics and how do you record them
and what definitions do you use and what
and what methods do you use and you get
into things like the privacy of
reporting there's a standard Crime
Victimization survey where somebody who
goes and visits someone they're in the
home and asks some of you've been the
victim of a crime and you know if you're
asking someone about domestic violence
and the partner is standing right there
you're not gonna get a very good answer
right and then what wording you use to
ask for there's a lot of variation in
the process that may have produced you
David so let's take a sneak peek here oh
yes so this is another example this is
from some work I did with some Indian
data journalism organizations and one of
them produce this beautiful example of
being able to tell that the data was
wrong just by looking at the data right
so we've been talking about all about a
lot of what I call external checks where
you go and talk to the person or try to
figure out how the data was recorded
these are this is an internal check
where you can just look at the data and
see that there's a problem so their
claim is that traffic accidents are
being
I'm there reported in these two cities
and I forgotten which state is Lucknow
thank you yeah so they are finding
problems with the accident statistics
coming out of this state and their
argument is that the traffic accident
deaths are about in line for the
population but the accident reports are
way low so what they think is happening
here is that deaths are reported because
they're kind of hard to hide and this is
a pattern we see in crime statistics
generally the more serious the crime the
more the the more complete the
statistics are going to be because it's
hard to hide serious crimes and people
are more likely to report them but the
less serious crimes I don't know what's
going on somebody's not giving out
tickets or you know has a definition of
accident which is very narrow or who
knows but something's wrong with these
numbers as I was mentioning earlier that
I sort of have this this split in terms
of methods to evaluate data between
internal and external validity and
internal validity is just checks you can
do on the data itself for example
histograms can reveal a lot of important
information about the data you can see
whether the data was quantized you can
often find format problems because you
see a bunch of zeroes and it's often you
know certain things right you should
have exactly 50 rows for 50 states in
this database and if you don't that's a
problem certain numbers should add up
you know you might get derived figures
like percentages you can check all this
stuff but probably you're gonna spend a
lot more time on external validity
which is comparing to other data sources
and comparing two people and expert
knowledge which is the conversation we
were just happening you know is this
reasonable does this make sense you see
a lot of amateur data work where they
just have no knowledge of the actual
subject and so they end up saying
something just completely wrong so the
the famous example of that is null
Island do any of you know what
when null Island is or it's a city in
Kansas as well this is a problem that
happens with geocoding
no island is zero zero so this is often
what geocoding returns the other thing
is so if the geo coder can't find an
address and just defaults you to the US
what you get is I'm trying to find it
there you go you get something which
ends up exactly in see if this works
Oh anyway you get this spot in Kansas
which is the geographic center of the
United States and so I've seen data
analyses that have an oddly high number
of whatever it is they're counting in
candidates in Kansas because that's
where all the geo coding errors ended up
right so there's all of this crazy
domain-specific stuff that you have to
watch for
[Music]
so in Dana journalism we have this this
saying interview the data which means
you know treat it as a source and treat
it with the thoroughness and skepticism
that you would expect of a source
interview I actually have another page
of these questions but let's do this
exercise what sorts of questions other
than the ones on the board should we be
asking of our data
yeah context context is a big word can
mean a lot of thing all I meant for the
purpose of any type of
what else
I want to come up with a list of things
to ask about the data before we use it
look what values can it be yeah sure
yeah I mean that's ultimately what we're
trying to find out but let's try to get
a little more specific does it make
sense just at first glance yeah the good
chick what about the politics of data
production how can we ask about people's
motives and putting this together yeah
yes that is an excellent question
whoo let's call it who benefits from
surf dancers right so there's a bunch of
obvious things here if you are reporting
crimes you want to see the crime rate go
down right if you were reporting growth
you want to growth go up and you know
this is why we have Accounting Standards
it's because when people get to make up
their own standards they make up the
standards that make their company look
healthier than it is few of you haven't
said much this class let's talk with a
few more others yeah let's compare it
yeah absolutely we should compare it to
similar data either historical data or
other types of data that measure similar
things
yeah so let's call this the sample frame
what about definitions what's right no
like you know we've run into all these
cases where defining something is really
hard right so what do you what do you
count it you know it's even even things
that seem sort of concrete have
definitional problems so say you're
trying to count how many people have
asthma well what does it mean to have
asthma how do you how do you count that
right and you know you can count the
number of people who've been prescribed
asthma medicine you can count the number
of people who from medical records who
have been diagnosed with asthma but then
that doesn't count people who moved in
from somewhere else or the previous
diagnosed you can count as an emergency
room visits you can do statistical
estimations based on these things you
can do population samples and ask people
but then how does their answer just
differ from the diagnosis like even just
counting the number of people with a
concrete disease is an incredibly
difficult thing to do you have to to
make complicated choices about what you
count so I'm going to call basic
definition on problems and then we have
the problem of has that definition
changed so let's say history data
collection if that definition changed in
you know 2007 we went from counting
prescriptions from asthma medicine to
counting people who said that asthma
survey we're not going to be able to
compare before and after that date at
least not simply so as usual nothing is
ever as simple as it seems here's here's
some more from my list
we got a bunch of these in our
discussion the the who benefits from
this data is a really good question
okay so I think any last questions on
this remember the next section of the
course alright oh sure right so right so
some data is collected by surveys or
something some data is collected by you
know logs of what people click on all
kinds of issues let's talk about that
what type of issues would you get with
clickstream data lots okay
box box
the data size you're probably gonna have
to make some approximations even
counting counting the number of items in
the database becomes a serious
estimation problem when you start to get
into the trillions
yeah all right and then I'm gonna put
down time you know do we have complete
logs did we lose the Kansas City data
center for three hours yesterday that
sort of thing yeah there's all kinds of
problems even when data collection is
automatic because it's not really
automatic right it's it's a system that
humans are creating and maintaining as I
like to say software runs on people and
if you don't believe me then why do you
have to keep upgrading your WordPress
install like falsifying records all
right we've got a lot more cover today
risk ratios so this is a very simple
statistical measure and we're gonna into
this because in part because it's an
interesting thing that's used as a bunch
of stories and in part because it's
gonna prepare us for a deep dive into
some machine bias stuff that we're gonna
do two classes from hell it's sort of
gonna be our introduction to confusion
matrices this was a story I was involved
in when I was at ProPublica this was one
of the first times we got or anyone got
access to a data set of police killings
and we did what may have been the
simplest possible thing which is compare
the rates of teens shot by police for
black versus white
in this case teenage men and so I made
this little graphic to try to explain
what was happening here and I wanted to
commute could communicate a bunch of
things first that the number was much
higher for black teens than white teens
but also that these numbers were small
in absolute terms and so this is the
diagram that I came up with now for our
purposes the key thing that I want you
to notice is this is not an analysis you
can do with less than four numbers you
always have four numbers here you've got
the numerator and denominator for a
black versus white but ultimately we
summarized them with one number which we
call the risk ratio and that's what
we're gonna be talking about today so
it's in this case it's 31 over 1.5 so
what is that it's 20 or so right and
that was sort of the headline number in
the story now there's sort of a bunch of
other difficulties in this story
including you know the data set we got
was the set of police records is very
incomplete many departments don't report
but this was before people started
building detailed databases and so this
was one of the first numerical estimates
of the difference another example of
this is a little more subtle and in fact
this is a case where I think they should
have used a risk ratio and didn't so
this was a story that came out in the
summer of 2016 before the election and
one of the sort of narratives happening
at the time was you know let's look at
Clinton's connections to the Clinton
Foundation and whether there was
influence there from her donors and this
was the top of the story that ap wrote
so take a second to read this and then
we're gonna talk
okay so what's the narrative here why do
they say okay so I want to work through
the logic of this first of all do you
find the story convincing does it seem
to you like this is evidence that
Clinton is doing - no it's ever it's
everyone outside of government who was
on her calendar on our calendar yeah so
that's what this is from this is from a
Freedom of Information request to see
your calendar so many public officials
calendars are
[Music]
yeah
so right away we have got data quality
problems right so what's measuring and
what's not measured but I want to set
those aside and focus in a little closer
on this this is the second simplest
statistical model the first simplest is
to just count in counter proportion
gives you with a single probability so
in this case I'm looking you know what
let's say there's a blue cab company in
a yellow cab company and I'm looking at
the accident rate right so the yellow
cabs have more accidents but there's
also more yellow cabs so if we're asking
a question about which sort of cab is
more dangerous to get into we have to
take into account the relative not
absolute proportions right so we have to
think of you know the accident rate for
yellow is this out of this what about
compared to this out of this and given
that there are more than three times as
many dots here then then here whereas
there's exactly three times as many
accidents this actually means that the
yellow cabs are safer to get into even
though the number of accidents was
higher so this sort of two-by-two table
will be looking at a great deal through
the next two lectures because it's a
surprisingly
subtle model and the point I'm making is
that to interpret this we have to look
at some sort of ratio so let's see here
we go here's I don't need to draw this
it's all bet on this light so here's a
another common context instead of
accidents and cabs we now have smoker
versus nonsmoker compared to whether
someone got let's say a lung cancer and
the question we're trying to ask is
ultimately does smoking lead to cancer
and to do that we I claim that we have
to use all four numbers and we have to
use them in this way so it makes sense
to everybody so just working through the
logic of this we're looking at the
fraction of people who had the disease
who were smokers as compared to the
number of smokers overall which was a
plus B as compared to the fraction of
people who had disease who are
non-smokers versus the number of
non-smokers overall I'm gonna claim that
this is the simplest possible way that
you can compare these you can look at
this data and that if you remove any one
of these values you can't answer the
question you want to answer which is is
smoking coral I mean what you really
want to know is - is smoking causal
before you even get there you have to
ask for a correlation that I'm saying
you can't even see there's a correlation
without using these four numbers in this
way okay
now let's apply this to the Clinton case
and let's think of a 2 by 2 matrix here
let's see so smoker versus nonsmoker
I'm going to say that's donated did not
donate okay and then what are the
outcomes on this side so what do I put
up here all right what is the if we look
at this this story
what are they saying the money gets you
a meeting okay so meeting no meeting
okay so let's look at these numbers try
to fill this out so that 85 and 154
where do they go and the 154 okay so
what is that number anybody wanna 69
okay thank you
all right so we have half of this table
so here it is so by the way I've waited
a way I've written it here is an odds
ratio not a risk ratio a risk ratio is a
ratio of probabilities an odds ratio is
a ratio of odds does everybody know the
difference between odds and probability
yes no did we do this no okay so I will
refer you again to my book and
where is accounting for chance no here
we go
okay so here's probability it's the
number of we usually say favorable
outcomes but it's just really the
outcomes we want to count out of the
total number right so we ask what's the
probability of purple its well how many
Purple's out of how many of everything
or as odds are just favorable divided by
unfavorable so they're they're
numerically related they're you can
calculate one for the other well if you
want to look at the problem sort of
right so you want to look at the
probability of a got a meeting got a
meeting and donated the total number of
people who got a meeting it would be 85
out of 85 plus 69 so odds ratio
functions very similarly to risk ratio I
said here that this is the simplest
possible way that we can look at this
that's not quite true we could just make
this over beat and this over D so then
we would have an odds ratio another risk
ratio but it functions it very much in
the same idea and I've just done it this
way to simplify it here right so if we
call these a B C indeed the number that
tells us whether paying is associated
with an advantage of getting meeting is
what odds ratio equals
i wanted i've donated got a meeting so a
over donated we did not get a meeting
versus did not donate and got a meeting
over did not donate they're not gonna be
so what what values of this would
suggest that paying gets you a meeting
right so if this is greater than 1 then
the corruption
right so this is not what this story is
what this story has is just this
fraction because we don't have all of
these numbers and in fact if you're
looking at a story which is claiming
that paying money or giving some sort of
favor gets you something back then you
should know immediately that you that
you need four numbers and if you only
have two you don't have enough to
include this so this is a case where I'm
still thinking about this two years
later and so they're trying to wrap my
head around the logic of this but I
think this is where this is a case where
there's less than it seems right we know
that you know eighty-five people all of
those people met with her we know that
85 were donors and a 69 or not but we
don't know how many people donated to
get a meeting so particulars say this
number is very large right say she's got
a lots and lots of donors who don't get
meetings that means that paying for a
meeting is not a quid mean if this
number is about the same that paying for
meeting doesn't give you any particular
advantage because then a or b and c over
d will be very close yes
well okay but these are these are people
who wanted one presumably and that's why
I phrased it that way right so the
question is say you have tried to get on
our calendar and you weren't a Clinton
Foundation donor and you didn't get on
her calendar calendar then you'd be in
this box right so these numbers are very
high and about the same then this goes
to one right so that's one way that
these numbers might not indicate any
advantage on the other hand say that you
know that number is smallish and this
number is also 69 so I made this wall
and then that eight that's very high
right so if if many of the people if
basically nobody who donated didn't get
a meeting then back supports this
argument but we don't know where case
were in and you can make reasonable
arguments that any of these things are
true is very reasonable for Secretary of
State so what that is is either that's
putting an extra constraint on this
that's saying that a plus a plus C
equals some number of meetings per year
that gets us closer that helps with our
inference that sort of reduces one
degree of freedom that's a good
observation in fact one of the one of
Clinton's responses through a you know
her office was that uh you know this is
a crazy estimate because it doesn't
count all of the government people I met
with for example and I'm sure it doesn't
count you know doesn't count all the
people she talked to who she didn't meet
in the office right so there's lots and
lots of problems with this story but the
reason I wanted to bring it up is
because I wanted to sort of work through
the translation from the words to the
numbers right and so if the right
phrasing for the verbal claim we are
making is paying money to her foundation
got you a meeting with Clinton then I
think that this ratio has to be greater
than one I think there's no other
reasonable translation which means so if
you believe that and I'm very interested
to hear what you all all think about
that because I've been thinking about
this for two years and I'm still not
certain this is correct but if you
believe that then this story is a bad
story the reporters just didn't have
anything and I think if that is true
then they should have known better
you know I don't know them personally I
mean I used to work at the aviary I was
there but not when they wrote this story
yeah and there's also several thousand
reporters yeah well I mean I think it
does show something it shows that she
has deep connections between her State
Department work and our Clinton
Foundation work which when you start
looking into the detail so here's where
we start looking past the data into the
rest of this story right so she's doing
things like I mean the Clinton
Foundation is International Development
right and and the State Department is
also involved in international
development so we can say there's
definitely crossover between her
government work and her foundation work
that much is clear
so I think I think you can make an
argument that she's entangled and maybe
maybe that's a problem you know and and
maybe the story is defensible on that
ground but what actually happened here
is that they walk to the story back
actually let me get the record of this
well the AP is not doesn't have its own
alright so here here's here's I wrote a
long analysis of this exact problem
because this is something that we really
want to do we would love to find
corruption and data but it's often very
hard because just looking at who donated
to what and who got what it's not going
to answer the question of whether
donating gives you an advantage because
for that we need four numbers not two
numbers they walks this back Oh this
druggie
oh yeah and then what I want to talk
about as well also do a little bit later
do is the causal structure here okay
here we go
they changed the headline and deleted
the tweet so there's a there's they're
sort of layers and layers to this
question one of the problems is that I
think they kept the story but they
changed the headline in the tweet so
that's an issue is where the headlines
don't match the content of the story I I
think it is very interesting that she's
so entangled between the foundation and
her political life but it's not
immediately obvious what it means and
then ultimately what you would really
like to do is be able to show cause and
join causes very hard for variety of
reasons you have to show that that this
ratio is greater than one so you have to
show that there's advantage and you have
to show that the reason the ratio is
greater than one is is that paying money
gets you the meeting and that gets into
the issue of correlation and we're gonna
talk about that shortly how many of you
have done linear regression before yeah
it should be all of you is that a no or
yes okay you've never seen it ever no
okay yeah I think all of you have at
least seen linear regression I want to
talk about it briefly because I want to
talk about how it's used in journalism
and this is also an
to the problem of causality which we
touched on briefly a moment ago
and very often first we do regression
and then we try to make a causal claim
yeah
okay so here's one of my favorite
multivariable regressions in journalism
this is a hold data this is actually
data from 2001 but the reason it's from
2001 is because for a few glorious
months we had data on both tickets and
warnings for all traffic stops in
Massachusetts and at the time the
difference between a ticket and a
warning was whether someone checked
ticket on the form so the police officer
would stop you they would do that like
do you know how fast you were going
thing or whatever their conversation
they had and then they decide to give
you a ticket or a warning and so it was
kind of a natural experiment because we
could look at the relative fraction of
tickets versus warnings and see who got
punished more so a ticket is a fine it
goes on your driver's license or your
driving record can change your insurance
rates a warning is nothing basically and
so just by doing simple counts you can
get charts like this right so you know
you get you know men get more tickets
than women minority drivers get more
tickets than white drivers State Police
get more tickets than local police and
so on the analytical issue here of
course is that you don't know whether
the drivers are the same in all other
respects so the data is actually very
rich I'll show you the data very briefly
I have it loaded into your workbench
tutorial here where is this we were
doing this in the tutorial section
tickets versus warning
here we go so this by the way is some of
you have seen workbench some of you have
not the workbench is a code free data
journalism system designed for
transparent and reproducible work and so
what you can do is you can publish this
document this this workflow along with
the story but here's what the data looks
like it's got all of this interesting
stuff so type ticket versus warning we
know the day of week we know which
police agency gave them the ticket we
have the badge number of the officer we
have the race here just coded and you
can see it's like you know unknown black
white and it's got Asian a few other
races you could see them here broken out
the sex of the driver whether the
vehicle was searched where they were
what time it was what type of citation
it is so the first thing the Boston
Globe analysis does is just filtered out
for speeding so if I look at this filter
here that's that's what I'm doing here
is I'm filtering out you know only for
speeding and only things where I know
how fast they were going this is how
fast they were supposed to be going so
we have a lot of information here and if
we take a basic analysis this is that
chart you just saw which is the you know
the ticket versus warning rate for
whites versus minorities and then we can
make that a little simpler by looking at
those as rates of tickets and here we
see this disparity it's on the next cell
what I actually do is I use a formula to
calculate the difference and it's about
19 percent different 90 percent higher
yeah do you know who was the Cowboys
yeah so you have to write so you have to
ask about the accuracy of that caller as
a matter of fact we can and they did
that they here I'll show you this
package yeah this is the actual story
it's an archive of it but somewhere in
here there we go minority officers are
the toughest there you go there's the
graphic that they did about this and let
me give you the URL to this story and
you can download the original data as
well so this is a great great way to
play with multivariable regression so
anyway these are just raw accounts but
now you have to ask questions like well
okay we're drivers of different age race
location going the same speed right or
maybe you know more minority drivers are
from out of town and out of telling
people get ticketed more you have to
account for all these other factors so
this is why we do multivariable
regression right we want to try to
account for all of these factors
simultaneously and you start to get
words you've probably heard these words
controlling for the speed that the
drivers were going controlling for
driver age you know you still got a
higher rate of tickets and the idea here
is we take this statistical model and we
try to reverse engineer it to compare
people who are otherwise different so if
we find out that going 10 miles an hour
over the speed limit versus 20 miles
reduces your chances of getting a ticket
by 5% then we take all the people who
were going ten miles an hour and we
increase the rate by five percent of you
see if that compares evenly so this is
classic multivariable regression and
it's introducing the idea of controlling
for things this was one of the stories
that they give I actually found this one
was kind of fascinating because this was
one of the only stories I've seen which
looks for a disparity doesn't find one
and then reports that so we all know
that's a problem in science with
negative results being unreported it's a
big problem in journalism as well and
they did all this work yeah you know I'm
of the opinion that you should just
write a story that's insane that's okay
because that becomes archival our
reference it becomes a reference point
in these discussions but you don't
really see that and I I feel this is one
of the places where a statistician or a
scientist and in journalists are
learning very different ideas about what
should be written another thing I want
to point out here is that they actually
didn't find that everything was
identical they finally found that there
was a 2% difference for the men I guess
and they have this curious phrase it
says these small differences are not
statistically significant we will talk
about statistical significance in some
detail in the next class but you know
when you compare our big aggregate
numbers like this they're never going to
be exactly the same so then the question
is well how much of a difference is
important it is a 19 percent difference
important is it 2 percent difference
important if you're going to compare you
know graduation rates or rates of you
know men and women in management
positions or any of these sort of
diversity issues there's no I don't know
of an obvious way to say that this is
the threshold where the difference
matters and below that we don't care and
above that we do care even this
statistical significance type argument
here they're using probably like P
smaller than point zero five type of
test that's an arbitrary threshold as
well it's a complicated problem to try
to turn a number into a binary yes/no
oh yeah that's what we were just looking
at this is a another story which
involves regression but expands the
frame a little bit to try to look at the
overall question of controlling for
multiple factors so here's the original
story this is a recent New York Times
story and I find this a fascinating
story for a number of reasons one of
which is that it's a story about a
company making an Edition that seems to
be true they're saying that you can wear
these shoes and run faster by about 4%
and okay but who it's interactively okay
so the naive way to do this is just so
they had all of this these this
information from this running site they
got it from there you go Strava so they
were able to get you know thousands and
thousands of races and the simplest way
to do this comparison is just look at
all the races where they were wearing
the shoes versus not that is a bad idea
why is it a bad idea saying people the
different races wearing different shoes
or where they they just compared people
wearing services well what they're
actually doing is more sophisticated so
what I'm saying is what if you did the
analysis by just downloading all of this
data and comparing that people wearing
shoes to people not wearing those shoes
right or sponsor or run at night all
right
instead of the morning apparently people
run slightly faster at the end of the
day or they could be running on
different tracks or they could be just
bitter athletes oh you know like these
shoes because Nike is marketing to
particularly athletically there's so
many things it could be so you have to
control for all of this other stuff and
the time story is different because they
controlled actually in three different
ways you know before the New York Times
writes a story saying this shoe will
make you run faster like that's a big
endorsement they want to really nail
this down so here's so this top one this
statistical model that is a multi linear
regression okay so you can think of this
as finding the the slope of the best-fit
line that accounts for all these factors
and then trying to equalize everyone
else by subtracting off those factors so
it says okay we know people run faster
at night so let's take everybody who ran
at night subtract off the 3% advantage
that gives okay we know people who are
younger run faster so let's subtract off
the advantage that gives and trying to
balance everything out with the scenes
there's still a difference with these
shoes but that's not and that's the
classic method but that's not the only
way of controlling for factors so they
have these two other ways right when we
compare changes and race times among
groups with runners who sat around the
same pair of racers so let's figure out
what that actually means here
cars that around the same pair of races
yeah yeah
somewhere in this article they have a
little more description of exactly what
to do but that's the idea is you take a
person who ran the same race with and
without the shoes and then you take a
whole bunch of these people who ran with
them without the shoes and you look at
the difference and the idea is
everything else is the same it's the
same person it's the same pair of races
I you know you hope I guess in this case
that they're not always running the
second one at a different time of day or
something but this is a different way to
control for this effect is they're
choosing the data to hold everything
else still so rather than trying to
build them all all the data they're just
picking people who are directly
comparable you know the same person
running on two different races or
running hopefully on the same track and
then this is a different thing which is
you look at before and after switching
shoes so again you kind of have to hope
there's no correlation they didn't do
something else they didn't like do their
total makeover program I'm gonna start
training more and switch shoes so that's
a way you can still make mistakes here
but what they find is no matter how they
run the numbers these paper flies are
still an outlier so this is an example
of a robust result it means even when
you change the model a bunch of
different ways you get the same answer
this is what we're looking for in data
analysis we're gonna talk about this a
lot more next class this idea of
robustness and of course you want non
data sources to also match
this is a big regression that I was
involved in this is Sergent scorecard
it's was the based on five years of
Medicare data it was the first analysis
of surgical outcomes to name doctors to
do that we had to account for a number
of factors including the characteristics
of the patients because certain patients
would be more complicated or sicker than
others and the hospital that they're
working in because there's a long line
of research that suggests that there is
and statistically what we would call a
hospital affect certain hospitals just
have a higher quality of care we also
had to make the patient population
incredibly uniform so we did elective
surgery only some knee replacements
spinal fusion stuff like this things
that was non-emergency because they were
much more simpler than uniform and we
screened out patients with a variety of
pre-existing conditions and really tried
to narrow it down there is a long
methodology paper that explains this
this was an extremely controversial
piece but one of the ways we tried to
make it less controversial it is by
rather than just reporting the raw rate
we reported the result of a model and
this is what that model looks like and
I'll try to explain this for you so this
solid line in the middle is every doctor
in our sample and we had I forget how
many other ten thousand or something
ordered by the model the rate and the
model Drake is what we reported the and
so you can see that goes from about 1% -
it actually does it to like 20% but
really it's like one to five or
something and then these dots are the
raw rates for that doctor meaning what
is the actual number of
locations out of the number of
procedures they performed and we had to
define complication very carefully in
this case we defined it in a way which
is pretty common in the medical work
which is Rhian mission within 30 days
for a related problem which we used well
dye used looked at the diagnosis codes
for the readmissions is what we had
where admission records it's not a
perfect way but it's not an awful way
either to look for complications and the
reason you get these sort of bands here
is because these are often small number
of issues you know so this will be if
it's six percent it could be you know
three out of 50 or something right and
so what you're seeing are sort of whole
numbers here what you'll notice
immediately is that the raw rates as
opposed to the adjusted rates OAC hours
of just a complication rates have a much
wider range and we also adjust for
patient characteristics like do they
have a history of heart disease and so
forth but the unadjusted model which is
the open circles you can see the open
circles and the asterisks are almost on
top of each other there was very little
change by adjusting for patient
characteristics so what's going on here
there's something called shrinkage have
any of you heard this this statistical
term the idea is this we're trying to
report a complication rate but you know
there's an element of chance here in
what patients they get and what happens
on that day and surgery whether the
patient gets exposed to an infection so
we don't expect that the rahl rates are
going to be accurate right there's
there's a chance in there in particular
we expect that most of the doctors who
have something that looks like a
particularly high rate just got unlucky
and most of the doctors who had what
looks like a particularly low rate got
lucky and so if you work out the optimal
estimator for the you imagine there's
some underlying complication rate that
is due to the doctors skill or practice
and what we see is the outcome of luck
plus or minus that rate so if you try to
work through the math and try to
estimate that underlying rate as well as
possible what you're gonna find is it's
going to be narrower than the observed
rate which is what we see here right
what this model is telling us is it's
saying that you know we think that most
doctors are actually somewhere around
this average rate of 2.2% and when we
see something that's higher than that we
tend to shrink it down towards that
average either make it smaller if it's a
very high number or make it bigger if
it's a very low number and this is some
this is a statistical effect you're
gonna see in lots of different analysis
so in this case the modeling rather than
it controlling for all of these other
patient factors was just trying to get a
more accurate estimate by pushing
everything a little closer to the
average
have you ever seen shrinkage before so
they're there this is probably the most
sophisticated statistical analysis that
I've seen in journalism I had the good
portion to be involved in programming
some of this and here let me get the
link for you
so they wrote this long methodology
paper rich man if you want a crash
course in statistics based reporting
this is really something and it's a it's
a gigantic logistic regression
ultimately now this story had a lot of
impact it really split the medical
community half of people were like who
is this news organization using this non
peer-reviewed methodology to critique
doctors you know sort of how dare they
and half of the medical establishment
was like you know we've been trying for
20 years to get greater transparency
around outcomes because preventable
medical errors one of the leading causes
of death in the u.s. somewhere around
between two hundred four hundred
thousand deaths a year from preventable
errors and patient safety advocates have
been saying for decades that one of the
things that might help most is
transparency around complication rates
because you referred to a surgeon how do
you know if they're any good now it's
not quite that simple there's a lot of
problems with transparency and
complication rates for example they have
to be adjusted for patient mix which of
course we tried to do to the best of our
ability and you know the loop has to
close just reporting that a doctor has a
higher complication rate it doesn't mean
that complication rate is going to go
down necessarily depends why and how
patients and doctors and staff and
funders perceived these rates so it's
not obvious to me that publishing this
story will ultimately result in better
medicine what is clear is that it
advanced the conversation that had been
stalled for decades and so now part of
the critique processes other people are
coming up with what they think are
better measures so it's really forced to
that field to move forward yeah
from Medicare so that's government
subsidized health care for older people
which is about 40% of the medical care
done in the u.s. it was so how this
happened is the Wall Street Journal
wanted to investigate Medicare fraud
so they actually fought a court case to
get access to the records and did some
stories about Medicare fraud and then
for public I was like well now that news
organizations can get this data what can
we do with it and so settled on this and
there was actually a complicated data
use agreement between the government and
república that they had to negotiate for
and sign before they got the data and
the data use agreements included terms
on privacy protection so for example see
where it says redacted here are all
complication right complications one to
ten if if they were less than ten
complications we couldn't report the
exact number because if you knew the
exact number like say you knew they were
only three complications in a particular
hospital you might be able to figure out
who they are yeah I mean I'm probably
ultimately about but also even if there
was an eight but you'd have general
privacy concerns you know you want to
name the doctors not patients so a
complicated data negotiation a
complicated statistical analysis which
was basically a big regression we did
control for all of these patient factors
just like here we're controlling for
different you know who's racing and when
and so forth and here we're controlling
for all of these driver characteristics
but here mostly what we're getting is
shrinkage so the modeling is pulling
everybody towards the average to try to
balance out the effects of chance
okay so that's a few examples of multi
linear regression in practice normally
why you do a regression why you do
regression you look at the coefficient
so you can compute these risk ratios if
you want to assign costs so let's talk
about causal models this is the last
major topic today here is a graph of the
Nobel Prizes per population versus
chocolate consumption per per capita so
clearly Nobel Prizes are eating
chocolate makes you win Nobel prizes
right exactly
here is some data from an old study on
mortality vs. smoking so they knew this
smoking rate in the mortality rate and
different professions when you graph
them you get this so clearly smoking
causes people to die right okay so you
all believe one and not the other why
there is some thing about cause that is
not in the data that's what I'm trying
to demonstrate here right this has
exactly the same logical structure as
that and yet you believe this one about
the other one so what's the difference
yeah so we can have competitive air
pools for sure that's where this is
going
yeah exactly so the difference is due to
a story we have about how the cause
works in particular we have for the
smoking case we have in our heads a very
good idea of a causal mechanism so cause
is not something that can really be
inferred from the data you can get
evidence towards causation in the data
but by itself that it can very rarely
tell you there's cause at least absent
an experiment and get to what an
experiment is in a minute here's another
one unemployment rate versus investment
so is this saying that if you invest
more you'll have lower unemployment or
is it saying that when there's lower
unemployment you're gonna see higher
investment
neither both what's going on here
indeed not only the x-axis is the
independent variable and for that reason
it's easiest to read it in that
direction
but actually there's this paper that
just came out suggesting that you know
really what we should do is to turn all
the charts sideways to prevent that to
prevent people of interpreting it back I
don't know if that will actually work
but it's an interesting observation not
only do we always use the x axis as the
independent variable but we very
naturally talk about cos and it's hard
to even describe this graph without
implying cos right so if I say
investment an employment rate goes down
when investment to GDP ratio goes up
that sounds like I'm talking about a
cause I have to really work around it
linguistically I have to say something
like when we see higher investment to
GDP ratio we also see lower unemployment
rate it's actually we're so wired to
look for causes and they said can this
can mislead us and our readers so here's
another one this was also a story I
worked on there was a guy who a few
years ago he was a professor somewhere
who was looking at these figures and
saying well you know the way to reduce
domestic violence is to make sure women
marry which if you look at this graph I
mean it's consistent with that story
right because you know women who aren't
married which is the black and the blue
here have higher rates now of course one
of the things that's happening when you
dig into this data is that this is the
crime victimization survey where you sit
there and you go to someone's house and
you ask the woman in front of her
husband you know are you the victim of
domestic violence so not a very good
survey for that might work okay for
you know have you ever been robbed at
gunpoint but not for this and another
thing that's going on is there's all
kinds of potential confounders so for
example women who are more economically
secure are more likely to get married
and if the market AMA creating secure
they're also probably more physically
safe so I wrote a story sort of beating
up on that here's some this is from a
site called spurious correlations or you
can clearly see that eating cheese makes
you more likely to die by being tangled
in your bedsheets and here's some more
as they say spurious correlations so you
know if we stopped funding the National
Science Foundation the temperature would
stop going up right and this seems very
silly right but this is not a different
case than this right we're also looking
at a correlation here it's just a
question about whether we have a story
that comes to mind which is why the
story is not good enough why we actually
need to break down cause in a more
sophisticated way so here's the
beginning of doing that I'm going to
introduce you to a modeling formalism
called causal graphs and there is
actually a mathematical interpretation
of this stuff you're looking at
variables and the arrow means basically
the distribution of Hawaii depends on
the distribution of X it says that
they're not independent it says
something more than that it says that if
you change X Y will change and this is
the idea of cause you can't capture
causes and correlations you've heard
that many times but now
as one article on this topic but it.if
correlation doesn't mean causation then
what does so we'll get to that in a
second but here's the idea here are all
of the ways two variables can become
correlated now you can have intermediate
variables you could have you know Z
controlling X through a for example but
basically this is it this is every
possible way and so when you think that
you have a cause one of the first
diagnostic questions you can ask
yourself is which case am i in right so
for example we see that countries which
have more guns have more deaths by
firearms I think many people come to
this interpretation first having gun
causes deaths but this is also a
consistent interpretation and so is this
and so if we want to claim that guns
cause firearm deaths we have to rule out
these cases as well and that's where
things start to get difficult because
often these cases are very hard to roll
out now this case the correlation is due
to chance we are going to study that
extensively in the next class the next
class is going to be all about
statistical significance and
randomization so in this class we're
just going to talk about these other
cases in particular in the case of the
confounding variable which is that
triangle one this is one of my favorite
charts from the sadly departed OkCupid
blog I was called okay Trends and it was
data analysis on OkCupid dating they
shut it down when I could keep it got
acquired I think it was last posted in
like twenty or fourteen or something I
know why right so interesting so this
was an analysis of response rates women
responding to a first message from a man
and the average response rate was like
30-something percent it was it was a
different
time and and saw that messages
containing certain words that
particularly higher than normal or
particularly lower than normal response
rates so what's going on here why do we
see this data this pattern of data yeah
don't be creepy right so right however
this data was introduced to me by a data
scientist who was giving a talk on
causation and she said yeah okay but
there's an obvious confounder here so by
that I mean what is a factor some
unknown factor that causes both the
change in response rate and this
particular type of language to be used
no because the the creepiness is not
observable can only act through language
because you all you have as the language
so we want to what I'm saying is maybe
it's not true that the language causes
the response rate
maybe something causes both the language
and the response rate interesting but
you know yeah okay so it could be some
other aspect of the language but I'm
actually looking for for something a
little different which is this so if
someone has a hot photo up they're more
likely to get comments on their
appearance and they're more likely to
respond at a lower rate because they get
my message
that that was the argument now I don't
know
in reality the ratio between these two
effects and actually I had people in
class come up with like several other
other confounders so I'm I'm not willing
to to you know look at this data and
tell you for sure you know there's no
correlate there's no causal effect
between using comments on our appearance
and having a lot of response rate but we
could say that it's some combination of
these types of factors and the reason I
show you this particular example is
because the cultural narrative around
what's happening here is so strong and
so obvious it makes it hard to even
think of confounders right so that's the
challenge is when you have a strong
correlation to be able to step back and
think what causes this that I'm not
seeing
this language of causal models is
actually a whole system for talking
about the relationships between
variables generally the language is you
know pink means an observed variable and
gray means an unobserved variable and
when you're really getting into these
social science questions that we get
into in data journalism you're going to
want to draw out these types of maps
just to orient yourself how do these
variables relate and in fact there is a
formal calculus of this stuff you can
actually talk about all of this in terms
of distributions and variables you can
you can do this you can do rigorous
statistics on this there's inference
algorithms that try to take a bunch of
observed data sets and draw the graph
for you so it's actually not quite true
that you can't figure out calls from the
data you can try to infer it
statistically but for our purposes in
journalism mostly it's just a thinking
tool I want you to think through the
causal structures of what you're talking
about
and finally I said I would define an
experiment formally and this is tied up
in the formal definition of a cause so
talking about relationships between
distributions there's no cause there
right the cause comes from the idea of
an intervention
and so the idea here is that x3 causes
x5 if when I do something to x3 it
forces the value of x-5 to change so in
the formalism this is actually called
the do operator and the do operator is
what defines pause so for centuries
people have been trying to talk and talk
about causes in terms of correlations
and relations between distributions it's
not sufficient to talk about causes you
actually have to talk about how changing
one thing changes something else and
that's what an experiment
is an experiment is changing a variable
and seeing of another variable changes
that's how you can tell if you have an
experimenter observational data the
reason there's much reason to use this
intervention definition but one of the
nice properties of this definition is
that when we talk about causes often the
reason that we care about causes is
because we want to intervene to change
if we think that poor childhood
nutrition causes poor academic
performance in high school years later
we want to find out if that's true
because we want to know if we should
work on childhood nutrition but that's
the point of doing that analysis that's
the point of writing the story is the
accountability we can see that when
these kids aren't provided with with you
know lunches they do later for the next
they do worse for the next ten years so
it fits this idea of an intervention not
only resolves the statistical problem of
what's the difference between
correlation would cause but gives us the
answer to exactly the question we want
to know which is if I change this will
it change something else because
normally we want to know about cause
because we want to control something so
just as an example of this this is a
very nice little experiment that
Facebook created a few years ago and
they were trying to figure out if how
much people Reshard something after they
saw it in other words if I put something
in front of you on the newsfeed oh
you're gonna share it and there's this
whole complicated system of causes that
are is not observable so for example if
someone emails me the link what the
marketers like to call dark social I may
post that on Facebook the markers
basically just want everything to be
Twitter
which is why they work so much for two
days right so there's lots and lots of
reasons why I might go visit it why I
might post a link and we don't get to
observe this but what we can do if we're
facebook is randomly not show someone a
certain link so that's what they did
they for each link they say well
normally our algorithm would say a
hundred thousand people see this link
but I'm gonna remove it from half of
them and I'm going to see if those
people share that link less and it's a
way to D couple the effect of this from
the effect of all this stuff so what I
mean what this picture although it's not
drawn in the usual way this this is a
causal network a causal diagram you can
see it's split into observable and
unobservable variables and again drawing
this out helps you think about what's
actually going on here in the
relationship between variables alright
thoughts on that oh because they were
experimenting with things but they do
that every day when they tweak their
algorithm I mean not not to dismiss the
ethics of what they're doing but I'm not
so the thing that makes the ethics
potentially different is they're
publishing research here but every
social media company is changing what
you see every day that's that's their
job and it's not I mean a human editor
is doing the same thing right again it's
not that there aren't ethical issues
here it's just that this is a normal
part of the job
yeah the emotional contagion study yeah
so that was several years after this and
that was the one which became the big
scandal and ignited this whole
discussion about you know he is it
ethical for tech companies to experiment
on us and jeez I don't know that is a
complicated discussion well so before
you can know that something is a
confounding variable you have to think
of it so the first thing is from your
reporting from your imagination on your
domain knowledge you have to go through
figure out a list of everything that
could be a confounder and then if you
can measure that variable then you can
check for compounding directly because
you can look at the three-way
correlations and there's other methods
to talk about cause you can always
interview experts right anything
interesting somebody's gonna be spending
their whole life on it
all right so they'll know but you have
to imagine it first and that's often the
hard part it's very tempting to just say
oh I've got this defense

An overview class on what to count, how to “interview the data,” statistical models, the uses of multi-variable regression in journalism, and correlation vs. causation

Contributor: Jonathan Stray