Algorithmic Accountability and AI Fairness

Frontiers of Computational Journalism week 6 - Quantitative Fairness
all right everyone so we're gonna be
continuing with our algorithmic
accountability work today
the main thing we're building up to
today is quantitative definitions of
fairness which are central to almost all
of the algorithmic accountability work
that's been done so far certainly
everything we saw last week of course
race and gender bias are extremely
popular algorithmic accountability
stories and we'll be looking a lot at
that but you can also imagine different
groups like rich and poor people we'll
also talk about more general
considerations of data quality feedback
loops from using algorithms how people
interpret the results of algorithms
we've been spending a lot of time on
this sort of core algorithms themselves
that's by design because there aren't
many places where you can get that type
of discussion we're gonna we're gonna
sort of bring all of the other issues in
scope and you will find many other
discussions of some of these broader
issues so we're gonna start talking
about trying to measure fairness in
various ways there's some really
challenging problems when you work with
observational data going back to some
issues around causality we're gonna go
back to machine bias and talk about the
fairness criteria and that Republic are
used the the core of this class I think
is the impossibility theorems which put
limits on what types of fairness you can
have and we're gonna see actually over
20 definitions of fairness and and show
how they relate and then we're going to
talk about all of the stuff around
including some very interesting case
studies such as the child abuse
screening hotline that we looked at
earlier
so let's start with analyzing
experimental data
you've probably seen a lot of this sort
of thing
these are his analysis of hiring
decisions and the sciences this is a
huge a huge paper on this whole question
of what is the career path look like for
women in science and and what are the
disparities so what do you see here what
are we looking at
yeah it's this sort of multistage
process which is very common almost
every case we will look at is this
multistage process including criminal
justice especially what are the numbers
suggesting and equivalently you know if
you look at different data sets you'll
see the reverse right you'll see the the
how to put it the traditionally
disadvantaged group looking worse at
each stage instead of better for example
right so it could could be one way or
the other what can we conclude from this
type of data what can we say about these
processes can we say anything about bias
for or against the grouping question
just from these types of Statistics
incomplete how
so we're getting immediately into a very
fundamental problem and all of this work
which is and I'm gonna draw a very
abstract model that applies in a lot of
different fields so we the basic idea is
that we have some process which is
[Music]
supposed to output a yes or no decision
or let's say let some number of people
through and we have some pool of
applicants which in the simplest
possible case we're going to say break
down into well let's call it qualified
and unqualified in this case this could
also be you know are going to end up
committing a crime while on bail or not
it could be you know are going to pay
back the loan or not it's often some
sort of predictive variable it's often
some sort of ranking or measure of Merit
in some way but really what it is is
this sort of question of should should
they be let through this filter right so
there's a certain number of people it's
not always people it could be you know
applications for a particular project
right you know grant proposals or
something right and as much as possible
we would like this to only let the good
ones through whatever good ones means
and then we've got different groups
right so the situation is that's Group B
that's Group A and we've got some
qualified and unqualified people again
I'm using the words qualified and
unqualified
in very generic ways right in group a
and B and the idea is at the output we
have a certain number of people from
Group A and a certain number of people
from group B and we would like to let
mostly the qualified ones through right
so this is make some sort of decision
and this set up shows you immediately a
few different things one is that just
looking at the ratio of people the group
a to group B doesn't really tells you
what's going on in this box we normally
are interested in bias inside this box
but we can't from just these outputs and
these inputs right so the the inputs
that we have on this this slide are we
have you know the number of people in
Group A we have the number of people in
Group B and then we have these numbers
here as well all right we just have the
counts and from those four numbers we
can't really determine what's happening
inside this box and whether it is biased
we're here we're going to define bias is
as you know we are preferring qualified
people of one group over the other and
the reason we can't is because we don't
know what the input looks like we don't
know what the fraction of qualified
people in the different groups are so we
we can't and this should be if you think
if you start this trap for a second you
think about it this should be clear
right we can't disentangle the fraction
of male and female applicants from the
quality of the applicants right maybe
all of the women come from better
schools you know or had more NSF grant
funding in their prior careers and
that's why more of them get
or maybe not right so that's the basic
setup for almost everything that we're
gonna do in this class and this is part
of why answering these questions is very
hard is because we need to know or
control for the characteristics of the
applicants now that in itself is a
particular idea of fairness that is
equality of opportunity you can also
defend a different principle of fairness
that says you know what we got to
normalize these counts I don't care what
the the applicants are I want equality
of outcome not equality of opportunity
and that sort of is not identical with
but sort of heads towards the
informative action model so real
considerations of fairness are much more
complicated but this is the idea and the
set up for this there are a bunch of
reasons that a bunch of complications in
just measuring stuff here this is a very
famous paper and actually lawsuit what
happened here was the University of
California Berkeley was sued in the
early 1970s on the basis that their
admission rate for female students was a
lot lower than for male students and
shortly thereafter this paper was
published and this is an instance of a
something known as Simpsons paradox what
had happened was that if you analyze the
overall numbers for the school the
admission rate for women was lower if
you analyzed the admission rate by
Department almost every department had a
higher admission rate for women so how
can that be how can you have the overall
admission rate for the entire school be
lower than
whereas when for each department it's
higher for women yeah yeah warm getting
closer it requires one other factor so
take a look at this diagram for a second
this is how many women apply it so each
of these is a department this is how
many women apply to each department this
is how many are admitted lips and then
the size of the box is the size of the
department and they're showing a trend
line here
trend line is saying that the higher the
percentage of women applicants where
there are more women applicants there
are fewer of them admitted right so
basically the women are applying the
departments where there's a lower
admission rate generally now they're
male there may still be a problem here
because what this turns out to be is you
know the women apply humanities where
they have a huge number of admissions
and so they can only take 10% and more
men apply to the sciences where
basically they take everyone who applies
because only people who are qualified
apply so there's there's still
complexities at play here than we might
want to take care of but I am suggesting
that how you group your how you grow
people changes the numbers and you can
actually have overall different balances
in the aggregate versus the individual
units
so I'm gradually building up some of the
complexities of asking this question
that's my goal here Simpsons paradox
appears in a bunch of other places too
so another form that Simpsons paradox
takes is when you do a regression model
with only one variable you see a
positive trend when you add another
variable you control for another
variable you see a negative trend so an
example of that would be let's take our
Nike shoes example that we studied when
we looked at regression let's say you
look overall and you say oh the people
who wear these shoes are you know if I
just look at this one variable did you
wear a shoe I see a positive trend for
improved performance in the race but
when I add in this other variable of
were you running at morning or night
because generally people have more have
better a slightly better performance at
night you actually see a negative trend
for those shoes potentially so as you
add variables to your model in this case
adding the variable of what department
they applied to you can actually reverse
the regression coefficients and there's
a famous example in healthcare where it
looks like a drug was better for
everyone together but was worse when
analyzed at two different hospitals and
the answer to that puzzle is well the
patients were different at the two
different hospitals okay
this lead gets us right back to the
question of controlling for other
variables this was I don't remember if
we've looked at this one this was a nice
analysis by the Miami Herald Tribune
looking at sentencing lengths again for
black and white defendants and you can't
just ask do white defendants get longer
or shorter sentences than black
defendants because again it's the the
input population problem right are are
they committing the same types of crimes
and so to do that they used this point
scale which was developed about 10 years
prior and to try to even out sentencing
so it's you get assigned points for
previous criminal history basically and
the reporters can put sort of people in
the buckets by how many points they had
and then compared people only within the
same bucket so that's an attempt to
control for the differing
characteristics so right away if you're
gonna do these types of analyses and
look for bias inside this decision box
you really need to control for the
characteristics of each group so we
talked about qualified or goodness or
something right it's the characteristic
on which the decision should be based so
if you believe that the sentence length
should be based in part on your previous
criminal history then that needs to be
compared when you're comparing sentence
length
really what you'd like to do is an
experiment what's the difference between
experimental and observational data yeah
you can you control a variable when you
doing experiment so we talked a bit in
our discussion calls them auditing
modeling of the do operator an
experiment is where you get to force the
value of a certain variable so this is
an example where they took pitch decks
and they changed the names and
photographs of the people who were
supposedly the team behind that pitch
and they sent them in a pitch
competition to a bunch of VCS and what
do we get here you want to be an
attractive man
that gives you the highest likelihood of
winning the pitch competition what does
this mean yeah what does that mean not
significant okay so what they're saying
is it's within statistical noise yeah
according to the P equals 0.5 definition
which we're going to take apart next
class a good question I don't actually
know because it's been a while since I
read this paper but I can think of a few
ways that it's been done so how how
would you judge attractiveness for this
experiment okay
yeah much much simpler approaches just
to ask people to rate them yeah which of
course will depend on who your readers
are but anyway you can make this work so
this is as clear as evidence of bias
ever gets when you have an experimental
manipulation of the variable that you
think people are biased on you can
pretty much just nail it down and prove
it right like this is yeah there's
gender bias in picture competitions
right as opposed to this case where it's
hard to say what's going on maybe the
women are just better scientists right
or vice versa depending on the data
you're looking at sometimes you get
what's called a natural experiment which
is where there's randomization or other
control introduced can happen in a lot
of different ways laws changing in some
states but not others is a classic
natural experiment a natural disaster
can be a natural experiment because it
forces everybody to rebuild at the same
time there's various ways this can
happen this is a nice piece by a Swiss
paper looking at deportation hearing
cases assigned to judges the judges are
appointed by different parties some of
which are more you know favorable to
immigration than others and what they
found is that yes certain parties have
you know significantly higher rates of I
forget where this is denying or allowing
the appeal deportation rates so this is
denying the appeal
yeah that's right that's right no yeah
so this is an experiment because of the
randomization across judges and
randomization breaks the correlation
between the judge and the case so summed
over many cases you can get these types
of results and so this is another
experimental result which is which is
pretty good right pretty not quite as
good as doing your own manipulation but
pretty good slides out of order this is
a description of the bias on the bench
so this stuff this is what's going on
there methodology post which of course
you all should be reading the
methodology posts on data stories and
writing them and what I want to draw
your attention to is this grouping
defendants committed the same crimes
according to the points they scored at
sentencing
so they used histograms one point Wyatt
basically and then they evaluated the
differences within each bucket so we've
talked about adding a variable to a
regression model to control for it you
know you don't even need to get that
complicated just just bidding people and
and comparing them by bin does basically
the same thing right so very simple
models can be used to control for these
factors and then they computed a
weighted average of the difference in
sentencing time across the buckets
weighted by what by the way when they
say weighted average what do they mean
yeah exactly
all right see right so you if you have a
thousand people in one group and two
people in another group you don't want
to weight them evenly
this is a New York Times story about
racial bias in New York state prisons
which this is common and complex here
they're looking at disciplinary rates so
how often someone gets disciplined in
various ways and what they're arguing is
that the Disciplinary rates are higher
than the population rates for black
defendants and of course accordingly
lower for white dependence thinking
about this model again what's the issue
with interpreting this data so does this
show that the prison staff are harder on
the black population right so you've got
the same problem right you don't know
about the population right we can
certainly say that it is this data is
consistent with bias on the part of the
prison staff but we're gonna need more
to nail this down so they do have more
we don't have a controlled experiment
here we don't really have data on what
the prisoners are doing
so actually the Disciplinary rates for
for inmate populations and interesting
so that's a like when you're when you're
looking at proportionate response it's
hard to make that argument
interesting that's a kind of indirect
evidence yeah I like it
this is what the times did so here they
are talking about this problem of
differing populations greater sure black
inmates are in prison for violent
offenses and minority mates are
disproportionately younger all right so
they're they're addressing this issue
and popin population difference then
they know it and we're gonna see this in
the public eye data as well it's the
same issue more more criminal history
younger but this sentence I'm
disappointed with even after accounting
for these elements they don't say how
they accounted for them all right so I
don't know if it's a statistical method
or or if they're just like they just
don't believe it like I find that very
unsatisfying and I hope you will all do
better but this is interesting the
disparities we're offering greatest for
infractions that gave discretion to
officers like disobeying an order right
so where there is more choice in the
disciplinary action then we see greater
disparity so here's what that looks like
right more more disparity for more
discretionary offenses so this is how
they attempted to compensate for this
problem of different populations and
it's a challenge right because you know
the rest of the article is a list of
horrifying incidents of racism inside
the prison system so call that the
qualitative data and you don't want to
discount the qualitative data the
qualitative data gives you a detailed
picture of what is happening at ground
level right that's the it's the only way
you can get that information it's you
you you you need to understand these
individual stories otherwise you don't
understand the mechanics of what's
happening also it's important for
storytelling right you gotta give
examples if you just give statistic
it's it's not going to have the same
effect on the audience they're not going
to remember it they're not going to
understand it but this is how they're
trying to make this case in a
statistical sense so this is the more
common case the natural experiment case
the natural control variable case are
rare so more often you're going to be
dealing with tricky methods to try to
understand what's happening and of
course you always have to consider the
possibility that there isn't bias
because otherwise why are you even
looking at the numbers if you already
know the answer this is a similar one
again this is a New York Times piece
this is use of lethal force again we've
got a control here look at this this
phrasing in similar situations this is
another New York Times piece which is oh
yeah the upshot it's an Amanda Cox piece
and this person who I don't know but
they went through and coded for what the
situation was right so here's what they
did and when they say similar situation
they have some sort of coding system for
reading these reports and then sorting
them into categories right and so the
exciter is they sorted into him or are
you know did they fire a weapon and you
know were they handcuffed and blah blah
blah we could we could take this apart
and look at how they tried to come and
say but you're always going to need
something right because normally when
you look at these statistics all you
have are the raw rates and you don't
have the population characteristics so
that's the major challenge
so switching tracks I'm sort of trying
to give give a bunch of pizzas and then
we're going to assemble this why do we
why is there interest in using
algorithmic decisions at all for
criminal justice hiring child abuse
hotline screening blending all of the
stuff we've seen why why are we doing
this increase efficiency okay so if you
got a screen a lot of resumes maybe you
can make a computer do it that's what
Amazon was trying to do for example so
right so if does standardize or the
process have a you know make sure
everyone is given the same sort of
treatment at least the same sort of
treatment in as far as they have the
same data on them which is not assured
why else
yeah so this is an interesting argument
I've found that some people feel that
algorithmic decisions are a black box
they're complicated some machine
learning models not explainable I mean
how do you explain why a convolutional
neural net made a choice right I mean
right some people feel that human
decisions are a black box although you
can ask a human why they made a choice
although the answer that give you may
not be the reason why they actually made
the choice so it's not a simple
comparison and I think people tend to
have a human or algorithmic bias and
tend to imagine that one of these is
more explainable than another why else
are we using algorithms
so that's an efficiency argument again
what I'm driving towards is prediction
right so in general across a wide range
of fields simple statistical methods and
I'm not talking neural nets I'm talking
regression with two or three variables
tend to be better predictors so let's
see what that looks like this is a paper
about using algorithms for risk scoring
so this is the machine bias case and
there's been a bunch of work like this
but this is a significant one which is
they use actual data on judge decisions
from I think this is New York State that
they use okay I don't think it says here
I think they do in New York and a
national one and they show that
basically judges are not super great at
predicting who's going to be re-arrested
after being released and just based on
nevermind the hundred and thirty seven
questionnaire that Compass has just
based on data available at the time sort
of case history stuff case history and
demographic stuff you can improve
prediction so much that you can put a
lot fewer people in jail or reduce
crimes by by keeping the right people in
jail and do that while we're reducing
racial disparities now what exactly they
mean by reducing racial disparities we
have to look at it the paper and take
that apart but basically human judges
are not great at deciding who is a
safety risk and you know I would argue
that this is not surprising and the
reason I think it's not surprising that
human
aren't very good at this is because they
don't generally track their decisions
and the outcomes right it's hard to get
good at prediction if you're not
tracking your predictions and they and
the true outcomes and almost nobody does
this I'm sure there are a few judges
that do it but it's not it's not a
standard thing right so that that so
possibly if all judges tracked their
predictions over time then they could
close this predictive gap but at the
moment we know that basically linear
regression is going to do so much better
that there there is a huge potential
public benefit and this is why people
are pushing for this in general I said
simple statistical models are better at
prediction this doesn't mean that
statistical models are good often it
just means that the problem is very hard
right so your model may only get 30
percent accuracy which is not great
accuracy but if the humans are only
getting 25 it looks better
I'm gonna put a lot of caveats on this
accuracy is not the only measurement
that you could be interested in this
doesn't account for fairness metrics
even if the humans and the statistical
method have the same accuracy score the
errors could be distributed differently
there's all kinds of potential problems
with this but here here's the beginning
of a table from this paper and what I
want you to see here is that this is a
lot of different fields right this holds
among a lot of different things so this
is education this is medical this is
criminal justice there's some more
medical there and so across many many
different domains generally simple
statistical models what used to be
called actuarial models perform better
and so there's
I would argue that your your prior your
sort of baseline suspicion is that the
statistical model is going to give
better predictions okay
so better predictions have consequences
if you can if you know that someone's
not going to commit a crime you can
release them from jail if you know that
someone is going to succeed in this job
you can hire them so that so accuracy is
important but it's not the only
criterion and I think this was really
the the we're critiquing this machine
bias piece very heavily in this class
but it was still an extremely important
piece of journalism because it opened up
this whole conversation and spurred the
research that we are talking about today
and just to sort of revisit this this is
what we this is what we saw so this is
the actual raw data so this is my my
friends Stephanie who has been working
on this stuff made this chart when she
was trying to understand it and you can
see that you there's eight numbers here
and you actually can't do this
comparison without an eight numbers what
you really have is to confusion matrices
and you need a confusion matrix to
generate the false positive rate and
you're comparing false positive rates
for two different groups so you actually
need all eight numbers to get this this
is kind of the equivalent of dimensional
analysis does everybody run into
dimensional analysis this is where you
try this is used mostly in physics where
you you sort of work out what the
formula has to be based on the units so
if you know you're gonna end up with a
speed and you have a distance and a time
well you have to divide the distance by
the time because you know miles per hour
right so this is sort of the equivalent
in is I showed you the risk ratio where
like okay you're gonna you know you're
not gonna need four numbers to compare
differential risk you know you're gonna
need eight numbers to compare
differential accuracy metrics so this is
what we had note that the
the base rate have reached four black
defendants is much higher and we saw
that in our test as well right so so
basically the left side of the chart
tends to be higher than the white side
of the chart they are reacting what the
behavior could be different it could
also be biased in the arrest data which
we're going to talk about but differing
base rates turn out to be a major
problem so this is sort of how there's
sort of this fight and you can read
several rounds of back and forth between
four pública and Northpoint who created
the compass scoring system and you know
the there's like a hundred pages of this
stuff but if you really boil it down
this is what the argument is república
is saying hey your false-positive rate
is much higher for black defendants than
white defendants which I'm producing
here by labeling these numbers right
there just calculations off the
confusion matrix and Northpoint saying
yeah but the positive predicted value is
almost the same in fact it even slightly
favors black defendants because they are
slightly more likely to be re-arrested
at the same level of risk value and one
way to think about this is you're
reversing the conditional probability
all right let's so the false positive
rate is of the people who are not really
were labeled high-risk the positive
predicted value is of the people who are
labeled high-risk how many were really
so you should all know from from basic
probability theory that reversing the
conditional probability changes the
number and one way to see that or to
talk about that is to use that diagram
we had
a couple of classes ago the quadrant
diagram remember this this quadrant
diagram so here it is so we're gonna say
re-arrest not read and then the way this
is set up is high risk low risk and then
we've got some dots for simplicity we're
just going to look at one type of dot we
have some people in each of these and in
practice most people are not read and
what we would like is that the high risk
people basically we want that this is a
confusion matrix basically basically we
want the the diagonals to have most of
the dots because we either want them to
be labeled low risk and not read or high
risk and re arrested and so the more we
have the most of the dots or the
diagonals the better the classifier is
now if we're going to do that false
positive rate so f PR equals let's label
these cells a b c d i don't know if this
is the same notation i used in the slide
so i apologize for that
so false positive rate is higher risk
given they were not read so if you're
not read that's here so that's the
denominator and the numerator is there
so B over B plus D okay the well here's
where I really wish I had one more color
I guess I'll use the green but the
positive predictive value
is rear ested given that you were
high-risk so the given that you a high
risk means this is the denominator and
that's the numerator so a over a plus B
okay so the denominator is different
you're actually measuring as a fraction
of a different population of people the
fraction you're measuring is on the
right side of that bar in the
conditional probability conveniently
it's sort of looks like a denominator
and so the Northpoint argument is that
you should be asking using all the
people who are labeled high-risk as the
denominator and this actually
corresponds to some pretty standard
definitions from the psychological
testing literature that the
psychological testing literature of
course they're concerned about fairness
across race as well for you know ayuk
you and anxiety and school scores and
you know they've been thinking about
this for the last hundred years and they
normally use this a definition that is
more or less equivalent to Northpoint
okay
I've been really tearing through this
I'm just trying to get us sort of up the
speed questions or comments on this so
far
all right moving on then so the positive
predictive value is basically you can
think of it as what's the probability
you're going to be read for any given
score right so this this is a binary
version of the problem where you're just
labeled high or low risk above some
threshold this is the continuous version
of the problem where we have a
continuous score and it corresponds to
some probability of recidivism which it
means being rested and as we saw earlier
this is more or less linear which means
that this score can be transformed into
a probability it's basically a
probability and that is a property known
as calibration of a predictor it's a
very important property and it's more or
less the same between black and white
and actually favors black at the high
end here slightly so this is the
continuous version of saying that you
have equal positive predictive values
and here is the oh and here is more more
públicas analysis these are the curves
for low medium and high risk how many
people were arrested so this curve goes
down when someone has rear ested this is
a survival curve and you can see there's
clear separation between the categories
so the people labeled as higher risk
really are higher risk all right these
red ones and they're about the same
between black and white and if anything
the black survival curves are lower
which means that the predictor is a
little more optimistic so it's gonna let
more people go free so this chart on
this chart are basically the same
information this just introduces the
time element now here's the challenge if
you take your basic confusion matrix
formulas and do a little rearranging you
can get this formula which expresses the
false positive rate
in terms of the positive predictive
value and P which is the base rate so P
in our case is how many people are
actually reading and here's the issue if
you set the false positive rates to be
equal and you and the and P isn't equal
right so the base rate of Rio a stiffer
--nt then the positive predictive value
cannot be equal or vice-versa so this is
our first impossibility theorem again
you can get this just by starting with
the formulas on Wikipedia and trying to
derive FP R in terms of these things so
there's no no magic here and it says
that you can't have both types of
fairness at the same type kind of time
right you can't have both the score
means the same thing for black and white
defendants which is calibration or equal
PPV and we have the same false positive
rate for black and white defendants at
the same time so you have to pick one or
find some balance between them and your
homework for today is going to be
playing around with a classifier to try
to balance them turns out you can get
almost as good as the compass predictor
by just using age and criminal history
you can get nearly the same performance
which is interesting it mean itself says
that this 137 factor thing is actually
unnecessary and that most of the
information is contained in a small
number of variables which has
explanatory consequences as well all
right so then the question becomes do we
want to keep people in jail just based
on their age because that's kind of what
we're doing if we use these predictors
so that's our first impossibility result
and this is how they put it in that
paper
I like this phrase difficult stakeholder
choices and this the goal of complete
Racer gender neutrality is unachievable
at least if you define it in terms of
those two definitions of fairness as it
turns out there are many many
definitions of fairness and one of the
things that has become a little clearer
in the last two years of research is
that they basically boiled down to a
small number all of which are mutually
exclusive so let's set this up here's
the notation we're gonna use for the
rest of this the ideas we're assigning a
score to each person which has some
attributes X and they're in some Class A
so you can think of a as you know race
or gender or whatever it is and we give
everybody a score between 0 & 1 which if
it is a calibrated predictor is gonna be
think of it as a probability more or
less and then we give we make this
decider this yes or no by threshold 'end
what we're trying to match is why so if
we had a perfect predictor then we get C
equals y okay so that's the set up
sometimes this is also called D for
decision now there is a wonderful list
that has been compiled these are all of
the definitions quantitative definitions
of fairness that have been proposed in
the literature so there's a couple dozen
of these and they operate in different
ways and if you go through this and you
stare at it long enough you can actually
translate each of these into words so
for example in any situation right so
let's take the hiring example
statistical parody the probability of
decision equals 1 given a
days decisioning same as probability
given a equals something else so that
says that we hire the same number of men
and women right so that's a definition
of fairness that's based on equality of
outcomes
[Music]
whereas let's see equal positive
predictive values says that given that
someone is in a particular class and we
decided yes the probability that they
actually were yes because we're talking
about predictors here is the same
between classes so this is if we said
that you were high-risk the probability
that you were actually read is equal
between classes so this is the
calibration definition and so on here's
the one that per public I was saying
based on true positive rates you can
also say for example let's make the area
under curve the same so AUC is this this
metric for classifiers that we looked at
last time you can say let's make that
the same that's sometimes called
accuracy equity calibration is the
continuous version of equal PPV and then
you get into causality definitions and
causality as we've discussed is not
something that is available purely from
the statistics you have to use a do
operator you have to imagine that you
can actually change a variable and so
counterfactual fairness says that the
decision I would have been hired or not
hired regardless of whether I was black
or white that's what this says the
decision for one group is equal the
decision for a different group for that
person so if you stare at this long
if you can actually translate each of
these into words and these are all
definitions of fairness that people have
proposed at various points so now our
task is to untangle these a little bit
and talk about and by the way I highly
recommend to this reference it's it's
links from the syllabus it's it has this
stuff and a lot of discussion and
interesting references including a reap
app cap of the Republican North Point
argument but we would like to simplify
this a little bit because this is really
complicated ok questions before we go on
pause here for you don't understand it
uh-huh so let's let me load up that
paper so what it's saying here is that
so look at the structure of this it's
saying that the probability of y equals
1 if we know they're in the group and
they have a particular score is equal to
the probability that y equals 1 if we
only know the score in other words what
it's saying is that the probability that
y equals 1 for that score where in this
case y equals 1 means that child will be
placed in a foster home is the same
whether or not you take the group into
account yeah so what what they're saying
is so take a look at this again what
that's saying is if we take the overall
so for each score so it's conditioned on
s so we're conditioning on this number
which means we're setting that number as
the denominator right so we pick one of
these and we say the overall level at
which y equals 1 which is the vertical
axis here is the same as the individual
levels which means that these two bars
have to be the same height so that's
what they're saying
okay yeah so there are some that are
widely used so calibration is
historically what the psychological
testing world which might be closest
into what we're doing has used for
fairness criterion but you can make
arguments for other ones and so we'll
get to there in a minute
so these are all the definitions Moritz
Hart has proposed that we talk about
mostly three conditions and if you stare
at that those that list of definitions
long enough you can probably convince
yourself that almost all of them boil
down to these three definitions there's
some that aren't quite there because the
causality definitions involved
counterfactuals which aren't captured in
statements about statistical
independence are not fully captured but
most of them boil down to these three
things this the slides will be posted by
the way so you can either say that so
let me try to translate these in English
see independent of a so the choice
doesn't matter what group you're in
so we hire the same number of men and
women season independent of a
conditional and why so if we know what
the true outcome is so we know that
you're going to be reread where the 50%
probability and it doesn't matter what
race you are or separation so this is
the public aversion
so we whether or not you're going to be
read it doesn't depend on race
if we know whether you've been
classified as risky or not so pretty
much all these definitions boil down to
one of these three and we're gonna look
at these in a little more detail and
what I've done is I've tried to tie them
into legal and moral principles because
I think actually all of these
definitions are defensible in different
ways and in many ways they recapitulate
the arguments that we are having about
affirmative action right now this is
part of why this is a hard conversation
and oh yeah you can prove that you can
only have one at a time although you can
have you can approximate multiple ones
at a time and that may be a good answer
so this demographic parity idea this is
equality of outcome there's different
ways of phrasing this so the most
concise way is to say that the
classification is independent of a and
so for example you could make a
classifier that chooses the ten best
scoring men and women in each group each
of these definitions has drawbacks
there's usually a way to cheat and the
way to cheat is extremely important if
you're talking about training machine
learning algorithms because they will
cheat if they can if you make a neural
network to evaluate resumes and you
insist that there are the same number of
men and women hired but you don't have
the same number of qualified men and
women then what's going to happen is the
people for the gender that is has fewer
qualified applicants this just allows
the classifier to pick randomly it puts
no constraint on I've said choose the
ten best but if all you say is that then
the neural network might just get lazy
and pick random ones all it knows is
that you want the same number right so
thinking about what it does not
constrain is as important as thinking
about what it does constrain
also the perfect predictor which always
guesses C equals y so remember that's
the perfect predictor is not allowed
under this definition if the base rates
differ right so if you had perfect
prediction of who's going to be read but
you knew that black defendants are going
to be rear ested twice as often then
that you'd the perfect predictor would
violate demographic parity because you
don't label the same number of people as
high-risk nevertheless there are legal
and moral principles that embody this
idea calibration yeah this is a classic
definition this is the one which was I
think everybody building these things
thought that this was the one you wanted
until this exploded into the into public
view with the ProPublica story it's
still a very important definition for
variety reasons the reason it's called
sufficiency is what it's saying is that
if you know the score right how they're
ranked right the the risk score that
they got the outcome is independent of
the group so in other words the risk
score is sufficient to guess at the
outcome so this symbol of statistical
independence so what this means in terms
of actual numbers is this right this is
the positive predictive value the
probability that the true outcome is 1
given that we said that it was going to
be 1 is the same between each group and
most machine learning algorithms are
designed to do this the accuracy most
machine learning algorithms are scored
on like f score or accuracy or something
and most of those metrics that we use in
training algorithms are optimizing the
algorithms will produce this condition
so it's it's natural in a certain sense
however you have have at least two
problems one is that you may not want to
treat everybody fairly remember under US
law US law around protected classes
doesn't say treat everybody the same in
particular think about the Americans
with Disabilities Act if you can't
discriminate against someone with a
disability and you have to build that
ramp that's not treating everybody the
same that's because we are concerned
with equality of outcome in such cases
so it's entirely defensible to say that
this is not the definition that you want
in certain cases and then you can also
get different accuracy this is why this
has been called sufficiency here's the
idea in terms of a causal diagram you're
saying the outcome is correlated with
the score but there's no direct but own
but there's no direct correlation
between which group you're in except
through the score in other words another
way to think about this diagram is all
of the information about what group
they're in is contained in the score
right so if we know our we know
everything that matters about a that's
why this is called sufficiency or
separation
and then the last one are sorry it's
that's why it's called sufficiency the
last one is separation or equal error
rates and here the idea is we want to
normalize the accuracy in between groups
I use accuracy very loosely accuracy has
a technical term for confusion matrices
but could mean false positive rates
could mean AUC could mean a lot of
things right but if we have this
condition then we get all of those so
this implies for example equal false
positive rate - positive rate accuracy
and a you see all of those things are
implied by this right this is why these
definitions are useful as they fold in
many many possibilities and again this
is the flipping of the condition of the
conditional probability it's it's what
we had here just reversed right it's
different denominators the simplest way
to build a classifier that does this is
to use a different threshold for each
group and that's what you're actually
going to do on the homework and there's
actually some theoretical work showing
that taking a classifier which is
calibrated and setting different
thresholds on each group produces a
classifier which is you know obviously
satisfies this condition but it's still
mostly calibrated it well if the
original predictor predictor was
calibrated then the modified predictor
will be close to calibrated of course
you can make that more formal and this
is what you're going to do in your
homework you're going to take a
calibrated predictor for the public uh
data and you're going to modify it by
using different thresholds for black and
white defendants and you're going to
tell me what happens to the calibration
and the accuracy a potential legal
roadblock to using this is that if you
use different thresholds and in fact
most of the methods which retrieve this
have to use group membership explicitly
so you end up explicitly building your
algorithm to depend on say a race that
may or may not be legal in various cases
we're seeing a court case right now the
Yale and Harvard cases where the
plaintiffs are Asian students who feel
that they're being discriminated against
by the policies that encourage black and
Latino students but in other cases
there's no case law saying that it's
legal
however again it's you can argue that
it's an equality of opportunity thing
right everybody should have the same
opportunity or you know you shouldn't
have a classifier that has performs much
worse on one group than another in terms
of error rate this is called separation
because again again you've got this
causal structure but here a and Y are
flipped it says that all of the
information about a is contained in the
output variable and that if or
conversely the output depends both on
what group they're in and what what's
score they're in which implies equal
error rates because let's see why does
that apply equal error rates I think I'd
have to think about that for a minute
this is where I ended up the last time
this these impossibility theorems that
show that you can only have some
definitions of fairness at the same time
and not others part of the reason I
assigned you the homework that I did is
we know that numerically it's impossible
to satisfy all these criteria at the
same time but the question remains how
big is the trade-off and you know in
some cases it might be a very large
trade-off that you have to make in some
cases it may be only a few percent so
that's why I'm asking you to look at the
ProPublica data to try to get understand
what the size of the trade-off is in a
real-world case and there are some
scholars who argue that actually
these trade-offs are not as big as they
might appear and so maybe all this
argument over definitions of fairness is
overblown having said that we do have
reason to believe that minority groups
will in general have lower accuracy if
you build a classifier so what is the
the general argument for that yeah bingo
yeah so there's sort of two problems one
is just less trading data another is the
minority population may actually require
a different classifier than the majority
population right they might actually be
different in some way so say you're
trying to predict whether students will
graduate from a from a college program
and you know whatever the minority group
is and in an apartment they they may
have different features that predict
that right but as this this picture is
trying to show you the classifier is
going to fit the majority group because
that's where most of the data is so
there's a bunch of techniques that
computer scientists are putting together
to try to deal with these issues the
simplest thing you can do is just throw
more training data in for the minority
group you can also add group membership
as a feature right if you know whether
you're red or green in this example you
can change that line of discrimination
between those two positions and a good
classifier and classifier algorithm
should actually learn that automatically
you know again that that can be
controversial as well it depends how you
feel about the idea of using group
so let's say race or gender explicitly
as feature in your classifier but it's
one of the ways to solve this problem
and I suppose the only really general
thing you can say is that you should
probably be checking if this is the sort
of thing you care about right so you get
the classic examples of you know facial
recognition doesn't work as well on
black women well maybe you should be
evaluating the accuracy on black women
as opposed to just that single accuracy
for the entire training set okay so that
sort of concludes the theoretical work
we're going to do on this problem of
classification and bias the final part
of this is in some ways the most
important part we're going to zoom out
and look at the entire system around the
algorithm right so we've been very
focused on the algorithmics of this
stuff because I feel like it's really
important for you to get a solid
introduction to what's now called fat ml
fairness accountability and transparency
and machine learning but if you're
actually going to use this stuff it's
always embedded in a system so let's
talk about that so I'm gonna draw a
little box and we're going to talk about
the sort of risk score case again right
so we're gonna call this we'll just call
it the classifier and
[Music]
it takes in inputs about the individual
case and it takes in inputs about you
know the training data right so let's
say this is training data and this is
[Music]
individual data and it produces a risk
score great so this is sort of where
we've been looking so far now we're
going to talk about everything else so
when you actually use this it's embedded
in as much larger context so let's draw
it the whole picture here so if you're
using algorithmic risk scores to try to
decide who gets bail if in a pre-trial
sense of course they're also used in
post trial you know who gets parole
right what else do you have in this
system yeah
yeah okay so the data comes from
somewhere so how do we represent that in
this in this picture where does it come
from what did they train the the what
did compass train the classifier on yeah
in this case arrest data right so the
arrest data comes from the police
department so let's see if I can draw a
police badge here what's that
it's what
oh wait I'm drawing too high yeah thank
you okay so how about over here yeah
okay so let let me draw the ID for the
badge so nice last time oh man okay so
here's the Police Department right so
they produce the arrest data in fact I'm
not going to even say training data I'm
going to say arrest data and how do they
produce it where does it come from I did
a make it up like what how does this
work it sounds like a trick question but
it's not I'm asking you to think about
the entire process that generates this
data they make an arrest okay right so
so somewhere in here we get the actual
arrests so an arrest is an interaction
between a police officer and a defendant
who let's draw them in handcuffs here
there you go
sometimes I can spell so that you know
that depends where the that depends on
the behavior of both of these people it
depends on where the police officers are
assigned to patrol it depends on the
priorities of the police department in
terms of what sorts of crimes they are
interested in and you know there's no
guarantee that that's unbiased and we're
gonna we're about to look at biased in
arrests data for example and we can also
have a feedback loop here where if there
is some algorithm that is deciding where
to send the police you can send the
police where people have previously been
arrested now you only get an arrest if
you have the police there so if you
continually send the police where the
previous arrests are you may not
actually be capturing crime rates so
much as police patrolling patterns in
your arrest data right so we can get
this feedback loop where this
information feeds back into the data
generating system we'll talk about these
types of feedback loops in certain cases
or the potential for them okay so then
on the other side what do we what
happens to the risk score
yeah okay so the judge sees it in the
pre-trial case right so this goes to the
judge who let's see can I make a robed
and wigged figure here a necklace yeah
what's a judge necklace locally oh are
you ready okay all right so yeah all
right so here we go
and they've got robes as well okay so
there's the judge you know it judge
event determinate gender all right so
that so the judge looks at this and
makes some sort of determination or you
know in the more general cases right so
in some of the cases we're gonna look at
it could be someone looks at it could be
someone who decides whether something
gets a loan or whether they're hired or
whether to do an investigation into
potential child abuse which we'll look
at in a little bit you know one one
critique of the idea of using these
algorithms in the system is that maybe
whoever uses the output will blindly
follow the algorithm right maybe they
will give up any sort of case-by-case
decision making which is kind of what we
hope it's kind of how we hope the
justice system works right if it's your
case you want your case to be considered
individually not mechanically or you
know as part of a group but of course if
you're building any sort of predictor
prediction is impossible without
considering other people like you
right that's what prediction is is
looking at history and saying you're
similar to history in this way that's so
that's a more generalized argument
against
prediction and maybe your case has some
special circumstances and maybe we give
up our human decision-making ability by
looking at the algorithm or maybe the
output from the classifier is perfectly
fair by whatever metric or metrics we
use and the judge ignores it selectively
there is actually evidence that this is
happening we'll look at a few cases and
I actually suspect this is the common
case is that people ignore the
algorithmic out but more or less and
make biased decisions anyways and then
ultimately you get some final decision
right
we're probably missing many many parts
of this diagram we haven't drawn the
legal system the the entire
institutional framework around the judge
right so we could draw this other box
here and we'll call this the courts
right the judge has certain incentives
maybe there's political pressure to walk
more people up maybe the incentive is
for them to either follow or ignore the
algorithmic risk score the overall point
of that I'm trying to make is that the
classifier is one part of a larger
system and normally the that that sort
of $10 word that people use to describe
this it is a socio technical system
you've probably heard this and fairness
whatever fairness might mean is going to
be a property of the system not the
algorithm
all right before we move on can anyone
think of other parts of the system that
are missing I can't think of at least a
few more any thoughts what else is
missing in this picture yeah yeah sure
so eventually you expand outwards and
you include the whole government sure
yeah
yeah why not another important thing
we're missing is the creator of the
software right so that is and this is an
area of concern that is a private
company which normally is keeping trade
secrets right that algorithm is not
public the compas algorithm is still not
public we know a lot about it because of
the extensive reverse engineering but
they've never actually said how it's
built right now there's as you'll see in
your homework it's not it can't actually
be that fast fancy right it's probably
not more than a logistic regression or
you know it's it's nothing nothing
groundbreaking because you can get the
same predictive accuracy with just a few
features in a logistic regression so
it's not it's not really that exciting
but nonetheless it is secret and it's
secret in the case of you know loan
lending decisions FICO scores or secret
for example the FICO score algorithm is
secret it's secret in the case of hiring
because you don't know what the company
is using DNA testing or DNA matching you
know does the DNA evidence show they
were at the scene of the crime there's a
project right now to try to obtain that
code and audit it in New York City
there's a talk about that next week
actually so we can include let's see
here
I'm gonna try to draw a you know al
co-building right there's the front door
and they created this algorithm and they
have certain incentives as well right
they ultimately they have to make an
algorithm that they can sell so that
depends on you know somewhere in here
there's a person who chose that
algorithm decided that they were going
to buy it you know how do you convince
them to buy an algorithm without
revealing what the algorithm is this
just sort of goes along and on and on
and the again the property of fairness
is emergent from the entire system okay
so while it's very important understand
how this box works and we spent what two
classes on this that is not the whole
story that is the beginning the the the
complete analysis has to ask how all of
this fits together in this much larger
system okay so what we are going to do
next is expand on that by looking at a
bunch of important factors here so the
first one is data quality this is one of
the main critiques of use of algorithmic
techniques in criminal justice cases you
know while the data is biased so every
one of these things we're going to talk
about is the subject of intense research
and debate I'm just going to try to
point you in the direction of what's
going on for certain types of crimes we
have very good evidence that arrest
rates are biased tremendously so for
example arrests for marijuana possession
you know the the rates of usage among
black and white people are
you know have been nearly equal for a
long time the arrest rates are just not
right there whale so if you build your
predictor based on whether someone is
going to be arrested for marijuana it's
going to be heavily skewed to think that
black defendants are more likely to be
arrested which is true they are more
likely to be arrested but to the extent
that we want our predictor to predict
crimes and not arrests we have a serious
problem with using this type of data so
this is another real one of these you
know annoying real-world problems which
is that we often can't get the data we
want what the data we want is crimes the
data we have is arrest so we often have
to use a proxy which may or may not be
accurate so this type of data is
sometimes used by critics of algorithmic
methods to just dismiss them entirely
it's you know the data is biased we
can't use this it makes no sense the
best evidence I can find suggests oh
hang on before we go into that so why is
the arrest rate higher in neighborhoods
with more minority population this was
the the New York Times and trying to
tackle a similar issue which is why you
have more calls two three one one right
and 911 and the police explanation the
NYPD's explanation for this is that
there's just more people who report and
there's various theories about why this
might be true so for example maybe you
don't trust your landlord to do anything
in these neighborhoods because you have
crappy landlords and so maybe you
the city instead and what this story was
was the times trying to test the police
departments explanation and so you end
up it ends up being this this exercise
in GIS right so you you have to align
the police precincts in the Census
Bureau because you have the demographic
data at the census block level you have
the police data at precinct level so you
have to mess around with this in
computers and it's a little bit of a
pain actually have a blog post about how
to do this and sure enough the minority
neighborhoods have more calls about
marijuana but what they found is that
when you control for the rate of calls
you have higher arrest rates in majority
black neighborhoods all right so this
gets into this issue of controlling four
variables right this is this is always
an issue when you're trying to do these
types of bias stories is that you're
trying to make comparisons between
different populations or different
neighborhoods and all kinds of things
are changing at the same time you really
you would really like to make statements
if you're going to talk about fairness
about you know everything is the same
except the race of the person so and
that that in fact comes out in our
definitions of fairness right so all of
this stuff that we were looking at
earlier if you look at for example the
counterfactual fairness definition right
you're saying well what would be the
decision the decision would be the same
if the race was different right well
implicitly what that's saying is
everything else is the same you very
rarely have everything else the same and
you can kind of parse these up these
definitions and there's always an
implicit everything else being equal
what we would call in economics
ceteris paribus assumption but we do not
have that in reality because you know
race determines a lot of things in
American life and so you have to do some
sort of control maneuver and so that's
what they're doing here right they're
saying okay well if we control if we
look at the neighborhoods where there's
the same number of calls what we find is
that we still have a greater arrest rate
so there's some translation of calls to
the police into arrests and that process
that turns one into the other
varies based on the demographics of the
neighborhood and the cops explanation is
that there's more crime in these
neighborhoods and so there's more police
and unfortunately The Times didn't
publish their analysis or I don't have
their notebooks so I don't know how they
controlled and I I feel like this story
is incomplete right I feel like if
you're really gonna go into this and use
the data and try to answer these
questions then you gotta publish the
data because otherwise it's it's just
too hard to know what they controlled
for and what they didn't and and so
forth but you can see by sort of looking
at this sketching out this analysis that
this is this is gonna be a hard thing to
do right this is not as simple as just
being like there's more there's more
arrests in black neighborhoods it's like
well okay but what else is going on that
differs there's lots of things varying
at the same time
[Music]
yeah I mean yeah I think what happened
is they sort of got about as far as they
could get with what what they had and
that's okay right yeah I agree it's it's
it's bothersome it is a fact of
journalism that sometimes you just have
to publish right you're not going to
have all the answers to all the
questions before you can run the story
especially when the issues are very
complex as they are here so you know but
they could have run a follow up
they could have published the data in
code yep yeah I don't think there's any
discussion of you know community
advocacy organizations or yeah right
yeah and then you would get it right but
I don't know I mean that's useful but
then you whether you get in sort of if
you should see said thing right you
can't you can't really unravel this
story without going down the rabbit hole
of you know the whole system right like
that's that's always the problem here
we're trying to understand systems yeah
yep you can talk to experts you know
they're I guarantee you there's people
at this university who have spent the
last 15 years on this question right so
we find them and talk to them final
project time oh and the last thing I
want to point out in this is drop to the
police data into four buckets based on
the percentage of precincts residents
who are black or Hispanic
all right that's this controlling four
idea
all right it's very it's it's very
similar to adding the variable and
multivariable regression this is it's
the same sort of concept but you know
there's questions of robustness are you
gonna get the same answer if you use
different buckets what about if you use
buckets versus multi linear regression
you remember the nike shoes story how
they just sort of beat that claim to
death by using different methods to
control it would have been nice to see
that here and you know it's not that i
don't believe that there is a racial
bias problem in policing that that
happens often enough and enough places
that i think it's very likely that there
is a bias problem in policing but I
don't think they've really nailed it
down here okay so that is the case for
nonviolent offenses where we know
there's severe disparities in how
offenses translate into arrests for
violent crimes the situation is probably
better from the point of view of
fairness so if you read this carefully
basically what we have are three data
sets on the racial composition of
violent violent offenders that show more
or less the same thing and there's other
I suppose theoretical reasons to suspect
that the hata phrases the translation
from offenses so actual crimes committed
in two arrests is going to be more
consistent
for example violent crimes tend to be
more often reported it's easier to hide
smoking a joint than a body right so we
can we can imagine that that this is
this translates better like most issues
in this course I'm not trying to
make a strong argument that you should
believe this because all of these are
very complicated this you know this is
the opinion of these scholars and
obviously there are other people who
have looked into this issue all I'm
really trying to illustrate is that this
question of whether the raw data on
crime statistics is biased and in what
ways it's going to depend on the details
right and it's probably even going to
depend on the jurisdiction just because
violent crimes are accurately reported
in one city or in general this is a
these are national surveys right just
because it on average it works
nationally doesn't mean that in any
particular city that's gonna be the case
that's another issue in this the systems
view just because an algorithm worked in
Louisville doesn't mean it's going to
work in Fresno because the judges are
different the laws are different data is
different it's very hard to say anything
in general we're not talking about laws
of nature here we're talking about as
the this sociologists and historians say
we're talking about contingent
situations right every place has its own
history it developed in different ways
and I mean we start going to different
countries and it's gonna be different
again so I guess the only thing that I
can say in general is that there's no
easy answer to this stuff you're gonna
have to do it on a case-by-case basis
then we get to the issue sort of the
outside of that diagram which is what
happens when the result is produced so
here's a very simple example I we've
seen this before I like this example
because the algorithm is trivial right
the algorithm is you add up
a number of points for each previous
interaction with the justice system so
you know each previous misdemeanor
conviction each previous felony
conviction and so forth
and this point system was again designed
to guide judges and sentencing and the
idea was that it would be colorblind
right you'd you're just looking at
criminal history here and so you know
black white doesn't matter if they had a
felony conviction for armed robbery you
get ten points or whatever it is and
again you have to do you have to control
for the number of points when looking at
sentence length so they did that by and
bucketing people and again you've got
this huge disparity in sentencing which
persists across that control variable
and consists across different crimes as
well very simple example of judges
ignoring a let's just go ahead and call
it an objective measure and doing their
own thing risk scores are even more
complicated they're a little more
abstract they're a little more opaque
right at least we know the the algorithm
that generated this when you have
something like you know a random forest
where you can't really explain why any
particular person got a particular score
it's probably even easier to dismiss
here's another example of scores being
used in practice so you've probably
heard about Chicago's predictive
policing push and there's different
pieces of that this is a piece which is
called the the SSL the strategic
subjects list I think is stands for
we'll see it in more in a second the
idea here is that you can predict who's
going to be involved in a crime and then
intervene
beforehand to reduce crime rates right
that's the theory
here's the model this is a really nice
report by the way this rand report the
big drawback is it's only really studied
sort of phase one of this project and as
always things that we've done but it's a
really good case study and how you think
about these things
interestingly they predict not only
whether somebody will be involved in a
crime but whether the victim of a crime
and they actually the output variable is
is both combined which I find actually
kind of disturbing right the prediction
is will you be will you shoot someone or
will you get shot right it's the same
variable which seems suspicious to me
but I suppose if you wanted to argue for
that you'd say well it tends to be the
same people who are involved in these
types of things I don't know man anyway
here's the idea is that you intervene
before these bad things can happen and
so here's the evaluation of whether it
worked and so they looked at you know
whether this reduced the homicide rate
basically with regression they use
something called a vector auto
regression which is use it for time
series you try to predict the next value
in the sequence from the previous values
in several sequences so for example you
try to predict the next months homicides
from all of the history of homicides up
to this point plus the history of other
types of crime on the assumption that
there might be some correlation between
different types of crime that's one of
the methods they used anyway so you can
do these types of regressions and look
at the coefficients and that helps your
D trend this stuff because obviously
there's a seasonal spike here there's
always more crime in summer and here's
what you get you don't being placed on
this high-risk list
does not reduce the likelihood of being
a victim or being arrested for murder
but it does make you much more likely to
be arrested for a shooting so what's
going on there
what can we guess about how this
algorithmically generated list of high
risk people is being used in practice
it's sort of the other way around what
what they think is happening and so the
researchers did all this data analysis
but also a lot of like sitting in on
Police Department meetings and
interviews it's a really nice example of
combined quantitative qualitative
research it's one of the reasons I
recommend you read this paper and what
they think is going on is basically the
cops are ignoring the output of this
predictor except when someone is shot
they use this list as kind of the usual
suspects list
yeah so it's not really being used the
idea was that you intervene before
there's a crime right but it's not
really being used that way it's being
used to arrest people after there's been
a shooting oh they're not a murder so I
think we're talking about arrests for
non-fatal shootings and in terms of
algorithmic accountability journalism
this was Chicago magazine who fired the
list and a bunch of the bunch of
information basically a bunch of
features for each person list and trying
to figure out what gets you on the list
now this is not really what gets you on
the list
you kind of want this to be regression
coefficients but they're not this is the
features of people on the list so it's
what are the features given that you're
on the list
regression coefficient is it's the other
conditional probability right what are
what makes you get on the list given
these features all right you know what's
the what's the probably that you're on
the list given these features whereas
what we're looking at here is what is
the probability that you have these
features given that you're on the list
so they're they're really not the same
but maybe we can get a little bit of
insight and say that well you know it
picks gang members and people with drug
arrests I do think it's interesting and
I think it's important I think it's it
is important to do this type of
journalism where you so for example you
know this this list was not publicly
available there was a little bit of a
bowel with the police department to get
this list of names and it was important
that somebody try to figure that out
nonetheless they are pressing on this
strategic subjects list is only one
portion of their predictive policing
system so they've got this whole system
of sort of decision-making centers and
they do a lot of like predictive
policing stuff right
so the standard predictive policing
thing is you use the history of crimes
to try to predict where new ones are
going to be and you send officers there
and this is the most direct illustration
of a place where that can be a feedback
loop because again you have arrest
history not crime history and so if you
send the officers to the places where
they've previously made arrests well by
gosh you're gonna have more arrests in
those places where there's officers
right you don't you don't get any
arrests where there are no police so you
got to be careful about that stuff and
that's one of the major criticisms but
then again you know if for violent
crimes in particular if you believe that
most offenses the offense is translating
to arrests and a much more balanced way
than for misdemeanors then maybe it
still makes sense in any case they're
investing in this further now I will
give you my opinion a little bit and I
have sort of the whole like algorithmic
risk score criminal justice thing one of
the criticisms that has been expressed
in the sort of research literature and
by the organizations which are concerned
about the use of risk scores and other
types of technology in criminal justice
is that the risk scores will be blindly
followed by the judges and it will
eliminate human decision-making right
where we're just going to give up our
autonomy or individual consideration of
individual cases in the Florida case
which you've seen
in I believe it was Kentucky where risk
scores came in and in terms of another
study which we'll see in a second of
decisions by New York state judges as
compared to a very simple classifier by
and large I mean it's still early but
every case I know of where we looked at
what human judges did with risk scores
they kind of ignored them by and large
it is not the case that humans are
slavishly following the algorithmic
results it isn't seen to be more common
that they ignore them and do their
standard thing anyway in other words if
you are concerned about the fairness of
the overall system my general sense or
if you like my prior would be that what
we need is people following the
algorithm that results more closely not
less closely this is a hard thing to
nail down but that's where I see the
weight of the evidence there's no doubt
some incredible reporting to be done on
this topic okay so we've been talking
about criminal justice a lot criminal
justice is you know very it's been very
well covered it's very consequential
it's very controversial it's a rabbit
hole you can really really go down this
this hole and what I will recommend to
you if you have an interest in this
stuff is a recent piece by Stephanie
Wieck stray box we've we've studied the
algorithmic portion of this and this
will be linked from the syllabus
extensively but this whole issue of risk
scores in free trials and
is really come to the forefront as one
component of bail reform and if you want
to understand not just the algorithm but
that whole thing this is as good a
primer as you are going to get this is
the a massive amount of context to how
these risk scores are used and she
mentions risk scores in here right is
risk assessment so you know she's got
one section on risk assessment which is
really good and has lots of links so
boxes in DC yeah but you know but this
she's a freelancer you can work for them
anywhere in particular that it has some
interesting links to evidence for this
idea that I just mentioned that
basically judges decisions are a lot
more biased than the algorithmic
decisions but to really get into this
it's a very complicated issue and I
haven't really even touched on the legal
points about this but if you want to
there's so much good reporting to be
done on this and this would be a great
place to start I want to talk about two
other domains where algorithms which
social impact are widely used you may
have heard of a new type of blending
which rather than basing it on
traditional FICO scores and and so forth
you know there's this whole traditional
process for how banks and decide there
are these startups which say you know we
don't care about FICO scores we're just
going to collect all of this other types
of data and we think we can make much
more accurate predictions about who's
going to default and part of the way
this is pitched
is that this is going to extend access
to credit so as it says here one of the
problems that people have getting access
to credit is they don't have a credit
history and there are groups which might
be perfectly credit or has worthy but
don't have any way to prove that such as
recent graduates and immigrants right so
you come from another country you have
no American credit history you can't get
a mortgage to buy a house our car or
student loans or whatever it is you want
right so they make this social argument
about why they should be allowed to use
all of this personal information to
decide who gets alone of course it's our
job to evaluate these types of arguments
and see if there's anything behind them
right does this make sense this paper or
the references that the body is a very
interesting paper it's looks at the
effect of introducing better prediction
for defaults on the American mortgage
market so housing loans which is of
course heavily regulated and there's a
long history of unfair discrimination
against people of color you may have
heard the term redlining this is where
banks used to literally draw a red line
on the map and say okay we don't like
give mortgages in these neighborhoods
right and the it's a beautiful analysis
in part because it's done by two people
in the one of the federal banks and so
they have access to this huge data set
of loans so they have the underlying
data about who gets loans and who
doesn't and what their demographics are
and so they can analyze what would be
the effect of better prediction
retrospectively
because they know historically who paid
off their mortgages and so forth and the
point that they make here is I think a
really good framing for this type of
reporting which is that if you have
better prediction of outcomes there's
always going to be some winners and
losers the winners are going to be those
people who should have gotten a loan but
didn't the losers are going to be those
people who were going to default but
there was no way to know that and so
they got they got alone anyway
no it's not entirely clear whether
giving somebody alone if you know that
they're going to be unable to pay it
back is a good thing that is another
very complicated argument that is a very
complicated question and there's lots of
good reporting for public has a whole
series on debt which I recommend you
read but we can say that this general
frame will hold for any predictive
technology as you get better predictions
some people are gonna win and some
people are gonna lose and that's a way
into a story right you can always use
that frame to start your reporting I
mean this is a classic journalism frame
right the new law has passed who are the
winners and losers right this is basic
important stuff and in terms of fairness
or distributional characteristics what
they say is that they do think that
better prediction through machine
learning in the mortgage market is going
to give more people access to credit so
that's good right and as they say
marginally reduces disparity and
acceptive rates across race and ethnic
groups however they find that so this is
a very sophisticated analysis it looks
at not only who gets a loan but at what
interest rate and under their model what
happens is that black and Hispanic
borrowers have a much wider spread of
rates so the disparities come about not
so much because some people get
mortgages and some don't but more that
some people are gonna see their rates go
way up because the machine learning
model predicts that they're much riskier
and some people see their rates go down
so the the disparity actually ends up
being within-group in their models here
which is a type of disparity we don't
normally talk about
I've also got a notebook that's linked
from my lead course and I can add that
to our links which analyzes data from
Lending Club which is a one of these new
style algorithmic lending organizations
they actually publish their complete
loan portfolio as this huge data set
with ten million rows and so I used that
to do a very simple and including the
history of the loan right so I use that
to do a very simple analysis that asks
the question with improved prediction so
I what I do in that analysis as I can
bear a very simple machine learning
model with a slightly more accurate
machine learning model because I give it
more features and ask the question what
is the income distribution of the people
who get loans under the inaccurate model
and the more accurate model now because
of the way I set this up I don't claim
that this particular test says anything
interesting but perhaps it illustrates
the methodology right if you can get the
data you can do this type of analysis
and say okay does this extend credit to
poorer people for example it's hard to
get the data but it is out there yeah
you'd have to talk to them right see if
they're willing to share the data or
whether they've got someone who's
studying the ethical implications of
this technology already right and
remember it I think ethics are not you
know this distributional issues are not
the only lens right you because the
people doing this are doing it because
they genuinely believe that there's a
net social benefit so you can ask
questions like so is there really a
benefit right as compared to you know
what are we doing now and what are the
ways this is better and what are the
ways this is worse right so what a with
any change you can always do this
winners and losers frame and the the
winners and losers may be distributed
across race or gender they made to be
distribute cost income they may be
distributed across countries you know
this really penalizes foreigners right
there there could be all sorts of you
know this is really hard on the coal
industry like whatever right there's
there's all kinds of ways that the the
benefits can be unequally distributed
we sold it uh-huh yeah well you know the
way that society regulates this stuff
it's a really interesting question so I
mean law enforcement is fascinated by
money laundering right that's their job
as you've seen journalism also does a
lot of money laundering investigation
these days the accounts of the world
don't add up huge amounts of public
money are being diverted away from
people who need it you know the amount
of stolen money flowing out of Africa is
greater than the amount of foreign aid
floating and flowing into it right so if
we could solve the problem of diverted
funds that would more than double
you know the aid budgets so it's a big
issue and journalism and law enforcement
are in some sense working on the same
problem and they have different
advantages and different and
disadvantages so law enforcement can
subpoena information they can get used
legal powers to compel the production of
data which we can't on the other hand we
get to work in public right and as the
saying goes the crime is that it's legal
so law enforcement can only get involved
when there is a potential crime
journalism can get involved when
something bad is happening
that's not even illegal so there's
various advantages and different
Vantage's and one of the really
interesting questions is how these
various mechanisms interlock and relate
to each other also journalism is not
capable of punishing anybody which means
people can often talk more freely to
journalists right if you can guarantee
anonymity and and for somebody then
maybe they will tell you about all these
crimes that they may be implicated in
right whereas they would not talk to law
enforcement because their lawyer would
advise them not to incriminate
themselves so very complicated
relationship between the two
this is a really interesting case this
is another one which has been heavily
covered and the syllabus includes two
links one is to a paper by the people
who built this system and one is
Virginia Eubanks who wrote automating
inequality covers this case in great
detail the conversation that they're
having is fascinating to me they're sort
of talking past each other so I urge you
to read both of the paper and one of you
Banks's articles
carefully and try to read them against
each other so this is from this paper
that just came out a month ago on how
the system works and what they're doing
with it and we talked about feedback
loops very briefly I find this
fascinating they were worried that
seeing so okay so here's what the tool
was meant to do someone calls this child
abuse reporting hotline and after this
call the person who took the call has to
make a choice do you screen screen this
call out or screen it in meaning start
an investigation and follow up so let me
show you what this looks like Oh
everything's so small oh this is paper
we'll talk about later day
so this is a flow chart of the process
and this is really a wonderful paper
this is really state of the art in the
algorithmic fairness discussion but
anyway here's here's a little bit of
this this this context
Oh interesting like apparently I can't
as any bigger so here's how this goes so
when you get you get this this call and
really I can't zoom this that's nice
okay you get this call and we've always
had this step where the screener the
human desai assigns risk and safety
ratings which are sort of these em this
risk score and then now the algorithm
weighs in and then eventually they have
to say okay we're not going to follow up
or we are going to follow up and after
the investigation either there's some
sort of intervention or there's and
there's no intervention so the purpose
of the algorithm is to help the
screeners decide which branch to take
here right do we do an investigation or
not and the thing it's designed to do is
help the screeners integrate all of the
previous case history for this child
right so we're their previous calls were
the previous calls for the siblings
right because that's a that's a risk
indicator as well if you've had a report
for someone else in that family or
someone else in that household notably
the algorithm does not take into account
the content to the current call because
the algorithm runs before the contents
of the current call are entered so it's
just trying to summarize the history and
what they're worried about is that if
the people working in the child
protection services agency use this risk
score to just so the model is trained on
who gets placed in foster care within
two years right that's the endpoint that
they use but if the risk score
influences the decision to investigate
and ultimately has influence on the
decision about whether someone is placed
in foster care then you get the same
feedback loop you get
sending the cops out to the places where
you've made the arrest before right so
there there's the possibility of this
feedback loop and to try to ameliorate
that part of the design of the system is
that they train it on an outcome that
the people who use the algorithm don't
have much influence over so in
particular it's where did I go okay the
decision to place a child in foster care
is made by a different group of people
than the people screening calls and I
mean that doesn't necessarily break the
feedback loop but it certainly is an
attempt to weaken it a little bit so the
issue here is as they're saying workers
affect the outcome predicted by the
model eg substantiate cases that the
model tells them they likely should then
we get to this type of RC analysis this
is like the classic fairness metric
stuff that we were looking at and so we
have race specific ROC curves and the
way they assign the risk scores is by
percentiles which means that you
actually get a different threshold for
different races because it just looks at
you know are they within the top five
risk is that the top 10% most risky is
the top 15% so you actually get very
different risk scores for four different
races this is sort of their old
algorithm this is their new algorithm
the old algorithm was a logistic
regression which they originally chose
because they wanted it to be explainable
but they found that our that nobody was
really cared about the explanations
anyways of why the algorithm produced a
score and that a random forest had much
better predictions so they've switched
her around and forest and so you can see
the
you know the false positive rate
actually differs and in the old case
substantially between race and a little
less now and it's just the case that
they have better prediction for certain
groups so for example I have no idea why
but for the unknown race right where
they don't know the race this the AUC is
much higher I'm really not sure what's
going on there and that seems like it's
probably an interesting artifact and
part of the reason they want to do this
is because there's already bias in the
system so we know for a long time black
children are disproportionately more
likely to be placed in foster care and
[Music]
there's a bunch of different reasons all
right there's a bunch of different
stages in this pipeline where this could
be happening they know that introducing
this tool remember it's this whole
contextual problem only changes some of
these issues but they if they can build
a colossal fire which has this sort of
fairness property of let's say
calibration so that the same score means
the same thing regardless of race and
those scores are used to in the same way
to decide whether to screen in a call
that is to do an investigation that has
the potential to reduce bias at certain
stages of this pipeline although not
others of course unfortunately and this
is another data point in the people
ignore the risk scores they analyze this
a bunch of different ways this is
probably the most informative graph so
the x-axis here's the risk score and so
because this is a calibrated algorithm
you can imagine
the actual risk of being placed in
foster care is increasing monotonically
with risk score but the decision to
screen in which is this purple so the
discussion to follow up investigation
does not increase very quickly with this
risk score if the decision to screen in
was based only on the probability that
someone is going to end up in foster
care this would be a much steeper curve
in particular and this is this I find
fascinating the highest three scores are
a mandatory screening right so this
should actually we should actually not
have any green for these last three bars
if people were following the procedures
which theoretically they're required to
follow which means a manager is
overriding the algorithm in or
overriding the policy I should say in
about 25 percent of cases so basically
they built this tool they and calibrated
it for all of these fairness properties
and then they deploy it into production
and the workers kind of ignore it so
we're not going really going to get fair
outcomes as a result of this deployment
now the broader criticism and this is
what I read is Eubanks criticism is that
it doesn't matter if you make a better
screening tool because ultimately the
problem is that we are not funding child
protective services right these people
are very poorly paid we don't have the
volume of foster homes we need and
[Music]
changing the accuracy with which we can
deploy resources is not going to solve
the problem of having far too few
resources so in that sense talking about
better screening algorithms is talking
about the wrong thing and is a
distraction from the broader problem of
why are we not adequately providing the
resources for these children to thrive
so that's a different type of criticism
which is outside the algorithmic frame
and outside even the the systemic frame
right she's just saying none of this
matters if it's not funded okay so
thoughts on all of this
sure right you know if we if we funded
them to the levels that they need then
when this help I think you would have to
look very carefully at how people end up
using these scores right if these things
are really more accurate and more
predictive than than the human decisions
and by the way another one of the charts
in this paper shows that there's very
low correlation between the human
assigned risk scores and the
algorithmically assigned risk scores and
because we know that the algorithmic
risk scores are extracting about as much
data as much information as possible
from the data that's available that
means the human scores are actually not
very good right there are not good
predictors so the scope for improvement
is probably really big
if these scores could be used and
integrated into the system in the right
way sure and that's gonna be their next
paper they say at the end that that's
they're investigating why people are
writing right and maybe there's a good
reason yeah people might not trust the
scores they might not they might
consider that it's taking away their
autonomy there's all sorts of things
that could be going on here yep
yeah I yeah I'm still waiting for I
think there's a great story to be done
here for someone who wants to get deep
into this you know we've seen you know
that we've seen the stories which are
like these risk scores are gonna be
biased they're just not going to solve
the problem I think in this cultural
moment because of the public eye
discussion and the fat ml community and
sort of where we've been the general
perception is that these algorithmic
methods are biased I think the reality
is a lot more complicated and we haven't
we haven't seen a an in-depth smart and
balanced article because of course the
people doing this right the why are we
bothering at all with all of this
technology and the reason is there is
reason to believe that this stuff could
help right so what are the reasons to
believe that it could help what are the
reasons to believe that it's going to
make things worse what's actually
happening on the ground where do we go
from here that article hasn't been done
yet
for policing as this is growing by a
huge amount whereas the welfare state is
being constrained and so yep
I mean
could this to a place humans I don't I
don't think so right somebody still has
to answer the phone right I don't think
anybody is talking about replacing
humans what their this is meant to be a
decision support system so to help
humans make better decisions you know
when they're very pressed for time and
they may not be able to read the
complete case history so that's one
issue is they're just not going to be
able to assimilate all of the data
that's available the other is that
humans are just not that great at making
predictions this we've got you know 60
or 70 years of evidence that people are
inconsistent in general now of course
they may be they may be all right in a
particular domain but my my bias would
be that they're lacking any other
information we should assume they're not
that great and that's what they found
here as well this oh that's the end of
that deck but this gets the question of
you know why why are people why do
people want to implement these
algorithms right and you can be really
cynical and say oh because there's money
to be made it's like okay well that
doesn't explain it's not that much money
to be made doing this stuff right that
doesn't explain why all of these
researchers are doubling down on this
stuff this is produced by a bunch of
academics and you know in even the
algorithms critics admit that it was
done with it with it and you know they
did everything right in terms of
consulting with the community and you
know holding meetings with various
stakeholders before they started and
hiring people to do an independent
Ethics review of the system so this
paper is very interesting because it's a
case study of sort of everything going
right and yet the outcomes still being
not that great
okay your homework which will ever sign
you shortly is you're gonna take the
propulsion atta so this should be
familiar you're going to build your own
classifier which is a logistic
regression on so we don't have the
compass data like the hundred and thirty
seven questions we're basically just
going to use Age criminal history a few
other things this this is actually in
the notebook we used last class this is
actually in there so basically for this
you can just cut and paste that code and
you will find that it gets about sixty
five percent accuracy and it's almost as
good as the actual Northpoint algorithm
so you're gonna build that and then now
that we have our own classifier we can
mess around with it so you're going to
write this function which applies a
different threshold to each group and
you're going to tell me when you're
going to set the threshold so that the
false positive rate is equal between
black and white defendants and then
you're going to tell me what that does
to the other confusion matrix metrics so
we're going to look at a real world
example of trying to modify a classifier
to make it fair by one of these
definitions and optional if you like I
just put this in here because it's
interesting you can do a little more
analysis here and figure out how much
information about race is actually
leaking through into the classifier
remember we talked last time that it's
almost impossible to make a to hide race
and gender from in the data well now
you're going to find out how well you
can guess race just from the features
that the classifier is using

A class on algorithmic fairness, definitions of fairness and "bias"

Contributor: Jonathan Stray