Algorithmic Accountability and Discrimination
Frontiers of Computational Journalism week 5 - Algorithmic Accountability and Discrimination
Transcript
today we're going to start algorithmic
accountability that is to say an
analysis of how algorithms have an
effect in society I originally had a
statistics lecture scheduled I decided
to postpone that a little bit because
next week your final project proposals
are due and so I wanted to talk about
some of the things that you may be
interested in including in your final
projects this is part 1 of 2 of
algorithmic accountability we're going
to spend two weeks on this the
centerpiece of today's class is going to
be a deep unpacking of a famous piece of
algorithmic accountability journalism
possibly the most famous piece which is
proposed machine bias where they
analyzed the compass recidivism
prediction algorithms so this is an
algorithm used to decide who gets bail
is a pre-trial risk scoring algorithm
right so this is literally an algorithm
where individuals freedom hangs in the
balance of course it's natural that it
should come under a lot of scrutiny and
we're going to talk about the analysis
they did and what they concluded and the
honestly the can of worms that that
opened up because they it was the
beginning of some very complicated
questions first I want to talk about
some previous work in algorithmic
accountability some other things that
have been done and even before that the
first thing we have to do is I guess
scope this and the way I'd like to do
that is to list out all of the places
where algorithms have an important
effect in society so
let's do that so where our algorithms
used in society inconsequential ways
obviously algorithms are used in society
for all kinds of stuff right you know
music recommendations and checkout
machines and so forth but they're also
used to make decisions that can be quite
consequential so yeah real estate tell
me about that
okay yep so classic stuff tied into any
of you know the phrase of redlining
yeah okay so that was the policy of
steering minority homebuyers away from
certain neighborhoods which is of course
illegal became illegal in the 1960s but
still kind of happens in certain ways
and we'll look at analyses of that okay
what else
credit scores yep it's a big one so
that's also related to lending I will
actually do look in much more detail at
the effects of Grimek default prediction
on lending in the next class it's gonna
be another exercise we do hiring yep
yeah yeah well we're gonna yeah so we're
gonna we're gonna look at this question
of fairness in great detail that's
that's kind of what we're doing here
okay what else
welfare distribution
what kind of algorithms would we have in
welfare distribution okay okay
do you know of algorithms being our
models being used for that oh yeah right
yep yeah so that books gonna come up a
bunch of this work is gonna come up okay
what else
yep there's all kinds of places in
health care so diagnostic models that's
a big one right or prognosis you know
who gets this very expensive procedure
it's nice to imagine that anybody could
possibly benefit from the medical
procedure gets that procedure but that's
not how health care works
right you have a even if it's a large
number you have a finite amount of
resources and so you have to make
allocation decisions which means that
some people aren't going to get a very
expensive test which has a very low
probability of finding a problem so you
have to make these decisions yeah
medical ethics it's a fascinating field
what else
insurance yeah insurance of all types
interestingly insurance is probably the
first method where algorithmic
techniques were used and we're going to
talk about what I mean by algorithmic
techniques right that's a we can also
just say statistical models or actuarial
models which is how insurances rates are
calculated I mean large parts of
probability and statistics were invented
to make insurance work out hmm
policing yep so I'm going to start over
here
how can algorithmic or statistical
method to be used in policing yeah okay
so yeah so face recognition
you said predictive policing so where
our crime is going to be committed or
who is going to commit them yeah that's
normally part of predictive policing and
in general we can add this category of
Criminal Justice
so risk scores who's likely to be
re-arrested or commit a crime and that
can be pretrial or post trial sentencing
there's a bunch of different places
where predictive models so you have you
know should we release this person when
bail that's pretrial what sentence
should they get
they released on parole stuff like this
what else there's a at least one
remaining huge category yeah that's a
good one
all right so student testing teacher
evaluation hiring and firing within
education all kinds of places
Aereo yep advertising and marketing so
optimization of ad purchases micro
targeting and as long as we're going to
talk about micro targeting let's talk
about politics so trying to persuade
voters sending different messages to
different people that's micro targeting
and personalized advertising predictive
models like election prediction now for
you and I election prediction is is sort
of a journalism function right like
we're going to inform people who's
leading in the polls ultimately but if
you're look at it from the point of view
of a campaign election prediction
changes the strategy that candidates use
they changed where they campaign where
money is spent what issues they and
positions they take because of course
they will try to take the issues that
will get them the support they need to
win so those predictive models can
indirectly have influence on the policy
positions of our leaders kind of crazy
huh there's a really still a big area
that we don't have on this list
let's say social networks yeah that's a
good one isn't it
so we've been talking about this in your
filter design assignment that's a big
field will coups let's sort of leave it
at that for the moment
I want to add at least one more finance
algorithmic trading models for
investment this is a heavily algorithmic
field possibly more so than any of these
others right they mean when we say a
quant right you all know that phrase
like we're talking about that that word
came from Finance the allocation of
money in society is heavily dependent on
statistical models and algorithmic
methods yeah
what are the categories where where the
general public or even when people were
more specialized like us or not paying
attention like everyone talks about
algorithmic well mm-hmm yeah so hey
let's circle the popular ones how about
that and then we can see ok so
everyone's talking about policing right
now we're gonna talk about it too and
criminal justice social networks
politics yeah it's just a fair amount to
talk about that so hey that leaves
everything else credit score's people
talk about that that so I am so okay
so I would say finance that's my answer
I think there is a catastrophic lack of
accountability for the use of
statistical methods in investment
decisions and it's not you know in some
sense it's not really a problem with the
use of statistical methods it's it's the
problem of deciding what to invest in
using only the metric of how much money
will it make me right this is not this
is not a new problem we just have a new
form of this problem yeah yep
people that are building models like to
reverse-engineer the models that are
being used
mmm
sort of yeah we'll talk about Finance a
bunch more in part because I think it's
really fascinating and important oh
another thing under releasing is DNA
testing I was talking to someone last
week who is on this she got a magic
grant to compare the outputs of various
DNA matching algorithms from different
vendors and she's having the same
problem that you just mentioned which is
that these algorithms are trade secret
right the companies that build them
don't publish them and yet they send
people to jail here so I'm a circle
finance is something we can talk about
more and we'll talk about it a little
later today when we discussed the final
projects the we I went through that deck
and really briefly last class but I'd
like to actually have a discussion about
it today okay so lots and lots of places
where algorithms have influential
effects this was a list that I made up
earlier let's see how it pairs oh price
discrimination and terrorism yeah
so that and search I'll yeah so this is
a big list and you can add more things
to it
price discrimination is the practice of
charging different people different
amounts of money we'll see examples
where that is happening that can very
often turn into class discrimination
generally people who have less money
have poorer negotiating power which
means ironically you can charge them
more terrorism prediction this was a
sort of exploded into public
consciousness after September 11 there
was briefly something that was supposed
to be called the the Total Information
Awareness program which was very sort of
big brother-ish most of the things that
they wanted to do have since been taken
up by the DHS and the intelligence
community we're gonna look at algorithms
for trying to like machine learning
algorithms to guess who is a terrorist
that type of thing
scanners at airports you know who gets
stopped and searched and this is a
uniquely challenging domain both because
it intersects basic civil liberties and
freedoms and because the number of
people who are actually a threat is so
small that anything you do is going to
be swamped by false positives so this is
a basic challenge in actually several of
these fields okay so how gorillas are
everywhere again just sort of as a
framing exercise there's there's a big
discussion happening in Silicon Valley
right now which was barely happening two
years ago it's really the the election
kick-started all of this and a lot of
people are interested in sort of the
technology ethics this is one framing
for the ethics of Technology the framing
is the the unintended harms and
technology and they've they have I
split into a bunch of zones and and so
we're gonna talk mostly about this
machine ethics and algorithmic biases
this is huge right now disinformation
we're gonna have a class on
disinformation
I think it's called truth and trust the
class we're going to do it's one of the
last few classes
I am also extremely interested in this
economic inequalities I suspect that
that is going to have far more impact on
more people's lives than just about
anything else that we have on the board
including things like criminal justice I
feel like it's very understudied and I
think what part of reasons understudies
it's very hard to get a handle on
because it's so systemic you can think
of it as algorithmic critique of
capitalism which is just getting off the
ground but anyway mostly we're gonna be
here today and in the next class oh this
is the idea that that we're designing
Facebook to make you yeah yeah we're
building products that are designed to
be addictive right which is not healthy
this is a site which is specifically
about algorithmic accountability
algorithm tips org and what this is is
[Music]
it's this list they generated of
algorithms used in government and so for
example okay so here's a few right let's
see what should we search for what type
of algorithm should we find
okay so here you go method for
calculating the number of children
directly certified for free school meals
and other social benefits yeah I mean I
don't I don't want to speculate but we
could find out I don't want to do it
right now let's just try a welfare as
opposed to a child
nope that's the only one what if we just
do child
huh
interesting
it looks like it's a evidence collection
system okay yeah how come the door and
talk to them yeah this is a scoring
system for evaluating schools anyway so
this just goes on and on and many of
these things are not algorithms per se
in fact many of them probably don't even
have code all right
they're just methods but that's that's
although we're now talking about
algorithmic accountability it's not like
statistical models are new in government
they've been used for at least a hundred
years any sort of scoring system any
sort of predictive model they've been
used for a long time in a lot of fields
it's just they're becoming more
sophisticated and more consequential I
don't know what's the line between
statistics and machine learning or
machine learning in AI it's not really
important the the the the important
characteristics here is that there's
some let's say mechanical method for
coming to a decision and we want to
evaluate that method
so that's algorithm tips how they got it
is pretty fascinating
they wrote a paper about this which is
linked from the syllabus and they
basically these are all the search terms
they used and you can see some of them
are our bold those are the I think those
are the ones that they came up with
themselves and then they used real
analysis of related phrases so what else
appears in these documents to find is
the non bold ones so there's a lot of
different names for these types of
techniques
algorithm tips is cool there is a major
limitation with algorithm tips which is
that it's only government and you can
see which government departments they
pull it from interestingly Health and
Human Services has the highest number I
find that fascinating because it's the I
suppose the most human right it's the
Department of Energy must have all kinds
of formulas for figuring out how much
electrical wiring is needed for a state
but that's not really that doesn't
appear in the search results I guess
this is about the extent of public
knowledge of the FICO score the FICO
score is the standard credit scoring
algorithm that is to say it's not we
don't know very much they don't release
publicly where that score comes from and
most of the things on this list that we
just came up with there's very little
public information about how this stuff
works
so we had the question earlier how do we
investigate these algorithms and this is
something we'll come back to a number of
times but any initial ideas strategies
for investigating non-public algorithms
encourage leakers sure so inside sources
yeah you can't for a company oh yeah
that's the issue
yep so you can you can try to replicate
the results assuming you can get the
results so that's a big deal right
can you get the output of the algorithm
sometimes you can that's kind of what
república did with the compass algorithm
it was a model to predict whether people
would be read with in two years so they
just waited and waited until everybody
was really a waited until two years went
and saw how many of those people were
arrested right so you get the ground
truth because arrest records are public
yeah I mean definitely you should talk
to the people who are affected by the
algorithm
the challenge there is that even a very
good algorithm is going to have some
very bad results because there will be
errors so how do you do balanced
reporting because you can always find
the case where something bad happened so
okay if that's true then how do you
decide whether a particular algorithm is
good or bad what does that even mean
you're looking at like trying to reverse
engineer and algorithm the team of the
best engineers have been working on yeah
[Music]
yeah although as we'll see I know I
don't think it was that resource
intensive actually it mostly involved
yeah smart talented well-trained people
like all of you go forth yeah so there's
there's issues other strategies for
investigating private algorithms include
so if you can run the algorithm then you
can test it right so you can if you want
to get into situation where you can feed
inputs into it and see what happens
and for example you can do this with
Facebook you can make fake profiles you
can try posting different things we're
gonna look at a little bit later today a
an experimental method for analyzing how
the news feed algorithm works which is
very clever but it's very very simple
and very easy to do in very clever and
organizations like The Wall Street
Journal have done things like their red
feed blue feed piece which just created
two different profiles and set up
different friend groups just to
understand how you could end up in
different information worlds there's a
lot you can do just by interacting with
the algorithm so while these algorithms
are often trade secret there are
actually methods for investigating them
and if it is an algorithm of significant
public interest you can also make the
argument that it should be public right
you can do a story so this is like kind
of like doing a story that says why
isn't this data publicly available you
can some time when you can't get the
data you sometimes still get a story
which is I can't get the data so you can
do a story which is like you know we've
talked to ten people whose lives were
badly affected by this algorithm why
can't we see it you can do that story as
well you can apply pressure for
transparency so for the next part of our
discussion I want to go through just a
bunch of previous algorithmic
accountability work and sort of just
sort of a brief survey of what's been
done this wasn't a piece of journalism
this was actually a blog post but this
was an analysis showing that New York
City's teacher evaluation scores weren't
doing what they were supposed to do and
let's get the original thing so we can
look at it and I will try to unpack this
a little bit I know that chart doesn't
make a lot of sense on its own
huh so this is going back a few years
now and what we are looking at is thing
called a value-added score so the idea
is you basically you look at the test
scores of the students but you don't
want to look at just the raw test scores
because the student population is going
to be different at different schools the
students come in with different levels
of skills already so you instead you
look at this measurement which is how
much have they improved and this is the
idea behind the value-added test scores
and the issue that he's pointing out
here is that we really have no way to
figure out what it means but this
analysis suggests that it doesn't mean
all that much so here's what's going on
here the horizontal axis is the teachers
value-added score for and these are
released publicly for a particular
subject for one year let's say 4th grade
math and the y-axis is their score for
the same subject for the next year so
let's say 5th grade math or I could be
6th and 7th grade English or something
like that now you would expect that if
this is actually measuring teacher
quality that the teachers quality of
teaching is gonna be similar between 4th
and 5th grade math right the idea here
is that you're measuring some underlying
latent teacher quality variable the
issue is that we get only a very weak
correlation right so we would expect
that this was more or less a linear
cluster right it should be a lot flatter
the fact that it's all over the map
means there's no correlation there which
means that the you either have to assume
that in general there's no relationship
between how well you teach one grade and
the next grade or the metric isn't
working so that was the argument in this
analysis yeah I don't know we'd have to
dig into exactly how the scoring system
works so I like this example for variety
of reasons one of which is it's a very
clever way to test the output of this
model without actually having access to
the model right we don't have to know
how this is being calculated to know
that it has not captured the thing which
it was designed to capture all right so
we look at only the output not even the
input and we can tell it's wrong here's
one of my favorites this is an old piece
by the journal and what you're looking
at is the relative price of a
standardized basket of goods right so
how much for a ream of paper an SD card
and a stapler at buying online right so
this was staples which is a office
supply retailer and what they did is
they effectively access to their website
from computers in every zip code so they
used a commercial proxy service to route
the requests through each of these
places and looked at how often they got
higher prices than the baseline so what
pattern do you see her
cities of lower prices right why do
cities have lower prices right so now
we're gonna ask you know what is the
company doing
warehouses their supply
how much it cost yeah so that's a
reasonable explanation
I think shifting the burden is the right
language not everybody does that
in fact in public systems we always end
up spending more on rural areas so for
example in and this is a famous case of
price discrimination used by economists
in England for a long time there were
laws saying that you can't charge people
more to ride the train to faraway
stations right it's a it's got to be the
same cost per mile no matter how far out
you're going but of course in the rural
areas in the smaller lines there's fewer
writers so it's actually per capita more
expensive to support them and the New
York subway system is similar right it's
the same cost whether you're riding from
Far Rockaway or three stops highway is
basically all transport systems
subsidize the lower population areas by
private companies do not so that's one
theory of what they're doing is
basically they're just passing the cost
of shipping although it's not clear to
me if you know DHL or what whoever
they're using is actually going to cost
more to go to rural areas normally if
you look at the shipping rates it's sort
of like they divide the country into
three or four regions and everything
within that region is the same cost so
it's not clear to me that this is
reflecting their shipping cost what else
could it be
competition say more say more about that
more competition and more competitors
and large cities yeah so clearly on you
know on some level they're charging this
price because they can if nobody was
buying at that price the prices would
shift but there's an interesting
question here which is how are these
prices actually being set is there some
operations executive who's sitting there
with a spreadsheet every week and
setting the prices is that what's
happening yeah I suspect that these
prices are set algorithmically and I
think actually and this would make a
great final project that you could
actually get this result with an
extremely simple little algorithm which
is basically just put an optimization
algorithm on total sales right so you're
all familiar with optimization let's say
a one variable optimization problem
where you have some some function that
you're trying to find the maximum of
right so in this case its sales and this
is price so you try to find the price at
which the sales are maximized and
there's many algorithms for doing this
so one of the simplest is hill climbing
you change the price let's say you add
1% to it and you see if the total sales
go up if they go up you keep adding 1%
is that closed and if it goes down then
you subtract one percent of the price
right so all you're doing is you're just
chasing the local gradient and any given
point you compute the gradient
I guess in this case just the derivative
right because it's a one variable
problem by numerically right so you just
evaluate the function at two different
points corresponding to different prices
that tells you which way is up you chase
it until you get to develop so I you
know which you can do in two or three
lines of Python so I suspect that this
pattern that we see is generated by
extremely simple pricing scripts and
what we get is that rural areas are
charged higher prices and because there
is a correlation between urbanization
and wealth what you get is poorer people
have to pay more so probably the
engineer who set up this price
optimization algorithm had no idea that
that would be the effect and so this is
potentially one of the issues with the
automation of these types of things is
that you can get bad effects and not
even be aware of it right it's not that
charging poor people more money for the
same thing is new that is a standard
effect of capitalism the problem now is
that you can do it without thinking or
knowing about it
potentially
this is when we looked at before this is
reverse engineering of political
micro-targeting through email this was a
Republican piece from the previous
election and oh there you go look I've
just got the headline Amazon scrapped an
AI system that discriminated against
women there you go in the news today
we've broken this one apart it's it's
tf-idf clustering to combine the
variance of the email interestingly the
way they did this was they crowd-sourced
the data collection so that is another
strategy for trying to do algorithmic
accountability here's another big piece
they did insurance premiums are were
historically higher in minority
neighborhoods we've had for decades now
laws saying you can't do that but it
still seems to be the case that they are
higher in minority neighborhoods even
accounting for risk right so you have to
account for risk because not all
neighborhoods and not all drivers have
the same risk the risk by zip code is
available so that's public record so
that's what this x-axis here is so
average payout per car per year so in
this case that would mean basically each
driver claims an average of 200 dollars
a year what actually happens of course
is that most drivers claim nothing and
then some get into accidents so what
they're doing here is they're they're
fitting
these these smooth curves to the
patterns in the different neighborhoods
and you I mean I don't know there's a
there's a lot of variance here right you
can see that these curves don't actually
track the data that closely but you can
also see that you know generally it's
higher in my Norman minority
neighborhoods for the same level of risk
okay the vertical axis is the case okay
each dot is a zip code yeah so each dot
is a zip code yeah exactly so the so
that the y-axis is the cost of insurance
in that zip code for a particular
insurer and the x-axis is the payout in
that zip code for averaged across all
insurers well
because within a County Tennessee yeah
like those were not no I didn't say no
yeah exactly
no it's it's all over the NOI and they
actually do it for several states this
is an analysis of this chart is a little
hard to read as well so what this is
it's the correlation between the change
in price so this is an analysis of surge
pricing you're all familiar with surge
pricing you know what no what I mean by
that anyone not familiar with surge
pricing
okay so analysis of surge pricing the
change in the search price versus the
change in the number of cars nearby so
what they're looking at is do the surge
price changes actually lead to more cars
and a positive correlation coefficient
so everything above here is yes so you
can see that after five minutes or so
there tends to be more cars but the
effect is different in different
neighborhoods and again the sort of
variable of interest that they were
looking at here is the racial
composition of those neighborhoods
neighborhoods in DC yeah
right
yeah or slower right so you don't
actually get better service by using the
uber API so the API will tell you both
the current price or what the surge
pricing factor is right because it goes
to like 1.5 or 2 or something and also
the number of cars and a neighborhood
yeah I don't know if you still have
access to it but but so here's a
technique as well as use the API analyze
analyze what you can get from the API
now you know they said in their article
you know we we don't think that uber is
trying to do worse in neighborhoods with
more minority passengers right like and
this is a key concept in reporting on
discrimination is don't don't get
fixated on intent right you you don't
have to find somebody who really wants
to be an asshole a really you know some
really racist marketing exec to have a
story unintentional disparities also
matter
in relation to cars yeah it's supposed
to well I think I thought surge pricing
was deter determined based on not only
the number of cars available the number
of people no no you're right you know
I'm so you're right that search pricing
is based on demand as well
I don't think demand is available
through the API and I don't think it's
in the article but any one of these so
some of these are linked from the
syllabus but basically how these slides
are set up is that if you take the thing
at the bottom and you type it into a
search engine you'll get the story
because it's always the title at the
bottom so you can look at these charts
content and in fact in this particular
case the code is available so we could
reverse engineer this whole story if you
want to want to look at how it's done
yeah so right well are they trying to
take uber down
yeah I see exactly so I see this story
as more of a comment on the issues with
private provision of what has often been
considered a public service which is
transportation right so as opposed to
the subway where there's a great deal of
effort to try to reduce disparities in
service quality and cost among different
populations right so let's say close to
the center and far out which is also
what we're seeing here you know a
private transportation company may not
be trying to do any of that and so it's
an unintended and perhaps unexpected
consequence of the shift from public to
private service provision
ya know just right the truth I mean yeah
so let's let's talk about this briefly
because this is actually important so
first of all the standard saying or the
standard approach in investigative
journalism is go in through the front
door first
alright so always ask be upfront and
honest about what you're trying to find
out why it's interesting what you want
to know okay so there's that scraping
there's been a was a recent Circuit
Court case that decided that scraping is
not a violation of the terms of services
of website is not a criminal offense
okay
so nobody's gonna go to jail for
scraping having said that it can still
be a civil offense right you can still
get sued so when in doubt which is going
to be most of the time for this type of
stuff talk to your in-house counsel for
your a news organization I mean if
you're doing this freelance then you
need to talk to your assigning editor
about this generally what your lawyer
will tell you is that if there's a clear
public interest then go for it because
public interest is a defense in libel
cases in this country and actually many
countries
but yeah there's there are media law
issues here for sure right well I mean
so you would have to be able to make the
argument that that algorithm is
important and knowing how the algorithm
operates can only be done using by
testing so there is actually a history
of government's testing systems so for
example sending a similarly qualified
black and white applicant into a bank to
try to get a loan right and there have
been there's a long history of testing
or enforcing anti-discrimination laws in
this way and so there is a strand of
legal argument which I don't know if
it's ever been tested in court but the
argument is that you know we have a long
history of sending people undercover to
test for discrimination so doing that by
making API calls has got to be legal so
I would expect that you're in the clear
but this is why we have lawyers it's to
give us good advice on these questions
ok let's go through this section and
then the after the after the break we'll
do our analysis of machine bias our race
and gender bias are the most common
topics for algorithmic accountability
before we get deep into this I just want
to go back to this slide and say most of
these things are not race and gender
bias okay so the algorithmic
accountability community and research
and stories so far has been focused on
an important but actually quite narrow
set of potential problems with
algorithms
nonetheless that's what we're going to
do today part of the reason we're so
interested in race and gender bias is
that there's a very clear legal
framework around it so this is one of
the piece of law around bias in this
case it's looking at employment but this
was one of the first laws where this
list what is now known as a list of
protected classes appears right race
color religion sex or national origin so
we have these categories they're called
protected classes and it's illegal to
discriminate on the basis of them in a
variety of things we're gonna we're
gonna look at that in much more detail
tomorrow then we're gonna look at it in
much more detail right now okay so these
are some of the places in which it is
illegal to discriminate on the basis of
protected classes so look at look at
what we have here lending education
employment housing and this public
accommodation that's things like the
Americans with Disabilities Act right it
says that you have to have you know
wheelchair accessible entrances and so
forth interestingly the and this is
going to become extremely important when
we talk about definitions of fairness
the public accommodation portion of this
is the clearest example of the legal
principle that different people are
going to require different amounts of
effort to be fair to right so the basis
of the law isn't that everybody gets the
same amount of effort or money or time
or cost the law is more complicated that
it says some people are going to get
more to make up for some disadvantage
so that is already built into the legal
definitions of fairness and we'll this
becomes a very big discussion as and is
indeed the center of some very big
arguments this is a pretty broad list I
you know there aren't many things that
fall outside of this this is most of the
services that governments and provide
that's true that's true
well credit
yeah credit and housing that's that's
market regulation yeah so these are
these are some of the domains and then
the I love this list this is this is the
protected classes and you can see how
they've expanded over time so you know
they're you know race color sex religion
national origin those that's the Civil
Rights Act interestingly citizenship
this is relevant today with all of our
debates over immigration in most cases
you can't discriminate based on whether
someone is a citizen or not obviously
there are some cases where you can like
voting age pregnancy you know things you
might expect but then veterans status
and more recently genetic information so
I can't refuse to offer you insurance
because of a genetic test for example so
this is gradually expanding and I guess
we can expect that it will gradually get
bigger
so you may think that if you're building
an algorithm to decide on let's say
credit scoring or hiring or who gets
admitted to a school for example that
it's very simple to not discriminate
against race you just don't give it the
race of the person as input this is
where things get complicated and this is
what I want you to remember is that race
and gender correlate with just about
everything so let's look at that in a
little more detail I'm going to show you
some some data from this paper this is a
paper from I think 2015 or what they did
is they built a model to try to predict
protected classes age gender political
and rigid views well things that are
protected by classes relationship status
as well from a person's Facebook like so
which pages had they liked and let's
just pause for a second and talk about
how they did this because it actually
has a lot in common with all of the
recommendation systems that we've been
studying so they start with the users
Facebook lights so the user like matrix
right this is very similar this is
basically a user item ranking matrix and
then singular value decomposition so
that's a matrix factorization technique
so they take each user and rather having
rather than having 55,000 likes for each
user they reduce it to just a hundred
components you can think of these as the
topics right so here we component one
component two and so forth
this turns this is kind of like our user
topic matrix or document topic matrix
doing this you'll also get another
matrix which tells you which
topics are related to which pages but in
that case we don't use that we just what
we're basically trying to do is
dimensionality reduction on the users
here and then we do a regression model
to try to make a either a binary or
continuous prediction for each of these
variables alright and so here we just
fit the data so we we start with all of
this user like data and we also have the
values of these these things we're
trying to predict and we fit a model and
we are taking a linear combination of
the component scores any question on the
method okay everybody get what we're
doing here this is almost the simplest
model you can build it could be slightly
similar if you simpler if you admitted
this singular value decomposition step
but then it would be kind of painful to
do because you would need fifty five
thousand coefficients in your regression
equation okay and the the factorization
also reduces noise interestingly just
like our topic modeling allowed to the
same word to mean different things in
different contexts the matrix
factorization will do that for us here
right so maybe liking BMW means one
thing if you're an if you are a young
man and one thing if you're an older
woman right the the matrix factorization
will actually capture that okay
so when you do that this is what you get
so these are the things you're trying to
predict and then this axis is how well
you can predict it
this is an area under curve score how
many of you have seen AUC scores before
knows what that is okay
so many of you know alright so this is
useful stuff this is the basic mechanics
of classification which is what we're
going to do with the machine bias work
so this is this is important stuff the
idea is this let's say you're trying to
predict whether they drink alcohol
alright that's one of the things they
tried to predict here so you build some
classifier in this case it's going to be
a logistic regression classifier but it
doesn't matter it could be a decision
tree could be anything and basically all
of these classifiers at some point
they're going to give you a numerical
output right so we have some rule and we
say you know prediction equals some
function of the input variables which
I'm going to call X right X bar to
indicate that it's a vector of all of
these different input variables and
often it's something like a probability
right so let's say in 0 1 so if you do
religious turk aggression you'll get
something in 0 1 which if you set it up
right is a probability so now you have
to turn this into a yes/no guess right
because the we're supposed to give it
yes/no answer do they drink or maybe
it's a multi-class problem male or
female you know gay or not whatever over
50 under 50 binary prediction problems
are very common so we need a threshold
right so we're gonna say we're gonna
turn this into a binary prediction just
by a threshold in so we're gonna say f
of X smaller than
threshold now the threshold will control
the trade-off between false negatives
and false positives so if I make that
fish hold really high then I'm almost
always going to say yes which means I'll
have more many more false positives if I
make it zero then I'm always going to
say no so I'll have many more false
negatives if you only want to get rid of
false positives always answer no if you
only want to get rid of false negatives
always answer yes ok so that the trick
is not to reduce one of those to zero
because you can always do that the trick
is to balance them in an interesting way
so what we do is we write this graph and
[Music]
we're gonna say this is true positive
versus true negatives for historical
reasons because this first came up in
radar radio systems this is called an
ROC curve for receiver operating
characteristic which is the worst name
but anyway that's what it's called and
[Music]
well there's various ways to draw this
but I'm gonna go with the Wikipedia way
this is false positive rate this is
false negative rate and as you so we
know that we can get
oh sorry yeah thank you false positive
true positive rate so we know that we
can get zero false positives and zero
true positives by always guessing false
we know that we can get a hundred
percent true positives and zero false
positives by always guessing true so
those two points are always available to
us but then as we sweep the threshold as
we sweep these as we make the threshold
yes we increase it from zero we pick
more and more positives and we start
sweeping this curve up and as it gets
nearer near to one eventually we have a
hundred percent positives so we get some
curve that looks like this as we bury
the threshold and so these are the
points these are the trade-offs that are
available to us and normally what we do
is we compare that to the straight line
between these two points and the reason
we do that is that that is the coin flip
classifier okay if we flip a standard
coin that's weighted heads or tails
fifty-fifty we get this point in the
middle right because it's gonna get half
true positives and half false positives
if we start weighting that coin and we
say well let's wait at 70/30 through a
negative we get a point here or we could
get a point here by weighting it the
other way by setting by just randomly
guessing a certain percentage of trues
and falses we can get all the points on
the green curve so only this area of the
graph that tells us how good our
classifier is if we could guess
perfectly we would get
just this one point here right we would
have something that looked kind of like
this and if we ever built a classifier
that had this type of performance then
we just guessed the opposite and we'd
have a better classifier so the area
under the curve is finally this whole
area right we basically just take the
integral as we sweep out our possible
combinations of false positive rate and
true positive rate and it has to be over
0.5 because if it's because point 5 is
what the baseline is here of guessing
randomly if we ever get something that's
less than point five we just guess the
other way so going back to our Facebook
analysis result here so guessing whether
someone's parents to get our together at
age 21 we get an area under curve 0.6
that's just a little bit better than
random guessing so we get some
information but not a lot whereas
guessing gender or we were you look at
this right hey protected classes race
sexual orientation gender we can guess
with quite high accuracy 0.93 which as
so 0.93 means we have a curve which is
you know up like this right we get most
of this graph under the curve so this is
the the AUC is a useful way of measuring
the performance of a classifier because
it it sort of integrates out the effect
of setting a different threshold it says
you know for whatever application you
want you can pick a false positive rate
and a false negative rate that make you
happy we don't care what one what you're
going to set that Thresh
all that this measures it independently
but anyway 0.93 and 0.95 are very good
predictions so you know there must be a
bunch of pages that men like and women
don't and vice versa because we can
definitely figure it out just from your
likes and similarly with race alright
questions on that yeah
for profiles that didn't fit the binary
classes what do you mean okay right what
did I do with the Hindus and Jews yeah I
don't know we'd have to read the paper
to see how they coded it yeah I mean you
know or erase other than white or black
almost all of the racial stuff that that
you'll see is done with either white
versus black or white versus non white
there's very few that break out actually
the different races and then you have
the the problem of well what are the
different races and that's a mess anyway
so you can do this even with much less
information that analysis was done with
fifty five thousand likes per person
that's a lot of information although
probably you only liked a few dozen this
is as simple as an example done using
predicting Jennifer and political
orientation from Twitter and this is
just using different techniques but you
can see here the top row they use only
the data from what they had tweeted and
their profile and you can also use data
in some cases to get better accuracy if
you use data about who they follow but
you but you look at this and you see
that you can get about 80% accuracy on
gender and about 90% accuracy on
politics just from Twitter data right so
much simpler data set and I would be
very surprised if you can't get some
really high accuracy on race so it's
very hard to blind an algorithm to
protected casas because and of course
this is the reason their protected
classes is because they run so deep in
society
they're correlated with just about
everything else so where you live the
amount of money you make what your job
is your health history all of this
information will give you lots and lots
of predictive power and productive
classes which means it's it's you can't
really blind an algorithm to race it can
pretty much figure it out anyway so that
becomes a challenge to the entire idea
of not using that type of information to
make decisions oh yeah here you go
here's race from Twitter tada
[Music]
so this uses a different measures of
accuracy precision and recalls whether
they yeah I carry out but how did they
get the data whether they followed it
maybe I don't remember now but they did
that because they wanted to look at also
um sort of commercial applications right
so what kind of company figure out about
you they use f-measure instead of a you
see F measure is sort of a weighted
average of precision and recall we're
going to get much deeper into all these
things precision recall at measures
ultimately all of these things derive
from the confusion matrix and yeah we're
gonna do that after the break we're
going to talk about some machine bias
concerns pretrial risk scores and
there's this lovely 538 piece that has
an interactive simulation one other yeah
I think so
but here we go here's here's what I yeah
very very marshal project sir but
they've got this lovely little
interactive here so this is the idea
this is the sort of setting that we are
in we have a bunch of people there's a
classifier which classifies them here
into three risk buckets and then I think
how this is set up as the high risk
people are always denied for all the
lowest people are always awarded and
some number of the people who are
granted parole will reoffending and some
number number of people who are denied
parole won't we offense so this is let's
call positive you are classified as
eligible for parole so that would make
this false positives and that would make
this false negatives okay so so sorry
this is post trial this is parole this
is not pretrial but a very similar
system is used to decide who is who gets
to be released on bail or to advise
judges and who is get to release some
bail that's actually an important point
is that nobody uses these algorithms as
the final decision it's advice who are
that is given to the decision-makers who
may or may not follow the advice so
that's the setting and one of the
parameters you have in controlling an
algorithm like this is this you know
threshold we're talking about so if I
crank this threshold way up then a lot
fewer people are denied parole which is
going to make the false positive rate in
hmm how does this work here we go
that false negative rate is much higher
right so I'm I'm releasing fewer people
whereas if I crank this way down then
I'm saying more people are high-risk
which should reduce this rate and why is
this running away backwards than I think
it should be you know high-risk people
there we go okay here we go yeah okay so
here we go so we've got a lower false
positive rating the higher false
negative rate right so there you can see
directly we trade these things off I
actually think this is confusing and
there should be just one slider which is
the threshold for paroled versus not
paroled because I think what happens
here is some randomly people in the
medium risk category get award or denied
so it's a little bit confusing anyway
well that that's the exercise of picking
a threshold you have to decide you you
can only pick a point on that curve you
have to decide which point you want and
that has to be based on how bad or false
negatives versus how bad or false
positives
so in this case how bad is it to release
someone who then RIA fence
keeping someone in jail who wouldn't
have of course you don't get to observe
this if you're actually using the system
that you know the people you keep in
jail don't get to reoffending this is
how these systems are built is you
calibrate it by making predictions on a
set of data all of whom are released and
then rather than you know denied parole
but wouldn't ever be offended what
you're looking at is we would have
denied them parole if we were using the
system so in development you actually
get to see these four numbers and in in
production you don't get to see that
number so that's the setting we're in
and let's see here
the machine bias piece is a journalistic
investigation into the compass risk
assessment tool compass makes risk
assessment both pretrial and post-trial
we're gonna talk mostly about pretrial
that means do you get to be free on bail
while waiting for your trial or not the
they were able to obtain a copy of the
actual questionnaire that produces the
data this thing has 137 questions I've
just shown you one here it has things
like back here let's let's actually open
this up and look at it there's there's
lots of fun stuff here
so this is up on on document cloud so
you can see the things here that they've
helpfully annotate it so they're asking
about things like you know have you
moved a lot or have you been suspended
from school how often you feel bored as
well as you know as they call it
criminal thinking yeah as well as things
like the current charge and so forth
notice that race is not on here now of
course we know the race of the person we
arrested but the risk score does not use
race or gender or age right no protected
classes here however as we saw you can
probably make pretty good guesses as to
the protected variables from this much
data so this is the input there's a one
hundred and thirty seven data points and
the core of propanol of course there's a
lot more that went into this study but
the core of the story is this table here
which I've reproduced from the
methodology and what we are looking at
is the confusion matrix for everybody
for black defendants and for white
defendants so does everyone know what I
mean by confusion matrix okay so what
the compas algorithm actually produces
is two things one is a numerical score
from one to ten which you can think of
as a probability of arrest another is
just based on threshold Ignatian low
medium or high risk and I think how this
works is in for públicas analysis the
medium and high risks
merged into one category high-risk and
so these are the wrong numbers so that
this is the confusion matrix then and
all this stuff false positive rate false
negative rate positive prediction value
which is also known as precision
negative prediction value which is kind
of like you know positive prediction
value is if we of the people who we said
would be reached at how many were
negative prediction value is of the
people we said weren't how many weren't
and so on and then when you break it out
for black and white what you find is the
false positive rate for black defendants
is almost twice as high as the false
positive rate for white defendants
meaning that of the people who were
predicted to be read twenty three
percent of those who were white in that
group were not read where as forty four
percent of those who were black were not
rewritten right so the error rate is
much higher and this is the basis for
the article this is this is the basis
for the claim that the algorithm is
biased so we're gonna do this we're
gonna step through this one one thing at
a time
yeah so first of all his pretrial so
there's no parole here this is where
they released on bail or not right so
that that's important from a legal in a
rights perspective because these are
people who have been charged but not
convicted okay so that that is a legal
difference between you know the the
their case hasn't been heard yet so we
don't actually know if they're guilty of
that crime yet this is just people
who've been charged with a crime
so you have two notebooks
one is a sort of an empty copy so
basically what we're gonna do is we're
gonna take the empty notebook and we're
going to fill it out and I'm going to
more or less cut and paste what's going
on here oh we need the data too so here
we go so let's let's do this so I'm
gonna run this one cell at a time by the
way
that come the compass algorithm produces
both a violent and a non-violent crime
predictor and then the data also exists
for looking at who gets arrested for a
violent arrest versus a non-violent
arrest this turns out to be important
because there's important differences in
the accuracy of data on arrests for
violent crimes versus non violent crimes
we'll get into that next class but for
the moment we're just gonna use the non
violent data because that's mostly what
for public I focused on so if you run
this you should see the data and so this
is what it looks like we have so this is
public record right this is this is
mostly public record it's just arrest
records this is in Broward County
formula Florida over two years so how
this story happened is they FOIA j-- the
risk scores so they got the risk scores
that were assigned all of these people
and then the risk score is supposed to
give a probability of ria rest within
two years so then they just waited two
years and collected the arrest records
and you can see that here so you can see
a bunch of personal information and then
v decile score is for the violent
predictor the screening blah blah blah
and then two-year acid' this is where
they really in two years so basically
what we're gonna look at is is decile
score versus to your recede so decile
score is the predicted risk and to us it
is the ground truth so you can see that
here right so and there's a bunch of
other things right like what they were
charged with and so forth and then we
throw it a bunch of data I just follow
up or públicas analysis here let's not
get in too much into it but this is this
is our output right we have 6,000 rows
here so 6,000 people who were given a
risk score and then either date or did
not reopened so then the first analysis
we're gonna look at is just single
columns so here's where we start typing
right cv I don't know why we called it
CV compass I can't remember
compass scores hmm anyway compass values
anyway CB so we just look at the way
that age is categorized you can see the
majority are in the young category and
then we can get a similar thing I think
this parable is just called race yeah
there you go there's the population it's
majority black and then we can look at
well what scores were they assigned so
this is decile score value counts there
you go and then they also get this like
texts assignment low medium high so that
is score text value counts and what this
is is there's actually only one
classifier which produces the decile
score and then the low medium high are
generated by threshold English at some
value so I'm not sure exactly where they
put the thresholds but that's the idea
so I think most of the public analysis
is based on the low medium high scores
but it's actually the same information
just in a slightly lower resolution form
all right has everybody got in here okay
I know I get to cut and paste so I got
that a lot easier for me so just tell me
if I'm going too fast
so now what we can do is we can look at
the scores
oh my this is fun as a histogram so
we're gonna say for race equals clothes
occasion the DES I'll call them the
reason I use it variable here is because
that allows me to switch between the
violent and nonviolent versions right
so I set these columns it just changes
the column name you can just make that
dot decile call or a dot decile score if
you want and then I plot as a histogram
and I get this okay I can do the exact
same thing with I think the text is
african-american yeah with that - I can
do the exact same thing here and now I
have the daesil scores for
african-american defendants okay so what
do you notice
let's start analyzing this yeah so all
right so after you're married defendants
are more often assigned higher scores
but it's hard to say what that means
just yet it's you know it's certainly a
pattern that would be consistent with
with racial bias but we we're looking at
the output of the process we have to
look at think of it the inputs basically
or more precisely we're looking at the
predictions we have to look at the the
facts as well because in this case we
get to know what actually happens
because they waited two years so let's
look at the outputs now or the or the
ground truth so for this let's say so
how did this work we what does that call
him called - you ever said yeah okay
here we go
cv - you received value counts so there
we go
so you know a little less than half were
read and then let's do a crosstab by
race and I've just I've done a little
magic here to just get rates as well so
if I just do a crosstab on these two
variables I get this I get oh okay but I
want it as rates as well because I have
different numbers of defendants so I add
this other line which is I had a column
called rate where I just divide the one
column by the sum across that axis and I
get this table so when you look at this
table what do you see
true so that some some races are much
have a much smaller number of people so
we can expect the rates to be very noisy
for those races what else do you see
african-americans reoffended more than
any of the race
yeah it's okay so let's let's get clear
about what this value is
this is reread K so our arrest means a
lot of things that doesn't mean a
conviction for example because we are
pretrial here but they're definitely
rear ested so we have to ask what
re-arrest means and we will do that
extensively in the next class we're
gonna we're gonna get deep into the data
here and I think I've set violent equals
false yeah violent equals false so we're
looking at rest for nonviolent crimes
which tend to be let's say more prone to
bias than violent crime but the key
thing here is that these rates which I'm
going to call the base rate differ
between race and this is going to be the
fundamental challenge to definitions of
fairness okay if if the base rate was
the same between every race here then it
would be very easy to build a classifier
which was unbiased by whatever metric
you want because the base rates
different we're going to be forced into
making some difficult choices these are
all arrests of people who they had risk
scores for all right so this is how the
reporting process worked for this as
they filed a bunch of FOIA requests
Broward County came through and said all
right you can get the risk scores for
these people then they waited two years
and then they compared the risk scores
that they'd obtained through FOIA to the
public records so you're following up a
particular population of people who were
originally arrested in a particular
interval for which they could get risk
scores this was one County in Florida
for I think a few months of arrest data
and then they and then two years after
that to see if they were reading okay so
then we can do
this this same this same type of
analysis for I guess I could I thought I
could switch tabs by going control left
and right oh well the same type of
analysis for sex so exactly the same
idea one of the reasons I'm including
sex in here we'll to start with it's a
protected class but here you have kind
of the opposite pattern which is that
the traditionally marginalized group has
a lower recidivism rate so in fact we're
going to find that if we pick a
particular definition of biased it's
usually going to be biased for the safe
it's biased for the oppressed racial
group it's going to be biased against
the oppressed gender group and this is
interesting because when you start to
ask intersectional questions you will
find that there is there are trade-offs
between the different categories in very
complicated ways if there is a theme to
all of this it's just that you you can't
have everything you have to have
trade-offs and those trade-offs are
uncomfortable
alright so that's what it looks like you
will be shocked to hear that men are
re-arrested more often so then we're
gonna do the same thing actually we're
gonna we're gonna skip that let's so now
that's those are the those are the
predicted risk scores and those are the
actual Rio rest rates now we're going to
compare them on the same chart so how we
do that is
like this so we group people by decile
and then we take the average of the
recidivism so that gives us a recidivism
rate and when we run that we get this
thing so this is how this is the rate of
rear people who were assigned each risk
score so what do you see consistency
what does that what does that mean right
so the horizontal axis is the prediction
the vertical axis is the reality and you
can see that people who were predicted
to be more likely actually work alright
there's this general upward trend now
this is called calibration and it's
called calibration because if this plot
of prediction versus reality is
monotonic that is to say non decreasing
there is some transformation to turn
this into a probability so the risk
scores are not actually probabilities
but they can be turned into
probabilities okay so as long as you
have a monotonic plot we say that the
prediction is calibrated meaning that we
can turn it into a probability so this
is calibration is a very important
property of prediction systems and it is
a property that this predictor has so so
far so good for the predictor but what
we want to know is how does this break
down between black and white again and
so I have to go through a little bit of
work to get there so what we're doing is
we're generating these two the values
for these two plots for BMW and then
I don't know I had to glue them together
into a data frame called
a to get it to plot the way I want it
which is like this so talk to me about
this plot what does this plot tell us
we hated for whites and I was blessed I
was less of the trend yeah okay yeah so
right so there's noise here so so
generally it so I think there's sort of
a few big takeaways from this plot right
so generally they both increase
monotonically alright so there there's
calibration not only for everybody
together but for the individual groups
there's some noise here if we really
wanted to we could do estimates of you
know is this statistical noise or is it
actually a real problem with the
predictor but generally it's alright the
other thing is the height of the bars is
more or less the same at every decimal
score in other words a score of six
means the same thing regardless of
whether you're black or white this is a
concept called positive predicted value
another name for this is precision it's
the same calculation it's just one we
think of like information retrieval one
we think of prediction what it is is if
I give you a score of six how often do
you actually riaf end and so that
positive predictive value is relatively
balanced between races all right so if I
say you're you have a certain
probability of riah fence that means the
same thing regardless of your race
except for this one right so if I say
you have a very high probability of riah
fence you're actually slightly less
likely to reoffend if you're white so by
the way what whose advantage would that
be - for it to be so in this pattern
where it's actually lower for
white defendants whose whose advantage
is that other way around yeah because if
I if the actual rear rest rate is lower
than I say it is then that is a false
positive and that's bad for that group
right so the fact that we said the rest
rate was high and it ended actually we
predicted it to behind it actually was
that's good for the group that has that
and for someone to be reached at less
often than their predicted that's bad
for them because it means those people
are going to end up getting granted bail
less often right right so it's about
calibrated if anything that by and again
by this measure of fairness this plot
would would favor African Americans okay
slightly I think I think honestly this
is just noise and then if we tried it
with a different sample we would we
wouldn't see this that would be my guess
so that's as far okay
the next thing we're gonna do is produce
the actual confusion matrix so we're
going here from daesil scores or
probabilities to just yes/no which means
we have to apply a threshold somewhere
and what we do is we use the low medium
high risk variable and then we combine
the medium and high
so basically we're splitting at the
threshold between low and medium and I
don't actually know what that is in
terms of decile score but here it is how
do we do this okay we generate a
confusion matrix like this so first
we're saying K these are all the people
where we guess that what they would be
arrested here all the people who were
read
when we do a crosstab we get this
confusion matrix so here it is so
confusion matrices the entries on the
diagonal of the confusion matrix are
correct guesses the entries off the
diagonal or incorrect guesses so you can
see right away that this has a fair
proportion of incorrect guesses there
are many many statistics that we can
compute from a confusion matrix the
simplest one is called accuracy and oh
it says right here the fraction of the
guesses that were correct so what is it
what is the accuracy here how do we
calculate the accuracy from this
confusion matrix
yeah right so it's this Plus this
divided by the total number of cases
positive predicted value is what we were
just looking at so of the people we
guessed would recidivate which is this
column these are the people we guessed
would recidivate how many did so it
would be this divided by the sum of this
column right and then we have false
negative and false positive rate which
is so false positive rate is of the
people who didn't recidivate so this row
how many did we guess would not so this
so it's this divided by this and this of
the people who didn't recidivate how
many degree whips to be guess would
right so okay so actual receipt false so
that's this row and how many people do
we get okay so this okay so it's it's
this divided by the sum of this row so
note that positive predictive value uses
only this column false positive rate
uses only this row they're actually
looking at different things the
situation here is very analogous to the
calculation of conditional probability
so probability of a given B is actually
a completely different number than
probability of B given a we'll talk
about that in a later class in more
detail but the point is that this
confusion matrix is actually where all
the information is all of these one
number summaries use different pieces of
that and then this is all relative to a
threshold right if we change the
threshold for whether we guess true or
false then these numbers are all going
to change and changing that threshold we
can adjust the rate of false positives
to true positives but only to a certain
point right we can never get a hundred
percent accuracy because the classifier
that's just not available to us and this
link has this is a great page which
summarizes quantitative definitions of
fairness
oh cool
yeah that would be cool okay so here's a
nice little diagram and it shows you how
all of these things relate so these four
entries are the four entries of the
confusion matrix and then all of these
things you can calculate so for example
positive predictive value is something
it doesn't show you what it uses to
calculate it I think that Wikipedia one
does though no it just has this one
anyway you'll get really familiar with
these definitions if you work with this
for a while these are the core of the
definitions of fairness and these are
all calculations you do from a confusion
matrix and there's this great here we go
here's the diagram I wanted so if you
click the L it just goes to that link
damn it somewhere there's an interactive
one where you feel hover it shows you
which it uses to calculate but anyway
there's lots and lots of these different
things and they're all based on values
from the confusion matrix you can study
that on your own time what we're gonna
do here is just calculate them and then
we're going to compute them so let's see
the positive predictive value according
to Wikipedia is true positive / true
positive plus false positive so let's do
it
true positive / - positive plus false
positive what do we get
63% and then the false positive rate is
pulse positives / negatives so that's
false positive over a false positive +
true negative so the denominator here is
just the number of people who were not
read
FP / FP + TN FP / FP + TN yeah there you
go so we're just doing these these are
all calculations and these make it a
little little easier right so we can
just see FP / n I guess I gotta run this
one okay now if you want a zero false
positive rate you can always get zero
false positives by just guessing that
nobody will reoffending off against the
false negative rate which is F and / P I
think yeah FN / P so for this particular
threshold setting so that remember
there's an implicit threshold here that
sorted people into low versus medium
versus high for that threshold setting
this is the false negative rate and we
can always have more false negatives or
false positives by moving that around
okay so this you have let's just run it
and then what I do is we finally get to
the analysis here which is where am I
going there we go we are now going to
compare the confusion matrices and these
various metrics for black versus white
so here we go I just grabbed the set of
them that are black versus white and I
print the metrics on what I guessed for
them and what actually happened and this
is what I get so we're now have
replicated that ProPublica methodology
table so here we go
here is a replication of the Republic
Astoria so what do you see let's talk
about this
everybody got tear it's actually not
that hard to replicate the core of the
story it's not a very complicated
calculation you're just looking at
confusion matrices so first of all what
which of these values are the the
central claim of bias in república
stories okay so here we go that versus
that right okay
what else can we see yeah so yes what
about the positive predictive value five
nine versus six five
yeah so let's let's draw this out so
this this is an extremely important
distinction which is confusing and this
is part of this sort of challenge of
defining bias so Paul's positive
predictive value is percent let's put it
this way
we guessed re-arrest and okay and they
were actually read okay so we want this
to be as close to one as we can get it
right whereas false positive rate starts
the other way around
it says they were not re-arrested but we
guessed they would be okay so the
denominator here is different the
denominator here is every one we guessed
positive whereas the denominator here is
everyone who is actually read so
remember one one is one is a row in the
confusion matrix one is a column in the
confusion matrix so they are related
because they both have one element in
common which is we guess they were
rheostat and they would be rear ested
but there you can think of it as as
guessing in a different direction right
and it takes a little while to wrap your
head around it's it's sort of like this
is also sometimes called conditional
use accuracy because this is the value
this is something from the point of view
of trying to decide whether to release
someone on bail right so we don't know
if they're actually going to be
redirected when we have to release them
on bail
we only know whether we're guessing them
so it's kind of like from the point of
view of the information that we have at
the time we have to make the decision
how good can we do whereas this is sort
of the other way around which is from
the point of view of what actually
happens eventually how good was the
guess so they actually measure different
things and part of the disconnect here
is that the the positive prediction
value is relatively balanced if anything
it looks better to black defendants
whereas the false positive rate is not
balanced okay and in fact this is what
happened when the story was published
Northpoint who makes compass came back
and said what are you talking about it's
this this is a calibrated algorithm
right they came back basically with this
chart and said hey it's fine
so part of what was going on was was
that they were using a different
definition of fairness when they
developed it and you can say many many
other things about this this issue most
of which will have to say next time but
this is the the core of the analysis so
I think we're gonna have to leave this
there for now and talk about your final
projects so let's do that
did you find our project suggestions so
remember these are just ideas you can do
you can do anything else you want I as
I've said a number of times I'm
fascinated by automated trading I think
it is drastically under explored in the
algorithmic accountability literature
and part because it's harder to do and
in part because it's it's deeper it's
more tied into the structure of society
it's it's the roots of capitalism so
here's one thing you could look at
here's when there was when a piece
Witter account was hacked and there
there was this fake tweet that got sent
out it briefly crashed the stock market
there is a paper which uses the extent
and the timing of this crash to estimate
the speed and scope of automated trading
in the markets now this is a few years
ago this was 2013 I can't remember now
13 14 yeah 13 so as of five years ago we
have an estimate but I bet there are
ways to estimate the current scope of
systems certainly we have numbers like
you know what fraction of trades are
made by computer now you have to be
careful interpreting these numbers
because there's lots of different
markets and made by a computer can mean
different things and so forth but um it
you know they would make a very
interesting story to try to dig into
that
related what are the values encoded in
automated trading algorithms and one way
you could proceed on this story is just
take some of the standard automated
algorithms and there are libraries of
these things so the quanto peon is an
automated trading development platform
you could go to the cart opions library
look at each of the algorithms that they
have just just basically use it as a
list of algorithms and just say some
things about what would an economy where
many people are trading with this
algorithm look like what types of
businesses what types of economic
activities are encouraged by this
algorithm and what types are discouraged
so here's mean reversion and what mean
reversion would do for example is it
would buy stock of something that
suddenly fell and sell stock of
something that suddenly grew so that has
a weird effects right it would publish
punished fast growers and try to reward
slow growers which may be as actually a
kind of inequality reducing effect or
maybe it punishes people who do well and
then you've got the more basic question
of what algorithms are people actually
using it's very hard to find that this
is part of why everybody's looking at
government right now is because it's
easier because you can fly it all right
you can't FOIA a private company so you
have to look for regulatory filings you
have to look for conference
presentations you have to talk to people
who used to work there but there's
actually court cases there's actually
quite a lot of disclosure you just have
to look in in weird places for it tools
you can do a tool as your final project
I'm involved in two major journalism
platforms one of them is work bench
which Aaron is involved in so you can
ask him about that but right
complicated but there's also this
document mining tool overview which
we've we've talked a lot about the
construction of it could use some love
so the disadvantage of this is that I'm
basically asking you to write write
software for other people the advantage
of that is those people are your fellow
journalists I can't I don't have any
developers working on overview right now
so I can't do requests like better
entity recognition so if you wanted to
figure out how to do it and then
implement it journalists throughout the
industry would be able to use it the
most ambitious thing you might do is is
build machine learning into overview
right now you have to export out to a
notebook to the machine learning and
export back in but we've got like all of
the pieces we've got all of the tagging
and UI we just don't have the actual
core machine learning yeah yeah so I on
your assignment through course works
there's two links to previous projects
yeah of course works for sports too we
were just talking about bail reform why
do people bother with algorithmic risk
assessment you're gonna see this slide
next class as well it's because there is
good evidence to suggest that machines
can guess a lot better than humans
judges can who is going to be rested and
if you can do that then you can keep
more people out of jail while
simultaneously reducing crime and
reducing racial bias there are only a
few I think there are three cases where
you can directly compare human and
machine guesses I haven't seen a good
review comparing them most of the
discussion around risk prediction has
been sort of like 'omg biased algorithms
I haven't seen good articles that are
like ok but less biased than humans
which I'm might yeah and then we talked
about oh so this we're gonna talk about
this a little more next class but you
can experiment on platform algorithms
right so this was an experiment where he
published poems haiku with and without
the colored background and estimated the
difference in engagement and basically
they got twice as much engagement and so
you can just do this with a spreadsheet
right this is um this is a great little
article and if you start thinking about
it in this way like if I changed my
interaction so what you have to do is
you have to change your interaction with
the platform and then figure out
something to measure and you could use
your account or you could use new
accounts there are lots and lots of
different experiments you could run on
these algorithms
accountability that is to say an
analysis of how algorithms have an
effect in society I originally had a
statistics lecture scheduled I decided
to postpone that a little bit because
next week your final project proposals
are due and so I wanted to talk about
some of the things that you may be
interested in including in your final
projects this is part 1 of 2 of
algorithmic accountability we're going
to spend two weeks on this the
centerpiece of today's class is going to
be a deep unpacking of a famous piece of
algorithmic accountability journalism
possibly the most famous piece which is
proposed machine bias where they
analyzed the compass recidivism
prediction algorithms so this is an
algorithm used to decide who gets bail
is a pre-trial risk scoring algorithm
right so this is literally an algorithm
where individuals freedom hangs in the
balance of course it's natural that it
should come under a lot of scrutiny and
we're going to talk about the analysis
they did and what they concluded and the
honestly the can of worms that that
opened up because they it was the
beginning of some very complicated
questions first I want to talk about
some previous work in algorithmic
accountability some other things that
have been done and even before that the
first thing we have to do is I guess
scope this and the way I'd like to do
that is to list out all of the places
where algorithms have an important
effect in society so
let's do that so where our algorithms
used in society inconsequential ways
obviously algorithms are used in society
for all kinds of stuff right you know
music recommendations and checkout
machines and so forth but they're also
used to make decisions that can be quite
consequential so yeah real estate tell
me about that
okay yep so classic stuff tied into any
of you know the phrase of redlining
yeah okay so that was the policy of
steering minority homebuyers away from
certain neighborhoods which is of course
illegal became illegal in the 1960s but
still kind of happens in certain ways
and we'll look at analyses of that okay
what else
credit scores yep it's a big one so
that's also related to lending I will
actually do look in much more detail at
the effects of Grimek default prediction
on lending in the next class it's gonna
be another exercise we do hiring yep
yeah yeah well we're gonna yeah so we're
gonna we're gonna look at this question
of fairness in great detail that's
that's kind of what we're doing here
okay what else
welfare distribution
what kind of algorithms would we have in
welfare distribution okay okay
do you know of algorithms being our
models being used for that oh yeah right
yep yeah so that books gonna come up a
bunch of this work is gonna come up okay
what else
yep there's all kinds of places in
health care so diagnostic models that's
a big one right or prognosis you know
who gets this very expensive procedure
it's nice to imagine that anybody could
possibly benefit from the medical
procedure gets that procedure but that's
not how health care works
right you have a even if it's a large
number you have a finite amount of
resources and so you have to make
allocation decisions which means that
some people aren't going to get a very
expensive test which has a very low
probability of finding a problem so you
have to make these decisions yeah
medical ethics it's a fascinating field
what else
insurance yeah insurance of all types
interestingly insurance is probably the
first method where algorithmic
techniques were used and we're going to
talk about what I mean by algorithmic
techniques right that's a we can also
just say statistical models or actuarial
models which is how insurances rates are
calculated I mean large parts of
probability and statistics were invented
to make insurance work out hmm
policing yep so I'm going to start over
here
how can algorithmic or statistical
method to be used in policing yeah okay
so yeah so face recognition
you said predictive policing so where
our crime is going to be committed or
who is going to commit them yeah that's
normally part of predictive policing and
in general we can add this category of
Criminal Justice
so risk scores who's likely to be
re-arrested or commit a crime and that
can be pretrial or post trial sentencing
there's a bunch of different places
where predictive models so you have you
know should we release this person when
bail that's pretrial what sentence
should they get
they released on parole stuff like this
what else there's a at least one
remaining huge category yeah that's a
good one
all right so student testing teacher
evaluation hiring and firing within
education all kinds of places
Aereo yep advertising and marketing so
optimization of ad purchases micro
targeting and as long as we're going to
talk about micro targeting let's talk
about politics so trying to persuade
voters sending different messages to
different people that's micro targeting
and personalized advertising predictive
models like election prediction now for
you and I election prediction is is sort
of a journalism function right like
we're going to inform people who's
leading in the polls ultimately but if
you're look at it from the point of view
of a campaign election prediction
changes the strategy that candidates use
they changed where they campaign where
money is spent what issues they and
positions they take because of course
they will try to take the issues that
will get them the support they need to
win so those predictive models can
indirectly have influence on the policy
positions of our leaders kind of crazy
huh there's a really still a big area
that we don't have on this list
let's say social networks yeah that's a
good one isn't it
so we've been talking about this in your
filter design assignment that's a big
field will coups let's sort of leave it
at that for the moment
I want to add at least one more finance
algorithmic trading models for
investment this is a heavily algorithmic
field possibly more so than any of these
others right they mean when we say a
quant right you all know that phrase
like we're talking about that that word
came from Finance the allocation of
money in society is heavily dependent on
statistical models and algorithmic
methods yeah
what are the categories where where the
general public or even when people were
more specialized like us or not paying
attention like everyone talks about
algorithmic well mm-hmm yeah so hey
let's circle the popular ones how about
that and then we can see ok so
everyone's talking about policing right
now we're gonna talk about it too and
criminal justice social networks
politics yeah it's just a fair amount to
talk about that so hey that leaves
everything else credit score's people
talk about that that so I am so okay
so I would say finance that's my answer
I think there is a catastrophic lack of
accountability for the use of
statistical methods in investment
decisions and it's not you know in some
sense it's not really a problem with the
use of statistical methods it's it's the
problem of deciding what to invest in
using only the metric of how much money
will it make me right this is not this
is not a new problem we just have a new
form of this problem yeah yep
people that are building models like to
reverse-engineer the models that are
being used
mmm
sort of yeah we'll talk about Finance a
bunch more in part because I think it's
really fascinating and important oh
another thing under releasing is DNA
testing I was talking to someone last
week who is on this she got a magic
grant to compare the outputs of various
DNA matching algorithms from different
vendors and she's having the same
problem that you just mentioned which is
that these algorithms are trade secret
right the companies that build them
don't publish them and yet they send
people to jail here so I'm a circle
finance is something we can talk about
more and we'll talk about it a little
later today when we discussed the final
projects the we I went through that deck
and really briefly last class but I'd
like to actually have a discussion about
it today okay so lots and lots of places
where algorithms have influential
effects this was a list that I made up
earlier let's see how it pairs oh price
discrimination and terrorism yeah
so that and search I'll yeah so this is
a big list and you can add more things
to it
price discrimination is the practice of
charging different people different
amounts of money we'll see examples
where that is happening that can very
often turn into class discrimination
generally people who have less money
have poorer negotiating power which
means ironically you can charge them
more terrorism prediction this was a
sort of exploded into public
consciousness after September 11 there
was briefly something that was supposed
to be called the the Total Information
Awareness program which was very sort of
big brother-ish most of the things that
they wanted to do have since been taken
up by the DHS and the intelligence
community we're gonna look at algorithms
for trying to like machine learning
algorithms to guess who is a terrorist
that type of thing
scanners at airports you know who gets
stopped and searched and this is a
uniquely challenging domain both because
it intersects basic civil liberties and
freedoms and because the number of
people who are actually a threat is so
small that anything you do is going to
be swamped by false positives so this is
a basic challenge in actually several of
these fields okay so how gorillas are
everywhere again just sort of as a
framing exercise there's there's a big
discussion happening in Silicon Valley
right now which was barely happening two
years ago it's really the the election
kick-started all of this and a lot of
people are interested in sort of the
technology ethics this is one framing
for the ethics of Technology the framing
is the the unintended harms and
technology and they've they have I
split into a bunch of zones and and so
we're gonna talk mostly about this
machine ethics and algorithmic biases
this is huge right now disinformation
we're gonna have a class on
disinformation
I think it's called truth and trust the
class we're going to do it's one of the
last few classes
I am also extremely interested in this
economic inequalities I suspect that
that is going to have far more impact on
more people's lives than just about
anything else that we have on the board
including things like criminal justice I
feel like it's very understudied and I
think what part of reasons understudies
it's very hard to get a handle on
because it's so systemic you can think
of it as algorithmic critique of
capitalism which is just getting off the
ground but anyway mostly we're gonna be
here today and in the next class oh this
is the idea that that we're designing
Facebook to make you yeah yeah we're
building products that are designed to
be addictive right which is not healthy
this is a site which is specifically
about algorithmic accountability
algorithm tips org and what this is is
[Music]
it's this list they generated of
algorithms used in government and so for
example okay so here's a few right let's
see what should we search for what type
of algorithm should we find
okay so here you go method for
calculating the number of children
directly certified for free school meals
and other social benefits yeah I mean I
don't I don't want to speculate but we
could find out I don't want to do it
right now let's just try a welfare as
opposed to a child
nope that's the only one what if we just
do child
huh
interesting
it looks like it's a evidence collection
system okay yeah how come the door and
talk to them yeah this is a scoring
system for evaluating schools anyway so
this just goes on and on and many of
these things are not algorithms per se
in fact many of them probably don't even
have code all right
they're just methods but that's that's
although we're now talking about
algorithmic accountability it's not like
statistical models are new in government
they've been used for at least a hundred
years any sort of scoring system any
sort of predictive model they've been
used for a long time in a lot of fields
it's just they're becoming more
sophisticated and more consequential I
don't know what's the line between
statistics and machine learning or
machine learning in AI it's not really
important the the the the important
characteristics here is that there's
some let's say mechanical method for
coming to a decision and we want to
evaluate that method
so that's algorithm tips how they got it
is pretty fascinating
they wrote a paper about this which is
linked from the syllabus and they
basically these are all the search terms
they used and you can see some of them
are our bold those are the I think those
are the ones that they came up with
themselves and then they used real
analysis of related phrases so what else
appears in these documents to find is
the non bold ones so there's a lot of
different names for these types of
techniques
algorithm tips is cool there is a major
limitation with algorithm tips which is
that it's only government and you can
see which government departments they
pull it from interestingly Health and
Human Services has the highest number I
find that fascinating because it's the I
suppose the most human right it's the
Department of Energy must have all kinds
of formulas for figuring out how much
electrical wiring is needed for a state
but that's not really that doesn't
appear in the search results I guess
this is about the extent of public
knowledge of the FICO score the FICO
score is the standard credit scoring
algorithm that is to say it's not we
don't know very much they don't release
publicly where that score comes from and
most of the things on this list that we
just came up with there's very little
public information about how this stuff
works
so we had the question earlier how do we
investigate these algorithms and this is
something we'll come back to a number of
times but any initial ideas strategies
for investigating non-public algorithms
encourage leakers sure so inside sources
yeah you can't for a company oh yeah
that's the issue
yep so you can you can try to replicate
the results assuming you can get the
results so that's a big deal right
can you get the output of the algorithm
sometimes you can that's kind of what
república did with the compass algorithm
it was a model to predict whether people
would be read with in two years so they
just waited and waited until everybody
was really a waited until two years went
and saw how many of those people were
arrested right so you get the ground
truth because arrest records are public
yeah I mean definitely you should talk
to the people who are affected by the
algorithm
the challenge there is that even a very
good algorithm is going to have some
very bad results because there will be
errors so how do you do balanced
reporting because you can always find
the case where something bad happened so
okay if that's true then how do you
decide whether a particular algorithm is
good or bad what does that even mean
you're looking at like trying to reverse
engineer and algorithm the team of the
best engineers have been working on yeah
[Music]
yeah although as we'll see I know I
don't think it was that resource
intensive actually it mostly involved
yeah smart talented well-trained people
like all of you go forth yeah so there's
there's issues other strategies for
investigating private algorithms include
so if you can run the algorithm then you
can test it right so you can if you want
to get into situation where you can feed
inputs into it and see what happens
and for example you can do this with
Facebook you can make fake profiles you
can try posting different things we're
gonna look at a little bit later today a
an experimental method for analyzing how
the news feed algorithm works which is
very clever but it's very very simple
and very easy to do in very clever and
organizations like The Wall Street
Journal have done things like their red
feed blue feed piece which just created
two different profiles and set up
different friend groups just to
understand how you could end up in
different information worlds there's a
lot you can do just by interacting with
the algorithm so while these algorithms
are often trade secret there are
actually methods for investigating them
and if it is an algorithm of significant
public interest you can also make the
argument that it should be public right
you can do a story so this is like kind
of like doing a story that says why
isn't this data publicly available you
can some time when you can't get the
data you sometimes still get a story
which is I can't get the data so you can
do a story which is like you know we've
talked to ten people whose lives were
badly affected by this algorithm why
can't we see it you can do that story as
well you can apply pressure for
transparency so for the next part of our
discussion I want to go through just a
bunch of previous algorithmic
accountability work and sort of just
sort of a brief survey of what's been
done this wasn't a piece of journalism
this was actually a blog post but this
was an analysis showing that New York
City's teacher evaluation scores weren't
doing what they were supposed to do and
let's get the original thing so we can
look at it and I will try to unpack this
a little bit I know that chart doesn't
make a lot of sense on its own
huh so this is going back a few years
now and what we are looking at is thing
called a value-added score so the idea
is you basically you look at the test
scores of the students but you don't
want to look at just the raw test scores
because the student population is going
to be different at different schools the
students come in with different levels
of skills already so you instead you
look at this measurement which is how
much have they improved and this is the
idea behind the value-added test scores
and the issue that he's pointing out
here is that we really have no way to
figure out what it means but this
analysis suggests that it doesn't mean
all that much so here's what's going on
here the horizontal axis is the teachers
value-added score for and these are
released publicly for a particular
subject for one year let's say 4th grade
math and the y-axis is their score for
the same subject for the next year so
let's say 5th grade math or I could be
6th and 7th grade English or something
like that now you would expect that if
this is actually measuring teacher
quality that the teachers quality of
teaching is gonna be similar between 4th
and 5th grade math right the idea here
is that you're measuring some underlying
latent teacher quality variable the
issue is that we get only a very weak
correlation right so we would expect
that this was more or less a linear
cluster right it should be a lot flatter
the fact that it's all over the map
means there's no correlation there which
means that the you either have to assume
that in general there's no relationship
between how well you teach one grade and
the next grade or the metric isn't
working so that was the argument in this
analysis yeah I don't know we'd have to
dig into exactly how the scoring system
works so I like this example for variety
of reasons one of which is it's a very
clever way to test the output of this
model without actually having access to
the model right we don't have to know
how this is being calculated to know
that it has not captured the thing which
it was designed to capture all right so
we look at only the output not even the
input and we can tell it's wrong here's
one of my favorites this is an old piece
by the journal and what you're looking
at is the relative price of a
standardized basket of goods right so
how much for a ream of paper an SD card
and a stapler at buying online right so
this was staples which is a office
supply retailer and what they did is
they effectively access to their website
from computers in every zip code so they
used a commercial proxy service to route
the requests through each of these
places and looked at how often they got
higher prices than the baseline so what
pattern do you see her
cities of lower prices right why do
cities have lower prices right so now
we're gonna ask you know what is the
company doing
warehouses their supply
how much it cost yeah so that's a
reasonable explanation
I think shifting the burden is the right
language not everybody does that
in fact in public systems we always end
up spending more on rural areas so for
example in and this is a famous case of
price discrimination used by economists
in England for a long time there were
laws saying that you can't charge people
more to ride the train to faraway
stations right it's a it's got to be the
same cost per mile no matter how far out
you're going but of course in the rural
areas in the smaller lines there's fewer
writers so it's actually per capita more
expensive to support them and the New
York subway system is similar right it's
the same cost whether you're riding from
Far Rockaway or three stops highway is
basically all transport systems
subsidize the lower population areas by
private companies do not so that's one
theory of what they're doing is
basically they're just passing the cost
of shipping although it's not clear to
me if you know DHL or what whoever
they're using is actually going to cost
more to go to rural areas normally if
you look at the shipping rates it's sort
of like they divide the country into
three or four regions and everything
within that region is the same cost so
it's not clear to me that this is
reflecting their shipping cost what else
could it be
competition say more say more about that
more competition and more competitors
and large cities yeah so clearly on you
know on some level they're charging this
price because they can if nobody was
buying at that price the prices would
shift but there's an interesting
question here which is how are these
prices actually being set is there some
operations executive who's sitting there
with a spreadsheet every week and
setting the prices is that what's
happening yeah I suspect that these
prices are set algorithmically and I
think actually and this would make a
great final project that you could
actually get this result with an
extremely simple little algorithm which
is basically just put an optimization
algorithm on total sales right so you're
all familiar with optimization let's say
a one variable optimization problem
where you have some some function that
you're trying to find the maximum of
right so in this case its sales and this
is price so you try to find the price at
which the sales are maximized and
there's many algorithms for doing this
so one of the simplest is hill climbing
you change the price let's say you add
1% to it and you see if the total sales
go up if they go up you keep adding 1%
is that closed and if it goes down then
you subtract one percent of the price
right so all you're doing is you're just
chasing the local gradient and any given
point you compute the gradient
I guess in this case just the derivative
right because it's a one variable
problem by numerically right so you just
evaluate the function at two different
points corresponding to different prices
that tells you which way is up you chase
it until you get to develop so I you
know which you can do in two or three
lines of Python so I suspect that this
pattern that we see is generated by
extremely simple pricing scripts and
what we get is that rural areas are
charged higher prices and because there
is a correlation between urbanization
and wealth what you get is poorer people
have to pay more so probably the
engineer who set up this price
optimization algorithm had no idea that
that would be the effect and so this is
potentially one of the issues with the
automation of these types of things is
that you can get bad effects and not
even be aware of it right it's not that
charging poor people more money for the
same thing is new that is a standard
effect of capitalism the problem now is
that you can do it without thinking or
knowing about it
potentially
this is when we looked at before this is
reverse engineering of political
micro-targeting through email this was a
Republican piece from the previous
election and oh there you go look I've
just got the headline Amazon scrapped an
AI system that discriminated against
women there you go in the news today
we've broken this one apart it's it's
tf-idf clustering to combine the
variance of the email interestingly the
way they did this was they crowd-sourced
the data collection so that is another
strategy for trying to do algorithmic
accountability here's another big piece
they did insurance premiums are were
historically higher in minority
neighborhoods we've had for decades now
laws saying you can't do that but it
still seems to be the case that they are
higher in minority neighborhoods even
accounting for risk right so you have to
account for risk because not all
neighborhoods and not all drivers have
the same risk the risk by zip code is
available so that's public record so
that's what this x-axis here is so
average payout per car per year so in
this case that would mean basically each
driver claims an average of 200 dollars
a year what actually happens of course
is that most drivers claim nothing and
then some get into accidents so what
they're doing here is they're they're
fitting
these these smooth curves to the
patterns in the different neighborhoods
and you I mean I don't know there's a
there's a lot of variance here right you
can see that these curves don't actually
track the data that closely but you can
also see that you know generally it's
higher in my Norman minority
neighborhoods for the same level of risk
okay the vertical axis is the case okay
each dot is a zip code yeah so each dot
is a zip code yeah exactly so the so
that the y-axis is the cost of insurance
in that zip code for a particular
insurer and the x-axis is the payout in
that zip code for averaged across all
insurers well
because within a County Tennessee yeah
like those were not no I didn't say no
yeah exactly
no it's it's all over the NOI and they
actually do it for several states this
is an analysis of this chart is a little
hard to read as well so what this is
it's the correlation between the change
in price so this is an analysis of surge
pricing you're all familiar with surge
pricing you know what no what I mean by
that anyone not familiar with surge
pricing
okay so analysis of surge pricing the
change in the search price versus the
change in the number of cars nearby so
what they're looking at is do the surge
price changes actually lead to more cars
and a positive correlation coefficient
so everything above here is yes so you
can see that after five minutes or so
there tends to be more cars but the
effect is different in different
neighborhoods and again the sort of
variable of interest that they were
looking at here is the racial
composition of those neighborhoods
neighborhoods in DC yeah
right
yeah or slower right so you don't
actually get better service by using the
uber API so the API will tell you both
the current price or what the surge
pricing factor is right because it goes
to like 1.5 or 2 or something and also
the number of cars and a neighborhood
yeah I don't know if you still have
access to it but but so here's a
technique as well as use the API analyze
analyze what you can get from the API
now you know they said in their article
you know we we don't think that uber is
trying to do worse in neighborhoods with
more minority passengers right like and
this is a key concept in reporting on
discrimination is don't don't get
fixated on intent right you you don't
have to find somebody who really wants
to be an asshole a really you know some
really racist marketing exec to have a
story unintentional disparities also
matter
in relation to cars yeah it's supposed
to well I think I thought surge pricing
was deter determined based on not only
the number of cars available the number
of people no no you're right you know
I'm so you're right that search pricing
is based on demand as well
I don't think demand is available
through the API and I don't think it's
in the article but any one of these so
some of these are linked from the
syllabus but basically how these slides
are set up is that if you take the thing
at the bottom and you type it into a
search engine you'll get the story
because it's always the title at the
bottom so you can look at these charts
content and in fact in this particular
case the code is available so we could
reverse engineer this whole story if you
want to want to look at how it's done
yeah so right well are they trying to
take uber down
yeah I see exactly so I see this story
as more of a comment on the issues with
private provision of what has often been
considered a public service which is
transportation right so as opposed to
the subway where there's a great deal of
effort to try to reduce disparities in
service quality and cost among different
populations right so let's say close to
the center and far out which is also
what we're seeing here you know a
private transportation company may not
be trying to do any of that and so it's
an unintended and perhaps unexpected
consequence of the shift from public to
private service provision
ya know just right the truth I mean yeah
so let's let's talk about this briefly
because this is actually important so
first of all the standard saying or the
standard approach in investigative
journalism is go in through the front
door first
alright so always ask be upfront and
honest about what you're trying to find
out why it's interesting what you want
to know okay so there's that scraping
there's been a was a recent Circuit
Court case that decided that scraping is
not a violation of the terms of services
of website is not a criminal offense
okay
so nobody's gonna go to jail for
scraping having said that it can still
be a civil offense right you can still
get sued so when in doubt which is going
to be most of the time for this type of
stuff talk to your in-house counsel for
your a news organization I mean if
you're doing this freelance then you
need to talk to your assigning editor
about this generally what your lawyer
will tell you is that if there's a clear
public interest then go for it because
public interest is a defense in libel
cases in this country and actually many
countries
but yeah there's there are media law
issues here for sure right well I mean
so you would have to be able to make the
argument that that algorithm is
important and knowing how the algorithm
operates can only be done using by
testing so there is actually a history
of government's testing systems so for
example sending a similarly qualified
black and white applicant into a bank to
try to get a loan right and there have
been there's a long history of testing
or enforcing anti-discrimination laws in
this way and so there is a strand of
legal argument which I don't know if
it's ever been tested in court but the
argument is that you know we have a long
history of sending people undercover to
test for discrimination so doing that by
making API calls has got to be legal so
I would expect that you're in the clear
but this is why we have lawyers it's to
give us good advice on these questions
ok let's go through this section and
then the after the after the break we'll
do our analysis of machine bias our race
and gender bias are the most common
topics for algorithmic accountability
before we get deep into this I just want
to go back to this slide and say most of
these things are not race and gender
bias okay so the algorithmic
accountability community and research
and stories so far has been focused on
an important but actually quite narrow
set of potential problems with
algorithms
nonetheless that's what we're going to
do today part of the reason we're so
interested in race and gender bias is
that there's a very clear legal
framework around it so this is one of
the piece of law around bias in this
case it's looking at employment but this
was one of the first laws where this
list what is now known as a list of
protected classes appears right race
color religion sex or national origin so
we have these categories they're called
protected classes and it's illegal to
discriminate on the basis of them in a
variety of things we're gonna we're
gonna look at that in much more detail
tomorrow then we're gonna look at it in
much more detail right now okay so these
are some of the places in which it is
illegal to discriminate on the basis of
protected classes so look at look at
what we have here lending education
employment housing and this public
accommodation that's things like the
Americans with Disabilities Act right it
says that you have to have you know
wheelchair accessible entrances and so
forth interestingly the and this is
going to become extremely important when
we talk about definitions of fairness
the public accommodation portion of this
is the clearest example of the legal
principle that different people are
going to require different amounts of
effort to be fair to right so the basis
of the law isn't that everybody gets the
same amount of effort or money or time
or cost the law is more complicated that
it says some people are going to get
more to make up for some disadvantage
so that is already built into the legal
definitions of fairness and we'll this
becomes a very big discussion as and is
indeed the center of some very big
arguments this is a pretty broad list I
you know there aren't many things that
fall outside of this this is most of the
services that governments and provide
that's true that's true
well credit
yeah credit and housing that's that's
market regulation yeah so these are
these are some of the domains and then
the I love this list this is this is the
protected classes and you can see how
they've expanded over time so you know
they're you know race color sex religion
national origin those that's the Civil
Rights Act interestingly citizenship
this is relevant today with all of our
debates over immigration in most cases
you can't discriminate based on whether
someone is a citizen or not obviously
there are some cases where you can like
voting age pregnancy you know things you
might expect but then veterans status
and more recently genetic information so
I can't refuse to offer you insurance
because of a genetic test for example so
this is gradually expanding and I guess
we can expect that it will gradually get
bigger
so you may think that if you're building
an algorithm to decide on let's say
credit scoring or hiring or who gets
admitted to a school for example that
it's very simple to not discriminate
against race you just don't give it the
race of the person as input this is
where things get complicated and this is
what I want you to remember is that race
and gender correlate with just about
everything so let's look at that in a
little more detail I'm going to show you
some some data from this paper this is a
paper from I think 2015 or what they did
is they built a model to try to predict
protected classes age gender political
and rigid views well things that are
protected by classes relationship status
as well from a person's Facebook like so
which pages had they liked and let's
just pause for a second and talk about
how they did this because it actually
has a lot in common with all of the
recommendation systems that we've been
studying so they start with the users
Facebook lights so the user like matrix
right this is very similar this is
basically a user item ranking matrix and
then singular value decomposition so
that's a matrix factorization technique
so they take each user and rather having
rather than having 55,000 likes for each
user they reduce it to just a hundred
components you can think of these as the
topics right so here we component one
component two and so forth
this turns this is kind of like our user
topic matrix or document topic matrix
doing this you'll also get another
matrix which tells you which
topics are related to which pages but in
that case we don't use that we just what
we're basically trying to do is
dimensionality reduction on the users
here and then we do a regression model
to try to make a either a binary or
continuous prediction for each of these
variables alright and so here we just
fit the data so we we start with all of
this user like data and we also have the
values of these these things we're
trying to predict and we fit a model and
we are taking a linear combination of
the component scores any question on the
method okay everybody get what we're
doing here this is almost the simplest
model you can build it could be slightly
similar if you simpler if you admitted
this singular value decomposition step
but then it would be kind of painful to
do because you would need fifty five
thousand coefficients in your regression
equation okay and the the factorization
also reduces noise interestingly just
like our topic modeling allowed to the
same word to mean different things in
different contexts the matrix
factorization will do that for us here
right so maybe liking BMW means one
thing if you're an if you are a young
man and one thing if you're an older
woman right the the matrix factorization
will actually capture that okay
so when you do that this is what you get
so these are the things you're trying to
predict and then this axis is how well
you can predict it
this is an area under curve score how
many of you have seen AUC scores before
knows what that is okay
so many of you know alright so this is
useful stuff this is the basic mechanics
of classification which is what we're
going to do with the machine bias work
so this is this is important stuff the
idea is this let's say you're trying to
predict whether they drink alcohol
alright that's one of the things they
tried to predict here so you build some
classifier in this case it's going to be
a logistic regression classifier but it
doesn't matter it could be a decision
tree could be anything and basically all
of these classifiers at some point
they're going to give you a numerical
output right so we have some rule and we
say you know prediction equals some
function of the input variables which
I'm going to call X right X bar to
indicate that it's a vector of all of
these different input variables and
often it's something like a probability
right so let's say in 0 1 so if you do
religious turk aggression you'll get
something in 0 1 which if you set it up
right is a probability so now you have
to turn this into a yes/no guess right
because the we're supposed to give it
yes/no answer do they drink or maybe
it's a multi-class problem male or
female you know gay or not whatever over
50 under 50 binary prediction problems
are very common so we need a threshold
right so we're gonna say we're gonna
turn this into a binary prediction just
by a threshold in so we're gonna say f
of X smaller than
threshold now the threshold will control
the trade-off between false negatives
and false positives so if I make that
fish hold really high then I'm almost
always going to say yes which means I'll
have more many more false positives if I
make it zero then I'm always going to
say no so I'll have many more false
negatives if you only want to get rid of
false positives always answer no if you
only want to get rid of false negatives
always answer yes ok so that the trick
is not to reduce one of those to zero
because you can always do that the trick
is to balance them in an interesting way
so what we do is we write this graph and
[Music]
we're gonna say this is true positive
versus true negatives for historical
reasons because this first came up in
radar radio systems this is called an
ROC curve for receiver operating
characteristic which is the worst name
but anyway that's what it's called and
[Music]
well there's various ways to draw this
but I'm gonna go with the Wikipedia way
this is false positive rate this is
false negative rate and as you so we
know that we can get
oh sorry yeah thank you false positive
true positive rate so we know that we
can get zero false positives and zero
true positives by always guessing false
we know that we can get a hundred
percent true positives and zero false
positives by always guessing true so
those two points are always available to
us but then as we sweep the threshold as
we sweep these as we make the threshold
yes we increase it from zero we pick
more and more positives and we start
sweeping this curve up and as it gets
nearer near to one eventually we have a
hundred percent positives so we get some
curve that looks like this as we bury
the threshold and so these are the
points these are the trade-offs that are
available to us and normally what we do
is we compare that to the straight line
between these two points and the reason
we do that is that that is the coin flip
classifier okay if we flip a standard
coin that's weighted heads or tails
fifty-fifty we get this point in the
middle right because it's gonna get half
true positives and half false positives
if we start weighting that coin and we
say well let's wait at 70/30 through a
negative we get a point here or we could
get a point here by weighting it the
other way by setting by just randomly
guessing a certain percentage of trues
and falses we can get all the points on
the green curve so only this area of the
graph that tells us how good our
classifier is if we could guess
perfectly we would get
just this one point here right we would
have something that looked kind of like
this and if we ever built a classifier
that had this type of performance then
we just guessed the opposite and we'd
have a better classifier so the area
under the curve is finally this whole
area right we basically just take the
integral as we sweep out our possible
combinations of false positive rate and
true positive rate and it has to be over
0.5 because if it's because point 5 is
what the baseline is here of guessing
randomly if we ever get something that's
less than point five we just guess the
other way so going back to our Facebook
analysis result here so guessing whether
someone's parents to get our together at
age 21 we get an area under curve 0.6
that's just a little bit better than
random guessing so we get some
information but not a lot whereas
guessing gender or we were you look at
this right hey protected classes race
sexual orientation gender we can guess
with quite high accuracy 0.93 which as
so 0.93 means we have a curve which is
you know up like this right we get most
of this graph under the curve so this is
the the AUC is a useful way of measuring
the performance of a classifier because
it it sort of integrates out the effect
of setting a different threshold it says
you know for whatever application you
want you can pick a false positive rate
and a false negative rate that make you
happy we don't care what one what you're
going to set that Thresh
all that this measures it independently
but anyway 0.93 and 0.95 are very good
predictions so you know there must be a
bunch of pages that men like and women
don't and vice versa because we can
definitely figure it out just from your
likes and similarly with race alright
questions on that yeah
for profiles that didn't fit the binary
classes what do you mean okay right what
did I do with the Hindus and Jews yeah I
don't know we'd have to read the paper
to see how they coded it yeah I mean you
know or erase other than white or black
almost all of the racial stuff that that
you'll see is done with either white
versus black or white versus non white
there's very few that break out actually
the different races and then you have
the the problem of well what are the
different races and that's a mess anyway
so you can do this even with much less
information that analysis was done with
fifty five thousand likes per person
that's a lot of information although
probably you only liked a few dozen this
is as simple as an example done using
predicting Jennifer and political
orientation from Twitter and this is
just using different techniques but you
can see here the top row they use only
the data from what they had tweeted and
their profile and you can also use data
in some cases to get better accuracy if
you use data about who they follow but
you but you look at this and you see
that you can get about 80% accuracy on
gender and about 90% accuracy on
politics just from Twitter data right so
much simpler data set and I would be
very surprised if you can't get some
really high accuracy on race so it's
very hard to blind an algorithm to
protected casas because and of course
this is the reason their protected
classes is because they run so deep in
society
they're correlated with just about
everything else so where you live the
amount of money you make what your job
is your health history all of this
information will give you lots and lots
of predictive power and productive
classes which means it's it's you can't
really blind an algorithm to race it can
pretty much figure it out anyway so that
becomes a challenge to the entire idea
of not using that type of information to
make decisions oh yeah here you go
here's race from Twitter tada
[Music]
so this uses a different measures of
accuracy precision and recalls whether
they yeah I carry out but how did they
get the data whether they followed it
maybe I don't remember now but they did
that because they wanted to look at also
um sort of commercial applications right
so what kind of company figure out about
you they use f-measure instead of a you
see F measure is sort of a weighted
average of precision and recall we're
going to get much deeper into all these
things precision recall at measures
ultimately all of these things derive
from the confusion matrix and yeah we're
gonna do that after the break we're
going to talk about some machine bias
concerns pretrial risk scores and
there's this lovely 538 piece that has
an interactive simulation one other yeah
I think so
but here we go here's here's what I yeah
very very marshal project sir but
they've got this lovely little
interactive here so this is the idea
this is the sort of setting that we are
in we have a bunch of people there's a
classifier which classifies them here
into three risk buckets and then I think
how this is set up as the high risk
people are always denied for all the
lowest people are always awarded and
some number of the people who are
granted parole will reoffending and some
number number of people who are denied
parole won't we offense so this is let's
call positive you are classified as
eligible for parole so that would make
this false positives and that would make
this false negatives okay so so sorry
this is post trial this is parole this
is not pretrial but a very similar
system is used to decide who is who gets
to be released on bail or to advise
judges and who is get to release some
bail that's actually an important point
is that nobody uses these algorithms as
the final decision it's advice who are
that is given to the decision-makers who
may or may not follow the advice so
that's the setting and one of the
parameters you have in controlling an
algorithm like this is this you know
threshold we're talking about so if I
crank this threshold way up then a lot
fewer people are denied parole which is
going to make the false positive rate in
hmm how does this work here we go
that false negative rate is much higher
right so I'm I'm releasing fewer people
whereas if I crank this way down then
I'm saying more people are high-risk
which should reduce this rate and why is
this running away backwards than I think
it should be you know high-risk people
there we go okay here we go yeah okay so
here we go so we've got a lower false
positive rating the higher false
negative rate right so there you can see
directly we trade these things off I
actually think this is confusing and
there should be just one slider which is
the threshold for paroled versus not
paroled because I think what happens
here is some randomly people in the
medium risk category get award or denied
so it's a little bit confusing anyway
well that that's the exercise of picking
a threshold you have to decide you you
can only pick a point on that curve you
have to decide which point you want and
that has to be based on how bad or false
negatives versus how bad or false
positives
so in this case how bad is it to release
someone who then RIA fence
keeping someone in jail who wouldn't
have of course you don't get to observe
this if you're actually using the system
that you know the people you keep in
jail don't get to reoffending this is
how these systems are built is you
calibrate it by making predictions on a
set of data all of whom are released and
then rather than you know denied parole
but wouldn't ever be offended what
you're looking at is we would have
denied them parole if we were using the
system so in development you actually
get to see these four numbers and in in
production you don't get to see that
number so that's the setting we're in
and let's see here
the machine bias piece is a journalistic
investigation into the compass risk
assessment tool compass makes risk
assessment both pretrial and post-trial
we're gonna talk mostly about pretrial
that means do you get to be free on bail
while waiting for your trial or not the
they were able to obtain a copy of the
actual questionnaire that produces the
data this thing has 137 questions I've
just shown you one here it has things
like back here let's let's actually open
this up and look at it there's there's
lots of fun stuff here
so this is up on on document cloud so
you can see the things here that they've
helpfully annotate it so they're asking
about things like you know have you
moved a lot or have you been suspended
from school how often you feel bored as
well as you know as they call it
criminal thinking yeah as well as things
like the current charge and so forth
notice that race is not on here now of
course we know the race of the person we
arrested but the risk score does not use
race or gender or age right no protected
classes here however as we saw you can
probably make pretty good guesses as to
the protected variables from this much
data so this is the input there's a one
hundred and thirty seven data points and
the core of propanol of course there's a
lot more that went into this study but
the core of the story is this table here
which I've reproduced from the
methodology and what we are looking at
is the confusion matrix for everybody
for black defendants and for white
defendants so does everyone know what I
mean by confusion matrix okay so what
the compas algorithm actually produces
is two things one is a numerical score
from one to ten which you can think of
as a probability of arrest another is
just based on threshold Ignatian low
medium or high risk and I think how this
works is in for públicas analysis the
medium and high risks
merged into one category high-risk and
so these are the wrong numbers so that
this is the confusion matrix then and
all this stuff false positive rate false
negative rate positive prediction value
which is also known as precision
negative prediction value which is kind
of like you know positive prediction
value is if we of the people who we said
would be reached at how many were
negative prediction value is of the
people we said weren't how many weren't
and so on and then when you break it out
for black and white what you find is the
false positive rate for black defendants
is almost twice as high as the false
positive rate for white defendants
meaning that of the people who were
predicted to be read twenty three
percent of those who were white in that
group were not read where as forty four
percent of those who were black were not
rewritten right so the error rate is
much higher and this is the basis for
the article this is this is the basis
for the claim that the algorithm is
biased so we're gonna do this we're
gonna step through this one one thing at
a time
yeah so first of all his pretrial so
there's no parole here this is where
they released on bail or not right so
that that's important from a legal in a
rights perspective because these are
people who have been charged but not
convicted okay so that that is a legal
difference between you know the the
their case hasn't been heard yet so we
don't actually know if they're guilty of
that crime yet this is just people
who've been charged with a crime
so you have two notebooks
one is a sort of an empty copy so
basically what we're gonna do is we're
gonna take the empty notebook and we're
going to fill it out and I'm going to
more or less cut and paste what's going
on here oh we need the data too so here
we go so let's let's do this so I'm
gonna run this one cell at a time by the
way
that come the compass algorithm produces
both a violent and a non-violent crime
predictor and then the data also exists
for looking at who gets arrested for a
violent arrest versus a non-violent
arrest this turns out to be important
because there's important differences in
the accuracy of data on arrests for
violent crimes versus non violent crimes
we'll get into that next class but for
the moment we're just gonna use the non
violent data because that's mostly what
for public I focused on so if you run
this you should see the data and so this
is what it looks like we have so this is
public record right this is this is
mostly public record it's just arrest
records this is in Broward County
formula Florida over two years so how
this story happened is they FOIA j-- the
risk scores so they got the risk scores
that were assigned all of these people
and then the risk score is supposed to
give a probability of ria rest within
two years so then they just waited two
years and collected the arrest records
and you can see that here so you can see
a bunch of personal information and then
v decile score is for the violent
predictor the screening blah blah blah
and then two-year acid' this is where
they really in two years so basically
what we're gonna look at is is decile
score versus to your recede so decile
score is the predicted risk and to us it
is the ground truth so you can see that
here right so and there's a bunch of
other things right like what they were
charged with and so forth and then we
throw it a bunch of data I just follow
up or públicas analysis here let's not
get in too much into it but this is this
is our output right we have 6,000 rows
here so 6,000 people who were given a
risk score and then either date or did
not reopened so then the first analysis
we're gonna look at is just single
columns so here's where we start typing
right cv I don't know why we called it
CV compass I can't remember
compass scores hmm anyway compass values
anyway CB so we just look at the way
that age is categorized you can see the
majority are in the young category and
then we can get a similar thing I think
this parable is just called race yeah
there you go there's the population it's
majority black and then we can look at
well what scores were they assigned so
this is decile score value counts there
you go and then they also get this like
texts assignment low medium high so that
is score text value counts and what this
is is there's actually only one
classifier which produces the decile
score and then the low medium high are
generated by threshold English at some
value so I'm not sure exactly where they
put the thresholds but that's the idea
so I think most of the public analysis
is based on the low medium high scores
but it's actually the same information
just in a slightly lower resolution form
all right has everybody got in here okay
I know I get to cut and paste so I got
that a lot easier for me so just tell me
if I'm going too fast
so now what we can do is we can look at
the scores
oh my this is fun as a histogram so
we're gonna say for race equals clothes
occasion the DES I'll call them the
reason I use it variable here is because
that allows me to switch between the
violent and nonviolent versions right
so I set these columns it just changes
the column name you can just make that
dot decile call or a dot decile score if
you want and then I plot as a histogram
and I get this okay I can do the exact
same thing with I think the text is
african-american yeah with that - I can
do the exact same thing here and now I
have the daesil scores for
african-american defendants okay so what
do you notice
let's start analyzing this yeah so all
right so after you're married defendants
are more often assigned higher scores
but it's hard to say what that means
just yet it's you know it's certainly a
pattern that would be consistent with
with racial bias but we we're looking at
the output of the process we have to
look at think of it the inputs basically
or more precisely we're looking at the
predictions we have to look at the the
facts as well because in this case we
get to know what actually happens
because they waited two years so let's
look at the outputs now or the or the
ground truth so for this let's say so
how did this work we what does that call
him called - you ever said yeah okay
here we go
cv - you received value counts so there
we go
so you know a little less than half were
read and then let's do a crosstab by
race and I've just I've done a little
magic here to just get rates as well so
if I just do a crosstab on these two
variables I get this I get oh okay but I
want it as rates as well because I have
different numbers of defendants so I add
this other line which is I had a column
called rate where I just divide the one
column by the sum across that axis and I
get this table so when you look at this
table what do you see
true so that some some races are much
have a much smaller number of people so
we can expect the rates to be very noisy
for those races what else do you see
african-americans reoffended more than
any of the race
yeah it's okay so let's let's get clear
about what this value is
this is reread K so our arrest means a
lot of things that doesn't mean a
conviction for example because we are
pretrial here but they're definitely
rear ested so we have to ask what
re-arrest means and we will do that
extensively in the next class we're
gonna we're gonna get deep into the data
here and I think I've set violent equals
false yeah violent equals false so we're
looking at rest for nonviolent crimes
which tend to be let's say more prone to
bias than violent crime but the key
thing here is that these rates which I'm
going to call the base rate differ
between race and this is going to be the
fundamental challenge to definitions of
fairness okay if if the base rate was
the same between every race here then it
would be very easy to build a classifier
which was unbiased by whatever metric
you want because the base rates
different we're going to be forced into
making some difficult choices these are
all arrests of people who they had risk
scores for all right so this is how the
reporting process worked for this as
they filed a bunch of FOIA requests
Broward County came through and said all
right you can get the risk scores for
these people then they waited two years
and then they compared the risk scores
that they'd obtained through FOIA to the
public records so you're following up a
particular population of people who were
originally arrested in a particular
interval for which they could get risk
scores this was one County in Florida
for I think a few months of arrest data
and then they and then two years after
that to see if they were reading okay so
then we can do
this this same this same type of
analysis for I guess I could I thought I
could switch tabs by going control left
and right oh well the same type of
analysis for sex so exactly the same
idea one of the reasons I'm including
sex in here we'll to start with it's a
protected class but here you have kind
of the opposite pattern which is that
the traditionally marginalized group has
a lower recidivism rate so in fact we're
going to find that if we pick a
particular definition of biased it's
usually going to be biased for the safe
it's biased for the oppressed racial
group it's going to be biased against
the oppressed gender group and this is
interesting because when you start to
ask intersectional questions you will
find that there is there are trade-offs
between the different categories in very
complicated ways if there is a theme to
all of this it's just that you you can't
have everything you have to have
trade-offs and those trade-offs are
uncomfortable
alright so that's what it looks like you
will be shocked to hear that men are
re-arrested more often so then we're
gonna do the same thing actually we're
gonna we're gonna skip that let's so now
that's those are the those are the
predicted risk scores and those are the
actual Rio rest rates now we're going to
compare them on the same chart so how we
do that is
like this so we group people by decile
and then we take the average of the
recidivism so that gives us a recidivism
rate and when we run that we get this
thing so this is how this is the rate of
rear people who were assigned each risk
score so what do you see consistency
what does that what does that mean right
so the horizontal axis is the prediction
the vertical axis is the reality and you
can see that people who were predicted
to be more likely actually work alright
there's this general upward trend now
this is called calibration and it's
called calibration because if this plot
of prediction versus reality is
monotonic that is to say non decreasing
there is some transformation to turn
this into a probability so the risk
scores are not actually probabilities
but they can be turned into
probabilities okay so as long as you
have a monotonic plot we say that the
prediction is calibrated meaning that we
can turn it into a probability so this
is calibration is a very important
property of prediction systems and it is
a property that this predictor has so so
far so good for the predictor but what
we want to know is how does this break
down between black and white again and
so I have to go through a little bit of
work to get there so what we're doing is
we're generating these two the values
for these two plots for BMW and then
I don't know I had to glue them together
into a data frame called
a to get it to plot the way I want it
which is like this so talk to me about
this plot what does this plot tell us
we hated for whites and I was blessed I
was less of the trend yeah okay yeah so
right so there's noise here so so
generally it so I think there's sort of
a few big takeaways from this plot right
so generally they both increase
monotonically alright so there there's
calibration not only for everybody
together but for the individual groups
there's some noise here if we really
wanted to we could do estimates of you
know is this statistical noise or is it
actually a real problem with the
predictor but generally it's alright the
other thing is the height of the bars is
more or less the same at every decimal
score in other words a score of six
means the same thing regardless of
whether you're black or white this is a
concept called positive predicted value
another name for this is precision it's
the same calculation it's just one we
think of like information retrieval one
we think of prediction what it is is if
I give you a score of six how often do
you actually riaf end and so that
positive predictive value is relatively
balanced between races all right so if I
say you're you have a certain
probability of riah fence that means the
same thing regardless of your race
except for this one right so if I say
you have a very high probability of riah
fence you're actually slightly less
likely to reoffend if you're white so by
the way what whose advantage would that
be - for it to be so in this pattern
where it's actually lower for
white defendants whose whose advantage
is that other way around yeah because if
I if the actual rear rest rate is lower
than I say it is then that is a false
positive and that's bad for that group
right so the fact that we said the rest
rate was high and it ended actually we
predicted it to behind it actually was
that's good for the group that has that
and for someone to be reached at less
often than their predicted that's bad
for them because it means those people
are going to end up getting granted bail
less often right right so it's about
calibrated if anything that by and again
by this measure of fairness this plot
would would favor African Americans okay
slightly I think I think honestly this
is just noise and then if we tried it
with a different sample we would we
wouldn't see this that would be my guess
so that's as far okay
the next thing we're gonna do is produce
the actual confusion matrix so we're
going here from daesil scores or
probabilities to just yes/no which means
we have to apply a threshold somewhere
and what we do is we use the low medium
high risk variable and then we combine
the medium and high
so basically we're splitting at the
threshold between low and medium and I
don't actually know what that is in
terms of decile score but here it is how
do we do this okay we generate a
confusion matrix like this so first
we're saying K these are all the people
where we guess that what they would be
arrested here all the people who were
read
when we do a crosstab we get this
confusion matrix so here it is so
confusion matrices the entries on the
diagonal of the confusion matrix are
correct guesses the entries off the
diagonal or incorrect guesses so you can
see right away that this has a fair
proportion of incorrect guesses there
are many many statistics that we can
compute from a confusion matrix the
simplest one is called accuracy and oh
it says right here the fraction of the
guesses that were correct so what is it
what is the accuracy here how do we
calculate the accuracy from this
confusion matrix
yeah right so it's this Plus this
divided by the total number of cases
positive predicted value is what we were
just looking at so of the people we
guessed would recidivate which is this
column these are the people we guessed
would recidivate how many did so it
would be this divided by the sum of this
column right and then we have false
negative and false positive rate which
is so false positive rate is of the
people who didn't recidivate so this row
how many did we guess would not so this
so it's this divided by this and this of
the people who didn't recidivate how
many degree whips to be guess would
right so okay so actual receipt false so
that's this row and how many people do
we get okay so this okay so it's it's
this divided by the sum of this row so
note that positive predictive value uses
only this column false positive rate
uses only this row they're actually
looking at different things the
situation here is very analogous to the
calculation of conditional probability
so probability of a given B is actually
a completely different number than
probability of B given a we'll talk
about that in a later class in more
detail but the point is that this
confusion matrix is actually where all
the information is all of these one
number summaries use different pieces of
that and then this is all relative to a
threshold right if we change the
threshold for whether we guess true or
false then these numbers are all going
to change and changing that threshold we
can adjust the rate of false positives
to true positives but only to a certain
point right we can never get a hundred
percent accuracy because the classifier
that's just not available to us and this
link has this is a great page which
summarizes quantitative definitions of
fairness
oh cool
yeah that would be cool okay so here's a
nice little diagram and it shows you how
all of these things relate so these four
entries are the four entries of the
confusion matrix and then all of these
things you can calculate so for example
positive predictive value is something
it doesn't show you what it uses to
calculate it I think that Wikipedia one
does though no it just has this one
anyway you'll get really familiar with
these definitions if you work with this
for a while these are the core of the
definitions of fairness and these are
all calculations you do from a confusion
matrix and there's this great here we go
here's the diagram I wanted so if you
click the L it just goes to that link
damn it somewhere there's an interactive
one where you feel hover it shows you
which it uses to calculate but anyway
there's lots and lots of these different
things and they're all based on values
from the confusion matrix you can study
that on your own time what we're gonna
do here is just calculate them and then
we're going to compute them so let's see
the positive predictive value according
to Wikipedia is true positive / true
positive plus false positive so let's do
it
true positive / - positive plus false
positive what do we get
63% and then the false positive rate is
pulse positives / negatives so that's
false positive over a false positive +
true negative so the denominator here is
just the number of people who were not
read
FP / FP + TN FP / FP + TN yeah there you
go so we're just doing these these are
all calculations and these make it a
little little easier right so we can
just see FP / n I guess I gotta run this
one okay now if you want a zero false
positive rate you can always get zero
false positives by just guessing that
nobody will reoffending off against the
false negative rate which is F and / P I
think yeah FN / P so for this particular
threshold setting so that remember
there's an implicit threshold here that
sorted people into low versus medium
versus high for that threshold setting
this is the false negative rate and we
can always have more false negatives or
false positives by moving that around
okay so this you have let's just run it
and then what I do is we finally get to
the analysis here which is where am I
going there we go we are now going to
compare the confusion matrices and these
various metrics for black versus white
so here we go I just grabbed the set of
them that are black versus white and I
print the metrics on what I guessed for
them and what actually happened and this
is what I get so we're now have
replicated that ProPublica methodology
table so here we go
here is a replication of the Republic
Astoria so what do you see let's talk
about this
everybody got tear it's actually not
that hard to replicate the core of the
story it's not a very complicated
calculation you're just looking at
confusion matrices so first of all what
which of these values are the the
central claim of bias in república
stories okay so here we go that versus
that right okay
what else can we see yeah so yes what
about the positive predictive value five
nine versus six five
yeah so let's let's draw this out so
this this is an extremely important
distinction which is confusing and this
is part of this sort of challenge of
defining bias so Paul's positive
predictive value is percent let's put it
this way
we guessed re-arrest and okay and they
were actually read okay so we want this
to be as close to one as we can get it
right whereas false positive rate starts
the other way around
it says they were not re-arrested but we
guessed they would be okay so the
denominator here is different the
denominator here is every one we guessed
positive whereas the denominator here is
everyone who is actually read so
remember one one is one is a row in the
confusion matrix one is a column in the
confusion matrix so they are related
because they both have one element in
common which is we guess they were
rheostat and they would be rear ested
but there you can think of it as as
guessing in a different direction right
and it takes a little while to wrap your
head around it's it's sort of like this
is also sometimes called conditional
use accuracy because this is the value
this is something from the point of view
of trying to decide whether to release
someone on bail right so we don't know
if they're actually going to be
redirected when we have to release them
on bail
we only know whether we're guessing them
so it's kind of like from the point of
view of the information that we have at
the time we have to make the decision
how good can we do whereas this is sort
of the other way around which is from
the point of view of what actually
happens eventually how good was the
guess so they actually measure different
things and part of the disconnect here
is that the the positive prediction
value is relatively balanced if anything
it looks better to black defendants
whereas the false positive rate is not
balanced okay and in fact this is what
happened when the story was published
Northpoint who makes compass came back
and said what are you talking about it's
this this is a calibrated algorithm
right they came back basically with this
chart and said hey it's fine
so part of what was going on was was
that they were using a different
definition of fairness when they
developed it and you can say many many
other things about this this issue most
of which will have to say next time but
this is the the core of the analysis so
I think we're gonna have to leave this
there for now and talk about your final
projects so let's do that
did you find our project suggestions so
remember these are just ideas you can do
you can do anything else you want I as
I've said a number of times I'm
fascinated by automated trading I think
it is drastically under explored in the
algorithmic accountability literature
and part because it's harder to do and
in part because it's it's deeper it's
more tied into the structure of society
it's it's the roots of capitalism so
here's one thing you could look at
here's when there was when a piece
Witter account was hacked and there
there was this fake tweet that got sent
out it briefly crashed the stock market
there is a paper which uses the extent
and the timing of this crash to estimate
the speed and scope of automated trading
in the markets now this is a few years
ago this was 2013 I can't remember now
13 14 yeah 13 so as of five years ago we
have an estimate but I bet there are
ways to estimate the current scope of
systems certainly we have numbers like
you know what fraction of trades are
made by computer now you have to be
careful interpreting these numbers
because there's lots of different
markets and made by a computer can mean
different things and so forth but um it
you know they would make a very
interesting story to try to dig into
that
related what are the values encoded in
automated trading algorithms and one way
you could proceed on this story is just
take some of the standard automated
algorithms and there are libraries of
these things so the quanto peon is an
automated trading development platform
you could go to the cart opions library
look at each of the algorithms that they
have just just basically use it as a
list of algorithms and just say some
things about what would an economy where
many people are trading with this
algorithm look like what types of
businesses what types of economic
activities are encouraged by this
algorithm and what types are discouraged
so here's mean reversion and what mean
reversion would do for example is it
would buy stock of something that
suddenly fell and sell stock of
something that suddenly grew so that has
a weird effects right it would publish
punished fast growers and try to reward
slow growers which may be as actually a
kind of inequality reducing effect or
maybe it punishes people who do well and
then you've got the more basic question
of what algorithms are people actually
using it's very hard to find that this
is part of why everybody's looking at
government right now is because it's
easier because you can fly it all right
you can't FOIA a private company so you
have to look for regulatory filings you
have to look for conference
presentations you have to talk to people
who used to work there but there's
actually court cases there's actually
quite a lot of disclosure you just have
to look in in weird places for it tools
you can do a tool as your final project
I'm involved in two major journalism
platforms one of them is work bench
which Aaron is involved in so you can
ask him about that but right
complicated but there's also this
document mining tool overview which
we've we've talked a lot about the
construction of it could use some love
so the disadvantage of this is that I'm
basically asking you to write write
software for other people the advantage
of that is those people are your fellow
journalists I can't I don't have any
developers working on overview right now
so I can't do requests like better
entity recognition so if you wanted to
figure out how to do it and then
implement it journalists throughout the
industry would be able to use it the
most ambitious thing you might do is is
build machine learning into overview
right now you have to export out to a
notebook to the machine learning and
export back in but we've got like all of
the pieces we've got all of the tagging
and UI we just don't have the actual
core machine learning yeah yeah so I on
your assignment through course works
there's two links to previous projects
yeah of course works for sports too we
were just talking about bail reform why
do people bother with algorithmic risk
assessment you're gonna see this slide
next class as well it's because there is
good evidence to suggest that machines
can guess a lot better than humans
judges can who is going to be rested and
if you can do that then you can keep
more people out of jail while
simultaneously reducing crime and
reducing racial bias there are only a
few I think there are three cases where
you can directly compare human and
machine guesses I haven't seen a good
review comparing them most of the
discussion around risk prediction has
been sort of like 'omg biased algorithms
I haven't seen good articles that are
like ok but less biased than humans
which I'm might yeah and then we talked
about oh so this we're gonna talk about
this a little more next class but you
can experiment on platform algorithms
right so this was an experiment where he
published poems haiku with and without
the colored background and estimated the
difference in engagement and basically
they got twice as much engagement and so
you can just do this with a spreadsheet
right this is um this is a great little
article and if you start thinking about
it in this way like if I changed my
interaction so what you have to do is
you have to change your interaction with
the platform and then figure out
something to measure and you could use
your account or you could use new
accounts there are lots and lots of
different experiments you could run on
these algorithms
