Algorithms: Filter Design

Frontiers of Computational Journalism week 3 - Information Filter Design

Transcript

so today's topic is filter design and by
filter I mean any system which picks
content so or you can think of it as
reducing the volume of information so
this is recommendation systems this is
news aggregators this is the Facebook
newsfeed
this is Twitter's algorithmic
suggestions anything which has to rank
results search engines as well can be
considered a type of filter and these
are the recommendations for what you
should watch on Netflix and what you
should buy on Amazon this is fundamental
technology for handling information and
[Music]
basically all of our news consumption
goes through filtering algorithms at
this point so we're going to talk about
why we do this then we're going to look
at quite a number of algorithms for
doing it then we're gonna add humans
back into the system because real
filtering algorithms have things like
teams of thousands of comet moderators
for example then we're gonna talk about
some classic problems or concerns with
filters how many of you have heard of
the idea of the filter bubble yeah so
we're gonna we're gonna talk about
polarization and filter bubbles try to
get some clarity on on that debate and
then we're gonna talk about filter
design what is its design a filter what
do we want a filter to do and then
you'll have an assignment which is to
design a filter filtering
[Music]
I'm showing you this picture of a
newspaper to make a few points about
what a filter does so if you look at a
standard newspaper you know you've got a
big headline here and then a subhead
here and then smaller stories here and
then of course stories on the following
pages and one of the reasons I'm using
this particular headline is in part
because I used to teach this course in
Hong Kong but in part because you may
not have seen this headline if you were
living in Europe or America at the time
this you know you would have seen this
news but it might have been page 2 if
you are in Los Angeles whereas its
front-page if you're in Manila so the
point I'm making is that there has
always been filtering or prioritization
in news that's how we decide what should
get the most attention humans have
always done this and actually there's
they're starting to be research
comparing what humans do and what
filtering algorithms do this is also a
filter this is Google News from I guess
a few days ago and you know it's all
Cavanagh all the time right now so think
about that for a second of all of the
things that happen in the world
yesterday when I go to Google News why
do I get to that one right now of course
I think it's undeniable that's it's an
important story but there's lots of
important stories and outside the US
this is gonna be much less important
story so that's a filter as well and we
know a little bit about how this is done
we talked about the clustering
algorithms that produce the
the groups of stories and categorize
them by topic what we didn't talk about
is the ranking algorithm that picks
which story goes at the top so think of
it that way that's where we're gonna
talk about today how do you know which
story goes at the top here's the
Facebook newsfeed this was actually from
a business insider editor it was
actually quite hard to find a picture of
someone's newsfeed like who's willing to
post their newsfeed right but of course
the newsfeed is a filtering algorithm
possibly the most significant filtering
algorithm at the present time in terms
of its effects on society oh oh it was
an article about the Facebook newsfeed
and so they needed an illustration and
so they they had a brave volunteer I
guess mine were too stupid to post
here's an old graphic it only goes to
2015 or so but at the time it was 300
hours of video per minute uploaded I'm
sure it's much higher now so you can you
can do the math here and think well how
many television stations were there in
the 20th century right television
commercial television transmission
started late 30s 40s
earliest you know we might have had a
few dozen channels at peak so you start
working this out let's say we have you
know 50 channels times 100 years and you
find that the amount of video on YouTube
exceeds the amount of broadcast
television of the 20th century by
multiple orders of magnitude and you
know we're probably closing in on
a million times more video so this is a
huge amount of information and of course
it's not just video and news public
information is produced at an enormous
rate so every time a public company
takes certain types of actions you know
the executive buy stock they make an
acquisition that sort of thing
you have to file a report with the SEC
so there are ten thousand such reports
filed every day
imagine covering this stuff right so
imagine you work for writers or
Bloomberg and you have to watch this
stuff even if you have a team of a
hundred quarters doing it that's still a
hundred to one ratio they can't cover
every so they have to pick some set of
this to look at so implicitly or
explicitly explicitly there's a filter
happening here and this is the ApS
output the AP only has a few thousand
reporters so actually how they get ten
thousand stories a day is right throughs
and translations and syndication they'd
probably produce a couple thousand
original text stories a day if you don't
count this sort of stuff but I got these
statistics a number of years ago by
talking to one of the engineers on the
platform which routes all this
information I want to try a little
experiment with you I want to estimate
the filter aperture and by that I mean
how many orders of magnitude do your
personal information filters throw out
so to do this we need two things one is
how many items of content per day do you
consume so how many links do you click
on how many articles do you read or
videos do you watch I mean what's the
what's the estimate here anybody want to
hazard a guess four
let's pick it let's say an article you
read let's just say the article you open
it actually doesn't much matter what you
pick because if you change the
definition you'll change it by maybe one
order of magnitude so let's say an
article that you open so what are we
talking about how many a day 10 any
other guesses was that 15 10 okay
yeah so it's on the order of like 10 20
30 it's not a hundred it's not - okay so
let's we can say conservatively ten
items a day read I suspect that it's a
little bit higher I suspect that by the
time we count the cat videos it's gonna
be more like 20 or 30 but we're just
doing an order of magnitude estimate so
it actually doesn't matter if it's 10 or
50 yeah maybe you see a hundred
headlines yeah let's call it ten okay so
now we have a somewhat harder estimate
so so this is the this is what makes it
through your personal system of filters
which is a combination of what humans
have picked and put on these sites and
what Facebook has chosen what Google
News has chosen what the recommendation
system of the a.m. New York Times has
chosen so most news sites now have
personalized recommendations right so
the combination of all of these
filtering systems gives you that so now
how let's say you're reading only new
items
all right let's simplify this problem
and say you're only gonna read things
that were produced that day you're not
looking at the back catalogue how many
articles are produced at day
so that's a hard thing to estimate we're
gonna we're gonna break that down a
little bit again we're just looking at
orders of magnitudes and we'll get we'll
put some ranges in this how many
publishers are there that you might read
so I imagine you're mostly reading in
English
so we'll just restrict it to
english-language publishers are there a
thousand are there a million
how many publishers are there how many
people putting content on the way what's
that a few thousand yeah anyone other
guesses well publishing any of the
content that you might click on yeah
well it's got to be the same count it's
got to be the same universe as that
right so publishing stuff that you read
well or that you could read right but if
you're not counting videos here then
we're not counting good night yeah how
many items you could read if you read
them all so don't don't sweat this too
much we're going for orders of magnitude
here right so I'm gonna say a thousand -
I think we have ten thousand publishers
do we have a hundred thousand
yeah all right so let's let's just let's
just put the uncertainty in here all
right so we're just gonna this is very
like the what's it called the Drake's
equation right you're trying to estimate
the number of civilizations in the
universe we get these really wide ranges
but that's okay we're looking for order
of magnitudes okay so then how many
items a day on average does a publisher
publish now the AP is big it's one of
the largest news organizations in the
world okay yeah I I think it's probably
smaller I'm gonna say ten to a thousand
because there are big ones as well right
so if we multiply these two together
then we get so that's ten to the three
that's ten to the one on the low end so
we can we get 10 to the 4 - what is that
10 to the 5 times 10 to the 3 10 to the
8 items published
day ten to the four seems really small
there's definitely more than 10,000 a
day so I'm gonna say that that we do not
have that case and call it ten to the
five to ten to the eight because what we
know one publisher alone does 10 to the
four so so then divided by ten so then
divided by ten items day equals 1 over
10 to the 4 to 1 over 10 to the 7 so
that is the filter aperture right that
is how much narrower the funnel are the
the sort of business end of the filter
is the end that you read compared to the
stuff going in it
right that's that's measuring this width
so to me that says you have no hope of
having a representative sample it
doesn't matter what you do the output of
the filter is not going to give you a
very good picture of the input to that
filter so there's this idea I here I
heard journalists and other server
information professionals say there's a
lot right everything you need to know
about what's happening in the world
right you're gonna you're gonna read
this news source you're gonna be an
informed person no way absolutely not
there is it is not possible to keep up
even with what makes it as a news wire
headline right if you wanted to keep up
with what an international wire service
is producing so if you want to read the
AP or the BBC or the Reuters or the
Bloomberg feed you're gonna be doing
nothing but reading those stories all
day and that's just what the major wire
services are producing we're not talking
to local news or anything like that okay
so nobody is an informed citizen in the
classic sense of like knowing what's
going on in the world instead we have
these systems that give people a little
piece of what's happening and so we have
the very complicated problem of what
piece do we give them and that is one
aspect of the filter design problem okay
but I want I want you to think of this
this tiny little funnel so on average
you get to pick one in a million
so which one out of the million view
pick
right and even a random sample is
pointless because if you pick 10 random
items out of 10 million items there's
more than 10 topics in the world so it
doesn't matter what you do you're just
gonna miss everything
on the bright side that's normal
this is how humanity has always operated
and it is possible to be more informed
than any other time in history but
you're never going to you know know
everything
to be like duplicates of like the scenes
I don't know I mean you could knock off
an order-of-magnitude if you want right
the only thing I'm really trying to say
is that it's not like you can read one
percent or even point one percent or
point zero one percent you have to start
getting into the one-in-a-million range
to get estimates and actually I think
this is a little low I think when you
start adding and like social media posts
and you start adding in foreign language
media I think we get a few more orders
of magnitude so given that we have this
tiny filter aperture I want to start
talking about algorithms to select some
tiny subset of things to show you and
we're gonna build this up from some
pretty simple algorithms so some pretty
complicated algorithms culminating in
the underlying architecture of the New
York Times recommendation system but
we're going to start with comment
ranking
so any discussion system let's say
reddit in this case because that's
actually the algorithm we're going to
look at needs a way to rank comments or
rank stories whatever the unit of
analysis is and so they have voting
usually and the idea is that things that
are voted higher go to the top of the
list now this is actually not a very
good common ranking algorithm just
straight put the items with vote at the
top fails in actually a pretty bad way
anybody guess the way in which this
fails yeah so it's those things which
have a lot of votes appear at the top so
they get voted on more the problem is
that you don't actually even get a
chance to vote on the stuff that's even
below the fold right so for example we
know there's an enormous drop-off if you
have to go to the second page of Google
search results right so that means you
know at most people are going to look
through ten
there's actually a bunch of
documentation which is discussed in an
article that's linked from the syllabus
on how reddit's comment and story
ranking works and originally it used
this algorithm which is now known as the
hot algorithm and it's called the hot
algorithm because it's got this
time-based component so think of it here
right a is when it was posted and B is
the current time we take this this time
variable here this TS thing and then we
do a calculation of the number of
uploads minus down votes Y is basically
just a sine function but basically we
have the time divided by some constant
so as the article gets older it falls
down in the rankings and Z is a function
of basically it's a clipped up votes -
down votes it doesn't let the if it's
something there's a lot of down votes it
doesn't let it go below the 1 it you
can't make things massively negative so
basically it's this thing where it's you
know log of the number of up votes plus
a decay factor over time so this is
about as simple a filtering algorithm as
you will find in practical use right so
it's based on votes and time it makes
things decay a little bit
so far so good no this is the old
algorithm I think this one is still used
or Kenz
you can't you can set it to be used for
stories but they've decided that stories
and comments need different rankings but
the nice thing about reddit and the
reason I'm using it as a teaching
example is it's open source so you can
read the source code and people have
commented on it including Randall Munroe
of xkcd wrote a long article about
actually pushed for the ranking change
that we're going to talk about and wrote
a long article about why the new reddit
comment ranking algorithm is based on
the idea that you want to approximate
the percentage of up votes so here I've
drawn them as green dots that you would
get if every user voted so let's say
there are n equals 16 users you want to
know how many of them voted positively
that's that V number so as if you had
every single person way and of course we
can't have every single person way and
we only observe some fraction and the
problem that we face is that when you
observe only a fraction of the votes the
the proportion estimate can vary
dramatically right so if you only have
three people voting and they just happen
to really dislike it whereas actually
most people like it you'll actually get
the wrong signal so what we do is we
account for the uncertainty in the
observed to vote proportion and ready
for this oh wait that's the next slide
that has the big equation oh yeah so
here's what I was just saying right
depending on which users we see we can
get dramatically different vote up
proportions so we're talking 70% versus
20% here
what we do instead is we compute the
confidence interval of the observed
proportions so you've all seen basic
sampling Theory right at this point
margin of error on a pole anyone
interesting okay none of you have had a
like basic stats class you have okay a
long time ago okay but you're familiar
with the idea that you can compute the
interval in which the true proportion is
likely to lie right
there's formulas for doing this or you
can do it with resampling techniques and
so forth well okay this is a
scary-looking formula it was actually
derived in like 1915 by this guy Wilson
and what it is is it's just it's just an
approximation to the confidence interval
and it's got it's a big formula but
notice there's only a couple variables
in it we have P hat which is the
observed proportion of upvotes
we have n which is how many people voted
and then we have this thing Z alpha
which is it's the confidence interval
it's do we want a 5 percent or a 10
percent margin of the sides and what it
actually is is the integral of the this
area the main thing to notice here is
that we have the observed proportion
plus a correction factor which is likely
to be very small because as n increases
this gets smaller and then plus or minus
this thing right so it's it's the
observed proportion plus or minus some
range and apparently this is what reddit
actually uses and what they do is they
rank the comments by the lower bound so
if we've had three people vote on it we
have we don't know what would happen if
we had everybody vote on it but we know
that it's gonna be let's let's say two
out of the three voted up so let me say
well 66% is our bad
estimate but because only a few people
voted on it it's like 66% plus or minus
20 so we're gonna say minus 20 in other
words it's at least 46 percent up vote
and we're gonna use that to rank it and
as more people vote on it that
confidence interval narrows so as more
people vote on it and increases and you
can see all this stuff has n in the
denominator so n + and so the confidence
interval shrinks and the lower bound
estimate can converge us to the observed
estimate as and more people vote and we
gradually rank it closer and closer to
the actual portion so in other words we
penalize comments where we don't have a
lot of votes to prevent them from always
dominating the top like the top of the
listings we don't let something that has
three votes and a hundred we don't
classify that ranked it as a 100%
positive votes if we only have three
votes all right
so we throw away that we yeah so right
so this so for stories they still have
time because they want stories to
disappear but for comments they use this
non time-based algorithm because they
want them to sort of stabilize it like
bubble the best comments at the top so
this scary formula notwithstanding the
scary formulas not important right you
you can use you can get nearly the same
effect with basically any formula that
penalizes items that don't have a lot of
boats yet but by looking at this common
ranking system we can see all of the
basic pieces that we can start to
assemble to build a filtering algorithm
a more sophisticated filtering algorithm
this is called collaborative filtering
how many of you have heard of this okay
so collaborative filtering is how Amazon
generates the product recommendations
it's basically people who bought this
also but Dad it was originally applied
for email filtering and the idea here is
that we look at the pattern of things
that you clicked on or that you liked or
that you gave a high rating to we have
some positive signal from each user and
each user only looks at a very small
number of items and then we try to
recommend whether to figure out whether
you would like a new item by saying well
who else rated things in a similar way
to you and what did they like so you can
think of it as a two-step process we
take your likes and cluster you with
other users who like similar things and
then ask what else did those users like
so here's the setup we have a data
structure called the user item matrix
and this is a fundamental data structure
in filtering algorithms right so we
spent the last two classes talking about
the term document matrix we're gonna
spend most of this class talking about
the user item matrix and in this case
user is our rating movies so I guess we
have a one to seven star rating in this
example this and subsequent diagrams are
from the reference in the syllabus and
[Music]
what we're trying to do is we're trying
to do an out-of-sample prediction so we
want to know how user five feels about a
movie that they haven't rated and then
in principle to build a recommendation
system what we do is to generate the
recommendations for user five we guess
their vote on every single movie and we
show them the top ten or that's one
simple filtering algorithm right so what
I'm when I'm trying to convince you of
is if we can solve this problem of
guessing how one user feels about one
item we can use that primitive to build
a filter so we're gonna look at a
variety of ways to do this day but this
is what Netflix does this is what
Facebook does this is what Amazon does
this is what everybody does and it it it
gets out you know like all of these
algorithms you get this basic algorithm
and it gets to arbitrary more
complicated but this data structure is
fundamental
one of the things I want you to notice
here is that unlike all of the work we
just did on text analysis we have
absolutely no content in the algorithm
right we are not analyzing what is in
each movie we are not analyzing what the
product is and this was very important
for Amazon because I mean books sure you
can do text analysis and cluster them
movies okay we're starting to get to the
point where we can do video content
analysis but hairbrushes batteries
tripods right how do we do content
analysis on these on these things we
can't really do it and so this is an
enormous advantage of this technique
because it doesn't require any knowledge
of what the item is another thing to
notice is that the user item matrix is
mostly empty and so the algorithm has to
account for that and you want to be a
little bit careful to make sure that to
distinguish between the user never
expressed an opinion and the user didn't
like it right so I saw that Facebook
post and I did nothing versus I never
saw that Facebook post are two different
signals and sometimes it's a little
tricky to wrote to build the algorithm
such that they are differentiated
so again here's the setup for this we
have this user item matrix we want to
predict the vote of one user on one item
and then if we can do that then we just
take the top items and we have our
filtering algorithm
so the core logic of this algorithm is
that we want to suggest similar items so
if we had a similarity metric between
items then what we can do is we can take
the items that the user has previously
said hey this was great I loved the 5
port pistachio ice cream product that
you sold me and we should be able to say
hey wait what about 5 quarts of
strawberry ice cream that's pretty
similar that's much more similar than a
miniature monster truck set so let's
recommend you the ice cream and not the
miniature trucks so again we're talking
about a distance metric but the only
information that we have about these
items again we have no notion of the
content is how people rated them so what
we do is we construct a similarity
metric based on how other how different
users have rated the same item and and
rating I'm using very generically right
it could be it's any sort of feedback
signal it could be purchase it could be
click it could be like it could be
commented on it could be shared and in
practice real acronyms use a variety of
these signals so here's what we do we
define a cosine distance like metric
between the ratings on two items here
they've notated it s IJ and it has to be
sparse because most users are won't have
rated most items so you can see those
dashes there
and we're looking across all users so
the idea here is that that the items are
sort of implicitly sorted into types you
can think of it like you know ice cream
is a type and sunglasses is a type and
action videos is a type and we have no
idea what these types actually are
although of course when you have this
data you can start to visualize them
through clustering algorithms but we can
figure out if two things are the same
type by looking if they got the same
ratings on the assumption that users
have types of things that they like so
here's the actual formula this should be
super familiar it's again from that
paper cosine similarity right two
vectors in this case they are vectors I
and J are vectors of the rating for two
different items across all users exactly
the same idea
notice that users who have not rated an
item just drop out of this formula we
only compute this formula over pairs
where both users have rated something
this by the way is one of the keys to
efficient implementation this is this is
we may have a billion users and Facebook
has a billion users but this is very
sparse right we only have to do this
over the posts that both users have
engaged with
and in fact there's a bunch of distance
metrics we use this is a version of
cosine similarity where we just subtract
off the average rating and what we're
doing there is we're adjusting for think
of it as adjusting for the grumpiness of
each user like you know a user who
really likes something may give it five
stars or they may give it three star and
if they dislike it something you know
you may give it four stars like when you
kind of you know if you really don't
like your uber driver you give them
three stars well maybe some people who
really don't like a movie give it one
star or maybe they give it three stars
so we're just normalizing and it sort of
goes from there you can you can sort of
build up more complicated similarity
metrics but they're all the same idea of
looking at all the ratings on one item
and then when we actually want to
generate a recommendation right so we're
trying to fill that one item so Pui the
prediction of user use rating on item I
what we do is we take the weighted
average of the rating that the user has
given for all other items so that's the
ru n right so the user's rating on item
and so we're summing over N where n is
every item that they've raided and we're
saying I'm sorry n is every item and
we're saying what is the similarity of
the item that we want to know about
which is I to the item that they've
previously rated which is N and then the
bottom is just a normalization factor
all right so we are waiting the rating
they gave each item previously by how
similar it is to the item that we want
to guess so when we run this algorithm
will find out that they gave five stars
to the table
- Oh ice cream and you know they gave
some reading to sunglasses that they
bought but we don't really care about
the waiting that they gave to the
sunglasses because the sunglasses are
not very similar to the tablet
strawberry ice cream whereas the
strawberry ice cream the pistachio ice
cream are very similar because people
who liked one tend to like the other and
so we mostly interpolate from the
waiting on the ice cream yeah well I
know you don't set it up that comes out
of the algorithm right so the similarity
is computed by asking do these two items
have similar ratings across users so you
don't have to tell it this is the big
advantage right you never have to tell
it that there's ice cream in sunglasses
and toys it figures that out
automatically by saying oh well people
who like this like that and if people
who like one type of ice cream tend to
like the other types of ice creams then
it will figure out that ice cream is a
category so this is this is an example
of latent structure and will actually
make this a little more a little more
rigorous a little bit when we talk about
matrix factorization
Oh
okay
assuming that everybody know everybody
relates related items in the same matter
that's the key right so let's say we
have different types of ice cream and we
never tell it that these are ice cream
it figure it learns that and then these
are toys let's say cars and crayons and
then these are users
okay so we're just gonna look at the to
user case with a couple of concepts of
items so
let's say user a likes ice cream and
user B likes toys what you're gonna have
here is five stars for all of the ice
cream and zero for the toys and vice
versa and then you have let's say you
add a new ice cream alright you start
selling a new flavor mmm
lime sherbert okay and you want to know
how similar is this lime flavor to all
of these other things which is this s IJ
we're trying to calculate well before we
can know that we need to get some
information about how they're you're the
other users have raided it and what we
discover is that user a gave this a
three and user B gave it a one and then
we asked well this vector3 one right so
oh dear I've transposed this from the
diagram there that's using columns for
user ratings how similar is it to each
of these items and does it look like
this vector alright does it does it look
like this or does it look like this and
the way we compute that is cosine
similarity so we take the dot products
with each vector so here let's actually
do this so this is 3 times 5 plus 0
times 1 this is 3 times 0 plus
five times one so this is 15 this is
five so what we find out is that this
new flavor is much more similar to the
ice creams than the twice and the reason
we know that is because the users tend
to like one or the other does that make
sense yeah sure
when they start musics
people who bought diapers new kids also
going through so the items are similar
in any way it was just that users were
reading them or buying them so the
system right right so I've been saying
type of item but it's exactly it's not
really type of item
what it is is items that gets a certain
type of ratings right it is you are
clustering and rating space so if two
items tend to be rated the same way by
different users then they'll be
clustered together even if they have
nothing to do each other but that's fine
because all you're trying to do is is
predict a new a new rating
be like whichever like so then you would
say something like crayons
right so it could it could turn out that
it's not that things aren't groups by
ice cream it actually turns out that
people who like green crayons like lime
ice cream right
maybe maybe people have their
preferences based on color and not type
of objects like who knows and will find
interesting surprises like the diaper
and beer issue diaper and beer yeah yeah
you know I've never chased down the
original source for that it's one of
these like AI urban legends except we
didn't used to call this AI the word is
getting bigger okay so that's the idea
we're trying to find items that have
similar ratings we're gonna go sort of
take a next step here so I've showed you
a user item recommendation first because
I think it's a little less abstract what
we're gonna do now is talk about doing
the same thing as matrix factorization
so you will recall that
we looked at factoring the term document
matrix and the we called the hidden
factors topics in this case the hidden
factors are you can think of them as
topics or type of object but the idea is
this okay so this is the user item
matrix and items go this way users go
that way and now I have to remember how
to do matrix multiplication again
it's alright
yeah okay good so then what we have here
is items and users and now what we're
doing is sort of a hypothetical model
right so we are saying that there are
some number of let's just call it topics
right or types of objects because we can
use this for anything we can use this
for ratings or articles or whatever
let's call it topics and this should
looks really similar to when we did
matrix factorization for topic modeling
and the idea is that each user is
interested we'll just use zeros and ones
for the moment in some set of topics so
this user is interested in politics and
beads and then each item also has a
certain set of topics so
this is a story on politics actually
that's this this row is the politics
topic right so let's say this is an
election and this one which has the same
pattern this has both politics and B's
so this is about environmental
regulation on honey in the honey
industry okay so it's about bees and
politics so it has a 1 in both both
places and then what we say is to
generate the observed user item rankings
and here we're just saying 0 1 think of
as click didn't click buy I didn't buy
like didn't like this can all be
extended to more general cases we're
saying well how the way that we get this
item so user eye item J
is we multiplied through all of those
users topics for the corresponding
position in the matrix right so it's
this item and this user so what is this
well this is an item that it's about B's
and so when we multiply through we get
a1 so that's the idea the idea is we're
saying you know what I don't believe
that the elements of this user Adam
matrix are truly random I don't believe
that what's here and what's here and
what's here are actually completely
independent we can't believe that
because otherwise we would never be able
to make prediction we're saying I think
there's structure and I think the way
that their structure is I think that
items have topics and I think that users
like topics and so I want to find the
smallest set of topics that captures
this structure well it's the same idea
with words and documents only now we've
got users and items so going back to
this slide ok so here we go we're going
to break this down matrix factorization
that's what we're doing finding factors
of matrix we represent users no names in
a shared latent low dimensional space of
dimension K so that K is here K topics
okay
late this word late means we don't
observe it but it's it's behind what's
happening it's you can think of this as
like this is what's happening this is
the underlying reality that is
generating what we
observe in dimension K user is
represented by a latent vector UI out of
RK so here's user I here's UI there's
that vector out of RK it's a K
dimensional vector and item J by a
latent vector v j out of RK so here's
the latent vector for the item that's
the V and then we form a prediction of
whether user I will like item J with the
inner product between their latent
reference representations Raj which is
this which we get by taking the inner
product which is a dot product of these
two latent vectors so far so good
hopefully I've translated the the gobbly
gook here yeah exactly right so that so
the beer diaper topic is yeah baby
stress male baby stress maybe right
maybe there's a chocolate diaper topic I
don't know
okay so that's where we're going with
this and you can build a recommendation
system out of doing this right so it's
it's it's basically the same thing as we
were doing before right where we had
this user item matrix it's just a little
more principled in particular there is
this weird thing here where what we're
doing is we are predicting the ranking
of a user on an item by asking what are
the items that are similar the ones that
have already ranked you could predict
the ranking on a user item pair by
asking well who are the users similar to
this user you can actually run this
algorithm either horizontally or
vertically as it were and matrix
factorization just avoids this problem
because it's a little more or
mathematically grounded and if you take
the hit of dealing with matrix
factorization which is you know you have
to run an algorithm to compute this
factorization if you're ok with doing
that then you get a much more
mathematically straightforward
prediction that sort of solves this
problem of well do I predict the rows
from the columns or the columns from the
rows to do this we have to compute the
factors and there's an iterative
algorithm that sort of bounces back and
forth we'll see in a second
it starts random for these factor
matrices and then computes the user
topics given the item topics and then
give it's the item topics given the user
topics and the idea is to minimize the
difference between the predictions and
they observe predictions so this term
here
roj is what we've actually observed the
ratings of I on J and then u transpose V
J is how we form the prediction that's
where we take that inner product and so
if we sum over all item pairs and take
the square what we have is the error
between the prediction
and the actual and we want to minimize
that and then this stuff is
regularization which I think we've seen
before and that just forces the solution
to have as many zeros as possible
because we would like these matrices to
be sparse UI is the topics for the user
Vijay is the topics for the item yeah
it's the actual data that you have on
who liked what or bought what and this
is sparse right so which is important
because if we have a million users and a
million items then we have ten to the
twelve rrj pairs and we we don't want to
compute matrices with ten to the twelve
elements and because we can't fit that
in memory so there is another way of
looking at this which is looking at it
probabilistically we've talked about
this in terms of we haven't really well
we really done is sort of played around
with matrices a little bit we haven't
really talked about a probabilistic
model at all and there's some advantages
to talking about a probabilistic model
because if we have a probabilistic model
then we can do probabilistic inference
which is a hugely powerful and well
developed a set of machinery to start
talking about this we're gonna I'm gonna
introduce this plate diagram and because
this this plate diagram is going to be
part of a plate diagram of the New York
Times recommendation system which is
it's gonna it's gonna be a little freaky
but we'll build up to it the idea is
this so again remember that a filled
circle means we observe something so
that our user rating of items let's see
that's the data that's the input to the
algorithm it's the only thing we start
with and then that box around the R
where it says I users means we have
ratings for our users and J items and
that overlap between the boxes that
means
we have a total of I times J items and
we're saying that there exist but we
don't know the topics for the each user
and the topics for each item which is
the U and the V and there is if you like
a the regularization
okay so regularization in an
optimization algorithm becomes a prior
when you look at this in a bayesian
context does anybody run into this idea
before ok so a regularization parameter
of this form when you turn this into a
probabilistic inference problem it's
actually a prior which says that the
distribution of values in these matrix
this UI and VJ entries are going to be
around zero and the width of this thing
is gonna be like 1 over lambda right
there's gonna be some factors in there
and I might be missing a squared or
something in fact I'm pretty sure I'm
missing a squared but this is the
general idea that when you make those
regulation regularization parameters
large what you're doing is you're
penalizing having large values which
means that you're saying that the prior
gets narrower and puts more of the mass
close to 0 so when we turn a an
optimization problem into a
probabilistic inference problem
regularization becomes a prior that's a
basic a basic connection between
[Music]
probabilistic inference and other types
of optimization problems and so our
regularization parameters in his plate
diagram become priors on the
distributions of these U and V matrices
so far so good
so then so we're doing the same thing
we're just thinking about it a little
differently and we're using a different
algorithm to compute it the reason I
show you this is this is the basis it's
a basic component of the collaborative
topic modeling system that the New York
Times uses and it's a bunch of other
people use presumably all right so this
highlighted stuff is what I want to show
you all right so there's the
minimization we just saw in the
optimization formulation in the
probabilistic formalization basically
we're pretending that the way that these
observed ratings get generated it's from
a probabilistic model and by the way
pretending that something is generated
by a probabilistic model is how we
normally do probabilistic modeling when
we are trying to figure out how many
cars are passing through an intersection
we say okay well let's pretend that we
decide when cars come by drawing from a
Poisson distribution right so this is
normal this is how we do probabilistic
modeling and we say okay let's pretend
that the ratings come from this process
it's a little more complicated process
but it's the same ideas pretending that
cars come from a Poisson distribution or
test scores come from a normal
distribution or whatever model you have
we say for each user we pick some some
topics normally distributed with
variance proportion to 1 over lambda
which is what I just showed you right so
there's a wider variation and the topic
values when lambda gets smaller and then
for each item we also pick some
EPIK values for that item based on its
parameter and then what we do is we
multiply the topics of the user times
the topics times the item and we pick
from a normal distribution on that and
we this C thing is a noise term
basically it's a some variation and
that's it right so that's the ideas that
we we pick some topics for the users we
pick some topics for a items and then
and when we picked them we picked them
on the distribution of this width and
then we multiply them together and add
some noise and that is the rating that
we observe so it's a really really
similar idea to our normal matrix
factorization we just we've put this in
probabilistic terms and the advantage of
doing this is now we can use all of the
machinery of Bayesian inference alright
so we have standard ways of doing of
estimating the probability distribution
of these latent factors that is to say
going from going the inference is going
from the R to the B and the U right
that's what we want to do to factor this
matrix and then when we have the V in
the u we can generate predictions just
by taking a dot product we can do that
with the standard like Monte Carlo
Markov chain methods that we use for
Bayesian influence how many of you have
done that by the way looked at MCMC
inference for Bayesian models okay
couple of you
so before the break just to prove to you
that I'm not making this up I want to
show you an article on the Facebook blog
talking about how they do it so here we
go I'll just scan through this really
fast collaborative filtering so that's
the name of the algorithm that we've
been talking about so the items the item
set is pretty complicated pages groups
events games and more best
recommendations come from people who
have similar tastes in other words a
taste is a topic in the language that
we've been using for a dict how someone
would rate an item right it's exactly
the same same thing their problem is
that they have a hundred billion ratings
and a billion users
so the standard algorithms don't work
very well and so they need to do a
distributed algorithm and so they talk
about this like matrix factorization
there we go oh hey that looks familiar
right they have one regularization
parameter instead of two but there it is
they solve it using a gradient descent
algorithm which basically you perturb
each value slightly and and and try to
take the perturbations that roll you
downhill reduce the error as fast as
possible they talked about a bunch of
different algorithms and then they
talked about how to do this in a
distributed way and what they end up
with is they break the users and put
them on different machines so there's a
different group of user on each worker
machine and then they put they split up
the items as well and they run this
estimation with this set of users in
this set of items and then for the next
time set they cycle the items so in
other words they keep the users fixed so
in other words they they keep the user
tastes or user topics on the same
machines continuously while cycling the
items between machines and in each point
they estimate a chunk of this matrix so
I have if I have these users and these
items then those users and those items
can be used to estimate some piece of
the matrix right so I multiply these
things together that gives me
predictions I compare them to the
ratings that I actually saw that gives
me the error term I take the derivative
of the values that I have with respect
to the error term that gives me a
gradient I take a step in that gradient
to try to reduce the error for these
section of the matrix and then I move
then I try it again with on this machine
with the same users and a different set
of items which means I'm now estimating
this block while one of the other
machines which has a different set of
users was estimating this block and as
nasta making that block so there it is
in reality being done at scale and a
distributed algorithm bringing this back
to journalism why am I telling you all
of this and in the journalism school
because this is it folks when people
bitch about the news feed and say that
it's tearing us apart it's only showing
us popular items it's making filter
bubbles it's reports clickbait sits and
we'll we'll get to that today we'll talk
about all of the complaints that you can
level against the recommendation system
this is where it comes from okay and if
we want to fix these complaints we have
to design a better algorithm so it's
important to start with what we are
actually doing
yeah it could work being more or less on
time one minute late okay so now we're
going to talk about the New York Times
recommendation algorithm our discussion
is going to be based on this post from
2015 also alex has actually come and
talked to our class in previous years he
M I wasn't available this year no
different Alex Alex pang at the time
that this was built this powered the
little recommended for you box they're
recommended for you box is no longer on
the homepage in fact I think it's just
gone but now this powers personalization
and all the section pages so if you are
logged in then you go to the home page
or you go to one of the section pages
those recommendations are those stories
that you see will be in part based on
the recommendation algorithm
yeah
right so when you when you look at it so
what's actually happening is rather than
when when you're logged in when you look
at the New York Times section you will
see not just some combination of what
the editors think are the hot stories
and what the algorithm thinks that you
are likely to read we're also going to
be looking at a snapshot of the
algorithm circa 2015 2016 this stuff is
constantly evolving
so I'm not gonna say that this is what
the New York Times is doing today
because I don't have details of what
they're doing today but this is what
they did recently and as you'll see it's
a pretty general method and it's if you
start reading this post which is again
linked from the syllabus you'll see they
talk about two problems so initially
they had a content-based recommendation
which was just based on keyword tags
[Music]
then they talk about some problems here
you can imagine a lot of these
criticisms would apply to tf-idf as well
right so these these are algorithms
which use the text of the article to try
to figure out what category it's in look
at what categories you've clicked on
maybe what you do is you keep a vector
of which is the centroid of all of the
articles that you've read and you look
at the distance between that vector and
the vector for a particular article and
you rank it by closeness so if you read
a lot about bees they will recommend
your bees articles this has some
problems one of the big problems is that
it doesn't give you articles about
topics that you're probably going to be
interested in but don't know you're
interested in collaborative filtering is
what we just looked at right it's the
user item matrix it's matrix
factorization and it can give you an
article about polish cinema even if you
normally only read articles about bees
if it discovers that all of the people
who like the same things as you also
like polar cinema so maybe people who
like bees also like polar cinema it will
figure that out by looking at what other
people like you are reading okay but
collaborative filtering has a big
problem which is this is often called
the cold start problem which is that if
nobody has yet read an article we have
no information about it so it says right
there this approach fails at
recommending newly published and
explored articles articles that were
relevant to groups of readers but hadn't
yet been read by any reader in that
group so
in 2011 there was this paper published
called collaborative topic modeling
which combines the two technique and
that's what the New York Times
recommender is built on and that is what
we're going to be studying so here's the
idea there's different again there's
different ways to do filtering so we've
talked about content recommendation that
was the first class with tf-idf and
stuff there are no users in content
filtering we've talked about
collaborative filtering which I think of
as part of the broader class of social
filtering which is you use user signals
to determine what you see and so
actually I would say that Twitter is in
this class as well you see stuff from
people you've followed and actually
algorithmic recommendations from people
you haven't followed it unless you've
turned it off like I have because I just
want my reverse chrome dammit or even
the reddit ranking algorithm that we
looked at it is also possible to build a
recommend a recommender that uses both
the content of the item and how users
are reacting to it and that's what we're
that's what we're doing so here's the
idea
they're distinguishing between what they
call in matrix and output matrix
prediction so in matrix prediction is
the collaborative filtering problem that
we just talked about and where we're
asking how a user feels about an article
out of matrix prediction is when we're
asking about how a user feels about a
new article where we have no information
at all so collaborative filtering can't
really do out of matrix prediction
because we we have no information about
those rows of the matrix and therefore
this
you know what topic that item is and so
forth yeah so we basically have no no
information on these these rows in the
in the item matrix so we can't run our
classic algorithm it's it's a
underdetermined on the other hand we
know the content of the articles you
know the topics in the articles so we
could predict that way so the goal is to
combine these two so we looked at this
last class this is LD a topic modeling
this this again is a very you know it's
all the sort of the same idea of a
latent variable right so what this is
saying is that each document is a
mixture of topics which is this each
word is particular topic which is this
what we actually observe is the words
and then how we pick the words once we
know that topic is oh I see when I slide
the mouse that's why it keeps advancing
when I don't mean to how we pick the
words is each topic as a set of words
and it's roughly akin to factoring the
term document matrix is a little it's
sort of a probabilistic form of that and
then we've got these what they call them
hyper parameters which you will play
with a little bit in your assignment so
this is just modeling the content of the
documents this is just modeling user
behavior we put them together and we get
this horrifying diagram okay
so who is horrified by this diagram yeah
okay I was when I first looked at this I
was like what so the red the red
annotation is all me this is me
scribbling notes to help me understand
what the hell is going on here and so
here's the idea the only things we
observe are the contents of the
documents and the ratings that the users
give and we're gonna say well where that
actually comes from is the ratings that
the users give is based on the topics
that the user likes plus the topics in
the document and the topics in the
document is in part determined by the
ratings right this this produces the
rating so when we run the inference
algorithm we go backwards from the
ratings to the topics in the document
but so this is the topics in the
document as far as user behavior is
concerned this is the topics in the
document as far as text analysis is
concerned and this arrow says that the
way the document acts as far as user
behavior is going to be based in part on
the content of the document right so
this arrow here is the connection
between the this area the content of the
document which is this and the way users
act around it that's the link between
these two things and intuitively you're
gonna think they're gonna be kind of the
same right so if you have articles that
are talking about bees then you'll have
you know a high rank on the bee topic
for that article but if it turns out
that it's actually talking about the
portrayal of bees
a movie then the way the users who like
movies will also be clicking on that
article even if this doesn't use the
words like film and movies so this
allows the meaning or the topics in an
article to shift based on user behavior
and you'll see a diagram of that in a
second and then we go through all of the
the usual rigmarole when we actually run
this algorithm what we do is we start
with the content of the articles and all
the previous ratings and we when we
first add an article we infer the topics
for the document and also the topics for
the document
or that in terms of the collaborative
filtering and as we start to get ratings
on that article we update the topics
that this this.v variable here so
basically v starts out the same as theta
when we don't have any clicks on
anything and the model allows V to drift
away from theta in other words the
topics in the document to drift from the
topics we obtain from text analysis as
we get more ratings
no I think it can only be people who are
logged in because it needs to track what
articles you've read right it needs to
figure out which user you are
if it gets move please there's probably
some portion of the system that takes
into account overall popularity but this
model does not so here's kind of what it
looks like and this is the diagram from
the blog post I just showed you so these
are two topic dimensions they're labeled
by words so this is exactly what we saw
when we were doing PCA type stuff or you
can think of this as maybe this is a
word embedding dimension maybe it's a
latent topic from Lda
but this is sort of a politics topic
this is sort of a car I don't a road
trip topic I guess and every document is
some point in this space in fact it's a
much higher dimensional space than this
and if we just use analyze it based on
the copy we get a point here you can
think of this as the theta but if we
look at how people are reading the
article right so every every user has
topics as well right so this U is the
topics for the user so we can plot the
users in the same space and we learn
that the people who are reading this
article tend to be here in the space so
we push our topics from for this article
in the direction of the users who have
read this article in other words we
start when we have no ratings for this
with this theta and then as ratings come
in and people read it we know where
those users are in topic space and we
push theta in the direction of the users
who are reading it which gives us V
which is the ultimately what we take is
the topic for that article and as usual
when we want to do a prediction we
multiply we take the
our product the dot product of you and B
so as we get more click data the topics
for that article drift towards the
topics of the users who are reading it
all right
yeah social is the click data yeah
doesn't mean like
you know it's not social media it's it's
social interaction it does not we are
not talking about social media here we
are talking about user click-through
behavior yeah I guess users let's look
at the model so if a user reads many
articles on a certain topic then that
has to influence the topics for the user
yeah so the users will yes the users
will also get pulled in the direction of
the target of the article good good
catch
right so in principle you rerun the
inference on the model of it after every
click but you can't actually do that so
there's a question of batching this and
I think in yeah here they talk about the
small right thinking this article this
is a great article usually this is worth
working through in some detail if you
understand this article you understand
quite a lot about the state of the art
of recommendation systems I think in
this article they talk about yeah here
you go here we're modeling users yeah
there's lots of good detail
[Music]
yeah I have to read this much more
closely
they must batch it in some way they're
not rerunning on it every click but
there's a nice property of the
optimization algorithm which is it's an
iterative algorithm right every step you
run it gets it reduces the error so they
can take the previous computed topics
for users and articles and use that to
do an update step to incorporate the
newly published articles and the newly
arrived click data so they can just sort
of keep this model running iteratively
alright so that's it i recommend you you
read this article it's really it's it's
dense there's a lot there you can also
read the original David Bligh blind wing
topic modeling article which is this one
they originally did it for scientific
articles because everybody doing NLP
starts with scientific articles which
are the worst training data because it's
so clean but if you can make it through
this article in that post you will know
quite a lot about how recommendation
systems work
the next filtering algorithm I'm going
to show you is a little bit different
here instead of picking articles to
recommend to readers
we're gonna pick tweets to show to
journalists this was a system that
Reuters built and has published a bunch
about called Reuters tracer and the idea
is you ingest the Twitter firehose I
think they are not quite the Twitter
firehose but lots of tweets anyway and
you try to find stuff that is newsworthy
all news organizations at this point
have some process for sourcing stuff
from social media this is a you know a
slide I can tell from image quality I
was in the back and cleaned it out
but I've never seen a chart like this I
really love it this these are all the
steps that the Associated Press goes
through to verify content and in fact
there are now entire books on this there
is a book called the verification
handbook which was written in
collaboration with the BBC and this is
an entire book on best practice for
verifying stuff coming out of social
media and it has a lot of steps so for
example you have to contact the person
who posted it both as a verification
action you know where did you get this
from that sort of thing but also to
clear it for copyright and then the the
AP has a bunch of more steps which I
think are fascinating so you look at the
sources history find the earliest
example look at the sources history
actually talked to them compare with AP
reporting so you've gone through all of
this effort to have reporters on the
ground covering stuff that does this
match what they've seen
check with language experts you know the
the for global news service it's not
going to be in one language and so you
have to make sure it's the language it's
supposed to be and dialect is correct
and you might pick up discrepancies and
in you know
slang or word usage and then there's
this verify context step which could be
sort of arbitrarily huge so there's this
whole step but the first thing you have
to do is is find it and then find the
earliest version and so that's what
Reuters tracer tries to automate this is
what the system looks like there's a
bunch of columns on they call them
channels on different topics there's one
that's just called news and it just
tries to find news in general each of
these is a story cluster the
where does it show it somewhere in this
interface it tells you how many tweets
are in that cluster this little star is
a newsworthiness star it so it tries to
actually figure out okay not only is
this news but this is the sort of thing
that writers normally covers these dots
are a veracity score it tries to figure
out how likely it is for something to be
true oh yeah these little numbers that's
how many tweets in that group so I think
three is the minimum here so it tries to
find the cluster of tweets on the same
topic and rank it for veracity and
newsworthiness in real time on across
all of Twitter
and it does sort of tagging stuff as
well tries to assign categories to it
here's the basic architecture first we
throw out the things that aren't
newsworthiness so we don't care about
people just chatting about stuff we want
events happening we group all of the
Suites into events we assign these
various scores and then we map against
various queries and alerts that
reporters have have set up previously
here's a more detailed system diagram
this stuff that is in the a box is at
the tweet level so for example we throw
at fam spams this is they call this the
chit chat filter we rank it on you know
how likely is it to be an actual event
or rather than just people talking and
then these are clusters we group the
tweets into clusters where you do more
spam filtering and then we do for
example event verification how likely do
we think this is to be true newsworthy
this ranking topics each of these is a
machine learning model that's a machine
learning model that's a machine learning
model that's a machine learning model
there's like a dozen models in here to
make this all work it's a quite a
sophisticated system one of the first
models is try to break things into
events or not events and so they have
this scale which I think another
researcher developed and so they have a
classifier which is these are all you
know I don't I don't know exactly what
they're using but there there's score
classifiers right so you can you know
puts a score like logistic regression
like where do I think it falls on here
it's a difficult classifier to build
because the input data is extremely
unbalanced almost everything is not
interesting so you know it's sort of a
thousand to one from the left to the
right side of the diagram so it's
actually quite difficult to build a
classifier because a small false
positive rate well just swamp the system
all right if you have a thousand to one
ratio
then a false positive rate of 1% means
there's going to be 10 times as many
false positives as true positives so
it's a it's a difficult classifier to
build and then they do all of this data
to time it against what reporters are
already doing so they compare both
versus Reuters and versus other things
let's see how does this work
I think negative ones are where the
algorithm is faster and they're
comparing it on three different types of
stories these are ones that we know are
coming and
[Music]
interesting they're comparing it there
are also global misery or organization
so they're comparing it to non-english
speaking locations because one of their
hypotheses is that this is really going
to help source news from other countries
where they don't have a lot of reporters
who speak that language and events that
we're not extracted from Google I have
to read the paper again to know exactly
what that is
but anyways this shows you that a lot of
the time it's significantly faster right
so like up to an hour faster
oh that's all of it that I have for this
this paper so this thing has
dramatically sped up their reporting on
a fairly large set of events which for
Reuters is extremely important because
they are a business news source so
getting a couple minutes ahead can be
convene a lot of money their customers
especially if they're doing algorithmic
trading which we're going to talk about
repeatedly in this class including today
alright questions on Reuters tracer
there's a couple really good articles
about this actually
yeah there's an internal research team
and software team I mean Reuters is
almost unique among news organizations
in that they have a lot of money and a
lot of technical talent so they can
afford to build these big systems I was
at the you know when I was at the AP
actually my boss at one point asked me
to look into building a system like this
and I was like well I'm one guy with
responsibility for getting the news out
each day so I don't think I can get very
far with this but yeah they've built
this over several years yeah there is a
similar system that is public called
Data Miner which you can use to
basically very advanced twitter search
including over archives and event
detection all of this
it doesn't do veracity ranking or our
newsworthiness detection it's not tuned
for news and actually how they tuned
this system is they took a whole bunch
of events that occurred on twitter a
subset of which were events that were
covered by a reuters and they trained a
model to predict which events would
actually be covered by the
journalist so I find this fascinating
because it's one answer to the question
of how do you decide what makes it
through the filter and what they've
decided is what should make it through
the filter is what currently makes it to
the filter by a human journalists now
it's not a bad or a wrong answer but it
it is going to replicate whatever biases
the humans have
and when I say biases I don't even mean
that negatively right there are reasons
that journalists cover certain topics
for example their customers are
interested in certain topics but there
are also going to be bad reasons for
covering and not covering things so
whenever we are building filters we have
this question of how how do we decide
what news is which is a fundamental
problem and there's basically only two
answers you can decide based on what
journalists actually do or you can try
to build it from first principles in
some way right and you end up writing
these rules like well if more than three
people are killed then we should
probably cover this story which seems I
don't know that's that's a hard set of
rules to write all right what are the
ethics there how do you decide what gets
covered but if did you end up running
into these problems it's exactly the
problems journalists don't normally like
to think about but implicitly do anyway
there's lots of research showing that so
by the way not all murders make the news
right yeah it's depending on the city
between 30 and 70 percent it is skewed
towards in exactly the way you would
think so a murderer is more likely to
make the news if the victim is a young
white female famous there's a
particularly violent method it's a sex
crime or there are multiple deaths
exactly what you would think
you know I doesn't seem quite right I
think some of those are justifiable in
some way but if you're not gonna cover
everything you have to articulate some
principles and normally we don't
articulate the principles and we just
have this sort of magic newsworthy in a
sense and coding designing filters
forces us to be explicit about these
choices which i think is wonderful I
think it's it's time that we pay more
attention to these choices okay we've
talked so far about algorithmic filters
implicitly all of them are involve
humans because the humans read the
articles the humans click on the
articles when I say human-machine
filters I mean something a little bit
different I mean there are professional
humans in the filtering process and this
is actually extraordinarily common in
fact pretty much every filtering
algorithm or filtering system that you
will find is a hybrid system that has
humans working in it as well
this is the earliest reference to humans
working in the filtering system that I
can find this is a Technium and media
gazer which are based off the same
engine and initially it was just an
algorithm scanning all of the blogs and
news sources to try to find newsworthy
stuff and then in 2008 they changed it
because they found that it was making
mistakes and what it actually how the
system works now is that the human
review runs five minutes ahead of the
homepage so the algorithm runs and then
with a five-minute delay that is copied
to the homepage and in that five minutes
there's a human editor looking at it and
checking stuff out reordering things
adding deleting stuff right so there's
now a human editor watching the page
all the time does anybody remember the
Facebook trending module yeah this is
quite a story if you dig into this
there's a long history they finally just
got rid of it but here's what it looks
like at the time so this is circa 2016
and the ideas they wanted to like pick
out news originally they had human
editors for a variety of reasons
including allegations of liberal bias
they removed the editors I think
actually it was just convenient time for
do that then I think actually the main
reason they removed them is because they
want a completely automated system so
that they can run it in the a you know
100 languages that Facebook runs in but
as soon as they did that so you lost a
few things first notice that there's
there's a little summary here that was a
human written summary because machine
summarization is not very good yet we
lost the summaries we just have the
entities now and it started almost
instantly reporting false news as well I
didn't say fake news because well two
reasons one as I hate that word the
other is that Facebook's documents about
this call it false news so that's what
I'm gonna call it here they eventually
got this under control I don't know what
extent they have humans in the system
now but we do have a nice set of
documents which were leaked at circa
2016 these are the most detailed
documentation on exactly how this hybrid
filtering system worked
it's a remarkable document because now
that you know how the things like the
topic finding algorithm might work we
also know how the humans were used
instructed to work with the topics that
were found by that algorithm we
basically have the description of a
complete large-scale production news
finding system and this is the only
public description of this sort
that I'm aware of so if you study this
you'll really learn a lot about how
these systems work and this is what it
looks like so on the left here these are
the topics that it found and I wonder if
I can zoom in here oh I can just I can
just do this
there we go
so these are the the topics that it the
the trending algorithm found it just
looks at stuff that people are sharing
and sometimes it'll find a a weird
headline so for example here it said
Denver Broncos hashtag Broncos and it
said actually let's call this the
Indianapolis tall Colts and you can see
this whole system they had for editing
it so the importance you can write a
headline it shows you how to appear you
get to pick the URL that that link goes
to so it's kind of like the algorithm
finds a cluster of trending things and
then the human reviews it and sets it
out and there were instructions like you
know only include topics that are
actually about a real-world event as
opposed to just a bunch of people
talking about something and it's
actually quite a long long document and
it's pretty I don't know I thought it
was pretty fascinating to read so here's
the document nicely leaked and they
built it into three different teams so
they had this whole pipeline built out
of humans so throughout topics that
aren't real-world remove duplicates and
and it tells you how to use all these
tools their process today is completely
different I don't I don't actually know
what they're doing but I still think
this is pretty pretty fascinating and
look at the problems they're solving
right so that the thing has to have a
recognizable name that's a very
difficult thing to do algorithmically
you're supposed to fix these problems
alright so it might might give you a
headline like this and you turn it into
that this is one type of human
participation in topic filtering oh
there we go
there is another really important type
which is common every major platform now
has a large group of human comment
moderators and of course the comment
moderation policy is along with the
algorithmic design controls what is
visible and so if you are concerned
about ideas like public discourse then
these rules are important and this is
actually a nice little little quiz that
The Times published a few years ago they
turned it into a little game where they
showed you a bunch of stories and you
had to approve or reject and in this
case no we don't allow name-calling and
these are designed to showcase a lot of
the difficult calls
I think that's okay yep
you can criticize the times apparently
have to explain your reason and so on
and I actually I I think it's great that
the Times published this that they
published their guidelines and they
really got ahead of this discussion of
what are you allowed to say Facebook did
not and so they took a lot of criticism
for you know they're deleting this
they're censoring that and it was very
hard to tell if those were policy or
isolated incidents eventually the
Guardian published leaked documents and
here's this is from the Guardian story
it talks a little bit about what the
guidelines say reading these guidelines
is super fascinating because you have to
deal with all of this really gnarly
stuff that you wouldn't even think about
the first thing to say about comment
moderation policies is that it's really
complicated the you're balancing a
number of factors including freedom of
speech and versus you know hate speech
or harassment but you're also you know
violence porn law enforcement and you
have to make it economically feasible to
economically and technically feasible so
actually how all of these systems work
is you have a machine learning model
which kicks stuff which is questionable
you know could be porn could be violence
and then that goes to humans to make a
final call but you know these are barely
nuanced calls right some photos of
non-sexual physical abuse and bullying
and children do not have to be deleted
unless there is a statistic or
celebratory element that's quite a
detailed rule and
you may or may not agree with any of
these rules in particular but I hope
this gives you a sense of how
complicated these rules have to be
including look at this anyone with more
than a hundred thousand followers is a
public figure which denies them full
protections given to private individuals
that is precedent that comes from
journalism so laws around privacy are
different if someone is a public figure
privacy and so for example a criticism
of a politician
that's okay right but criticism of a
private individual is it's treated a
little more skeptically under the law
yeah guidelines for content moderation
followed very closely to legal
categories which leads to some
surprising things so threats against
children not a particular child but
against children in general there was
one version of the Facebook content
moderation guidelines I said that's fine
all right well doesn't have to be
removed right that's the thing I that's
the problem with these these sorts of
guidelines is that they immediately
spark the idea that Facebook thinks it's
okay to have boolean content it's like
well no they don't necessarily think
it's okay they think that it's that they
shouldn't it shouldn't be their job to
decide to remove it from the platform so
this was leaked eventually they posted
this document a couple years later so
I'm of the opinion that if you are
running a large-scale filtering
algorithm you need to very quickly
publish your moderation guidelines this
stuff is complicated it's difficult a
lot of problems can be avoided if you're
transparent about what decisions you're
making of course you're sort of damned
if you do damned if you don't because
nobody is going to agree yeah you're
gonna take flak whatever choices you
make but you have to make choices so
you've got a you got a document them and
I think it's interesting that the New
York Times did the comment moderation
quids two years ahead of when Facebook
published to this community standards
document so I think they had a better
sense of their responsibility to its
community in this way although Facebook
has a much much more complicated problem
the New York Times because they operate
with a much wider variety of users and a
much wider variety of political contexts
and that just an absurd scale where were
a billion users right so we're talking
like
10 to the 5 difference in scale Facebook
has tens of thousands of moderators
worldwide
there has been a lot of criticism of
filtering algorithms in a variety of
dimensions we've already talked a little
bit about the what should you see
dimension which is a really important
dimension I want to talk briefly about
filter bubbles who's heard the term so
people started talking about the idea of
filter bubbles back in the 90s Nicholas
Negroponte who used to leave the MIT
Media Lab had this concept called the
daily mean he foresaw personalized news
and foresaw that it could lead to a
narrowing of vision daily me you read
the daily me instead of the daily news
now polarization is real I'm going to
show you a few slides to suggest this
this is each dot is a political book
this was during the 2008 election each
arrow is people who bought this also
about that so you can you can make this
diagram just by tracing focus on Amazon
and you see that they break pretty much
into red and blue books and this is not
content analysis this is purchasing
decision analysis so people buy one or
the other this is a similar diagram from
Twitter I think this is retweets as
opposed to following and the colors are
assigned by a classification algorithm
and you can see that the network
structure is such that there's a lot
more within group than cross group ties
this is a more recent analysis in a
different context this was a couple
years ago when there was a fighting in
in Gaza and Gilad Lawton who's now chief
data scientist at BuzzFeed built this
and this is the co-occurrence of Twitter
Instagram hashtags so each circle is a
tag and two tags get links between them
and therefore our place too near to each
other if they co-occur on
the same images and you can see there's
basically three different clusters here
there's the pro-israel pro-peace Dillian
palestinian and then connected and
related to the palestinian but distinct
here's a religious cluster other way
around actually this is the religious
cluster this is the palestinian cluster
so polarization is real you can see this
network structure in society in almost
any place you look anytime you have a
political topic you will find these
types of cluster structures people do
sort into tribes absolutely I always
have there's there's lots and lots of
research about about that but this
doesn't mean that there's a filter
bubble necessarily or more precisely it
doesn't mean that our filtering
algorithms are causing the bubble in
fact the evidence that our filtering
algorithms are causing people to only be
exposed exposed to agreeable viewpoints
is quite weak so here's the idea of the
bubble this is an Eli Pariser quote and
he wrote a book called the filter bubble
and what he's saying is that what people
care about is shaped by what they see
and I think the key point here is that
you get a feedback loop in principle if
you only read articles about bees than
the recommendation algorithm will only
show you articles about bees and you are
never going to care about automobile
safety
yeah or cultural critics actually is
mostly where it's kind of a mem it's
it's actually I think the reverse I
think the mathematicians are the one
debunking it more political scientists
there's a bunch of like skepticism on
filter bubbles these days but that took
a few years to start building and you
will still find a lot of breathless
articles that say that Facebook is
splitting us apart without citing any
evidence to that effect in part because
it's hard to get the evidence which is a
problem so I'm going to show you a few
different types of evidence to argue
against the idea that filtering
algorithms are driving tribalism which
is not to say that they couldn't be used
to diffuse it right to say that that
algorithms are not driving people apart
doesn't mean that we shouldn't try to
tune the algorithms to bring people
together it just means that they're not
the problem they might still be the
solution or part of the solution so
here's one type of evidence this is
polarization and voting behavior so this
is polarization at the elite level as
opposed to the citizen level but you can
draw similar graphs for population
attitudes and the thing I'd like to
point out is that the recent rise in
polarization started well before the
internet okay so whatever was causing
polarization to increase in the 80s and
90s was not Facebook here's another very
interesting study the increase in
polarization is larger among people who
are not online so there's a variety of
ways to do it you could for example use
I think this is what they did for this
data from the General Social Survey and
look at clustering of attitudes it's a
great question polarization can be
measured a bunch of ways one way you can
do it is take a bunch of questions on do
you think they ever should spend more in
the environment how do you feel about
abortion how do you feel like gun
control you just take a bunch of
political issues and if you do a PCA
plot answers on that
you'll find clusters now this is very
interesting because why should how you
feel about climate change tell me how
you feel about abortion there's no
logical connection between those topics
but there is a cultural connection and
the higher the correlation between
separate topics the higher the
polarization so you're looking at how
well all of these different attitudes
collapse into a single linear scale in
other words how much of the variation is
captured by the first principal
component and if that principal
component captures most of the variation
and the second component is much smaller
that tells you that basically people are
answering these questions on one axis
and so that gives you a measure of the
polarization you can look at the
distance and you can look at the ratio
of the size of the first axis to the
size of the second axis or you could
look at cluster structure and group ties
there's all kinds of metrics you could
use to look at this and I would have to
read this paper again to see what this
one is but by their measure of
polarization which is this index here
which combines a bunch of these scores
people who use the internet more which
is the this blue line or in this case
younger people again blue had a lower
increase in polarization so that's kind
of a problem for the theory that social
media use is driving an increase in
polarization
another difficult factor here is that
porting information in front of someone
doesn't necessarily make them read it so
this is from an internal study at
Facebook which was on diversity and
polarization which is a fascinating
study now it comes with a big plus and a
big minus the big pluses they used real
Facebook data so they don't have to
guess or use it proxies or use Twitter
data or something the big - is it comes
from Facebook so it's hard to say you
know what's de-emphasized or emitted but
sure enough so the left graph graph is
sort of ideologically what they call
cross-cutting so showing left content to
write people in vice versa and when you
do that people so potential means what
articles are available exposed means
they actually saw it in their newsfeed
and so you can see that the newsfeed
shows slightly less cross-cutting than
ideologically aligned stuff but not
hugely less and mostly that's because of
the large number of weak ties it's
because your Facebook friends are not as
homogeneous as you might think they are
and the people who are not close to you
on Facebook you're not close to the -
all of your 500 friends which means most
of the people who are sharing content
that might appear in your feed are
actually fairly different from you are
clearly diverse and so even though you
see stuff from your close friends more
because your weak ties far outnumber
your close friends you actually see
quite a lot of stuff from them this is
also why you get jobs through weak ties
it's because there's a lot of them and
so what actually happens is you get fair
exposure to diverse opinions but people
do click on it lads that effect is real
so I don't think that social media is
driving polarization I do think it still
plays a role in potentially resolving it
which leads us to really the entire
point of this whole class which is I
want people to design better filters in
fact one of the former graduates of this
program is now working on the Facebook
knows news feed which makes me very
happy because I hope they've taken some
of the values that we discussed first I
want to talk about what is the
information available to build a filter
so let's make a list I've sort of sorted
it to three categories in this slide but
in each of those categories there's a
huge amount of data so if we're we are
building a recommendation system for
let's say articles with text content
what information do we have to recommend
something let's make a list
yell it out
think that we have like yeah what
information do we have to build a
recommendation algorithm what what are
we gonna feed into the algorithm let's
view it is sure so in viewing history
there's two types yours and others okay
what else yeah
so the topics in the text yeah
topics from let's say talk bricks from
clicks right so this is classic
collaborative filtering that's classic
topic modeling okay what else think of
everything we've talked about in this
class right all the algorithms we've
looked at there's dozens of factors we
talked about yeah so that's that's may
be subsumed in here but let's just write
it shares and retweets okay what else
what else is news or what else does
Facebook personalize by for example yeah
let's call it location and more
demographics right your age sex and so
oh yeah expressed interests right so
many filtering algorithms you can say
I'm interested in this so Google News
will do that for example and other user
controls right there are you can turn on
and off algorithmic recommendations and
Twitter yeah exactly
so let's shorthand that by saying social
graph oh no so stop following is
different that's an expressed interest
but the structure of the social network
is used as well
yeah
okay so is that in usage history or is
that I think so for example what device
you're on can be you right device time
of day right so if I'm looking if I'm
reading on my iPad at night versus
reading on my phone on the subway in the
morning might want different things yeah
exactly
if I if you have my location you get
even more you know I'm at work
it's good theory we'll see what happens
when they introduce Wi-Fi on trains one
that no one has mentioned but this is
the first one we looked at is recency
how long ago was this article posted
news context
oh that's good yeah so what's trending
that's that's great slash context source
yeah
that's an obvious one that's an
important one yeah alright the more we
dislike it's hard to actually exhaust
these there's there's several dozen
factors what else
Yeah right okay so SEO or let's let's
say page structure so for example google
penalizes slow loading pages in there
right in their rankings
we have viewing history
so you're saying if you can predict how
long the user will be on using your your
app content yeah Wow okay
predicted visit time I'm just gonna say
content long short video text etc
okay we'll stop there the point is
there's a lot of signal in here a lot of
stuff and I mean it basically breaks
down into this this right so stuff about
the item stuff about me stuff about
other people and so i i've stated the
filter design problem in this way we
want a a rating or a relevance of a
particular story to a particular user
given these sets of information and also
the previous results right that's
actually when we didn't put on this list
but have i shown you this before have i
shown you something like this before
so facebook definitely tracks that it
tries not to show you similar things too
much another way to state the filter
design problem is in terms of the goals
so it's not just a mathematical function
its an algorithmic object and in fact
algorithmic plus human object because
you have to write moderation rules and
so forth designed to produce particular
outcomes it has to be both economically
viable and technically feasible and we
hope also has some sort of normative
goals both for the person this is a good
product for me to use and the society
this is improving public discussion or
whatever however you want to talk about
that
there are different ways of sorry i know
we're running over time here
we'll wrap up in just one minute there
are different ways of talking about what
it means to have a good filtering
algorithm this is one that twitter has
been talking about this idea of
conversational health they have ideas of
diversity here they have ideas of shared
reality you know are we all talking
about the same things these are sort of
filter bubble ideas there are multiple
different definitions of diversity when
you start digging into this idea
you can connect different ideas about
diversity to different theories of
political participation so you might say
that diversity means you know I should
have variety of opinions you might say
diversity means I I should be served
exactly those things I've expressed an
interesting or I might want to be
interested in or you could say it should
challenge people right
I shouldn't necessarily have balanced
content right this is if this is a more
radical theory of what a filtering
algorithm should do and each of these
could be turned into an algorithm or or
a modification on an algorithm

Algorithms: Filter Design

Phone

Contact Us

Algorithms: Filter Design

Related Content

Phone

Contact Us