Algorithms: Text Analysis
Frontiers of Computational Journalism week 1 - Introduction and High Dimensional Data
Transcript
so welcome everybody that the course is
called frontiers of computational
journalism because it's a research level
introduction to what are people doing
with computers in journalism and the
first thing we're gonna do is to find a
little bit closer what that means a
little bit more carefully and show you
some examples and then today's lecture
is called high dimensional theta so I
imagine a lot of this will be familiar
to a lot of you but I probably not all
of it will be familiar to all of you as
we proceed through the course I'll get a
better sense of who's done already in
the course or m/l course or you know
work with with high dimensional data
already so this first course is really
just sort of getting up to speed on a
bunch of stuff and then we're gonna get
into the tf-idf model who's done that
before in some form okay yeah so we're
gonna find that a lot right so about
half of you have done some things and in
particular we're gonna we're going to
develop it and I'm gonna show you how I
used it in a piece of software called
overview which is a document - a system
is an investigative journalism all right
so that's the plan questions so far
great so the first thing I want to talk
about is what's the what's the
definition what are we doing and so I
pulled a few different definitions I
think this one is my favorite this is
this is the idea of sort of data mining
all right so we are using advanced
computational techniques to find stories
they're make hidden connections there's
all these sort of relatively vague ways
of describing this the purpose of this
course is to make those concrete instead
of big and this is mostly what I work on
as well but there are a bunch of other
places where we use computer science a
journalism that talk about so just
an example of what that first category
looks like so here's some documents that
that were obtained from the US State
Department through a Freedom of
Information request about private
security contractors in the Iraq war
when we talk about data mining this is
what's what some of this stuff looks
like you know so do you remember this
issue it was a big issue about ten years
ago of private military contractors
attacking Iraqi civilians during the war
in particular there was one incident in
a place in Baghdad called switch Square
where they ended up killing 17 civilians
and part of the issue is that the legal
regime that the contractors were
operating on there was very unclear
they weren't US military so they weren't
subject to court-martial and they
weren't subject to domestic Iraqi law as
well and they weren't subject to an
American law so basically there was no
legal basis for punishment for four
times which eventually got it fixed but
anyways it was a big deal on press
conferences and so we got these 4,500
pages and not me a guy named John Cooke
who spent some time at Gawker is now our
reason mags and filed his FOIA request
took three years came in you know this
much paper arrived on paper scanned it
put it online he did a story on it and
then I did a story using some document
mining techniques and we'll talk a
little bit about that later or this is
another example of the sort of thing
that we're talking about with just sort
of data mining example so
this is a piece called surgeon scorecard
this is the first-ever publication of
surgical complication rights that name
doctors so Eugene wolf has done and knee
replacement eighty three times and his
complication rates about two percent
with fairly wide error bars and this is
a piece that was produced
ProPublica from five years of medicare
data so Medicare doctors provide about
40 percent of the medical procedures in
the United States so it's it's a as big
of a sample of medical data as you're
going to get and there were seventy
million rows in the original data set
and so what we did is we screened for
all of the elective procedures and then
we adjusted for patient characteristics
to try to level the playing field in a
number of ways and we'll get into this
when we talk about when we do a
classroom regression so this is the like
classic data mining produces journalism
type of thing and this was very
influential by the way the reason for
public that did this was because best
estimates are that there are about 200
to 400 thousand deaths per year due to
preventable medical error it's one of
the leading causes of death in our
country patient safety advocates have
been saying that transparency of
surgical outcomes is one of the big
things that might help they've been
saying it for 20 years
Publica did this it was very
controversial but definitely advance the
conversation about transparency that
comes I worked on this did the
statistical program or we'll talk about
this later from the same papers the
first definition there's this other
definition which is even broader right
so it talks about not just the
reproduction
discovery presentation aggregation
monetization right so to me when I look
at this I think okay so now we're
talking about Google now we're talking
about the Facebook newsfeed and we will
definitely talk about those things in
this course we will study the deep
design of information filtering
algorithms that's actually basically the
topic of the first third of the course
you will design your own algorithms
we're going to talk about filter bubbles
we're going to talk about all the
algorithms behind these things we're
going to talk about designing them so
they serve social items because that
that much is that is very much in the
realm of what we talked about in general
is the school now because it's such an
important part of the public sphere and
a few past students have now gone to do
things like work on the Facebook
newsfeed so I'm training you not just to
be computational journalist working at a
newsroom I'm training you to be someone
who builds the infrastructure of the
public square that's that's sort of the
topic of this course so what does that
look like well this is an example within
a newsroom this is the talk by a show
this what kind of shape The Washington
Post said last year's computation plus
journalism symposium and what this slide
is is it describes their virality
predictor in other words story is
published you collect signals about
performance of that story for let's say
30 minutes and you want to know if that
story's going to be viral or not
alright so it's because you want to know
to promote it more right if you know
that a story is doing well there are
lots of things you can do you can push
it out on more channels to try to get
even more traffic you can think about
your ad sales on that page and maybe
even perform a real-time auction for
those clerks you I don't know that
that's certainly a possibility you can
write more stories on that topic all
kinds of things reasons you want to know
this and so this is actually some of the
most advanced machine learning models in
journalism tend to be on the business
side so we're going to talk a little bit
about that a bit less about that in part
because I feel that's pretty well
covered in other places this doesn't
look that different from sort of
business analytics data science stuff in
general but there's a lot of it this is
another example of using computer
science techniques to do something in
the public sphere that isn't producing
stories so we're gonna have a whole
class on tracking effects this was a map
of the spread of the Kony 2012 video
which if you remember this video yeah
okay so at the time it was the fastest
viral spread of any video ever and
perhaps anything ever and so gilad note
on maps the the network Twitter network
the distribution to service any other
major nodes were the point I'm trying to
make why including this is that the
system we're going to look at spans this
entire train from story production all
the way through distribution and then
after distribution to you can look at
the effects of the story and then it
cycles right it's actually not just over
here train it's a loop because the story
has some effects that changes the world
weary report on the world so we're in
this constant feedback system and there
is code everywhere and that feedback
system now the other major theme of this
course is algorithm accountability which
is what we call in journalism but the
idea appears in a lot of fields it's the
idea that algorithms have power they're
being used for making
decisions hiring decisions decisions
about who goes to jail we're going to
talk about that at great length all
kinds of stuff and so part of keeping
tabs on the power in society is keeping
tabs on what algorithms are being used
how they work and this is something
which only people with computer science
backgrounds can really do at least at
the level of asking how the algorithms
work there's the stuff around the
algorithms turns out to be very
important as well but that's no excuse
for not knowing how the algorithms work
so we're gonna start looking at reverse
engineering this stuff by the way you
all get these slides don't worry about
that so as an example of what that looks
like here's one of my my favorite
examples of algorithmic accountability
this is looking at how prices vary by
zip code for staples which is an online
office supply retailer and it found that
basically things were more expensive
away from urban centers and we're gonna
talk about this particular example and a
bunch of other examples right so one of
the questions you're gonna ask here is
why right and the shorter answer is well
it makes them more money but there's a
more complicated answer which is sort of
well how did they get to this was there
a human who was sitting there optimizing
things were they running an algorithm
dropped my sales which is what I believe
happened if so what did the algorithm
look like well I'm off the sighs about
what the structure that I dog where that
was and and then there's the effects
which is that generally higher income
correlates to population density meaning
that people are richer than cities
meaning that they are charging or people
more for the same product right so that
has obvious social implication
you know start expanding that out times
millions for every company and these
pricing algorithms start to have serious
implications for the structure of
capitalism and algorithms appear in
finance we're gonna be talking about
that quite a lot as well here's another
sort of algorithmic accountability story
this is for public ayahs message machine
which is an attempt to reverse-engineer
political micro-targeting in the 2012
election I don't think they did this in
2016 that's sort of kind of different
set of approaches but what they did is
they collected a bunch of emails
clustered all of the similar ones using
tf-idf so by the end of this class don't
know how they did that and use that and
collected demographic information use
that to try to figure out who got what
email so how did the Obama and Romney
campaign's target people and of course
in a post Cambridge analytical world
political micro-targeting is now very
much in the spotlight so these are the
kinds of things we do right and so you
can think of it as basically two
directions one is you use competition to
say or discover things about society and
the other is you use you do journalism
about algorithms in society so you can
think of it as applying computers to
journalism or applying journalism to
computers and by the latter I don't mean
the tech press right we're not going to
talk about like covering new Apple
hardware we're going to talk about the
effect of changes in the facebook news
feed algorithm on democracy which is
kind of a different
sorry yeah yeah I mean where stories
come from it's kind of an interesting
question but I am presupposing that
there is some data somewhere that we
have to look at I don't think you can do
the sort of first direction or
computational journalism about data so
much data can mean a lot of things of
course I can I don't mean you can
download the CSV or you didn't make a
data set you can and people have done
stuff like this well look at a piece
there Guardian went through the laws in
every state about same-sex couples right
so this is pre gay marriage ahead of it
by a couple years and you know
visitation rights child care marriage or
other civil union there's there's this
whole list of Rights that same-sex
couples did or didn't have anything okay
so they assemble this data set by
reading the laws it discuss is that
helpful yeah
yeah well that's me
like how do you find a story right yeah
yeah well I mean is it always a Deena
source or is that something that's no I
mean Marta in the guardian case the the
data source came after the concept for
the story right so I mean I guess the
short answer is that could happen it's
all different ways
oh okay so here's kind of the structure
of some of the topics down in the middle
are the major topics of this course
that's more or less a synopsis of some
of us the things we're gonna we're going
to talk about each of those is a pretty
big topic and then around the edges are
some of the fields that we're gonna have
to talk about so this is an extremely
interdisciplinary course that's one of
the things that I like about
computational journalism is it's really
right in the middle between and it's
also very practice oriented right you
have to actually produce a story and
publish it so where this is this isn't
although we will talk about some very
abstract mathematical things this is not
an abstract game so it's it's a really a
fun place to work and you can you can
use it as sort of a jumping-off point to
all kinds of incredibly interesting
areas and so we're gonna we're gonna be
talking about all of these things we're
going to start on this course with some
information retrieval and natural
language processing by the end of this
class you will know how to build Google
version 1.0 right well we're gonna talk
about basic search engine operands basic
information retrieval stuff just in
terms of the structure of the class
chances to get a sense of this okay so
how many of you are not dual degree
students okay yeah so so for you this is
a three-point class and for the dual
degrees it's a six point class how we
manage that is if you if you are not a
dual degree student you don't have to do
the final project there was some
confusion over that last year so all of
my you guys the rest of you there will
be a final project
we're gonna start talking about it
earlier we're gonna talk about it a
bunch next class this will be either a
reported project or basically a piece of
computational journalism research it's a
pretty you know substantial piece of
work there will be a sort of midterm
presentation where you talk about the
progress that you've made and they'll be
at the last class you'll present your
projects as well so you know every year
there's way more stuff that I'm
interested on that I get to report on
personally and so part of part of what I
do is put a list of things on the board
that I wish I had time study and you're
free to pick any of those spaces or
anything else and hopefully we're going
to produce some some novel either
investigative journalism or research and
then there's assignments and
participation so the class participation
mark it's pretty minimal of what I
expect
I expect two things show up and say
something in each class like anything I
don't care if you say like two words I
just want to know that you said
something in each class that's it yeah
twenty percent just for showing up but
open your mouth so you're already about
half of the assignments will require
programming in Python however even for
the assignments that are right right
there
which require programming in Python I'm
not really gonna be looking at your code
I mean you know I'll might check that
it's reasonable but this is not a
programming course this is a the
assignments are gonna have the form
rights and code would explain to me
what's going on and if you can't explain
to me what's going on then you're gonna
have difficulty with the assignment this
is a journalism course so the explaining
what's going on is the key skill right
it's not just about impressing me it's
about showing me that you can write
about complicated topics is that that's
the class is pass/fail for all of your
journalism you'll get a letter grade if
you need one the assignments are
pass/fail every year a few people fail
an assignment if you're one of those
people it's okay it's you know you
you're not gonna fail the course if you
fail home or assignment it's it's my way
of telling you that what I expect is
something very different it means come
to my office hour let's have a
conversation you can see it's normal oh
well good question well get today
probably Thursday afternoon but we'll
have to talk about when it okay so this
is what we're actually gonna try to
cover in this class we're gonna today so
we're gonna start with this sort of very
abstract concept of high dimensional
data which you know foundational Dahl
data science basically then we're gonna
build up the document vector space model
and put together to MIT I vectors and
use it to talk about a piece of document
mining software that is used in
journalism so we're going to talk in
some detail about the evolution of that
platform and how it works so high
dimensional data there's this quote that
is something like measurement is the
most neglected area of statistics the
process of how data is produced is
fundamental dollars's to work if you
open a statistics textbook
it says almost nothing about it does
anybody know where there's lunch
is occasionally I get something the
classroom recognizes it it's actually
from a children's story called a lost
thing and it's a very sweet little story
about a boy who finds a finds this
creature on the beach and sort of adopts
it and it's not really clear what it is
it's not really machine and it's not
really an animal you know he doesn't
know what it eats it doesn't really fit
in anywhere in this garage that's you
know it's a parable for not fitting in
but what I'm saying here is that the
entire world doesn't fit into data
alright there's always that that arrow
that arrow is the quantification
operation there's always some crazy
abstraction that happens in the
qualification process how many of you
have done any sort of experimental field
work better like ask somebody's survey
questions a little bit anybody done any
ethnography okay anybody been involved
in data entry
census collection okay we'll talk about
this a lot more in classes two and three
I think but the the question of data
sources of data quality is there's
really big one that is really important
in journalism's
and perhaps talked about more abilities
than almost anything else as I said just
expect books neglected this almost
entirely but anyway if we can manage to
convince herself that there's a
quantification procedure what we get is
we turn real world objects into a vector
of normal a numeric entry see all right
in fact always numeric we end up
recoding categorical variables
so we're gonna work through a little
example here which is data from the UK
Parliament in 2012 and there's actually
a reason why I pick 2012 it's not just
that that's why I built this example and
this is what it looks like basically we
have a list of bills you can see the
titles there and there's repeated titles
because there's often multiple votes in
the same bill and lists of MPs and their
votes minus x means I've didn't attend
but I think four is no - yes
oh yeah this is the bottom yeah and what
we have in this set of data is over a
thousand MPs and 1600 votes
so the analytical question is is there
structure in this data does anybody have
a suggestion on how to receive to try to
make sense of what's going on here yeah
so Alicia those column whose empty ideas
is a particular person there's actually
another table that gives their their
name in their party party maybe a good
idea huh yeah so imagine your editor
comes to and says tell me about voting
patterns in the House of Lords
right we're looking for sort of initial
exploratory steps here I mean you've
done exploratory data analysis and of
course yeah okay so let's talk about how
to mention elata if this was had only a
few dimensions you could just start
making charts right and you can make you
know you can build a table that
says you know for any given bill you
know how many labor how many
conservative and so forth voted on that
bill right so you can sort of imagine
taking sections of this but you know I'm
feeling ambitious and I want all of us
so we're going to talk about reducing
the dimension of this space so this
space is if we consider each MP to be an
object so we have a thousand in peace so
so a thousand rows of data and sixteen
thirty votes so now we have a thousand
data points each of which is sixteen
thirteen dimensional and I want to
visualize all of them so the basic
problem we have is we have to go from 16
and 30 dimensions to two or perhaps
three this is a very common problem and
do the science this is called
dimensionality reduction and how many of
you have seen some version of this in
here other classes okay yeah what
dimension in reality dimensionality
reduction techniques did you see yeah so
that's where we're going with this we're
gonna we're gonna build up to PCA today
how many of you know what principal
components analysis is okay yeah so this
this is basically how this entire class
is gonna goes I'm gonna talk about
something and like some of you will know
what it is you want so you know whether
you know it or don't know if I'll feel
then the basic technique that we're
going to talk about it's projections
right so what projection is is it is a
classic
in fact ancient technique for dropping
dimensions
although this this form of it was
actually worked out during the
Renaissance and some may be familiar
with techniques where painters would you
know look through a sheet of glass or
rig strings between the points of an
object and the central like I point plot
their intersections the mathematics of
this will worked out at four or five
hundred years ago and you can think of
it as finding a direction to look from
and then dropping the depth dimension so
this finding a direction to look from
it's pretty important and I'm going to
go to my favorite rotated cube so this
is a data set it's like a the iris data
set where there's a bunch of points and
in this case the points are in three
dimensions and we're looking at a
two-dimensional picture of course and
we're trying different angles on this
and what I want you to notice is that in
some angles the clusters really overlap
so you know the red and the black will
be on top of each other in a minute and
if you're trying to see clearly the way
to go right so if you didn't know
already we're not ready color in them
red and black you wouldn't really be
able to see what's going on there and so
for example this is not a particularly
good direction to look from because
everything's overlapping but there are
other directions where you get a lot of
separation between the points and those
are directions you want to look at and
the theory of PCA kind of goes like this
so let's do sort of the simplest
possible example where we're projecting
from one dimension or two dimensions to
one
and zero an eraser over there if you
hand me the other - that would
be okay so we've got these data points
and let's say there are two big groups
and we've got these two axes
which we're going to call X 1 and X 2
because that sort of convention in data
science X is the coordinate or the input
and we're gonna project them down to one
dimension now we can project from any
direction but let's just limit our
choices here and say we want to drop the
X 1 of the X 2 venture well if we drop
the X 1 dimension then what we're gonna
get is the positions of these things
projecting onto this axis and I'm not
gonna draw this little projection for
every everything here but you can see
that they're they're really gonna pile
up on top of each other here if we drop
the X 2 axis instead we're gonna get
some pile up that's what happens when
you drop dimensions but we're also going
to still be able to make out but there's
two clusters so we've preserved a little
more structure now the question is how
do we automatically pick which axis is
better to project right which access is
better to keep excellent any ideas
I think you are building up the
intuition behind the NBS algorithm which
which you also get to but yes good a
simpler is let's pile that yeah how do
we know about these things are these
what's that it's tense distance yeah
something-something distance so the a
simple heuristic to try to pick a good
projection Direction is just look at the
overall range of the data and pick
whichever dimension has preserved
whichever dimension has more range all
right they're gonna be less pile up may
not be ideal but it's gonna get a lot of
it in fact instead of instead of a
minimum maximum when we normally
computers variants which is the standard
deviation squared as you'll recall and
so the entire notion of PCA is that we
pick the first dimension PC is the one
which has the highest variance the
second one is the direction which has
the second highest variance and all of
them are perpendicular so what we end up
with is a set of perpendicular
directions which have the 1st 2nd and
3rd highest variance and so forth so
when you put that into math speak you
end up with eigenvectors and covariance
matrices how many of you have done the
whole eigen thing maybe you have that
that background yeah some of you will
have been have been forced to do this so
typically you'll do this in like second
or third year linear algebra right
and it's one of these things where you
just sort of gotta why am i learning
this yeah it turns out the answer is
because it's actually an incredibly
foundational concept but not really for
us because we don't have to compute it
we're gonna do this in one line of
Python code but this is what's happening
right we we build this covariance matrix
and the covariance matrix is the
diagonals of it are the variance along
each direction and potently the off
diagonals are the variance between two
different coordinates and and that
matrix actually encodes all of the
information about whatever way this
thing is rotated in space and then we
just sort of picked the however many
directions we want
so for put your direct into two or three
dimensions we pick the top two or three
eigen vectors which end up being the
directions of maximum variance so we're
gonna run this on the House of Lords
data in a second before that I want to
talk about why and basically we want to
find structures in the data and the
simplest structure that we can find is a
cluster and the reason that clusters are
interesting is they produce a sort of
natural classification of the data right
so think about going out into an
unexplored country perhaps as Darwin did
and or at least unexplored by Western
naturalists and cataloging all of the
species that you find and so you walk
around and you have no idea was there
music okay so I was a small creature and
it has four legs and it's green and some
phibian and there's another creature and
it's a tree and it has feathers it's
brown and here's another one which is
green Olivia and here's another one
which is also green and fibia
it's smaller and you end up you know sir
plotting these characteristics and
eventually you're like okay well I've
got some brown birds some frogs and some
green grasshoppers right you've you've
deduced the categories from the
structure of the data that you observed
so that's what's happening when we're
doing clustering and this is super
important categories are foundational to
language and thought and basically all
analysis and so this is why clustering
and categorization is such a fundamental
thing to do with data and we're gonna
we're going to see this in a bunch of
different contexts
clusters are perhaps the simplest
important structure that you get data
and this is a whole book on
classification systems which is awesome
and there's another book called sorting
things out which is about how our
classification system effects our know
just our thought in our language but
actually our institutions and our
processes like think about whether
something is classified as a mental
illness or not alright that can have
huge effects on individuals life course
and it's just a classification change so
we're gonna do this so I'm gonna show
you what this notebook looks like when
we actually do it and you can follow
along did I post the online version of
this let's see oh no I made a repo just
for this a new repo jail post slack
you know it's the old repo here we go
so here's the idea so we read these
these boats and I reduced it to just the
last hundred boats so stayed a little
smaller it's a hundred votes by 613 MPs
and this is a bit different from the one
we saw in the slide this is just to make
this exposition a bit more
straightforward so here's the ID which
you can key back into the original data
if you want here's the party
conservative XP means crossbencher we're
now instead of getting into the details
of British politics labour Liberal
Democrats labour and ivory coded them so
one is yes and zero is abstain and -1 is
no so there it is then I'm gonna write
this little function which color codes
them so this is kind of our legend and
we get black if we don't know it and
then this is the actual we're starting
to actually view the data and if we
don't have dimensionality reduction we
kind of have a problem because we can
view three dimensions at a time so each
of these dimensions is a vote right so
this is this axis would be the vote on
both 22 I guess and we've got a bunch of
problems here so first of all why what
is this pattern or see what why do they
have this this shape in the scatterplot
yeah yeah basically we're getting all
combinations of minus 1 0 and plus 1 on
these three axes so that and so these
things are also piling up right there's
actually like a hundred dots piled up
here because we've got 613 total you can
see some combinations don't exist but
basically we have you know 3 times 3
times 3 27 dots so it's not really
telling us anything about the voting
patterns and so what's actually going on
here is that this is the intuition I
want you to develop the eraser bill oh
yes okay this is the intuition I want
you to though we are looking from the
wrong direction okay so you can talk
about GCA in terms of life you can do
this all symbolically but it's really
all geometry and if you can get them a
little high dimensional geometry then
you can talk about all of this stuff in
geometric apologies and this is what's
going on
so if I project to this axis or this
axis what I get are three blogs right if
I project to a diagonal axis I get
something more interesting I get my
blogs the sender which are a little
heavier if there's more dots that are
stacked up okay and so if in fact what
happens is that most of the M Keys
floated here for example so this isn't
actually linked in to lots of talk each
other then I'm going to start to see
that when I look from this direction as
you add dimensions this diagonal view
which you can think of as just taking a
sum of a bunch of different boats
simultaneously a bunch of different
dimension simultaneously start to be
more interesting because you're you're
looking at you know here you're looking
at the sum of two boats if you go into
three or four or ten dimensions you're
starting to look at the zone ten boats
and then you're starting to see more
values than just the three you know
minus one zero plus one so what we're
gonna do is we're going to let the PCA
algorithm pick the dimension and I told
you it was easy here it is
so this is what we get when we reduce it
to two dimensions I've done one other
thing here which is I've set the colors
to the party so first of all you can see
a lot more structure if you look at the
colors so Liberal Democrat is yellow
labour is blue conservative is red so as
you would expect conservative is red
labour is blue so you've got sort of
left and right here which tells us
something a little bit about this
political axis and then we've got this
other axis which is remember that axis
with the second most variation which is
where the yellow right so the Liberal
Democrats which is where they are now
this plot actually corresponds to the
political structure of the government at
that time does anybody know anything
about the UK government at that time
yeah when I when I taught this course in
previous years yeah people were like
yeah this is happening right now no
one's ever gonna get this question again
so I'll just show you
they had a coalition government which is
something that only happens in
parliamentary systems where they
eventually had a coalition between the
Conservatives and Liberal Democrats so
what you are looking at in this example
is the red and the yellow are in a
coalition government together but
they're not really voting together but
the Liberal Democrats who
maybe you think would vote more often
with labor the blue end up sort of
voting distinctly differently from the
blue party and but not quite the same as
the red party as well right so that's
the question right what are the exits
what are the exits yeah kind of I mean
so here's the things the actual answer
to what are the axis is the axes are the
dimension the X the first axis the
horizontal access in this picture is the
dimension along which there is the
greatest variance that's the y axis is
the dimension along which there is the
perpendicular dimension along which is
there is the second greatest variance of
X so from a technical perspective that
is the answer but I think really what
the question that you come up with when
you look at this picture is okay but
what is the axis need and to figure out
what they mean the only thing we can do
is go back to the data and so just
looking at this picture which is is only
interprete because we put colors on the
dots right if this was a black line
picture you'd be like okay it's a
t-shape but knowing that conveniently
conservatives on the left and laborers
on the right you can see that the first
dimension is something like the
left/right political axis and actually
in every voting pattern analysis and
social attitude analysis that I've ever
done or ever found in the literature the
left-right axis just popped out for a
while I had this question you know is
the left-right political divide real
right is this thing that people talk
about
will actually slip that way or is this
just sort of a narrative that we impose
on the world and it turns out is real
it's real real it tends to be for some
reason the first axis of variation in
political attitudes seems to be this
left-right axis in a pretty traditional
sense now exactly what left-to-right
means will vary in each country but you
always find it um what is the second
axis mean so this is clearly the
direction in which the Liberal Democrats
vote which is some other policy
dimension to figure out what that
dimension is what we would have to do is
we would have to look at what bills are
assigned to that dimension so what do I
mean by that I'm also talking about what
is called uploading all the components
and what you actually get is I don't
remember how to do this now side kids
PCA loadings also known as factor
loadings
ah PC a components okay so here we go
where to go where to go okay and we
called it model all right
oh my god numbers numbers numbers okay
well let's let's see what shape this is
2 comma 100 so notice I had 100 votes
here right hundred two columns because I
also had these two but I just used the
voting data so this has as many columns
as votes and then it has two because I
asked it to build me a two dimensional
projection and so what I end up with is
is to 100 vectors and so what that is is
a coefficient for each build and think
of it this way the PC era kicked some
dimension to project right and so you
know if it was just the horizontal
dimension then my vector would have 0
everywhere I'm one in one place or same
with the vertical dimension if it was
really this diagonal direction we're all
like exactly like a 45 degree then it
would have the same value in every
component but instead I'm gonna have
like some direction which is like I
don't know like this right so it's not
exactly that but it's actually actually
horizontal and it's you know it's a
direction right it's it's a vector and
the components telling me the direction
of that vector so now jump up a
dimension and imagine this in three
dimension it's a plane and in higher
dimensions is actually a hyper plane and
as you all know a hyper plane is
represented by a vector which is
actually the normal vector of the plane
right so these these numbers the
coefficients here
which let's look at the first one for
example right this is actually the it
each of those numbers is tells you how
much weight you put on each bill so if
you go and you look at this and you take
the directions which have the largest
value right so a lot of these are really
small like point zero three or something
but a few of them are bigger like point
one nine that tells you that those bills
are counted most heavily in that
direction so that's like the you know
the leftists or rightists most bill and
oh but I looked at component one which
is the vertical one right so that bill
is really characteristic of this
direction and so all of these yellow
dots tended to vote YES on that bill so
we could go through these coefficients
and go back and read the titles and text
of the bill and figure out what it means
to be more vertical and what we're gonna
find is we're gonna find all the bills
where the local Democrats thirty guess
which we can already tell that because
that's what that shape means okay
questions
when you say like the direction the
second-row serious I mean yeah kinda
right so like healthy so I said I gave
you this nonsense earlier right first K
eigen vectors of the covariance matrix
like this of course one of the largest
eigen values right so one algorithm for
doing that is you finds the direction
that has the largest variance nevermind
pal you know you just find it that's the
largest eigen vector and then you're
like okay now let's find the direction
which is perpendicular to that which has
the largest variance and you just keep
adding perpendicular vectors adding as
much variance whichever direction was
the most variants each time so that
sounds actually computed but once you
compute it it's it's done right because
the data is yeah because we have
intentionally tried to find the
direction in which there is different as
possible from everything else that's
what it means to have the direction of
the second-most variance so it's gonna
find the direction that they that they
go a little bit later we're gonna look
at a multi-dimensional scaling plot of
this where we plot things based on
distance and you'll see that actually
they overlap so it's a different view of
the data yeah you go about explains you
had done this
visualization or something with you how
beautiful persons yeah a great question
so I would say that you know each MP is
given two scores each of which is a
weighted average of their votes on a
certain set on on the bills one set of
weighted of weights is includes bills
that vary along the traditional
left-right axis and the other set and is
just those bills where local Democrats
tend to vote differently so that's sort
of one way to think about it
I think that's probably still too much
because most people are not from the ROO
the weighted averages I think you sort
of the first order version of this gets
to be something like the they're plotted
on two axes the left or the first axis
the horizontal axis represents the major
left-right divide the second axis is the
ways in which Liberal Democrats differ
from other parties
it gets really metaphorical so this is
this is a key a key theme in this course
remember this of course is the mapping
between the words in other words the
mathematics and this is really this is
one of the reasons this is so much fun
and so hard because we're really
straddling the bottom of all of it if
you're right
we're trying to find words that not only
are an accurate description of what's
going on in the data but that people
will understand right so it's we have to
map not just the meanings of the words
but the meanings that will be understood
when we publish this publicly to people
with no technical knowledge and I'm not
necessarily recommending that you
publish a PCA plot but I want I'm not
saying you shouldn't but it is a little
hard to describe but I want all of you
to know how to interpret these because
it is perhaps the simplest way to
realize the basic structure of a
high-dimensional Vegas thing all right
thoughts on that okay so that's PCA oh
do I have this in two places oh yeah
okay slide issues there is another
approach another fundamental approach to
understanding the structure of high
dimensional data which uses distance
metrics and a lot of things uses
distance metrics and for example all of
the filtering algorithm designs that
we're gonna look at in this class are
built out of distance metrics so let's
talk about them formally a distance
metric is a function with these
properties you know so they have names
like reflexivity and so forth these are
meant to model our intuitions about
geometric distance right so if I'm
looking at the distance between two
points it's not- the distance of
anything to itself is zero distance from
green to blue is the same as the
distance from blue to green all of those
are obvious the triangle inequality is I
think a little less obvious you're all
familiar with the triangle inequality in
geometry it's another way of saying that
the shortest path between two points is
a straight line it says if you go from
what we talked about if you go from X to
Y it is the it can't be shorter than
going from X to Z directly ok so these
actually proposed impose a fairly strict
set of constraints on the topology of
the underlying space but there's a bunch
of stuff that does this so normally
including distance will satisfy this but
so well things like distance on graph
alright so if I have a
Satur notes and I have some edges and
the edges are weighted you know with
some distance on them then if I define
the distance metric between two points
as the sum of the weights on the edges
of the shortest path and this will sound
like this
so it actually applies to a lot of
different situations more than you might
think
and if we have a distance metric then
the space that the objects are in so
these these X&Y; they live in some
abstract geometrical space is called a
metric space if we have a distance
metric and we have endpoints we can make
a distance matrix now this is
distinguished from the data matrix so we
saw the data matrix for the House of
Lords so this is the data matrix right
we have one row for each object so that
can be in whatever size right M objects
of n dimensions the distance matrix is
always square and it's always symmetric
why is it centered
yeah because it's because we said that
it's symmetric right
so it is a symmetric matrix with of M
times M objects and we can always
compute this thing if we have a distance
metric it's gonna be big right this
starts getting scary if we have a
thousand points now we've got a million
entries in matrix and if we have a
million points so sometimes you don't
want to compute this explicitly right
sometimes you want to do it on demand if
you want to make it sparse or something
but in principle we can just store this
thing in a variable and we can use the
distance matrix to create visualizations
of the space of objects which maybe you
were getting at earlier with the sort of
distance preserving embeddings so let's
talk about that algorithm this algorithm
is called oh we're talking about
clustering yes before we talk about
creating visualizations we're going to
talk about clustering algorithms there's
tons and tons of them this is sort of
like pretty pretty standard data science
stuff but if you think about them you'll
notice that they all require a distance
metric so the fundamental thing you need
to do to do clustering is how similar
are two things right so I'm out in the
field I'm exploring my new country and I
noticed one for lady green amphibious
and another four legged green phibian
are they similar enough that I'm going
to say the same species right or is one
a frog and the other a salamander so
presumably my distance function will
tell me that two frogs
any two frogs are more a a lower
distance to each other than any frog in
any settlement that's the idea here and
that is case then we can find a
mathematical structure here how many of
you
McCann's on really yeah right okay so
we'll do that pretty fast this is a
really classic algorithm so restart
alright so you pick how are we going to
pick the initial centroids so in other
words we're trying to find the center of
clusters to do that we need an initial
guess for where the cluster centers are
randomly okay and then what kind of data
would you like smiley or pimpled smiling
pimples smile pimples smiley is really
the best one okay so here we go the way
the caming algorithms work is so the big
limitation of k-means is you have to
pick k before you start you have to pick
you have to know the number of clusters
before you begin and there are
clustering algorithms which don't
require you to know that but k-means
does so there's a centroid how many what
should be set K to how many clusters do
you think we have here for okay so
there's four random centroids and what
it's done is it has colored the regions
of space where which are closest to each
centroid so for example the the border
between the red and blue is gonna be on
a line which is equidistant from each
right that's that's what defines these
lines so now the k-means algorithm has
two steps the first step is we figure
out for each point what's the closest
centroid so there you go we've colored
them and the next step is we move the
centroid to the centroid of all of the
points that have that color
so you'll notice centroid is centroids a
fancy word for average it's a geometric
a concept of an average so
we go and then we repeat we've changed
what points are closest to what you can
see that for example we've got some red
points in the blue region right so we
reassign all the points and then we move
the centroids again and we keep doing
this and eventually what happens is we
reach a step where we don't assign any
points so I think we're gonna reassign
just a few here and you can see that
sometimes it takes a while to converge
but you can actually prove that this
will converge and actually it stops
there you go
we've reached convergence okay so um
there's a proof that the k-means
algorithm will converge it's not
necessarily going to converge fast when
a practices does pretty well you can get
away with ten iterations or something
what do you think are these reasonable
clusters I have the at the centroids
captured the structure of the data yeah
it's kind of a hard thing to add
clusters this this makes more sense when
you do something like okay write it
better find these so let's try it right
okay so obviously clustering only really
makes sense if the data really is
clustered real data is kind of somewhere
in the middle we'll look at some real
data in a bit clusters of documents and
you'll see that it's you know there's
cluster structure but it's pretty even
see so this is the k-means algorithm
right and when you look at this I want
you to imagine doing this in some high
number of dimensions we're doing it too
here but it certainly works in creating
in fact it works in any number of
dimensions and all of the proofs go
through and in fact if you build a
clustering algorithm so so think about
what you need to build the k-means
algorithm implement this right so first
the first step is you would assign each
point to the closest centroid
well the centroid is a point and the
closest ones closest to me well it needs
least distance so if you have a distance
function then you can do that step and
that distance function could be in any
number of dimensions and then the only
other thing you need is you need to be
able to create a separate so you need to
be able to take a bunch of points at
average level somehow so if you have an
averaging step and you have a distance
function you can implement k-means and
there are other clustering algorithms
which only use a distance function so if
you know how to take a distance function
between two points in your data you know
and there are some pretty obvious
distance functions that working you know
any number of dimensions like euclidean
distance right then you can cluster in
any number of dimensions you don't have
to change your clustering algorithm so
all this stuff works in in high
dimensions so I'm not gonna run this in
a notebook but you can run k-means or
whatever clustering I would be like I
think I used a clustering algorithm here
and I said tell me which cluster each NP
is in and this is what you get ha now
you get a number for your Chimpy the
reason that I've showed you this is
because I want to point out and I will
point this out over and over again is
that data and by itself never has mean
the meaning comes from connecting the
data to the world in some way right and
if you don't believe me I'm going to
take some of my favorite spreadsheets or
move all the column headers and send
them to you right so and this is again
one of those neglected topics and
statistics it's like well did you know
the data only has meaning when it's
connected to everything else so we're
gonna connect it to some
in this case we're gonna connect it to
party so now I've added the at the top
there it's the party of each MP so you
see the the party and then immediately
below it the cluster and if you stare at
this for a bit obviously there's better
ways of visualizing this but if you
stare this for a bit you might be able
to convince yourself that is it's kicked
up party information so labor is usually
to Liberal Democrats are normally one
crossbenchers are normally thing but
take a look at the Conservatives they're
usually one as well so it's ended up
putting the Conservatives and the
Liberal Democrats who are in a coalition
in the same cluster and and that is
because what I did when I did this is I
built a distance function that computes
distance based on the number of votes in
common so I think I'll actually show you
this so if you look at this original
post I don't think I have the code yeah
here we go
this is the dis instruction I used and
if you stare at this for a few minutes
maybe you can convince yourself that
this follows the rules of the distance
function so for example it's symmetric
and v1 and v2 you can switch them and
you get exactly the same thing right
because everywhere they appear you can
switch them if they're the same thing
right if I pass the same MP for both of
them this is gonna give me zero triangle
inequality is a little harder to prove
but you should be able to convince
yourself that it works out and all it
does is it basically counts the fraction
of both where they both voted and they
both
the same right so I figure out how many
they both bills it goes floated on
because not a Google it's on everything
and then I figure out well how many do
they match on within the ones where they
were both there and then I figure out
this fraction the number where they
match done divided by the number where
they could have matched and I take a one
minus because this this will be one if
they voted identically but it needs to
be zero if they vote identically so a
couple of - okay so that's the distance
function I designed to do this and we're
going to turn this into a visualization
in a minute oh yeah so there you go it
says at the top this may seem a little
arbitrary to you and it is so here's the
thing about distance functions and
clusterings they're not objective and
subjective isn't a bad word subjective
one of the ways to think of subjective
is set by contextual considerations so
for example I am using domain knowledge
of how Parliament works and what voting
is to decide that this is a useful
distance function now I'm still making a
choice I'm saying well let's look at how
well they agreed where they both voted
rather than looking at you could also
say well let's just look at the number
of votes that match and if one person
was there and the other person wasn't
then let's just say it's a mismatch
right so we there we still have a bunch
of choice here and there isn't going to
be a right clustering or a right set of
categories there are only going to be a
useful set of categories for your
problem this is my favorite example at
the top we have the this is due to Clay
Shirky at the top we have the piece of
the classification system for the Soviet
National Library and at the bottom we
have a piece of the Dewey Decimal System
and you can pretty clearly see the
philosophical influences of each of
these categorization systems right there
both categorizing the same thing about
categorizing books but they have really
different ideas of what a category is a
lot of that 290s every non Christian
religion alright so this was invented by
Dewey in the 19th century and so like
you know in the libraries in American
libraries at the time this was a
reasonable way to do it in fact there is
a justification for this which is that
anybody know the difference between a
typology and the taxonomy so these words
were used in the title of this book
what's the difference between a typology
and a taxonomy anybody know which file
technology is just a series of pipes
what types of things can we have so your
typology would probably have books of
all kinds of different religions a
taxonomy is a typology built up from the
objects you actually have and so this is
why the Dewey Decimal System is so
Christian it's because at the time it
was made that was the set of books in
the library right you it's a completely
reasonable thing to do to build up the
classification system from the examples
that you actually have so going back to
exploring the animals of the new country
you're building a taxonomy so the the
Tree of Life is a taxonomy it is a
classification system designed to be
effective in categorizing the actual
species that were found in the field by
natural ists as opposed to I guess sort
of hypothetically or based on
theoretical considerations anyway you're
gonna run into these issues when you're
designing distance functions I'm doing
clustering right
there's no no
we're going to move from cluster
individualization they're often very
closely related but remember that
technically speaking a clustering
algorithm just produces cluster
assignments right so the output of a
clustering algorithm is this in fact
it's not even that it's that which is
why we still want visualization because
you know we would probably take the
output of the clustering algorithm and
then visualize it for example by
assigning colors on the plot so just a
reminder of that point so we're going to
talk about a different way of
visualizing this I dimensional space so
the PCA projection is a linear
projection that means a couple different
things geometrically that means what we
do is we drop straight lines to a
straight surface right so I'm gonna
repurpose these points as points in a
Euclidean space again and I'm gonna say
well we're going to take a flat
projection plane right so this is some
hyperplane and then we're going to take
the perpendicular distance which is of
course also the shortest distance of
each point to that plane and then we're
going to take the points on the plane
that are the closest distance and that's
our projection all right so that's what
we mean by a linear projection in
geometric terms in linear algebra terms
what we mean is we do a matrix
multiplication and are you all familiar
with the rank of the matrix okay so you
can think of the rank of a matrix as how
many dimensions you have left after you
multiply by that matrix and the fact
that this is a projection means that
that matrix has to be less than full
rank in fact if you're projecting to the
screen that is projecting to two
dimensions that has to be a ring to
matrix it has to draw every other
dimension it now in some basis that
matrix will have zero for every row
except the first two but in general
because this is on a diagonal in
whatever coordinate system the matrix is
still full of numbers but you're going
to find that it it projects everything
just so this is what we mean by a linear
projection the projection we are going
to use is a nonlinear projection it is
still a dimension reducing function so
we still end up on a plane but we're not
doing it by dropping perpendiculars
anymore so in particular a linear
projection preserves certain types of
distances and angles so obviously a
linear projection doesn't preserve all
distances but it preserves distance it
preserves this distance right it
preserves certain types of angular
relationships as well the linear
projection doesn't have to preserve
anything and there's a fisheye lens
which is a simple linear projection so
here's what we're gonna do we're going
to build a non linear projection and
instead of starting with the positions
of all of the points and multiplying by
a matrix to flatten them out what we're
going to do is we're going to start with
the distance matrix right so the
distance between every pair of points
and if you think about this if the input
and the output dimensions are the same
and actually I can get all of the
coordinates back if I just know the
distances right so say this is a map of
the say these are the positions of a
bunch of cities and these edges are
labeled with the distance between them
you know so as the crow flies and I
don't give you the positions of the
points but I give you all of the lengths
here I just give you a list of the
distances from every city over the city
there's if you think about it for a
second and you can you can prove this
automatically you wish
you can recover the positions of points
just for the distances up to a scaling
and translation right so I may not know
which way this is rotated I may not know
where it is on the plane but I know
what's gonna have this shape so a
distance preserving transformation
preserves in particular the geometric
properties were interested it preserves
all of the relative lengths and angles
if you go from - space - to space or
three space - three sticks but you know
there's nothing that stops us from going
from like 200 space to do space we just
may not be able to get all of the
distances exactly right you can't
satisfy all of the original distances in
the low dimensional space so this is a
classic algorithm that started in 1952
called multi-dimensional scaling it's
and you know you're some linear algebra
really the only thing I want you to
think about here is so we start with the
distance matrix on the left and we end
up with with coordinates right so it
Maps distances to coordinates and it
says well we're not gonna get exactly
right so so X are the coordinates and B
is this transformed version of the
distance matrix basically be sort of
subtract out that that means and that
that line six this minimizes B minus X X
what that's saying there is it minimizes
the disparity between the coordinates
that's computed are the distances
between the coordinates it's computed in
other words the distances in the low
dimensional space and the distances in
high dimensional space so that's what
we're doing and there's a bunch of
flavors of this and so this minimization
see that that in six there you have that
that double bar that is a matrix norm
it's the it's a standard squared norm
and if you sort of grind through the
math and you work this out this is what
you get
that norm ends up being the sum the
squared sum of the distance difference
between the high and low dimensional
distances so the x's here are the
computed coordinates right so we're
trying to figure out where to place
these points that's the output of the
visualization the DS are the distance
matrix elements and what we were doing
is we're trying to minimize the squared
difference and how many of you have had
some basic physics anybody know the
physics of Springs so turn books yeah
right so actually the force increasing
identically but the energy it takes to
stretch a spring is proportional to the
square of the displacement and Springs
have a rest length right you can go
squish them and you can pull them and so
that squared and there's a spring
constant but that spring constants are
dropped out here so that's minimization
is actually identical to connecting all
these points with Springs that have rest
length equal to the distance of high
dimensional space and then waiting for
them to find the lowest energy
configuration which is what they'll do
if you just let them say right if you
connect a bunch of things with Springs
and you let it wiggle around until it
stops you're gonna find that the lowest
energy configuration so there's actually
a very intuitive physical interpretation
which is that you connect everything
with Springs and wait for it to settle
down which means actually you can
calculate this and you'll see this in a
second and show you visualization when
we talk about the overview system you
can calculate this just by pretending
there's forces on these dots and
simulating them at each time step with
with the dots moving according to the
spring Force and so another way I think
about this is that you have a bunch of
these points have this sort of
connectivity between them it's just like
stretchy structure and I take this
high dimensional structure and I
flattened it between two planes of glass
and it squishes down and the springs
pull around and it's very excited
wobbles eventually stops and that's
multi-dimensional scale so if I do this
to the House of Lords data this is what
again this is a blog post again so it
goes into more detail and you can you
can do this in Python I'll just add it
looks a bit different from the PCA plot
and I think there are a bunch of reasons
for that first of all what do you see
what is the relationship between the
Conservatives and the Liberal Democrats
in this plot the colors are different
objects yeah so they overlap a lot more
so the Liberal Democrats in purple don't
quite overlap the Conservatives in red
right they attempt to stretch a little
bit towards the left towards the Labour
in this case but they're definitely in
the same quadrant right and then the
other fun thing here is the green dots
next B those are the cross ventures
those are MPs who will sit with neither
party that's why they called cross
ventures and sure enough they even vote
all over that place so this gives a very
different picture based on the distance
in between the voting patterns and what
might be going on here so first of all I
think the main the major differences are
just the fact that rather than trying to
find the direction where the Liberal
Democrats are most different right the
second most variance direction we're
just trying to put everything down and
be distance preserving so you can see
there is clearly an axis where the
Liberal Democrats differ right they sort
of go out this
but that that access has been flattened
down by this projection so that's one
thing that's going on the other thing
that's going on is it could be that the
Liberal Democrats just vote on a
completely different set of bills than
everyone else and we don't see that in
the PCA plot because we take every bill
whereas our distance function looks at
only bills where both people voted right
so we we've actually got a difference in
what we're doing and in fact in the blog
post so here's that one we were just
looking at if we change the distance
function right so I modified it rather
than just the number of votes where they
overlap I'm modifying it by this factor
which is okay but how how big is the
overlap right so I'm taking a log here
just to try to squish the scale of it
you can start to see sort of how how I
suppose it's technically correct to say
subjective I would I think prefer the
word contextual here all right there may
be strongly justifiable choices for why
you do these things in this case because
I don't get an interpreter will plot if
I don't do this log thing right but I'm
I'm sorry I'm adding this other factor
which is okay well let's push to M Q's
apart if they didn't vote on the same
bills so if the if the overlap is zero
well I guess I don't run this line at
all right so if the overlap is small
then this will be very close to one so
we'll get much bigger distance when I do
that this is the pattern that I get I
get a different shape and you can see
that the Liberal Democrats now start to
push off up here alright so I think this
axis this vertical axis is kind of like
the people who don't share a lot of
votes with anyone else they sort of end
up up there so I think that's part of
what's going on although you can see
that conservatives end up up there too
so anyway different different
projection methods will emphasize or
allow you to see different aspects of
the structure of this data and again
there isn't it asking which one is the
right method is kind of like asking what
is the right method to analyze data it's
like well depends on the question you're
asking right there there isn't going to
be a single answer to this question okay
yeah and here I'm sort of saying the
same thing right so there's all of these
free variables there's all of these
places where our domain knowledge our
knowledge of the actual real world
problem comes in so maybe we shouldn't
even be looking at boats if our question
is about how people are working together
maybe we should be looking at committees
maybe we should get the transcripts and
do text analysis I don't know maybe we
should look at voting voter behavior or
not legislature behavior if we want to
talk about polarization and then also we
have to ask okay but you know what it
what is the rhetorical structure of the
argument that we are going to be making
right what are we saying and different
data analysis methods will support us
saying different types of things so this
is our first mention of robustness in
this course is the word you're gonna
hear me say a lot if different analysis
methods give you different stories you
have a problem right it's not
necessarily a problem it's a problem we
know how to deal with in journalism if
different sources give you different
stories you have to reconcile that in
some way right you you have to you know
either somebody's giving you inaccurate
or incomplete information or opinions
really do vary and you have to represent
that in your story so it's the same
thing with data analysis what I guess
what you're hoping for is that different
methods a variety of different methods
including non data methods right
including qualitative methods will give
you
qualitatively the same answer and then
you have a very strong story it's like
if every source is telling that same
thing okay so our next topic today is
text analysis and I'm gonna start by
just showing you some examples of what
people have done with it and then we're
going to develop the tf-idf model and
then we're going to show the development
of this text mining system which I think
given the time we probably have to be
next bus text analysis could mean all
types of things here's a story I worked
on república is about a child foster
care facility in California called level
14 which anyway it did long story and it
all sorts of problems but what I had to
do was extract incident type from a
bunch of documents that looked like this
right so these were scans redacted scans
that we got that OCR with a lot of
errors and I had to basically get date
and type so it's basically just a big
data extraction problem and this is very
common analyzing a document set could
mean a lot of things this was from a
story that came out of Reuters
on basically these yahoo news groups
where people were how to put it let's
call it unauthorized adoption right
there trading children and they went
through it was a public news group their
private news groups as well but to avoid
sort of various types of legal
complications they went through a public
miss group went through over 5,000
messages and they had to determine by
hand that each certain messages were
talking about particular children and
then they broke down the demographics in
a bunch away so this is basically a
counting analysis but it's a very
complex counting analysis
analyzing Twitter data is still an
obviously popular you know Twitter is
the e.coli of social network analysis
right it's the model organism it's all
public so it's really easy to get the
data everybody knows how to use it right
so everything's done Twitter even wanted
probably shouldn't be this is just this
this piece that looked at the
demographic breakdown of who talked
about particular hash tags and then
sentiment analysis it doesn't really
matter what I say all of my students
want to do sentiment analysis on Twitter
I recommend against it it's a very
appealing notion right the idea that you
can track how the general public feels
about a topic you can't really do it
though for cover reasons one is that
sentiment analysis is more complicated
than it seems even humans only get about
eighty or eighty-five percent accuracy
on sentiment analysis because language
is pretty ambiguous sometimes you can't
tell how somebody feels from ten more
tweet the other problem is that Twitter
is just not even a random sample I mean
leaving aside the question of the
demographics of the people who are on
Twitter
I mean tweeting about a topic is is
intense self selection so without really
sophisticated modeling who talks about a
topic at all it's very hard to
understanding it nonetheless there was a
sort of peak and you know about I think
it sort of passed about four years ago
of people doing Twitter sentiment
analysis in this in my mind this is a
much better use of sentiment analysis so
rather than trying to ask how people
feel about an issue what they did is
they looked at how draft versions of
inspector general reports different from
the published versions and the way they
did that is they looked that you send
them
to find cases where the language was
softened and there's a nice recording
from the night car conference a few
years ago talking about it I also talked
about this case and paper which we'll
look at a little bit later and one of
the interesting things about that is
they had to retrain the sentiment model
to work on these reports this is very
calm with sentiment analysis it's also
one of the problems is everybody's like
I know he's not okay but generally
sentiment analysis is very domain
dependent so in this domain for example
the word recommendation which is
normally a positive word these are Auto
reports so recommendation is a negative
word right if you have a recommendation
in an audit report it means they found
something back so it's kind of
interesting I think they 400 negative
references were removed right so they
didn't want to count all those manually
so they built a sentiment analyzer that
would work for this debate
classification is one of the most common
things types of text analysis we saw a
very sort of complex manual version of
that in the child exchange story this is
a much simpler classification problem
all you want to know is whether starting
with a description that the police
recorded about a particular incident was
it an an aggravated assault or a regular
assault and anyway so there's a criteria
for whether something counts as one or
the other and you can read the incident
report and decide and what they found by
piloting for the year of this by going
through Europe gave us about 20,000
reports is that many assaults were
misclassified is less serious than they
were and then starting from the beat
there
so they went through a year of this
stuff and that generated trading data
and they used to training data to train
classifier to train all the rest of it
and we'll look at this in more detail
but document classification
is a pretty standard technique just a
bunch of journalism oh and here's what
it looks like so I love the language
that they use here like Biggs and some
of the arguments anyway it's you know I
keep talking about domain knowledge this
is what I mean standard algorithms will
not work for journalism the algorithms
you learn in textbooks or you load from
psych it in an L TK or you're gonna
learn in class generally won't work and
they can often be made to work but out
of the box they're not gonna work for
variety of reasons we're also gonna look
at topic model how many of you
encountered this LD a style topic models
okay we're gonna actually gonna get into
those next class and I have an
assignment on them this is a popular
method to well you want to say analyze
topics but really what it is it's pick
up patterns and ordered distributions it
is a latent variable model we'll get
into what those are there's only one use
of in journalism that I know of and that
was to another Reuters story look at the
topics of a bunch of court cases where
before the Supreme Court to try to
understand which topics we're actually
accepted by the court so that's a very
fast run through of some of the things
you can do with text analysis in a
journalistic context there's I mean
there's lots more right and people are
inventing new ones all the time but
really I show you that we'll talk about
it all tomorrow but I show you that
because I'm trying to motivate you to
want to learn this thing that I'm going
to try to teach you which is the
document vector space model so now we're
going to jump in to a bunch of
abstraction and what I'm going to teach
you is a thing called tf-idf which was
used to do the
clustering of message machine and so
what it's used for is to generate
document vectors which then can be
compared with a distance function to
decide whether two documents are in this
case near the event you can also compare
asked questions like do two documents
talk about the same topics this is also
foundational search engine technology so
every full-text search engine uses
something like this
maybe it's modern search engine will use
lots of other parts too but you can't
really build a search engine without
this stuff there's a bunch of things
that you can use too if idea for
clustering we've seen you can use it to
try to extract keywords you can use it
to try to figure out the topics of
things is a key part of every
recommendation system that works with
texts so Google News they study speed
whatever it's a standard method
representation for doing document
classification so if you're running a
classifier you're running classifier on
vectors you have to turn the text into
vectors and search engines are made for
them so here's the idea we're trying to
design a representation of a document as
a vector that is as a series of numbers
in particular as a series of numbers of
a certain length we more or less need
all of our document vectors to be the
same length through the same
representation so the simplest way we
can start turning documents and D
vectors is by counting words this is an
ancient technique people started doing
this with the Bible in the 12th century
it's called a concordance when you do a
Bible and you can think of the counts as
the numbers and the words as the
dimensions right so
this document has an 18 on the animal
dimension and so these this complete set
of dimensions it's the set of vocabulary
of all of the documents so this is just
word counts you can also use word
frequency which is just divided by the
total number of words and word frequency
vectors or word count vectors are a
fundamental representation which you can
use in lots of places they are of course
equivalent to a word cloud here is a
word cloud with the same document this
is actually a press release from US
Senator in DES New Jersey and you know
the reason people use word clouds is
that it kind of tells you what the
document was about so so what's this
document about just more cloud yeah
about enforcement of laws around
reporting of animal cruelty and there's
this particular guy Michael something
Google was involved in dog fighting and
that's how they made an example of him I
suppose so the fact that word clouds can
tell you interesting things about a
document should convince you that word
frequency can tell your interesting
things notice that God doesn't appear
here
they've removed it all right so that
tells you that already we want to do
something to the word count so we don't
want the raw counts so here it is just a
little more formally right where we're
turning a document into a vector where
each dimension is a word and you know
here's a couple examples this is what it
looks like with two sentences and you'll
get this matrix this matrix is called
the term document matrix an individual
entry which is denoted by term it's an
indexed by term and document is called
term frequency and this case is not
actually term frequency it's time counts
I divide everything the first row by 3
and the second row by
one or four and you get term frequency
so this is the th part of TF idea one
thing about this representation is that
it is completely insensitive to word
ordering so if you have semantics that
depend on the order of words which is of
course very common in language then
you're not going to be able to pick it
it disappears in this model this is why
this is called the bag of words model so
you take all the words in the document
you throw them into a bag you lose the
ordering you know this is less terrible
when it seems um before you can get
there you have to break it into words
that is more complicated than it might
seems so you're going to do some
tokenization assignment up for the next
class and you know the algorithm I've
written there is really straightforward
tokenization right it's it's about the
dumbest thing you can do and then it'll
work okay there's lots of cases where
this throws out information that you
want so for example punctuation alright
if you break words on punctuation does
that break don't into doughnut teeth
what about periods between the letters
of abbreviation all right is that going
to break into it you have to break up
periods because sentences at done
periods some more sophisticated
tokenizer deal with punctuation and in
what is it Chinese Korean Japanese you
don't have spaces between the word
equivalents and so you termination this
is completely different process it's
actually becomes like probabilistic
words living algorithms language is wack
is what I'm saying
you will find over and over that
language is black I promise
okay so now we've got vectors now we
want to get a distance metric this is
useful for a lot of things so clustering
as we saw mashing a search query so how
a search engine actually works at this
at the base level is it calculates the
distance between every documents and the
search query and then it sorts by
distance and shows you the closest
results that's it that's pretty much
what a search engine does now there's a
bunch of other signals apparently Google
now uses thousands of different signals
or something so the links thing which is
PageRank the defense port
yeah sure all of that but the the most
basic signal is do the documents contain
word and the documents don't have to
contain the world but workable is there
a similarity so this is basic stuff and
the concept of the distance functional
build is just overlapping terms so we're
gonna build up something called cosine
similarity which is basically
overlapping terms and document vectors
and we'll see why it's called cosine in
a minute there's different ways you
could do this you could use the
Euclidean distance to compare your
document vectors so what I mean what I
mean by that is I have a document vector
you know I have some dimensions are
words it's always cat sitting on mass
with me right so this is document 1 this
is document 2 and I have you know
certain certain accounts here all right
there's a lot of cats and Mac's here
and you're cleaning the distance between
these two documents is the I take the
sum of squares of the distances right so
it's 2 minus 1 squared plus 1 minus 1
squared 3 squared all right so that's
for that girthy I can do this it work it
has some drawbacks basically we the
difficulties that we need to normalize
longer documents will end up being
farther away from everything because
they have more words and so we'll get to
what normalization but you can do this
it's it's not it will work sometimes but
there's a standard thing that we use
instead and instead what we do is
instead of this every for that we take
dot product so that is to say we change
this so instead of taking the squared
differences we do an element-wise
multiplication and take the sum of those
elemental implications and one thing to
notice about this immediately is that
anywhere where there's a 0 that term
drops out meaning that this takes into
account how many words are overlapping
mean somewhere right so the dot product
for any word disappears when that word
only appears at one document not yet
over and it's also as I noted on the
slide it's not a distance function
because this actually gets bigger when
the documents are closer so sometimes
this is called similarity and you can
think of similarity as 1 minus distance
one of the difficulties here is that the
longer the documents the more words that
are in it
these numbers get higher or the more
different words that are in it the
higher this gets for the same document
right so if you think of comparing a
document to itself the longer the
document is the higher this dot product
is going to get and so this presents a
problem it just it screws up the
similarity speculation it just gives you
clearly the wrong answer in a lot of
cases
so which query do you think or which
document do you think the query should
match here hey right but it doesn't it
matches me so what we do is we normalize
the documents and the way we do that is
we make the document vectors have unit
lengths according to the standard l2
norm
everybody's everybody know what I mean
by the l2 norm now I'm trying to sort of
calibrate where the map is so there's
different ways of measuring the length
of a vector one of the things weight
things you can do is just sum the
absolute value of all the components
it's called the l1 norm you can think of
it as taking each component to the first
power another thing you can do is you
can sum the lengths of the squares of
the components and then take square root
that's called the l2 norm you could also
take each component to the peeth power
then take the P through and occasionally
you'll see that it'll be like the you
know l3 norm or the L infinity norm
which if you think about it for a second
that'll just give you the maximum value
anyway this is a standard Euclidean norm
that we're doing here right so we're
dividing everything by the square root
of the sum of the squares and we're
defining our similarity function again
not a distance function a similarity
function by the dot product divided by
the lengths of these things which if you
remember your your vector geometry is
just the cosine of the angle between
them conveniently this also
gives us a result between zero and one
so what's a zero if I run this formula
to get a zero what does that tell you
about the documents okay but
linguistically yeah no overlapping words
what is it one you yeah they're
identical so the only way you can get a
one is if a dot B is equal to the
product of the lengths of a and B and
the only way that could happen if you
look at the formula or you can either
look at the formula for Q dot product so
you can think about the geometry of this
and the only the only way that can
happen is if the vectors are variable so
now we've got a function that says zero
if they're completely different and one
if they're completely identical and so
if we try this again with all of this
normalization going on now we match the
correct query right we've matched a
instead of B so that's that's basically
why we normalize it it's just it fixes
this problem of longer documents minute
so here's a picture of what's going on
you can think of each of these document
vectors as vectors in some space because
all of the counts are positive they are
actually in the the but it would be
quadrant in two dimensions or octant and
three dimensions that'd be one over two
to the N subspace of the high
dimensional space where everything is
positive I'm sure there's a word for
that anyway hyper object there you go
they're in the positive hyper octant and
then we to match a document against
another document we look at the angle
and they're all normalized which means
actually you can imagine them as lying
on a quarter circle in this picture so
after we normalize them they're all
going to be on a circle in two
dimensions or a sphere and three
dimensions or a hyper sphere in the real
world and everything's on the surface of
the hyper sphere and now we're taking
the angle between two vectors on the
surface of a piece of a sphere the
positive section of the sphere and I
tell you this because although you can
always grind through the math the
easiest way to think about this stuff is
to think about a geometric way all right
your geometrical intuition will almost
always give you the right answer so then
to turn similarity into a distance
function all we do is we say 1 minus the
similarity all right so we went from
something that it's 0 when the documents
are distinct at 1 when they're equal to
something that is 0 when they are equal
at 1 when they are distinct and again
you can grind through this and show that
it is a distance metric so it's
symmetric and a and B when a and B are
identical
it gives 0 and the triangle inequality
is a little harder to show but it
basically just just followers
geometrically right so if I have three
points on a sphere then the angle from
the first to the third has got to be at
least as large as the angle from the
first to the second plus the angle from
the second to the third and again you
can prove all of us if you're concerned
about it so that is cosine similarity or
cosine distance there are some problems
that we do this on raw counts and one of
the big problems is that all of the
common words are going to dominate
the result so actually sometimes we want
this there is a case where we want this
so style ama treat so you know you're
all familiar with the the Trump insider
op-ed which was was that just last week
right so a lot of people have wondered
who wrote it there is a classic
statistical technique to figure out who
wrote something and the basis of that
technique is to look at the frequencies
of common words the reason you use the
common words and compared to a corpus
right so you need a bunch of different
documents by a bunch of different
authors and then you compare the
frequency of common words with the
unknown document to the statistics of
the authors the reason you use the
common words is because the common words
are independent of topic right they the
authors might be right about different
things each author might have different
topics so if you started losing words
like you know Iraq or something then
sometimes they'll match sometimes it
won't but if you ask questions like how
often do you use the word
which then you start to get results that
work between different documents so
sometimes this is what you want but
usually not this is not what you want
for a search engine or document cluster
what you would like is some notion of
words that are distinctive words that
appear in some document but not others
and so the intuition here is that we're
going to boost words that don't appear
in very many documents that's right
because I won't start I I have my own
theories about that I think that
document was a deliberate pastiche I
think either it had multiple authors or
was built in part by cutting and pasting
sentences from other places because they
knew that everybody would do style on
the German even so I'm excited to see
the first you know it's the the first
may be any one of you can do it it's got
to be fast cuz it's gonna be this week
for next week the first person to do a
serious tell on tree analysis on it I
wonder what it'll say and yeah Lone Star
is interesting
but what about the statistics of all of
your old common words so like what
happens if we apply the standards
telemetry techniques although although I
will stand you know say that the
statistical techniques are very useful
and interesting and probably the most
justifiable way to come to such
conclusions the Unabomber was caught by
doing manual Salama tree-like lodestar
style stuff yet a particular phrase that
he said that appeared in his his
master's thesis plus his manifesto
anyway so lots of history mr. lammeter
there's it people have done this for
shakespeare for the for the Federalist
Papers is a very famous statistical
analysis that said that Madison wrote
the papers fun stuff we want to down
wait words other than the common words
obviously you want it down like that but
we may also want to down my car if we
have a data set full car reduce right
but not if we have a data set that
doesn't mostly talk about cars so this
tells you that the the weighting
function has to be a function of the
entire document set and not just the
individual document because we want to
bury the weights based on other
documents so about the simplest thing we
can do is count document frequency and
so instead of counting how many times a
word appears in the document we count
how many documents a word appears in so
divided by the number of documents right
we've got this that's in terms of your
frequency not account and I don't know I
mean I guess I wanted to give you a
formal definition but it's not a very
complicated idea so I've done this like
subset thing here that's particularly
clarifying it's the fraction of
documents that have that word and then
we say okay well we want to do eight
words that appear and when you talk
so we're gonna flip it right we're gonna
flip Big D over the number of documents
that contain it and we're gonna throw a
logarithm there because what the heck I
mean the actual reason we need the
logarithm there is we want a function
that compresses the range because we
don't so say we have a word that appears
in ten documents and a word that appears
in fifteen documents let's say five to
ten five to ten is a pretty big jump 90
to 95 out of 100 documents is not a big
jump right so that tells us that we we
want more resolution in the lower end of
the scale area smaller numbers of
documents we want one additional
document to mean more when it's not a
mini document so we're interested in
relative or proportional changes and
applying the logarithm changes the scale
so that'd be an equal unit of IDs means
a additional multiplicative factor of
the number of documents that appears but
guess what we're totally making this up
and almost any function that compresses
the lower the upper end of the range
will work and by work I mean give you
search results that a human would get
that's ultimately how this stuff was
originally evaluated so literally
putting it together we now have a
function that tells us what value to
assign to each dimension of a document
vector or again the dimensions are words
or terms as we like to say as a function
of the term the document and also the
entire document set all right so notice
that if I add one document to the set it
will change the tf-idf vector of every
document which contains any word which
is in that new document so it's a
contextual calculation yes which is what
I'm saying
which is what we wanted we wanted to the
importance of the word car to depend on
was this is a duct purpose with car
reviews in which case we don't care or
is it you know the AP news corpus in
which case yeah it should definitely be
like hey this is the stock cars this is
some interesting implications by the way
this means that every in principle if we
care about this type of idea never mind
the algorithm if we care about this idea
that or that the importance of a word
depends on the corpus then every time
somebody adds a new document to the web
these search results for every other
search should change on Google which is
actually true right like I don't know if
it's true on sort of like a local scale
but it is definitely true that as the
web evolves Google's search algorithms
would give different results for other
searches so now we are looking at tf-idf
on the same document and notice that all
the stop words have dropped out here is
the word cloud again right so this is
just straight word frequency and here's
tf-idf and notice that it's it's it's
promoted some things right so Michael
and animals are much bigger right
Michael's relatively small here it's
relatively bigger here and that's
because the word michael does not
commonly appear in his press releases
alright and if you read the press
release
turns out michael's pretty important to
this press release press releases from
Senator Menendez oh yeah I mean up to
when I use scrape K which was like 2012
but it's like five years of press
releases
all right so Michael appears twice in
this excerpt and which is not a very
large number right like you know all the
stock words appear more than that and
then animal or dog fighting also appears
twice dog fighting doesn't rank that's
interesting I think actually that's
because dog fighting only appears in
this document and no other and the
standard TFI fil wrote a standard tf-idf
algorithm throws out yeah this is kind
of interesting right this is the
practicalities and stuff food you throw
out words that don't appear in more than
one document and the reason you do that
yeah that's a different problem right so
yes so any word which has multiple
meanings is gonna have a hard time here
and actually when we get to Lda we'll
look at one potential solution to that
problem but take a look at this formula
the IDF term NTD number of documents
that the term appears in if that's one
then that's gonna get a huge boost right
we're gonna get the log of the total
number of documents so that terms you
know it should match perfectly that the
problem is that if we're trying to
cluster documents which was what I
actually built that algorithm for when
that generated that tf-idf data that
term can't match any other document
right if dogfighting doesn't appear in
any other document then I can never
contribute to the cosine distance of
another document so for a search engine
you might not want to do that for
clustering it actually cuts down the
data quite a bit because there's lots of
words that only appear on document
anyway that's that's tf-idf this stuff
is from the 70s a guy named Sultan was
did a lot of work on this this is like
foundational search engine stuff and
this is how he drew it he said that you
know we would like the document vector
space to have these little clusters and
we want the clusters to be documents
that are about the same thing in fact
there's a formalization of this called
the cluster hypothesis here are some
actual multi-dimensional scaling plots
two sets of document vectors the TF
vectors and TFI the effectors the colors
are manually assigned topics they are me
going through and tagging this and we'll
see the tagging interface a little bit
later I guess next class at this point
if you look at the TF side what you see
are they sort of long sort of streamers
I see one that's sort of going sort of
to the upper left and I see one that's
sort of going almost straight up and
then a big central blob in the middle
and what's happening there is that those
sort of linear dimensions that's the
projection of the stop words right so
one of those things is the value
dimension and our space is really warped
by just the shape of how common those
words are in English and how common they
are in the press releases right some of
these dimensions are going to be the
like there's going to be the menendez
dimension right because they all have
its name in it there's going to be the
like appeared dimension or I don't know
whatever it is that politicians say most
commonly and then that central blob in
the middle is kind of everything else
and so we don't get very clear
separation of the clusters when we
divide the term frequencies by IDF you
cancel out not only the stop words
but the the Menendez styled stuff and we
get much more distinct clusters as you
can tell from the fact that many of
those topic clusters are now much more
compact and distinguishable all right so
that's that's why we do this there is an
interesting assumption here it's an
assumption which is empirical in nature
the cluster hypothesis has appeared in
various forms since people started doing
this type of like search engine
algorithm design like it states from the
earliest days of tip idea and even
earlier actually there's sort of proto
versions of this from the fifties and it
says that the this algorithm of turning
things into a vector space like this
word counting thing tends to put
semantically similar documents together
or in this case they say behaves
similarly respect to information means
right if you're building a search engine
but we do all kinds of things we're
gonna we're gonna look at document
clustering we're gonna look at like
keyword analysis and all kinds of stuff
and the underlying assumption of all of
this stuff is that the clusters in this
space means something now you can't
prove this from the mathematics and in
fact the tf-idf formula is not really
that important there's dozens of
variants and one of their references in
the syllabus goes through like what some
of these variants are different search
engines and different packages will give
you different answers it doesn't really
matter they all kind of do the same
thing they all kind of work on the same
intuition and I'm I want to point it
this because in almost all of the work
that we do there is some sort of often
unspoken hypothesis of this type which
connects the semantics and the
mathematical worlds so I talked about
the connection between the words and the
data the only way that happens is by
introducing hypothesis like this
and to the extent that the hypothesis is
correct
then the statements that you write about
the data have empirical meaning all
right so there's something going on in
the background and also tf--idf just
works right you you can build useful
search engines with this anyway there's
poor theoretical justification for it
but it empirically it works pretty well
this idea not just tf-idf with the idea
of encoding documents as word vectors in
some fashion most commonly in at tf-idf
ish type away is called the vector space
document model and it is foundational to
all types of text analysis including the
text analysis that is built-in to review
called frontiers of computational
journalism because it's a research level
introduction to what are people doing
with computers in journalism and the
first thing we're gonna do is to find a
little bit closer what that means a
little bit more carefully and show you
some examples and then today's lecture
is called high dimensional theta so I
imagine a lot of this will be familiar
to a lot of you but I probably not all
of it will be familiar to all of you as
we proceed through the course I'll get a
better sense of who's done already in
the course or m/l course or you know
work with with high dimensional data
already so this first course is really
just sort of getting up to speed on a
bunch of stuff and then we're gonna get
into the tf-idf model who's done that
before in some form okay yeah so we're
gonna find that a lot right so about
half of you have done some things and in
particular we're gonna we're going to
develop it and I'm gonna show you how I
used it in a piece of software called
overview which is a document - a system
is an investigative journalism all right
so that's the plan questions so far
great so the first thing I want to talk
about is what's the what's the
definition what are we doing and so I
pulled a few different definitions I
think this one is my favorite this is
this is the idea of sort of data mining
all right so we are using advanced
computational techniques to find stories
they're make hidden connections there's
all these sort of relatively vague ways
of describing this the purpose of this
course is to make those concrete instead
of big and this is mostly what I work on
as well but there are a bunch of other
places where we use computer science a
journalism that talk about so just
an example of what that first category
looks like so here's some documents that
that were obtained from the US State
Department through a Freedom of
Information request about private
security contractors in the Iraq war
when we talk about data mining this is
what's what some of this stuff looks
like you know so do you remember this
issue it was a big issue about ten years
ago of private military contractors
attacking Iraqi civilians during the war
in particular there was one incident in
a place in Baghdad called switch Square
where they ended up killing 17 civilians
and part of the issue is that the legal
regime that the contractors were
operating on there was very unclear
they weren't US military so they weren't
subject to court-martial and they
weren't subject to domestic Iraqi law as
well and they weren't subject to an
American law so basically there was no
legal basis for punishment for four
times which eventually got it fixed but
anyways it was a big deal on press
conferences and so we got these 4,500
pages and not me a guy named John Cooke
who spent some time at Gawker is now our
reason mags and filed his FOIA request
took three years came in you know this
much paper arrived on paper scanned it
put it online he did a story on it and
then I did a story using some document
mining techniques and we'll talk a
little bit about that later or this is
another example of the sort of thing
that we're talking about with just sort
of data mining example so
this is a piece called surgeon scorecard
this is the first-ever publication of
surgical complication rights that name
doctors so Eugene wolf has done and knee
replacement eighty three times and his
complication rates about two percent
with fairly wide error bars and this is
a piece that was produced
ProPublica from five years of medicare
data so Medicare doctors provide about
40 percent of the medical procedures in
the United States so it's it's a as big
of a sample of medical data as you're
going to get and there were seventy
million rows in the original data set
and so what we did is we screened for
all of the elective procedures and then
we adjusted for patient characteristics
to try to level the playing field in a
number of ways and we'll get into this
when we talk about when we do a
classroom regression so this is the like
classic data mining produces journalism
type of thing and this was very
influential by the way the reason for
public that did this was because best
estimates are that there are about 200
to 400 thousand deaths per year due to
preventable medical error it's one of
the leading causes of death in our
country patient safety advocates have
been saying that transparency of
surgical outcomes is one of the big
things that might help they've been
saying it for 20 years
Publica did this it was very
controversial but definitely advance the
conversation about transparency that
comes I worked on this did the
statistical program or we'll talk about
this later from the same papers the
first definition there's this other
definition which is even broader right
so it talks about not just the
reproduction
discovery presentation aggregation
monetization right so to me when I look
at this I think okay so now we're
talking about Google now we're talking
about the Facebook newsfeed and we will
definitely talk about those things in
this course we will study the deep
design of information filtering
algorithms that's actually basically the
topic of the first third of the course
you will design your own algorithms
we're going to talk about filter bubbles
we're going to talk about all the
algorithms behind these things we're
going to talk about designing them so
they serve social items because that
that much is that is very much in the
realm of what we talked about in general
is the school now because it's such an
important part of the public sphere and
a few past students have now gone to do
things like work on the Facebook
newsfeed so I'm training you not just to
be computational journalist working at a
newsroom I'm training you to be someone
who builds the infrastructure of the
public square that's that's sort of the
topic of this course so what does that
look like well this is an example within
a newsroom this is the talk by a show
this what kind of shape The Washington
Post said last year's computation plus
journalism symposium and what this slide
is is it describes their virality
predictor in other words story is
published you collect signals about
performance of that story for let's say
30 minutes and you want to know if that
story's going to be viral or not
alright so it's because you want to know
to promote it more right if you know
that a story is doing well there are
lots of things you can do you can push
it out on more channels to try to get
even more traffic you can think about
your ad sales on that page and maybe
even perform a real-time auction for
those clerks you I don't know that
that's certainly a possibility you can
write more stories on that topic all
kinds of things reasons you want to know
this and so this is actually some of the
most advanced machine learning models in
journalism tend to be on the business
side so we're going to talk a little bit
about that a bit less about that in part
because I feel that's pretty well
covered in other places this doesn't
look that different from sort of
business analytics data science stuff in
general but there's a lot of it this is
another example of using computer
science techniques to do something in
the public sphere that isn't producing
stories so we're gonna have a whole
class on tracking effects this was a map
of the spread of the Kony 2012 video
which if you remember this video yeah
okay so at the time it was the fastest
viral spread of any video ever and
perhaps anything ever and so gilad note
on maps the the network Twitter network
the distribution to service any other
major nodes were the point I'm trying to
make why including this is that the
system we're going to look at spans this
entire train from story production all
the way through distribution and then
after distribution to you can look at
the effects of the story and then it
cycles right it's actually not just over
here train it's a loop because the story
has some effects that changes the world
weary report on the world so we're in
this constant feedback system and there
is code everywhere and that feedback
system now the other major theme of this
course is algorithm accountability which
is what we call in journalism but the
idea appears in a lot of fields it's the
idea that algorithms have power they're
being used for making
decisions hiring decisions decisions
about who goes to jail we're going to
talk about that at great length all
kinds of stuff and so part of keeping
tabs on the power in society is keeping
tabs on what algorithms are being used
how they work and this is something
which only people with computer science
backgrounds can really do at least at
the level of asking how the algorithms
work there's the stuff around the
algorithms turns out to be very
important as well but that's no excuse
for not knowing how the algorithms work
so we're gonna start looking at reverse
engineering this stuff by the way you
all get these slides don't worry about
that so as an example of what that looks
like here's one of my my favorite
examples of algorithmic accountability
this is looking at how prices vary by
zip code for staples which is an online
office supply retailer and it found that
basically things were more expensive
away from urban centers and we're gonna
talk about this particular example and a
bunch of other examples right so one of
the questions you're gonna ask here is
why right and the shorter answer is well
it makes them more money but there's a
more complicated answer which is sort of
well how did they get to this was there
a human who was sitting there optimizing
things were they running an algorithm
dropped my sales which is what I believe
happened if so what did the algorithm
look like well I'm off the sighs about
what the structure that I dog where that
was and and then there's the effects
which is that generally higher income
correlates to population density meaning
that people are richer than cities
meaning that they are charging or people
more for the same product right so that
has obvious social implication
you know start expanding that out times
millions for every company and these
pricing algorithms start to have serious
implications for the structure of
capitalism and algorithms appear in
finance we're gonna be talking about
that quite a lot as well here's another
sort of algorithmic accountability story
this is for public ayahs message machine
which is an attempt to reverse-engineer
political micro-targeting in the 2012
election I don't think they did this in
2016 that's sort of kind of different
set of approaches but what they did is
they collected a bunch of emails
clustered all of the similar ones using
tf-idf so by the end of this class don't
know how they did that and use that and
collected demographic information use
that to try to figure out who got what
email so how did the Obama and Romney
campaign's target people and of course
in a post Cambridge analytical world
political micro-targeting is now very
much in the spotlight so these are the
kinds of things we do right and so you
can think of it as basically two
directions one is you use competition to
say or discover things about society and
the other is you use you do journalism
about algorithms in society so you can
think of it as applying computers to
journalism or applying journalism to
computers and by the latter I don't mean
the tech press right we're not going to
talk about like covering new Apple
hardware we're going to talk about the
effect of changes in the facebook news
feed algorithm on democracy which is
kind of a different
sorry yeah yeah I mean where stories
come from it's kind of an interesting
question but I am presupposing that
there is some data somewhere that we
have to look at I don't think you can do
the sort of first direction or
computational journalism about data so
much data can mean a lot of things of
course I can I don't mean you can
download the CSV or you didn't make a
data set you can and people have done
stuff like this well look at a piece
there Guardian went through the laws in
every state about same-sex couples right
so this is pre gay marriage ahead of it
by a couple years and you know
visitation rights child care marriage or
other civil union there's there's this
whole list of Rights that same-sex
couples did or didn't have anything okay
so they assemble this data set by
reading the laws it discuss is that
helpful yeah
yeah well that's me
like how do you find a story right yeah
yeah well I mean is it always a Deena
source or is that something that's no I
mean Marta in the guardian case the the
data source came after the concept for
the story right so I mean I guess the
short answer is that could happen it's
all different ways
oh okay so here's kind of the structure
of some of the topics down in the middle
are the major topics of this course
that's more or less a synopsis of some
of us the things we're gonna we're going
to talk about each of those is a pretty
big topic and then around the edges are
some of the fields that we're gonna have
to talk about so this is an extremely
interdisciplinary course that's one of
the things that I like about
computational journalism is it's really
right in the middle between and it's
also very practice oriented right you
have to actually produce a story and
publish it so where this is this isn't
although we will talk about some very
abstract mathematical things this is not
an abstract game so it's it's a really a
fun place to work and you can you can
use it as sort of a jumping-off point to
all kinds of incredibly interesting
areas and so we're gonna we're gonna be
talking about all of these things we're
going to start on this course with some
information retrieval and natural
language processing by the end of this
class you will know how to build Google
version 1.0 right well we're gonna talk
about basic search engine operands basic
information retrieval stuff just in
terms of the structure of the class
chances to get a sense of this okay so
how many of you are not dual degree
students okay yeah so so for you this is
a three-point class and for the dual
degrees it's a six point class how we
manage that is if you if you are not a
dual degree student you don't have to do
the final project there was some
confusion over that last year so all of
my you guys the rest of you there will
be a final project
we're gonna start talking about it
earlier we're gonna talk about it a
bunch next class this will be either a
reported project or basically a piece of
computational journalism research it's a
pretty you know substantial piece of
work there will be a sort of midterm
presentation where you talk about the
progress that you've made and they'll be
at the last class you'll present your
projects as well so you know every year
there's way more stuff that I'm
interested on that I get to report on
personally and so part of part of what I
do is put a list of things on the board
that I wish I had time study and you're
free to pick any of those spaces or
anything else and hopefully we're going
to produce some some novel either
investigative journalism or research and
then there's assignments and
participation so the class participation
mark it's pretty minimal of what I
expect
I expect two things show up and say
something in each class like anything I
don't care if you say like two words I
just want to know that you said
something in each class that's it yeah
twenty percent just for showing up but
open your mouth so you're already about
half of the assignments will require
programming in Python however even for
the assignments that are right right
there
which require programming in Python I'm
not really gonna be looking at your code
I mean you know I'll might check that
it's reasonable but this is not a
programming course this is a the
assignments are gonna have the form
rights and code would explain to me
what's going on and if you can't explain
to me what's going on then you're gonna
have difficulty with the assignment this
is a journalism course so the explaining
what's going on is the key skill right
it's not just about impressing me it's
about showing me that you can write
about complicated topics is that that's
the class is pass/fail for all of your
journalism you'll get a letter grade if
you need one the assignments are
pass/fail every year a few people fail
an assignment if you're one of those
people it's okay it's you know you
you're not gonna fail the course if you
fail home or assignment it's it's my way
of telling you that what I expect is
something very different it means come
to my office hour let's have a
conversation you can see it's normal oh
well good question well get today
probably Thursday afternoon but we'll
have to talk about when it okay so this
is what we're actually gonna try to
cover in this class we're gonna today so
we're gonna start with this sort of very
abstract concept of high dimensional
data which you know foundational Dahl
data science basically then we're gonna
build up the document vector space model
and put together to MIT I vectors and
use it to talk about a piece of document
mining software that is used in
journalism so we're going to talk in
some detail about the evolution of that
platform and how it works so high
dimensional data there's this quote that
is something like measurement is the
most neglected area of statistics the
process of how data is produced is
fundamental dollars's to work if you
open a statistics textbook
it says almost nothing about it does
anybody know where there's lunch
is occasionally I get something the
classroom recognizes it it's actually
from a children's story called a lost
thing and it's a very sweet little story
about a boy who finds a finds this
creature on the beach and sort of adopts
it and it's not really clear what it is
it's not really machine and it's not
really an animal you know he doesn't
know what it eats it doesn't really fit
in anywhere in this garage that's you
know it's a parable for not fitting in
but what I'm saying here is that the
entire world doesn't fit into data
alright there's always that that arrow
that arrow is the quantification
operation there's always some crazy
abstraction that happens in the
qualification process how many of you
have done any sort of experimental field
work better like ask somebody's survey
questions a little bit anybody done any
ethnography okay anybody been involved
in data entry
census collection okay we'll talk about
this a lot more in classes two and three
I think but the the question of data
sources of data quality is there's
really big one that is really important
in journalism's
and perhaps talked about more abilities
than almost anything else as I said just
expect books neglected this almost
entirely but anyway if we can manage to
convince herself that there's a
quantification procedure what we get is
we turn real world objects into a vector
of normal a numeric entry see all right
in fact always numeric we end up
recoding categorical variables
so we're gonna work through a little
example here which is data from the UK
Parliament in 2012 and there's actually
a reason why I pick 2012 it's not just
that that's why I built this example and
this is what it looks like basically we
have a list of bills you can see the
titles there and there's repeated titles
because there's often multiple votes in
the same bill and lists of MPs and their
votes minus x means I've didn't attend
but I think four is no - yes
oh yeah this is the bottom yeah and what
we have in this set of data is over a
thousand MPs and 1600 votes
so the analytical question is is there
structure in this data does anybody have
a suggestion on how to receive to try to
make sense of what's going on here yeah
so Alicia those column whose empty ideas
is a particular person there's actually
another table that gives their their
name in their party party maybe a good
idea huh yeah so imagine your editor
comes to and says tell me about voting
patterns in the House of Lords
right we're looking for sort of initial
exploratory steps here I mean you've
done exploratory data analysis and of
course yeah okay so let's talk about how
to mention elata if this was had only a
few dimensions you could just start
making charts right and you can make you
know you can build a table that
says you know for any given bill you
know how many labor how many
conservative and so forth voted on that
bill right so you can sort of imagine
taking sections of this but you know I'm
feeling ambitious and I want all of us
so we're going to talk about reducing
the dimension of this space so this
space is if we consider each MP to be an
object so we have a thousand in peace so
so a thousand rows of data and sixteen
thirty votes so now we have a thousand
data points each of which is sixteen
thirteen dimensional and I want to
visualize all of them so the basic
problem we have is we have to go from 16
and 30 dimensions to two or perhaps
three this is a very common problem and
do the science this is called
dimensionality reduction and how many of
you have seen some version of this in
here other classes okay yeah what
dimension in reality dimensionality
reduction techniques did you see yeah so
that's where we're going with this we're
gonna we're gonna build up to PCA today
how many of you know what principal
components analysis is okay yeah so this
this is basically how this entire class
is gonna goes I'm gonna talk about
something and like some of you will know
what it is you want so you know whether
you know it or don't know if I'll feel
then the basic technique that we're
going to talk about it's projections
right so what projection is is it is a
classic
in fact ancient technique for dropping
dimensions
although this this form of it was
actually worked out during the
Renaissance and some may be familiar
with techniques where painters would you
know look through a sheet of glass or
rig strings between the points of an
object and the central like I point plot
their intersections the mathematics of
this will worked out at four or five
hundred years ago and you can think of
it as finding a direction to look from
and then dropping the depth dimension so
this finding a direction to look from
it's pretty important and I'm going to
go to my favorite rotated cube so this
is a data set it's like a the iris data
set where there's a bunch of points and
in this case the points are in three
dimensions and we're looking at a
two-dimensional picture of course and
we're trying different angles on this
and what I want you to notice is that in
some angles the clusters really overlap
so you know the red and the black will
be on top of each other in a minute and
if you're trying to see clearly the way
to go right so if you didn't know
already we're not ready color in them
red and black you wouldn't really be
able to see what's going on there and so
for example this is not a particularly
good direction to look from because
everything's overlapping but there are
other directions where you get a lot of
separation between the points and those
are directions you want to look at and
the theory of PCA kind of goes like this
so let's do sort of the simplest
possible example where we're projecting
from one dimension or two dimensions to
one
and zero an eraser over there if you
hand me the other - that would
be okay so we've got these data points
and let's say there are two big groups
and we've got these two axes
which we're going to call X 1 and X 2
because that sort of convention in data
science X is the coordinate or the input
and we're gonna project them down to one
dimension now we can project from any
direction but let's just limit our
choices here and say we want to drop the
X 1 of the X 2 venture well if we drop
the X 1 dimension then what we're gonna
get is the positions of these things
projecting onto this axis and I'm not
gonna draw this little projection for
every everything here but you can see
that they're they're really gonna pile
up on top of each other here if we drop
the X 2 axis instead we're gonna get
some pile up that's what happens when
you drop dimensions but we're also going
to still be able to make out but there's
two clusters so we've preserved a little
more structure now the question is how
do we automatically pick which axis is
better to project right which access is
better to keep excellent any ideas
I think you are building up the
intuition behind the NBS algorithm which
which you also get to but yes good a
simpler is let's pile that yeah how do
we know about these things are these
what's that it's tense distance yeah
something-something distance so the a
simple heuristic to try to pick a good
projection Direction is just look at the
overall range of the data and pick
whichever dimension has preserved
whichever dimension has more range all
right they're gonna be less pile up may
not be ideal but it's gonna get a lot of
it in fact instead of instead of a
minimum maximum when we normally
computers variants which is the standard
deviation squared as you'll recall and
so the entire notion of PCA is that we
pick the first dimension PC is the one
which has the highest variance the
second one is the direction which has
the second highest variance and all of
them are perpendicular so what we end up
with is a set of perpendicular
directions which have the 1st 2nd and
3rd highest variance and so forth so
when you put that into math speak you
end up with eigenvectors and covariance
matrices how many of you have done the
whole eigen thing maybe you have that
that background yeah some of you will
have been have been forced to do this so
typically you'll do this in like second
or third year linear algebra right
and it's one of these things where you
just sort of gotta why am i learning
this yeah it turns out the answer is
because it's actually an incredibly
foundational concept but not really for
us because we don't have to compute it
we're gonna do this in one line of
Python code but this is what's happening
right we we build this covariance matrix
and the covariance matrix is the
diagonals of it are the variance along
each direction and potently the off
diagonals are the variance between two
different coordinates and and that
matrix actually encodes all of the
information about whatever way this
thing is rotated in space and then we
just sort of picked the however many
directions we want
so for put your direct into two or three
dimensions we pick the top two or three
eigen vectors which end up being the
directions of maximum variance so we're
gonna run this on the House of Lords
data in a second before that I want to
talk about why and basically we want to
find structures in the data and the
simplest structure that we can find is a
cluster and the reason that clusters are
interesting is they produce a sort of
natural classification of the data right
so think about going out into an
unexplored country perhaps as Darwin did
and or at least unexplored by Western
naturalists and cataloging all of the
species that you find and so you walk
around and you have no idea was there
music okay so I was a small creature and
it has four legs and it's green and some
phibian and there's another creature and
it's a tree and it has feathers it's
brown and here's another one which is
green Olivia and here's another one
which is also green and fibia
it's smaller and you end up you know sir
plotting these characteristics and
eventually you're like okay well I've
got some brown birds some frogs and some
green grasshoppers right you've you've
deduced the categories from the
structure of the data that you observed
so that's what's happening when we're
doing clustering and this is super
important categories are foundational to
language and thought and basically all
analysis and so this is why clustering
and categorization is such a fundamental
thing to do with data and we're gonna
we're going to see this in a bunch of
different contexts
clusters are perhaps the simplest
important structure that you get data
and this is a whole book on
classification systems which is awesome
and there's another book called sorting
things out which is about how our
classification system effects our know
just our thought in our language but
actually our institutions and our
processes like think about whether
something is classified as a mental
illness or not alright that can have
huge effects on individuals life course
and it's just a classification change so
we're gonna do this so I'm gonna show
you what this notebook looks like when
we actually do it and you can follow
along did I post the online version of
this let's see oh no I made a repo just
for this a new repo jail post slack
you know it's the old repo here we go
so here's the idea so we read these
these boats and I reduced it to just the
last hundred boats so stayed a little
smaller it's a hundred votes by 613 MPs
and this is a bit different from the one
we saw in the slide this is just to make
this exposition a bit more
straightforward so here's the ID which
you can key back into the original data
if you want here's the party
conservative XP means crossbencher we're
now instead of getting into the details
of British politics labour Liberal
Democrats labour and ivory coded them so
one is yes and zero is abstain and -1 is
no so there it is then I'm gonna write
this little function which color codes
them so this is kind of our legend and
we get black if we don't know it and
then this is the actual we're starting
to actually view the data and if we
don't have dimensionality reduction we
kind of have a problem because we can
view three dimensions at a time so each
of these dimensions is a vote right so
this is this axis would be the vote on
both 22 I guess and we've got a bunch of
problems here so first of all why what
is this pattern or see what why do they
have this this shape in the scatterplot
yeah yeah basically we're getting all
combinations of minus 1 0 and plus 1 on
these three axes so that and so these
things are also piling up right there's
actually like a hundred dots piled up
here because we've got 613 total you can
see some combinations don't exist but
basically we have you know 3 times 3
times 3 27 dots so it's not really
telling us anything about the voting
patterns and so what's actually going on
here is that this is the intuition I
want you to develop the eraser bill oh
yes okay this is the intuition I want
you to though we are looking from the
wrong direction okay so you can talk
about GCA in terms of life you can do
this all symbolically but it's really
all geometry and if you can get them a
little high dimensional geometry then
you can talk about all of this stuff in
geometric apologies and this is what's
going on
so if I project to this axis or this
axis what I get are three blogs right if
I project to a diagonal axis I get
something more interesting I get my
blogs the sender which are a little
heavier if there's more dots that are
stacked up okay and so if in fact what
happens is that most of the M Keys
floated here for example so this isn't
actually linked in to lots of talk each
other then I'm going to start to see
that when I look from this direction as
you add dimensions this diagonal view
which you can think of as just taking a
sum of a bunch of different boats
simultaneously a bunch of different
dimension simultaneously start to be
more interesting because you're you're
looking at you know here you're looking
at the sum of two boats if you go into
three or four or ten dimensions you're
starting to look at the zone ten boats
and then you're starting to see more
values than just the three you know
minus one zero plus one so what we're
gonna do is we're going to let the PCA
algorithm pick the dimension and I told
you it was easy here it is
so this is what we get when we reduce it
to two dimensions I've done one other
thing here which is I've set the colors
to the party so first of all you can see
a lot more structure if you look at the
colors so Liberal Democrat is yellow
labour is blue conservative is red so as
you would expect conservative is red
labour is blue so you've got sort of
left and right here which tells us
something a little bit about this
political axis and then we've got this
other axis which is remember that axis
with the second most variation which is
where the yellow right so the Liberal
Democrats which is where they are now
this plot actually corresponds to the
political structure of the government at
that time does anybody know anything
about the UK government at that time
yeah when I when I taught this course in
previous years yeah people were like
yeah this is happening right now no
one's ever gonna get this question again
so I'll just show you
they had a coalition government which is
something that only happens in
parliamentary systems where they
eventually had a coalition between the
Conservatives and Liberal Democrats so
what you are looking at in this example
is the red and the yellow are in a
coalition government together but
they're not really voting together but
the Liberal Democrats who
maybe you think would vote more often
with labor the blue end up sort of
voting distinctly differently from the
blue party and but not quite the same as
the red party as well right so that's
the question right what are the exits
what are the exits yeah kind of I mean
so here's the things the actual answer
to what are the axis is the axes are the
dimension the X the first axis the
horizontal access in this picture is the
dimension along which there is the
greatest variance that's the y axis is
the dimension along which there is the
perpendicular dimension along which is
there is the second greatest variance of
X so from a technical perspective that
is the answer but I think really what
the question that you come up with when
you look at this picture is okay but
what is the axis need and to figure out
what they mean the only thing we can do
is go back to the data and so just
looking at this picture which is is only
interprete because we put colors on the
dots right if this was a black line
picture you'd be like okay it's a
t-shape but knowing that conveniently
conservatives on the left and laborers
on the right you can see that the first
dimension is something like the
left/right political axis and actually
in every voting pattern analysis and
social attitude analysis that I've ever
done or ever found in the literature the
left-right axis just popped out for a
while I had this question you know is
the left-right political divide real
right is this thing that people talk
about
will actually slip that way or is this
just sort of a narrative that we impose
on the world and it turns out is real
it's real real it tends to be for some
reason the first axis of variation in
political attitudes seems to be this
left-right axis in a pretty traditional
sense now exactly what left-to-right
means will vary in each country but you
always find it um what is the second
axis mean so this is clearly the
direction in which the Liberal Democrats
vote which is some other policy
dimension to figure out what that
dimension is what we would have to do is
we would have to look at what bills are
assigned to that dimension so what do I
mean by that I'm also talking about what
is called uploading all the components
and what you actually get is I don't
remember how to do this now side kids
PCA loadings also known as factor
loadings
ah PC a components okay so here we go
where to go where to go okay and we
called it model all right
oh my god numbers numbers numbers okay
well let's let's see what shape this is
2 comma 100 so notice I had 100 votes
here right hundred two columns because I
also had these two but I just used the
voting data so this has as many columns
as votes and then it has two because I
asked it to build me a two dimensional
projection and so what I end up with is
is to 100 vectors and so what that is is
a coefficient for each build and think
of it this way the PC era kicked some
dimension to project right and so you
know if it was just the horizontal
dimension then my vector would have 0
everywhere I'm one in one place or same
with the vertical dimension if it was
really this diagonal direction we're all
like exactly like a 45 degree then it
would have the same value in every
component but instead I'm gonna have
like some direction which is like I
don't know like this right so it's not
exactly that but it's actually actually
horizontal and it's you know it's a
direction right it's it's a vector and
the components telling me the direction
of that vector so now jump up a
dimension and imagine this in three
dimension it's a plane and in higher
dimensions is actually a hyper plane and
as you all know a hyper plane is
represented by a vector which is
actually the normal vector of the plane
right so these these numbers the
coefficients here
which let's look at the first one for
example right this is actually the it
each of those numbers is tells you how
much weight you put on each bill so if
you go and you look at this and you take
the directions which have the largest
value right so a lot of these are really
small like point zero three or something
but a few of them are bigger like point
one nine that tells you that those bills
are counted most heavily in that
direction so that's like the you know
the leftists or rightists most bill and
oh but I looked at component one which
is the vertical one right so that bill
is really characteristic of this
direction and so all of these yellow
dots tended to vote YES on that bill so
we could go through these coefficients
and go back and read the titles and text
of the bill and figure out what it means
to be more vertical and what we're gonna
find is we're gonna find all the bills
where the local Democrats thirty guess
which we can already tell that because
that's what that shape means okay
questions
when you say like the direction the
second-row serious I mean yeah kinda
right so like healthy so I said I gave
you this nonsense earlier right first K
eigen vectors of the covariance matrix
like this of course one of the largest
eigen values right so one algorithm for
doing that is you finds the direction
that has the largest variance nevermind
pal you know you just find it that's the
largest eigen vector and then you're
like okay now let's find the direction
which is perpendicular to that which has
the largest variance and you just keep
adding perpendicular vectors adding as
much variance whichever direction was
the most variants each time so that
sounds actually computed but once you
compute it it's it's done right because
the data is yeah because we have
intentionally tried to find the
direction in which there is different as
possible from everything else that's
what it means to have the direction of
the second-most variance so it's gonna
find the direction that they that they
go a little bit later we're gonna look
at a multi-dimensional scaling plot of
this where we plot things based on
distance and you'll see that actually
they overlap so it's a different view of
the data yeah you go about explains you
had done this
visualization or something with you how
beautiful persons yeah a great question
so I would say that you know each MP is
given two scores each of which is a
weighted average of their votes on a
certain set on on the bills one set of
weighted of weights is includes bills
that vary along the traditional
left-right axis and the other set and is
just those bills where local Democrats
tend to vote differently so that's sort
of one way to think about it
I think that's probably still too much
because most people are not from the ROO
the weighted averages I think you sort
of the first order version of this gets
to be something like the they're plotted
on two axes the left or the first axis
the horizontal axis represents the major
left-right divide the second axis is the
ways in which Liberal Democrats differ
from other parties
it gets really metaphorical so this is
this is a key a key theme in this course
remember this of course is the mapping
between the words in other words the
mathematics and this is really this is
one of the reasons this is so much fun
and so hard because we're really
straddling the bottom of all of it if
you're right
we're trying to find words that not only
are an accurate description of what's
going on in the data but that people
will understand right so it's we have to
map not just the meanings of the words
but the meanings that will be understood
when we publish this publicly to people
with no technical knowledge and I'm not
necessarily recommending that you
publish a PCA plot but I want I'm not
saying you shouldn't but it is a little
hard to describe but I want all of you
to know how to interpret these because
it is perhaps the simplest way to
realize the basic structure of a
high-dimensional Vegas thing all right
thoughts on that okay so that's PCA oh
do I have this in two places oh yeah
okay slide issues there is another
approach another fundamental approach to
understanding the structure of high
dimensional data which uses distance
metrics and a lot of things uses
distance metrics and for example all of
the filtering algorithm designs that
we're gonna look at in this class are
built out of distance metrics so let's
talk about them formally a distance
metric is a function with these
properties you know so they have names
like reflexivity and so forth these are
meant to model our intuitions about
geometric distance right so if I'm
looking at the distance between two
points it's not- the distance of
anything to itself is zero distance from
green to blue is the same as the
distance from blue to green all of those
are obvious the triangle inequality is I
think a little less obvious you're all
familiar with the triangle inequality in
geometry it's another way of saying that
the shortest path between two points is
a straight line it says if you go from
what we talked about if you go from X to
Y it is the it can't be shorter than
going from X to Z directly ok so these
actually proposed impose a fairly strict
set of constraints on the topology of
the underlying space but there's a bunch
of stuff that does this so normally
including distance will satisfy this but
so well things like distance on graph
alright so if I have a
Satur notes and I have some edges and
the edges are weighted you know with
some distance on them then if I define
the distance metric between two points
as the sum of the weights on the edges
of the shortest path and this will sound
like this
so it actually applies to a lot of
different situations more than you might
think
and if we have a distance metric then
the space that the objects are in so
these these X&Y; they live in some
abstract geometrical space is called a
metric space if we have a distance
metric and we have endpoints we can make
a distance matrix now this is
distinguished from the data matrix so we
saw the data matrix for the House of
Lords so this is the data matrix right
we have one row for each object so that
can be in whatever size right M objects
of n dimensions the distance matrix is
always square and it's always symmetric
why is it centered
yeah because it's because we said that
it's symmetric right
so it is a symmetric matrix with of M
times M objects and we can always
compute this thing if we have a distance
metric it's gonna be big right this
starts getting scary if we have a
thousand points now we've got a million
entries in matrix and if we have a
million points so sometimes you don't
want to compute this explicitly right
sometimes you want to do it on demand if
you want to make it sparse or something
but in principle we can just store this
thing in a variable and we can use the
distance matrix to create visualizations
of the space of objects which maybe you
were getting at earlier with the sort of
distance preserving embeddings so let's
talk about that algorithm this algorithm
is called oh we're talking about
clustering yes before we talk about
creating visualizations we're going to
talk about clustering algorithms there's
tons and tons of them this is sort of
like pretty pretty standard data science
stuff but if you think about them you'll
notice that they all require a distance
metric so the fundamental thing you need
to do to do clustering is how similar
are two things right so I'm out in the
field I'm exploring my new country and I
noticed one for lady green amphibious
and another four legged green phibian
are they similar enough that I'm going
to say the same species right or is one
a frog and the other a salamander so
presumably my distance function will
tell me that two frogs
any two frogs are more a a lower
distance to each other than any frog in
any settlement that's the idea here and
that is case then we can find a
mathematical structure here how many of
you
McCann's on really yeah right okay so
we'll do that pretty fast this is a
really classic algorithm so restart
alright so you pick how are we going to
pick the initial centroids so in other
words we're trying to find the center of
clusters to do that we need an initial
guess for where the cluster centers are
randomly okay and then what kind of data
would you like smiley or pimpled smiling
pimples smile pimples smiley is really
the best one okay so here we go the way
the caming algorithms work is so the big
limitation of k-means is you have to
pick k before you start you have to pick
you have to know the number of clusters
before you begin and there are
clustering algorithms which don't
require you to know that but k-means
does so there's a centroid how many what
should be set K to how many clusters do
you think we have here for okay so
there's four random centroids and what
it's done is it has colored the regions
of space where which are closest to each
centroid so for example the the border
between the red and blue is gonna be on
a line which is equidistant from each
right that's that's what defines these
lines so now the k-means algorithm has
two steps the first step is we figure
out for each point what's the closest
centroid so there you go we've colored
them and the next step is we move the
centroid to the centroid of all of the
points that have that color
so you'll notice centroid is centroids a
fancy word for average it's a geometric
a concept of an average so
we go and then we repeat we've changed
what points are closest to what you can
see that for example we've got some red
points in the blue region right so we
reassign all the points and then we move
the centroids again and we keep doing
this and eventually what happens is we
reach a step where we don't assign any
points so I think we're gonna reassign
just a few here and you can see that
sometimes it takes a while to converge
but you can actually prove that this
will converge and actually it stops
there you go
we've reached convergence okay so um
there's a proof that the k-means
algorithm will converge it's not
necessarily going to converge fast when
a practices does pretty well you can get
away with ten iterations or something
what do you think are these reasonable
clusters I have the at the centroids
captured the structure of the data yeah
it's kind of a hard thing to add
clusters this this makes more sense when
you do something like okay write it
better find these so let's try it right
okay so obviously clustering only really
makes sense if the data really is
clustered real data is kind of somewhere
in the middle we'll look at some real
data in a bit clusters of documents and
you'll see that it's you know there's
cluster structure but it's pretty even
see so this is the k-means algorithm
right and when you look at this I want
you to imagine doing this in some high
number of dimensions we're doing it too
here but it certainly works in creating
in fact it works in any number of
dimensions and all of the proofs go
through and in fact if you build a
clustering algorithm so so think about
what you need to build the k-means
algorithm implement this right so first
the first step is you would assign each
point to the closest centroid
well the centroid is a point and the
closest ones closest to me well it needs
least distance so if you have a distance
function then you can do that step and
that distance function could be in any
number of dimensions and then the only
other thing you need is you need to be
able to create a separate so you need to
be able to take a bunch of points at
average level somehow so if you have an
averaging step and you have a distance
function you can implement k-means and
there are other clustering algorithms
which only use a distance function so if
you know how to take a distance function
between two points in your data you know
and there are some pretty obvious
distance functions that working you know
any number of dimensions like euclidean
distance right then you can cluster in
any number of dimensions you don't have
to change your clustering algorithm so
all this stuff works in in high
dimensions so I'm not gonna run this in
a notebook but you can run k-means or
whatever clustering I would be like I
think I used a clustering algorithm here
and I said tell me which cluster each NP
is in and this is what you get ha now
you get a number for your Chimpy the
reason that I've showed you this is
because I want to point out and I will
point this out over and over again is
that data and by itself never has mean
the meaning comes from connecting the
data to the world in some way right and
if you don't believe me I'm going to
take some of my favorite spreadsheets or
move all the column headers and send
them to you right so and this is again
one of those neglected topics and
statistics it's like well did you know
the data only has meaning when it's
connected to everything else so we're
gonna connect it to some
in this case we're gonna connect it to
party so now I've added the at the top
there it's the party of each MP so you
see the the party and then immediately
below it the cluster and if you stare at
this for a bit obviously there's better
ways of visualizing this but if you
stare this for a bit you might be able
to convince yourself that is it's kicked
up party information so labor is usually
to Liberal Democrats are normally one
crossbenchers are normally thing but
take a look at the Conservatives they're
usually one as well so it's ended up
putting the Conservatives and the
Liberal Democrats who are in a coalition
in the same cluster and and that is
because what I did when I did this is I
built a distance function that computes
distance based on the number of votes in
common so I think I'll actually show you
this so if you look at this original
post I don't think I have the code yeah
here we go
this is the dis instruction I used and
if you stare at this for a few minutes
maybe you can convince yourself that
this follows the rules of the distance
function so for example it's symmetric
and v1 and v2 you can switch them and
you get exactly the same thing right
because everywhere they appear you can
switch them if they're the same thing
right if I pass the same MP for both of
them this is gonna give me zero triangle
inequality is a little harder to prove
but you should be able to convince
yourself that it works out and all it
does is it basically counts the fraction
of both where they both voted and they
both
the same right so I figure out how many
they both bills it goes floated on
because not a Google it's on everything
and then I figure out well how many do
they match on within the ones where they
were both there and then I figure out
this fraction the number where they
match done divided by the number where
they could have matched and I take a one
minus because this this will be one if
they voted identically but it needs to
be zero if they vote identically so a
couple of - okay so that's the distance
function I designed to do this and we're
going to turn this into a visualization
in a minute oh yeah so there you go it
says at the top this may seem a little
arbitrary to you and it is so here's the
thing about distance functions and
clusterings they're not objective and
subjective isn't a bad word subjective
one of the ways to think of subjective
is set by contextual considerations so
for example I am using domain knowledge
of how Parliament works and what voting
is to decide that this is a useful
distance function now I'm still making a
choice I'm saying well let's look at how
well they agreed where they both voted
rather than looking at you could also
say well let's just look at the number
of votes that match and if one person
was there and the other person wasn't
then let's just say it's a mismatch
right so we there we still have a bunch
of choice here and there isn't going to
be a right clustering or a right set of
categories there are only going to be a
useful set of categories for your
problem this is my favorite example at
the top we have the this is due to Clay
Shirky at the top we have the piece of
the classification system for the Soviet
National Library and at the bottom we
have a piece of the Dewey Decimal System
and you can pretty clearly see the
philosophical influences of each of
these categorization systems right there
both categorizing the same thing about
categorizing books but they have really
different ideas of what a category is a
lot of that 290s every non Christian
religion alright so this was invented by
Dewey in the 19th century and so like
you know in the libraries in American
libraries at the time this was a
reasonable way to do it in fact there is
a justification for this which is that
anybody know the difference between a
typology and the taxonomy so these words
were used in the title of this book
what's the difference between a typology
and a taxonomy anybody know which file
technology is just a series of pipes
what types of things can we have so your
typology would probably have books of
all kinds of different religions a
taxonomy is a typology built up from the
objects you actually have and so this is
why the Dewey Decimal System is so
Christian it's because at the time it
was made that was the set of books in
the library right you it's a completely
reasonable thing to do to build up the
classification system from the examples
that you actually have so going back to
exploring the animals of the new country
you're building a taxonomy so the the
Tree of Life is a taxonomy it is a
classification system designed to be
effective in categorizing the actual
species that were found in the field by
natural ists as opposed to I guess sort
of hypothetically or based on
theoretical considerations anyway you're
gonna run into these issues when you're
designing distance functions I'm doing
clustering right
there's no no
we're going to move from cluster
individualization they're often very
closely related but remember that
technically speaking a clustering
algorithm just produces cluster
assignments right so the output of a
clustering algorithm is this in fact
it's not even that it's that which is
why we still want visualization because
you know we would probably take the
output of the clustering algorithm and
then visualize it for example by
assigning colors on the plot so just a
reminder of that point so we're going to
talk about a different way of
visualizing this I dimensional space so
the PCA projection is a linear
projection that means a couple different
things geometrically that means what we
do is we drop straight lines to a
straight surface right so I'm gonna
repurpose these points as points in a
Euclidean space again and I'm gonna say
well we're going to take a flat
projection plane right so this is some
hyperplane and then we're going to take
the perpendicular distance which is of
course also the shortest distance of
each point to that plane and then we're
going to take the points on the plane
that are the closest distance and that's
our projection all right so that's what
we mean by a linear projection in
geometric terms in linear algebra terms
what we mean is we do a matrix
multiplication and are you all familiar
with the rank of the matrix okay so you
can think of the rank of a matrix as how
many dimensions you have left after you
multiply by that matrix and the fact
that this is a projection means that
that matrix has to be less than full
rank in fact if you're projecting to the
screen that is projecting to two
dimensions that has to be a ring to
matrix it has to draw every other
dimension it now in some basis that
matrix will have zero for every row
except the first two but in general
because this is on a diagonal in
whatever coordinate system the matrix is
still full of numbers but you're going
to find that it it projects everything
just so this is what we mean by a linear
projection the projection we are going
to use is a nonlinear projection it is
still a dimension reducing function so
we still end up on a plane but we're not
doing it by dropping perpendiculars
anymore so in particular a linear
projection preserves certain types of
distances and angles so obviously a
linear projection doesn't preserve all
distances but it preserves distance it
preserves this distance right it
preserves certain types of angular
relationships as well the linear
projection doesn't have to preserve
anything and there's a fisheye lens
which is a simple linear projection so
here's what we're gonna do we're going
to build a non linear projection and
instead of starting with the positions
of all of the points and multiplying by
a matrix to flatten them out what we're
going to do is we're going to start with
the distance matrix right so the
distance between every pair of points
and if you think about this if the input
and the output dimensions are the same
and actually I can get all of the
coordinates back if I just know the
distances right so say this is a map of
the say these are the positions of a
bunch of cities and these edges are
labeled with the distance between them
you know so as the crow flies and I
don't give you the positions of the
points but I give you all of the lengths
here I just give you a list of the
distances from every city over the city
there's if you think about it for a
second and you can you can prove this
automatically you wish
you can recover the positions of points
just for the distances up to a scaling
and translation right so I may not know
which way this is rotated I may not know
where it is on the plane but I know
what's gonna have this shape so a
distance preserving transformation
preserves in particular the geometric
properties were interested it preserves
all of the relative lengths and angles
if you go from - space - to space or
three space - three sticks but you know
there's nothing that stops us from going
from like 200 space to do space we just
may not be able to get all of the
distances exactly right you can't
satisfy all of the original distances in
the low dimensional space so this is a
classic algorithm that started in 1952
called multi-dimensional scaling it's
and you know you're some linear algebra
really the only thing I want you to
think about here is so we start with the
distance matrix on the left and we end
up with with coordinates right so it
Maps distances to coordinates and it
says well we're not gonna get exactly
right so so X are the coordinates and B
is this transformed version of the
distance matrix basically be sort of
subtract out that that means and that
that line six this minimizes B minus X X
what that's saying there is it minimizes
the disparity between the coordinates
that's computed are the distances
between the coordinates it's computed in
other words the distances in the low
dimensional space and the distances in
high dimensional space so that's what
we're doing and there's a bunch of
flavors of this and so this minimization
see that that in six there you have that
that double bar that is a matrix norm
it's the it's a standard squared norm
and if you sort of grind through the
math and you work this out this is what
you get
that norm ends up being the sum the
squared sum of the distance difference
between the high and low dimensional
distances so the x's here are the
computed coordinates right so we're
trying to figure out where to place
these points that's the output of the
visualization the DS are the distance
matrix elements and what we were doing
is we're trying to minimize the squared
difference and how many of you have had
some basic physics anybody know the
physics of Springs so turn books yeah
right so actually the force increasing
identically but the energy it takes to
stretch a spring is proportional to the
square of the displacement and Springs
have a rest length right you can go
squish them and you can pull them and so
that squared and there's a spring
constant but that spring constants are
dropped out here so that's minimization
is actually identical to connecting all
these points with Springs that have rest
length equal to the distance of high
dimensional space and then waiting for
them to find the lowest energy
configuration which is what they'll do
if you just let them say right if you
connect a bunch of things with Springs
and you let it wiggle around until it
stops you're gonna find that the lowest
energy configuration so there's actually
a very intuitive physical interpretation
which is that you connect everything
with Springs and wait for it to settle
down which means actually you can
calculate this and you'll see this in a
second and show you visualization when
we talk about the overview system you
can calculate this just by pretending
there's forces on these dots and
simulating them at each time step with
with the dots moving according to the
spring Force and so another way I think
about this is that you have a bunch of
these points have this sort of
connectivity between them it's just like
stretchy structure and I take this
high dimensional structure and I
flattened it between two planes of glass
and it squishes down and the springs
pull around and it's very excited
wobbles eventually stops and that's
multi-dimensional scale so if I do this
to the House of Lords data this is what
again this is a blog post again so it
goes into more detail and you can you
can do this in Python I'll just add it
looks a bit different from the PCA plot
and I think there are a bunch of reasons
for that first of all what do you see
what is the relationship between the
Conservatives and the Liberal Democrats
in this plot the colors are different
objects yeah so they overlap a lot more
so the Liberal Democrats in purple don't
quite overlap the Conservatives in red
right they attempt to stretch a little
bit towards the left towards the Labour
in this case but they're definitely in
the same quadrant right and then the
other fun thing here is the green dots
next B those are the cross ventures
those are MPs who will sit with neither
party that's why they called cross
ventures and sure enough they even vote
all over that place so this gives a very
different picture based on the distance
in between the voting patterns and what
might be going on here so first of all I
think the main the major differences are
just the fact that rather than trying to
find the direction where the Liberal
Democrats are most different right the
second most variance direction we're
just trying to put everything down and
be distance preserving so you can see
there is clearly an axis where the
Liberal Democrats differ right they sort
of go out this
but that that access has been flattened
down by this projection so that's one
thing that's going on the other thing
that's going on is it could be that the
Liberal Democrats just vote on a
completely different set of bills than
everyone else and we don't see that in
the PCA plot because we take every bill
whereas our distance function looks at
only bills where both people voted right
so we we've actually got a difference in
what we're doing and in fact in the blog
post so here's that one we were just
looking at if we change the distance
function right so I modified it rather
than just the number of votes where they
overlap I'm modifying it by this factor
which is okay but how how big is the
overlap right so I'm taking a log here
just to try to squish the scale of it
you can start to see sort of how how I
suppose it's technically correct to say
subjective I would I think prefer the
word contextual here all right there may
be strongly justifiable choices for why
you do these things in this case because
I don't get an interpreter will plot if
I don't do this log thing right but I'm
I'm sorry I'm adding this other factor
which is okay well let's push to M Q's
apart if they didn't vote on the same
bills so if the if the overlap is zero
well I guess I don't run this line at
all right so if the overlap is small
then this will be very close to one so
we'll get much bigger distance when I do
that this is the pattern that I get I
get a different shape and you can see
that the Liberal Democrats now start to
push off up here alright so I think this
axis this vertical axis is kind of like
the people who don't share a lot of
votes with anyone else they sort of end
up up there so I think that's part of
what's going on although you can see
that conservatives end up up there too
so anyway different different
projection methods will emphasize or
allow you to see different aspects of
the structure of this data and again
there isn't it asking which one is the
right method is kind of like asking what
is the right method to analyze data it's
like well depends on the question you're
asking right there there isn't going to
be a single answer to this question okay
yeah and here I'm sort of saying the
same thing right so there's all of these
free variables there's all of these
places where our domain knowledge our
knowledge of the actual real world
problem comes in so maybe we shouldn't
even be looking at boats if our question
is about how people are working together
maybe we should be looking at committees
maybe we should get the transcripts and
do text analysis I don't know maybe we
should look at voting voter behavior or
not legislature behavior if we want to
talk about polarization and then also we
have to ask okay but you know what it
what is the rhetorical structure of the
argument that we are going to be making
right what are we saying and different
data analysis methods will support us
saying different types of things so this
is our first mention of robustness in
this course is the word you're gonna
hear me say a lot if different analysis
methods give you different stories you
have a problem right it's not
necessarily a problem it's a problem we
know how to deal with in journalism if
different sources give you different
stories you have to reconcile that in
some way right you you have to you know
either somebody's giving you inaccurate
or incomplete information or opinions
really do vary and you have to represent
that in your story so it's the same
thing with data analysis what I guess
what you're hoping for is that different
methods a variety of different methods
including non data methods right
including qualitative methods will give
you
qualitatively the same answer and then
you have a very strong story it's like
if every source is telling that same
thing okay so our next topic today is
text analysis and I'm gonna start by
just showing you some examples of what
people have done with it and then we're
going to develop the tf-idf model and
then we're going to show the development
of this text mining system which I think
given the time we probably have to be
next bus text analysis could mean all
types of things here's a story I worked
on república is about a child foster
care facility in California called level
14 which anyway it did long story and it
all sorts of problems but what I had to
do was extract incident type from a
bunch of documents that looked like this
right so these were scans redacted scans
that we got that OCR with a lot of
errors and I had to basically get date
and type so it's basically just a big
data extraction problem and this is very
common analyzing a document set could
mean a lot of things this was from a
story that came out of Reuters
on basically these yahoo news groups
where people were how to put it let's
call it unauthorized adoption right
there trading children and they went
through it was a public news group their
private news groups as well but to avoid
sort of various types of legal
complications they went through a public
miss group went through over 5,000
messages and they had to determine by
hand that each certain messages were
talking about particular children and
then they broke down the demographics in
a bunch away so this is basically a
counting analysis but it's a very
complex counting analysis
analyzing Twitter data is still an
obviously popular you know Twitter is
the e.coli of social network analysis
right it's the model organism it's all
public so it's really easy to get the
data everybody knows how to use it right
so everything's done Twitter even wanted
probably shouldn't be this is just this
this piece that looked at the
demographic breakdown of who talked
about particular hash tags and then
sentiment analysis it doesn't really
matter what I say all of my students
want to do sentiment analysis on Twitter
I recommend against it it's a very
appealing notion right the idea that you
can track how the general public feels
about a topic you can't really do it
though for cover reasons one is that
sentiment analysis is more complicated
than it seems even humans only get about
eighty or eighty-five percent accuracy
on sentiment analysis because language
is pretty ambiguous sometimes you can't
tell how somebody feels from ten more
tweet the other problem is that Twitter
is just not even a random sample I mean
leaving aside the question of the
demographics of the people who are on
I mean tweeting about a topic is is
intense self selection so without really
sophisticated modeling who talks about a
topic at all it's very hard to
understanding it nonetheless there was a
sort of peak and you know about I think
it sort of passed about four years ago
of people doing Twitter sentiment
analysis in this in my mind this is a
much better use of sentiment analysis so
rather than trying to ask how people
feel about an issue what they did is
they looked at how draft versions of
inspector general reports different from
the published versions and the way they
did that is they looked that you send
them
to find cases where the language was
softened and there's a nice recording
from the night car conference a few
years ago talking about it I also talked
about this case and paper which we'll
look at a little bit later and one of
the interesting things about that is
they had to retrain the sentiment model
to work on these reports this is very
calm with sentiment analysis it's also
one of the problems is everybody's like
I know he's not okay but generally
sentiment analysis is very domain
dependent so in this domain for example
the word recommendation which is
normally a positive word these are Auto
reports so recommendation is a negative
word right if you have a recommendation
in an audit report it means they found
something back so it's kind of
interesting I think they 400 negative
references were removed right so they
didn't want to count all those manually
so they built a sentiment analyzer that
would work for this debate
classification is one of the most common
things types of text analysis we saw a
very sort of complex manual version of
that in the child exchange story this is
a much simpler classification problem
all you want to know is whether starting
with a description that the police
recorded about a particular incident was
it an an aggravated assault or a regular
assault and anyway so there's a criteria
for whether something counts as one or
the other and you can read the incident
report and decide and what they found by
piloting for the year of this by going
through Europe gave us about 20,000
reports is that many assaults were
misclassified is less serious than they
were and then starting from the beat
there
so they went through a year of this
stuff and that generated trading data
and they used to training data to train
classifier to train all the rest of it
and we'll look at this in more detail
but document classification
is a pretty standard technique just a
bunch of journalism oh and here's what
it looks like so I love the language
that they use here like Biggs and some
of the arguments anyway it's you know I
keep talking about domain knowledge this
is what I mean standard algorithms will
not work for journalism the algorithms
you learn in textbooks or you load from
psych it in an L TK or you're gonna
learn in class generally won't work and
they can often be made to work but out
of the box they're not gonna work for
variety of reasons we're also gonna look
at topic model how many of you
encountered this LD a style topic models
okay we're gonna actually gonna get into
those next class and I have an
assignment on them this is a popular
method to well you want to say analyze
topics but really what it is it's pick
up patterns and ordered distributions it
is a latent variable model we'll get
into what those are there's only one use
of in journalism that I know of and that
was to another Reuters story look at the
topics of a bunch of court cases where
before the Supreme Court to try to
understand which topics we're actually
accepted by the court so that's a very
fast run through of some of the things
you can do with text analysis in a
journalistic context there's I mean
there's lots more right and people are
inventing new ones all the time but
really I show you that we'll talk about
it all tomorrow but I show you that
because I'm trying to motivate you to
want to learn this thing that I'm going
to try to teach you which is the
document vector space model so now we're
going to jump in to a bunch of
abstraction and what I'm going to teach
you is a thing called tf-idf which was
used to do the
clustering of message machine and so
what it's used for is to generate
document vectors which then can be
compared with a distance function to
decide whether two documents are in this
case near the event you can also compare
asked questions like do two documents
talk about the same topics this is also
foundational search engine technology so
every full-text search engine uses
something like this
maybe it's modern search engine will use
lots of other parts too but you can't
really build a search engine without
this stuff there's a bunch of things
that you can use too if idea for
clustering we've seen you can use it to
try to extract keywords you can use it
to try to figure out the topics of
things is a key part of every
recommendation system that works with
texts so Google News they study speed
whatever it's a standard method
representation for doing document
classification so if you're running a
classifier you're running classifier on
vectors you have to turn the text into
vectors and search engines are made for
them so here's the idea we're trying to
design a representation of a document as
a vector that is as a series of numbers
in particular as a series of numbers of
a certain length we more or less need
all of our document vectors to be the
same length through the same
representation so the simplest way we
can start turning documents and D
vectors is by counting words this is an
ancient technique people started doing
this with the Bible in the 12th century
it's called a concordance when you do a
Bible and you can think of the counts as
the numbers and the words as the
dimensions right so
this document has an 18 on the animal
dimension and so these this complete set
of dimensions it's the set of vocabulary
of all of the documents so this is just
word counts you can also use word
frequency which is just divided by the
total number of words and word frequency
vectors or word count vectors are a
fundamental representation which you can
use in lots of places they are of course
equivalent to a word cloud here is a
word cloud with the same document this
is actually a press release from US
Senator in DES New Jersey and you know
the reason people use word clouds is
that it kind of tells you what the
document was about so so what's this
document about just more cloud yeah
about enforcement of laws around
reporting of animal cruelty and there's
this particular guy Michael something
Google was involved in dog fighting and
that's how they made an example of him I
suppose so the fact that word clouds can
tell you interesting things about a
document should convince you that word
frequency can tell your interesting
things notice that God doesn't appear
here
they've removed it all right so that
tells you that already we want to do
something to the word count so we don't
want the raw counts so here it is just a
little more formally right where we're
turning a document into a vector where
each dimension is a word and you know
here's a couple examples this is what it
looks like with two sentences and you'll
get this matrix this matrix is called
the term document matrix an individual
entry which is denoted by term it's an
indexed by term and document is called
term frequency and this case is not
actually term frequency it's time counts
I divide everything the first row by 3
and the second row by
one or four and you get term frequency
so this is the th part of TF idea one
thing about this representation is that
it is completely insensitive to word
ordering so if you have semantics that
depend on the order of words which is of
course very common in language then
you're not going to be able to pick it
it disappears in this model this is why
this is called the bag of words model so
you take all the words in the document
you throw them into a bag you lose the
ordering you know this is less terrible
when it seems um before you can get
there you have to break it into words
that is more complicated than it might
seems so you're going to do some
tokenization assignment up for the next
class and you know the algorithm I've
written there is really straightforward
tokenization right it's it's about the
dumbest thing you can do and then it'll
work okay there's lots of cases where
this throws out information that you
want so for example punctuation alright
if you break words on punctuation does
that break don't into doughnut teeth
what about periods between the letters
of abbreviation all right is that going
to break into it you have to break up
periods because sentences at done
periods some more sophisticated
tokenizer deal with punctuation and in
what is it Chinese Korean Japanese you
don't have spaces between the word
equivalents and so you termination this
is completely different process it's
actually becomes like probabilistic
words living algorithms language is wack
is what I'm saying
you will find over and over that
language is black I promise
okay so now we've got vectors now we
want to get a distance metric this is
useful for a lot of things so clustering
as we saw mashing a search query so how
a search engine actually works at this
at the base level is it calculates the
distance between every documents and the
search query and then it sorts by
distance and shows you the closest
results that's it that's pretty much
what a search engine does now there's a
bunch of other signals apparently Google
now uses thousands of different signals
or something so the links thing which is
PageRank the defense port
yeah sure all of that but the the most
basic signal is do the documents contain
word and the documents don't have to
contain the world but workable is there
a similarity so this is basic stuff and
the concept of the distance functional
build is just overlapping terms so we're
gonna build up something called cosine
similarity which is basically
overlapping terms and document vectors
and we'll see why it's called cosine in
a minute there's different ways you
could do this you could use the
Euclidean distance to compare your
document vectors so what I mean what I
mean by that is I have a document vector
you know I have some dimensions are
words it's always cat sitting on mass
with me right so this is document 1 this
is document 2 and I have you know
certain certain accounts here all right
there's a lot of cats and Mac's here
and you're cleaning the distance between
these two documents is the I take the
sum of squares of the distances right so
it's 2 minus 1 squared plus 1 minus 1
squared 3 squared all right so that's
for that girthy I can do this it work it
has some drawbacks basically we the
difficulties that we need to normalize
longer documents will end up being
farther away from everything because
they have more words and so we'll get to
what normalization but you can do this
it's it's not it will work sometimes but
there's a standard thing that we use
instead and instead what we do is
instead of this every for that we take
dot product so that is to say we change
this so instead of taking the squared
differences we do an element-wise
multiplication and take the sum of those
elemental implications and one thing to
notice about this immediately is that
anywhere where there's a 0 that term
drops out meaning that this takes into
account how many words are overlapping
mean somewhere right so the dot product
for any word disappears when that word
only appears at one document not yet
over and it's also as I noted on the
slide it's not a distance function
because this actually gets bigger when
the documents are closer so sometimes
this is called similarity and you can
think of similarity as 1 minus distance
one of the difficulties here is that the
longer the documents the more words that
are in it
these numbers get higher or the more
different words that are in it the
higher this gets for the same document
right so if you think of comparing a
document to itself the longer the
document is the higher this dot product
is going to get and so this presents a
problem it just it screws up the
similarity speculation it just gives you
clearly the wrong answer in a lot of
cases
so which query do you think or which
document do you think the query should
match here hey right but it doesn't it
matches me so what we do is we normalize
the documents and the way we do that is
we make the document vectors have unit
lengths according to the standard l2
norm
everybody's everybody know what I mean
by the l2 norm now I'm trying to sort of
calibrate where the map is so there's
different ways of measuring the length
of a vector one of the things weight
things you can do is just sum the
absolute value of all the components
it's called the l1 norm you can think of
it as taking each component to the first
power another thing you can do is you
can sum the lengths of the squares of
the components and then take square root
that's called the l2 norm you could also
take each component to the peeth power
then take the P through and occasionally
you'll see that it'll be like the you
know l3 norm or the L infinity norm
which if you think about it for a second
that'll just give you the maximum value
anyway this is a standard Euclidean norm
that we're doing here right so we're
dividing everything by the square root
of the sum of the squares and we're
defining our similarity function again
not a distance function a similarity
function by the dot product divided by
the lengths of these things which if you
remember your your vector geometry is
just the cosine of the angle between
them conveniently this also
gives us a result between zero and one
so what's a zero if I run this formula
to get a zero what does that tell you
about the documents okay but
linguistically yeah no overlapping words
what is it one you yeah they're
identical so the only way you can get a
one is if a dot B is equal to the
product of the lengths of a and B and
the only way that could happen if you
look at the formula or you can either
look at the formula for Q dot product so
you can think about the geometry of this
and the only the only way that can
happen is if the vectors are variable so
now we've got a function that says zero
if they're completely different and one
if they're completely identical and so
if we try this again with all of this
normalization going on now we match the
correct query right we've matched a
instead of B so that's that's basically
why we normalize it it's just it fixes
this problem of longer documents minute
so here's a picture of what's going on
you can think of each of these document
vectors as vectors in some space because
all of the counts are positive they are
actually in the the but it would be
quadrant in two dimensions or octant and
three dimensions that'd be one over two
to the N subspace of the high
dimensional space where everything is
positive I'm sure there's a word for
that anyway hyper object there you go
they're in the positive hyper octant and
then we to match a document against
another document we look at the angle
and they're all normalized which means
actually you can imagine them as lying
on a quarter circle in this picture so
after we normalize them they're all
going to be on a circle in two
dimensions or a sphere and three
dimensions or a hyper sphere in the real
world and everything's on the surface of
the hyper sphere and now we're taking
the angle between two vectors on the
surface of a piece of a sphere the
positive section of the sphere and I
tell you this because although you can
always grind through the math the
easiest way to think about this stuff is
to think about a geometric way all right
your geometrical intuition will almost
always give you the right answer so then
to turn similarity into a distance
function all we do is we say 1 minus the
similarity all right so we went from
something that it's 0 when the documents
are distinct at 1 when they're equal to
something that is 0 when they are equal
at 1 when they are distinct and again
you can grind through this and show that
it is a distance metric so it's
symmetric and a and B when a and B are
identical
it gives 0 and the triangle inequality
is a little harder to show but it
basically just just followers
geometrically right so if I have three
points on a sphere then the angle from
the first to the third has got to be at
least as large as the angle from the
first to the second plus the angle from
the second to the third and again you
can prove all of us if you're concerned
about it so that is cosine similarity or
cosine distance there are some problems
that we do this on raw counts and one of
the big problems is that all of the
common words are going to dominate
the result so actually sometimes we want
this there is a case where we want this
so style ama treat so you know you're
all familiar with the the Trump insider
op-ed which was was that just last week
right so a lot of people have wondered
who wrote it there is a classic
statistical technique to figure out who
wrote something and the basis of that
technique is to look at the frequencies
of common words the reason you use the
common words and compared to a corpus
right so you need a bunch of different
documents by a bunch of different
authors and then you compare the
frequency of common words with the
unknown document to the statistics of
the authors the reason you use the
common words is because the common words
are independent of topic right they the
authors might be right about different
things each author might have different
topics so if you started losing words
like you know Iraq or something then
sometimes they'll match sometimes it
won't but if you ask questions like how
often do you use the word
which then you start to get results that
work between different documents so
sometimes this is what you want but
usually not this is not what you want
for a search engine or document cluster
what you would like is some notion of
words that are distinctive words that
appear in some document but not others
and so the intuition here is that we're
going to boost words that don't appear
in very many documents that's right
because I won't start I I have my own
theories about that I think that
document was a deliberate pastiche I
think either it had multiple authors or
was built in part by cutting and pasting
sentences from other places because they
knew that everybody would do style on
the German even so I'm excited to see
the first you know it's the the first
may be any one of you can do it it's got
to be fast cuz it's gonna be this week
for next week the first person to do a
serious tell on tree analysis on it I
wonder what it'll say and yeah Lone Star
is interesting
but what about the statistics of all of
your old common words so like what
happens if we apply the standards
telemetry techniques although although I
will stand you know say that the
statistical techniques are very useful
and interesting and probably the most
justifiable way to come to such
conclusions the Unabomber was caught by
doing manual Salama tree-like lodestar
style stuff yet a particular phrase that
he said that appeared in his his
master's thesis plus his manifesto
anyway so lots of history mr. lammeter
there's it people have done this for
shakespeare for the for the Federalist
Papers is a very famous statistical
analysis that said that Madison wrote
the papers fun stuff we want to down
wait words other than the common words
obviously you want it down like that but
we may also want to down my car if we
have a data set full car reduce right
but not if we have a data set that
doesn't mostly talk about cars so this
tells you that the the weighting
function has to be a function of the
entire document set and not just the
individual document because we want to
bury the weights based on other
documents so about the simplest thing we
can do is count document frequency and
so instead of counting how many times a
word appears in the document we count
how many documents a word appears in so
divided by the number of documents right
we've got this that's in terms of your
frequency not account and I don't know I
mean I guess I wanted to give you a
formal definition but it's not a very
complicated idea so I've done this like
subset thing here that's particularly
clarifying it's the fraction of
documents that have that word and then
we say okay well we want to do eight
words that appear and when you talk
so we're gonna flip it right we're gonna
flip Big D over the number of documents
that contain it and we're gonna throw a
logarithm there because what the heck I
mean the actual reason we need the
logarithm there is we want a function
that compresses the range because we
don't so say we have a word that appears
in ten documents and a word that appears
in fifteen documents let's say five to
ten five to ten is a pretty big jump 90
to 95 out of 100 documents is not a big
jump right so that tells us that we we
want more resolution in the lower end of
the scale area smaller numbers of
documents we want one additional
document to mean more when it's not a
mini document so we're interested in
relative or proportional changes and
applying the logarithm changes the scale
so that'd be an equal unit of IDs means
a additional multiplicative factor of
the number of documents that appears but
guess what we're totally making this up
and almost any function that compresses
the lower the upper end of the range
will work and by work I mean give you
search results that a human would get
that's ultimately how this stuff was
originally evaluated so literally
putting it together we now have a
function that tells us what value to
assign to each dimension of a document
vector or again the dimensions are words
or terms as we like to say as a function
of the term the document and also the
entire document set all right so notice
that if I add one document to the set it
will change the tf-idf vector of every
document which contains any word which
is in that new document so it's a
contextual calculation yes which is what
I'm saying
which is what we wanted we wanted to the
importance of the word car to depend on
was this is a duct purpose with car
reviews in which case we don't care or
is it you know the AP news corpus in
which case yeah it should definitely be
like hey this is the stock cars this is
some interesting implications by the way
this means that every in principle if we
care about this type of idea never mind
the algorithm if we care about this idea
that or that the importance of a word
depends on the corpus then every time
somebody adds a new document to the web
these search results for every other
search should change on Google which is
actually true right like I don't know if
it's true on sort of like a local scale
but it is definitely true that as the
web evolves Google's search algorithms
would give different results for other
searches so now we are looking at tf-idf
on the same document and notice that all
the stop words have dropped out here is
the word cloud again right so this is
just straight word frequency and here's
tf-idf and notice that it's it's it's
promoted some things right so Michael
and animals are much bigger right
Michael's relatively small here it's
relatively bigger here and that's
because the word michael does not
commonly appear in his press releases
alright and if you read the press
release
turns out michael's pretty important to
this press release press releases from
Senator Menendez oh yeah I mean up to
when I use scrape K which was like 2012
but it's like five years of press
releases
all right so Michael appears twice in
this excerpt and which is not a very
large number right like you know all the
stock words appear more than that and
then animal or dog fighting also appears
twice dog fighting doesn't rank that's
interesting I think actually that's
because dog fighting only appears in
this document and no other and the
standard TFI fil wrote a standard tf-idf
algorithm throws out yeah this is kind
of interesting right this is the
practicalities and stuff food you throw
out words that don't appear in more than
one document and the reason you do that
yeah that's a different problem right so
yes so any word which has multiple
meanings is gonna have a hard time here
and actually when we get to Lda we'll
look at one potential solution to that
problem but take a look at this formula
the IDF term NTD number of documents
that the term appears in if that's one
then that's gonna get a huge boost right
we're gonna get the log of the total
number of documents so that terms you
know it should match perfectly that the
problem is that if we're trying to
cluster documents which was what I
actually built that algorithm for when
that generated that tf-idf data that
term can't match any other document
right if dogfighting doesn't appear in
any other document then I can never
contribute to the cosine distance of
another document so for a search engine
you might not want to do that for
clustering it actually cuts down the
data quite a bit because there's lots of
words that only appear on document
anyway that's that's tf-idf this stuff
is from the 70s a guy named Sultan was
did a lot of work on this this is like
foundational search engine stuff and
this is how he drew it he said that you
know we would like the document vector
space to have these little clusters and
we want the clusters to be documents
that are about the same thing in fact
there's a formalization of this called
the cluster hypothesis here are some
actual multi-dimensional scaling plots
two sets of document vectors the TF
vectors and TFI the effectors the colors
are manually assigned topics they are me
going through and tagging this and we'll
see the tagging interface a little bit
later I guess next class at this point
if you look at the TF side what you see
are they sort of long sort of streamers
I see one that's sort of going sort of
to the upper left and I see one that's
sort of going almost straight up and
then a big central blob in the middle
and what's happening there is that those
sort of linear dimensions that's the
projection of the stop words right so
one of those things is the value
dimension and our space is really warped
by just the shape of how common those
words are in English and how common they
are in the press releases right some of
these dimensions are going to be the
like there's going to be the menendez
dimension right because they all have
its name in it there's going to be the
like appeared dimension or I don't know
whatever it is that politicians say most
commonly and then that central blob in
the middle is kind of everything else
and so we don't get very clear
separation of the clusters when we
divide the term frequencies by IDF you
cancel out not only the stop words
but the the Menendez styled stuff and we
get much more distinct clusters as you
can tell from the fact that many of
those topic clusters are now much more
compact and distinguishable all right so
that's that's why we do this there is an
interesting assumption here it's an
assumption which is empirical in nature
the cluster hypothesis has appeared in
various forms since people started doing
this type of like search engine
algorithm design like it states from the
earliest days of tip idea and even
earlier actually there's sort of proto
versions of this from the fifties and it
says that the this algorithm of turning
things into a vector space like this
word counting thing tends to put
semantically similar documents together
or in this case they say behaves
similarly respect to information means
right if you're building a search engine
but we do all kinds of things we're
gonna we're gonna look at document
clustering we're gonna look at like
keyword analysis and all kinds of stuff
and the underlying assumption of all of
this stuff is that the clusters in this
space means something now you can't
prove this from the mathematics and in
fact the tf-idf formula is not really
that important there's dozens of
variants and one of their references in
the syllabus goes through like what some
of these variants are different search
engines and different packages will give
you different answers it doesn't really
matter they all kind of do the same
thing they all kind of work on the same
intuition and I'm I want to point it
this because in almost all of the work
that we do there is some sort of often
unspoken hypothesis of this type which
connects the semantics and the
mathematical worlds so I talked about
the connection between the words and the
data the only way that happens is by
introducing hypothesis like this
and to the extent that the hypothesis is
correct
then the statements that you write about
the data have empirical meaning all
right so there's something going on in
the background and also tf--idf just
works right you you can build useful
search engines with this anyway there's
poor theoretical justification for it
but it empirically it works pretty well
this idea not just tf-idf with the idea
of encoding documents as word vectors in
some fashion most commonly in at tf-idf
ish type away is called the vector space
document model and it is foundational to
all types of text analysis including the
text analysis that is built-in to review