Visualization and Network Analysis
Visualization and Network Analysis
Transcript
today what we're gonna look on is kind
of a combination of two different topics
neither which is a full class we're
going to look at visualization and we're
gonna look at network analysis and I'm
sure all of you have done some of both
of these I'm going to try to show you a
sort of journalistic approach to these
topics in particular the frame we're
going to use is visualization as
perception I like to think of
visualization as a sort of hack that
interfaces with the human visual system
and I'm gonna so I'm going to try to
show you sort of start from that
perspective and build upward from it and
then we're going to do social networks
and I'm going to throw in a certain
amount of social network theory meaning
we're gonna talk about sociological
ideas which is ultimately why we do
social network analysis and then we're
going to show some examples of network
analysis actually being done in
journalism including some research that
I did just surveying how it's used
this is a list of the relationship
between various topics in a book called
gödel Escher Bach which is this really
this sort of early book on cognitive
science and artificial intelligence and
art and you know want to Pulitzer Prize
in 1980 if I wanted to ask you if
infinity is connected to the halting
problem you would have to sit here and
do quite a lot of sort of back-and-forth
searching you know looking at all the
things that infinity is connected to and
then scanning through this list it would
take you quite a long time to try to
look at those connections if I showed
the same information like this then it's
trivial to answer that question or to
answer more transitive questions right
is infinity connected to touring this I
think is a more interesting fact than it
seems so why should the exact same
information displayed in this way be so
much more useful for answering certain
types of questions anybody got a theory
of what's going on here
all right so there's the idea of spatial
relationships yeah anyone else got a
thought here
yeah there's something about putting
this in visual form as opposed to just
words that is really it makes a big
difference and my friend tomorrow Minds
nur who wrote a wonderful book that's
just called visualization puts it this
way carefully designed images as a form
of external memory so here's a very
interesting fact let's throw some words
on a board and he's my usual cat Matt
some colors okay I want everybody to
focus here so I want I want you to
literally look at my finger this is
actually important for this to work and
then I want you to tell me what this
word is
can you see it while you're staring here
yeah it's too bad you already read it
damn it
let's try this on the slide here all
right look here what's this word all
right so I'm trying to demonstrate two
things one is that your field of view
where you have high-resolution
perception the fovea is actually very
narrow it's only about two degrees so
that's the first thing is that your cone
of vision is much smaller than it then
it feels like because it feels like you
can see everything - is that we don't
actually have much visual memory all
right so it feels like you know if I
close my eyes I can still sort of I know
everything is in the room but in
practice we don't have memory for the
visual sphere around us because we don't
need it because if I want to access the
image over there I can just look right I
can that's our eyes swivel and they do
it unconsciously and the brain is not
very fast it's highly parallel but the
the clock rate is not very fast recall
times tends to be in hundreds of
milliseconds well if I've got a few
hundred milliseconds I might as well
just look okay so we don't use memory we
use perception and therefore if we take
information and render it visually then
the normal perceptual mechanisms apply
to the information we've just rendered
so offload cognitions of perceptual
system so rather than thinking
about all of this stuff I can just look
and this is basically why visualization
works now we can say a few more things
about it we can say well I can
demonstrate to you that your visual
system has a bunch of processing
occurring in parallel at the
preconscious level so let's try this
which one is different okay no problem
there which one is different okay easy
you'd notice you didn't have to scan to
do it which one is different okay yeah a
little bit harder you had to search so
that's interesting isn't it for some
problems you just know and for some
problems you have to do a visual scan so
when we get above a certain threshold of
complexity of the visual problem we have
to invoke cognitive methods the point of
data visualization is to present
information in a way that doesn't
require cognition okay you want to swap
cognition for perception and so we can
do experiments like this to find out
what perception gets us and the answer
is quite a lot so this idea where you
don't have to search for something
that's called pop out and a great many
things can trigger pop out so for
example we don't have to think to know
which line is longer or which is bigger
or you saw orientation based pop out and
the previous example and it turns out
there are quite a large number of visual
channels so depending how you count
there's you know a dozen plus or minus
different channels which will trigger
pop out and not just pop out but
comparisons
so this has been extensively studied and
here's a list of the visual channels in
fact three different lists it's the same
list but it turns out that for
communicating different types of
information different types of channels
are more effective so for example color
so let's say hue here is not a
particularly good channel for quantitive
information if you're trying to compare
which value is bigger like so you're
trying to do a plot of you know GDP by
ear color is not a great way to make
fine comparisons where as position or
even area or our volume is pretty good
for comparing quantities another hand
for categories hue is a great way to
express categorical information or
texture so there's there's a lot of
research on how this works and I'll just
I'll try to sort of give you a sense of
this very briefly by grabbing one of the
links in the syllabus here you go
39 studies about perception in 30
minutes by my former AP colleague
Kennedy Elliott who's now at the
Washington Post and what she's gone
through here she's gone through a bunch
of studies about just perceptual
experiments so here's the idea you can
see some pictures of how this works the
way you do these experiments is you try
to encode the same data in different
channels so in this case you're looking
at a ratio of two things and you're
saying let's try it as a direction let's
try it as an angle let's try it as an
arrow
and so for example you can see that it's
much easier to read as a length than an
angle right it's much harder to compare
those two and how you actually do this
stuff is you typically do either
reaction time or forced choice tests so
you you know give them a button press a
if the first ones bigger be if the
second ones bigger you see how long it
takes them or you require them to press
it in a really fast interval a couple
hundred milliseconds one second or
something and you look at the error rate
so by these types of measures you can
measure what's the fastest way for
people to do this so Cleveland and
McGill that's just the this paper we
were just looking at it's the classic
one this has been redone with
crowd-sourced stuff and it just goes on
and on and on
here's comparisons of pie charts versus
split line charts here's different types
of pie charts what else do we have here
here's looking at comparing trends
across multiple groups bars so if you
they didn't have these trend lines how
easy is it to see that category C is
decreasing and category D is increasing
it just goes on and on and on so we know
here's experiments with three
visualizations we know a huge amount
about how people generally perceive this
stuff most accurately and if you're
going to build visualizations it's
really good to familiarize yourself with
these types of results
the other thing that you can do is use
all of these channels together so I'm
going to play you this video since it's
actually a really nice example of data
visualization so let's see the the late
hands rustling visualization is right at
the heart of my own work - I teach
global health and I know having the data
is not enough I have to show it in ways
people both enjoy and understand now I'm
going to try something I've never done
before
animating the data in real space with a
bit of technical assistance from the
crew so here we go first an axis for
health life expectancy from 25 years to
75 years and down here an axis for
wealth income per person four hundred
four thousand and $40,000 so down here
is full and sick and up here is rich and
healthy now I'm going to show you the
world 200 years ago in 1810 here come
all the countries Europe brown Asia red
Middle East Queens Africa south of
Sahara blue and the Americas yellow and
the size of the country bubbles show the
size of the population and in 1810 it
was pretty crowded down there wasn't it
all countries were sick and poor life
expectancy were below 40 in own
countries and only the UK and the
Netherlands were slightly better off but
not much and now I start the world
the Industrial Revolution makes
countries in Europe and elsewhere move
away from the rest
but the colonized countries in Asia and
Africa they are stuck down there and
eventually the Western countries get
healthier and healthier and now we slow
down to show the impact of the First
World War and the Spanish flu epidemic
what a catastrophe and now I speed up
through the 1920s and the 1930s and in
spite of the Great Depression Western
countries forge on towards greater
wealth and health Japan and some others
try to follow but most countries stay
down here now after the tragedies of the
Second World War we stopped a bit to
look at the world in 1948 1948 was a
great year the war was over Sweden
topped the medal table at the Winter
Olympics and I was born but the
differences between the countries of the
world was wider than ever the United
States was in the front Japan was
catching up Brazil was way behind Iran
was getting a little richer from oil but
still had short lives and the Asian
giants China India Pakistan Bangladesh
and Indonesia
they were still poor and sick down here
but look what is about to happen here we
go again in my lifetime former colonies
gained independence and then finally
they started to get healthier and
healthier and healthier and in the 1970s
then countries in Asia and Latin America
started to catch up with the Western
countries they became the emerging
economies some in Africa follows some
Africans were stuck in civil war and
others hit by HIV and now we can see the
world today in the most up-to-date
statistics most people today live in the
middle but there are huge difference at
the same time between the best of
countries and the worst of countries and
there are also huge inequalities within
countries these bubbles show country
averages but I can split them
big China I can split it into provinces
there goes Shanghai it has the same
wealth and health as Italy today and
there is the poor in line problems why
Shou it is like Pakistan and if I split
it further the rural parts are like
Ghana in Africa and yet despite the
enormous disparities today we have seen
200 years of remarkable progress that
huge historical gap between the west and
the rest is now closing we have become
an entirely new converging world and I
see a clear trend into the future with
aid to trade green technology and peace
it's fully possible that everyone can
make it to the healthy wealthy corner
[Music]
well what you have seen in the last few
minutes is a story of two hundred
countries shown over two hundred years
and Beyond it involves plotting a
120,000 numbers pretty neat
ah hands after the elections yeah we
lost house rustling a couple years ago
that's too bad he did a lot of really
interesting work trying to get people to
understand the trajectory of
international development anyway so
there's an there's an interactive
visualization of this that you can play
with this is just a screenshot what I
want to talk about is the visual
encoding design of this data so how many
visual channels are there here and what
are they yeah okay so what's it I think
you mean that by size same thing yeah
okay so what what are is so position is
actually two positions we have X and
y-axis so what is each of them encoding
and the color is all right so let's go
back to this slide and think about this
for a second
we've got positions which are being used
for sure enough two quantitative
variables it's a life expectancy and GDP
per capita
we've got size which is being used for a
quantitative variable as well so you can
think about it that's area really
so area isn't quite as good for encoding
quantities as position but they've also
chosen a variable that isn't quite as
important they've chosen population it's
not as important to make fine population
comparisons as it is to make fine life
expectancy comparisons on this chart and
then they're using Q for a category so
in fact the visualization design
corresponds very closely with the
experimental results on how people
perceive things and I would say there's
actually one more channel here which is
which is time or motion the this chart
the sort of the early experiments don't
study motion but you can there's a lot
of research on you know how well can we
compare different rates of motion if I
show you an animation that lasts 5
seconds can you remember what happens
during that animation and generally what
they show is that it's it's hard to
complete complex information and
animations you only remember very simple
things but in this case all that you
need to remember is that things went
from the bottom left upper right if you
need more complicated things then
sometimes what you'll see is you'll see
trails right so if this moved from here
to here then you would have a trail like
this and that's one way to encode more
more information that way
so that's an introduction to the
perceptual point of view of
visualization design another way to
think about visualization is is is like
this right
make the salient features of the data
visible without thinking and to do that
you have to understand as we've seen in
these experiments what it is you can see
without thinking about it and how to map
those own two basic structures in the
data so this is an incomplete list that
I put together of patterns in the data
that it's possible to turn into very
simple visual representations so we've
seen lots and lots of examples of
clusters and we'll see some more on our
social network analysis clusters are a
basic possibly the most basic pattern in
data especially in multi variable data
or high dimensional data and you
definitely want a visualization
algorithm that preserves clusters but
you can also look at things like you
know the extent of the data like the
range of it or find outliers or look at
more sophisticated patterns so this is
an incomplete list of course what other
types of patterns and data can be turned
into visual structures
yeah that's a good example like that's
what a Venn diagram is yeah yeah all
right so you get to use motions so you
can turn temporal relationships into
spatial relationships so you can keep
them as temporal relationships what
about this one's kind of interesting
we're not really exploiting this but
we've got these little lines right so
connectivity and paths so that this was
an example of that earlier sort of graph
theoretic attributes can be turned into
visualizations when you start thinking
about that there's there's quite a lot
of stuff that with a little imagination
you can turn into visual encodings
one of the main things we are interested
in when we're talking about analysis
tasks is interactive visualizations so
most of the visualizations I've shown
you are static if you load up that
Gapminder interactive so this one you
can play with it a little bit right so
you can select countries and you can
change the time so you can set the color
and so forth interactive techniques are
an enormous design space this is a chart
from a paper I quite like about
visualization design which tries to map
out the space of where you should use
interactive techniques versus automation
I also used this diagram when I need to
talk to people who are used to
automating everything so you know I'm
talking to a Google engineer who's like
well everything has to be run at scale
and so we have to have an algorithm that
finds fake news on its own well that
only works if the task is very crisp if
you know exactly what it is you're
trying to accomplish and all of the
information is in the computer as
opposed to the reporters head from
making phone calls or their interview
notes in free text that hasn't been you
know entered into the system yeah
it's it's talking about when you should
use I mean originally yes it was in the
context of visualizations but it's more
broadly talking about when interactive
techniques will work versus automated
techniques computer scientists tend to
like automated techniques because they
don't have to think about those pesky
humans yeah so let's take ten thousand
PDFs of public records and find the
story in them so there have been
techniques that try to take plain text
and find stories but it's much more
effective to put the human into that
process because the human knows lots of
things about what is interesting about
those documents one way to think about
that is there's lots of information that
is in the human's head and the tasks
what they're trying to do is not all
that clear because there could be many
different types of stories right so
you're for that problem you're kind of
like around here in that space which
means that automated techniques are not
a good fit so it's just a way to think
about how much automation versus how
much is assistance to human effort
well interactive visualizations are
squarely in this middle box right so one
way to solve this problem of how do we
investigate this huge pile of documents
is to visualize the contents of them as
opposed to trying to do NLP and spit out
the answer most visualizations are
interactive there's no point in making a
picture that a human isn't going to look
at I now want to talk very briefly about
visualization design and so you're doing
database and other classes I imagine yes
no okay you have a visualization class
or in the CS department or here
yeah okay do they talk about this same
perceptual stuff okay visualization
shows okay interesting okay so it falls
to me to talk about this stuff I guess
in computer science we tend to talk a
lot about these inner tube boxes you
know visualization designers like
publishing papers on visualization or
algorithm design or some encoding
technique you know I'm going to show the
probability of this thing on my Bayesian
inference algorithm by color you know
whatever it is but this is another way
to think about design is that you sort
of have to start from the outside in
starting with the domain problem and
this is often a very complicated problem
so for example the problem of
investigating a huge pile of documents
is not a particularly well defined the
main problem all right this is and I've
done a bunch of research on what I call
the ethnography of data work you know
what is it that journalists actually do
with documents and you have to answer
those questions before you can start
talking about you know this is how my
topic model that's going to work because
topic model is all the way down at this
level so the ideas you work through from
from the the outside in yeah yeah so
what are you trying to solve so an
example in the investigative journalism
context they say this document said my
name not quite hypothetical example is
okay so I'm trying to find all places
where it looked like politician was
taking money for a policy change so
that's the domain problem
characterization or one way of saying it
the data operation abstraction design is
okay well maybe the abstractions I want
our people and payments and then once
you say that can settle on those
abstractions then you can go to the next
level which is the encoding or
interactive techniques and say something
like ah I'm going to show their people
and payments in a graph and a network
we'll see that later this class and then
the algorithm design is okay here's how
I'm going to efficiently extract all of
the people in places and render a graph
and lay it out
but visualization data visualization is
more than just plotting data so let's
look at this New York Times piece on
homeruns what else is here know what
pictures yeah there's and some people
what else yeah it's a lot of annotation
what else do you notice here where does
your eye go first all right the big
number and then where it's after the big
number where does it go
all right the this red line and it's
comparison to these other lines so
there's a bunch of stuff going on here
there's annotations and there's visual
hierarchy why does your eye go to the
big number first
okay whiles this size yeah what else
yeah so position size other elements
leading the eye towards it the weight of
the font it's a very thick black font as
opposed to you know
so here's text which is lighter and
smaller so that's lower in the visual
hierarchy so there's all of this stuff
that is not data visualization per se
but is all of this stuff around it that
makes a visualization work and there's
titles too so there's a very interesting
piece of research recent research that
shows that by changing the title on the
same chart you can change what people
remember by changing the framing right
you can have a chart that shows you know
a very moderate increase in crime rate
over the last few years and you can have
a title that says crime rate
unchanged as police crackdown you can
have a title that says you know crime
has increased as police crackdown or you
can have a title that says you know
crime on a 10-year downward trend just
directing attention to a different range
on the graph you can really change what
people remember by changing the words
you put next to the pictures so data
does not speak for itself that's I think
one of the lessons of the data
journalism program is you know it is a
narrative medium and so doing a data
visualization doesn't free you from
getting the narrative right this is a
piece that I quite enjoyed when it came
out
oh dear was this flash oh no can I even
I wonder if this will load in any
browser this is the problem with reusing
slides from last year right it's always
always the possibility that it'll break
there we go
yeah it was based on this
which came out a little bit earlier and
a lot of people enjoy just these these
lines the sort of animation like this
and what the New York Times did was they
made this thing which now I have a
screenshot of which showed the wind
blowing to the right on the red dots and
the left on the left dots and the of
course the length of vector is the size
of the shift so this was the 2012
midterms which seems like a very long
time ago now we're and in fact this is a
general pattern in midterms generally
the president's party loses seats that
is the most common thing that happens
with midterms you're right this is not
the midterms that was 2014
okay what I said about midterms is still
true but this was Obama's re-election
where you know there was a rightward
shift but Obama's still won and what I
want to show you is the process of
designing this visualization so there's
a there used to be a blog called charts
and things which was the one of the
Kevin Quigley's blog about creating
these visualizations unfortunately it's
now only available through our caveat
org which is another solution to the my
class materials keep disappearing
problem oh yeah so it says it's based on
the wind map and you can see the sort of
whiteboarding process that they went
through here it's not just that map but
it's they've then break it down by the
shift per state and they break it down
by different demographic groups as well
so here's panic voters and young voters
and and women and so forth so that was
the original sketch and it ended up here
and this this is actually a nice little
tumbler
they they had a lot of stuff on here on
let's see if we can get it through
Internet Archive you can learn how quite
a few of these things were made and it's
not so much that the individual articles
are super interesting it's more it's
just neat to see the process yeah I
guess it's more work than it's worth
right now here's another one that was
from that
here's another election thing or you can
show how I think it's how each state
shifted to the left of the right over
time and you can see how it started out
is just a quick visualization and are
and then some experiments with with
Sankey diagrams this is called a Sankey
diagram and then here's the final piece
so when you make one of these fancy
visualizations is not like it just pops
you just start typing it immediately all
right there's a lot of sketching
involved and sketching not just like on
whiteboards but going through various
iterations of attempts to do it with
code this is a really nice visualization
as well and it's a good example of
narrative you know data visualization
so basically you just keep scrolling
down here right so here's the observed
pattern here's the change here from the
Earth orbit so this is all this is
basically a data visualization of a NASA
climate model here's solar variations
here's volcanoes which are actually tend
to be cooling because they block light
here's all of it together
here's deforestation which also is
slightly cooling
here's ozone here's aerosols which have
actually quite a lot of cooling effect
and here's greenhouse gases and then
here's the model versus the observed
wage okay so that is very interesting
but I'm gonna make another comment here
that you as data analysts should be
worried about which is of course the
model matches the observed data you
don't publish a model on climate change
that doesn't match the observed data
right that in fact part of how we know
that the model is working as it
reproduces the reality so all of the
problems you have with machine learning
where you don't want to peek at your
training data applied to fitting
physical models as well what's the
justification that the model is not
overfitting what types of justifications
would NASA have for this
so it matches historical data but of
course you don't choose models that
don't match historical data that's what
you hope you hope you hope to test your
model by matching it against the future
and you can you can cheat that right so
you can do like cross-validation style
checking on your model where you train
it on part of the sequence and see how
well it matches on the parts that didn't
see the other big answer is robustness
checks if your model has various
parameters you look at how your model
does if you vary those those parameters
if you vary your assumptions which is
part of what generates these error bars
right so this this blue shaded region is
the 95% CI probably 95% yeah there you
go 95% CI it says there for this model
and part of where that CI comes from is
measurement error on the physical
parameters that go into the model and
part of where it comes from is
robustness checks on assumptions so if
we don't know exactly how much clouds
contribute to global warming or in this
case global cooling then we make some
range of reasonable guesses and that
distribution of uncertainty gets
incorporated into the model as model
uncertainty anyway I show this to you as
an example of narrative visualization I
think there's one more and here's
everything together and tada
it's a beautiful fit oh and I think it
just goes back to the top
so this is a lot more than a chart right
this is a pretty sophisticated
interactive presentation one of my
favorite quotes about designing
interactive visualizations is if you're
asking the user to click the payoff
better be huge is the payoff for
clicking big enough here a company out
mostly yeses around the room yep the
direction of the Internet I think it
does work here I'm gonna post it to
slack so you can interesting it I think
it just goes to static pictures on
mobile probably it just tries to avoid
JavaScript yeah and then it talks about
where all of this came from oh and this
is what I was just talking about these
rebus this checks so here we go
it's robustness in several ways right
there are 28 research groups around the
world and they've written 61 climate
models each one is slightly different
but this is just one model but this is
how this science is done actually they
very intentionally have a lot of
different people using a lot of
different models and then they look at
the aggregates of the models which is
not something we normally think about in
terms of models we normally think about
like you know measuring some value 10
times and averaging it to reduce the
noise but it's the same measurement
process here it's actually different
groups with different methods and
there's there's
experimental literature on you know does
this actually work to have different
people doing different estimates and
take the median like why should this
work and the answer is basically that it
works because you know this throws out
the extreme values and you get
straddling where some values are above
the true value and some are below and
they cancel each other out at least
somewhat and you get closer to the true
value anyway
Wow there's a lot of stuff there
yeah and here's actually the
interpretation of the confidence
interval it's as I said they do have a
huge range of simulations and then they
pick the interval that 95% of them lie
inside so aside from the narrative
structure of the visualization which i
think is very interesting this is really
illustrates how the science translates
into the data right it's it's it's not
just you run a model as you actually run
a suite of models and you have all the
same problems that you have in machine
learning with overfitting and validation
and so forth I guess one of the reasons
I show you all of this is I want to try
to dissuade you of the idea that data
visualization is objective so first of
all data isn't objective and the simple
way to see that is I can just mail you
all of my favorite spreadsheets without
column headers all right the the names
of the columns provide the crucial link
between the world and the spreadsheet
and those that that's not objective
that's those are facts that have to be
reported out right here's a chart that
doesn't start at zero it happens to be a
chart from a technical paper comparing
different ways of computing document
similarity and showing that you know
this one it's it compares the error
relative what humans think I found this
paper because I was very curious if you
do cosine similarity on a set of
documents how close does it match human
ratings of similarity and the answer is
pretty closely right like 80% it gets
it's a reasonable approximation to how
humans think about document similarity
here's another chart that it's become a
little bit famous this was presented to
Congress I think three years ago
talking about Planned Parenthood and and
okay so what is what is the narrative in
this chart
I don't think that that's what there's a
yeah generally they're saying Planned
Parenthood does mostly abortions and not
cancer there's a bunch of weird stuff
going on here so I would say that this
isn't actually a data visualization
because it throws out first of all it
throws out all the intervening years but
also it plots to things on a different
scale so oh do I have the link for this
yeah here let me show you the the actual
data that this is drawn from I don't
know
way back machine to the rescue again
let's see if we've got it yes
okay that's fine do we do we end up with
yeah okay here we go this is what I want
so I will update my notes here all right
so here was it being used in Congress
and here's the actual data all right so
they you if you get to choose the same
the same scale on the left on the right
and you start at zero
it looks quite a bit different and
here's here's a more complete data set
which shows actually all the intervening
integrating data
alright so Planned Parenthood mostly
doesn't do abortions they mostly do you
know STI screening and contraception so
what is that what is that saying if you
torture the data you can make them say
whatever you want so you will find if
you go through this stuff arguments
about what fair data visualisation is so
for example you will find heated
arguments about whether you should
always start at 0 or not where do you
all fall on that always start the y-axis
at 0
[Music]
yeah I don't know I mean there's a bunch
of things we can talk about but I don't
know that there's any totally general
rules for this stuff but I think you can
say that the there is as much editorial
choice that goes into graphing the data
as there isn't in choosing the data
itself so again the data do not speak
for themselves you have to make these
choices and here's another example of
that this was around the Obamacare
debate this was supposed to be a chart
of how all of this stuff works together
this is the same chart right so you can
lay out again the same information in
different ways and here's a close-up on
that which is a lot clearer so by the
way I'm certainly not saying that you
shouldn't have a narrative in your data
if you don't have a narrative why are
you showing the data right the data has
to have a meaning and so you have to
choose the meaning that you want to
display that's part of being a
journalist it's just certain things are
can be intentionally misleading all of
the issues of objectivity and balance
and so forth come into play just in the
same way that it would be dishonest to
leave out facts that are relevant to a
story it would be dishonest to leave out
data that is relevant to the story all
right so for the second part of this
class we're going to talk about social
networks and journalism and I'm going to
try to ground it in at least a little
bit of a sociological perspective so
just first the definition we're talking
about social networks when I say social
network I don't mean Facebook
I mean nodes and edges right so it's a
graph of people and connect
between them I haven't said what type of
connections so when we use social media
data we're talking about you know
following or friend relationships these
terms aren't completely standard but
often network analysis is used to mean
there's only one type of relationship
and link analysis is used to discuss
having many different types of
relationships and this is very common in
big investigative projects this
difference becomes important when you
talk about things like some travel
algorithms all of which are derived
assuming there's only one type of
connection and link analysis ultimately
grew up in law enforcement and
intelligence
so journalism is sort of adapting many
of these techniques the entire reason
this is interesting is because people
act in groups or to put it another way
if I know something about person a I can
probably know something about person B
as well who's connected through a link
and you can you can sort of imagine all
of the different ways that properties
are transmitted through links so for
example family ties are very strong and
following relationships on social media
those are the channels through which
information flows or can flow so you
would expect that people who are for
example follow the same person are
exposed to the same type of information
and this applies basically in every
sphere there's a name for this which is
homophily
friends would like or as is sometimes
said birds of a feather flock together
it's a people with whom you have a lot
of ties you're going to be similar to in
some way and that's basically why we do
these analysis it's an inferential
method
to transfer knowledge about one person
into knowledge about people related to
them there's kind of two ways to do this
analysis so by the way I'm talking about
analysis of the structure of the network
when I say social network analysis I'm
not talking about collecting all of the
tweets about the election and doing text
mining I don't consider that social
network analysis because there if you do
that there is no use of the structure of
the network so I'm talking specifically
about techniques where the structure of
the network is used as data all right
it's part of the inferential method and
basically the two ways people do this
are visualize it and then use human
interpretation or apply an algorithm to
compute something in both cases as we
shall see the results are highly
contextual in fact this is one of the
most contextual types of data analysis
you really can't just sort of read off
the answer you have to understand what
it is you're looking at and who these
people are
this is a pretty old idea these are the
earliest social network diagrams that I
could find
he Marino called them Scioscia Graham
Moreno was a psychologist at Columbia
who studied things like dorms where
foster care children lived fraternities
and sororities classes and he started
drawing these pictures and his notation
was this is a diagram of I think dorm
and he already is distinguishing between
one dimensional and two dimensional
links so one dimension is just a narrow
two dimensional is a a-line with a bar
between it in other diagrams it's
actually a beautiful book it's full of
these hand-drawn
picture drawn pictures in other diagrams
he has different colored links to
indicate repulsion or dislike so already
in the 1930s there was barely
sophisticated very recognizably modern
social network analysis by hand he got
the data by going into these places and
doing surveys so just looking at this
picture who do you think is the class
president anyone want to say the number
this this one right yeah right
so that's interesting right we can learn
something about the role of these people
just by looking at the structure of the
network so this is the type of inference
we're talking about rather than going
into the visualization algorithms I want
to take a slight detour to a paper that
I quite like this is a paper about
analyzing social networks from Facebook
data theory is publicly available some
very old Facebook data it's anonymized
of course I think it's really early it's
like 2006 or 7 or something Facebook
doesn't do this anymore
but there's various ways to get data
like this and the purpose of this paper
is to show how to go from the image on
the left the hairball to this image
which shows the actual underlying social
communities and it's based on a bunch of
sociological assumptions so what we've
done here is we've cut out most of the
links to reveal this diagram and the way
we do that is we look at triangles and
this is where this word sommelier comes
in Jorah he was georg simmel
yeah and doesn't say his first name I
think his name is George simul he was a
late 19th century early 20th century
sociologist so this is before any data
was available before people were drawing
social network diagrams really and his
theory of sociology was based on
triangles he made the following
observation he said to study sociology
you have to study at least three people
because two people when they act it was
a one person obvious that you can't
study the social to people when you
study just dyads these people aren't
observed true people can have their own
little world there's no enforcement of
the norms of society so he thought that
sociological theory needs to be built
around how do two people interact when
they're being watched
hence triangles and triangles turn out
to be extremely important structures in
social network analysis so graphs
generally like graph theory is all
concerned with edges which is a
two-degree relation sociology is
concerned with triangles so let me give
you an example of that
so I'm just going to draw a sort of
pretty standard like friend graph
so let's say these are symmetric
relationships of the you know a is
friends with B type now there has been
observed in sociology a simple
predictive rule called triangle closure
and what triangle cloture is is if you
have a triangle that is open such as
this triangle if you watch these graphs
over time it is quite likely that the
triangle will close that is to say you
will add this edge okay
so think about the sociological
mechanism there what that says is that
if I have two friends it's quite likely
that my two friends will meet it's a
very simple sociological observation and
it's highly predictive and in fact these
open triangles they are the basis of
most social social network person
recommendation algorithm so you know
facebook says you might also know well
how does it do that the way it it does
that is it looks for the maximum number
of unclosed triangles so let's say this
is me and facebook wants to recommend
people to me what it does is it looks
for people who many of my friends know
right so there are three of my friends
know and if I know them all so then it
would close a bunch of triangles it
would close this triangle it would close
this triangle and it would close I guess
that's it because it's not a direct
connection all right so one way to think
about this is who who is there many
second-degree paths to another
to think about it is who do many of my
friends know but another way to think
about it is how do I close the most
triangles so triangles as opposed to
edges so a set of three as opposed to a
set of through of two are a fundamental
way of thinking about social
relationships that seems to have some
sort of sociological reality both
theoretically as pros proposed by symbol
100 years ago and in practice and so
this paper talks about different types
of triangles right so if everybody is
friends with everyone in this case it's
directed you get a civilian triangle
otherwise you get different types of
triangles right so soul symmetric so you
know just these two people know each
other and then this person was excluded
or just one person knows the other one
and this pruning algorithm here's
another example of what you can do with
this algorithm and this extremely slowly
rendering diagram can take these big
hair balls and turn them into something
much more easy to read right it it finds
the true strong ties so it goes from the
thing on the left of the thing on the
right and notice that without being told
what dormitory people are in it
successfully groups people into
dormitories all right so it figures out
who's actually friends with who and
trims out all of these weaker edges the
way that the algorithm works is
it say it's trying to figure out if it
should have an edge between a and B well
what it does is it looks who looks at
all of a top friends and the way it
looks at the top friends it has asks how
many how many triangles does it have in
common so someone is a top friend if you
don't only have a link to it but there's
a mutual friend as well right so it
looks for a triangle
let's call this person and see which
says that not only do I know it but we
have mutual friends and it ranks these
people who are part of triangles it says
every person that I know how many
triangles am i involved in other words
how many mutual friends do I have with
that person and so let's say my close
friends you know all rather A's close
friends and then this is the number of
triangles so see I have say have friends
in common number of friends in common
you know C D you a four you have three F
two and I'm actually going to order
these because that's how the algorithm
works
and then it takes some threshold say it
says it says you know the top 5 so this
is my top 5 best friends as ranked by
number of friends in common on the
assumption that the people for whom I
have most friends in common are the
closest and then it does the same for B
so C 3 H 3 J 2 e 2 G 1 ok and then what
it asks is how many of our closest
friends are in common so notice I start
with binary data your friends or you're
not
this algorithm really works what it
really wants is the strength of your tie
to the friend but it measures the
strength of your tie if you only have
binary data by calculating triangles
right friends in common and then once
you have these types of lists then it
asks how many people appear on both
lists so C appears on both lists G
appears on both lists he appears on both
lists that's it so then because either
there are three people who appear on
both lists it says that the strength of
the relationship between these two is
three all right so it's this sort of
second-order thing first I figure out
who are my close friends and then I ask
how many in my top five list or top
analysts are in common and that gives me
a weight of the strength of the
connection between a and B it says we
have the same close friends if we have
the same close friends greater than some
weight let's say we have some threshold
of three then we keep the edge otherwise
we throw it out so using this triangle
based analysis we can trim this hairball
and get back a much smaller graph which
is reflective of the sociological
reality in this pace everybody's living
in a dorm together and so you can see
the dorm colors there
so this is a more complicated example at
a different University and what it is
found is actually two different types of
social groups so on the left so remember
the algorithm doesn't have this
information on the left the so that
they're colored by the well the dorm
that they live in and you can see that
it's found some dorm structure here
so these people live in a dorm and these
people live in a dorm and those people
live in a dorm but then you have this
right these big multicolored nodes and
on the right it's covered colored by
year of graduation so you can see that
some of these these are all the freshmen
right so it is found
sociologically real communities of
multiple different types so it's found
dorm based communities and it's found
year based communities and then it looks
like there's some stuff that's that's
truly mixed so like when we start to get
into here and here these are communities
that are not neither your may store dorm
based but seem to have some sociological
reality I bet that's like people who
like to go out to the same clubs or
something i or you know are in the same
class or something I bet there's some
variable which they actually have in
common that we could find with a little
more study
and so this triangle idea is very
powerful and it it seems to both
theoretically and practically capture
something about human social relations
that appears in the data
anyway fun stuff to play around with and
you know each social network is going to
be different if you do this on Twitter
following somebody on Twitter probably
means something different than friending
them on Facebook and Facebook has both
IRB's now so that I'm sure the follow
networks and the friend networks will
have different realities okay
I've shown you a lot of pictures of
social networks today aside from the one
that was drawn by hand pretty much all
of them are drawn by this algorithm
called a force directed layout how many
of you have seen this algorithm yeah so
it's a very simple idea the idea is we
throw nodes down randomly and then we we
say that every edge is a spring that
wants to push the nodes apart to a
certain distance so if they're closer
than that it pushes them apart they're
farther than that it pulls them together
and normally there's also a universal
force which is often called gravity that
just sort of pulls everything together
oh I'm sorry no it's the opposite it's a
global repulsive force that pushes
everything apart right so so you end up
going from this to this right if you
have this this tetrahedral shape it will
get laid out like this
so the picture you get does depend on
where you start because it's you throw
down randomly and so you may get
different shapes depending on your
initialization also you can use all
kinds of different algorithms for laying
things out here are a bunch of different
layouts like for example this is a
common visualization where you put
everything in a circle and draw edges
between them these are actually all the
same graph and this is from acute paper
which investigates how the layout or the
the visualization of the graph right so
it's technically it's exactly the same
data it's just drawn differently how
that influences our perception so they
ask people questions like how many
subgroups are there who's the most
important people and then they ask about
sort of bridging roles people who
connect different parts of the graph
which as we'll see is important so let's
take a look at these for a second how
many subgroups are there in this diagram
yeah it could be three could be four you
know same here who's the most important
person in this diagram or who are the
important people
see why I see because their way mm-hmm
well look at where C is here so it turns
out that the way you lay this stuff out
has a big effect on inference so just
something to be warned about as we'll
see eh shortly the primary analysis
method in journalism is just looking at
visualizations but there is this
question of whether the visualizations
are really showing you reliable results
nonetheless force-directed layout is
basically everybody does it it's a
produces nice-looking pictures it's you
know solve certain problems pretty well
and then there's the question of what we
can learn from a graph and there are
many questions we can ask photograph one
of the most common types of questions in
[Music]
sort of classical social network
analysis is this idea of centrality
which is also influence or power right
so we want to know who is the most
important person here who's the boss
who's the person I have to talk to who
really made all of this happen and you
can just look at a visualization as we
were just doing or you can also compute
metrics has anyone seen centrality
metrics yeah so this is what your
homework is going to be about so how
could we compute who the most important
people are in a graph what type what
type of metrics could we use any ideas
degree no degree yep that's that's
called degree centrality any other ideas
uh-huh yeah so whether so how something
about how it connects different groups
yeah there's actually a bunch of
different ways and I'm going to
show you a few really quickly here
because you should know this because
you'll run into it so degree and all of
them sort of capture different ideas so
degree centrality is just number of
edges so I think of this as modeling a
celebrity or a news hub right so who has
the most followers there's another kind
of called closeness centrality and
closeness centrality requires computing
the average path length to every other
node so I compute the shortest path to
every node you've probably all seen
shortest path algorithms yeah okay it's
a you know computer science favorite so
what I do is I compute the average
distance to every other node along the
shortest path so in this case
unsurprisingly you end up with the one
in the middle just because the average
distance is lower right so a real world
application of this idea is if you're
getting on the subway and you don't know
which end of the platform the exit is at
you should take the middle car because
the average distance is going to be
lowest from the middle car and this is a
useful model if you're thinking about
information flow or like flow of
packages like where should you put your
your warehouse to serve the whole
country well you know more if you're
only going at one it should be more or
less in the middle of the country it's
the same type of logic you can imagine
that these are cities and these are Road
distances driving distances and so you
put your warehouse which has that the
shortest average distance so this is
called closeness there's another one
which is betweenness centrality which is
again you think about all of the
shortest paths so the set of shortest
paths from every node to every other
node
so maybe you can Feud it with Dijkstra's
algorithm
that's one of my favorite algorithms in
computer science now all pairs shortest
paths and cubed algorithms
I want a programming competition with
that one time so of course I like it but
this kind of models introductions or
transmission of some sort right so this
is a map of the relationships the inter
marriages between the ruling families of
Florence in Renaissance Italy and you
know you want to be the Medicis because
if you're gonna make introductions to
marriage they have to go through you
right so you have the most control over
how allegiance by marriage happens
another example of this one is if you
are thinking about networks of imports
and exports and your levering tariffs
for people to travel through your
country you want to be the country that
everyone travels through all right this
has a strategic or military applications
as well this one is maybe a little less
object obvious so this one is one answer
to the problems all of the metrics that
I've just shown have the following
problem
let's say we're talking about organized
crime and you have the mob boss here but
and then you have you know all of the
other members of the mob in various
places but they don't get to talk to the
capo right they get to talk to the
secretary okay
and presumably to each other and so
forth and then the secretary talks to
the capo so all of the network analysis
algorithms that we looked at all the
centrality algorithms we've looked at so
far will say that the secretary is the
most important person and you know who's
this guy right that person is not any
higher than any of these other people
and maybe even lower I convector
centrality says that your importance is
not just who you can talk to but whether
you can talk to somebody important right
so because the mob boss is close to the
secretary and the secretary has very
high centrality in between this the mob
boss will get a very high score in this
case I suppose all these other people
will as well but if we sort of back this
off a few few degrees right then then
this person is very important and these
ones or not but this person is still
important because they can talk directly
to the secretary and so eigenvector
centrality is is it's kind of the key
the PageRank idea applied to network
analysis there's various ways of
describing this but one way to compute
this is how likely you are to end up
somewhere on a random walk in fact this
was how PageRank was originally
described if I start ad a web page and I
start clicking through following random
links a lot of those paths will end up
at wicked
so Wikipedia is really important so this
is the same sort of idea if I start
clicking through just following edges
and you know a lot of them will end up
at the mob boss because they go through
because a lot of them end up but the
secretary and once you're at the
secretary it's easy to get to the mob
boss has anybody know why this is called
eigenvector centrality what's the
relationship to eigenvectors here okay
so a little little linear algebra
tutorial you should know this anyway
because this is how PageRank works so
the idea is we are looking for and I'll
use the technical language here the
stationary distribution of a random walk
between pages people which means so I
pick a random node I start following
random edges the question is how often
do I end up at a particular node and the
stationary distribution is say I throw a
hundred people into this graph and they
all start walking around from moment to
moment 70% of them will be at the most
central node and thirty percent or
twenty percent will be here and like
there's some distribution where after I
wait a sufficiently long time and
they're all well mixed after everyone
takes a step that distribution will be
the same does that make sense it's it's
the equilibrium distribution of random
Walker's so the equilibrium distribution
of random clicking around the web will
have some high fraction of people on
Wikipedia just because although many
people live Wikipedia in the same step
the same number of people arrive in the
same step and they'll have a much
smaller fraction of people on the
website for this course because although
some people link into it basically all
of the links go out and more the links
go out than in so this was the idea now
to sort of complete this example let's
say we have a structure that looks like
this okay and we need the idea of a
graph structure represented as a matrix
have you all seen this and adjacency
matrix okay so here's the basic idea we
have some matrix and you can think of
this as the rows are from and the
columns are two a b c d a b c d and you
can always get from a point to itself in
one step so the diagonals are 1 and then
you have a 1 if there's a connection
between two edges and 0 if there isn't
so in this case i can go from a to b c d
1 1 1 and i can go from b c d a so all
those are ones and the rest are 0 and in
particular if the edges are
bi-directional then the graph is
symmetric if I have directionality which
I do for links right I can go from A to
B but not B to a then this becomes a 0
if I say that you have to move to a
different spot you can't stay on the
same page then the diagonals become 0 ok
so this graph tells me how I can move
around in one step
now fun fact if I multiply this by a
vector that tells me where I am now so
let's say I start on B
and what this does is this actually
picks a column you can see as you
multiply through it picks out a column
of this matrix right so yeah and it says
from B I can't get anywhere because I
made that arrow on one directional but
if I start instead at at a then it tells
me where I can get to from a so that
would look like this so this is standard
matrix adjacency math I have the
adjacency matrix when I multiply the
adjacency matrix by a vector
representing my current location tells
me how fast I can get to various places
I can also think of this as a
distribution like a probability
distribution if I say that you know I'm
as a 50% chance I'm on a and a 50%
chance that I'm on B or D and I take one
step well then what I'm going to get is
half of this and half this that's gonna
give me the probabilities that I end up
in any other point on this matrix so if
I multiply this through I get another
vector so let's call this matrix big a
let's call this X so x equals where I
can get in one step if I take another
step well this just gives me a vector of
the distribution of where I am a ax or a
squared X two steps and so forth okay
and in fact you can do a you can use
this fact to do a shortest path
algorithm right you basically just keep
on multiplying a by itself until it
converges now you have to normalize at
each step but event
what you're gonna find is all every term
if everything's reachable then every
term will get filled in and you'll end
up looking at all of the paths after two
steps three steps four steps five six
six six steps so if I'm talking about an
equilibrium distribution remember the
definition I gave earlier if I start
with a with you know people in some
distribution all these nodes and I take
one step where everyone takes a random
edge out I end up with the same
distribution well what does that mean so
well let's say a let's say it's there's
some vector V equals B okay so I take
one step starting from some distribution
V and I end up at V so this is the
equilibrium distribution and that's an
eigenvector problem okay you may
recognize that as the definition of an
eigenvector in fact it's a little
different I get a lambda in front of it
which is a scalar so I multiply the
vector by a scalar but I can always set
up the matrix just by scaling this such
that this equals 1 and that's just
normalization of this matrix so I can
always you can always get that and now
what I have is I'm looking for a
distribution such that it doesn't change
when everybody takes a step and that is
that's an eigenvector right that is
exactly an eigenvector so that's why
this is called eigenvector centrality so
what I am doing is I am finding a an
importance or a centrality metric that I
can sign to every node that says after I
take this one step where everybody gives
a little bit of importance to everyone
else so in other words instead of
thinking about following a link think
about this in the social net
sense of every step I give a little bit
of my importance to everybody I'm
connected to and what is the equilibrium
distribution for that right what is the
what is the distribution of importance
where I get exactly the same importance
out as I give out at every step so I I
divide my importance and I send it all
equally on each of my edges but if I'm a
very important person well that's
because I'm connected to people who are
sending a lot of importance to me and
that solves the president's adviser
problem right the president doesn't talk
to everybody
directly but the adviser does so when
the adviser sends out their importance
the president gets a huge fraction of it
and the president becomes an important
person as well so that is I convector
centrality and it's kind of a model for
who you know right so maybe I'm not a
celebrity so I don't have a lot of
connections but I'm the producer for the
for Lady gaga so I'm an important person
too so our next problem is what does
this all mean
which centrality metrics should we use
what do we do with this in journalism
and before we get into that I want to
talk about just ask the question let's
say you walk into a small town and you
have been your reporter right the first
thing you have to do is find out who is
influential in that town how do you
actually do it because you're probably
not going to start by analyzing social
network data although maybe but let's
say you don't have social network data
for this town what do you actually do
like let's let's bring it back to the
real world here how do you determine
yeah
yeah that is exactly right so first of
all talking to the local reporter is
always a good idea but also just trying
to figure out who's influential in
society and I want to show you an
interesting document which I stumbled
across a few years ago this is from the
1970s it was a handbook again it's
linked to the syllabus about development
workers we don't really send development
workers into rural communities in the
same way that we used to but it was
called mapping the community power
structure and look at this research had
shown that successful community social
action efforts depend on the appropriate
involvement of key leaders and community
so you have to find them and here's how
it suggests you go about it so this is a
wonderful little guide book because it's
a very practical reference to how power
works in the community and it suggests
you find you do this thing called the
reputational technique which is if we go
to page 8
okay it says you find knowledgeable
people you find a number of
knowledgeable z' to be interviewed
knowledgeable z' will be asked who they
think community's power actors so you go
and ask people who's influential and it
tells you how many people to ask and it
says what you look for is people who
appear on a lot of lists so that's it
right I I think there's a nice resonance
here with the the the sommelier and
backbone method that we looked at
earlier right you you build these lists
and you say who appears on lots of these
lists yeah yeah so why would that be
yeah right you're talking about
sociological and political reality right
I think this method would have worked
anytime in the last hundred thousand
years and will continue to work no
matter what the internet does to us I
think I think you know there are two
sets of in solvable problems love and
politics
yeah I mean it's no different right so
one of the one of the questions you
should be asking in any interview when
you don't know the space well is who
else should I talk to and eventually you
find everybody gives you the same name
and you talk to that person right so you
can think of this as a kind of degree
centrality right you ask a bunch of
people you can think of sort of randomly
sampling the network and you get lists
of names and then you go talk to the
people who a lot of people pointed to so
it's kind of a centrality algorithm but
I want to I want to emphasize that you
you know we've had a very mathematical
presentation of centrality it's a very
sociological idea it's very concrete and
and really this this concrete idea is
much more important I mean you know
there's more to it but here you go
here's the question you're supposed to
ask I love the like older slang who are
the kingpins who can get things done who
swing a big stick
yeah none of this has changed you might
use different slang but that's about it
okay so that's linked from the syllabus
so it's it's really fun to read and it's
just basic basic field work very similar
to what reporters do which brings us to
the very important issue of how does
where am i how does all of the
centrality relate to the real world
you know I wish so this is a graphic
from a Wall Street Journal story on
insider trading yeah unfortunately it's
offline now so the journal moved CMS's
and lost all their Interactive's it's
twice today we've run into the problem
of dying Interactive's but it's about a
very complicated insider trading case
and it was kind of fun you could step
through it and would highlight different
nodes and tell you what their
involvement was but you can tell just by
looking at this that this is the central
actor in the story they have high degree
centrality they have high betweenness
centrality they're also placed in the
center of the picture now of course
they're telling of course they're gonna
be central to the story that the
journalist is telling a particular story
so they get to choose the people to
include in the links that are relevant
so it's remember this is if you just had
the Facebook graph that included all
these people you may not get this person
essential because who knows if they're
central in society there's central once
you pick a particular subset of people
which is a really important analytical
point in graph analysis that the answers
you get are gonna depend on what you
sample in any case there's this idea
that I've called journalism centrality
we want to know who's important to this
story and centrality metrics may or may
not answer this question in fact
centrality metrics are not used very
often in practice they have been used so
I wrote a paper called networking
in journalism where I surveyed a bunch
of famous stories and tried to break
down what they did
it's a good question and and this is
another problem with using centrality
algorithms it is you usually already
know who is central to the story it's
it's pretty rare that you don't have
some idea of who the important person is
in the story because if you've been
reporting then it's your beat and so you
know already from doing all of this
previous work but we're gonna put that
centrality aside for just a second your
act your homework assignment is actually
going to be about centrality but I'm
going to talk about something else that
you can do with with networks so so far
we've talked about identifying and
ranking individual people now we're
going to talk about finding communities
and basically we're going to talking
about finding clusters in the graph now
one of the ways to do this is just
visualize it and just use your eyes but
just like you can look at a
visualization and find centrality or you
can run an algorithm and find centrality
there are community finding algorithms
we're not going to spend too much time
on this but I'm just gonna go quickly
through a classic one so that you have
some idea what these are about
so this is it right this is actually an
old picture of my facebook graph there
used to be a Facebook app that would
draw this for you and it's highlighted
the different clusters for me and
they're actually like you know my San
Francisco friends and my Hongkong
friends and some circus people that I
just hang out with and so forth
you know pretty good job of finding this
stuff yes
yeah put it on slack meanwhile I'm gonna
talk about this one so this is there's
different ways of defining clusters
because there's different ways of
defining links so this one is done with
Amazon from the people who read this
also read that that's what each of these
arrows means and this was for the 2008
election you can see that starting with
a political book on the conservative and
the liberal side you get this these real
clusters right it doesn't this is sort
of the the filter bubble echo chamber or
member effect that you can see but
there's many different ways of defining
community so if you're for example if
you're looking at Twitter you can look
at who follows who that's the most
obvious thing but you can also draw an
edge between two people if they both
share at a particular link so that's
what I sometimes call that coke
consumption one was your example Oh
connected China yeah I know this yeah
this is fascinating so let's see
in Zurich we're toss how do they
interact with each other
yeah writers did this a few years ago
based on a huge amount of research it's
yeah it's called connected China yeah
this one's kind of a fun one as well it
shows for various people the different
stages of their career so the previous
and current premieres for example yeah
it's not selectable and every one of
these is based on a news article I don't
know if they'll actually give us the
links here but how they did this is they
had this whole database of relationships
between people Wow
it's quite a visualization well okay
let's not get into China's government
structure I want to go back to the the
social power one yeah this is the one
that's most clearly a network analysis
and also most clearly not loading let's
try that again
so I wonder if I can get the original
links if I click on this person each of
these links is an article I don't mmm
connections I don't think it'll show me
the references here but maybe here is
where they are you know this is just
writers reporting so they're not making
the database exposed and this is was the
the brainchild of red schwa who is now a
data journalism exec yet writers and
he's been here's a blog he's been
writing about stress
for journalism for a very long time and
this is a sort of Exhibit A he had a
team of reporters go through and just
trace all this stuff out from public
news reports so they built this database
of connections there are different
databases like this well look at a few
more so for example little cysts so
you've heard a big brother so I can look
on a particular person so let's search
for a person not Trump Andrew Cuomo so
there's the positions that he's had and
he's worked for these people and he has
this linked to Martin Connor oh I just
get him but for each of these things
yeah right
it's true what I want I'm right now
those I want I think this might be that
yeah here we go each one of these is
referenced so here is the article link
from which this one link comes through
so you can build up these networks just
by analyzing public sources and so yeah
these maps that people make right and
each of these it says lobbying and if I
click on the link it gives me the source
for that one piece of data yeah so
little sis is pretty cool unfortunately
I don't know how much journalism
actually comes out of this I think it's
more of a research tool you use it all
the time what do you use it for
[Music]
yeah yeah it's nice because it has all
the sources and links to everything yeah
I'm not sure used to be anybody but I
think they have more editorial control
so I want him to move on because there's
there's actually a video I want to show
you so let's do community detection and
then we'll go into how this stuff is
actually used for reporting another way
to define communities is who talks to
who this is the Enron email
network map do you guys all know what
this data set is so
yeah half a million emails court-ordered
release after the collapse of Enron
energy in 2002 over accounting scandals
you can get clusters through web link
structure this is one of the early
example of trying to map the blogosphere
and around it's very cool or you can get
clusters by where people go right you
can see everybody who goes to a certain
type of party because their location
trails all end up at a particular Club
on Friday nights or a particular rave or
something we've talked a lot about
clustering you can use any of these
cluster algorithms by defining
similarity metrics on any of the
attributes we just talked about there is
another one which is pretty standard
it's called modularity you will see this
in network analysis packages and
basically it works like this it says can
we divide a set of nodes into two groups
such that there's way fewer edges who
cross that divide then if the edges were
random so it takes all the edges
distribute imagines they're distributed
randomly and says can I find a split
such that I have many fewer edges than
random and many more edges in random
within each group and so I'll just go
through this pretty quickly this is what
this looks like so this is just notation
I take the total number of edges here
which just sums the degrees and divides
by two and then if I have the total
number of edges that in a random graph
the expectation of edges between two
nodes I and J is the product of their
degrees divided by the number of edges
hopefully you can convince yourself that
that's right the greater their degrees
the greater number of chances they have
and then the over to em and just
normalizes so now I know how many edges
I would expect randomly now what I do is
I take the adjacency matrix which we
just looked at which tells me how many
edges I actually have
I subtract off how many edges I would
have randomly I multiply by whether
they're in the same group so in other
words I count only the edges between the
same group and I defined this thing Q
and if by summing over all pairs and if
Q is greater than zero that means there
are more edges within members of the
group then I would expect if the edges
are randomly distributed so then the
modularity algorithm tries to find G it
tries to find an assignment to groups
such that Q is maximized so that's it
that's the entire logic of the algorithm
it turns out that you can do a ghen
vector trickery here as well I'm not
going to get into get into it but
basically you can solve this for a
vector of group assignments it is
possible that there is no split that
gives Q greater than 0 right so there's
no split you can do such that there are
more edges within the groups than across
the group in which case you say there's
only one community here's what this
looks like so we saw those books earlier
here's what this looks like applied to
that split so the interesting thing here
is that we only use the edge structure
to do this analysis we do not use we
know the political category of the book
that's the shape here right circles the
literal squares or conservative
triangles or centrist and it more or
less reproduces the left-right split
just on the graph structure which tells
you again that graph structure
correlates to other things of interest
that's why we study graph structure so
there's a lot of ways of defining
clusters but this is a popular one and
the graph analysis tool you're going to
use for your homework which is Gaffey
has this
as an operation and most graph analysis
packages have it as well okay now we get
to how this is actually used in
journalism I wrote a paper on this a
couple years ago what I did is I went
through the I re tipsheet archives and
found everything that talked about
network analysis and categorized it and
here are some of the things that I found
right so this is a visualization of the
links between people in the seattle art
world and so it's a nice interactive you
can click on each one and see how this
person is and the way they got this
graph is they just went and interviewed
people they say who else do you know
and then the reporter went through their
interviews grants scripts and pulled out
the edges so the nodes are the people
they interviewed in the edges of the
hood you know and then they colored it
by are they a gallery owner are they an
artist are they a purchaser or whatever
our museum curator and so on so this is
just sort of a basic there's no network
analysis in the algorithmic sense here
it's just a way to draw a picture of the
scene this is another very interesting
case this image never really appeared it
never appeared in the story this was a
story about juvenile car theft and what
they did is they got all the arrest
records for juvenile car theft and the
edges are a CO arrest so this edge here
means that person in that person appear
on an arrest record together they were
arrested at the same time and they found
these clusters and then they had about a
about 400 people in this central
component so they they got a central
component and just to Kinnaird and they
have the problem of who to go interview
or investigate further and this is one
of the best uses of centrality measures
that I know they ranked people by
centrality and then went and pulled
additional data for those people so they
bought public records so they can't do
everybody because
it's like 20 bucks to pull this file for
one person so they wanted to focus on
just the most interesting people so I
got extra data and interviewed more
people who were central so I went
through like this and coded a lot of
stories and I for each story I looked at
first of all how did they use
visualization you know did it was it
just for reporting or did it appear in
the story where did they get the data
and did they have to extract it from
documents or did they just download the
data that they run an algorithm and that
they use a graph networks database so
here's what we got
visualizations appeared in most stories
but sometimes and about half of them
they were also used for reporting and
you got an unpublished visualization
like the one I just showed you actually
running an algorithm was pretty rare
mostly people are analyzing this to your
visualization and you know about 2/3 of
the time or so they they can't just
download the graph data they have to
scrape it from documents like we saw in
the interview transcripts or public
records or something
so then the question is why why our
journalists not using graph analysis
algorithms like centrality and
modularity what's the drawback here
yeah it's difficult to interpret the
results but why is that yeah broadly the
answer is context that what the measures
mean depends a lot about the details of
the story and by the time you do all of
the reporting to get the details you
kind of know the answer anyway that's
pretty much what's happening the other
important reason is that you get heavier
out in its networks right you do this
kind of link analysis you have these
links like you know worked for this
person it went to school with is family
of donated money you can't really run
centrality on that because these these
algorithms all assume that every type of
link is the same so there's there's
difficulty and interpreting lists stuff
also you often don't have the whole
network like if you're doing interviews
and building out a network by talking to
people
you can't run a network that tells you
centrality until you've figured out that
you have to talk to somebody so that's
why we don't see a lot of algorithmic
graph analysis but we do see a lot of
visualizations one of the uses of
putting things in a graph and using
visual tools is that you can put
different types of information together
so this is a pretty classic one we're
combining three databases who Rick Perry
made calls to so he was running for the
Texas governor who donated to him and
who got political appointments and when
you start to try to merge several
different types of data graph analysis
becomes very interesting and useful and
what I'm going to show you now is a
video it's actually kind of an old video
but it's nice because it's very
self-contained and not too long and it's
about a story on the tissue trade I cij
did this years ago this is pre Panama
papers but it's a really good example of
you do graph analysis this is a story
about companies that harvest human skin
and bones and tendons and process them
into medical implants it's a booming
market giving human tissue a second life
you might say what's wrong with that in
fact you might know someone like we do
who's walking around right now with a
corpse hendon that repaired a busted
knee this is a legitimate industry for
the most part these products can heal
even save lives but I'm going to tell
you two stories about how illegal tissue
can feed that legal billion-dollar
market here's the list of characters
you'll be hearing about a Florida
company called RTI and its German
subsidiary to dodging a Ukrainian morgue
named nico live a russian coroner named
Igor alla Shenko and a formal dental
surgeon from Brooklyn now a convicted
felon named Michael Mastromarino who
stole body parts including from the
corpse of the famous British commentator
alistair cooke one last point before we
get into the guts of the story
our team looked at more than 200
companies we talked to everyone from
industry insiders and government
officials to surgeons and convicted
felons we read through thousands of
court documents regulatory reports
corporate records and internal company
memos using a powerful analytical
software called Palantir we analyzed
data on imports affections infections
and accident reports filed with the Food
& Drug Administration the US agency that
oversees the human tissue trade and we
tried repeatedly to talk to the company
that ended up being the focus of our
investigation but executives wouldn't
talk to us or respond to written
questions so here's our story there's
this company called
RTI biologics which started out as a non
profit division of the University of
Florida since breaking away it has
become one of the world's largest
publicly traded for-profit tissue
processors as we dug into this global
trade we started to see a pattern RTI
and its
and subsidiary to dojin have repeatedly
obtained tissue from suppliers that were
later investigated for allegedly
stealing human parts first let's go to
Ukraine this February a few months into
our research authorities raided a morgue
in the southern city of Nikko live we
talked to investigators there they
suspect body parts were stolen from
cadavers brought in for autopsy they
told us signatures were forged to make
it seem like families had given their
consent the Ukrainian investigators said
tissue might have been taken illegally
from as many as half the bodies that
passed through the morgue during the
raid police seized autopsy reports
written in English lab results
apparently destined for - dodging and
bottles of human tissue labeled -
dodging made in Germany but one thing
really caught their eye an envelope
stuffed with cash labeled n I K one
that's the US government's abbreviation
for the Nick lied morgue see the morgue
was registered as a tissue bank with the
FDA what's interesting is the phone
number listed for that morgue was
identical to 24 other Ukrainian morgues
also registered in the US and what's
more when we called that number we
reached an automated system for -
touchin which makes medical implants out
of human tissue in early 2008 to dodging
merged with RTI remember that's the big
American for-profit tissue processor
back in the 90s - dojin got its tissue
from suppliers and places like Estonia
Latvia Czech Republic and Hungary where
tissue can be taken without consent
unless a donor opts out while he's still
alive
families complained in some pieces
police launched investigations but they
didn't go anywhere we got a hold of
internal records that show - diggin has
been getting tissue from Ukraine for
more than a decade and supplying RTI the
company appears to work through a
middleman there a Russian coroner named
Igor a Lysenko until the raid in
February a leshenka was director of bio
implant a company
owned by the Ukrainian Ministry of
Health which collected the tissue from
regional morgues to send to to taejun
internal records show to dejenne had
real problems doing business with a
lysenko in a 2002 memo marked strictly
confidential with four exclamation
points two-digit executives urged an
exit strategy from Ukraine they wrote
that a middle man believed to be a
lysenko kept demanding more money and
they didn't know what happened to the
money they sent him to pay the morgues
despite these misgivings to dajin didn't
pull out of Ukraine instead the company
expanded its regional network over the
years Ukrainian authorities have
investigated to other morgues that
supply to dajin for allegedly stealing
tissue of bullying families and forging
consents no one has ever been convicted
as for a lysenko local news reports say
he left Ukraine following the February
raid neither police nor health officials
will tell us where he went so where are
we today
RTI doesn't import its Ukrainian tissue
through Germany anymore the Ukrainian
tissue Bank bio implant is exporting
directly to the United States but
foreign tissue is still a small part of
RTI supply chain like other big players
in the industry
RTI actually gets most of its tissue in
the US but US law hasn't kept up with
this rapidly evolving industry and that
brings us to our second story here at
home RTI operates a nonprofit called
RTI donor services which directly
recovers tissue from American cadavers
RTI has also contracted with tissue
banks in 23 states one of those
suppliers was New Jersey based
biomedical tissue services it was run by
a former dental surgeon from Brooklyn
Michael Mastromarino RTI started working
with Mastromarino in 2002 but got
nervous when staff complained that he
was verbally abusive and had ties to
organized crime
so RTI hired a law firm to run a
background check and here's what the
firm advised the good doctor has been on
Santa's naughty list for quite some time
I would strongly encourage you not to do
business with someone
that has this kind of resume instead RTI
drew up a new contract with biomedical
tissue services in place of
mastromarino's name was that of his
newly licensed off-site medical director
RTI continued to work with the company
and Mastromarino as their main contact
until 2005 that's when the company found
out what he had really been up to for
the last three years from Funeral Homes
in the Northeast Mastromarino stole body
parts some infected with cancer
hepatitis or HIV from more than 1,000
corpses one source of tissue was the
body of alistair cooke that famous
british broadcaster and host of
masterpiece theater
he died at 95 of lung cancer but his
death certificate was altered and then
his cancer ridden tissue was sold for
$11,000 there were massive recalls five
companies pulled back a total of 25,000
products made from the human tissue
Mastromarino supplied and police got on
his case he's now serving time at a
maximum-security prison after pleading
guilty to conspiracy and body theft and
the families of his victims filed a
lawsuit against RTI alleging negligence
that case goes to trial this October in
New York when we started looking through
court files and started talking to
people we realized that neither
companies nor the FDA had ever tried to
verify whether consent was actually
given no one cross-checked the files
until after the industry was alerted to
a pending criminal investigation so
what's the bottom line here why should
we care well you should care because
regulations are ineffective given the
enormous profits that can be made in the
industry it's shockingly easy to get
into just fill out a one-page form on
the FDA website and you're in business
according to those who have done it you
can harvest tissue yourself you can
distribute it on the open market you
don't even have to be inspected in fact
according to our analysis of FDA data
only about 40% of tissue banks in
operation today
show any record of being inspected by
the agency so why do we need this fancy
software to figure out the story well
you can see how complicated all the
networks can be these companies and
their operations stitch an intricate
network across the globe but those
connections are buried in paperwork and
datasets ultimately we uploaded more
than 1 million companies people document
and events into the system so to sum up
the human tissue trade is a perfect
example of demand outstripping
regulation it's not surprising that some
unscrupulous characters sensing massive
profits may have latched on to an
otherwise legitimate industry and that's
a fact the FDA and Congress have
overlooked
so there's an older example a more
recent example is the Panama papers
which there's lots of material online
and talks about how graph analysis was
used they put everything into a neo4j
database and then visualized it with one
curious which is this tool one of the
things that is interesting about this is
that there are a lot of specific
challenges right so so here's the ideas
you throw everything into a bale
database and then you can because a
graph is a very flexible representation
you can represent almost anything you
want so anything in a relational
database you could put into a graph and
then you can merge them and find paths
and do these visualizations the there's
a number of issues one of which is that
the structured data in the Panama papers
was only a small part of it well a big
part but the minority part so documents
emails and PDFs or a much bigger part so
you have to figure out how to turn that
stuff into graphs perhaps the most
obvious way to turn it into graphs is to
do entity recognition on the documents
however entity recognition
commercial entity recognition has a
recall of only about 70% because it's
tuned off as per suit against precision
right it doesn't want false positives
which means that commercial tools are
not going to pull out all of the
entities and so journalists have had to
do a lot of work to tune entity
recognition for the investigative
workflow because you would rather have a
bunch of leads that don't pan out than
to miss something the other big problem
is record linkage so if you have n
datasets you might have n copies of the
same entity in this case a company and
here you can tell they're the same right
like look at the address it's the same
thing I have G has 15 unions creat - Uzi
they're just written differently
you can often have the computer find
them when you do have a computer find
records to link you don't want them to
actually collapse it down to one node
because it could be wrong so instead the
preferred technique is a soft record
linkage which is you add another edge so
in this case the the relation is called
has similar name and address as so you
throw all these data sets into the graph
you apply some similarity algorithm and
here machine learning actually works
quite well to try to figure out which
are the same entities but you don't
merge edges or merge nodes you add a
notation that says we think they're the
same because then later a human is going
to have to check it the reason for that
is you can't ever accuse somebody of
wrongdoing based on the output of a
model anytime you're going to make a
public claim of wrongdoing
you have to check every step of that by
hand otherwise you are at risk of libel
and in particular negligence for libel
the system that I would like to see for
doing this it's up in sort of prototype
form and a number of organizations
including IC RJ and OCC RP I'm gonna
show you in a minute a slide of what I
wish we had that tries to solve some of
these problems right you can it's not as
simple as just throwing everything into
a database and then running a graph
algorithm for variety reasons record
linkage is a major problem another one
is that the graphs that reporters use
are not data visualizations so here's
part of an investigation into Trump's
business that we did here at this school
a couple years ago this is a hand built
map it's done in a program called C map
journalists have been doing this for
decades it's not a data visualization
these things are in a database but we're
not just plotting the result of a query
we're choosing which things are
important and we're setting a layout
based on well obviously we're focused on
bay rock which a lot of journalists have
been looking into
here's that that Wall Street Journal
story again and you you can actually
step through this it this is this is all
hand built what I'd like to see is a
system that is based on the graph
database so you know if you run some
query you get all of this stuff but lets
you pick which ones you want right I
want it to allow you to build these
they're not data visualizations because
I'm calling them Maps builds maps by
searching in the database and adding one
note at a time which is kind of what
link curious does this is the link ureas
interface on our right on the right and
so when you see these pictures of with
the Panama papers that's how they were
built they were hand built based on
expanding out nodes and people of
interest the system that I ultimately
like to see is this you combine both
structured and unstructured data which
goes through entity recognition to build
the huge graph database including
provenance information and multiple
copies when you have a thing in multiple
sources then you do record linkage to
generate these individual Maps which are
specific to the story you're
investigating right so the key thing
here is this these maps are not data
visualizations they are hand built with
computer assistance for record linkage
and in particular if you have if you
learn something interesting you should
be able to just use this interface and
click on it and say oh I want to add a
new node that then goes back into the
central data store so people are
starting to build systems like this I
know of at least three efforts to build
something which looks like this
so this is really the future of high-end
cross-border investigative journalism is
these types of pipelines and graph data
stores okay your homework will be up
shortly you are going to run multiple
centrality metrics on the
the Lim is data set so how many of you
know the story of lemons or elbows I've
seen the musical or the movie yeah if
you haven't I'm afraid you'll just have
to Wikipedia it for the plot I'm what
I'm looking for so it's a graph which
has all the characters and how many
chapters they co-occur in you're gonna
load that up into Gaffey run different
centrality algorithms what I'm looking
for is an analysis of how well the
different centrality algorithms capture
the plot of the story right so I again
the fundamental questions of this course
are always about the relationship
between the mathematics and the world so
this is exactly that problem tell me
what the centrality metrics tell you and
whether it matches the story all right I
will see you all next week
of a combination of two different topics
neither which is a full class we're
going to look at visualization and we're
gonna look at network analysis and I'm
sure all of you have done some of both
of these I'm going to try to show you a
sort of journalistic approach to these
topics in particular the frame we're
going to use is visualization as
perception I like to think of
visualization as a sort of hack that
interfaces with the human visual system
and I'm gonna so I'm going to try to
show you sort of start from that
perspective and build upward from it and
then we're going to do social networks
and I'm going to throw in a certain
amount of social network theory meaning
we're gonna talk about sociological
ideas which is ultimately why we do
social network analysis and then we're
going to show some examples of network
analysis actually being done in
journalism including some research that
I did just surveying how it's used
this is a list of the relationship
between various topics in a book called
gödel Escher Bach which is this really
this sort of early book on cognitive
science and artificial intelligence and
art and you know want to Pulitzer Prize
in 1980 if I wanted to ask you if
infinity is connected to the halting
problem you would have to sit here and
do quite a lot of sort of back-and-forth
searching you know looking at all the
things that infinity is connected to and
then scanning through this list it would
take you quite a long time to try to
look at those connections if I showed
the same information like this then it's
trivial to answer that question or to
answer more transitive questions right
is infinity connected to touring this I
think is a more interesting fact than it
seems so why should the exact same
information displayed in this way be so
much more useful for answering certain
types of questions anybody got a theory
of what's going on here
all right so there's the idea of spatial
relationships yeah anyone else got a
thought here
yeah there's something about putting
this in visual form as opposed to just
words that is really it makes a big
difference and my friend tomorrow Minds
nur who wrote a wonderful book that's
just called visualization puts it this
way carefully designed images as a form
of external memory so here's a very
interesting fact let's throw some words
on a board and he's my usual cat Matt
some colors okay I want everybody to
focus here so I want I want you to
literally look at my finger this is
actually important for this to work and
then I want you to tell me what this
word is
can you see it while you're staring here
yeah it's too bad you already read it
damn it
let's try this on the slide here all
right look here what's this word all
right so I'm trying to demonstrate two
things one is that your field of view
where you have high-resolution
perception the fovea is actually very
narrow it's only about two degrees so
that's the first thing is that your cone
of vision is much smaller than it then
it feels like because it feels like you
can see everything - is that we don't
actually have much visual memory all
right so it feels like you know if I
close my eyes I can still sort of I know
everything is in the room but in
practice we don't have memory for the
visual sphere around us because we don't
need it because if I want to access the
image over there I can just look right I
can that's our eyes swivel and they do
it unconsciously and the brain is not
very fast it's highly parallel but the
the clock rate is not very fast recall
times tends to be in hundreds of
milliseconds well if I've got a few
hundred milliseconds I might as well
just look okay so we don't use memory we
use perception and therefore if we take
information and render it visually then
the normal perceptual mechanisms apply
to the information we've just rendered
so offload cognitions of perceptual
system so rather than thinking
about all of this stuff I can just look
and this is basically why visualization
works now we can say a few more things
about it we can say well I can
demonstrate to you that your visual
system has a bunch of processing
occurring in parallel at the
preconscious level so let's try this
which one is different okay no problem
there which one is different okay easy
you'd notice you didn't have to scan to
do it which one is different okay yeah a
little bit harder you had to search so
that's interesting isn't it for some
problems you just know and for some
problems you have to do a visual scan so
when we get above a certain threshold of
complexity of the visual problem we have
to invoke cognitive methods the point of
data visualization is to present
information in a way that doesn't
require cognition okay you want to swap
cognition for perception and so we can
do experiments like this to find out
what perception gets us and the answer
is quite a lot so this idea where you
don't have to search for something
that's called pop out and a great many
things can trigger pop out so for
example we don't have to think to know
which line is longer or which is bigger
or you saw orientation based pop out and
the previous example and it turns out
there are quite a large number of visual
channels so depending how you count
there's you know a dozen plus or minus
different channels which will trigger
pop out and not just pop out but
comparisons
so this has been extensively studied and
here's a list of the visual channels in
fact three different lists it's the same
list but it turns out that for
communicating different types of
information different types of channels
are more effective so for example color
so let's say hue here is not a
particularly good channel for quantitive
information if you're trying to compare
which value is bigger like so you're
trying to do a plot of you know GDP by
ear color is not a great way to make
fine comparisons where as position or
even area or our volume is pretty good
for comparing quantities another hand
for categories hue is a great way to
express categorical information or
texture so there's there's a lot of
research on how this works and I'll just
I'll try to sort of give you a sense of
this very briefly by grabbing one of the
links in the syllabus here you go
39 studies about perception in 30
minutes by my former AP colleague
Kennedy Elliott who's now at the
Washington Post and what she's gone
through here she's gone through a bunch
of studies about just perceptual
experiments so here's the idea you can
see some pictures of how this works the
way you do these experiments is you try
to encode the same data in different
channels so in this case you're looking
at a ratio of two things and you're
saying let's try it as a direction let's
try it as an angle let's try it as an
arrow
and so for example you can see that it's
much easier to read as a length than an
angle right it's much harder to compare
those two and how you actually do this
stuff is you typically do either
reaction time or forced choice tests so
you you know give them a button press a
if the first ones bigger be if the
second ones bigger you see how long it
takes them or you require them to press
it in a really fast interval a couple
hundred milliseconds one second or
something and you look at the error rate
so by these types of measures you can
measure what's the fastest way for
people to do this so Cleveland and
McGill that's just the this paper we
were just looking at it's the classic
one this has been redone with
crowd-sourced stuff and it just goes on
and on and on
here's comparisons of pie charts versus
split line charts here's different types
of pie charts what else do we have here
here's looking at comparing trends
across multiple groups bars so if you
they didn't have these trend lines how
easy is it to see that category C is
decreasing and category D is increasing
it just goes on and on and on so we know
here's experiments with three
visualizations we know a huge amount
about how people generally perceive this
stuff most accurately and if you're
going to build visualizations it's
really good to familiarize yourself with
these types of results
the other thing that you can do is use
all of these channels together so I'm
going to play you this video since it's
actually a really nice example of data
visualization so let's see the the late
hands rustling visualization is right at
the heart of my own work - I teach
global health and I know having the data
is not enough I have to show it in ways
people both enjoy and understand now I'm
going to try something I've never done
before
animating the data in real space with a
bit of technical assistance from the
crew so here we go first an axis for
health life expectancy from 25 years to
75 years and down here an axis for
wealth income per person four hundred
four thousand and $40,000 so down here
is full and sick and up here is rich and
healthy now I'm going to show you the
world 200 years ago in 1810 here come
all the countries Europe brown Asia red
Middle East Queens Africa south of
Sahara blue and the Americas yellow and
the size of the country bubbles show the
size of the population and in 1810 it
was pretty crowded down there wasn't it
all countries were sick and poor life
expectancy were below 40 in own
countries and only the UK and the
Netherlands were slightly better off but
not much and now I start the world
the Industrial Revolution makes
countries in Europe and elsewhere move
away from the rest
but the colonized countries in Asia and
Africa they are stuck down there and
eventually the Western countries get
healthier and healthier and now we slow
down to show the impact of the First
World War and the Spanish flu epidemic
what a catastrophe and now I speed up
through the 1920s and the 1930s and in
spite of the Great Depression Western
countries forge on towards greater
wealth and health Japan and some others
try to follow but most countries stay
down here now after the tragedies of the
Second World War we stopped a bit to
look at the world in 1948 1948 was a
great year the war was over Sweden
topped the medal table at the Winter
Olympics and I was born but the
differences between the countries of the
world was wider than ever the United
States was in the front Japan was
catching up Brazil was way behind Iran
was getting a little richer from oil but
still had short lives and the Asian
giants China India Pakistan Bangladesh
and Indonesia
they were still poor and sick down here
but look what is about to happen here we
go again in my lifetime former colonies
gained independence and then finally
they started to get healthier and
healthier and healthier and in the 1970s
then countries in Asia and Latin America
started to catch up with the Western
countries they became the emerging
economies some in Africa follows some
Africans were stuck in civil war and
others hit by HIV and now we can see the
world today in the most up-to-date
statistics most people today live in the
middle but there are huge difference at
the same time between the best of
countries and the worst of countries and
there are also huge inequalities within
countries these bubbles show country
averages but I can split them
big China I can split it into provinces
there goes Shanghai it has the same
wealth and health as Italy today and
there is the poor in line problems why
Shou it is like Pakistan and if I split
it further the rural parts are like
Ghana in Africa and yet despite the
enormous disparities today we have seen
200 years of remarkable progress that
huge historical gap between the west and
the rest is now closing we have become
an entirely new converging world and I
see a clear trend into the future with
aid to trade green technology and peace
it's fully possible that everyone can
make it to the healthy wealthy corner
[Music]
well what you have seen in the last few
minutes is a story of two hundred
countries shown over two hundred years
and Beyond it involves plotting a
120,000 numbers pretty neat
ah hands after the elections yeah we
lost house rustling a couple years ago
that's too bad he did a lot of really
interesting work trying to get people to
understand the trajectory of
international development anyway so
there's an there's an interactive
visualization of this that you can play
with this is just a screenshot what I
want to talk about is the visual
encoding design of this data so how many
visual channels are there here and what
are they yeah okay so what's it I think
you mean that by size same thing yeah
okay so what what are is so position is
actually two positions we have X and
y-axis so what is each of them encoding
and the color is all right so let's go
back to this slide and think about this
for a second
we've got positions which are being used
for sure enough two quantitative
variables it's a life expectancy and GDP
per capita
we've got size which is being used for a
quantitative variable as well so you can
think about it that's area really
so area isn't quite as good for encoding
quantities as position but they've also
chosen a variable that isn't quite as
important they've chosen population it's
not as important to make fine population
comparisons as it is to make fine life
expectancy comparisons on this chart and
then they're using Q for a category so
in fact the visualization design
corresponds very closely with the
experimental results on how people
perceive things and I would say there's
actually one more channel here which is
which is time or motion the this chart
the sort of the early experiments don't
study motion but you can there's a lot
of research on you know how well can we
compare different rates of motion if I
show you an animation that lasts 5
seconds can you remember what happens
during that animation and generally what
they show is that it's it's hard to
complete complex information and
animations you only remember very simple
things but in this case all that you
need to remember is that things went
from the bottom left upper right if you
need more complicated things then
sometimes what you'll see is you'll see
trails right so if this moved from here
to here then you would have a trail like
this and that's one way to encode more
more information that way
so that's an introduction to the
perceptual point of view of
visualization design another way to
think about visualization is is is like
this right
make the salient features of the data
visible without thinking and to do that
you have to understand as we've seen in
these experiments what it is you can see
without thinking about it and how to map
those own two basic structures in the
data so this is an incomplete list that
I put together of patterns in the data
that it's possible to turn into very
simple visual representations so we've
seen lots and lots of examples of
clusters and we'll see some more on our
social network analysis clusters are a
basic possibly the most basic pattern in
data especially in multi variable data
or high dimensional data and you
definitely want a visualization
algorithm that preserves clusters but
you can also look at things like you
know the extent of the data like the
range of it or find outliers or look at
more sophisticated patterns so this is
an incomplete list of course what other
types of patterns and data can be turned
into visual structures
yeah that's a good example like that's
what a Venn diagram is yeah yeah all
right so you get to use motions so you
can turn temporal relationships into
spatial relationships so you can keep
them as temporal relationships what
about this one's kind of interesting
we're not really exploiting this but
we've got these little lines right so
connectivity and paths so that this was
an example of that earlier sort of graph
theoretic attributes can be turned into
visualizations when you start thinking
about that there's there's quite a lot
of stuff that with a little imagination
you can turn into visual encodings
one of the main things we are interested
in when we're talking about analysis
tasks is interactive visualizations so
most of the visualizations I've shown
you are static if you load up that
Gapminder interactive so this one you
can play with it a little bit right so
you can select countries and you can
change the time so you can set the color
and so forth interactive techniques are
an enormous design space this is a chart
from a paper I quite like about
visualization design which tries to map
out the space of where you should use
interactive techniques versus automation
I also used this diagram when I need to
talk to people who are used to
automating everything so you know I'm
talking to a Google engineer who's like
well everything has to be run at scale
and so we have to have an algorithm that
finds fake news on its own well that
only works if the task is very crisp if
you know exactly what it is you're
trying to accomplish and all of the
information is in the computer as
opposed to the reporters head from
making phone calls or their interview
notes in free text that hasn't been you
know entered into the system yeah
it's it's talking about when you should
use I mean originally yes it was in the
context of visualizations but it's more
broadly talking about when interactive
techniques will work versus automated
techniques computer scientists tend to
like automated techniques because they
don't have to think about those pesky
humans yeah so let's take ten thousand
PDFs of public records and find the
story in them so there have been
techniques that try to take plain text
and find stories but it's much more
effective to put the human into that
process because the human knows lots of
things about what is interesting about
those documents one way to think about
that is there's lots of information that
is in the human's head and the tasks
what they're trying to do is not all
that clear because there could be many
different types of stories right so
you're for that problem you're kind of
like around here in that space which
means that automated techniques are not
a good fit so it's just a way to think
about how much automation versus how
much is assistance to human effort
well interactive visualizations are
squarely in this middle box right so one
way to solve this problem of how do we
investigate this huge pile of documents
is to visualize the contents of them as
opposed to trying to do NLP and spit out
the answer most visualizations are
interactive there's no point in making a
picture that a human isn't going to look
at I now want to talk very briefly about
visualization design and so you're doing
database and other classes I imagine yes
no okay you have a visualization class
or in the CS department or here
yeah okay do they talk about this same
perceptual stuff okay visualization
shows okay interesting okay so it falls
to me to talk about this stuff I guess
in computer science we tend to talk a
lot about these inner tube boxes you
know visualization designers like
publishing papers on visualization or
algorithm design or some encoding
technique you know I'm going to show the
probability of this thing on my Bayesian
inference algorithm by color you know
whatever it is but this is another way
to think about design is that you sort
of have to start from the outside in
starting with the domain problem and
this is often a very complicated problem
so for example the problem of
investigating a huge pile of documents
is not a particularly well defined the
main problem all right this is and I've
done a bunch of research on what I call
the ethnography of data work you know
what is it that journalists actually do
with documents and you have to answer
those questions before you can start
talking about you know this is how my
topic model that's going to work because
topic model is all the way down at this
level so the ideas you work through from
from the the outside in yeah yeah so
what are you trying to solve so an
example in the investigative journalism
context they say this document said my
name not quite hypothetical example is
okay so I'm trying to find all places
where it looked like politician was
taking money for a policy change so
that's the domain problem
characterization or one way of saying it
the data operation abstraction design is
okay well maybe the abstractions I want
our people and payments and then once
you say that can settle on those
abstractions then you can go to the next
level which is the encoding or
interactive techniques and say something
like ah I'm going to show their people
and payments in a graph and a network
we'll see that later this class and then
the algorithm design is okay here's how
I'm going to efficiently extract all of
the people in places and render a graph
and lay it out
but visualization data visualization is
more than just plotting data so let's
look at this New York Times piece on
homeruns what else is here know what
pictures yeah there's and some people
what else yeah it's a lot of annotation
what else do you notice here where does
your eye go first all right the big
number and then where it's after the big
number where does it go
all right the this red line and it's
comparison to these other lines so
there's a bunch of stuff going on here
there's annotations and there's visual
hierarchy why does your eye go to the
big number first
okay whiles this size yeah what else
yeah so position size other elements
leading the eye towards it the weight of
the font it's a very thick black font as
opposed to you know
so here's text which is lighter and
smaller so that's lower in the visual
hierarchy so there's all of this stuff
that is not data visualization per se
but is all of this stuff around it that
makes a visualization work and there's
titles too so there's a very interesting
piece of research recent research that
shows that by changing the title on the
same chart you can change what people
remember by changing the framing right
you can have a chart that shows you know
a very moderate increase in crime rate
over the last few years and you can have
a title that says crime rate
unchanged as police crackdown you can
have a title that says you know crime
has increased as police crackdown or you
can have a title that says you know
crime on a 10-year downward trend just
directing attention to a different range
on the graph you can really change what
people remember by changing the words
you put next to the pictures so data
does not speak for itself that's I think
one of the lessons of the data
journalism program is you know it is a
narrative medium and so doing a data
visualization doesn't free you from
getting the narrative right this is a
piece that I quite enjoyed when it came
out
oh dear was this flash oh no can I even
I wonder if this will load in any
browser this is the problem with reusing
slides from last year right it's always
always the possibility that it'll break
there we go
yeah it was based on this
which came out a little bit earlier and
a lot of people enjoy just these these
lines the sort of animation like this
and what the New York Times did was they
made this thing which now I have a
screenshot of which showed the wind
blowing to the right on the red dots and
the left on the left dots and the of
course the length of vector is the size
of the shift so this was the 2012
midterms which seems like a very long
time ago now we're and in fact this is a
general pattern in midterms generally
the president's party loses seats that
is the most common thing that happens
with midterms you're right this is not
the midterms that was 2014
okay what I said about midterms is still
true but this was Obama's re-election
where you know there was a rightward
shift but Obama's still won and what I
want to show you is the process of
designing this visualization so there's
a there used to be a blog called charts
and things which was the one of the
Kevin Quigley's blog about creating
these visualizations unfortunately it's
now only available through our caveat
org which is another solution to the my
class materials keep disappearing
problem oh yeah so it says it's based on
the wind map and you can see the sort of
whiteboarding process that they went
through here it's not just that map but
it's they've then break it down by the
shift per state and they break it down
by different demographic groups as well
so here's panic voters and young voters
and and women and so forth so that was
the original sketch and it ended up here
and this this is actually a nice little
tumbler
they they had a lot of stuff on here on
let's see if we can get it through
Internet Archive you can learn how quite
a few of these things were made and it's
not so much that the individual articles
are super interesting it's more it's
just neat to see the process yeah I
guess it's more work than it's worth
right now here's another one that was
from that
here's another election thing or you can
show how I think it's how each state
shifted to the left of the right over
time and you can see how it started out
is just a quick visualization and are
and then some experiments with with
Sankey diagrams this is called a Sankey
diagram and then here's the final piece
so when you make one of these fancy
visualizations is not like it just pops
you just start typing it immediately all
right there's a lot of sketching
involved and sketching not just like on
whiteboards but going through various
iterations of attempts to do it with
code this is a really nice visualization
as well and it's a good example of
narrative you know data visualization
so basically you just keep scrolling
down here right so here's the observed
pattern here's the change here from the
Earth orbit so this is all this is
basically a data visualization of a NASA
climate model here's solar variations
here's volcanoes which are actually tend
to be cooling because they block light
here's all of it together
here's deforestation which also is
slightly cooling
here's ozone here's aerosols which have
actually quite a lot of cooling effect
and here's greenhouse gases and then
here's the model versus the observed
wage okay so that is very interesting
but I'm gonna make another comment here
that you as data analysts should be
worried about which is of course the
model matches the observed data you
don't publish a model on climate change
that doesn't match the observed data
right that in fact part of how we know
that the model is working as it
reproduces the reality so all of the
problems you have with machine learning
where you don't want to peek at your
training data applied to fitting
physical models as well what's the
justification that the model is not
overfitting what types of justifications
would NASA have for this
so it matches historical data but of
course you don't choose models that
don't match historical data that's what
you hope you hope you hope to test your
model by matching it against the future
and you can you can cheat that right so
you can do like cross-validation style
checking on your model where you train
it on part of the sequence and see how
well it matches on the parts that didn't
see the other big answer is robustness
checks if your model has various
parameters you look at how your model
does if you vary those those parameters
if you vary your assumptions which is
part of what generates these error bars
right so this this blue shaded region is
the 95% CI probably 95% yeah there you
go 95% CI it says there for this model
and part of where that CI comes from is
measurement error on the physical
parameters that go into the model and
part of where it comes from is
robustness checks on assumptions so if
we don't know exactly how much clouds
contribute to global warming or in this
case global cooling then we make some
range of reasonable guesses and that
distribution of uncertainty gets
incorporated into the model as model
uncertainty anyway I show this to you as
an example of narrative visualization I
think there's one more and here's
everything together and tada
it's a beautiful fit oh and I think it
just goes back to the top
so this is a lot more than a chart right
this is a pretty sophisticated
interactive presentation one of my
favorite quotes about designing
interactive visualizations is if you're
asking the user to click the payoff
better be huge is the payoff for
clicking big enough here a company out
mostly yeses around the room yep the
direction of the Internet I think it
does work here I'm gonna post it to
slack so you can interesting it I think
it just goes to static pictures on
mobile probably it just tries to avoid
JavaScript yeah and then it talks about
where all of this came from oh and this
is what I was just talking about these
rebus this checks so here we go
it's robustness in several ways right
there are 28 research groups around the
world and they've written 61 climate
models each one is slightly different
but this is just one model but this is
how this science is done actually they
very intentionally have a lot of
different people using a lot of
different models and then they look at
the aggregates of the models which is
not something we normally think about in
terms of models we normally think about
like you know measuring some value 10
times and averaging it to reduce the
noise but it's the same measurement
process here it's actually different
groups with different methods and
there's there's
experimental literature on you know does
this actually work to have different
people doing different estimates and
take the median like why should this
work and the answer is basically that it
works because you know this throws out
the extreme values and you get
straddling where some values are above
the true value and some are below and
they cancel each other out at least
somewhat and you get closer to the true
value anyway
Wow there's a lot of stuff there
yeah and here's actually the
interpretation of the confidence
interval it's as I said they do have a
huge range of simulations and then they
pick the interval that 95% of them lie
inside so aside from the narrative
structure of the visualization which i
think is very interesting this is really
illustrates how the science translates
into the data right it's it's it's not
just you run a model as you actually run
a suite of models and you have all the
same problems that you have in machine
learning with overfitting and validation
and so forth I guess one of the reasons
I show you all of this is I want to try
to dissuade you of the idea that data
visualization is objective so first of
all data isn't objective and the simple
way to see that is I can just mail you
all of my favorite spreadsheets without
column headers all right the the names
of the columns provide the crucial link
between the world and the spreadsheet
and those that that's not objective
that's those are facts that have to be
reported out right here's a chart that
doesn't start at zero it happens to be a
chart from a technical paper comparing
different ways of computing document
similarity and showing that you know
this one it's it compares the error
relative what humans think I found this
paper because I was very curious if you
do cosine similarity on a set of
documents how close does it match human
ratings of similarity and the answer is
pretty closely right like 80% it gets
it's a reasonable approximation to how
humans think about document similarity
here's another chart that it's become a
little bit famous this was presented to
Congress I think three years ago
talking about Planned Parenthood and and
okay so what is what is the narrative in
this chart
I don't think that that's what there's a
yeah generally they're saying Planned
Parenthood does mostly abortions and not
cancer there's a bunch of weird stuff
going on here so I would say that this
isn't actually a data visualization
because it throws out first of all it
throws out all the intervening years but
also it plots to things on a different
scale so oh do I have the link for this
yeah here let me show you the the actual
data that this is drawn from I don't
know
way back machine to the rescue again
let's see if we've got it yes
okay that's fine do we do we end up with
yeah okay here we go this is what I want
so I will update my notes here all right
so here was it being used in Congress
and here's the actual data all right so
they you if you get to choose the same
the same scale on the left on the right
and you start at zero
it looks quite a bit different and
here's here's a more complete data set
which shows actually all the intervening
integrating data
alright so Planned Parenthood mostly
doesn't do abortions they mostly do you
know STI screening and contraception so
what is that what is that saying if you
torture the data you can make them say
whatever you want so you will find if
you go through this stuff arguments
about what fair data visualisation is so
for example you will find heated
arguments about whether you should
always start at 0 or not where do you
all fall on that always start the y-axis
at 0
[Music]
yeah I don't know I mean there's a bunch
of things we can talk about but I don't
know that there's any totally general
rules for this stuff but I think you can
say that the there is as much editorial
choice that goes into graphing the data
as there isn't in choosing the data
itself so again the data do not speak
for themselves you have to make these
choices and here's another example of
that this was around the Obamacare
debate this was supposed to be a chart
of how all of this stuff works together
this is the same chart right so you can
lay out again the same information in
different ways and here's a close-up on
that which is a lot clearer so by the
way I'm certainly not saying that you
shouldn't have a narrative in your data
if you don't have a narrative why are
you showing the data right the data has
to have a meaning and so you have to
choose the meaning that you want to
display that's part of being a
journalist it's just certain things are
can be intentionally misleading all of
the issues of objectivity and balance
and so forth come into play just in the
same way that it would be dishonest to
leave out facts that are relevant to a
story it would be dishonest to leave out
data that is relevant to the story all
right so for the second part of this
class we're going to talk about social
networks and journalism and I'm going to
try to ground it in at least a little
bit of a sociological perspective so
just first the definition we're talking
about social networks when I say social
network I don't mean Facebook
I mean nodes and edges right so it's a
graph of people and connect
between them I haven't said what type of
connections so when we use social media
data we're talking about you know
following or friend relationships these
terms aren't completely standard but
often network analysis is used to mean
there's only one type of relationship
and link analysis is used to discuss
having many different types of
relationships and this is very common in
big investigative projects this
difference becomes important when you
talk about things like some travel
algorithms all of which are derived
assuming there's only one type of
connection and link analysis ultimately
grew up in law enforcement and
intelligence
so journalism is sort of adapting many
of these techniques the entire reason
this is interesting is because people
act in groups or to put it another way
if I know something about person a I can
probably know something about person B
as well who's connected through a link
and you can you can sort of imagine all
of the different ways that properties
are transmitted through links so for
example family ties are very strong and
following relationships on social media
those are the channels through which
information flows or can flow so you
would expect that people who are for
example follow the same person are
exposed to the same type of information
and this applies basically in every
sphere there's a name for this which is
homophily
friends would like or as is sometimes
said birds of a feather flock together
it's a people with whom you have a lot
of ties you're going to be similar to in
some way and that's basically why we do
these analysis it's an inferential
method
to transfer knowledge about one person
into knowledge about people related to
them there's kind of two ways to do this
analysis so by the way I'm talking about
analysis of the structure of the network
when I say social network analysis I'm
not talking about collecting all of the
tweets about the election and doing text
mining I don't consider that social
network analysis because there if you do
that there is no use of the structure of
the network so I'm talking specifically
about techniques where the structure of
the network is used as data all right
it's part of the inferential method and
basically the two ways people do this
are visualize it and then use human
interpretation or apply an algorithm to
compute something in both cases as we
shall see the results are highly
contextual in fact this is one of the
most contextual types of data analysis
you really can't just sort of read off
the answer you have to understand what
it is you're looking at and who these
people are
this is a pretty old idea these are the
earliest social network diagrams that I
could find
he Marino called them Scioscia Graham
Moreno was a psychologist at Columbia
who studied things like dorms where
foster care children lived fraternities
and sororities classes and he started
drawing these pictures and his notation
was this is a diagram of I think dorm
and he already is distinguishing between
one dimensional and two dimensional
links so one dimension is just a narrow
two dimensional is a a-line with a bar
between it in other diagrams it's
actually a beautiful book it's full of
these hand-drawn
picture drawn pictures in other diagrams
he has different colored links to
indicate repulsion or dislike so already
in the 1930s there was barely
sophisticated very recognizably modern
social network analysis by hand he got
the data by going into these places and
doing surveys so just looking at this
picture who do you think is the class
president anyone want to say the number
this this one right yeah right
so that's interesting right we can learn
something about the role of these people
just by looking at the structure of the
network so this is the type of inference
we're talking about rather than going
into the visualization algorithms I want
to take a slight detour to a paper that
I quite like this is a paper about
analyzing social networks from Facebook
data theory is publicly available some
very old Facebook data it's anonymized
of course I think it's really early it's
like 2006 or 7 or something Facebook
doesn't do this anymore
but there's various ways to get data
like this and the purpose of this paper
is to show how to go from the image on
the left the hairball to this image
which shows the actual underlying social
communities and it's based on a bunch of
sociological assumptions so what we've
done here is we've cut out most of the
links to reveal this diagram and the way
we do that is we look at triangles and
this is where this word sommelier comes
in Jorah he was georg simmel
yeah and doesn't say his first name I
think his name is George simul he was a
late 19th century early 20th century
sociologist so this is before any data
was available before people were drawing
social network diagrams really and his
theory of sociology was based on
triangles he made the following
observation he said to study sociology
you have to study at least three people
because two people when they act it was
a one person obvious that you can't
study the social to people when you
study just dyads these people aren't
observed true people can have their own
little world there's no enforcement of
the norms of society so he thought that
sociological theory needs to be built
around how do two people interact when
they're being watched
hence triangles and triangles turn out
to be extremely important structures in
social network analysis so graphs
generally like graph theory is all
concerned with edges which is a
two-degree relation sociology is
concerned with triangles so let me give
you an example of that
so I'm just going to draw a sort of
pretty standard like friend graph
so let's say these are symmetric
relationships of the you know a is
friends with B type now there has been
observed in sociology a simple
predictive rule called triangle closure
and what triangle cloture is is if you
have a triangle that is open such as
this triangle if you watch these graphs
over time it is quite likely that the
triangle will close that is to say you
will add this edge okay
so think about the sociological
mechanism there what that says is that
if I have two friends it's quite likely
that my two friends will meet it's a
very simple sociological observation and
it's highly predictive and in fact these
open triangles they are the basis of
most social social network person
recommendation algorithm so you know
facebook says you might also know well
how does it do that the way it it does
that is it looks for the maximum number
of unclosed triangles so let's say this
is me and facebook wants to recommend
people to me what it does is it looks
for people who many of my friends know
right so there are three of my friends
know and if I know them all so then it
would close a bunch of triangles it
would close this triangle it would close
this triangle and it would close I guess
that's it because it's not a direct
connection all right so one way to think
about this is who who is there many
second-degree paths to another
to think about it is who do many of my
friends know but another way to think
about it is how do I close the most
triangles so triangles as opposed to
edges so a set of three as opposed to a
set of through of two are a fundamental
way of thinking about social
relationships that seems to have some
sort of sociological reality both
theoretically as pros proposed by symbol
100 years ago and in practice and so
this paper talks about different types
of triangles right so if everybody is
friends with everyone in this case it's
directed you get a civilian triangle
otherwise you get different types of
triangles right so soul symmetric so you
know just these two people know each
other and then this person was excluded
or just one person knows the other one
and this pruning algorithm here's
another example of what you can do with
this algorithm and this extremely slowly
rendering diagram can take these big
hair balls and turn them into something
much more easy to read right it it finds
the true strong ties so it goes from the
thing on the left of the thing on the
right and notice that without being told
what dormitory people are in it
successfully groups people into
dormitories all right so it figures out
who's actually friends with who and
trims out all of these weaker edges the
way that the algorithm works is
it say it's trying to figure out if it
should have an edge between a and B well
what it does is it looks who looks at
all of a top friends and the way it
looks at the top friends it has asks how
many how many triangles does it have in
common so someone is a top friend if you
don't only have a link to it but there's
a mutual friend as well right so it
looks for a triangle
let's call this person and see which
says that not only do I know it but we
have mutual friends and it ranks these
people who are part of triangles it says
every person that I know how many
triangles am i involved in other words
how many mutual friends do I have with
that person and so let's say my close
friends you know all rather A's close
friends and then this is the number of
triangles so see I have say have friends
in common number of friends in common
you know C D you a four you have three F
two and I'm actually going to order
these because that's how the algorithm
works
and then it takes some threshold say it
says it says you know the top 5 so this
is my top 5 best friends as ranked by
number of friends in common on the
assumption that the people for whom I
have most friends in common are the
closest and then it does the same for B
so C 3 H 3 J 2 e 2 G 1 ok and then what
it asks is how many of our closest
friends are in common so notice I start
with binary data your friends or you're
not
this algorithm really works what it
really wants is the strength of your tie
to the friend but it measures the
strength of your tie if you only have
binary data by calculating triangles
right friends in common and then once
you have these types of lists then it
asks how many people appear on both
lists so C appears on both lists G
appears on both lists he appears on both
lists that's it so then because either
there are three people who appear on
both lists it says that the strength of
the relationship between these two is
three all right so it's this sort of
second-order thing first I figure out
who are my close friends and then I ask
how many in my top five list or top
analysts are in common and that gives me
a weight of the strength of the
connection between a and B it says we
have the same close friends if we have
the same close friends greater than some
weight let's say we have some threshold
of three then we keep the edge otherwise
we throw it out so using this triangle
based analysis we can trim this hairball
and get back a much smaller graph which
is reflective of the sociological
reality in this pace everybody's living
in a dorm together and so you can see
the dorm colors there
so this is a more complicated example at
a different University and what it is
found is actually two different types of
social groups so on the left so remember
the algorithm doesn't have this
information on the left the so that
they're colored by the well the dorm
that they live in and you can see that
it's found some dorm structure here
so these people live in a dorm and these
people live in a dorm and those people
live in a dorm but then you have this
right these big multicolored nodes and
on the right it's covered colored by
year of graduation so you can see that
some of these these are all the freshmen
right so it is found
sociologically real communities of
multiple different types so it's found
dorm based communities and it's found
year based communities and then it looks
like there's some stuff that's that's
truly mixed so like when we start to get
into here and here these are communities
that are not neither your may store dorm
based but seem to have some sociological
reality I bet that's like people who
like to go out to the same clubs or
something i or you know are in the same
class or something I bet there's some
variable which they actually have in
common that we could find with a little
more study
and so this triangle idea is very
powerful and it it seems to both
theoretically and practically capture
something about human social relations
that appears in the data
anyway fun stuff to play around with and
you know each social network is going to
be different if you do this on Twitter
following somebody on Twitter probably
means something different than friending
them on Facebook and Facebook has both
IRB's now so that I'm sure the follow
networks and the friend networks will
have different realities okay
I've shown you a lot of pictures of
social networks today aside from the one
that was drawn by hand pretty much all
of them are drawn by this algorithm
called a force directed layout how many
of you have seen this algorithm yeah so
it's a very simple idea the idea is we
throw nodes down randomly and then we we
say that every edge is a spring that
wants to push the nodes apart to a
certain distance so if they're closer
than that it pushes them apart they're
farther than that it pulls them together
and normally there's also a universal
force which is often called gravity that
just sort of pulls everything together
oh I'm sorry no it's the opposite it's a
global repulsive force that pushes
everything apart right so so you end up
going from this to this right if you
have this this tetrahedral shape it will
get laid out like this
so the picture you get does depend on
where you start because it's you throw
down randomly and so you may get
different shapes depending on your
initialization also you can use all
kinds of different algorithms for laying
things out here are a bunch of different
layouts like for example this is a
common visualization where you put
everything in a circle and draw edges
between them these are actually all the
same graph and this is from acute paper
which investigates how the layout or the
the visualization of the graph right so
it's technically it's exactly the same
data it's just drawn differently how
that influences our perception so they
ask people questions like how many
subgroups are there who's the most
important people and then they ask about
sort of bridging roles people who
connect different parts of the graph
which as we'll see is important so let's
take a look at these for a second how
many subgroups are there in this diagram
yeah it could be three could be four you
know same here who's the most important
person in this diagram or who are the
important people
see why I see because their way mm-hmm
well look at where C is here so it turns
out that the way you lay this stuff out
has a big effect on inference so just
something to be warned about as we'll
see eh shortly the primary analysis
method in journalism is just looking at
visualizations but there is this
question of whether the visualizations
are really showing you reliable results
nonetheless force-directed layout is
basically everybody does it it's a
produces nice-looking pictures it's you
know solve certain problems pretty well
and then there's the question of what we
can learn from a graph and there are
many questions we can ask photograph one
of the most common types of questions in
[Music]
sort of classical social network
analysis is this idea of centrality
which is also influence or power right
so we want to know who is the most
important person here who's the boss
who's the person I have to talk to who
really made all of this happen and you
can just look at a visualization as we
were just doing or you can also compute
metrics has anyone seen centrality
metrics yeah so this is what your
homework is going to be about so how
could we compute who the most important
people are in a graph what type what
type of metrics could we use any ideas
degree no degree yep that's that's
called degree centrality any other ideas
uh-huh yeah so whether so how something
about how it connects different groups
yeah there's actually a bunch of
different ways and I'm going to
show you a few really quickly here
because you should know this because
you'll run into it so degree and all of
them sort of capture different ideas so
degree centrality is just number of
edges so I think of this as modeling a
celebrity or a news hub right so who has
the most followers there's another kind
of called closeness centrality and
closeness centrality requires computing
the average path length to every other
node so I compute the shortest path to
every node you've probably all seen
shortest path algorithms yeah okay it's
a you know computer science favorite so
what I do is I compute the average
distance to every other node along the
shortest path so in this case
unsurprisingly you end up with the one
in the middle just because the average
distance is lower right so a real world
application of this idea is if you're
getting on the subway and you don't know
which end of the platform the exit is at
you should take the middle car because
the average distance is going to be
lowest from the middle car and this is a
useful model if you're thinking about
information flow or like flow of
packages like where should you put your
your warehouse to serve the whole
country well you know more if you're
only going at one it should be more or
less in the middle of the country it's
the same type of logic you can imagine
that these are cities and these are Road
distances driving distances and so you
put your warehouse which has that the
shortest average distance so this is
called closeness there's another one
which is betweenness centrality which is
again you think about all of the
shortest paths so the set of shortest
paths from every node to every other
node
so maybe you can Feud it with Dijkstra's
algorithm
that's one of my favorite algorithms in
computer science now all pairs shortest
paths and cubed algorithms
I want a programming competition with
that one time so of course I like it but
this kind of models introductions or
transmission of some sort right so this
is a map of the relationships the inter
marriages between the ruling families of
Florence in Renaissance Italy and you
know you want to be the Medicis because
if you're gonna make introductions to
marriage they have to go through you
right so you have the most control over
how allegiance by marriage happens
another example of this one is if you
are thinking about networks of imports
and exports and your levering tariffs
for people to travel through your
country you want to be the country that
everyone travels through all right this
has a strategic or military applications
as well this one is maybe a little less
object obvious so this one is one answer
to the problems all of the metrics that
I've just shown have the following
problem
let's say we're talking about organized
crime and you have the mob boss here but
and then you have you know all of the
other members of the mob in various
places but they don't get to talk to the
capo right they get to talk to the
secretary okay
and presumably to each other and so
forth and then the secretary talks to
the capo so all of the network analysis
algorithms that we looked at all the
centrality algorithms we've looked at so
far will say that the secretary is the
most important person and you know who's
this guy right that person is not any
higher than any of these other people
and maybe even lower I convector
centrality says that your importance is
not just who you can talk to but whether
you can talk to somebody important right
so because the mob boss is close to the
secretary and the secretary has very
high centrality in between this the mob
boss will get a very high score in this
case I suppose all these other people
will as well but if we sort of back this
off a few few degrees right then then
this person is very important and these
ones or not but this person is still
important because they can talk directly
to the secretary and so eigenvector
centrality is is it's kind of the key
the PageRank idea applied to network
analysis there's various ways of
describing this but one way to compute
this is how likely you are to end up
somewhere on a random walk in fact this
was how PageRank was originally
described if I start ad a web page and I
start clicking through following random
links a lot of those paths will end up
at wicked
so Wikipedia is really important so this
is the same sort of idea if I start
clicking through just following edges
and you know a lot of them will end up
at the mob boss because they go through
because a lot of them end up but the
secretary and once you're at the
secretary it's easy to get to the mob
boss has anybody know why this is called
eigenvector centrality what's the
relationship to eigenvectors here okay
so a little little linear algebra
tutorial you should know this anyway
because this is how PageRank works so
the idea is we are looking for and I'll
use the technical language here the
stationary distribution of a random walk
between pages people which means so I
pick a random node I start following
random edges the question is how often
do I end up at a particular node and the
stationary distribution is say I throw a
hundred people into this graph and they
all start walking around from moment to
moment 70% of them will be at the most
central node and thirty percent or
twenty percent will be here and like
there's some distribution where after I
wait a sufficiently long time and
they're all well mixed after everyone
takes a step that distribution will be
the same does that make sense it's it's
the equilibrium distribution of random
Walker's so the equilibrium distribution
of random clicking around the web will
have some high fraction of people on
Wikipedia just because although many
people live Wikipedia in the same step
the same number of people arrive in the
same step and they'll have a much
smaller fraction of people on the
website for this course because although
some people link into it basically all
of the links go out and more the links
go out than in so this was the idea now
to sort of complete this example let's
say we have a structure that looks like
this okay and we need the idea of a
graph structure represented as a matrix
have you all seen this and adjacency
matrix okay so here's the basic idea we
have some matrix and you can think of
this as the rows are from and the
columns are two a b c d a b c d and you
can always get from a point to itself in
one step so the diagonals are 1 and then
you have a 1 if there's a connection
between two edges and 0 if there isn't
so in this case i can go from a to b c d
1 1 1 and i can go from b c d a so all
those are ones and the rest are 0 and in
particular if the edges are
bi-directional then the graph is
symmetric if I have directionality which
I do for links right I can go from A to
B but not B to a then this becomes a 0
if I say that you have to move to a
different spot you can't stay on the
same page then the diagonals become 0 ok
so this graph tells me how I can move
around in one step
now fun fact if I multiply this by a
vector that tells me where I am now so
let's say I start on B
and what this does is this actually
picks a column you can see as you
multiply through it picks out a column
of this matrix right so yeah and it says
from B I can't get anywhere because I
made that arrow on one directional but
if I start instead at at a then it tells
me where I can get to from a so that
would look like this so this is standard
matrix adjacency math I have the
adjacency matrix when I multiply the
adjacency matrix by a vector
representing my current location tells
me how fast I can get to various places
I can also think of this as a
distribution like a probability
distribution if I say that you know I'm
as a 50% chance I'm on a and a 50%
chance that I'm on B or D and I take one
step well then what I'm going to get is
half of this and half this that's gonna
give me the probabilities that I end up
in any other point on this matrix so if
I multiply this through I get another
vector so let's call this matrix big a
let's call this X so x equals where I
can get in one step if I take another
step well this just gives me a vector of
the distribution of where I am a ax or a
squared X two steps and so forth okay
and in fact you can do a you can use
this fact to do a shortest path
algorithm right you basically just keep
on multiplying a by itself until it
converges now you have to normalize at
each step but event
what you're gonna find is all every term
if everything's reachable then every
term will get filled in and you'll end
up looking at all of the paths after two
steps three steps four steps five six
six six steps so if I'm talking about an
equilibrium distribution remember the
definition I gave earlier if I start
with a with you know people in some
distribution all these nodes and I take
one step where everyone takes a random
edge out I end up with the same
distribution well what does that mean so
well let's say a let's say it's there's
some vector V equals B okay so I take
one step starting from some distribution
V and I end up at V so this is the
equilibrium distribution and that's an
eigenvector problem okay you may
recognize that as the definition of an
eigenvector in fact it's a little
different I get a lambda in front of it
which is a scalar so I multiply the
vector by a scalar but I can always set
up the matrix just by scaling this such
that this equals 1 and that's just
normalization of this matrix so I can
always you can always get that and now
what I have is I'm looking for a
distribution such that it doesn't change
when everybody takes a step and that is
that's an eigenvector right that is
exactly an eigenvector so that's why
this is called eigenvector centrality so
what I am doing is I am finding a an
importance or a centrality metric that I
can sign to every node that says after I
take this one step where everybody gives
a little bit of importance to everyone
else so in other words instead of
thinking about following a link think
about this in the social net
sense of every step I give a little bit
of my importance to everybody I'm
connected to and what is the equilibrium
distribution for that right what is the
what is the distribution of importance
where I get exactly the same importance
out as I give out at every step so I I
divide my importance and I send it all
equally on each of my edges but if I'm a
very important person well that's
because I'm connected to people who are
sending a lot of importance to me and
that solves the president's adviser
problem right the president doesn't talk
to everybody
directly but the adviser does so when
the adviser sends out their importance
the president gets a huge fraction of it
and the president becomes an important
person as well so that is I convector
centrality and it's kind of a model for
who you know right so maybe I'm not a
celebrity so I don't have a lot of
connections but I'm the producer for the
for Lady gaga so I'm an important person
too so our next problem is what does
this all mean
which centrality metrics should we use
what do we do with this in journalism
and before we get into that I want to
talk about just ask the question let's
say you walk into a small town and you
have been your reporter right the first
thing you have to do is find out who is
influential in that town how do you
actually do it because you're probably
not going to start by analyzing social
network data although maybe but let's
say you don't have social network data
for this town what do you actually do
like let's let's bring it back to the
real world here how do you determine
yeah
yeah that is exactly right so first of
all talking to the local reporter is
always a good idea but also just trying
to figure out who's influential in
society and I want to show you an
interesting document which I stumbled
across a few years ago this is from the
1970s it was a handbook again it's
linked to the syllabus about development
workers we don't really send development
workers into rural communities in the
same way that we used to but it was
called mapping the community power
structure and look at this research had
shown that successful community social
action efforts depend on the appropriate
involvement of key leaders and community
so you have to find them and here's how
it suggests you go about it so this is a
wonderful little guide book because it's
a very practical reference to how power
works in the community and it suggests
you find you do this thing called the
reputational technique which is if we go
to page 8
okay it says you find knowledgeable
people you find a number of
knowledgeable z' to be interviewed
knowledgeable z' will be asked who they
think community's power actors so you go
and ask people who's influential and it
tells you how many people to ask and it
says what you look for is people who
appear on a lot of lists so that's it
right I I think there's a nice resonance
here with the the the sommelier and
backbone method that we looked at
earlier right you you build these lists
and you say who appears on lots of these
lists yeah yeah so why would that be
yeah right you're talking about
sociological and political reality right
I think this method would have worked
anytime in the last hundred thousand
years and will continue to work no
matter what the internet does to us I
think I think you know there are two
sets of in solvable problems love and
politics
yeah I mean it's no different right so
one of the one of the questions you
should be asking in any interview when
you don't know the space well is who
else should I talk to and eventually you
find everybody gives you the same name
and you talk to that person right so you
can think of this as a kind of degree
centrality right you ask a bunch of
people you can think of sort of randomly
sampling the network and you get lists
of names and then you go talk to the
people who a lot of people pointed to so
it's kind of a centrality algorithm but
I want to I want to emphasize that you
you know we've had a very mathematical
presentation of centrality it's a very
sociological idea it's very concrete and
and really this this concrete idea is
much more important I mean you know
there's more to it but here you go
here's the question you're supposed to
ask I love the like older slang who are
the kingpins who can get things done who
swing a big stick
yeah none of this has changed you might
use different slang but that's about it
okay so that's linked from the syllabus
so it's it's really fun to read and it's
just basic basic field work very similar
to what reporters do which brings us to
the very important issue of how does
where am i how does all of the
centrality relate to the real world
you know I wish so this is a graphic
from a Wall Street Journal story on
insider trading yeah unfortunately it's
offline now so the journal moved CMS's
and lost all their Interactive's it's
twice today we've run into the problem
of dying Interactive's but it's about a
very complicated insider trading case
and it was kind of fun you could step
through it and would highlight different
nodes and tell you what their
involvement was but you can tell just by
looking at this that this is the central
actor in the story they have high degree
centrality they have high betweenness
centrality they're also placed in the
center of the picture now of course
they're telling of course they're gonna
be central to the story that the
journalist is telling a particular story
so they get to choose the people to
include in the links that are relevant
so it's remember this is if you just had
the Facebook graph that included all
these people you may not get this person
essential because who knows if they're
central in society there's central once
you pick a particular subset of people
which is a really important analytical
point in graph analysis that the answers
you get are gonna depend on what you
sample in any case there's this idea
that I've called journalism centrality
we want to know who's important to this
story and centrality metrics may or may
not answer this question in fact
centrality metrics are not used very
often in practice they have been used so
I wrote a paper called networking
in journalism where I surveyed a bunch
of famous stories and tried to break
down what they did
it's a good question and and this is
another problem with using centrality
algorithms it is you usually already
know who is central to the story it's
it's pretty rare that you don't have
some idea of who the important person is
in the story because if you've been
reporting then it's your beat and so you
know already from doing all of this
previous work but we're gonna put that
centrality aside for just a second your
act your homework assignment is actually
going to be about centrality but I'm
going to talk about something else that
you can do with with networks so so far
we've talked about identifying and
ranking individual people now we're
going to talk about finding communities
and basically we're going to talking
about finding clusters in the graph now
one of the ways to do this is just
visualize it and just use your eyes but
just like you can look at a
visualization and find centrality or you
can run an algorithm and find centrality
there are community finding algorithms
we're not going to spend too much time
on this but I'm just gonna go quickly
through a classic one so that you have
some idea what these are about
so this is it right this is actually an
old picture of my facebook graph there
used to be a Facebook app that would
draw this for you and it's highlighted
the different clusters for me and
they're actually like you know my San
Francisco friends and my Hongkong
friends and some circus people that I
just hang out with and so forth
you know pretty good job of finding this
stuff yes
yeah put it on slack meanwhile I'm gonna
talk about this one so this is there's
different ways of defining clusters
because there's different ways of
defining links so this one is done with
Amazon from the people who read this
also read that that's what each of these
arrows means and this was for the 2008
election you can see that starting with
a political book on the conservative and
the liberal side you get this these real
clusters right it doesn't this is sort
of the the filter bubble echo chamber or
member effect that you can see but
there's many different ways of defining
community so if you're for example if
you're looking at Twitter you can look
at who follows who that's the most
obvious thing but you can also draw an
edge between two people if they both
share at a particular link so that's
what I sometimes call that coke
consumption one was your example Oh
connected China yeah I know this yeah
this is fascinating so let's see
in Zurich we're toss how do they
interact with each other
yeah writers did this a few years ago
based on a huge amount of research it's
yeah it's called connected China yeah
this one's kind of a fun one as well it
shows for various people the different
stages of their career so the previous
and current premieres for example yeah
it's not selectable and every one of
these is based on a news article I don't
know if they'll actually give us the
links here but how they did this is they
had this whole database of relationships
between people Wow
it's quite a visualization well okay
let's not get into China's government
structure I want to go back to the the
social power one yeah this is the one
that's most clearly a network analysis
and also most clearly not loading let's
try that again
so I wonder if I can get the original
links if I click on this person each of
these links is an article I don't mmm
connections I don't think it'll show me
the references here but maybe here is
where they are you know this is just
writers reporting so they're not making
the database exposed and this is was the
the brainchild of red schwa who is now a
data journalism exec yet writers and
he's been here's a blog he's been
writing about stress
for journalism for a very long time and
this is a sort of Exhibit A he had a
team of reporters go through and just
trace all this stuff out from public
news reports so they built this database
of connections there are different
databases like this well look at a few
more so for example little cysts so
you've heard a big brother so I can look
on a particular person so let's search
for a person not Trump Andrew Cuomo so
there's the positions that he's had and
he's worked for these people and he has
this linked to Martin Connor oh I just
get him but for each of these things
yeah right
it's true what I want I'm right now
those I want I think this might be that
yeah here we go each one of these is
referenced so here is the article link
from which this one link comes through
so you can build up these networks just
by analyzing public sources and so yeah
these maps that people make right and
each of these it says lobbying and if I
click on the link it gives me the source
for that one piece of data yeah so
little sis is pretty cool unfortunately
I don't know how much journalism
actually comes out of this I think it's
more of a research tool you use it all
the time what do you use it for
[Music]
yeah yeah it's nice because it has all
the sources and links to everything yeah
I'm not sure used to be anybody but I
think they have more editorial control
so I want him to move on because there's
there's actually a video I want to show
you so let's do community detection and
then we'll go into how this stuff is
actually used for reporting another way
to define communities is who talks to
who this is the Enron email
network map do you guys all know what
this data set is so
yeah half a million emails court-ordered
release after the collapse of Enron
energy in 2002 over accounting scandals
you can get clusters through web link
structure this is one of the early
example of trying to map the blogosphere
and around it's very cool or you can get
clusters by where people go right you
can see everybody who goes to a certain
type of party because their location
trails all end up at a particular Club
on Friday nights or a particular rave or
something we've talked a lot about
clustering you can use any of these
cluster algorithms by defining
similarity metrics on any of the
attributes we just talked about there is
another one which is pretty standard
it's called modularity you will see this
in network analysis packages and
basically it works like this it says can
we divide a set of nodes into two groups
such that there's way fewer edges who
cross that divide then if the edges were
random so it takes all the edges
distribute imagines they're distributed
randomly and says can I find a split
such that I have many fewer edges than
random and many more edges in random
within each group and so I'll just go
through this pretty quickly this is what
this looks like so this is just notation
I take the total number of edges here
which just sums the degrees and divides
by two and then if I have the total
number of edges that in a random graph
the expectation of edges between two
nodes I and J is the product of their
degrees divided by the number of edges
hopefully you can convince yourself that
that's right the greater their degrees
the greater number of chances they have
and then the over to em and just
normalizes so now I know how many edges
I would expect randomly now what I do is
I take the adjacency matrix which we
just looked at which tells me how many
edges I actually have
I subtract off how many edges I would
have randomly I multiply by whether
they're in the same group so in other
words I count only the edges between the
same group and I defined this thing Q
and if by summing over all pairs and if
Q is greater than zero that means there
are more edges within members of the
group then I would expect if the edges
are randomly distributed so then the
modularity algorithm tries to find G it
tries to find an assignment to groups
such that Q is maximized so that's it
that's the entire logic of the algorithm
it turns out that you can do a ghen
vector trickery here as well I'm not
going to get into get into it but
basically you can solve this for a
vector of group assignments it is
possible that there is no split that
gives Q greater than 0 right so there's
no split you can do such that there are
more edges within the groups than across
the group in which case you say there's
only one community here's what this
looks like so we saw those books earlier
here's what this looks like applied to
that split so the interesting thing here
is that we only use the edge structure
to do this analysis we do not use we
know the political category of the book
that's the shape here right circles the
literal squares or conservative
triangles or centrist and it more or
less reproduces the left-right split
just on the graph structure which tells
you again that graph structure
correlates to other things of interest
that's why we study graph structure so
there's a lot of ways of defining
clusters but this is a popular one and
the graph analysis tool you're going to
use for your homework which is Gaffey
has this
as an operation and most graph analysis
packages have it as well okay now we get
to how this is actually used in
journalism I wrote a paper on this a
couple years ago what I did is I went
through the I re tipsheet archives and
found everything that talked about
network analysis and categorized it and
here are some of the things that I found
right so this is a visualization of the
links between people in the seattle art
world and so it's a nice interactive you
can click on each one and see how this
person is and the way they got this
graph is they just went and interviewed
people they say who else do you know
and then the reporter went through their
interviews grants scripts and pulled out
the edges so the nodes are the people
they interviewed in the edges of the
hood you know and then they colored it
by are they a gallery owner are they an
artist are they a purchaser or whatever
our museum curator and so on so this is
just sort of a basic there's no network
analysis in the algorithmic sense here
it's just a way to draw a picture of the
scene this is another very interesting
case this image never really appeared it
never appeared in the story this was a
story about juvenile car theft and what
they did is they got all the arrest
records for juvenile car theft and the
edges are a CO arrest so this edge here
means that person in that person appear
on an arrest record together they were
arrested at the same time and they found
these clusters and then they had about a
about 400 people in this central
component so they they got a central
component and just to Kinnaird and they
have the problem of who to go interview
or investigate further and this is one
of the best uses of centrality measures
that I know they ranked people by
centrality and then went and pulled
additional data for those people so they
bought public records so they can't do
everybody because
it's like 20 bucks to pull this file for
one person so they wanted to focus on
just the most interesting people so I
got extra data and interviewed more
people who were central so I went
through like this and coded a lot of
stories and I for each story I looked at
first of all how did they use
visualization you know did it was it
just for reporting or did it appear in
the story where did they get the data
and did they have to extract it from
documents or did they just download the
data that they run an algorithm and that
they use a graph networks database so
here's what we got
visualizations appeared in most stories
but sometimes and about half of them
they were also used for reporting and
you got an unpublished visualization
like the one I just showed you actually
running an algorithm was pretty rare
mostly people are analyzing this to your
visualization and you know about 2/3 of
the time or so they they can't just
download the graph data they have to
scrape it from documents like we saw in
the interview transcripts or public
records or something
so then the question is why why our
journalists not using graph analysis
algorithms like centrality and
modularity what's the drawback here
yeah it's difficult to interpret the
results but why is that yeah broadly the
answer is context that what the measures
mean depends a lot about the details of
the story and by the time you do all of
the reporting to get the details you
kind of know the answer anyway that's
pretty much what's happening the other
important reason is that you get heavier
out in its networks right you do this
kind of link analysis you have these
links like you know worked for this
person it went to school with is family
of donated money you can't really run
centrality on that because these these
algorithms all assume that every type of
link is the same so there's there's
difficulty and interpreting lists stuff
also you often don't have the whole
network like if you're doing interviews
and building out a network by talking to
people
you can't run a network that tells you
centrality until you've figured out that
you have to talk to somebody so that's
why we don't see a lot of algorithmic
graph analysis but we do see a lot of
visualizations one of the uses of
putting things in a graph and using
visual tools is that you can put
different types of information together
so this is a pretty classic one we're
combining three databases who Rick Perry
made calls to so he was running for the
Texas governor who donated to him and
who got political appointments and when
you start to try to merge several
different types of data graph analysis
becomes very interesting and useful and
what I'm going to show you now is a
video it's actually kind of an old video
but it's nice because it's very
self-contained and not too long and it's
about a story on the tissue trade I cij
did this years ago this is pre Panama
papers but it's a really good example of
you do graph analysis this is a story
about companies that harvest human skin
and bones and tendons and process them
into medical implants it's a booming
market giving human tissue a second life
you might say what's wrong with that in
fact you might know someone like we do
who's walking around right now with a
corpse hendon that repaired a busted
knee this is a legitimate industry for
the most part these products can heal
even save lives but I'm going to tell
you two stories about how illegal tissue
can feed that legal billion-dollar
market here's the list of characters
you'll be hearing about a Florida
company called RTI and its German
subsidiary to dodging a Ukrainian morgue
named nico live a russian coroner named
Igor alla Shenko and a formal dental
surgeon from Brooklyn now a convicted
felon named Michael Mastromarino who
stole body parts including from the
corpse of the famous British commentator
alistair cooke one last point before we
get into the guts of the story
our team looked at more than 200
companies we talked to everyone from
industry insiders and government
officials to surgeons and convicted
felons we read through thousands of
court documents regulatory reports
corporate records and internal company
memos using a powerful analytical
software called Palantir we analyzed
data on imports affections infections
and accident reports filed with the Food
& Drug Administration the US agency that
oversees the human tissue trade and we
tried repeatedly to talk to the company
that ended up being the focus of our
investigation but executives wouldn't
talk to us or respond to written
questions so here's our story there's
this company called
RTI biologics which started out as a non
profit division of the University of
Florida since breaking away it has
become one of the world's largest
publicly traded for-profit tissue
processors as we dug into this global
trade we started to see a pattern RTI
and its
and subsidiary to dojin have repeatedly
obtained tissue from suppliers that were
later investigated for allegedly
stealing human parts first let's go to
Ukraine this February a few months into
our research authorities raided a morgue
in the southern city of Nikko live we
talked to investigators there they
suspect body parts were stolen from
cadavers brought in for autopsy they
told us signatures were forged to make
it seem like families had given their
consent the Ukrainian investigators said
tissue might have been taken illegally
from as many as half the bodies that
passed through the morgue during the
raid police seized autopsy reports
written in English lab results
apparently destined for - dodging and
bottles of human tissue labeled -
dodging made in Germany but one thing
really caught their eye an envelope
stuffed with cash labeled n I K one
that's the US government's abbreviation
for the Nick lied morgue see the morgue
was registered as a tissue bank with the
FDA what's interesting is the phone
number listed for that morgue was
identical to 24 other Ukrainian morgues
also registered in the US and what's
more when we called that number we
reached an automated system for -
touchin which makes medical implants out
of human tissue in early 2008 to dodging
merged with RTI remember that's the big
American for-profit tissue processor
back in the 90s - dojin got its tissue
from suppliers and places like Estonia
Latvia Czech Republic and Hungary where
tissue can be taken without consent
unless a donor opts out while he's still
alive
families complained in some pieces
police launched investigations but they
didn't go anywhere we got a hold of
internal records that show - diggin has
been getting tissue from Ukraine for
more than a decade and supplying RTI the
company appears to work through a
middleman there a Russian coroner named
Igor a Lysenko until the raid in
February a leshenka was director of bio
implant a company
owned by the Ukrainian Ministry of
Health which collected the tissue from
regional morgues to send to to taejun
internal records show to dejenne had
real problems doing business with a
lysenko in a 2002 memo marked strictly
confidential with four exclamation
points two-digit executives urged an
exit strategy from Ukraine they wrote
that a middle man believed to be a
lysenko kept demanding more money and
they didn't know what happened to the
money they sent him to pay the morgues
despite these misgivings to dajin didn't
pull out of Ukraine instead the company
expanded its regional network over the
years Ukrainian authorities have
investigated to other morgues that
supply to dajin for allegedly stealing
tissue of bullying families and forging
consents no one has ever been convicted
as for a lysenko local news reports say
he left Ukraine following the February
raid neither police nor health officials
will tell us where he went so where are
we today
RTI doesn't import its Ukrainian tissue
through Germany anymore the Ukrainian
tissue Bank bio implant is exporting
directly to the United States but
foreign tissue is still a small part of
RTI supply chain like other big players
in the industry
RTI actually gets most of its tissue in
the US but US law hasn't kept up with
this rapidly evolving industry and that
brings us to our second story here at
home RTI operates a nonprofit called
RTI donor services which directly
recovers tissue from American cadavers
RTI has also contracted with tissue
banks in 23 states one of those
suppliers was New Jersey based
biomedical tissue services it was run by
a former dental surgeon from Brooklyn
Michael Mastromarino RTI started working
with Mastromarino in 2002 but got
nervous when staff complained that he
was verbally abusive and had ties to
organized crime
so RTI hired a law firm to run a
background check and here's what the
firm advised the good doctor has been on
Santa's naughty list for quite some time
I would strongly encourage you not to do
business with someone
that has this kind of resume instead RTI
drew up a new contract with biomedical
tissue services in place of
mastromarino's name was that of his
newly licensed off-site medical director
RTI continued to work with the company
and Mastromarino as their main contact
until 2005 that's when the company found
out what he had really been up to for
the last three years from Funeral Homes
in the Northeast Mastromarino stole body
parts some infected with cancer
hepatitis or HIV from more than 1,000
corpses one source of tissue was the
body of alistair cooke that famous
british broadcaster and host of
masterpiece theater
he died at 95 of lung cancer but his
death certificate was altered and then
his cancer ridden tissue was sold for
$11,000 there were massive recalls five
companies pulled back a total of 25,000
products made from the human tissue
Mastromarino supplied and police got on
his case he's now serving time at a
maximum-security prison after pleading
guilty to conspiracy and body theft and
the families of his victims filed a
lawsuit against RTI alleging negligence
that case goes to trial this October in
New York when we started looking through
court files and started talking to
people we realized that neither
companies nor the FDA had ever tried to
verify whether consent was actually
given no one cross-checked the files
until after the industry was alerted to
a pending criminal investigation so
what's the bottom line here why should
we care well you should care because
regulations are ineffective given the
enormous profits that can be made in the
industry it's shockingly easy to get
into just fill out a one-page form on
the FDA website and you're in business
according to those who have done it you
can harvest tissue yourself you can
distribute it on the open market you
don't even have to be inspected in fact
according to our analysis of FDA data
only about 40% of tissue banks in
operation today
show any record of being inspected by
the agency so why do we need this fancy
software to figure out the story well
you can see how complicated all the
networks can be these companies and
their operations stitch an intricate
network across the globe but those
connections are buried in paperwork and
datasets ultimately we uploaded more
than 1 million companies people document
and events into the system so to sum up
the human tissue trade is a perfect
example of demand outstripping
regulation it's not surprising that some
unscrupulous characters sensing massive
profits may have latched on to an
otherwise legitimate industry and that's
a fact the FDA and Congress have
overlooked
so there's an older example a more
recent example is the Panama papers
which there's lots of material online
and talks about how graph analysis was
used they put everything into a neo4j
database and then visualized it with one
curious which is this tool one of the
things that is interesting about this is
that there are a lot of specific
challenges right so so here's the ideas
you throw everything into a bale
database and then you can because a
graph is a very flexible representation
you can represent almost anything you
want so anything in a relational
database you could put into a graph and
then you can merge them and find paths
and do these visualizations the there's
a number of issues one of which is that
the structured data in the Panama papers
was only a small part of it well a big
part but the minority part so documents
emails and PDFs or a much bigger part so
you have to figure out how to turn that
stuff into graphs perhaps the most
obvious way to turn it into graphs is to
do entity recognition on the documents
however entity recognition
commercial entity recognition has a
recall of only about 70% because it's
tuned off as per suit against precision
right it doesn't want false positives
which means that commercial tools are
not going to pull out all of the
entities and so journalists have had to
do a lot of work to tune entity
recognition for the investigative
workflow because you would rather have a
bunch of leads that don't pan out than
to miss something the other big problem
is record linkage so if you have n
datasets you might have n copies of the
same entity in this case a company and
here you can tell they're the same right
like look at the address it's the same
thing I have G has 15 unions creat - Uzi
they're just written differently
you can often have the computer find
them when you do have a computer find
records to link you don't want them to
actually collapse it down to one node
because it could be wrong so instead the
preferred technique is a soft record
linkage which is you add another edge so
in this case the the relation is called
has similar name and address as so you
throw all these data sets into the graph
you apply some similarity algorithm and
here machine learning actually works
quite well to try to figure out which
are the same entities but you don't
merge edges or merge nodes you add a
notation that says we think they're the
same because then later a human is going
to have to check it the reason for that
is you can't ever accuse somebody of
wrongdoing based on the output of a
model anytime you're going to make a
public claim of wrongdoing
you have to check every step of that by
hand otherwise you are at risk of libel
and in particular negligence for libel
the system that I would like to see for
doing this it's up in sort of prototype
form and a number of organizations
including IC RJ and OCC RP I'm gonna
show you in a minute a slide of what I
wish we had that tries to solve some of
these problems right you can it's not as
simple as just throwing everything into
a database and then running a graph
algorithm for variety reasons record
linkage is a major problem another one
is that the graphs that reporters use
are not data visualizations so here's
part of an investigation into Trump's
business that we did here at this school
a couple years ago this is a hand built
map it's done in a program called C map
journalists have been doing this for
decades it's not a data visualization
these things are in a database but we're
not just plotting the result of a query
we're choosing which things are
important and we're setting a layout
based on well obviously we're focused on
bay rock which a lot of journalists have
been looking into
here's that that Wall Street Journal
story again and you you can actually
step through this it this is this is all
hand built what I'd like to see is a
system that is based on the graph
database so you know if you run some
query you get all of this stuff but lets
you pick which ones you want right I
want it to allow you to build these
they're not data visualizations because
I'm calling them Maps builds maps by
searching in the database and adding one
note at a time which is kind of what
link curious does this is the link ureas
interface on our right on the right and
so when you see these pictures of with
the Panama papers that's how they were
built they were hand built based on
expanding out nodes and people of
interest the system that I ultimately
like to see is this you combine both
structured and unstructured data which
goes through entity recognition to build
the huge graph database including
provenance information and multiple
copies when you have a thing in multiple
sources then you do record linkage to
generate these individual Maps which are
specific to the story you're
investigating right so the key thing
here is this these maps are not data
visualizations they are hand built with
computer assistance for record linkage
and in particular if you have if you
learn something interesting you should
be able to just use this interface and
click on it and say oh I want to add a
new node that then goes back into the
central data store so people are
starting to build systems like this I
know of at least three efforts to build
something which looks like this
so this is really the future of high-end
cross-border investigative journalism is
these types of pipelines and graph data
stores okay your homework will be up
shortly you are going to run multiple
centrality metrics on the
the Lim is data set so how many of you
know the story of lemons or elbows I've
seen the musical or the movie yeah if
you haven't I'm afraid you'll just have
to Wikipedia it for the plot I'm what
I'm looking for so it's a graph which
has all the characters and how many
chapters they co-occur in you're gonna
load that up into Gaffey run different
centrality algorithms what I'm looking
for is an analysis of how well the
different centrality algorithms capture
the plot of the story right so I again
the fundamental questions of this course
are always about the relationship
between the mathematics and the world so
this is exactly that problem tell me
what the centrality metrics tell you and
whether it matches the story all right I
will see you all next week