Introduction to OpenRefine

hi everyone so this week we're going to
shift over from Excel to a tool called
open refine and open refine is a really
valuable tool for dealing with large
data sets in particular and from
generating really quick for both dealing
with them cleaning them up and
generating kind of quick top-level
insights
so while Excel is good when we kind of
want to browse data and especially do
some calculations on it open we're fine
is gonna give us a lot more flexibility
when it comes to dealing with large data
sets and also messy data sets in fact
their tagline is a power tool for messy
data and so for to look at this right
what we're gonna do is see how so last
week I sort of did a little bit of magic
and rather than trying to open our
entire September city by data file in
Excel I did this instead with I sort of
did some magic in the background and
created a 10,000 bro file this week
we're gonna I'm here to show you exactly
what that quote-unquote magic is and
that magic is openrefine
which gives us a lot of control over how
much of a data set we open which is a
key issue when we are dealing with large
data sets and just also provides us a
lot of other really valuable tools so
I'm gonna go ahead and open this so you
can see this is just the entire data set
and so openrefine runs actually this is
the logo here this little blue diamond
this can be installed on any operating
system it's free what you'll notice I
think is that it actually works in a
browser window okay so it's not
happening on the internet it is
happening locally on your computer but
the interface is a browser window so
what this means is that we're not going
to really have the keyboard shortcuts
that we have in something like Excel
we're not gonna be able to sort of
copy/paste it's a different interaction
pattern but I think hopefully you'll see
quickly how powerful it can be so this
is the it defaults to kind of open
project I don't how many existing
projects here so I go to create project
and this is gonna let me choose data
that I want to work with so I'm gonna go
ahead and choose this file that I just
downloaded and you notice that I can
choose the entire file even though that
file is quite large in fact if I look at
it let's see look at this I see that
this is yeah 330 megabytes roughly okay
so it's a pretty big file something
again that would probably crash Excel so
but I can choose my file and even though
it says again it says you know this as
uploading data is not actually uploading
it to a remote location it's just that's
the language that it uses it's it's
created what we call a local web server
here so it's yeah just sort of strange
terminology but in any case one of the
important things as you can see that it
makes a guess about what kind of data
this is it is indeed CSV data and we can
see that it's giving us this nice
preview of what the data looks like
given the existing settings now I'm
gonna call this city bike first hundred
okay so first I want to make sure that I
give it a unique file name okay because
I don't necessarily want this will allow
me to create multiple projects that have
the same name which can get very very
confusing so I want to make sure that I
have created a unique file name but the
real key here is that rather than rather
than just opening the entire data set I
can actually check this little box and
say low and load at most in this case I
can be pretty I can be pretty
substantial with this I'm gonna load a
hundred thousand rows of data you notice
also too that by default it assumes that
there is a column header in Lion which
of course there is this is also nice
because it means that that that row
count on the Left isn't offset by one
which was our experience in Excel where
it really didn't have a way to to deal
with the idea of a header row that
wasn't part of the data set so 100
thousand rows is should be is gonna be
fine for for this volume of data might
take a second on your depending on the
machine that you're using and I say
create project and it's gonna get me
this message but then it's gonna open it
up anyway
and so you'll notice I think right away
a few great a few things about this so
first of all what I like about this is
that it automatically shows me that I
have 100
Rose now considering that I told it that
I wanted to open a hundred thousand Rose
that might not seem very interesting as
we start manipulating and looking at
this data in different ways it is going
to become more useful but you know even
if you think about a lot of times when
you download datasets the only way you
have note have to tell how many rows it
has I mean you can look at the file size
to see about how big the file is but the
only way to know how many actual rows it
is in Excel is to like open it up and
scroll to the bottom which is pretty
frustrating right I mean it's a sort of
a stupid task to have to do so the nice
thing is that open we're fine it always
gives us this count right at the top um
you'll notice however that it's it'll
only show me up to 50 rows at a time at
the maximum right and as default is
actually much fewer than that I can
browse through it using these first next
and last but as you might imagine
especially if I have a hundred thousand
rows even 50 rows at a time I'm not
going to be able to browse this data in
that way and that's just fine because
the mechanisms that open refine lets us
use for examining data are much more
robust than just us having to scan
through in a scroll through and look at
things so the first and probably the key
the sort of key function for open refine
is something called a facet and you can
think of this a little bit like a filter
for unique values okay so I'm gonna do
this example on a start station ID and
so in order to interact with a column of
data I have to select this little I
click on this little arrow that's
pointing down you can see the first
option is facet now here I'm gonna
choose text facet and that's because
right now all of the data in my columns
is is text openrefine
doesn't do a lot of automatic
transformation it doesn't guess what
what data type is in a particular field
that's actually an advantage in my
opinion sometimes when I open things in
Excel it makes changes to like the date
format for example that I find really
frustrating so I can just go ahead and
write away on this I can do a text facet
and you'll see that this window comes up
on the left this offers us a lot of
really valuable information with very
little effort as I'm sure you can see
now right now this is sorted by name I
don't
want that I'm gonna sort it by count
instead but this we already have quite a
lot of metadata about this right in this
first hundred thousand rows we now know
that there are six hundred and eight
different unique start station IDs are
present we don't know if that's all of
the start stations in the city bike
system but we know that's the number in
this data set when I sort when I do that
sort by count I get a descending list of
how frequently every one of these values
appears so I know exactly how many rows
have start station ID number 5:19 in
them if I wanted to examine this more
closely I can actually go ahead and
select that choices line link and it's
going to actually give me this as a
tab-delimited file with those values so
if I want to do some kind of analysis
just on the distribution of or the
number of times that start stations that
different start stations appear I could
do that right off the bat um the other
thing that this lets us do is very
easily start to kind of dig into our
data by selecting by creating facets and
then selecting subsets of those facets
so for example if I just wanted to
examine all of the values all of the
rows where the start station ID is 519 I
can just select that we now see as we
expect that there's 700 750 - maths
750 - matching rows if I wanted to look
at the top three right I could go here
and I can actually just choose include
on this side right and you can see by
the orange highlight which ones I have
included again it's giving me a count of
the matching rows which is really handy
and so this gives me a really nice way
to just very quickly get tough a
top-line understanding of what's in my
data set now in there are more types of
text facet than just there's more types
of facets than just text facets in order
to make use of them we need to actually
tell openrefine what type of data how it
should be treating the data in our
columns meaning to be a little bit more
specific so in this case trip duration
we know is in seconds right I want that
I wanted to treat that as a number
because it is start station ID obviously
is also an
technically it's a set of digits but
it's really more of an identifier it's
really more like a name so in order to
convert this to numbers I'm gonna again
select the downward facing triangle on
that column and I'm gonna say I actually
say edit cell so even though I'm
performing this transform on an entire
column it's considered a cell level
transform and if I just come to common
transforms here I can say to number and
it's gonna go ahead and do all that now
one thing you want to be careful of here
is that openrefine is what in
programming we might we might call I
think of a stateful okay which is to say
that I've just performed this transform
and you can tell that something has
happened because now this trip duration
these trip durations are showing up in
green however openrefine is very
respectful of the subsets that we have
selected which means that for example if
I were to export this data right now as
a CSV right it would actually only be
opening it would only be saving out the
2,120 matching rows that are currently
selected this is actually really handy
because I find that often getting
subsets of an Excel spreadsheet out of
Excel in a useful way is pretty
difficult or just tedious in this case
it's very straightforward right so if I
just want to be working with a subset of
the data I can quickly select that
subset export it as a CSV file open it
as a another open refined project and
work with my subset quite directly what
that also means however is that because
I had in this case I had these specific
start stations selected as you will see
as I start to remove this oops okay what
you see is that now that now that it's
just sort of a regular sorted order
again and I don't have those sub
selections the green highlight has
disappeared from trip duration now this
doesn't mean that for the ones that I
chose it wasn't converted to a number it
certainly was but because I applied that
transform when only certain cells were
selected it was only applied to those
selected
so you know the short advice here is
that anything that you want to before
you apply any transform that you want to
be sure applies to an entire column of
your entire data set make sure that you
don't have anything sub selected in the
left-hand pane so no harm is done by
this it's just maybe not what you would
expect given how other initially given
how other programs operate so it's very
easy for me to come back to edit cells
and I have nothing selected convert them
to number and now we see this no matter
what I have selected is going to be a
number I'm also gonna do another
transform where I convert these to where
I convert the start start time and stop
time to dates whoops
or I try to do that yep and the reason
why I'm doing this is because it's going
to allow me to do different kinds of
facets specifically and perhaps
unsurprisingly it's going to allow me to
do a numeric facet on the number the
number entries and it's gonna allow me
to do a so called time line facet on
these on the date data types now some
interesting things here to notice first
of all what we can see here we talked a
little bit about standard distributions
and skewed distributions this is clearly
a very skewed distribution the vast
majority of our trip duration values are
between one and a hundred thousand
seconds we have a few that are crazy
outliers out here right you can barely
see the highlight but I can go ahead and
I can narrow in on those by just
adjusting these window ranges and if you
notice down here it's pretty hard to see
because a very tiny little blue bar so
the thing to know is that on timeline
facets it always shows you a picture of
the entire data set and then it
highlights your sub selection the sub
selection in your data set with a
slightly darker blue
this means that often it's very
difficult with a time line facet where
you have where you're trying to look at
the pattern in a relatively small number
of rows you probably want to actually
export that data and then
a timeline facet on just the subset that
you're interested in if you want to look
at that pattern it's gonna be very hard
to perceive against the larger patterns
backdrop here but so everything is being
put here I found one matching row this
is a ride that apparently lasted from
September second until September 28th
somebody failed to return their Citi
bike that entire entire time
might be something else going on here it
actually says the end stationing is the
New York city-based
New York City Pike Depot I think in
Gowanus so I'd be interesting to know
what that's about but again it's just to
say that this gives me a lot of ways to
examine my data and sort of see what is
you know get get a feel for what's
happening so I could start to make some
I could start to make some some claims
and understandings about what patterns
exist we can see that there's a pattern
in in the way that bikes are or taken
out also I can do things like facet on
say gender right which I can treat as a
text facet if I want understand within a
subset how many people are men women or
didn't identify how many people have
subscribers and customers so hopefully
you can see how this helps lets us start
generating some insights about our data
when we come back we're going to look at
some other features of open refine that
will let us even do further analysis

An introduction to OpenRefine, an open source tool for manipulating large, messy data sets. Facets and data transformations

Contributor: Susan McGregor

Video 1 of 3