More OpenRefine: Extract, Apply, Cluster

hi everyone so um I just want to take
them these last few minutes to go over a
few more features have opened refine
that are can be really useful for us
when we are dealing with data and so
we've gone over faceting with via text
numbers dates we've also looked at sort
of simple form of regular expressions
which we'll elaborate a little bit more
on next week and how that can help us
find things that would be you know
basically parse data in a way that would
be very difficult and time-consuming to
do by hand the final thing that I want
to look at is I that I want to
demonstrate is some of the
reproducibility features of them are
fine and this is actually where open
refine I think has some of its greatest
value although again I think it's a very
good program for many reasons so what I
want to point out here is that you may
have noticed that as we are going along
this there's this account in the
background called undo redo that has
been going up at certain points now
first of all we should emphasize when
that count increases and when it doesn't
so you may notice that as I remove
facets or for example create facets
right that actually that undo redo count
doesn't doesn't change it all right even
if I select certain things include them
right it's not changing and the reason
why it's not changing is because when we
do faceting all we're doing is we're
asking for a new view of the data we're
not actually transforming the underlying
data in any meaningful way and so as a
result we don't see any increase in the
undo redo what the undo redo does is it
actually automatically generates a log
of all of them meaningful
transformations that we have applied to
our data set so for example we see item
0 is creating the project we see the
text transform this is my initial text
transform when I made a subset of trip
duration rows into numbers and then the
remaining transforms when I completed it
on the rest of the data my date
transforms my the mass edit that was
when
I was when I updated that Central Park W
to the Central Park West and now we see
that the creation of our house CPW
column is another one now why is this
important well first of all one of the
great things is that instead of having
to rely on you know an undo command like
Apple Z if we feel like we've quoted
quote made a mistake we can actually
step back in our actions here and return
the data set to a previous state so if
we're not happy with what we did we can
actually just move back through it or
move forward through it as we want now
one thing to keep in mind here is that
if I were to say revert this the tech
data to this state and then I were to
come back to and start doing work on it
again I would lose these other five
steps right so it really does let us
move back in time which is pretty great
when we want to be exploring and not
worrying about what changes we made to
our data set quote-unquote doing the
wrong thing however even more than this
is that you may notice here that there's
something called extract and apply now
if I click extract you will see that
each of the possible sets each of the
steps has been automatically selected
and hopefully you will you will
recognize the data format on the right
which of course is job is JSON so what
this is is this is a JSON expression of
all of the steps that I have performed
on my data set so far why this is
important well because actually what
this does is I can copy this out of here
and I can use it I don't know why this
always happens that it's always a little
bit too zoomed I can copy this out of
here and now let's say that I have done
all this work on this first hundred
thousand rows I've created my house cbw
but that's still only a subset of the
data right well I want to be able to do
this on the other hundred on the next
hundred thousand rows so the way I would
go about that is quite simple I might
come back here you can see this is the
original project that I created I'm
gonna come back to create project choose
files and I'm gonna go back to my main
data source again click Next
okay and this is just generating the
preview remember and what I'm gonna do
is I'm gonna change my selections on the
preview page so that I get not the first
hundred thousand rows but the second
hundred thousand rows so I'm gonna once
again edit the file name here project
name and say city okay no how do I make
sure I get the second hundred thousand
rows you might have noticed this too
first if you didn't missus first time we
have not just the load at most but the
discard initials I've already loaded the
most the first hundred thousand rows and
happily because of the way that
openrefine
handles headers I don't need to worry
about subtracting one or adding one and
now I can say alright and now load at
most one hundred thousand rows okay by
default like logically then I am
capturing the second hundred thousand
rows of my data set so I can update the
preview if I want I can create this
project and now here's the really
powerful thing you can see that just as
it was initially with my first part of
the data set you know the trip duration
is text the start time stop time those
are all those are all texts sure it
wouldn't be too complicated for me to
you know come here and do an edit column
and transform those cells but how much
better that I can come to undo redo and
I can choose apply and now I can paste
in the JSON from my first file and say
perform operations and what you'll see
is that in about Oh three seconds or so
everything that I did in my original
file that's meaningful again that his
impact of the data has been performed so
not only do I have the format
transformations but it is automatically
applied
it is also applied and that my regular
expression and created a new column with
the true and false on him so if I wanted
to see how many entries are true in this
case I can go ahead and look at it and
of course I'm only gonna get this is
only meaningful if I actually also do a
text facet onstage start station name
right but I see that you know again
maybe not surprising that there were
some entries from all of the ones that
had had that but hopefully what you can
see from this is that this gives us an
enormous power we need to work with
large datasets and or and or repeat of
analyses over over time for example so
not uncommon that you know there'll be a
set of operations that you do to a data
set may be a government data set
something that's released on a regular
basis by doing this work and open a
refine and having that automatically
generated history you can easily apply
it to the next version of that data
without having to manually do it all
again again this is something you can do
in Excel but it's a bit more effortful I
would say it substantially more
effortful actually and so it's a really
great that it's just been built in here
um finally one other thing that I want
to talk about here is that there is a
that there's an option here on the edit
cells called cluster and edit now this
is not really going to be again I have a
sub selection here I want to get rid of
this so clustered edit is something that
is not so applicable to this particular
data set because I would describe this
data set is actually being very clean
right basically there aren't a lot of
misspellings or anything like that
now without getting too far into exactly
how this is doing it this is basically
using algorithms that try to find things
that sound the same so things that have
kind of fanatic representations
similarities and we can go ahead and go
through these the meta phone is probably
gonna do better for us here so this is
actually using it's it's using sounds
it's using word sounds to cluster these
things now again we believe that these
are all distinct other than something
like the Central Park West Central Park
W that we've highlighted we wouldn't
likely want to edit any of these to be
the same however working through this is
very valuable when you are dealing with
that's particularly where the entries
have been entered by hand so my favorite
examples of this always is the stop and
frisk data set data sets I should say
where there is an unbelievable number of
ways that marijuana is spelled having
this these algorithms work to find
things that sound the same but aren't
spelled precisely the same way is really
valuable you can do you can choose to
merge and give a particular cell value
so that you make sure that you are that
you know everything that should be
classified the same way as classify the
same way so that's kind of the quick
overview of openrefine
again we can export to a bunch of
different formats here and for your
assignment we'll be looking at doing
some of these analyses on your own on
your own subset of the data but other
than that I hope you all have a great
weekend and I look forward to seeing you
next week

An overview of apply/extract commands to step backwards and forwards through our work, as well as apply it to new or revised data sets. A brief look at the potential of clustering and mass editing for cleaning up data

Contributor: Susan McGregor

Video 3 of 3