Regex on OpenRefine

hi everyone so hopefully you got a sense
from the last video that the fascinating
that we're able to do in open our find
gives us a really useful way to start
generating some quick insights about
even a large data set so in this case
for example I might say well across this
hundred thousand data across these
hundred thousand rows of data I want to
see for example how prevalent I'm going
to go ahead and come back to selecting
all of my data I want to say something
I'm gonna see something about I don't
know how how come in subscribers versus
customers are so happily there's
actually some metadata on the landing
page about um if you come back here
about what these things mean so a
customer is somebody who has a twenty
four three-day pass subscriber has an
annual has an annual membership so one
thing we can see again right off the bat
here is that subscribers compose eighty
percent over 80 percent of the rides
taken in this time period right in this
first hundred thousand rows and I
couldn't facet them before and I could
say okay you know among subscribers what
does say the gender breakdown right I
can still do a text facet on this and
say okay interestingly you know of
subscribers a roughly sixty percent of
them are men twenty percent of them are
women there's a small percentage that
haven't identified so you can see how we
can start building narratives and
understandings if I wanted to say look
at them from this particular start
station ID right maybe I want to look at
this by a neighborhood or something I
could go ahead and make a selection
there and see how all of these things
are impacted right so as I choose
subsets of as I choose subsets of
information what I what I get is
actually the narrowed count the
combination of all of these things and I
can start to make you know generate some
questions and hypotheses about this data
set so that's one way that we can
manipulate this data there's another
really important thing that
of data manipulation that openrefine
supports and that's called those are
called regular expressions regular
expressions are a kind of pattern
matching tool that allows us to look for
string pattern so patterns of characters
whether they're letters or numbers
inside another a sort of a cell a given
cell of data now to make us a little bit
more concrete we're gonna do some we're
gonna actually use a regular expression
on start station name and this is kind
of a hypothetical like the things we
might want to find out about these start
station names so um I like to use as an
example what if we had the question of
how many stations are there how many
city bike stations are there on Central
Park West well at first this might seem
like an obvious question to answer
because I can just scroll down here and
because this is alphabetized I can say
well you know I can see all the Central
Park quests here but what this doesn't
tell me is whether Central Park West
appears in in the second half of the
descriptive location of these start
stations right so I see Central Park
West this one is actually a little bit
different one of the nice things that
open or fine lets me do is it helps me
it allows me gives me several tools for
helping clean up my data and make it
more uniform I can actually just choose
edit here and I can make just Central
Park West and it will automatically
update all whatever 200 odd rows that
was so that now I have a uniform pattern
for Central Park West however if I
happen to know because I've inspected
this data set before there are other
instances where Central Park West
appears and I think it is there we go
West 106 street in Central Park West now
the question here is how would I locate
all rows that contain Central Park West
without having to just look through all
the values that is the kind of task that
a regular expression is very very good
for and in fact there isn't there isn't
another type of sort of technology or
program that
does what regular expressions do regular
expressions actually exist across pretty
much all programming languages and there
and while they're not supported in
Microsoft Excel they are supported and
open refine this is a really powerful
tool there are a little tricky to kind
of get used to and so we're gonna show
some examples here of how you can do a
basic one like finding all of the cells
that have Central Park West in them now
of course you could do this as a you
know using like a find right I mean I
could just do a find here and do that
however that only works if I have
relatively few instances so let's say
that I had something maybe Central Park
West appeared 50 times
I wouldn't very well want to go you know
sort of doing a find through this list
and manually including each one right so
the point here is that we're looking for
the other important point about regular
expressions is a scale very well we can
set up a regular expression that will
operate and it will do this search and
identifying of these fields for us in a
matter of seconds okay so how do we
actually go about doing this um the way
that I'm going to test my regular
expression is what I want to do the key
thing that I wanted to know about my
regular expression as I'm constructing
it is whether it is successfully
discriminating between content that I
want and content that I don't which is
to say I want to check that it's not
just capturing things that have Central
Park West in them but ensure that it is
also excluding things that don't so the
way that I'm gonna do this is I'm gonna
actually select a couple of entries here
that I'm gonna use as sort of my test
bed so that I can see as I'm
constructing I can do kind of a
real-time test of my regular expression
regular expressions are as I said
relative they're they're kind of tricky
and so you don't generally want to be
writing them blind like you want to see
what the results are and then you can
tweak them as you go so in order to do
this I'm going to now go to edit column
and OnStar station name because that's
what I'm operating on and I'm gonna say
add column based on this column and what
we can see here which is handy is it in
little preview window Oh am I getting no
I'm only again oh yeah no there's one
entry okay so I'm getting two different
and types of entry like values right
that have Central Park West and West
something right so I have 72nd Street 76
Street and 100 and 2nd Street so I can
check as I'm going and make sure that
I'm capturing both of those right not
just one or the other
but then I have something here that has
Central Park north and I want to make
sure that I'm excluding that right so I
can be checking in real time to make
sure that this is operating as I as I
like so I'm gonna call this has CP CP W
tests very important here that I create
a meaningful kind of descriptive column
header and the first thing that I'm
going to do if you notice is that value
right is a placeholder here that means
the contents of the cell right so by
default I'm saying I you know this is
the current column this is the column
that I'm this is the original column
this is the column that I'm currently
constructing so right now if it just
says value is just going to duplicate
the values so I want to ask it to do
some work on this so I'm going to use a
function called value dot match and this
is going to be a way to say look for the
pattern that I'm about to describe in
the original cell and basically tell me
if you find it now we can do this in a
very literal sense which is to say that
I can actually take this and I can wrap
my content in quotes just the same way I
would for example if you wanted to
search for a specific phrase and Google
you would wrap it in quotation marks to
keep it all together so what you can see
is that if I put in you know Central
Park West and West 76th Street as a
whole then it does the way that it shows
that it has found it is it gives these
little square brackets okay like a list
like an empty list if it doesn't find it
then it gives us a null value and null
means nothing was found and we can again
see that it doesn't find it on West
women second so this is working fine
except that if all that was all
wanted I would just use a text facet now
at first I might say okay well I could
just look for Central Park West but the
problem is that isn't a precise match
for what I'm looking for and this is I
I'm looking for a piece that is Central
Park West but also might have other
things and this is exactly where my
regular expression comes in and so the
first thing I need to do is I'm going to
signal to openrefine that I want it to
interpret the things that I want it to
be matching on the basis of a regular
expression rather than a regular string
and the way that I send that signal is I
wrap my I wrap my expression in forward
slashes that is arbitrary it's just how
openrefine decides to do it but I still
want to treat Central Park West as a
group because it's important to me that
I find all of those words and that I
find them in that order and so to do
that I'm going to go back to sort of my
master editions which is to put round
parentheses around them however I'm not
getting a match yet and the reason for
that is that regular expressions need to
account for every single item that
appears in this value every single
character that appears in a given in a
given value so the problem here is that
while this value for example begins with
Central Park West as the regular
expression gets to the end I always
think of this as a turtle on a number
line but is the regular fishing net city
and it's going to say okay I found
Central Park West great and then it's
gonna look at value and if you say well
wait there's a space here and my regular
expression doesn't leave any room for
spaces right it hasn't given it hasn't
given that little turtle right a way to
move forward on a space it's just kind
of stopped so in order to do that in
this case I'm gonna take a very very
general approach and I'm gonna do two
things so first I'm going to say look I
don't care what comes after Central Park
West it can be any character and the way
that I represent any character in in
this regular expression is with a period
a period means anything you want space
number letter whatever but this still
isn't working because I have to say how
many of them there are allowed to be so
I could kind of just like keep adding
characters maybe until I've added enough
that I've accounted for the West 76th
Street so that works except well it
doesn't capture 102nd Street because
that has one character more so that's
not a very efficient way of doing this
what I can do when I really don't care
how many characters what characters
appear or how many of them there are is
I can use an asterisk and what that
means is you can give me and anything
can come after this or and this is
important nothing can come after this
the asterisk means repeat whatever came
right before it zero or more times so it
doesn't need to appear but it can appear
as many times as you like
so that's great I now see that I am I am
I seem to be doing most of what I hope
to which is I'm capturing both of my
Central Park West type values I am
ignoring my Central Park North value
however if we think back to the other
example that I found
we knew that Central Park West was the
latter part of the station location
description in precisely the case that
we wanted to find right and so in this
case what I'm gonna do is I'm gonna put
a dot and an asterisk at the beginning
as well so that it will hopefully
capture those things that start with
something else but have Central Park
West at the other end now this works
fairly well except that it's not very
useful for me to have this Central Park
West value as the content of the new of
the new column what I really want is
something that says that answers the
question of does it have CPW right does
it have this value and so I want I'm
going to use a built-in function that
will convert my results to essentially a
true/false and happily this is a very
good function because it's functioning
is very descriptive it is is not null
okay and I have to wrap so you can see
that I've got a matching set of
parentheses for is not null I've got a
matching set of parentheses for value
dot match and I've got a matching set of
parentheses whoops for Central Park West
right so all of my parentheses are
balanced well you can see that this does
is adding that is not null does exactly
that the values that were null up here
is false and the values that were
anything else right up here is true this
is also important because we can have
regular expressions that match on more
than one thing that find for example a
letter or word combination or another
letter or word combination or two in
conjunction and so having the true/false
is a very simplified way of presenting
that so I could do this what I'm gonna
do here actually is instead of running
this right I've now tested out my
regular expression instead of running
this I'm actually going to close this
out I've copied this to the clipboard
and I'm gonna come back here and I'm
going to deselect my test values right
because I want to apply this regular
expression to the entire dataset and so
I'm going to now come back here and say
edit column add column based on this
column and paste in my regular
expression I'll now call it has cbw I
click OK and now I can do a text facet
and if I come down here I can select the
ones that match on true and I can see
that I have in fact captured not just
one but two other start stations that
have Central Park West in them so again
this is a relatively simplistic example
we'll look at some more complex ones
next week but this is a really really
powerful technology and great thing to
get comfortable with going forward so
we'll come back and look at another
couple of features of open refine

An introduction to basic regular expressions in OpenRefine

Contributor: Susan McGregor

Video 2 of 3