More OpenRefine and Regex
Transcript
hi everyone so now that you've had some
chance to some time to get more familiar
with open refine I want it to just add a
little bit at a few more examples of
things that can be done with this
program there's a useful for you get
analysis especially with large data sets
and also to look at some of the more
enhanced elaborated ways that we can use
regular expressions so I'm actually
gonna work with one of our existing
projects this is the first hundred
thousand rows that we looked at in class
the other day and I want to highlight a
couple of things here so one of them is
I want to come back to the issue of
facets so hopefully you all are pretty
familiar now with already with this idea
of the text bassett right very simple
way to select all of the occurrences of
a particular value within within the
data set and I just want to point out
one aspect of this we know for example
that we can sort by name right which is
not to meaningful for us and we're
supporting when we're sorting by numbers
although it was pointed out in one of
the classes that these aren't in what we
would consider numerical order and the
reason why that is is because as
indicated by all of the black text here
and the fact that we didn't ask it to
parse by data type on the way in
actually on the original screen when we
imported this data right now for
everything except for our false values
here we can see as we scan across it's
actually interpreting all of these
everything that this are constructed has
CPW columnist it is interpreting as
strings okay what that means is that
it's seeing numbers not as numerical
values but actually just as characters
right and so that's why we see something
that looks strange at first right when
we expect this to be sorted by by
quote-unquote name right where we see
like okay 2000 but then we see that for
example 212 is coming after 2023 full of
course that's because in string
comparison 20 comes before 21 right so
if we wanted to change that we would
actually need to change the data type in
this in this column which we can do by
going here to edit cells and under
common transformation we're going to see
um an option for two number right and
that's not going to make much difference
to us here other than to sort when we
sort by quote-unquote name to perhaps I
think hopefully sort them in the way
that we expect but let's find out so if
i do a numeric facet here so this is dr.
sort of a preview it actually if we have
a numerical data and it is set up as
numbers what we get is actually a free
histogram from from open refine and we
can actually scale is now there's
nothing we can't export this it's good
as a quick overview especially if we
have you know a sort of isolated data it
can give us a sense of the shape of the
data so if you have already collected
and cleaned a bunch of data and you kind
of want to see what the shape of things
matching that is it's useful to do you
know you can see it's sort of
interpreting these things weirdly it's
not 2900 I can see you know most of my
values are between 0 and six hundred and
you see there's actually updating the
matching row count here so I can kind of
get a sense of that as I go but the
thing that I actually wanted to let's
see if I do a text facet here I think
I'm still gonna get let's see what kind
of sorting I get so I still kind of get
the unexpected sorting here because I
told it to facet treating this as text
so that's another thing to be aware of
is it we can probably can adapt the type
of facet that we do to to reflect the
way we want the data to be treated in
that moment even if the column data
format has been turned to something else
right so in this case I converted these
two numbers but when I do a text gasset
it will treat them as text regardless
the other thing that I want to point out
here is this 573 choices so we referred
to this as a way of indicating you know
how many different start station ids
there are for example in this data set
if I click on this what I actually get
is a tab separated list of values and
the counts so this is something where I
could actually copy this out throw it
into a text edit file save it as a txt
and then use this for example to
generate a history
and if I wanted something a little bit
more if I wanted to be able to do that
and something like XL right so actually
used to support a way to do this sort of
mini histogram on this site of thing
right in in open or fine I don't it
doesn't seem like they do that anymore
and as an older version but just so you
know that you can actually extract this
information very very quickly I could
obviously sort it either by number of I
count and so if I wanted to create a
histogram of this data the only thing I
need to do is save this out open it up
in XL and and generate the chart as
those of YouTube Instagram I've already
had experience doing so again handy
trick for manipulating the data for kind
of getting a sense of it again what
we're looking for here is a lot of
different ways to evaluate it the other
thing that we noted is that you know if
i look at my under redo history here it
will for example as I said the under
redo this tree will capture all
meaningful transformations of the data
so in this case converting it to a
number of counts as a meaningful
transformation but what if i want to
share this data with somebody and
highlight something so i'm going to go
back to my text facet here you know one
of the things because we don't have a
sort of show/hide feature that we have
in something like microsoft excel if we
want to kind of present a particular
view of the data and share that with a
colleague or keep a reference of it it's
a little bit trickier or it seems that
way at first in fact there is a great
feature here that lets us do just that
so for example if i'm just going to go
ahead and select a handful of stations
for the heck of it right it doesn't
matter which ones oops wait so again if
i want to include more than one at a
time and need to specifically chooses
include or okay so I've chosen three
station ids doesn't really matter but we
can see again in the under video history
these facets don't appear in my in my
under redo history so how do I preserve
a view of them right this is isolation
the fact that 624 matching rows etc etc
what if I wanted to get more granular on
that and you know I don't edit it down
by n station ID any of that well
this is another case where we have is we
where we can do this is we can ask it to
generate what we call permalink okay so
with this permalink is and I hope you
notice that you saw how this URL
transformed up here this permalink
essentially incorporates we can almost
read it in fact right it says facets
type you know name start station ID
right so it's actually kind of detailing
in quasi readable way the
transformations that exist here this
permalink I can then open at a later
time and it's going to look exactly the
way I was last viewing it okay so this
is going to be really important for you
as you're creating your documents for
your final projects you want to make
sure that you register this chain you
know you that every time you generate of
you that you're using to uh you know to
gain an insight right especially it
becomes useful in your store or
something like that that you're actually
including that that sort of URL now
obviously this isn't you can't just
paste this into an email and send it to
somebody because unless they have your
original data set and open refine and
opened with I'm running it's not going
to do anything right but as long as I do
have that if I have open refine running
and I have your data set then I can or
an export of this project right I can
import that product to my own version of
open refine use this link to actually
see what you are seeing or working at
that point in time so super useful thing
there um the other thing oh so let's
look at trip duration really quick here
so I'm going to go ahead and just see if
I can do a common transformative use to
number right because this is number of
seconds and see if I can also quickly do
a numeric asset so what you'll see is
that this is not too useful because
there's actually such a spread in the
duration apparently somebody had a bike
out for four hundred and thirty thousand
seconds which seems like an awful lot in
fact i'm very curious what that was so
when I again when I restrict this I get
this example so
somebody took somebody kept a bike out
for five days apparently wonder what was
going on there they took it out at State
Street and Smith Street and they finally
returned it on Nassau Street and Navy
Street I wonder what that cost so you
know this is again if you this can be a
great way to kind of get a broad
overview of your data of course you know
if you have outliers on a very large
data set it's going to be very hard to
see any kind of pattern and we can't
obviously expand this pattern I just to
show that that basic histogram can be
done here too so the next thing that I
want to look at is we can also do this
with for example with time so I'm going
to do a transformation to date this is a
date time and I can actually create a
timeline facet on this that will show me
the distribution of timestamps so for
this one if I do a timeline facet now I
can get a really interesting view on
what the you know what the what the
times are what the sort of peaks and
valleys are in when people leave city
bike stations with bikes finally I want
to start to expand on the so you know
you may find that useful analysis the
last thing I want to do here is expand
on some of the work that we've started
doing with regular expressions this is
going to be augmented for your next
assignment which was for what you're not
going to use this data you're going to
use data related to your final project
and so for this what I'm going to do is
I'm going to go ahead and do add column
based on this cat problem so we're going
back to our regular expressions here so
my basic outline is this value match
right so here we have it's going to
treat this column as text I haven't
actually converted it to a date time but
let's say I want to do something that
you know initially might seem kind of
difficult like I decide I want to
capture all of the values between 0 and
3 a.m. between midnight and 3am right
well there's a few things that I want to
highlight here so the first is I know or
on sep tember first let's say right so I
know that I can start with 91 2016 now
first of all as we can probably tell the
forward slashes are reserved characters
right they have a special meaning in the
context of this function we add them to
indicate that we're about to do a
regular expression so we can't just use
them straight up in a regular expression
we have to do what we call escaping and
the way that we escape in this context
is with a backslash so it looks a little
bit funny but this to character pair is
being read by open refine as a genuine
/not the special character for us
forward slash so i'm going to put in
2016 here now what comes next in my date
time the next thing that comes is a
space okay and that back / s is a
special character code for my regular
expressions that i'm going to share a
reference to these for you so you can
get a feel for what's available but that
means any white space character so it
could be a space it could be a tab
that's just sort of how we put it there
now if what I want to do is find
everything between midnight and 3am I
know that the first character is still
going to be a zero right because it's
either 00 01 02 03 so now what i can do
is i can say look I want this to be
anything from 0 to 3 so I have to have
10 and then I can have the next
character be a 0 a 1 a 2 or 3 and that 0
dash 3 means from 0 to 3 inclusive okay
and then I can say and the rest can be
anything you want because I don't care
about the minutes and seconds right and
so I'm going to use my is not null here
because if I don't put the is not null
then actually nothing shows up in the
resulting column but so far it looks
pretty good because all of these
timestamps match what I'm looking for
but they they all match what I'm looking
for I'd say they're coming up true so
I'm going to say from midnight it's not
very good spelling two three and
and here we go see so I'm getting some
truths that's good I knew that but let's
see if I do a text faster than I getting
any false so so far so good it looks
like out of these first ones this first
hundred thousand i have 766 only 766
that match from zero from midnight to
3am on on November Ariane September
first and I can just take a peek here
that's interesting oh no way sorry I did
it on stop time that start time I can
take a peek here and see that yes all of
these ones that are coming back with
forces are definitely outside the range
that I'm looking for right so this is
really just a spot check i would
obviously want to confirm this an easier
way to confirm this probably will be to
click on the true ones and check those
out but it looks like things are pretty
good so something to think about going
forward we'll talk in more detail about
regular expressions in class and
practice creating them and describing
them in English which is going to be
your assignment but until then I hope
you all have a good few days
chance to some time to get more familiar
with open refine I want it to just add a
little bit at a few more examples of
things that can be done with this
program there's a useful for you get
analysis especially with large data sets
and also to look at some of the more
enhanced elaborated ways that we can use
regular expressions so I'm actually
gonna work with one of our existing
projects this is the first hundred
thousand rows that we looked at in class
the other day and I want to highlight a
couple of things here so one of them is
I want to come back to the issue of
facets so hopefully you all are pretty
familiar now with already with this idea
of the text bassett right very simple
way to select all of the occurrences of
a particular value within within the
data set and I just want to point out
one aspect of this we know for example
that we can sort by name right which is
not to meaningful for us and we're
supporting when we're sorting by numbers
although it was pointed out in one of
the classes that these aren't in what we
would consider numerical order and the
reason why that is is because as
indicated by all of the black text here
and the fact that we didn't ask it to
parse by data type on the way in
actually on the original screen when we
imported this data right now for
everything except for our false values
here we can see as we scan across it's
actually interpreting all of these
everything that this are constructed has
CPW columnist it is interpreting as
strings okay what that means is that
it's seeing numbers not as numerical
values but actually just as characters
right and so that's why we see something
that looks strange at first right when
we expect this to be sorted by by
quote-unquote name right where we see
like okay 2000 but then we see that for
example 212 is coming after 2023 full of
course that's because in string
comparison 20 comes before 21 right so
if we wanted to change that we would
actually need to change the data type in
this in this column which we can do by
going here to edit cells and under
common transformation we're going to see
um an option for two number right and
that's not going to make much difference
to us here other than to sort when we
sort by quote-unquote name to perhaps I
think hopefully sort them in the way
that we expect but let's find out so if
i do a numeric facet here so this is dr.
sort of a preview it actually if we have
a numerical data and it is set up as
numbers what we get is actually a free
histogram from from open refine and we
can actually scale is now there's
nothing we can't export this it's good
as a quick overview especially if we
have you know a sort of isolated data it
can give us a sense of the shape of the
data so if you have already collected
and cleaned a bunch of data and you kind
of want to see what the shape of things
matching that is it's useful to do you
know you can see it's sort of
interpreting these things weirdly it's
not 2900 I can see you know most of my
values are between 0 and six hundred and
you see there's actually updating the
matching row count here so I can kind of
get a sense of that as I go but the
thing that I actually wanted to let's
see if I do a text facet here I think
I'm still gonna get let's see what kind
of sorting I get so I still kind of get
the unexpected sorting here because I
told it to facet treating this as text
so that's another thing to be aware of
is it we can probably can adapt the type
of facet that we do to to reflect the
way we want the data to be treated in
that moment even if the column data
format has been turned to something else
right so in this case I converted these
two numbers but when I do a text gasset
it will treat them as text regardless
the other thing that I want to point out
here is this 573 choices so we referred
to this as a way of indicating you know
how many different start station ids
there are for example in this data set
if I click on this what I actually get
is a tab separated list of values and
the counts so this is something where I
could actually copy this out throw it
into a text edit file save it as a txt
and then use this for example to
generate a history
and if I wanted something a little bit
more if I wanted to be able to do that
and something like XL right so actually
used to support a way to do this sort of
mini histogram on this site of thing
right in in open or fine I don't it
doesn't seem like they do that anymore
and as an older version but just so you
know that you can actually extract this
information very very quickly I could
obviously sort it either by number of I
count and so if I wanted to create a
histogram of this data the only thing I
need to do is save this out open it up
in XL and and generate the chart as
those of YouTube Instagram I've already
had experience doing so again handy
trick for manipulating the data for kind
of getting a sense of it again what
we're looking for here is a lot of
different ways to evaluate it the other
thing that we noted is that you know if
i look at my under redo history here it
will for example as I said the under
redo this tree will capture all
meaningful transformations of the data
so in this case converting it to a
number of counts as a meaningful
transformation but what if i want to
share this data with somebody and
highlight something so i'm going to go
back to my text facet here you know one
of the things because we don't have a
sort of show/hide feature that we have
in something like microsoft excel if we
want to kind of present a particular
view of the data and share that with a
colleague or keep a reference of it it's
a little bit trickier or it seems that
way at first in fact there is a great
feature here that lets us do just that
so for example if i'm just going to go
ahead and select a handful of stations
for the heck of it right it doesn't
matter which ones oops wait so again if
i want to include more than one at a
time and need to specifically chooses
include or okay so I've chosen three
station ids doesn't really matter but we
can see again in the under video history
these facets don't appear in my in my
under redo history so how do I preserve
a view of them right this is isolation
the fact that 624 matching rows etc etc
what if I wanted to get more granular on
that and you know I don't edit it down
by n station ID any of that well
this is another case where we have is we
where we can do this is we can ask it to
generate what we call permalink okay so
with this permalink is and I hope you
notice that you saw how this URL
transformed up here this permalink
essentially incorporates we can almost
read it in fact right it says facets
type you know name start station ID
right so it's actually kind of detailing
in quasi readable way the
transformations that exist here this
permalink I can then open at a later
time and it's going to look exactly the
way I was last viewing it okay so this
is going to be really important for you
as you're creating your documents for
your final projects you want to make
sure that you register this chain you
know you that every time you generate of
you that you're using to uh you know to
gain an insight right especially it
becomes useful in your store or
something like that that you're actually
including that that sort of URL now
obviously this isn't you can't just
paste this into an email and send it to
somebody because unless they have your
original data set and open refine and
opened with I'm running it's not going
to do anything right but as long as I do
have that if I have open refine running
and I have your data set then I can or
an export of this project right I can
import that product to my own version of
open refine use this link to actually
see what you are seeing or working at
that point in time so super useful thing
there um the other thing oh so let's
look at trip duration really quick here
so I'm going to go ahead and just see if
I can do a common transformative use to
number right because this is number of
seconds and see if I can also quickly do
a numeric asset so what you'll see is
that this is not too useful because
there's actually such a spread in the
duration apparently somebody had a bike
out for four hundred and thirty thousand
seconds which seems like an awful lot in
fact i'm very curious what that was so
when I again when I restrict this I get
this example so
somebody took somebody kept a bike out
for five days apparently wonder what was
going on there they took it out at State
Street and Smith Street and they finally
returned it on Nassau Street and Navy
Street I wonder what that cost so you
know this is again if you this can be a
great way to kind of get a broad
overview of your data of course you know
if you have outliers on a very large
data set it's going to be very hard to
see any kind of pattern and we can't
obviously expand this pattern I just to
show that that basic histogram can be
done here too so the next thing that I
want to look at is we can also do this
with for example with time so I'm going
to do a transformation to date this is a
date time and I can actually create a
timeline facet on this that will show me
the distribution of timestamps so for
this one if I do a timeline facet now I
can get a really interesting view on
what the you know what the what the
times are what the sort of peaks and
valleys are in when people leave city
bike stations with bikes finally I want
to start to expand on the so you know
you may find that useful analysis the
last thing I want to do here is expand
on some of the work that we've started
doing with regular expressions this is
going to be augmented for your next
assignment which was for what you're not
going to use this data you're going to
use data related to your final project
and so for this what I'm going to do is
I'm going to go ahead and do add column
based on this cat problem so we're going
back to our regular expressions here so
my basic outline is this value match
right so here we have it's going to
treat this column as text I haven't
actually converted it to a date time but
let's say I want to do something that
you know initially might seem kind of
difficult like I decide I want to
capture all of the values between 0 and
3 a.m. between midnight and 3am right
well there's a few things that I want to
highlight here so the first is I know or
on sep tember first let's say right so I
know that I can start with 91 2016 now
first of all as we can probably tell the
forward slashes are reserved characters
right they have a special meaning in the
context of this function we add them to
indicate that we're about to do a
regular expression so we can't just use
them straight up in a regular expression
we have to do what we call escaping and
the way that we escape in this context
is with a backslash so it looks a little
bit funny but this to character pair is
being read by open refine as a genuine
/not the special character for us
forward slash so i'm going to put in
2016 here now what comes next in my date
time the next thing that comes is a
space okay and that back / s is a
special character code for my regular
expressions that i'm going to share a
reference to these for you so you can
get a feel for what's available but that
means any white space character so it
could be a space it could be a tab
that's just sort of how we put it there
now if what I want to do is find
everything between midnight and 3am I
know that the first character is still
going to be a zero right because it's
either 00 01 02 03 so now what i can do
is i can say look I want this to be
anything from 0 to 3 so I have to have
10 and then I can have the next
character be a 0 a 1 a 2 or 3 and that 0
dash 3 means from 0 to 3 inclusive okay
and then I can say and the rest can be
anything you want because I don't care
about the minutes and seconds right and
so I'm going to use my is not null here
because if I don't put the is not null
then actually nothing shows up in the
resulting column but so far it looks
pretty good because all of these
timestamps match what I'm looking for
but they they all match what I'm looking
for I'd say they're coming up true so
I'm going to say from midnight it's not
very good spelling two three and
and here we go see so I'm getting some
truths that's good I knew that but let's
see if I do a text faster than I getting
any false so so far so good it looks
like out of these first ones this first
hundred thousand i have 766 only 766
that match from zero from midnight to
3am on on November Ariane September
first and I can just take a peek here
that's interesting oh no way sorry I did
it on stop time that start time I can
take a peek here and see that yes all of
these ones that are coming back with
forces are definitely outside the range
that I'm looking for right so this is
really just a spot check i would
obviously want to confirm this an easier
way to confirm this probably will be to
click on the true ones and check those
out but it looks like things are pretty
good so something to think about going
forward we'll talk in more detail about
regular expressions in class and
practice creating them and describing
them in English which is going to be
your assignment but until then I hope
you all have a good few days