Parsing and Saving to CSV

okay folks so when we last left off we
had done we gotten pretty far in terms
of isolating the contents of the links
that we're looking for but we still have
this problem that they're both PDF and
Excel links and we're not interested in
PDFs so what I need to do is use of
course a regular expression now a
regular expression actually is from a
library that needs to be imported at the
top of your Python file so the short
name for it is re for regular expression
and regular expressions also work a
little bit differently in Python so what
we do is we actually create the regular
expression and we using a command using
a recipe called re compile and then we
actually we put the results of that
regular expression into a variable and
then we use that variable to refer back
to that regular expression so in a way
it's kind of handy because we have a
really messy regular expression you can
give it a nice short name short variable
name and then we don't have to worry
about kind of fitting in in line with
other code that might be kind of
complicated so I'm gonna call this is XO
right because that's what I'm interested
in and again I'm going to the re library
and I'm saying compile and what I want
to compile is the contents of my regular
expression now in this case all I care
about is that it ends in dot xlsx XLS
actually I was have trouble saying that
right so actually the syntax here is
almost identical to what we used in
openrefine except I'm gonna wrap it in
quotation marks instead of forward
slashes so I'm gonna use dot asterisk so
just mean anything you want as much as
you want
and then I'm gonna use a backslash dot
to escape my the dot which is the file
before the file extension and then just
put XLS X right so the only thing I'm
worried about basically is that it's
something that ends with xlsx there are
other ways that can be more explicit
about the fact that Excel accesses at
the end but I'm pretty confident that
there are no legitimate file names there
gonna be something dot xlsx more stuff x
LX s X all right and in any case I'm
only interested in identifying that they
have that in there so if I come back
down to my for loop here so I'm gonna
add a few comments and this
I go through all the sand six blocks and
then Here I am whoops
grabbing all a tags inside h-hell I tag
right and finally I am I'm actually
going to this is so linked eight so link
brackets href is actually kind of the
contents of my contents of my link the
contents of my URL and so what I want to
know is whether that href contents right
whether this string matches my regular
expression in other words does it have
xlsx at the end so to do this I'm
obviously going to use an if statement
I'm going to say if and now here's where
the syntax gets a little bit weird
because if you remember in openrefine
we did value dot match well in this case
we do re dot match because we are going
to again we're going to the regular
expression library and saying please use
the match function and then we this
actually takes two ingredients this is
going to take the variable okay that is
our regular expression the regular
expression that we compiled up here and
and then whatever the string is that we
want to search for it okay and so you
know you might ask why did I do my
regular expression up here I just like
doing them up there because they make it
makes it easier for me to find quite
frankly and because it makes sense to
only create it in one place rather than
if I were to put it in a for loop for
example it would be getting created
recreated every time the for loop
happened right and I don't really need
that to be the case so not a huge deal
just something I like to do and so what
I'm saying is if it matches now what I'm
gonna do is I am going to I'm gonna get
rid of this print statement because this
is just sort of a test to see if I could
help me see the difference between list
items I am going to say print and now
the thing
again the thing that I'm looking at is
still the link href right it's still the
contents of this I could give this
actually a variable name it might make
it a little bit more readable okay so
what I'm hoping for here is that this
will only print out things that have an
xlsx file extension so far so good right
so now all I need to do really is worry
about getting this into a file and so I
could go I could go ahead and I could do
you know I do a create CSV import CSV
and use right row I think in this case
I'm just gonna I'm going to be a little
bit more kind of old-school about it and
I'm just gonna write them to an output
file and put a carriage return at the
end so obviously to do that I need to
first create an output file and I don't
want to be redoing that for every time I
go through an item so I'm gonna create
output file here and of course we know
how to do this I'm gonna say stat links
dot so I'm gonna call to CSV if I'm
gonna make rows and I need to make it
right Abel and so here for output file
right instead of printing this I'm gonna
say output file dot right of course
because it is a file and instead of
writing just the link I'm also going to
write so backslash n is the character
that means carriage return you'll notice
that I have a plus sign between them so
this is sort of a little trick we call
this string concatenation or string
addition if you put a plus sign between
two strings it will literally just stick
them together so if that again works as
planned then what I should end up with
is a file that just has a list of all of
these URLs in it and of course I want to
make sure as always because I've opened
that output file I want to close it
right I need to put it down at the
bottom here okay and let's give it a
shot see what happens okay
so again new news is good news and if
things have gone according to according
to plan here
um yeah well let's ignore that looks
pretty good so again what might I do
with this file well that's a very good
question probably what I would do is I
would next take this file I would write
another Python script right and I would
take this file as input and use it and
then write a little script that just
downloaded all of these files right so
this is a quick way for me to get how
many files we have here number yeah
we've got 170 170 files right so I need
to get all that data downloading it
manually would be pretty time-consuming
what this lets me do is do that
automatically and then now that I have
that new file I could create a new a new
Python file where instead of using URL
open on the crime stats page I would use
it I could read in that CSV file line by
line each line would be the ingredients
for a URL open I would download it save
that file and I could download all 170
files right and then take it from there
so hopefully this is illustrated to you
kind of the basics of both web scraping
but also how the same principles that
we've learned is opening files writing
to files using for loops and if
statements is a really just kind of
pretty much what you need to know
obviously I've introduced some libraries
that you know I'm familiar with having
worked in this for a bit I think they're
pretty useful ones but you know it's
always you're always welcome to do
additional reading and really as I
mentioned in class you know this is the
kind of thing where you know googling
for something and looking through how
someone has solved something on Stack
Overflow reading documentation so if I
look at beautiful soup Python
documentation right I can actually go
and see all of the things right this is
like all of the stuff about this
and you know different ways that I might
want to do it so there's searching they
call it a tree here's the final function
that I used right which describes
various leaves to get it things so
anyhow I invite you all to explore this
as much as you would like and again hope
you find it you score it in the future
and with that I will see you all next
week
Expertise:

Using a regular expression in Python to isolate just the links with the appropriate file extension, and then writing them to a csv file.

Contributor: Susan McGregor

Video 3 of 3