Filtering for Tags
Transcript
okay folks so when we last left off we
had successfully downloaded our MPD
crime stats data and we were now looking
at this pretty dense and sort of
unintelligible HTML file and so this is
the point actually we're having
determined that this is the page that we
want to scrape we actually want to go
back to our web browser because our web
browser has much better tools for kind
of examining where what the unique
features are that we're going to use to
actually pull the data that we're
looking for out of this file so you may
remember we've looked at this briefly
before in class what we're gonna use
I'm gonna do a right-click here and I'm
use the inspect feature here what this
is gonna do is it's gonna take me kind
of straight to whatever content I
highlighted when I selected it but you
can see how in the right-hand pane here
I can kind of start going through and
seeing what the structure is okay so the
Manhattan thing is here obviously the
link is down here but I can see that
this is wrapped in a larger structure
with which is this list item so as I
scroll over each individual list item
you can see how it's highlighting on the
left hand side and if I go up one level
I see how this spans six right is
actually all of Manhattan south and if I
go over this next man six it's all of
Manhattan south or north right and in
each case we're ik as I come down here
it's a little bit distorted by the
resizing right I can start to see ok
there's all of Brooklyn and oh there's
another span to expand 6 so in this case
the first thing I'm going to look for
what beautifulsoup does again is similar
to how our dictionary reader takes our
sort of raw data and transforms it so
that it's easier to deal with we're
going to take our we're going to use
beautifulsoup to do a similar type of
transformation on our HTML file so that
we can use the special recipes that it
has for grabbing bits of the page based
on identifiers like this one which is in
our case we're going to use the class
span 6 as our starting point now might
there be things that are not my data on
here that
they span six possibly but it seems a
little unlikely and if worse comes to
worse you know we'll deal with that as
we encounter right so just like all
other things that we've seen you know
this kind of work is very trial and
error right we kind of look at things we
try something now we see how it works it
doesn't work we try something else you
know there's no kind of direct there's
no kind of one-size-fits-all here so
it's always a bit of a trial and error
process so well what I'm going to do is
come back to my Python scraper here come
back to my New York stat scraper and I'm
going to start by converting my file
that I'm using to I'm going to use the
beautiful soup recipe on it to get that
conversion that's going to give me some
of the nice recipes for pulling data out
of it
so to do this as always I'm going to
create a new variable and I'm gonna call
this crime stats soup right I like to
maintain the metaphor here and so what I
will do the command here is actually
called beautiful soup it's the only time
I'm going to use it directly and then I
say open write because I want to open
and I want to open this file ok so this
is actually the same open command that
we've used previously but we but we're
using it as an ingredient to the
beautiful soup recipe ok so let's see
how this goes and we want to keep
testing ok so this is a warning this
isn't actually a failure and it says no
parse it was explicitly designed so I'm
using the best available you know parser
l XML follow this usually Zinta problem
different virtual environment beautiful
soup your markup so this is actually
telling us hey it is a problem or hey
there's something you could be doing
better and here's how to fix it and so
what they're recommending is that I put
in a secondary argument which is
actually this this L XML parameter so or
yeah L XML parameter ok and so basically
what that saying is it's almost like
giving it's almost giving it something
like a file extension so in the same way
that a file extension is the clue to our
computer what type of program how it
should interpret
a given file that second ingredient that
we're gonna get a saying look I'm giving
you a file and by the way this is the
interpreter that you should use so it's
almost like an almost like an extension
so let's try this again just to make
sure we've solved that and so again no
news is good news so the only real way
for me to see if this is doing what I
wanted to be doing is to start looking
for things no beautiful soup has a lot
of documentation that you can look at
with the different kinds of methods that
it has for finding different pieces of
web data or data within a web page
obviously I've done a little pre
reporting on this and so I am gonna go
sort of straight to the chase here and
the method that I'm gonna use to do that
is I'm going to say let's see you call
this crime like soup data equals and now
I'm gonna use a method called find all
which is part of the beautiful suit
which is again a beautiful soup recipe
and so I'm gonna do find all and what
I'm gonna look for is class equals span
six now I want to point something out
here
and I have a feeling that I need you oh
yeah crime stats OOP let's see if I did
this right crime stats soup dot find all
okay
so what I'm saying is into the variable
soup data I want you to put the results
of running the find all on crime stats
OOP now the class equals spam six
remember I got actually from here right
from the contents of my HTML so there's
there's no magic there that's just what
I'm looking for now my first instinct
would be to use class but we noticed
that class is a reserved word in Python
right it's turning blue so I can't just
use that so as I looked up in the
documentation it turns out that they
understand this limitation and so
they've created a tweaked version where
if I write class underscore it will
accomplish the same thing but it gets
around the problem of how
class itself be a reserved word and
PyCon and so the only way that I'm going
to be able to kind of see if this is
working is I'm going to say for now I
know that because I'm using a fine doll
it's gonna make a list right so again I
can say for item in a soup data right
because my for in always knows it only
works if the thing if the second thing
is a list and then again whatever the
first name the first variable name is
it's just gonna say that's an item so
for now I'm just gonna try printing the
item and see if that see what happens
okay so I got some stuff right that's
that's sort of what I was going for and
it's a little hard for me to tell so I
just wanted to spit this out onto the
onto the terminal so kind of see if it
was working
I might actually what I sometimes do
here is I also print a second message so
I can tell where the break where the
separation between items is when they
kind of look the same so I'm gonna run
this one more time okay and so now I can
see yeah that actually so here's a given
chunk right it's between Break messages
between the output messages and
obviously I could have put anything in
that message but here's my div plan span
six right so now the question is where
is the data that I want and it's in
these href tags now I'm gonna prioritize
and say that I actually only want the
things that have an X out that are in
excel file big surprise we're gonna
actually do that part of the task with a
regular expression but before we get
there what I want to do is I want to get
actually the contents of this right so I
need to get each one of these is a list
item right as an li tag which in it has
some anchor tags okay and so what I'm
going to do is I'm gonna say okay for
every one of these items I want to find
every every anchor tag that's part of a
list item tag I want to find that as
well and so I can do this again there's
another recipe and beautiful soup that
lets us do this and the
method the recipe the sorry the recipe
that lets us do this is called select ok
so in select I can actually I'm gonna
get rid of this print statement I just
start over here so I'm gonna say first
of all I need a variable right because I
want something to put the contour result
of this into so I'm gonna call this link
contents not sure that's precisely
what's gonna be but that's what I'm
going for overall and it's gonna be
items so it's gonna be for each one of
these things right because this was item
I want to select list item a so this is
actually this particular feature of
again you know this select recipe is
part of a beautiful soup library so it
works however the writer of that library
decided that it would and what we can
what it lets us do is it lets us string
together multiple tags to say the way I
would interpret this is for every list
item give me the anchor tags within it
ok it's so hopefully what this should
give me is link contents should itself
again be a list because I'm gonna be
grabbing all of them and I'm hoping that
when I when I go through that list I'm
gonna get basically just the anchor tags
themselves right just this a href equals
etc etc so again in order to go through
that list I'm gonna have to do another
I'm gonna have to do another for loop
alright so hopefully you can see like
it's pretty much all it's for loops all
the way down and so for linking link
contents okay I'm now gonna do so now
I'm gonna say well let me just see you
know the what the what the subset is
there so I'm gonna say print link and
see what I get
oops so I didn't because that's an it
that's weird
never seen that before an extra equal
sign in there and it didn't throw an
error okay so this is getting much
closer so I can see now is that each
link is
is its own item is its own piece of the
list because it's running this it's
printing this break message after every
single URL so this is close to what I
want except I don't actually want the
whole thing I really just want this
piece and the way that I can get that is
actually through square brackets and
again this is all stuff that I'm getting
out of the documentation for for
beautifulsoup
okay so again I'm sort of handing this
to you all but this is stuff that you
know basically I went through and I read
the recipe book like I would read a
cookbook to figure out which which one's
had the pieces that I was looking for so
hopefully that each ref thing will get
any what I'm after and there we go so
I'm really close now right I've got just
the raw links you can see they're not
complete links but that's I'm not too
worried about that right because
obviously this the the base URL is going
to be probably the same as this nyc.gov
something like that
I don't know viously test one out before
I was would use it now the thing that I
want to eliminate however I want to do a
few things right first of all I want to
isolate just the ones that have xlsx and
then ultimately I want to write them to
a file right I don't want to just like
keeps thing they've hooked to the
terminal that's not gonna do me much
good obviously writing to a file we're
pretty comfortable with at this point so
I am gonna focus instead on how do i
distinguish between the excel file and
the PDF file well as I foreshadowed
earlier this is gonna involve a regular
expression right because the links are
you know they share some similarities
but ultimately they could have basically
anything in front all I care about is
whether it's an XLS X so when we come
back I'm gonna show you how to just how
we the sort of steps that we need to
take to use a regular expression within
Python and then from there we're going
to isolate those xlsx files and links
and then we're gonna write them to a
file so I'll see you in a minute
had successfully downloaded our MPD
crime stats data and we were now looking
at this pretty dense and sort of
unintelligible HTML file and so this is
the point actually we're having
determined that this is the page that we
want to scrape we actually want to go
back to our web browser because our web
browser has much better tools for kind
of examining where what the unique
features are that we're going to use to
actually pull the data that we're
looking for out of this file so you may
remember we've looked at this briefly
before in class what we're gonna use
I'm gonna do a right-click here and I'm
use the inspect feature here what this
is gonna do is it's gonna take me kind
of straight to whatever content I
highlighted when I selected it but you
can see how in the right-hand pane here
I can kind of start going through and
seeing what the structure is okay so the
Manhattan thing is here obviously the
link is down here but I can see that
this is wrapped in a larger structure
with which is this list item so as I
scroll over each individual list item
you can see how it's highlighting on the
left hand side and if I go up one level
I see how this spans six right is
actually all of Manhattan south and if I
go over this next man six it's all of
Manhattan south or north right and in
each case we're ik as I come down here
it's a little bit distorted by the
resizing right I can start to see ok
there's all of Brooklyn and oh there's
another span to expand 6 so in this case
the first thing I'm going to look for
what beautifulsoup does again is similar
to how our dictionary reader takes our
sort of raw data and transforms it so
that it's easier to deal with we're
going to take our we're going to use
beautifulsoup to do a similar type of
transformation on our HTML file so that
we can use the special recipes that it
has for grabbing bits of the page based
on identifiers like this one which is in
our case we're going to use the class
span 6 as our starting point now might
there be things that are not my data on
here that
they span six possibly but it seems a
little unlikely and if worse comes to
worse you know we'll deal with that as
we encounter right so just like all
other things that we've seen you know
this kind of work is very trial and
error right we kind of look at things we
try something now we see how it works it
doesn't work we try something else you
know there's no kind of direct there's
no kind of one-size-fits-all here so
it's always a bit of a trial and error
process so well what I'm going to do is
come back to my Python scraper here come
back to my New York stat scraper and I'm
going to start by converting my file
that I'm using to I'm going to use the
beautiful soup recipe on it to get that
conversion that's going to give me some
of the nice recipes for pulling data out
of it
so to do this as always I'm going to
create a new variable and I'm gonna call
this crime stats soup right I like to
maintain the metaphor here and so what I
will do the command here is actually
called beautiful soup it's the only time
I'm going to use it directly and then I
say open write because I want to open
and I want to open this file ok so this
is actually the same open command that
we've used previously but we but we're
using it as an ingredient to the
beautiful soup recipe ok so let's see
how this goes and we want to keep
testing ok so this is a warning this
isn't actually a failure and it says no
parse it was explicitly designed so I'm
using the best available you know parser
l XML follow this usually Zinta problem
different virtual environment beautiful
soup your markup so this is actually
telling us hey it is a problem or hey
there's something you could be doing
better and here's how to fix it and so
what they're recommending is that I put
in a secondary argument which is
actually this this L XML parameter so or
yeah L XML parameter ok and so basically
what that saying is it's almost like
giving it's almost giving it something
like a file extension so in the same way
that a file extension is the clue to our
computer what type of program how it
should interpret
a given file that second ingredient that
we're gonna get a saying look I'm giving
you a file and by the way this is the
interpreter that you should use so it's
almost like an almost like an extension
so let's try this again just to make
sure we've solved that and so again no
news is good news so the only real way
for me to see if this is doing what I
wanted to be doing is to start looking
for things no beautiful soup has a lot
of documentation that you can look at
with the different kinds of methods that
it has for finding different pieces of
web data or data within a web page
obviously I've done a little pre
reporting on this and so I am gonna go
sort of straight to the chase here and
the method that I'm gonna use to do that
is I'm going to say let's see you call
this crime like soup data equals and now
I'm gonna use a method called find all
which is part of the beautiful suit
which is again a beautiful soup recipe
and so I'm gonna do find all and what
I'm gonna look for is class equals span
six now I want to point something out
here
and I have a feeling that I need you oh
yeah crime stats OOP let's see if I did
this right crime stats soup dot find all
okay
so what I'm saying is into the variable
soup data I want you to put the results
of running the find all on crime stats
OOP now the class equals spam six
remember I got actually from here right
from the contents of my HTML so there's
there's no magic there that's just what
I'm looking for now my first instinct
would be to use class but we noticed
that class is a reserved word in Python
right it's turning blue so I can't just
use that so as I looked up in the
documentation it turns out that they
understand this limitation and so
they've created a tweaked version where
if I write class underscore it will
accomplish the same thing but it gets
around the problem of how
class itself be a reserved word and
PyCon and so the only way that I'm going
to be able to kind of see if this is
working is I'm going to say for now I
know that because I'm using a fine doll
it's gonna make a list right so again I
can say for item in a soup data right
because my for in always knows it only
works if the thing if the second thing
is a list and then again whatever the
first name the first variable name is
it's just gonna say that's an item so
for now I'm just gonna try printing the
item and see if that see what happens
okay so I got some stuff right that's
that's sort of what I was going for and
it's a little hard for me to tell so I
just wanted to spit this out onto the
onto the terminal so kind of see if it
was working
I might actually what I sometimes do
here is I also print a second message so
I can tell where the break where the
separation between items is when they
kind of look the same so I'm gonna run
this one more time okay and so now I can
see yeah that actually so here's a given
chunk right it's between Break messages
between the output messages and
obviously I could have put anything in
that message but here's my div plan span
six right so now the question is where
is the data that I want and it's in
these href tags now I'm gonna prioritize
and say that I actually only want the
things that have an X out that are in
excel file big surprise we're gonna
actually do that part of the task with a
regular expression but before we get
there what I want to do is I want to get
actually the contents of this right so I
need to get each one of these is a list
item right as an li tag which in it has
some anchor tags okay and so what I'm
going to do is I'm gonna say okay for
every one of these items I want to find
every every anchor tag that's part of a
list item tag I want to find that as
well and so I can do this again there's
another recipe and beautiful soup that
lets us do this and the
method the recipe the sorry the recipe
that lets us do this is called select ok
so in select I can actually I'm gonna
get rid of this print statement I just
start over here so I'm gonna say first
of all I need a variable right because I
want something to put the contour result
of this into so I'm gonna call this link
contents not sure that's precisely
what's gonna be but that's what I'm
going for overall and it's gonna be
items so it's gonna be for each one of
these things right because this was item
I want to select list item a so this is
actually this particular feature of
again you know this select recipe is
part of a beautiful soup library so it
works however the writer of that library
decided that it would and what we can
what it lets us do is it lets us string
together multiple tags to say the way I
would interpret this is for every list
item give me the anchor tags within it
ok it's so hopefully what this should
give me is link contents should itself
again be a list because I'm gonna be
grabbing all of them and I'm hoping that
when I when I go through that list I'm
gonna get basically just the anchor tags
themselves right just this a href equals
etc etc so again in order to go through
that list I'm gonna have to do another
I'm gonna have to do another for loop
alright so hopefully you can see like
it's pretty much all it's for loops all
the way down and so for linking link
contents okay I'm now gonna do so now
I'm gonna say well let me just see you
know the what the what the subset is
there so I'm gonna say print link and
see what I get
oops so I didn't because that's an it
that's weird
never seen that before an extra equal
sign in there and it didn't throw an
error okay so this is getting much
closer so I can see now is that each
link is
is its own item is its own piece of the
list because it's running this it's
printing this break message after every
single URL so this is close to what I
want except I don't actually want the
whole thing I really just want this
piece and the way that I can get that is
actually through square brackets and
again this is all stuff that I'm getting
out of the documentation for for
beautifulsoup
okay so again I'm sort of handing this
to you all but this is stuff that you
know basically I went through and I read
the recipe book like I would read a
cookbook to figure out which which one's
had the pieces that I was looking for so
hopefully that each ref thing will get
any what I'm after and there we go so
I'm really close now right I've got just
the raw links you can see they're not
complete links but that's I'm not too
worried about that right because
obviously this the the base URL is going
to be probably the same as this nyc.gov
something like that
I don't know viously test one out before
I was would use it now the thing that I
want to eliminate however I want to do a
few things right first of all I want to
isolate just the ones that have xlsx and
then ultimately I want to write them to
a file right I don't want to just like
keeps thing they've hooked to the
terminal that's not gonna do me much
good obviously writing to a file we're
pretty comfortable with at this point so
I am gonna focus instead on how do i
distinguish between the excel file and
the PDF file well as I foreshadowed
earlier this is gonna involve a regular
expression right because the links are
you know they share some similarities
but ultimately they could have basically
anything in front all I care about is
whether it's an XLS X so when we come
back I'm gonna show you how to just how
we the sort of steps that we need to
take to use a regular expression within
Python and then from there we're going
to isolate those xlsx files and links
and then we're gonna write them to a
file so I'll see you in a minute