Intro to Scraping

hi everyone so as discussed this week
we're going to be looking at how to
scrape data from web pages using Python
so this is a task that often comes up in
journalism because there's a lot of data
that's on the internet that isn't really
available as you know a CSV or an Excel
or something and we need to get that
information so we're gonna pivot away
from our citibike data and looking crime
statistics instead just to give you all
a sense of this and hopefully this will
sort of help you think about how to do
some of the your own web scraping tasks
that you might encounter so I'm just
about to open up town here I'm just
gonna go ahead and start on the desktop
and I'm gonna create a new folder to put
this in and basically hopefully this
will be valuable in part not just for
the practical application but because it
will this is like screen a little place
but because it will illustrate to you
just how Universal some of the systems
that we've already seen actually are so
I'm gonna create a new folder on the
desktop um
so start with new folder I'm gonna call
this crime crime stats and you know I
sort of chose this arbitrarily I just
wanted to look at might be of interest
since we've been working a city bike
data quite a lot so I'm gonna create a
new file so just like we call this crime
call this NYPD stats scraper reaper p/y
right so we start in the same place as
we always start with this which is that
I guess anita autoconfig my my system
here don't know why that happened but
here we go
so what I'm going to do is I'm going to
start by importing the libraries that I
need I don't necessarily know right at
the start what all of them are going to
be but I do know a couple of them and
I'm going to use so one of them that I'm
gonna use is called beautiful soup so
beautiful soup again just like all of
our other libraries is a bunch of code
that someone has written that is
specifically designed for going through
web page
and pulling data out of them and this is
part of an overall package called BS
four so that stands for beautifulsoup
before so instead of starting with
import I'm gonna actually say from BS
for import beautifulsoup
and what this is I mean you don't need
to concern yourself with it too much
it'll be a lot of times if you're
looking for advice on these things
you'll see them on Stack Overflow and
put Stock Exchange in places like that
but basically you know whereas with
import CSV we imported the entire CSV
library in this case we're actually
importing just a piece of a library
because the overall library is actually
bigger than we need and then the other
thing we're gonna import here is
something called URL Lib so this is like
CSV unlike beautifulsoup before which is
the thing we actually have to download
separately URL live is like CSV in that
it is a very commonly used library so
it's already on our computer we just
need to say listen make that library
available for this particular program so
the first thing I need to do of course
is find something to scrape and in this
case I'm gonna look at the NYPD crime
statistics so I thought this was a nice
example because it's information that
seems sort of organized but in point of
fact doesn't really have the data that I
might be looking for because it has
multiple file types here it's you know
if I wanted to say download the data
from every single one of these I would
need to get the URLs individually and
these are mostly gonna lay lead straight
to a Downloads there isn't really a good
way to kind of automate this and even
though this precincts do kind of go in
order you can see that not all of the
numbers are there so I can't fully
automate this so this is kind of a good
candidate for scraping that we might you
know this is a kind of common example we
might run into so the first thing that I
want to do is make sure that I the first
thing I want to do is just download the
data and the way I prefer to do this I
certainly can download and work on the
webpage directly I prefer to download
and save a copy of the webpage rather
than
kind of working with the whole thing in
memory and that's basically because if
something happens I mean if if the
program doesn't work right away I would
rather not be kind of opening the file
over and over again or opening the URL
over and over again once I've saved a
copy of it I can just work with that
copy and of course I also have it saved
on my computer so I could it's clear to
me exactly the data that I'm working on
it's easier to browse sometimes I think
when it's on the desktop so I'm just
gonna make a note of my target URL here
okay which is this crime stats page and
then I'm gonna start to work actually
just downloading it to my computer so
the first thing I need to do is actually
open the URL and so this is going to be
pretty similar to what we've seen except
that instead of just opening a file
we're opening a URL and so we are the
name of the verb right there is URL open
okay and then I just paste in the target
URL as a string okay and then I want to
read the data so I'm just gonna say I'm
gonna call this web data read from my
web page okay so pretty straightforward
again very similar to what we've seen so
far except it happens that we're dealing
with web page data instead of just
regular CSV or text data and I see I
have a type of they're already not
taking my own advice copy and pasting so
now I'm just gonna give this a little
comment I'm gonna say open the target
URL and read the data and now like I
said I'm going to save a local copy and
I could even break these two things up
right I could have something that just
downloads the data and then I could
actually once I downloaded it once I
could just comment that out now if I was
doing this with real time data like a
Twitter feed or something like that
right I might want to constantly be
looking at whatever was the most current
in this case is this data changes once a
week or so so it's not as necessary but
the reason why I bring it up is because
if you start running your script too
frequently you can run into
usage limits for for services right so
the one thing you don't want to do is
start running something so many times
that you end up getting yourself or as
sometimes happened an entire newsroom
blocked from a lot opening a given site
because there's too many requests coming
from coming from your location so I'm
gonna save a local copy of the web page
I'm gonna just call it my local file
okay and first I'm gonna create a I'm
literally gonna name it it's just a
local URL page so I'm gonna call it NYPD
crime stats dot HTML and now we have our
write function but now we have actually
a second letter in that write function
so before if you remember we had our you
for the read so that was read universal
in this case we're gonna have WB and
what WB stands for is write bytes okay
so data when we download it from the web
often comes as bytes that's the unit of
data in computation right so many of
your familiar if you see something has
KB that's kilobytes MB as megabytes gb/s
giga bytes so what we're saying is right
but be prepared for byte type data
rather than for example strings which is
what we've done in the past with our CSV
s so I'm just gonna go ahead and write
this I'm gonna say local file right and
I want to write the web data right so
I'm one of the results of that read and
then I'm gonna go ahead and close
everything so I'm gonna again anything
that I've used any kind of open command
with I'm gonna close so I'm going to
close the web page I'm gonna close local
file file okay and again the reason is
that particularly with local file
because very shortly I'm actually gonna
open it up again and use beautifulsoup
to kind of transform the content so
again similar to the dictionary writer
that we use from the CSV library the
beautiful soup writer is gonna give us
the beautiful soup library is going to
take as ingredients the contents of that
local HTML file and give us a bunch of
new verbs for looking through it and
finding what we're what we want to find
so just to make sure that there's no
conflict because we can't read and write
at the same time and I'll make sure we
close that local file before we go to
open it again and now in fact is
probably a good time for me to actually
test my script to make sure that
everything's working okay
because yeah it's not would rather
troubleshoot it now so I'm gonna go
ahead and get on my desktop here and I'm
going to go into my crime stats folder
and okay and so okay so it looks like I
have a little bit of an error here
because I've imported a particular live
imported URL Lib but I actually need to
be a little bit more specific about what
pieces of it I'm importing right so the
same way that I have from B is for
import beautifulsoup
I think I need to be a little bit more
specific about what part of the URL the
library I am bringing in so actually the
only way to do this is to look at lookup
documentation which I did which is that
I actually need to say something much
more specific which is from URL Lib dot
request ok import URL open I'm sorry
that keeps coming up and blocking things
but you know I think that the key thing
here is that when I get this URL open is
not defined
you know I've used this function
thinking it's gonna know what it is
because I've pulled in the library that
I need and so anytime I get a not
defined it means you're trying to
reference something that I haven't heard
about before and so you know in that
sense this troubleshooting process
whether it's because I've missed typed
the name of the variable or because I'm
trying to use a recipe that hasn't
actually been imported from a library
yet either way it's gonna kind of look
the same but also the remedy is pretty
much the same as well so let's try this
again
Oh interesting
web data is not defined oops here we go
web data equals webpage dot read so
that's interesting when I just tried to
open it that way it like actually opened
it in a browser I've never seen that
before
all right let's see if we do better here
okay so now it's upset about it says an
integer is required
fascinating all right let me see what I
can do with this line eight webpage web
data equals oh haha write equals read
see now I'm just getting sloppy here
let's try this all right I'm gonna have
to go back to my my dictionary here
alright one more time here we go it's
going to be web page dot read and it
knows how to read oh that's a given
anything more one more time okay so
couple has no attribute right okay
and of course I forgot to put open on
this C so this is one of those things
first we're just gotta keep stepping
through it so in line twelve it was like
I don't know what you're talking about
right and of course tupple actually or
tuple means pair right and so what it
was seeing before I had this open on
here as I was saying I was creating a
variable called local file and I was
saying hey this is a pair the first part
of the pair is a string called NYPD
crime stats out HTML the second part is
this string called WP and it was like
alright fine I'll do that and it's like
what do you mean right with this because
it's not a file so one more time all
right
no news is good news let's see how we
did
if this worked as I hope then I should
hopefully be able to see this actually
me open this in aptana that would be the
smarter thing to do is to look at it
here we go crime stats and let's open it
up so what we're gonna see here is
actually the code rather than the
webpage itself right rather than one
rendered webpage so now the trick is
anytime I'm scraping a webpage I have to
do a lot of what we call inspection I've
just got a look at it because what I
need to do is find out what is unique
about the pieces of HTML that contain
the data that I'm looking for that I can
use to ask beautifulsoup
to search for those particular pieces
okay so when we come back I'm gonna
actually look at this in a web browser
so that which will help us do that a
little bit faster do that process a
little bit faster but basically there's
no other solution except to kind of go
through and say where's the stuff I'm
looking for and what is something that's
unique or unique ish on the page that I
can use to get to that with my scraper
so I will see you all in a few minutes
right back
Expertise:

An introduction to scraping the contents of webpages using Python 3 and BeautifulSoup 4

Contributor: Susan McGregor

Video 1 of 3