Looping, Testing and Copying

okay everyone so now we've finished our
first few steps we've gone through an
open-source data file use the CSV
library with its reader and writer
recipes to make it easier to handle we
now need to go through that data one row
at a time and the way we do this is with
something called a for loop now a for
loop is a standard mechanism for going
through lists of things in programming
you'll find for loops in every
programming language you're likely to
encounter and Python the way that Python
does this is a little bit different from
some other languages but I think if you
think about reading it as like an
English sentence it will be a little bit
more obvious what's going on so the
construction for a for loop is for now
this is special instance of when we
create a variable will never create a
variable like this except in Sai except
when we're doing a for loop and
basically what we're doing is we're
saying for each row whoops my thing here
is not showing up quite right
for each row I need to have an
underscore there sorry it doesn't seem
to be okay I don't know why that isn't
working so I'm just to say each row in
what what is the what is this the what
is the element that I want to go through
right now well in my case it's actually
a city bike reader because remember the
city bike reader is the container the
city bike reader holds the transformed
version of my source file that has that
header that header row removed so I'm
just looking at the data
so for each row in a city bike reader
okay and then there's a colon okay so
what what I'm saying is actually what
this code is doing is precisely what it
sounds like which is I'm saying for each
row in this particular data this
particular data source do some stuff
right so I can literally just say do
some stuff now again you'll notice here
that the for and the in are both
highlighted that's because again these
are keywords and the way that we help
them
understand what it should do to each row
okay is we indent the code that appears
beneath that for loop so anything that
has one tab under that for loop is going
to be something that happens to every
single row to each row as it comes up in
in that dataset so first it'll look at
the first row that I'm gonna look at the
second row and it'll do whatever tasks
we ask it to do here and so the first
thing that I want to do is just say if
what was I doing again if the start
station ID is 72 so what I want to say
is look if I go if I look at each row if
I look at so each row is actually a
variable name it's the label for the
current row that we're looking at so the
first time through it's one second time
through its to etc so for each row and
then I need to look at the value which
particular value when looking at whoa
and look at the value in that row that
corresponds to the header of start
station ID okay and now I need to
compare it to the two that's actually
the string 72 so remember so python in
this case is a lot like openrefine
by default it treats everything like a
string so when I compare it I'm gonna
compare it to the string 72 not the
number 72 but how do I do this
comparison well the trick here is that
we've already used the equals sign for
something the equals sign is how we say
put the thing in the right inside the
box with the label that's on the left
and so to actually compare two things to
see if they are in fact equal we have to
use two equal signs in a row so we just
call this equals equals but saying
compare what's on the left to what's on
the right if it is the same okay then I
have another indent here okay which
which tells the computer what to do only
in the case where this is true right so
only if the start station ID is equal to
72 will it do whatever is then indented
under this if statement so we have
actually the outline here of the main
tasks that we want to do the
we said look check to see go through the
file one row at a time check to see the
start station on dia 72 that was our if
statement if it is make a copy of the
row and write it to my output file okay
so we this is the now I need to make a
copy of my row write it to the output
file so the first thing I need to do is
make a copy of my row and the way that I
can do that here is I'm gonna say ok
look first I want you to create I'm
gonna say a new row and it's gonna be
empty okay
it's remember a CSV is comma separated
just like a list is comma separated a
list or an array is comma separated so
I'm gonna say okay create a new row okay
leave it empty because I'm going to use
it to make a copy of this current row so
I'm gonna say new row is empty and now I
need to do what well I'm in the one row
I'm just looking at a single row right
now and it's the first row that I've
encountered that has a start station ID
of 72 I want to take that new row and I
want to go I want to go through my
original row and I need to copy each
value out of it okay and I need to do
that by saying okay first give me the
value this is a trip duration and then
put that on to my new row then give me
the value that's its start station at
start time and put it on to my new row
okay so we're gonna use we're going to
say if it equals 70 to create a new row
then then I'm going to say for a header
or I'm gonna say a column like a column
in now so I need to loop through I'm
gonna go through each of the headers one
by one and of course where do we find
those headers we have a list of those
headers in our field names property from
our dict reader which we called citibike
reader okay so for every column in
citibike reader field names right which
is just the list of which is just the
list of column headers we know that
because we haven't actually
in front of us right now for everyone I
want you to go through and say new ro
append okay append here means exactly
what it means in English add something
to the end append the value that appears
in the current row which again our label
for that right now is each row at a
column right or maybe I can say yeah I
think a column is fine right so in other
words the first for the first time that
it goes through this list right because
citibike reader field names is my list
it's the same as I'm looking at here in
the terminal so the first time through a
column is going to be trip duration so
it's gonna say okay look at the current
row and give me the value at trip
duration and append that to my
originally empty row the second time
through right it's gonna go it's gonna
after trip duration is going to go to
start time and then its start time it's
gonna say okay so go take the current
row look at the value of the start time
and append that to new row and hopefully
you can see that that step by step right
column by column it's going to make a
copy of this row and put it in new row
okay
and so now once I've done that the only
thing that I have to do by the time I
get to the end of this and again the way
that I get out of this for loop right so
all I'm doing in the for loop is making
the copy of that row when I get finished
with that all I have to do is outdent
again so that I have the same left
margin as that for loop and that means
once I'm just gonna say once the loop is
done is done right my new row to my
output file okay so at this point I can
say well alright our output file we're
gonna use the method city bike rider
right row city bike rider is the is the
sort of facilitated version of our
output file or access to our out the
file so I'm going to say right row okay
and what am I writing well I'm just
gonna write the
tense of new row right and that's really
about it so the trick I think the thing
that can be a little the the syntax for
the for in can be a little bit hard to
get one's head around at the beginning I
know I actually start gold with this a
lot when I started with Python because I
just it was like how does it know that
each row stands for a given row and it's
because the four in only works on lists
of things so citibike reader when we use
that when we create that dictionary
reader okay what it does is it actually
treats our rows as a list of rows and a
four in knows to treat the second
ingredient right the thing that appears
after the in always has to be some kind
of list it can be a list of letters it
can be a list of numbers it can be a
list in this case of rows of data and
whatever appears here is just the label
that it gives each row as it's going
through so it says look for each row in
this list of rows you're gonna do some
stuff the first thing you're gonna do is
check and see if the start station ID in
that given Row is 72 if it is make an
empty row go through each of the headers
in in our list of headers and add to the
end of that new row add to the end of
the that new row the value that appears
in the original row in the current row
at that that column header and when
you're done with that write again and
that being finished is signaled by the
outdent to be level with the four and it
says they look when you're done with
that take that new row that you just
used that we've just used to create a
copy and append it well write it to
write it to our output file we are
actually appending it to the file so it
works in each case because in this case
we're talking about a list of rows in
this case we're talking about a list of
column headers but it doesn't matter to
the for loop what type of thing it is a
list of it will just it just knows that
this
a list and it can deal with each item in
the list in succession and that's just
how it operates always and by the way
that's how it operates in all
programming languages sometimes the way
that we get it sometimes we get to that
point is a little bit different
but basically this is what you're gonna
see in all programming languages now the
final thing that I want to do is I'm
going to close all my open files so the
only things that I need to worry about
clothing are things that I have used in
open command with so in this case that
source file and an output file and so
I'm just gonna say source file and again
I'm doing this you notice that I'm all
the way back left left justified to the
margin and the reason why is because I
don't want to close things I don't want
to close anything in the middle of my
for loop right I want to make sure that
I've gone through the entire file before
I close out of anything and so again the
way that I signal that I shouldn't be
doing this the way that I signal that
this is only to happen once this for
loop is complete is by bringing it back
to that left margin so I'm lining it up
with that original for loop so if all
has gone well we'll find out I should be
able to run this and I guess we'll see
what happens
oh so there we go it looks like I have
an error it says invalid syntax what did
I do
oh I added an extra I had an extra
period there right I don't need that
period so that's a great example if you
get an invalid syntax or sometimes you
might get something is undefined the
first thing you do is go to the line
number so it was a little hard to see
here but it said line 37 it also showed
me the code in question and of course
when I pass ingredients I don't put a
dot before it I just pass the
ingredients so let's try that again okay
now it might look like things have gone
wrong here because my cursor is just
hanging and hanging and hanging this is
actually I like to describe Python as a
new news is good news kind of
programming language you see now that
it's returned to what we call the
command prompt right so about
dollars.the sort of location and the
dollar sign and I can type
again when it's in that blinking cursor
mode what it means is it's working so if
you don't get an error it means
everything is going fine it might take a
little while in this case it took you
know 10 seconds or so but that means
it's doing its work and so now we can
actually check and see if our output
looks the way we think it should so I'm
gonna go ahead and open this Libre
Office just like Microsoft Office
basically and in fact it looks like we
have had some good success here so it
has pulled out every single start
station that has a value of 72 we have
run this on the entire data set
okay so rather than again dealing with a
hundred thousand rows at a time we've
actually managed to do this on a few
million rows at least in about 10
seconds so hopefully this illustrates
you all the power of dealing with Python
of using Python for this I'll point out
the Python is not something that we use
generally to do exploratory work we
might if we want to see how frequently
something appears in an entire data set
and we don't have a tool that can do
that easily but almost always you're
gonna I find that I go to a tool like
open and refine first to kind of get a
feel for the data be able to look at it
mess around with it a little bit and
then once I kind of knew what I want to
accomplish for example in this case
filtering the data then I come to Python
and write a script that will allow me to
do that on a much greater quantity of
data in a much shorter period of time so
that's all I look forward to seeing you
all this week and working on this
together I hope everyone had a great
holiday and I will see you soon
Skill:
Expertise:

Looping through our data one row at a time, testing it for a particular value and, if it is found, copying that row to our new file

Contributor: Susan McGregor

Video 3 of 3