Importing Data and Writing Headers

okay folks so now our task the task
ahead of us is really just to translate
our what we call pseudocode our Comment
pseudocode right describing what we're
trying to do at the top of our file into
actual Python code obviously some of
this is going to be new to you all but
hopefully you'll find it reasonably
straightforward and of course we'll be
going over all of us in class so the
first thing that I need to do is
actually relate it a little bit to step
two so one thing I want to do is I want
to open my source data file actually I
can go ahead I'm gonna open my source
data file but because I want to use I
want to use something I want to use a
library so a library is basically like a
recipe book or a set of recipe books one
of the great things about Python about
using a programming language like this
is that lots and lots of people use it
for lots of different things
however given how common comma separated
value files are lots of people have this
also use it and deal with CSV files over
time what has happened is that people
have compiled libraries or recipe books
of recipes for dealing with CSV files
that helps make our lives a little bit
easier so we don't have to write every
single word of code to deal with these
files because other people have written
them ahead of time and made them have
shared them publicly so that we can use
them without having to write all the
code ourselves so the but because there
are so many different libraries out
there we have to be specific with our
program and tell it which ones it should
make sure to have available because
otherwise it would just take ages and
ages for the program to load so
specifically we are going to tell it to
import the CSV library so pretty
straightforward here python is designed
to be a little bit like english so
hopefully that makes sense that when you
want to bring a library and you import
it and this is going to give us access
to some functions that will be very
useful some some recipes methods or
functions so now that i've done that the
first thing that i want to do is I want
to open my source data file now
every time I do take an action I want to
put the results in my code I want to put
the results in kind of a labeled space
so we talked about this with the
creation of variables a variable is
really just a label and it's a label for
what I think of is it mean it's really
just a box in memory it's actually what
happens is when you create a variable
the computer set aside some memory and
puts that the labels it with the
variable name that you have chosen and
into that box it puts whatever you tell
it to put there so in this case the way
that we do this is we create the
variable we create the label first okay
so the label that I'm gonna say is I'm
gonna call this source file okay now
you'll notice that I'm trying to make my
variable name because it's a label I
want it to be descriptive okay so I'm
calling its source file because it is
actually me opening my source data file
as in step one here so the equal sign is
the action that tells a computer to put
whatever it comes on the right-hand side
into the box with the label that appears
on the left so it's gonna say take
whatever apply the label source file -
whatever - whatever I put inside of that
box which is going to be whatever
appears on the right here so in this
case I'm going to say open okay and open
again is just a function that's already
available in Python so it's a recipe
that that all Python programs
automatically have available and I'm
just gonna say open and then I'm gonna
give it my file name which handily I
have copied up here so I'm gonna say
open CSV my CSV file and now they
actually have two ingredients that I
need to pass to this I need to give it
the name of the file that I want to open
and I need to tell it I need to give it
a second ingredient that says that
basically lets it know what I want to do
with this and in this case what I want
to do is read the file and so that
second ingredient is going to be an R in
this case I'm using a you because it
means you means you know wrote universal
and it just basically has to deal to do
with differences between Macs and PCs
and the way that they saved carriage
returns it's not interesting are you
will take care of it
but so I'm gonna say
okay create a variable to hold the
results of opening my data file okay so
now I created my source file now the
next thing that I want to do is as I put
here I want to use the CSV library so
again I've already imported the CSV
library but I want to make use of one of
the recipes in the CSV library in order
to make my source data easier to handle
it's actually not just my source data
it's actually going to make my output
data easier to work with as well
so what I'm gonna do is I am going to
create another variable I'm gonna call
this city bike reader and this is again
I created a label which generate which a
label for a box in memory and in to it
I'm going to put the results of running
the CSV a recipe sorry to escape to get
rid of a message it's an error and it's
annoying I want to put into that box
labeled city bike reader the results of
using a recipe from the CSV library so
the first thing I'm gonna type here is
CSV and I'm going to ask it to create a
reader that will I'm gonna ask it to
create a reader that will deal with that
will deal with the contents of source
file and in this case then the name of
that recipe the name of the recipe that
I'm using that comes from the CSU
library is called dict reader okay so
it's short for dictionary reader and a
dictionary is kind of another word for a
list it's not precisely a list but
that's essentially how it functions in
this case and so I'm gonna say create a
dictionary reader from source file okay
now again why am I doing this well you
know it actually has it's actually not
dissimilar to what we saw the difference
between working in excel versus working
in openrefine so if you remember in
excel when we were doing our math one of
the things that we constantly had to
take into account was the presence of
the header row and you remember in excel
we wanted to remember how many
of data we had we always had to subtract
at least one because we had that header
row where as an open refine the header
row was present but it was not sort of
muddled up with our actual data so one
of the things you know the sort of the
numerical data that we were interested
in looking at so what one of the big
things that this Dictionary reader
decorator recipe does for us is it does
that kind of separation and it separates
our the header row from the rest of the
data so that we can still access that
header row we are going to need it right
because of course we're gonna want to
put it in our output file we're actually
gonna use it to access the data itself
for example we get to checking the start
station ID we're gonna want to do that
with reference to start station ID as
the header for that column but we want
it out of the way and so the dick reader
does this for us automatically so from
here on out we're gonna work with the
citibike reader data because it has
transformed my original source data
right that I opened straight from my CSV
it's transformed it in much the same way
that openrefine transforms the data
about kind of taking that header row
keeping it available but kind of out of
the way of the actual data and I can
test this because there is a when it
takes that header row out of the way it
puts it it gives it a different name it
gives it a label and it labels it field
names and so I can actually check and
see that this is worked correctly by
running a print here and just making
sure that I'm actually getting the
header row that I expect to get right so
I know that my header row is like trip
duration its trip duration start time
stop time start station ID etcetera etc
so I can get my my terminal here another
trick for dealing with terminal is that
I can have it repeat the last command
that I typed in by just hitting the up
arrow so instead of having to retype
Python 3 and then my file name every
time I can actually just hit up it again
and when I hit return again because I
have this print statement in here the
print statement will output it to the
terminal and I can see that indeed it
does seem to be getting what I expect
which is that trip duration start time
stop time etc that are all the way
through to gender
so it looks like setting out my
dictionary reader worked correctly just
great and now I can start thinking about
constructing my output file and what
that might look like of course I am
going to want that header row in my
output file so I'm going to go ahead and
create an output file and the
interesting thing here is that when I
create the output file I actually use
the open command again when I run this
script and I say open I'm gonna create
I'm gonna give it the name of the file
that I want it to open which I can just
make up okay so it's very similar to
using like a save as in Microsoft Word
or something like that where I say save
as and I just give it a file name I want
to give it a dot CSV at the end because
it's going to be a comma separated value
file but I'm gonna say open and I'm
gonna say um city by Sept station 72 dot
CSV right that's gonna be again my
descriptive file name but now the second
ingredient set of being in are for read
because this doesn't exist yet is going
to be a w-4 right in other words I'm
saying create a file right save as it's
gonna be empty to begin with which is
again when you create a new file and
something like Microsoft Word it's
originally untitled you have to give it
a title well here we just have to
provide the title right at the time that
we create it or just at the time that we
create it and we need to tell it look we
want to we want to write this is we want
this to be something we can actually
output data to and so we're gonna have
that W so the reader and now we want to
create so we created our reader right
that's dicked reader we also want to
create a writer
for the output file and again this is
because rather than writing everything
manually it's gonna make it a little bit
easier for us to write for example an
entire row of data all at once to our
output file rather than rather than
having to Co sort of like value comma
value comma value comma etc okay so I'm
gonna say citibike writer again and I'm
doing
you see that I have kind of a convention
going here I'm doing reader and writer
and making those parallels so I have
source file and output file City by
greed or city bike writer this is just a
convention that I like but it's it is
handy to start developing those for
yourself because it'll make it easier
when you come time to name your
variables if you just sort of get in a
habit of using similar using those kinds
of parallel constructions okay so in
this case again I'm going to the city
CSV recipe book but in this case it's
just writer and again what am i creating
the writer with I'm creating it with
that new file that I made and put into
the box labeled output file so created
the space for that file and now I'm
going to convert it to a writer and I
can test to see if this works by saying
by now trying to write a row to it and
the row that I want to write of course
is I think it's right row let's see is
going to be and I know I erased that
print statement but it's going to be
that field names it's gonna be that it's
gonna be the header row right I want to
I want to write the header row to my
output file because that's obviously the
first row in my output file as well I
don't want to lose that data just
because I'm creating a subset of of my
of my file I think I might yeah let's
just see how this goes
so again I'm gonna come back here and
I'm gonna hit this whoops and that work
let's see looks like it did the same
thing so let's double check now if his
work correctly uh-huh I should have an
output file and I'm just gonna go ahead
and open this I don't say okay and voila
I actually have a CSV that has my
headers in it so this is exactly what I
was going for this is a great start I'm
very pleased with this
and from here on out I'm gonna go on
with the rest of my with the rest of my
attack
my the rest of my script which is going
to be now that I've gone through steps
one through three created a reader and a
writer I'm now gonna go through the
source file one row at a time and I
could actually be even more detailed
with this outline right like I skipped
the part where I write the headers you
all want to go back and add those things
in as you go so the outline is a good
place to start but you should always be
updating it if your plans change right
to have as much detail as possible
because again it's all just gonna help
you understand what's happening
when you return to this so likewise I
would want to be writing inline comments
for every single line of code so in this
case I would say create a reader to deal
with my source file etc etc so that's
something that you all will look at as
well but for the time being we're gonna
pause here we're going to come back and
we're actually going to start going
through our file and filtering for our
start station ID
Skill:
Expertise:

Using the csv library to manipulate our new and existing csv data files

Contributor: Susan McGregor

Video 2 of 3