Adding Headers and Using Shell Scripts

okay folks so before we go on actually
to our shell scripts there is one more
thing that we need to do right which is
that we need to write some headers to
our data file we could do this manually
after creating everything but that's
still kind of a pain it had to be poking
around for things quite a lot it's
always nice to do these things in a
holistic way the problem is that or the
issue that we face here is that unlike
our first example when we were working
with an existing CSV and we can use some
of the CSV recipes to pull the header
the header row out of it
we don't have an existing heading row an
existing header row in this case so
instead what we need to do is we need to
find a way to I mean we basically just
need to tell this look the first from
the first file that you load or yeah
from the first file you load basically
we need to grab the header row but we
need to grab it only once again we could
grab it from each file or each row but
that would be pretty messy so instead
what we're gonna do is we're gonna say
look because we're processing a lot of
files all of those files have a header
row or we only want one header row for
all of those files we're gonna do
something up here it's a little bit
different so you knew the first time
that we create a boolean of our own and
basically we're just gonna say look does
this file still need a header row right
in other words has the header row yet
been written and if we recall what we're
doing here and of course you will all be
commenting heavily so I'm sure you will
recall if we recall what we're doing
here we're saying okay go through every
file and then for a given for each file
I'm gonna open the file and I'm gonna
load the data now I only need to pull
the headers once even though I might
have two or three or ten files right but
where do I get those headers well the
headers are gonna be the names of the
attributes in this case right so I have
the attributes I'm looking at the
attributes down here but if I try to
create a row would be I try to create a
header row in here I'm gonna end up
writing a header row every single time
and I don't want to do that so instead
what I'm going to do is I'm
actually gonna have to say look I want
to if I haven't yet if I still need a
header right
as in header row equals true needs
header equals true then
so if needs header equals equals equals
true remember ease R equals equals to
compare right is needs header does it
have a value of true then I'm gonna say
all right look I'm gonna create a new
header row right which again is just
gonna be an empty list there
I'm gonna go ahead and append to header
row my first element right because if
you remember that the first thing I'm
doing in each case here is appending the
execution time so my first column is
actually the execution time of the files
I'm gonna say okay once you do a pen I'm
gonna make up a name for it I'm gonna
say file date right that was the date
file was loaded and then I'm gonna say
and it's gonna be exactly the same thing
as here except rather than looking for
the value I'm just going to append that
the attribute itself so I'm gonna say
okay for attribute in station I want you
to add the name of the attribute to my
header row list okay when I'm done with
that I'm gonna go ahead and write that
row that header row to my output file
right okay and now very important here
if this has been done once right I don't
want to do it again otherwise every
other row that I get in my data is going
to be just like the names of the column
headers so it's very important here that
what I do is I now say by the way I no
longer need a header so I set needs
header to falsely because needs header
was established before this for loop
right it doesn't
the only place that it gets changed is
in this case right so what happens is
I've created this variable I've said the
value is true right I put the value true
in the box labeled needs header I come
through this in the very first time okay
I get my first file name I'm going to
load the data from it I'm gonna go
through each station and I hit this
condition I'm gonna say look does this
equal true well obviously the first time
that's going to be the case it's going
to equal true so I do all of this and
then I say by the way this thing make it
false right now I want you to replace
what's in the box labeled needs header
with the value false and then I go about
my business processing my first row of
data from my first file all is well in
the world what's important to remember
is that both when it comes back up here
so that went once I've done that first
that first file when I come back up here
it's gonna say okay I get the next file
name but every time it comes down here
the value of needs header will now be
false so it's only going to execute this
one time so I can test this again I'm
gonna say run my script doesn't take
very long because I'm not parsing that
much data and I'm gonna go ahead and
open this in Microsoft Excel and you can
see that I have successfully added the
file head they be attributed headers now
what's interesting is that this also
kind of demonstrates to us all of the
values that are possible in our city
bike station feed but don't actually
have values in them right so city
altitude street address postal code
which is kind of a bummer that would
probably be useful to us is it a test
station
oh here's well so this is sort of a
street address right it's actually just
the same as the station name it says
does have a landmark that would be kind
of interesting to pass it on right and
then the available bikes and the ID and
the location right so location also does
not value so that's how we can go about
adding header rows when we don't have an
original CSV to work with you could also
do this manually right I could just say
you know by the way you know my header
headers you know equals right and then I
could just manually enter a list
as we demonstrated in class right that's
always an option but this is just a way
to get it out of the the data that I'm
already processing so now that we've
solved that the question is how do we
streamline the process of having to go
to the website and copy the data off of
this URL and then save it in a text edit
file I do all this stuff right not only
is it tedious but it's hard to do with
good precision right because you know
often we want to be doing something
exactly every ten minutes or exactly
every hour or exactly every day at the
same time and that you know would be
kind of a damper on things if you had to
make sure that it exactly noon every day
you were there to download this file so
what we're gonna look at is a shell
scrap a basic example of a shell script
now this is all I'm just gonna sort of
give this to you
but I hope that but I think that you'll
you'll hopefully find it useful and it's
certainly something that you can explore
in your own beyond this class if you if
you want to but if you're doing kind of
some basic data collection you can
always run this kind of thing if you
know just kind of in the background
which is nice so I'm gonna say new just
say new file
oh no I'm not gonna say any file because
that's gonna be really annoying so I'm
gonna say new from template but I'm
gonna lie and I'm gonna pretend that I'm
going to do a Python file but I'm
actually not going to make it a Python
file so I'm gonna save this you can't go
to save it I'm gonna save as okay and so
remember that the file extension is just
something that a computer uses to see to
guess which program to run it with so
I'm gonna call this JSON download dot SH
dot SH is the name of is the file
extension for shell scripts now we're
gonna use a different kind of for loop
in this and what we're gonna do is we're
going to we're going to do something a
certain number of times there are other
ways to measure this but
say I want to do something five times
that's going to be how I specify you
know I want to download the file five
times over the course of five minutes or
maybe over a minute because we're not
gonna wait five minutes and videos so
here we go I'm going to start by saying
four so you see if it's a four syntax
right the the for loop concept is really
universal okay shell scripts have been
around since the very first computing
systems and in this case what I'm gonna
do is I have to do something we call
we're calling it an iterator right it's
something that it's basically the same
thing as our for in loop but we're much
more specific in this case I'm going to
say from I equals one so I'm gonna say
start at the number one until up tool
number up till six right or if they
equal then six right so I'm going to get
one two three four five because I'm
looking for things that are less than
six and then in this case I specify very
particularly I want you to add 1 to the
value that I have so I is a box you know
again it's just a variable so create a
box label it I put the value 1 in it
we're gonna increment the value just the
same way we're actually moving through a
list in exactly the same way that we do
in Python it's just written a little bit
differently a little bit more explicitly
and then I say do and again this is just
how it works with with shell scripts ok
and then the command that I'm gonna use
is called curl we write it in all
lowercase but it stands for copy URL so
that's what's gonna actually grab all of
the stuff from the URL that we pass to
this script so now I need to pass it the
location where we want it to save our
files to right so the first thing I'm
gonna do is put in the name of the
folder name of the folder which I think
is Jason downloads
maybe yep JSON downloads then I'm gonna
give it a file name so I'm gonna say
okay I want each of these file names to
start with citibike data and now I have
this problem of we need somehow to make
these file names unique because if I
don't make them unique by just a city by
data dot JSON every time it executes
it's gonna just overwrite my existing
day
file so what I can do is I can actually
use the fact that this is that I is
variable here now I use dollar sign
inside the for-loop again this is only
in child shell scripts but basically
what I'm saying is look take the value
that is in the variable I which the
first time will be 1 the second time
will be 2 3 4 5 right and then and make
that part of the file name that you
write okay
so I've told it now what the me where
the file should go then I'm gonna tell
it where to get the file from that's
just gonna be this URL right here right
and importantly I also have to tell it
that by the way I want you to actually I
should have put curl - oh okay and the -
o stands for output so the way this work
reads is curl - the output file called
JSON download citibike beta blah blah
blah blah JSON from this URL okay and
then what I'm gonna do is I'm gonna say
sleep this is the syntax write the
command the recipe is called sleep
you're gonna sleep and in this case I'm
going to say sleep for 10 it's the
default is in seconds so it's gonna say
sleep for some 10 seconds and then do it
again so every 10 seconds 5 times this
script is going to download a copy of
the data give it a new file name and
save it in my folder now obviously it's
not going to change much in those 10
seconds you would want to put any longer
weight but for us just to see that this
is working we're just gonna put in say
10 and then finally we do something that
seems very I don't know sensible to me
we type done that's how we tell it that
it's finished that the script has been
it and actually I left something out
here which is all I need to have that
Simon :
so again I know this is different than
what we've been looking at shell
scripting is unusual but because it's
useful I just want to want you all to
have this introduction okay so how do we
run this I can actually run this right
from my same terminal so if I do an LS I
see that I now have my JSON download SH
file and if I run Sh so again SH is the
name of lang
which the interpreter to use when we use
Python me type Python these shell
scripts we type SH and then I'm gonna
say JSON download SH and if things are
working well what we can see is exactly
what it's doing so it's saying percent
total percent received transferred
average download speed or average yeah I
would download speed average upload
speed etc etc so it's basically giving
us the information about these files
that it's creating every single time
over so we see it actually isn't the
same size every time which is sort of
interesting and if we wait our requisite
250 seconds we should get all of these
but I can check and you see that in my
folder here are appearing these new
files and there's the fifth one so
hopefully this gives you a sense of what
some of what can be done with a
combination of shell scripts and Python
for parsing both large data sets and
multiple data files into one useful spot
and we'll obviously be going over this
in class but I look forward to seeing
you all then
Skill:
Expertise:

A tutorial on adding a header file to a csv built from a JSON source file. Also, a quick demo of using shell scripts to automate the data downloading process

Contributor: Susan McGregor

Video 1 of 3