Advanced Command Line Exercises#

Setup#

When we finished our last set of Command Line Exercises, we had moved all the CSVs that had data from Thursdays and Fridays into their own folder. In this exercise, we’ll continue to work with that data, and we will assume that the folder currently has the following structure:

NYC_311calls_2018:

  • 311_SR_Data_Dictionary_2018.xlsx

  • CE-20170824.pdf

  • NYC311_column_names.txt

  • README.md

  • raw data [Folder containing CSVs from all days except Thursdays and Fridays]

  • thursdays_and_fridays [Folder containing all CSVs from Thursdays and Fridays]

If this is NOT the directory structure that you currently have, then the first exercise is to move around your files and folders until they have this structure!

Exercise 1: Checking Files#

When doing data work, it is important to always check your assumptions. For example, we think that we have data from every Thursday and Friday in the thursdays_and_fridays directory. But are we sure?

To check, let’s quickly google “how many thursdays were there in 2018” and “how many fridays were there in 2018” (there are 52 weeks in a year so it’s probably 52 of each, but important to be sure as 365 is not evenly divided by 52).

Then let’s count the number of files we have in the thursdays_and_fridays directory.

Hint: to do this, we need one command to list out the names of all the files in thursdays_and_fridays (ls), a second command to count the number of files reported (wc), and some way to pass data from one to the other.

Second hint: you probably want to use the 1 flag for ls. The 1 flag tells ls to make sure it only prints one file name per line.

If you have trouble interpreting the output you get, you might have to check the wc documentation.

Exercise 2: Combining files#

Having all these individual files is not very efficient. Instead, we’d like to merge all these files into a single large CSV file we can import into Python and analyze.

As you may recall, the cat command is often used to print the contents of a file out to the screen. But cat actually stands for concatenate: if you give the command multiple files, it will join them together one after the other.

Use cat to merge all the files in thursdays_and_fridays into a single CSV called thursdays_and_fridays.csv saved to disk.

Exercise 3: Viewing your Results#

Great! Now we have a single CSV file with all our data for Thursdays and Fridays. But there’s a problem: none of the columns are labele!.

Take a look at the first few lines of our merged CSV file using either less or head. Head is a new tool! It’s like cat in that it just prints the contents of a file to the screen, but unlike cat it only prints a few lines. There’s also a command called tail that shows you the last few lines of a file.

Try also confirming this by openning thursdays_and_fridays.csv in Excel. As you will discover, it’s a very slow process. This is another advantage of the command line: when working with big files, being able to quickly look at a few lines of a file in the command line is very efficient.

Exercise 4: Editing#

In NYC311_calls_2018 is a file called NYC311_column_names.txt. It has all the column names for our CSVs.

We can add these column names by concatenating the column name file to the top of our CSV, but only if it is properly formatted, which it is not – when the column names were exported, they included square brackets at the start and end of the column names:

cd ~/desktop/NYC_311calls_2018
head NYC311_column_names.txt
["Unique Key", "Created Date", "Closed Date", "Agency", "Agency Name", "Complaint Type", "Descriptor", "Location Type", "Incident Zip", "Incident Address", "Street Name", "Cross Street 1", "Cross Street 2", "Intersection Street 1", "Intersection Street 2", "Address Type", "City", "Landmark", "Facility Type", "Status", "Due Date", "Resolution Action Updated Date", "Community Board", "Borough", "X Coordinate (State Plane)", "Y Coordinate (State Plane)", "Park Facility Name", "Park Borough", "Vehicle Type", "Taxi Company Borough", "Taxi Pick Up Location", "Bridge Highway Name", "Bridge Highway Direction", "Road Ramp", "Bridge Highway Segment", "Latitude", "Longitude", "Location"]

Since this is a very small editing task, let’s fix it using nano. Use nano to open the column name file, remove the square brackets at the beginning and end of the column names, and save the result.

Then concatenate your column names with your CSV to create a new file called thursdays_and_fridays_w_columnnames.csv.

Exercise 5: Cleaning Up#

Once you’re done creating this file, let’s clean up by deleting the raw data folder and all its contents.