Welcome to the Command Line Basics Exercises!#

In this exercise we’re going to get some practice navigating and exploring files and folders from the command line by looking at some data from New York City’s 311 system. 311 is a citizen hotline set up by the city of New York for reporting non-emergency issues to the city. 311 takes calls about all sorts of issues, from noise complaints to issues with street lights to complaints about restaurant hygeine violations and rodent sightings.

You can find the 311 data we’ll be working with in a zipped file called NYC_311calls_2018.zip here. Please download the file and place it somewhere easy to remember (desktop, downloads, etc.).

Exploring Files#

Once you’ve unzipped NYC_311calls_2018, use cd to navigate into the folder so it is now your working directory. Then use ls to look at what’s in the folder. What you see should look something like this:

$ ls
311_SR_Data_Dictionary_2018.xlsx README.md                        raw data
CE-20170824.pdf                  just_the_letter_a.docx
NYC311_column_names.txt          just_the_letter_a.txt

Up until now, we’ve just been moving around at the level of the filesystem, seeing file names but not their contents. But if a file is a plain text file, we can also look at it’s contents. There are actually a few ways to do this, but the two most used options are cat (which will print the contents of the files to your screen), or less (which will open a small program to allow you to read through the document in a controlled manner). cat is quicker, but if you use cat with a big file, the whole file will just print out to your screen and you’ll end up overwhelmed (though you’ll be fine for a small file here).

If you use less, get to the end of a file and can’t get out, type q and enter.

Exercise 3#

Do as the README.md suggests and read it first with the command cat README.md, then with the command less README.md (press q for quit to get out when you’re done).

If you type cat without a file name, you’ll end up in a weird state where anything you type gets echo’d, but you can’t figure out how to get out. Command-. on a mac or ctrl-C on windows are the terminal equivalent of command-Q to quit a process.

Exercise 4#

Now let’s do the same with CE-20170824.pdf: run less CE-20170824.pdf. If less asks you a question, just type y.

What happened?! Unfortunately, CE-20170824.pdf was not a plain text file, but instead is what is referred to as a binary file. This distinction between plain text files and binary files will come up a lot, so let’s discuss it briefly.

The terms “plain text” and “binary” are a little misleading since everything on your computer is stored as 1s and 0s (i.e. binary). What differentiates plain text and binary files is what those 1s and 0s are meant to represent.

In a plain text file, the 1s and 0s of the file encode numbers and letters based on simple, commonly used codes (like ASCII or Unicode. These files also do not contain anything complicated (pictures, media, etc.), and in fact don’t even include information like fonts, or formatting. This simplicity makes plaintext files universally compatible, and easy to work with, so are a favorite of programmers. Any code you’ve ever written has probably been saved as a plaintext file.

In a binary file, by contrast, the 1s and 0s encode much more complicated information. In this case, CE-20170824.pdf is a PDF file that includes images, different fonts, careful formatting, etc. As a result, it can only be openned by a PDF reader (like Preview or Adobe Reader) that knows how to interprete the file’s complicated content. If you open it with less, less tries to treat the 0s and 1s like they were just encoding simple letters and numbers, but since they don’t, the result is just gobblygook.

Exercise 5#

Lets actually see the difference between plaintext and binary files. In your folder are two files called just_the_letter_a, one with a .txt suffix, and one with a .docx suffix. Using your normal operating system interface, open both files (assuming you have Microsoft Word installed). You should see that both files include nothing except a lower-case letter “a”.

Exercise 6#

You can see the actual 1’s and 0’s that underlay a file from the command line using the command xxd -b [filename]. First, use this to see what’s in just_the_letter_a.txt file. (the [] are just there for me to indicate it’s a place you put something — you don’t want to leave in the [].)

What you will see is a counter on the left, a colon, then the actual contents of the file grouped into sets of 8 bits (what’s called a byte). The first is the code for a lower case “a” (01100001). The second is the code that says “this is the end of the current line”. And that’s it! Congratulations, you can now read binary!

(Don’t believe me? You can find the code for the end of a line here, and for “a” here. Go check for yourself – there’s no magic here.

Exercise 7#

Now let use xxd -b [filename] do the same for the Microsoft Word doc that also encodes just a single letter “a”. Does it look similar?

Exercise 8#

And that is why plaintext so useful – it’s simplicity makes it nearly universal across both platforms and time.

Be aware that lots of file endings can be used for plaintext files. For example, .csv files are also plaintext. Indeed, it is because they have such a simple format that .csvs are the most used format for sharing tabular data. .md, .txt, .tsv, and other file suffixes are also usually plaintext.

But just because a file is not plaintext doesn’t mean we don’t want to know what it is!

Use the open command (on a mac) or the start command (if you’re using a bash shell on windows). open FILENAME / start FILENAME just asks your computer to do whatever it would do if you double-clicked on FILENAME. So if you type open CE-20170824.pdf / start CE-20170824.pdf, your computer will open the PDF in your default PDF reader.

Similarly, if you type open . or start ., your file explorer will open with the current directory open!

Exercise 9#

OK, so CE-20170824.pdf is just a paper someone wrote using this data. Since the name CE-20170824.pdf doesn’t tell us anything about this paper, let’s rename it using the mv command. Recall from DataCamp that mv stands for move, but that while it is moving files it can also rename them. If you “move” something from its current location back to its current location but with a different name, you’ve effectively re-named it!. So try re-naming CE-20170824.pdf to something more descriptive.

Organizing Files#

Up till now, we haven’t done anything that wouldn’t have been easier to do using a mouse and a regular graphical user interface. But now let’s suppose we want to analyze the data from 311 calls placed on Thursdays and Fridays to see if city workers are less likely to address problems that are reported on Fridays.

In your normal operating system GUI, open up the raw data folder inside NYC_311calls_2018. As you will see, the folder is full of CSVs (comma-separated-values, a plain-text format for storing spreadsheets), with one file for each day.

Exercise 10#

Without using the command line (or another progamming language), how you would pull out all the files for Thursdays and Fridays and move them to a new folder without using the command line? Would you strategy work if you had 10 years of data instead of 1 year of data?




Exercise 11#

One of the advantages of the command line is that you can use wildcards (the * symbol) to identify any files with a given pattern. For example, if I wanted to list all the CSV files in raw data from February, I would type ls 311calls_2018_2_*.csv, since all the files from February (month 2) would have the same prefix (311calls_2018_2_) and suffix (.csv). Now, using the mv command and the * symbol, move all the Thursday and Friday files to a new folder. (Hint: you’ll probably need to make a new folder to put the files into first.)