First Steps with Bash#

Prerequisites

Bash terminal installed on your machine

Objectives

Become familiar with terminal execution.
Learn regular expression syntax.

Terminal#

Using a terminal doesn’t need to be a daunting experience. Even with minimal exposure and a basic comprehension of core concepts, we can understand simple one-line commands enough to save us hours of tedious file management.

Our goal in this lesson is not to be fluent enough to have a conversation with our computer, but to learn the basic vocabulary enough to ask for directions.

The most common file system management tools we will need in the terminal is for bulk transport of files and for pattern matching.

The basic syntax for commands is cmd -<options> <arguments>. Options are also sometimes called flags and alter the default behavior of a command. You can set default options for basic commands on your machine by creating an alias of the command and storing that alias in a runtime configuration for bash known as .bashrc.

#.bashrc
alias ls='ls -FGh'  # set default flags for ls command

File management#

ls and cd#

View the contents of your current working directory with ls.

~ $ ls

To view hidden files, pass the -a flag to view all files. Hidden files normally begin with a . and are not usually displayed in GUI system file managers. This is a developer’s way of protecting end-users from accidentally deleting important stuff. We will encounter some hidden files with git when we create local repositories.

mv and cp#

Moving and copying files follow a simple format.

~ $ mv path/to/original/file.txt path/to/new/file.txt

~ $ cp path/to/original/file.txt path/to/copied/file.txt

For a move command, the original file is deleted from its original location, so be mindful that if you mistype the output directory, you may have trouble finding your file. Neither of these commands gives feedback when they run successfully, but they do throw an error when you attempt to move to a location that doesn’t already exist.

Tip

Most terminals have a history feature where you can view the last commands entered on the terminal without having to scroll all the way up in your terminal screen. These are commonly accessed by pressing the up and down arrows on your keyboard.

You can enter interactive mode in either of these commands by passing the -i flag. This will prompt you to confirm before any potentially dangerous overwrites.

~ $ mv -i original.txt new/directory/original.txt

mkdir#

If a directory doesn’t yet exist, create it with mkdir

~ $ mkdir -p a/series/of/new/nested/directories/

Here, the -p flag generates all of the parent directories in the specified path if they do not already exist. We can also generate multiple directories at once with parameter expansion shortcuts.

~ $ mkdir -p directory{A,B,10}

The above command generates directoryA, directoryB, and directory10 by supplying each of the values in {} to the rest of the string. We can specify ranges by {1..10} and even compound iterations with {{A..Z}{0..3}}. The latter expression will expand to A1, A2, A3, B1, B2, …, Z1, Z2, Z3.

rm#

Remove files with rm, enter interactive mode with -i, and delete recursively all files in a directory with -r.

~ $ rm -ri directory*

Here, the * is a wildcard expansion for all characters after the pattern directory. This will match all directories that start with the pattern directory. The -r flag will delete all files inside a directory, but to delete the directory itself, pass the -d flag.

~ $ rm -d directory*

Find and Replace#

The tool we will demonstrate in this lesson is sed (short for streamline editor), but other useful tools are awk and perl.

The basic syntax for sed is as follows:

sed -options 's/find/replace/' input_file

Here, find and replace are placeholders for regular expressions used to match your desired text; the contents of the first find are replaced with replace. sed also accepts piped input, so we can test simple strings with sed to learn its behavior.

What do you think the result of the following command will be?

~ $ echo "the cat in the hat" | sed 's/a/u/'

If the above answer surprises you, remember that sed only finds the first occurrence of a match and then moves on. If you want it to apply to all global occurrences, include the extra flag as sed 's/find/replace/g'.

By default, the result is directed to STDOUT, but we can reroute this to a file with > (overwrite) or >> (append).

~ $ cat input.txt | sed 's/find/replace/' >> output.txt

Regular Expressions#

We can match more complicated patterns. Some examples:

Example	Meaning
`^c`	only occurrences of c that begin a newline
`txt$`	the exact sequence txt that occurs at the end of a line
`[aeiou]`	any single character in the list a, e, i, o, u
`[^aeiou]`	any single character except a, e, i, o, u
`a?`	zero or one of a
`a+`	one or more or a
`a{3}`	exactly 3 of a
`a{3,}`	3 or more of a
`.`	any character

Under extended regex syntax (-E flag), you may also use capturing groups by enclosing your desired matching expression in (). This allows you to rearrange groups of matches in your replace syntax by referencing the matching groups as numeric variables corresponding to the order of the matches.

Example#

Look closely at the following example:

~ $ echo "Smith, John" | sed -E 's/(^.*), (.*$)/\2 \1/' 

We are exploiting the format of Last, First to group together a syntax that looks for chars + , + + chars. The first matching group is defined as ^.* (all characters from the start of the string), but only as it exists up to the next search criteria, which is a comma. We then explicitly include in our search the space after the comma and define another matching group .*$ (all characters until end of string).

Our replace expression is \2 \1, meaning display the second matching group in my previous expression, then a space, then the first matching group. The result of this command is John Smith

If we want to write changes directly to our input file, we can operate inline with the -i flag and specify an extension for a backup file.

# Search and replace input.txt, create input.txt.bk, overwrite original
~ $ sed -i '.bk' 's/find/replace/' input.txt

# Search and replace input.txt, no backup, overwrite original
~ $ sed -i '' 's/find/replace/' input.txt

Instructions#

Navigate to ~/Computing-Essentials in a terminal and execute the following commands:

~/Computing-Essentials $ curl https://raw.githubusercontent.com/davidcurie/Computing-Essentials/main/lessons/first-steps/make-first-steps.sh > make-first-steps.sh
~/Computing-Essentials $ chmod u+x make-first-steps.sh && ./make-first-steps.sh

The above commands download and execute a script that will create and populate a folder for this lesson.

Caution

Be mindful about blindly executing scripts you find online. Inspect any files you download with a text editor before executing from sources you don’t trust.

The new folder has many data files simulating measurements taken over many months. Each measurement is a simple spectrum represented by a series of (wavelength intensity) pairs.

Move to the newly created folder and view the contents with ls:

~/Computing-Essentials $ cd first-steps
~/Computing-Essentials/first-steps $ ls

Rename files#

These files were named by your colleague without careful consideration and subsequently lumped into one directory. One immediately obvious problem we face is that the files do not sort naturally by date. Renaming these dates by hand would be tedious, but we can solve this problem with some regular expression matching.

Task

Change all dates in the filenames from MMDDYY to YYYYMMDD.

Organize files#

We may also notice that some of the data files have special acquisition parameter circumstances that make them distinct from other files. Consider it good practice to have identifying remarks in the filename; aim to keep the level of remarks from most-to-least general going from left-to-right in the filename. We may anticipate an analysis that makes use of data acquired with similar parameters.

To make it easier for us to find this data when we need it, let’s restructure the directory to move similarly acquired data into an identifiable folder for easier browsing.

We’ll also keep the verbose filenames as a measure of redundancy even though the information will now be duplicated in the folder structure. Let’s also agree on a naming slug so that all files have a similarly-patterned name for any further pattern matching we may do in the future.

Task

Move data into low-temperature and high-temperature folders.
Include low-temp descriptor in regular filenames for extra clarity.

Identify special files#

Maintenance makes routine cleaning stops around the lab to pick up waste. When someone comes in to collect waste, the stray light from the hallway causes the counts in the detector to increase slightly. Normally, these operations happen at night when nobody is collecting data, but you remember that some of the acquisitions were running overnight. Unfortunately, the timestamps of these acquisitions were lost when you pulled this data from a backup drive, as the file timestamps now reflect when you last imported them from the archive.

This is not a problem. You know you can look for patterns in your files to detect values greater than some threshold. You know for certain that during the month of March you did not have any overnight runs, so you can look for typical spectrometer values for all runs in this date range.

For now, let’s assume you already figured out that the maximum signal you were getting did not go above 356 counts in this time. Use this as a discriminator to find the data, and thus the night(s), that you were collecting data when someone entered the room.

You may use the grep command to find the file. The general syntax is

~ $ grep -l 'pattern' file1 file2 ...

Task

Find files that have above-average spikes in their spectra.

Write a script#

You decide that you don’t want to learn these commands every six months when you get a large enough dataset to have to sit down and clean. You can assemble the commands into a plain text file and make them executable from the shell by invoking the filename. Typical filenames usually end in .sh for bash execution, but the extension may be left off entirely if you follow the steps below.

Create and edit a file called clean-data. At the top of the file, insert a shebang that tells the terminal what program you will eventually want to handle the execution of this file. For now, this will be bash. We do this with the following:

#! /bin/bash
# Author: Your name
# Description: Simple sentence of what this file will do

# Insert terminal commands here

The Author and Description notes are not strictly necessary, but if you plan to share this with your colleagues, you may want credit for your insight (or blame for your lack of insight). It is good practice to briefly mention the objectives of the script for the benefit of those who may expand on your script next, including you six months from now.

After you copy your commands and save your file, you’ll need to tell your machine that this file is executable. Do so with the following command.

~/Computing-Essentials/first-steps $ chmod u+x clean-data

The chmod command changes modifications so that the current user u has an additional execution permission +x on the following file. You can tell if this was done properly by inspecting the file with ls -l from the current directory.

You should see ten flags to the left of the filename, along with other file data. The left-most flag is either - or d to indicate a file or directory. The next nine flags are read, write, and execute permissions for user, group, and other, respectively. Note that user refers to the currently logged-in user on the machine.

If the above command was executed properly, you should see -rwx as the first four flags of clean-data. From here, you can invoke the file from the terminal at the same working directory as you completed these exercises.

~/Computing-Essentials/first-steps $ sh clean-data

Task

Write a simple script to complete the above tasks in one step.

Challenge

Allow for the script to accept an argument from the command line to represent the directory for which the organization should take place. This would allow you to exclude files that you already organized.

Example usage: sh clean-data unsorted/*.txt

Intended result: only rename and move .txt files that exist in an unsorted directory