Refactoring Basics#

Prerequisites

Objectives

  • Make use of tests to better prevent destructive edits in future revisions

We will reorganize our directory to better handle more scripts and data. We’ll further refine our find_max.py script to work with our new changes.

Summary of Revisions#

If you haven’t made your own edits, an updated version of find_max.py as expanded upon from the previous lesson can be found here.

The example script provided above is outlined by the following core logic:

  • Specify files to search

  • Get all filenames into a list

  • Group that list by some identifier (i.e. by month)

  • Operate on that grouped list

A stripped-down version of the script looks like the following:

# LOGIC FOR GATHERING FILES

def get_all_filenames(pattern):

def sort_filenames(list_of_filenames, key=project_YYYYMM):

# LOGIC FOR MANAGING DATA

def import_files_into_dataframe(parsed_list):

def get_max_values(list_of_dataframes):

# INTENDED SEQUENCE OF OPERATIONS

files_to_search = '*/*'
files_as_list = get_all_filenames(files_to_search)
grouped_files = sort_filenames(files_as_list)
df_list = import_files_into_dataframe(grouped_files)
df_max_vals = get_max_values(df_list)
with open('summary.csv', 'w') as outfile:
    df_max_vals.to_csv(outfile, header=None)

These logic of these core functions is made extensible by other helper functions.

def _get_date_from(filename):
    """Return (YYYY, MM, DD) from filename string"""
    # Assumes filename: YYYY-MM-DD_other_text
    YYYYMMDD = os.path.basename(filename).split('_')[0]
    YYYY, MM, DD = YYYYMMDD.split('-')
    return YYYY, MM, DD

def _project_YYYY(filename):
    """Return YYYY from filename string"""
    return int(_get_date_from(filename)[0])

def _project_YYYYMM(filename):
    """Return YYYY-MM string from filename string"""
    return '-'.join(_get_date_from(filename)[0:2])

def _project_YYYYMMDD(filename):
    """ Return YYYY-MM-DD string from filename string"""
    return '-'.join(_get_date_from(filename))

def _import_data(file):
    return pd.read_csv(file, delimiter='\t', header=None)

Notice that these helper functions are named with a leading underscore. This is a convention to tell other programmers that these are meant to be called by other functions in our script but not to be called directly by the end-user.

The purpose of the helper functions in this example is to make the core functions easier to read.

In the updated script, the sort_filenames() function makes use of a group_by() function in the itertools library, which is part of the standard Python library. This saves us from having to define our own grouping mechanism, but we need to start with a sorted list and specify how to group that list. Rather than hard-coding the sorting criteria logic directly inside the sort_filenames() function, we can define a sorting key (sometimes also called a list projection) that handles the logic.

The list projections used here are functions that return some sortable value from a filename. A sortable value is either a number or a string, where strings are sorted in alphanumeric order (0–9A–Za–z). In our case, we assume that the string YYYY-MM-DD appears in the beginning of the filename. We then find a way to return some subset of that string. If we want to sort or group by YYYY, we find a way to return just these characters. Since our initial task is to group by month, we want to YYYY-MM string from the filename. The year is included to avoid grouping months from different years together.

Updating the sorting algorithm is then as simple as updating the sorting key in the group_by() function. We don’t want to have to edit the guts of the sort_filenames() function each time we want to update the logic used by group_by() function inside, so we pass in a parameter to sort_filenames() and let group_by() access the contents of that parameter.

def sort_filenames(filenames, key=MyValue):
    # group_by logic that makes use of the value of `key`

This indicates that the default sorting key is MyValue unless otherwise specified at runtime. Running sort_filenames(filenames) would automatically make use of MyValue, but if you wanted to change this at runtime, you could invoke sort_filenames(filenames, key=MyOtherValue).

Having adjustable parameters grouped into the function call makes it much easier to make future edits without having to investigate the sometimes-complicated logic buried in functions.

Tip

Aim to generalize functions by allowing variables that you foresee will change in specific cases to be encapsulated as an optional function parameter. Give default parameter values to control the assumed behavior of the function.

More Refactoring#

We’ve made our script easier to inspect in the future by outsourcing the key logic into functions, but we haven’t made our project easier to inspect as a whole.

Cleaning the Directory#

If you’ve been following along with the lessons so far, your first-steps/ directory might look something like the following:

first-steps/
    ├──high-temp/
    ├──room-temp/
    └──LICENSE
    └──README
    └──clean-data
    └──environment.yml
    └──find_max.py
    └──summary.csv
    └──to-clean.txt

Suppose we want to divide our project into a data/ results/ and scripts/ folder. These changes might break the way some of our scripts access files, so we’ll commit our directory structure to log our housecleaning in Git.

(first-steps) ~/Computing-Essentials/first-steps $ git add . && git commit -m "Start of housecleaning"

Create these folders and move files until you get a directory structure like the following:

first-steps/
    ├──data/
        └──high-temp/
        └──room-temp/
        └──to-clean.txt
    ├──results/
        └──summary.csv
    ├──scripts/
        └──clean-data
        └──find_max.py
    └──LICENSE
    └──README
    └──environment.yml

The clean-data and find_max.py files need to be edited in the way they search for data. When looking for all data files, we previously used a wildcard search */* meaning all directories/all files relative to the working directory of where this script is executed. Relative to the new location of the script, that path looks something like ../data/*/*.

Task

Update the search criteria for clean-data and find_max.py so that they may be run from the scripts/ directory. Check that these produce the expected results, say, for find_max.py and that the results are written to their new location.

Task

Log the changes corresponding to the newly organized directory and the modified script files in Git as a commit with a message such as “Reorganize directory into folders”

Using Tests#

The file given at the beginning of the lesson is a functioning copy of a script that meets the current objectives. Sometimes the objectives of our script evolve in ways that aren’t obvious to predict when first defining functions.

Unit tests allow for us to make iterative changes to our code while we restructure its logic and check to see that it still meets the core requirements that we set out with.

See also

For a more lucid discussion on why unit tests are important and how to implement them, see https://goodresearch.dev/testing.html.

It’s good practice to have a set of tests for each script. We’ll make a separate tests/ folder to house each test, and name our tests after the script they are meant to inspect.

Task

A set of tests for the file linked at the beginning of the lesson can be found in this directory. Make your own tests/ folder and copy the contents of the example directory into your tests/ folder.

The tests may be run from the main project directory with the following line:

(first-steps) ~/Computing-Essentials/first-steps $ python -m pytest tests/find_max_test.py

Task

Check to make sure that all of the tests pass. Then commit the contents of the tests/ folder to your repository.

Notice that in the find_max_test.py file, each function starts with the word test and describes the performed test. This makes it easier to inspect which tests failed.

The tests are only as valuable as the insight you have when designing the tests. Often when refactoring, you’ll make errors that cause tests to fail. Make note of those errors and design a test to catch that error in the future.

Here are two test suggestions that might motivate further changes to our find_max.py script.

Task

Write a unit test that checks whether sort_filenames() continutes to sort by month when there are spaces in the filenames. Then implement changes to sort_filenames() so that filenames with spaces don’t break the workflow.

Task

Write a unit test that checks whether sort_filenames() continutes to sort by month when the YYYY-MM-DD substring is not the first part of the filename. Then implement changes to sort_filenames() so that the above condition is met.

Challenge

Rewrite the modified find_max.py script given in this lesson to use Classes.