Next Steps with Python#
Prerequisites
First Steps with Python
Optional Prerequisites
Exercism Python track: Tree building
Objectives
Develop scalable approaches to analyzing data
We will take an example script that solves the First Steps with Python exercise and improve upon it in a way that outsources the repeated steps into functions.
A sample solution#
As a reminder of our task, we were asked to search through our sorted data and find the maximum value for our spectrometer counts per month. The following script does that. You may find a copy of this initial script here.
import pandas as pd
from glob import glob
MONTHS = {1: "January",
2: "February",
3: "March",
4: "April",
5: "May",
6: "June",
7: "July",
8: "August",
9: "September",
10: "October",
11: "November",
12: "December",
}
# Search through months by int, name pair in 2021.
for month, name in MONTHS.items():
MM = f'{month:02}' # pad with leading zero
filenames = glob(f'*/2021-{MM}-*') # get all files in this month
data_as_list = []
for file in filenames:
month_entry = pd.read_csv(file, delimiter='\t', header=None)
data_as_list.append(month_entry[1])
try:
month_df = pd.concat(data_as_list, axis=0, ignore_index=True)
except ValueError:
month_df = pd.DataFrame()
with open('summary.csv', 'a') as outfile:
outfile.write(f'2021 {name}, {month_df.values.max()}\n')
# Search through months by int, name pair in 2022.
for month, name in MONTHS.items():
MM = f'{month:02}' # pad with leading zero
filenames = glob(f'*/2022-{MM}-*') # get all files in this month
data_as_list = []
for file in filenames:
month_entry = pd.read_csv(file, delimiter='\t', header=None)
data_as_list.append(month_entry[1])
try:
month_df = pd.concat(data_as_list, axis=0, ignore_index=True)
except ValueError:
month_df = pd.DataFrame()
with open('summary.csv', 'a') as outfile:
outfile.write(f'2022 {name}, {month_df.values.max()}\n')
The basic premise of the script is as follows:
Create a dictionary to search by individual months by number and name simultaneously.
MONTHS = {1: "January", 2: "February", ⋮ 12: "December", }
For each month, find files with identifiable matching criteria in the year 2021.
for month, name in MONTHS.items(): MM = f'{month:02}' # pad with leading zero filenames = glob(f'*/2021-{MM}-*') # get all files in this month, 2021
The
MM = f'{month:02}
syntax takes our dictionary index, say1
and converts it to the string01
. This allows us to match substrings that have a date in YYYY-MM-DD format, as we constructed in our earlier First Steps with Bash lesson.Read matching filenames and append the counts column to a list.
for file in filenames: month_entry = pd.read_csv(file, delimiter='\t', header=None) data_as_list.append(month_entry[1])
Concatenate existing list to a dataframe.
try: month_df = pd.concat(data_as_list, axis=0, ignore_index=True) except ValueError: month_df = pd.DataFrame()
In the case where the try statement fails due to an empty input list (say, because there are no filenames with the matching month), a
ValueError
is raised. We can handle that error case explicitly by writing an empty dataframe instead. If the try statement fails for other reasons not explicitly mentioned by an except statement, the error is relayed as normal. We write to an empty dataframe object so that we can use other dataframe functions in the future without raising more errors.Write the maximum value from the dataframe to a file.
with open('summary.csv', 'a') as outfile: outfile.write(f'2022 {name}, {month_df.values.max()}\n')
The
with open()
syntax takes care of properly opening and closing the file in memory after the commands of the indented section below. Here, we are opening the file in append mode'a'
, and explicitly including a newline character\n
in each line write. We can’t open the file in write mode'w'
because we would overwrite the monthly maximum on each successive loop. The contents of the line write are formatted in f-strings, which is a python 3.6+ feature. Words inside{}
are expanded to their variable representation. We use themax()
function on ourmonth_df
dataframe of counts accumulated for that month. We also convert the dataframe to its numeric value representation before computing the maximum value so that empty dataframes don’t display as an object instead of Not a Numbernan
.Repeat for months in 2022.
Identify repeated sections#
Note that in the first draft of this script above, the code that handles the month-by-month logic is largely identical for each year, except for the explicit inclusion of the year in following two lines:
filenames = glob(f'*/2021-{MM}-*') # get all files in this month
outfile.write(f'2021 {name}, {month_df.values.max()}\n')
We could group these two functions into a parent loop and iterate over each year.
# Insert code here
This is better because if we make changes to the logic of one month-by-month iteration, we don’t need to worry about duplicating that change in another block of code.
The downside of this approach is that we haven’t reduced operational complexity. We now have at least 3 nested for loops, while good practice aims for no more than 2 nested loops. This rule stems from the belief that if you are nesting loops to handle a flow-chart of cases, you are better of redesigning your logic.
Consider the following alternative logic:
Get all filenames (all months, all years)
Sort filenames by YYYY-MM, group into sub-lists
Gather data for all files in a sub-list; concatenate into one dataframe
Pick out the max value for each criteria in the larger dataframe
Write these values to
summary.csv
in one step
Notice how we can loosely group this into sections that deal with gathering files (steps 1–2), managing data (steps 3–4), and writing results (step 5).
More importantly, with enough insight we can structure these calls to be relatively independent from one another.
Gathering Files#
Rather than explicitly searching through all files within each month loop to
find only those files with a matching month integer MM
, we can search through
all files once at the beginning of our script and store the results in a list
that we can later parse. The advantage of this is that we only run glob()
once, which is a system wide search through directories. We then access
specific filenames by searching through an internal Python list, which may give
us a slight computing advantage.
Let’s make these calls dedicated functions.
def get_all_filenames(pattern):
all_filenames_as_list = glob(pattern)
return all_filenames_as_list
def sort_filenames(list_of_filenames):
pass
Our first function is just a one-to-one wrapper of the glob()
function, but
having it in a dedicated function of our choosing makes it clear to us when we
call it that we are accessing all files that we wish to parse. It also allows
us to modify the logic in case we want to add flags in the future without
having to search for all cases in our script where we used the native glob()
functionality by itself.
The pass
statement in the second function is a temporary placeholder that
says do nothing and allows us to benchmark functions that we should come back
to.
Our sort_filesnames()
function should take our list of filenames and
parse it into sub-lists of our choosing. This can be achieved in a couple of
ways.
Store this parsed list as a single object that we can pass to other functions and have those functions pick out the relevant parts that they need.
# Approach 1 def sort_filenames(list_of_filenames): # Sort logic here based on some predefined pattern return [[sublist_1], [sublist_2], ..., [sublist_N]]
We can then store this list of lists as an object, say
parsed_list
, and access each month through list indexing, as inparsed_list[0]
to get the first matching month.Parse the entire list of filenames on demand, only returning the specified match based on user input.
# Approach 2 def sort_filenames(list_of_filenames, sublist_pattern): # Sort logic here that depends on sublist_pattern return sublist_match
This approach allows for us to generalize our search criteria by specifying
sublist_pattern
as an extra function parameter. We might also rename the function to be more explicit about what it does.# Approach 2 def get_sublist_from_pattern(list_of_filenames, sublist_pattern): # Sort logic here that depends on sublist_pattern return sublist_match
If we intend to access the entire list of filenames infrequently, approach 2 might be the way to go from a resource perspective. It also is more general because it allows for tweaking the search criteria by using a second input parameter.
Recall that the purpose of our parsing is to be able to loop through different months and read a select number of files. Approach 2 forces us to use this function inside of a loop, where we may have previously defined the search criteria. It might look like this:
# Looping with approach 2
for month in list_of_months:
sublist_pattern = some_criteria
files_by_month = get_sublist_from_pattern(list_of_filenames, sublist_pattern)
# Read data from matched files and continue for loop logic
Here, some_criteria
would need to be defined for each loop (preferably by
another function). Additionally, we have to set up some iterable for us to loop
over before we can filter the filenames by month. This means that although
approach 2 is extensible, it requires us to create a list_of_months
to
iterate over and some_criteria
for us to tailor the generalized function to
our desired loop case. This means that if we later decide to group our searches
by academic semester instead of by month, we’ll need to remember to update how
we generate our list_of_months
iterable and our some_criteria
call.
Approach 1 returns an iterable object that is intimately tied to our search criteria, meaning that we don’t need to set up any further lists to loop through, so long as we took the time to carefully construct the subsets that we want.
# Looping with approach 1
parsed_list = sort_filenames(list_of_filenames)
for files_by_month in parsed_list:
# Read data from each list of files_by_month
# Continue for loop logic
From an extensibility standpoint, approach 1 is slightly better because it takes care of all of the pattern matching during the sorting processes, whereas approach 2 requires us to generate a sorting criteria on each loop and then apply that sorting criteria to the parent list of filenames on each loop.
Approach 1 |
Approach 2 |
|
---|---|---|
Objective |
Sort once, access later |
Sort on demand |
Pros |
Ties search criteria to iterable |
Extensible for matching other search patterns |
Cons |
Not extensible |
Adjustments to search pattern require changes in 2 spots |
We’ll adopt the generalizability of approach 2 into approach 1 by coding in our matching criterion in approach 1 as a callable function.
# Improved approach 1
def search_criteria(pattern)
pass
def sort_filenames(list_of_filenames, sublist_pattern):
# Use search_criteria(sublist_pattern) logic to parse list
return [[sublist_1], [sublist_2], ... , [sublist_N]]
Managing Data#
Once we have a subset of lists grouped by desired criteria, we can generalize a function to operate on all files in that list.
We’ll make use of the map(function, list)
function, which applies function
to each item in list
. This is a shortened way of writing the following:
# Equivalent of map(function, my_list)
for item in my_list:
function(item)
For our list of sub-lists, we can read each file in our sub-list into a
dataframe and concatenate all dataframes from the sub-lists into one dataframe
with the pd.concat()
function.
month_df = pd.concat(map(pd.read_csv, sublist))
As we’ll need to repeat this dataframe creation for each sub-list in our parsed list, we’ll put it in a for loop and append each dataframe to a list.
def import_files_into_dataframe(files_as_list):
list_of_dataframes = []
for sublist in parsed_list:
month_df = pd.concat(map(pd.read_csv, sublist))
list_of_dataframes.append(month_df)
return list_of_dataframes
We can now pick out the max value from each dataframe in a simple way.
def get_max_values(list_of_dataframes):
list_of_max_values = []
for dataframe in list_of_dataframes:
month_max = dataframe.max()
list_of_max_values.append(month_max)
return list_of_max_values
Writing Results#
Our results lie in the list_of_max_values
result from the get_max_values()
function. We can more simply write all lines of this dataframe to a single
file, meaning we only need to open and close it once instead of opening and
closing the file in each for loop.
with open('summary.csv', 'a') as outfile:
outfile.write(list_of_max_values)
Make it idempotent#
A script should yield the same results if called on the same conditions. In our
case, the write statement is appending data to our summary.csv
file.
We initially opened in append mode because opening in write mode would overwrite the results of the previous month with those from the current month during each new loop.
There are two potential solutions to this problem:
Write all of the columns of the dataframe in one step, overwriting the existing contents from previous runs.
Append only new data to the results.
Instructions#
Task
Make improvements to your find_max.py
file to make use of functions.
You may use the sample solution at the beginning of this lesson as a starting point. Consider incorporating the suggested improvements outlined above by filling in the logic for the functions declared in the outline.
The find_max.py
script should do the following by default:
Gather all files in room-temp/ and high-temp/ folders
Sort files based on month
Find the maxiumum spectroscopic count of each month
Write the results to
summary.csv
in the same order the files were grouped
Once all of your functions are defined, see if you can execute the main logic of the above pipeline in 10 lines of code or less.
Challenge
Extend the find_max.py
script to append uniquely new results to summary.csv
instead of overwriting on each execution.
More Resources#
How to Program a Game! (in Python) (Video: 1:10:06)
Professional Code Refactor! (Cleaning Python Code & Rewriting it to use Classes) (Video: 1:02:29)