Building your Python Toolkit

Upcoming workshop(s) available!

The next workshop is on Tue Dec 09 at 09:30 AM.

Alternatively, check our calendar for future events.

In this standalone workshop, we take a look at the Python building blocks that you’ll want to have in your toolkit. This includes:

Loops
Functions
Modules
I/O and filesystem (using os and sys)

We’ll work by procedurally building up a bigger and bigger program, containing all the features we cover. That way you’ve got them all in context

Along the way, you’ll encounter callouts like these:

New feature

Description of new feature

These crop up every time a new feature or concept is introduced.

Preparation

Before we begin, please follow the following instructions

Open your preferred IDE (e.g. Spyder).
Download the files for today’s session.
Extract them to an accessible location (you’ll work from here).

Part 0: Refresher

Before we get into the content for this workshop, let’s just have a brief refresher on what Python is and how it works.

Python is a program on your computer, just like any other. When you want to run Python code, or a .py file, the Python program (e.g. python.exe) runs. It takes your code as an input and evaluates it.

Python is also the name of the programming language which the Python program interprets. So, we refer to the language as Python, and the executable program which interprets it the interpreter.

The language is similar to other high-level, object-oriented languages like R, MATLAB, Julia, JavaScript etc. It is versatile and reads easily.

There are various ways to run Python, this workshop assumes you’re using an IDE like Spyder. Here, we’ll write just one script, containing our program.

In Python we work with two types of objects: variables and functions.

Variables

Variables allow you to store information in a named object. We use the = operator to created them, with the syntax

<name> = <value>

For example, the script below creates two objects, example_number and example_string, by assigning values to both.

example_number = 100.5
example_string = "Hello!"

Functions

Functions enable you to run predefined Python code. Functions can be built-in, from a module, or written yourself.

To call (use) a function, the syntax is

<function>(<input_1>, <input_2>, ...)

The parentheses () are essential, even if you don’t have any inputs.

For example, let’s use the round() function to round our example_number:

round(example_number)

Keywords and Modules

Finally, you will encounter a third type of command in Python: keywords. These are special commands which alter your code in a predetermined way, and you can’t use them for variable names. You also can’t write your own - they’re baked into Python itself.

Keywords must always be followed by a space (or a colon in special cases). One example is the import command, which loads a module. Modules are collections of Python code that other people have written, containing lots of functions, variables and submodules.

For example, the following code imports the os module.

import os

Part 1: Verifying your setup

In the first part of this workshop, we’ll learn to use Python’s building blocks by verifying our setup is working. We’ll need to use

The os module and functions therein
The print() function
Conditionals if and else
Error handling with raise

To begin, create a new script called toolkit.py. This should be in the same location as the /texts/ folder you downloaded.

Using `os` to check the working directory

We’ll begin by diving into an unusual starting point: the os module. Python comes with a built in module called os (short for operating system) which allows you to interact with your computer. Let’s start by importing it

import os

The os module

The os module is a built-in Python module that enables interfacing with the operating system. Some popular uses, which aren’t looked at in this workshop, are

os.chdir(...): Change the current working directory to ...
os.system(...): Send the command ... to a terminal
os.walk(...): Recursively walk through the files in ... (requires looping, see later)

This links our Python environment to the os module so we can now access code from it. The reason that we’re starting here is because we can use the os module to interact with our computer.

Next, we’ll use the print() function to send a message to the user:

import os

print("Running the Python Toolkit Program")

Running the Python Toolkit Program

The print() function

The function print(...) is a built-in function which sends the message ... to the console. Note that the output of the function is technically None (this is different to the console message).

As a first application of os, we’ll use it to print the current working directory. This is a location on your computer, considered by the program to be home - all files are relative to the current working directory. We’ll need this later when we deal with file paths.

You can access the current working directory with the getcwd() function within the os module. Because the function lives inside the module, we need to use the . operator.

import os

print("Running the Python Toolkit Program")
print("The current working directory is")
print(os.getcwd())

Running the Python Toolkit Program
The current working directory is
/home/runner/work/technology-training/technology-training/Python/5-python_toolkit

The . operator

The . operator allows you to access objects that exist within other objects. Here, we access the getcwd() function which lives inside the os module.

All objects in Python have methods (functions) and attributes (variables) attached to them, which you use the . operator to access.

We can simplify this process by using f-strings. Including the letter f directly before a string’s first quotation mark tells Python to execute any code within curly brackets:

import os

print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")

Running the Python Toolkit Program
The current working directory is /home/runner/work/technology-training/technology-training/Python/5-python_toolkit

Setting the correct working directory

When you work with other files, you’ll want to set the working directory. Here, we’ll be loading data that we store in the same folder as your Python script, so let’s move the working directory there.

First, copy your script’s file path. On Windows, find the file in File Explorer and copy the path address (up the top).
Next, use the os.chdir() function to change the current working directory as follows

os.chdir(R"path/to/script/")

Make sure you include the R before the path. This tells Python it’s a raw string and won’t misinterpret the backslashes.

f-strings

You can include executed code within strings by prepending them with the letter f. The code needs to be placed within curly brackets inside the string, and the output of the code will be directly inserted there.

f"This is an f-string, showing that 1 + 1 = {1 + 1}"

'This is an f-string, showing that 1 + 1 = 2'

Managing folders and paths with conditionals

Now that we’ve printed a welcome message, let’s use another function, to ensure that we’ve got the folder “texts” in our working directory. The function lives in a submodule of os, called os.paths, and the function is exists().

import os

print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")

# Check that the folder exists in our working directory
print(os.path.exists("texts"))

Running the Python Toolkit Program
The current working directory is /home/runner/work/technology-training/technology-training/Python/5-python_toolkit
True

Submodules

Most big modules, including os, contain submodules. These are also accessed via the . notation:

module.submodule.function()

Submodules could also have submodules:

module.submodule.submodule_of_submodule.function()

This helps keep the scope clear. For example, there might be lots of submodules with the function exists(), but because we call os.path.exists(), we know that this one relates to filepaths.

What should we do if the folder doesn’t exist? We should probably stop the program, fix our setup, and then try again.

Let’s tell Python to print a message if the folder exists. For this, we need to use conditionals.

The keyword if indicates that a block of indented code should only be run if a condition is True. Because the function os.path.exists() returns a True or False, we can use it as the condition.

Let’s set up our conditional and test it by printing a message. We have to be careful with the syntax:

if <condition>: code_to_run_if_True code_outside_will_always_run

Including it in our code, we want to substitute:

<condition> \(\rightarrow\) os.path.exists("texts")
code_to_run_if_True \(\rightarrow\) print("The folder /texts/ exists")

import os

print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")

# Check that the folder exists in our working directory
if os.path.exists("texts"):
    print("The folder /texts/ exists")

Running the Python Toolkit Program
The current working directory is /home/runner/work/technology-training/technology-training/Python/5-python_toolkit
The folder /texts/ exists

What if it doesn’t exist? We can use the keyword else to catch anything that fails the condition.

import os

print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")

# Check that the folder exists in our working directory
if os.path.exists("texts"):
    print("The folder /texts/ exists.")
else:
    print("The folder /texts/ does not exist.")

Running the Python Toolkit Program
The current working directory is /home/runner/work/technology-training/technology-training/Python/5-python_toolkit
The folder /texts/ exists.

Conditionals: if, elif and else

The keywords if, elif, and else together constitute the conditionals that Python supports. These allow you to only run code based on certain conditions.

The first, if, is always required. The syntax is,

The if keyword
A condition, which must become either True or False
A colon, :
The code to run if the condition is True. It must be indented if on a new line.

if <condition>: code_to_run_if_True code_outside_will_always_run

The keyword elif allows you to check an additional condition, only if the previous condition(s) failed.

if <condition>: code_to_run_if_True elif <condition_2>: code_to_run_if_condition2_True elif <condition_3>: code_to_run_if_condition3_True code_outside_will_always_run

Finally, the keyword else catches anything that failed all conditions

if <condition>: code_to_run_if_True elif <condition_2>: code_to_run_if_condition2_True elif <condition_3>: code_to_run_if_condition3_True else: code_to_run_if_all_failed code_outside_will_always_run

Realistically, we don’t want our program to run unless the setup is correct. This means we should stop the program if it can’t find the folder.

You’ve probably already encountered errors in your code. Now it’s your chance to code them in manually. To raise an error,

Use the raise keyword
Follow with a valid error, e.g. ValueError(), KeyError() etc.
Place a useful message inside the brackets.

In this case, we should use the FileNotFoundError(). Let’s replace the else section:

import os

print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")

# Check that the folder exists in our working directory
if os.path.exists("texts"):
    print("The folder /texts/ exists.")
else:
    raise FileNotFoundError("Cannot find the folder /texts/.")

Running the Python Toolkit Program
The current working directory is /home/runner/work/technology-training/technology-training/Python/5-python_toolkit
The folder /texts/ exists.

Raising exceptions with raise

The raise keyword allows you to raise exceptions (errors) in Python. These stop the execution and print an error message.

The syntax is

raise SomeError("Appropriate error message")

and there are lots of built in exceptions.

<Error retrieving source code with stack_data see ipython/ipython#13598>

Your traceback might have the error

<Error retrieving source code with stack_data see ipython/ipython#13598>

While this isn’t a major issue, it’s a bug that will prevent you from seeing tracebacks in your error messages. It’s because Anaconda has shipped an old version of a behind-the-scenes module that it uses; we can fix this.

If you’re using Anaconda,

Open an Anaconda Prompt
Run

conda update executing

If you’re not, then you might have a more serious bug going on. You can try running the following command in a terminal

pip install -U executing

but it may not solve the problem.

Activity 1

To make sure that you’ve set things up correctly, you should raise an error if the number of files in the texts folder is not five.

To set things up, let’s use the os.listdir() function to get a list of the files, storing them in a variable.

import os

print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")

# Check that the folder exists in our working directory
if os.path.exists("texts"):
    print("The folder /texts/ exists.")
else:
    raise FileNotFoundError("Cannot find the folder /texts/.")

# Check that there are five files in the folder
files_in_texts = os.listdir("texts")

Running the Python Toolkit Program
The current working directory is /home/runner/work/technology-training/technology-training/Python/5-python_toolkit
The folder /texts/ exists.

For this activity, you can use the len() built-in function to determine the size of files_in_texts. Then, use a conditional to raise an error if it’s not five.

In sum,

Use len() to determine the number of objects in files_in_texts
Use an if statement to check if this is not equal to five. You’ll need the != (not equal to) operator.
Use the raise keyword to raise an error.

Logical operators

To check (in)equalities, you can use logical operators. For example,

1 == 1  # Equal to

True

1 != 2  # Not equal to

True

2 > 1   # Greater than

True

1 <= 1  # Less than or equal to

True

Solution

The new code is

files_in_texts = os.listdir("texts")

# Check that there are five files within texts
if len(files_in_texts) != 5:
    raise FileNotFoundError("Incorrect number of files in /texts/.")

Note that this doesn’t produce an output message. Generally, if everything is fine, we don’t need a message.

The whole program has now become

import os

print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")

# Check that the folder exists in our working directory
if os.path.exists("texts"):
    print("The folder /texts/ exists.")
else:
    raise FileNotFoundError("Cannot find the folder /texts/.")

files_in_texts = os.listdir("texts")

# Check that there are five files within texts
if len(files_in_texts) != 5:
    raise FileNotFoundError("Incorrect number of files in /texts/.")

Running the Python Toolkit Program
The current working directory is /home/runner/work/technology-training/technology-training/Python/5-python_toolkit
The folder /texts/ exists.

Part 2: Analysing the data

In the second part of this session, we’ll learn to use Python’s input/output and looping features to analyse the files within texts. We’ll use

The open() function and with ... as ... keywords for reading
String methods and the set variable type to analyse the texts
for loops to automate the process

Code from Part 1

Before we begin, ensure that your code looks like this. Continue from the bottom throughout part 2.

import os

print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")

# Check that the folder exists in our working directory
if os.path.exists("texts"):
    print("The folder /texts/ exists.")
else:
    raise FileNotFoundError("Cannot find the folder /texts/.")

files_in_texts = os.listdir("texts")

# Check that there are five files within texts
if len(files_in_texts) != 5:
    raise FileNotFoundError("Incorrect number of files in /texts/.")

#### Part 2 ####

Running the Python Toolkit Program
The current working directory is /home/runner/work/technology-training/technology-training/Python/5-python_toolkit
The folder /texts/ exists.

Reading and manipulating text files

We’ll start by reading the first file, Macbeth.txt into Python. Make sure to include the snippets in part 2 below the code from part 1.

To read a file in Python,

Create a with ... as ... block to make sure the file connection closes properly
At the first ..., use the open(<filepath>, encoding = "utf-8") function to open the file. Use encoding = "utf-8" because your operating system might not have this as the default.
At the second ..., use a placeholder variable to store the file connection. Something like file.
Inside the block (like an if statement), use file.read() to access its contents and store that in a variable.

We’ll use that final variable to perform our analysis. It is completely disconnected from the actual file.

#### Part 2 ####

with open("texts/Macbeth.txt", encoding = "utf-8") as file:
    contents = file.read()

File input/output

Reading and writing files in Python takes a few steps. Essentially, Python forms a connection to a file with the open() function which automatically closes if we do this inside a with ... as ... block.

The syntax is

with open("path_to_file", encoding = "...") as <placeholder>: code_with_file_connection_open code_once_connection_has_closed

Note that whatever you put at <placeholder> will store the file connection. All files have the method read(), which parses the contents. A method is a function that you access with ..

Typically, you want to store the contents in a variable, like in our contents example, which is disconnected from the actual file.

The encoding refers to how the file stored text. Generally, you should use encoding = "utf-8", because the default is based on your operating system and varies from machine to machine.

Next, let’s perform some analysis of the text. Our goal is to compare the total number of words with the total unique number of words.

First, we need to apply a string method to separate the words in the text. Methods are functions that all variables of a particular type have access to, and we use them with the dot operator .. In this case, the .split() method will create a list by dividing the string every time there is a whitespace.

#### Part 2 ####

with open("texts/Macbeth.txt", encoding = "utf-8") as file:
    contents = file.read()

words = contents.split()

Methods

Every variable has common methods (functions) and attributes (variables) associated with them, accessible via the . operator. For example, all strings have the .lower() method which makes them lowercase:

example_string = "THIS WAS IN CAPS"
example_string.lower()

'this was in caps'

Other variables have their own methods. Numbers have the .as_integer_ratio() method, which turns the number into a fraction

example_int = 5.5
example_int.as_integer_ratio()

(11, 2)

and lists have the .append() method, which adds another element to the list

example_list = ["a", "b"]
example_list.append("c")
print(example_list)

['a', 'b', 'c']

We can then use the len() function again to determine the total number of words and print a message. We can also use print() by itself to make an empty line.

#### Part 2 ####

with open("texts/Macbeth.txt", encoding = "utf-8") as file:
    contents = file.read()

words = contents.split()
word_count = len(words)

print()
print(f"There are {word_count} words in Macbeth.")


There are 21428 words in Macbeth.

To work out the unique number of words, we can convert our list to a different variable type: the set. Sets are like lists, but they only contain unique values.

We can convert a variable to another type by using its type as a function, e.g. int(), str(), list(). Here, we’ll need set(). Then we can use len() again to determine its size.

Create a set of unique words with set(words)
Find the count of unique words with len()
Print an additional message

#### Part 2 ####

with open("texts/Macbeth.txt", encoding = "utf-8") as file:
    contents = file.read()

words = contents.split()
unique_words = set(words)

word_count = len(words)
unique_word_count = len(unique_words)

print()
print(f"There are {word_count} words in Macbeth.")
print(f"There are {unique_word_count} different words in Macbeth.")


There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.

Containers: list, tuple, dict and set

Python has four built-in variables which are ‘containters’: they store multiple values.

Lists

Lists simply a collection of Python objects. They are ordered, so you can access them by their index, and they are mutable, so you can change individual elements.

Create a list with square brackets:

example_list = [1, "a", 5.5]

# Mutable - can change specific elements
# Ordered - access elements by position
example_list[0] = "first"
print(example_list)

['first', 'a', 5.5]

Tuples

Tuples are like lists, but you can’t modify its elements. That makes it ordered and immutable.

Create a tuple with parentheses:

example_tuple = (1, "a", 5.5)

# Immutable - attempting to change specific element gives error
example_tuple[0] = "first"

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[29], line 4
      1 example_tuple = (1, "a", 5.5)
      3 # Immutable - attempting to change specific element gives error
----> 4 example_tuple[0] = "first"

TypeError: 'tuple' object does not support item assignment

Dictionaries

Dictionaries are like lists, but they are unordered. Instead of using position to identify elements, you use keywords.

Create a dictionary with curly brackets and key: value pairs:

example_dict = {"a": 1, "b": 2, "c": 3}

# Can create new elements in dictionary by 'accessing' them
example_dict["d"] = 4
print(example_dict)

{'a': 1, 'b': 2, 'c': 3, 'd': 4}

Sets

Sets are like lists but the elements are unique. Duplicates will always be removed. They are also unordered, so you can’t access individual elements unless you loop through the set.

example_set = {"a", "a", 2, 2, "c"}

print(example_set)

{2, 'a', 'c'}

Finally, let’s determine the ratio of unique words to total words

\[\text{ratio} = \frac{\text{unique words}}{\text{total words}}\]

#### Part 2 ####

with open("texts/Macbeth.txt", encoding = "utf-8") as file:
    contents = file.read()

words = contents.split()
unique_words = set(words)

word_count = len(words)
unique_word_count = len(unique_words)
ratio = unique_word_count / word_count

print()
print(f"There are {word_count} words in Macbeth.")
print(f"There are {unique_word_count} different words in Macbeth.")
print(f"The unique word ratio is {unique_word_count / word_count}")


There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526

Using loops to automate the process

Now that we’ve analysed one of the texts, let’s do the same for all five.

The brute force approach is to copy the code five times and adjust it.

However, we can do one better with a for loop. This enables us to repeat a section of code for each element in an object.

for <placeholder> in <object>: code_to_repeat code_after_loop

We’ll start just by printing out the names of each file. We need the list of file names, which we get from os.listdir("texts").

#### Part 2 ####

with open("texts/Macbeth.txt", encoding = "utf-8") as file:
    contents = file.read()

words = contents.split()
unique_words = set(words)

word_count = len(words)
unique_word_count = len(unique_words)
ratio = unique_word_count / word_count

print()
print(f"There are {word_count} words in Macbeth.")
print(f"There are {unique_word_count} different words in Macbeth.")
print(f"The unique word ratio is {unique_word_count / word_count}")

files_in_texts = os.listdir("texts")

for text_path in files_in_texts:
    print(text_path)


There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526
The_Great_Gatsby.txt
Pride_and_Prejudice.txt
The_Adventures_of_Huckleberry_Finn.txt
The_Count_of_Monte_Cristo.txt
Macbeth.txt

for loops

To iterate through an object, running the same code on each element, Python offers the for loop.

for <placeholder> in <object>: code_to_repeat code_after_loop

Whatever you name in <placeholder> will store an element of the <object> for each iteration of the loop.

For example, the following loop prints each element of example_list. Each time the loop runs, letter stores one of the list’s elements, in order.

example_list = ["a", "b", "c"]

for letter in example_list:
    print(letter)

a
b
c

There are a few important keywords you can use to help with loops.

break

The keyword break tells Python to finish the loop immediately. This is often used with conditionals. For example,

example_list = ["a", "b", "c"]

for letter in example_list:
    if letter == "b":
        break
    print(letter)

continue

The keyword continue tells Python to skip the rest of the current iteration and start the next.

example_list = ["a", "b", "c"]

for letter in example_list:
    if letter == "b":
        continue
    print(letter)

a
c

We can use our for loop to run the whole analysis on each file.

First, let’s just place the analysis inside the loop. This will run once for each file, but because we haven’t changed the path from "texts/Macbeth.txt", it will still read Macbeth each time.

#### Part 2 ####
files_in_texts = os.listdir("texts")

for text_path in files_in_texts:
    with open("texts/Macbeth.txt", encoding = "utf-8") as file:
        contents = file.read()

    words = contents.split()
    unique_words = set(words)

    word_count = len(words)
    unique_word_count = len(unique_words)
    ratio = unique_word_count / word_count

    print()
    print(f"There are {word_count} words in Macbeth.")
    print(f"There are {unique_word_count} different words in Macbeth.")
    print(f"The unique word ratio is {unique_word_count / word_count}")


There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526

There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526

There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526

There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526

There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526

Notice that the final files_in_texts = os.listdir("texts") has been removed because it’s superfluous.

Now we can use the text variable, which changes on each iteration of the loop, in place of the file path. Specifically, we’ll make the change

"texts/Macbeth.txt" \(\rightarrow\) f"texts/{text_path}"

#### Part 2 ####
files_in_texts = os.listdir("texts")

for text_path in files_in_texts:
    with open(f"texts/{text_path}", encoding = "utf-8") as file:
        contents = file.read()

    words = contents.split()
    unique_words = set(words)

    word_count = len(words)
    unique_word_count = len(unique_words)
    ratio = unique_word_count / word_count

    print()
    print(f"There are {word_count} words in Macbeth.")
    print(f"There are {unique_word_count} different words in Macbeth.")
    print(f"The unique word ratio is {unique_word_count / word_count}")


There are 51257 words in Macbeth.
There are 10206 different words in Macbeth.
The unique word ratio is 0.19911426731958562

There are 130410 words in Macbeth.
There are 14702 different words in Macbeth.
The unique word ratio is 0.11273675331646346

There are 114125 words in Macbeth.
There are 14307 different words in Macbeth.
The unique word ratio is 0.12536254107338446

There are 464023 words in Macbeth.
There are 40030 different words in Macbeth.
The unique word ratio is 0.0862672755445312

There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526

If you look closely, it has worked - the numbers are changing each time. We need to update our messages though, to make it dynamic.

#### Part 2 ####
files_in_texts = os.listdir("texts")

for text_path in files_in_texts:
    with open(f"texts/{text_path}", encoding = "utf-8") as file:
        contents = file.read()

    words = contents.split()
    unique_words = set(words)

    word_count = len(words)
    unique_word_count = len(unique_words)
    ratio = unique_word_count / word_count

    print()
    print(f"There are {word_count} words in {text_path}.")
    print(f"There are {unique_word_count} different words in {text_path}.")
    print(f"The unique word ratio is {unique_word_count / word_count}")


There are 51257 words in The_Great_Gatsby.txt.
There are 10206 different words in The_Great_Gatsby.txt.
The unique word ratio is 0.19911426731958562

There are 130410 words in Pride_and_Prejudice.txt.
There are 14702 different words in Pride_and_Prejudice.txt.
The unique word ratio is 0.11273675331646346

There are 114125 words in The_Adventures_of_Huckleberry_Finn.txt.
There are 14307 different words in The_Adventures_of_Huckleberry_Finn.txt.
The unique word ratio is 0.12536254107338446

There are 464023 words in The_Count_of_Monte_Cristo.txt.
There are 40030 different words in The_Count_of_Monte_Cristo.txt.
The unique word ratio is 0.0862672755445312

There are 21428 words in Macbeth.txt.
There are 6207 different words in Macbeth.txt.
The unique word ratio is 0.2896677244726526

Finally, let’s remove the trailing .txt on the messages by extracting the text’s title from its path. To do this, slice the string with square brackets: title = text_path[:-4]. In this case, we slice from the start of the path up to the fourth last character.

We can also replace the underscores with spaces using the .replace() string method.

Indexing and Slicing

Indexing

To extract a substring from a string (or a subset of a list) use square brackets and specify the position of the elements you want. For example, to pick out the first letter in the following string,

example_string = "apple"

you specify the position of the first element, which is 0 (in Python, count from 0):

example_string[0] # First element

'a'

If you want the second element, use 1:

example_string[1] # Second element

'p'

If you want to count from the end, use negatives:

example_string[-1] # Last element

'e'

Slicing

What if you want multiple elements? You slice by specifying the start and end indices between a colon:

example_string[1:3] # Elements 1 and 2

'pp'

Notice that it includes the first index but excludes the second.

To start at the beginning, just leave the first index out:

example_string[:3] # Elements 0, 1 and 2

'app'

To go to the end, leave the second index out:

example_string[2:] # Elements 2, 3, ..., -1

'ple'

Finally, you can combine negative indexing with slicing. For example, to go up to the last element:

example_string[:-1]

'appl'

In sum:

Code	0	1	2	3	4
`example_string`	`"a"`	`"p"`	`"p"`	`"l"`	`"e"`
`example_string[0]`	`"a"`
`example_string[2]`			`"p"`
`example_string[-1]`					`"e"`
`example_string[1:3]`		`"p"`	`"p"`
`example_string[:3]`	`"a"`	`"p"`	`"p"`
`example_string[2:]`			`"p"`	`"l"`	`"e"`
`example_string[:-1]`	`"a"`	`"p"`	`"p"`	`"l"`

#### Part 2 ####
files_in_texts = os.listdir("texts")

for text_path in files_in_texts:
    title = text_path[:-4].replace("_", " ") # <-- extract the title

    with open(f"texts/{text_path}", encoding = "utf-8") as file:
        contents = file.read()

    words = contents.split()
    unique_words = set(words)

    word_count = len(words)
    unique_word_count = len(unique_words)
    ratio = unique_word_count / word_count

    print()
    print(f"There are {word_count} words in {title}.") # <-- include in message
    print(f"There are {unique_word_count} different words in {title}.") # <-- include in message
    print(f"The unique word ratio is {unique_word_count / word_count}")


There are 51257 words in The Great Gatsby.
There are 10206 different words in The Great Gatsby.
The unique word ratio is 0.19911426731958562

There are 130410 words in Pride and Prejudice.
There are 14702 different words in Pride and Prejudice.
The unique word ratio is 0.11273675331646346

There are 114125 words in The Adventures of Huckleberry Finn.
There are 14307 different words in The Adventures of Huckleberry Finn.
The unique word ratio is 0.12536254107338446

There are 464023 words in The Count of Monte Cristo.
There are 40030 different words in The Count of Monte Cristo.
The unique word ratio is 0.0862672755445312

There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526

Activity 2

If you open one of the files, you’ll notice that there is front and end matter which isn’t from the original texts. Let’s remove them and save the cleaned texts. To do so, we’ll need to use two skills:

Slicing
Writing to files

Part 1: Clean the texts

To remove the front/end matter, notice that the original texts all begin after the string

*** START OF THE PROJECT GUTENBERG EBOOK

and end before

*** END OF THE PROJECT GUTENBERG EBOOK.

To clean the text,

Use the function contents.find(...) to find the index corresponding to the keys,

start_index = contents.find(...)
end_index = contents.find(...)

Slice the text between those two indices and save it in a variable.

Part 2: Write the strings to files

Writing strings to a file is similar to reading. We’ll start by creating a with ... as ... block pointing to the new file path:

with open(f"{title}_clean.txt", "w", encoding = "utf-8") as file:
    ...

Then use file.write(...) to write the cleaned string to the new file.

Solution

The following code is a possible solution to the problem.

#### Part 2 ####
files_in_texts = os.listdir("texts")

for text_path in files_in_texts:
    
    # ...
    # ...
    # ...

    # Remove front/end matter and save clean files
    start_message = "*** START OF THE PROJECT GUTENBERG EBOOK"
    end_message = "*** END OF THE PROJECT GUTENBERG EBOOK"

    start = contents.find(start_message)
    end = contents.find(end_message)

    clean_text = contents[start:end]

    with open(f"{title}_clean.txt", "w", encoding = "utf-8") as file:
        file.write(clean_text)

Part 3 (extension): Making it modular

In this final (optional) part we take a look making our code modular. In Python, you can do this in two ways:

Within the script, with functions
Outside the script, with modules.

Code from Parts 1 and 2

Before beginning, just check that your code is up to date:

import os

print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")

# Check that the folder exists in our working directory
if os.path.exists("texts"):
    print("The folder /texts/ exists.")
else:
    raise FileNotFoundError("Cannot find the folder /texts/.")

files_in_texts = os.listdir("texts")

# Check that there are five files within texts
if len(files_in_texts) != 5:
    raise FileNotFoundError("Incorrect number of files in /texts/.")

#### Part 2 ####
files_in_texts = os.listdir("texts")

for text_path in files_in_texts:
    
    title = text_path[:-4]

    with open(f"texts/{text_path}", encoding = "utf-8") as file:
        contents = file.read()

    words = contents.split()
    unique_words = set(words)

    word_count = len(words)
    unique_word_count = len(unique_words)
    ratio = unique_word_count / word_count

    print()
    print(f"There are {word_count} words in {title}.")
    print(f"There are {unique_word_count} different words in {title}.")
    print(f"The unique word ratio is {unique_word_count / word_count}")

    # Remove front/end matter and save clean files
    start_message = "*** START OF THE PROJECT GUTENBERG EBOOK"
    end_message = "*** END OF THE PROJECT GUTENBERG EBOOK"

    start = contents.find(start_message) + len(start_message)
    end = contents.find(end_message)

    clean_text = contents[start:end]

    with open(f"{title}_clean.txt", "w", encoding = "utf-8") as file:
        file.write(clean_text)

Running the Python Toolkit Program
The current working directory is /home/runner/work/technology-training/technology-training/Python/5-python_toolkit
The folder /texts/ exists.

There are 51257 words in The_Great_Gatsby.
There are 10206 different words in The_Great_Gatsby.
The unique word ratio is 0.19911426731958562

There are 130410 words in Pride_and_Prejudice.
There are 14702 different words in Pride_and_Prejudice.
The unique word ratio is 0.11273675331646346

There are 114125 words in The_Adventures_of_Huckleberry_Finn.
There are 14307 different words in The_Adventures_of_Huckleberry_Finn.
The unique word ratio is 0.12536254107338446

There are 464023 words in The_Count_of_Monte_Cristo.
There are 40030 different words in The_Count_of_Monte_Cristo.
The unique word ratio is 0.0862672755445312

There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526

Modularity within the script: functions

We’ll start with functions. Functions are like a script within a script - a section of code that runs when you call its name. They come in two parts:

The function call, which runs the code (e.g. print(), len(), etc.)
The function definition, which defines that code

Every time we’ve used a function, like print(...), len(...), etc., we have performed function calls. However, we need to write new definitions to make our own functions.

Let’s make a new function now, read_book(...), which reads a text file and returns the contents as a string, like we do in the loop

We’ll start by defining our function. Do this at the top of your script, just after the import statements. Functions definitions have the following syntax:

def <function_name>(<input1>, <input2>, ...): code code code return <output>

Let’s set it up, without including any code yet, with a single input variable path:

import os

def read_book(path):
    return

Function signature and inputs

The function signature, def <signature>: forms a key part of the definition. Inside the brackets there are different ways to specify the inputs.

No inputs

Your function doesn’t have to take any inputs. For example,

# Definition: def no_inputs(): ... return <output> # Call: no_inputs()

Compulsory inputs

If you just give the inputs names they are compulsory: all calls must include them

# Definition: def compulsory_inputs(input1, input2): ... return <output> # Call: compulsory_inputs(a, b)

Default / optional inputs

You can specify default values for function inputs, which makes them optional

# Definition: def optional_inputs(input1 = "apple", input2 = "banana"): ... return <output> # Call: optional_inputs("cherry") # Will interpret as input1 = "cherry", input2 = "banana"

Positional vs Keyword arguments

Finally, when you call a function, you can either specify the inputs directly or let it assume by position.

def example(input1, input2, input3):
    ...
    return <output>

example("apple", "banana", "cherry") 
example("apple", "banana", input3 = "cherry")
example(input1 = "apple", input2 = "banana", input3 = "cherry")
example(input3 = "cherry", input2 = "banana", input1 = "apple")

These are all valid calls, with various differences:

All positional
input1 and input2 are positional, while input3 is keyword
All keyword
All keyword - the order doesn’t matter for keyword arguments!

Positional arguments before keyword arguments

Because keyword arguments are unordered, positional arguments must precede them:

# Valid example("apple", input2 = "banana", input3 = "cherry") # Invalid - positional argument after keyword argument! example(input1 = "apple", "banana", "cherry")

Now, let’s include the code that we previously used to read the file and split the words. Note that the variable containing the full file path is path, so we should change that accordingly.

import os

def read_book(path):  
    with open(path, encoding = "utf-8") as file:
        contents = file.read()
    
    return contents

Function scope

Variables created within functions are deleted once the function runs, so they can’t be accessed by your main code! This is called scope.

Finally, let’s replace code within the loop with a simple function call to our new function. All together,

import os

def read_book(path):  
    with open(path, encoding = "utf-8") as file:
        contents = file.read()
    
    return contents

print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")

# Check that the folder exists in our working directory
if os.path.exists("texts"):
    print("The folder /texts/ exists.")
else:
    raise FileNotFoundError("Cannot find the folder /texts/.")

files_in_texts = os.listdir("texts")

# Check that there are five files within texts
if len(files_in_texts) != 5:
    raise FileNotFoundError("Incorrect number of files in /texts/.")

#### Part 2 ####
for text_path in files_in_texts:
    title = text_path[:-4]

    contents = read_book(f"texts/{text_path}")   # <-- Custom function call
    words = contents.split()

    unique_words = set(words)

    word_count = len(words)
    unique_word_count = len(unique_words)
    ratio = unique_word_count / word_count

    print()
    print(f"There are {word_count} words in {title}.")
    print(f"There are {unique_word_count} different words in {title}.")
    print(f"The unique word ratio is {unique_word_count / word_count}")

    # Remove front/end matter and save clean files
    start_message = "*** START OF THE PROJECT GUTENBERG EBOOK"
    end_message = "*** END OF THE PROJECT GUTENBERG EBOOK"

    start = contents.find(start_message) + len(start_message)
    end = contents.find(end_message)

    clean_text = contents[start:end]

    with open(f"{title}_clean.txt", "w", encoding = "utf-8") as file:
        file.write(clean_text)

Running the Python Toolkit Program
The current working directory is /home/runner/work/technology-training/technology-training/Python/5-python_toolkit
The folder /texts/ exists.

There are 51257 words in The_Great_Gatsby.
There are 10206 different words in The_Great_Gatsby.
The unique word ratio is 0.19911426731958562

There are 130410 words in Pride_and_Prejudice.
There are 14702 different words in Pride_and_Prejudice.
The unique word ratio is 0.11273675331646346

There are 114125 words in The_Adventures_of_Huckleberry_Finn.
There are 14307 different words in The_Adventures_of_Huckleberry_Finn.
The unique word ratio is 0.12536254107338446

There are 464023 words in The_Count_of_Monte_Cristo.
There are 40030 different words in The_Count_of_Monte_Cristo.
The unique word ratio is 0.0862672755445312

There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526

Modularity beyond the script: modules

What actually happens when you run import ...? Python adds the contents of another Python file to the existing ‘namespace’. Basically, you import a bunch of functions (and classes, and other objects…)!

We can make our own modules that Python recognises with the import command. The simplest way is just another Python script. Let’s make one to store our new function, so it’s out of the way.

Create a new script in this folder called reader.py
Move the function into that file.

The script should look like this:

def read_book_words(path):  
    with open(path, encoding = "utf-8") as file:
        contents = file.read()
    
    return contents

Finally, we should reflect the changes in our original script.

Replace the old function definition with the command import reader.
Replace the old function read_book_words(...) with the command `reader.read_book_words(…)

The main script should look like this

import os
import reader

print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")

# Check that the folder exists in our working directory
if os.path.exists("texts"):
    print("The folder /texts/ exists.")
else:
    raise FileNotFoundError("Cannot find the folder /texts/.")

files_in_texts = os.listdir("texts")

# Check that there are five files within texts
if len(files_in_texts) != 5:
    raise FileNotFoundError("Incorrect number of files in /texts/.")

#### Part 2 ####
files_in_texts = os.listdir("texts")

for text_path in files_in_texts:
    title = text_path[:-4]

    contents = reader.read_book(f"texts/{text_path}")
    words = contents.split()

    unique_words = set(words)

    word_count = len(words)
    unique_word_count = len(unique_words)
    ratio = unique_word_count / word_count

    print()
    print(f"There are {word_count} words in {title}.")
    print(f"There are {unique_word_count} different words in {title}.")
    print(f"The unique word ratio is {unique_word_count / word_count}")

    # Remove front/end matter and save clean files
    start_message = "*** START OF THE PROJECT GUTENBERG EBOOK"
    end_message = "*** END OF THE PROJECT GUTENBERG EBOOK"

    start = contents.find(start_message)
    end = contents.find(end_message)

    clean_text = contents[start:end]

    with open(f"{title}_clean.txt", "w", encoding = "utf-8") as file:
        file.write(clean_text)

Running the Python Toolkit Program
The current working directory is /home/runner/work/technology-training/technology-training/Python/5-python_toolkit
The folder /texts/ exists.

There are 51257 words in The_Great_Gatsby.
There are 10206 different words in The_Great_Gatsby.
The unique word ratio is 0.19911426731958562

There are 130410 words in Pride_and_Prejudice.
There are 14702 different words in Pride_and_Prejudice.
The unique word ratio is 0.11273675331646346

There are 114125 words in The_Adventures_of_Huckleberry_Finn.
There are 14307 different words in The_Adventures_of_Huckleberry_Finn.
The unique word ratio is 0.12536254107338446

There are 464023 words in The_Count_of_Monte_Cristo.
There are 40030 different words in The_Count_of_Monte_Cristo.
The unique word ratio is 0.0862672755445312

There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526

Conclusion and Summary

This is a big workshop, and we’ve covered a lot of content! See the summary table below for details on the topics covered. Each is linked to the notes in the workshop.

If you have any further questions, don’t hesistate to contact us at training@library.uq.edu.au.

Topic	Code	Description
The `os` module	`import os os.getwd() os.chdir() os.listdir()`	A built-in module which enables interacting with your operating system.
f-strings	`f"1+1 = {1+1}"`	Formatted strings, which behave like normal strings except that code within curly brackets `{...}` is executed.
Conditionals	`if <condition1>: ... elif <condition2>: .... else: ....`	Sections of code which only run if a condition is true. Always start with `if`. Use `elif` to check additional conditions (only if the first fail). Use `else` to catch everything that fails all conditions.
Raising exceptions	`raise ...Error("error_message")`	A way to manually trigger error messages and stop the program. Replace `...` with an errortype, e.g. `KeyError`, `ValueError`.
File input/output	`with open(...) as <placeholder>: ...` `open("path", encoding = "utf-8") open("path", "w", encoding = "utf-8")`	Read and write to files with the `open()` function and `with ... as ...` blocks. Most files use the `utf-8` encoding, which isn’t set by default. Send the `"w"` parameter to write to a file, and leave it out to read.
Loops	`for <placeholder> in <iterable>: ...` `while <condition>: ...`	Run sections of code multiple times with a loop. `for` loops run once for each element in an iterable object (e.g. a list). Each iteration stores the current element in what you specify for `<placeholder>`. `while` loops run until `<condition>` is `False`. If it never becomes `False`, the loop runs indefinitely, and will eventually crash your program.
Indexing and slicing	`example_string[1:4]`	Access individual elements of a string or list by indexing and slicing with square brackets.
Custom functions Custom modules	`def function_name(input1, input2): ... return ... function_name(a, b)` `import module`	Store sections of code away in functions to run them at a later point. Write a function definition with `def ...` which contains the code Call the function to use it with specific inputs You can store the functions in a separate script and import that script as a module.

Preparation

Part 0: Refresher

Variables

Functions

Keywords and Modules

Part 1: Verifying your setup

Using os to check the working directory

Managing folders and paths with conditionals

Activity 1

Part 2: Analysing the data

Reading and manipulating text files

Using loops to automate the process

Activity 2

Part 1: Clean the texts

Part 2: Write the strings to files

Part 3 (extension): Making it modular

Modularity within the script: functions

Modularity beyond the script: modules

Conclusion and Summary

Using `os` to check the working directory