= 100.5
example_number = "Hello!" example_string
Python Toolkit
In this standalone workshop, we take a look at the Python building blocks that you’ll want to have in your toolkit. This includes:
- Loops
- Functions
- Modules
- I/O and filesystem (using
os
andsys
)
We’ll work by procedurally building up a bigger and bigger program, containing all the features we cover. That way you’ve got them all in context
Along the way, you’ll encounter callouts like these:
Description of new feature
These crop up every time a new feature or concept is introduced.
Preparation
Before we begin, please follow the following instructions
- Open your preferred IDE (e.g. Spyder).
- Download the files for today’s session and place them in an accessible location
Part 0: Refresher
Before we get into the content for this workshop, let’s just have a brief refresher on what Python is and how it works.
Python is a program on your computer, just like any other. When you want to run Python code, or a .py
file, the Python program (e.g. python.exe
) runs. It takes your code as an input and evaluates it.
Python is also the name of the programming language which the Python program interprets. So, we refer to the language as Python, and the executable program which interprets it the interpreter.
The language is similar to other high-level, object-oriented languages like R, MATLAB, Julia, JavaScript etc. It is versatile and reads easily.
There are various ways to run Python, this workshop assumes you’re using an IDE like Spyder. Here, we’ll write just one script, containing our program.
In Python we work with two types of objects: variables and functions.
Variables
Variables allow you to store information in a named object. We use the =
operator to created them, with the syntax
<name> = <value>
For example, the script below creates two objects, example_number
and example_string
, by assigning values to both.
Functions
Functions enable you to run predefined Python code. Functions can be built-in, from a module, or written yourself.
To call (use) a function, the syntax is
<function>(<input_1>, <input_2>, ...)
The parentheses ()
are essential, even if you don’t have any inputs.
For example, let’s use the round()
function to round our example_number
:
round(example_number)
100
Keywords and Modules
Finally, you will encounter a third type of command in Python: keywords. These are special commands which alter your code in a predetermined way, and you can’t use them for variable names. You also can’t write your own - they’re baked into Python itself.
Keywords must always be followed by a space (or a colon in special cases). One example is the import
command, which loads a module. Modules are collections of Python code that other people have written, containing lots of functions, variables and submodules.
For example, the following code imports the os
module.
import os
Part 1: Verifying your setup
In the first part of this workshop, we’ll learn to use Python’s building blocks by verifying our setup is working. We’ll need to use
- The
os
module and functions therein - The
print()
function - Conditionals
if
andelse
- Error handling with
raise
To begin, create a new script called toolkit.py
. This should be in the same folder as the /texts/
folder you downloaded.
Using os
to check the working directory
We’ll begin by diving into an unusual starting point: the os
module. Python comes with a built in module called os
(short for operating system) which allows you to interact with your computer. Let’s start by importing it
import os
os
module
The os
module is a built-in Python module that enables interfacing with the operating system. Some popular uses, which aren’t looked at in this workshop, are
os.chdir(...)
: Change the current working directory to...
os.system(...)
: Send the command...
to a terminalos.walk(...)
: Recursively walk through the files in...
(requires looping, see later)
This links our Python environment to the os
module so we can now access code from it. The reason that we’re starting here is because we can use the os
module to interact with our computer.
Next, we’ll use the print()
function to send a message to the user:
import os
print("Running the Python Toolkit Program")
Running the Python Toolkit Program
print()
function
The function print(...)
is a built-in function which sends the message ...
to the console. Note that the output of the function is technically None
(this is different to the console message).
As a first application of os
, we’ll use it to print the current working directory. This is a location on your computer, considered by the program to be home - all files are relative to the current working directory. We’ll need this later when we deal with file paths.
You can access the current working directory with the getcwd()
function within the os
module. Because the function lives inside the module, we need to use the .
operator.
import os
print("Running the Python Toolkit Program")
print("The current working directory is")
print(os.getcwd())
Running the Python Toolkit Program
The current working directory is
C:\Users\uqcwest5\OneDrive - The University of Queensland\Tech Training\technology-training\Python\5-python_toolkit
.
operator
The .
operator allows you to access objects that exist within other objects. Here, we access the getcwd()
function which lives inside the os
module.
All objects in Python have methods (functions) and attributes (variables) attached to them, which you use the .
operator to access.
We can simplify this process by using f-strings. Including the letter f
directly before a string’s first quotation mark tells Python to execute any code within curly brackets:
import os
print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")
Running the Python Toolkit Program
The current working directory is C:\Users\uqcwest5\OneDrive - The University of Queensland\Tech Training\technology-training\Python\5-python_toolkit
When you work with other files, you’ll want to set the working directory. Here, we’ll be loading data that we store in the same folder as your Python script, so let’s move the working directory there.
- First, copy your script’s file path. On Windows, right click the file and press “copy as path”.
- Next, use the
os.chdir()
function to change the current working directory as follows
r"path/to/script.py") os.chdir(
Make sure you include the r
before the path. This tells Python it’s a raw string and won’t misinterpret the backslashes.
You can include executed code within strings by prepending them with the letter f. The code needs to be placed within curly brackets inside the string, and the output of the code will be directly inserted there.
f"This is an f-string, showing that 1 + 1 = {1 + 1}"
'This is an f-string, showing that 1 + 1 = 2'
Managing folders and paths with conditionals
Now that we’ve printed a welcome message, let’s use another function, to ensure that we’ve got the folder “texts” in our working directory. The function lives in a submodule of os
, called os.paths
, and the function is exists()
.
import os
print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")
# Check that the folder exists in our working directory
= os.path.exists("texts")
folder_check print(folder_check)
Running the Python Toolkit Program
The current working directory is C:\Users\uqcwest5\OneDrive - The University of Queensland\Tech Training\technology-training\Python\5-python_toolkit
True
Most big modules, including os
, contain submodules. These are also accessed via the .
notation:
module.submodule.function()
Submodules could also have submodules:
module.submodule.submodule_of_submodule.function()
This helps keep the scope clear. For example, there might be lots of submodules with the function exists()
, but because we call os.path.exists()
, we know that this one relates to filepaths.
What should we do if the folder doesn’t exist? We should probably stop the program, fix our setup, and then try again.
Let’s tell Python to print a message if the folder exists. For this, we need to use conditionals.
The keyword if
indicates that a block of indented code should only be run if a condition is True
. Because the function os.path.exists()
returns a True
or False
, we can use it as the condition.
Let’s set up our conditional and test it by printing a message. We have to be careful with the syntax:
if <condition>:
code_to_run_if_True
code_outside_will_always_run
Including it in our code, we want to substitute:
<condition>
\(\rightarrow\)os.path.exists("texts")
code_to_run_if_True
\(\rightarrow\)print("The folder /texts/ exists")
import os
print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")
# Check that the folder exists in our working directory
if os.path.exists("texts"):
print("The folder /texts/ exists")
Running the Python Toolkit Program
The current working directory is C:\Users\uqcwest5\OneDrive - The University of Queensland\Tech Training\technology-training\Python\5-python_toolkit
The folder /texts/ exists
What if it doesn’t exist? We can use the keyword else
to catch anything that fails the condition.
import os
print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")
# Check that the folder exists in our working directory
if os.path.exists("texts"):
print("The folder /texts/ exists.")
else:
print("The folder /texts/ does not exist.")
Running the Python Toolkit Program
The current working directory is C:\Users\uqcwest5\OneDrive - The University of Queensland\Tech Training\technology-training\Python\5-python_toolkit
The folder /texts/ exists.
if
, elif
and else
The keywords if
, elif
, and else
together constitute the conditionals that Python supports. These allow you to only run code based on certain conditions.
The first, if
, is always required. The syntax is,
- The
if
keyword - A condition, which must become either
True
orFalse
- A colon,
:
- The code to run if the condition is
True
. It must be indented if on a new line.
if <condition>:
code_to_run_if_True
code_outside_will_always_run
The keyword elif
allows you to check an additional condition, only if the previous condition(s) failed.
if <condition>:
code_to_run_if_Trueelif <condition_2>:
code_to_run_if_condition2_Trueelif <condition_3>:
code_to_run_if_condition3_True
code_outside_will_always_run
Finally, the keyword else
catches anything that failed all conditions
if <condition>:
code_to_run_if_Trueelif <condition_2>:
code_to_run_if_condition2_Trueelif <condition_3>:
code_to_run_if_condition3_Trueelse:
code_to_run_if_all_failed
code_outside_will_always_run
Realistically, we don’t want our program to run unless the setup is correct. This means we should stop the program if it can’t find the folder.
You’ve probably already encountered errors in your code. Now it’s your chance to code them in manually. To raise an error,
- Use the
raise
keyword - Follow with a valid error, e.g.
ValueError()
,KeyError()
etc. - Place a useful message inside the brackets.
In this case, we should use the FileNotFoundError()
. Let’s replace the else
section:
import os
print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")
# Check that the folder exists in our working directory
if os.path.exists("texts"):
print("The folder /texts/ exists.")
else:
raise FileNotFoundError("Cannot find the folder /texts/.")
Running the Python Toolkit Program
The current working directory is C:\Users\uqcwest5\OneDrive - The University of Queensland\Tech Training\technology-training\Python\5-python_toolkit
The folder /texts/ exists.
raise
The raise
keyword allows you to raise exceptions (errors) in Python. These stop the execution and print an error message.
The syntax is
raise SomeError("Appropriate error message")
and there are lots of built in exceptions.
Activity 1
To make sure that you’ve set things up correctly, you should raise an error if the number of files in the texts folder is not five.
To set things up, let’s use the os.listdir()
function to get a list of the files, storing them in a variable.
import os
print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")
# Check that the folder exists in our working directory
if os.path.exists("texts"):
print("The folder /texts/ exists.")
else:
raise FileNotFoundError("Cannot find the folder /texts/.")
# Check that there are five files in the folder
= os.listdir("texts") files_in_texts
Running the Python Toolkit Program
The current working directory is C:\Users\uqcwest5\OneDrive - The University of Queensland\Tech Training\technology-training\Python\5-python_toolkit
The folder /texts/ exists.
For this activity, you can use the len()
built-in function to determine the size of files_in_texts
. Then, use a conditional to raise an error if it’s not five.
In sum,
- Use
len()
to determine the number of objects infiles_in_texts
- Use an
if
statement to check if this is not equal to five. You’ll need the!=
(not equal to) operator. - Use the
raise
keyword to raise an error.
To check (in)equalities, you can use logical operators. For example,
1 == 1 # Equal to
True
1 != 2 # Not equal to
True
2 > 1 # Greater than
True
1 <= 1 # Less than or equal to
True
The new code is
= os.listdir("texts")
files_in_texts
# Check that there are five files within texts
if len(files_in_texts) != 5:
raise FileNotFoundError("Incorrect number of files in /texts/.")
Note that this doesn’t produce an output message. Generally, if everything is fine, we don’t need a message.
The whole program has now become
import os
print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")
# Check that the folder exists in our working directory
if os.path.exists("texts"):
print("The folder /texts/ exists.")
else:
raise FileNotFoundError("Cannot find the folder /texts/.")
= os.listdir("texts")
files_in_texts
# Check that there are five files within texts
if len(files_in_texts) != 5:
raise FileNotFoundError("Incorrect number of files in /texts/.")
Running the Python Toolkit Program
The current working directory is C:\Users\uqcwest5\OneDrive - The University of Queensland\Tech Training\technology-training\Python\5-python_toolkit
The folder /texts/ exists.
Part 2: Analysing the data
In the second part of this session, we’ll learn to use Python’s input/output and looping features to analyse the files within texts. We’ll use
- The
open()
function andwith ... as ...
keywords for reading - String methods and the
set
variable type to analyse the texts for
loops to automate the process
Before we begin, ensure that your code looks like this. Continue from the bottom throughout part 2.
import os
print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")
# Check that the folder exists in our working directory
if os.path.exists("texts"):
print("The folder /texts/ exists.")
else:
raise FileNotFoundError("Cannot find the folder /texts/.")
= os.listdir("texts")
files_in_texts
# Check that there are five files within texts
if len(files_in_texts) != 5:
raise FileNotFoundError("Incorrect number of files in /texts/.")
#### Part 2 ####
Running the Python Toolkit Program
The current working directory is C:\Users\uqcwest5\OneDrive - The University of Queensland\Tech Training\technology-training\Python\5-python_toolkit
The folder /texts/ exists.
Reading and manipulating text files
We’ll start by reading the first file, Macbeth.txt into Python. Make sure to include the snippets in part 2 below the code from part 1.
To read a file in Python,
- Create a
with ... as ...
block to make sure the file connection closes properly - At the first
...
, use theopen(<filepath>, encoding = "utf-8")
function to open the file. Useencoding = "utf-8"
because your operating system might not have this as the default. - At the second
...
, use a placeholder variable to store the file connection. Something likefile
. - Inside the block (like an
if
statement), usefile.read()
to access its contents and store that in a variable.
We’ll use that final variable to perform our analysis. It is completely disconnected from the actual file.
#### Part 2 ####
with open("texts/Macbeth.txt", encoding = "utf-8") as file:
= file.read() contents
Reading and writing files in Python takes a few steps. Essentially, Python forms a connection to a file with the open()
function which automatically closes if we do this inside a with ... as ...
block.
The syntax is
with open("path_to_file", encoding = "...") as <placeholder>:
code_with_file_connection_open
code_once_connection_has_closed
Note that whatever you put at <placeholder>
will store the file connection. All files have the method read()
, which parses the contents. A method is a function that you access with .
.
Typically, you want to store the contents in a variable, like in our contents
example, which is disconnected from the actual file.
The encoding refers to how the file stored text. Generally, you should use encoding = "utf-8"
, because the default is based on your operating system and varies from machine to machine.
Next, let’s perform some analysis of the text. Our goal is to compare the total number of words with the total unique number of words.
First, we need to apply a string method to separate the words in the text. Methods are functions that all variables of a particular type have access to, and we use them with the dot operator .
. In this case, the .split()
method will create a list by dividing the string every time there is a whitespace.
#### Part 2 ####
with open("texts/Macbeth.txt", encoding = "utf-8") as file:
= file.read()
contents
= contents.split() words
Every variable has common methods (functions) and attributes (variables) associated with them, accessible via the .
operator. For example, all strings have the .lower()
method which makes them lowercase:
= "THIS WAS IN CAPS"
example_string example_string.lower()
'this was in caps'
Other variables have their own methods. Numbers have the .as_integer_ratio()
method, which turns the number into a fraction
= 5.5
example_int example_int.as_integer_ratio()
(11, 2)
and lists have the .append()
method, which adds another element to the list
= ["a", "b"]
example_list "c")
example_list.append(print(example_list)
['a', 'b', 'c']
We can then use the len()
function again to determine the total number of words and print a message. We can also use print()
by itself to make an empty line.
#### Part 2 ####
with open("texts/Macbeth.txt", encoding = "utf-8") as file:
= file.read()
contents
= contents.split()
words = len(words)
word_count
print()
print(f"There are {word_count} words in Macbeth.")
There are 21428 words in Macbeth.
To work out the unique number of words, we can convert our list to a different variable type: the set. Sets are like lists, but they only contain unique values.
We can convert a variable to another type by using its type as a function, e.g. int()
, str()
, list()
. Here, we’ll need set()
. Then we can use len()
again to determine its size.
- Create a set of unique words with
set(words)
- Find the count of unique words with
len()
- Print an additional message
#### Part 2 ####
with open("texts/Macbeth.txt", encoding = "utf-8") as file:
= file.read()
contents
= contents.split()
words = set(words)
unique_words
= len(words)
word_count = len(unique_words)
unique_word_count
print()
print(f"There are {word_count} words in Macbeth.")
print(f"There are {unique_word_count} different words in Macbeth.")
There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
list
, tuple
, dict
and set
Python has four built-in variables which are ‘containters’: they store multiple values.
Lists
Lists simply a collection of Python objects. They are ordered, so you can access them by their index, and they are mutable, so you can change individual elements.
Create a list with square brackets:
= [1, "a", 5.5]
example_list
# Mutable - can change specific elements
# Ordered - access elements by position
0] = "first"
example_list[print(example_list)
['first', 'a', 5.5]
Tuples
Tuples are like lists, but you can’t modify its elements. That makes it ordered and immutable.
Create a tuple with parentheses:
= (1, "a", 5.5)
example_tuple
# Immutable - attempting to change specific element gives error
0] = "first" example_tuple[
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[29], line 4 1 example_tuple = (1, "a", 5.5) 3 # Immutable - attempting to change specific element gives error ----> 4 example_tuple[0] = "first" TypeError: 'tuple' object does not support item assignment
Dictionaries
Dictionaries are like lists, but they are unordered. Instead of using position to identify elements, you use keywords.
Create a dictionary with curly brackets and key: value
pairs:
= {"a": 1, "b": 2, "c": 3}
example_dict
# Can create new elements in dictionary by 'accessing' them
"d"] = 4
example_dict[print(example_dict)
{'a': 1, 'b': 2, 'c': 3, 'd': 4}
Sets
Sets are like lists but the elements are unique. Duplicates will always be removed. They are also unordered, so you can’t access individual elements unless you loop through the set.
= {"a", "a", 2, 2, "c"}
example_set
print(example_set)
{'a', 2, 'c'}
Finally, let’s determine the ratio of unique words to total words
\[\text{ratio} = \frac{\text{unique words}}{\text{total words}}\]
#### Part 2 ####
with open("texts/Macbeth.txt", encoding = "utf-8") as file:
= file.read()
contents
= contents.split()
words = set(words)
unique_words
= len(words)
word_count = len(unique_words)
unique_word_count = unique_word_count / word_count
ratio
print()
print(f"There are {word_count} words in Macbeth.")
print(f"There are {unique_word_count} different words in Macbeth.")
print(f"The unique word ratio is {unique_word_count / word_count}")
There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526
Using loops to automate the process
Now that we’ve analysed one of the texts, let’s do the same for all five.
The brute force approach is to copy the code five times and adjust it.
However, we can do one better with a for
loop. This enables us to repeat a section of code for each element in an object.
for <placeholder> in <object>:
code_to_repeat
code_after_loop
We’ll start just by printing out the names of each file. We need the list of file names, which we get from os.listdir("texts")
.
#### Part 2 ####
with open("texts/Macbeth.txt", encoding = "utf-8") as file:
= file.read()
contents
= contents.split()
words = set(words)
unique_words
= len(words)
word_count = len(unique_words)
unique_word_count = unique_word_count / word_count
ratio
print()
print(f"There are {word_count} words in Macbeth.")
print(f"There are {unique_word_count} different words in Macbeth.")
print(f"The unique word ratio is {unique_word_count / word_count}")
= os.listdir("texts")
files_in_texts
for text_path in files_in_texts:
print(text_path)
There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526
Macbeth.txt
Pride_and_Prejudice.txt
The_Adventures_of_Huckleberry_Finn.txt
The_Count_of_Monte_Cristo.txt
The_Great_Gatsby.txt
for
loops
To iterate through an object, running the same code on each element, Python offers the for
loop.
for <placeholder> in <object>:
code_to_repeat
code_after_loop
Whatever you name in <placeholder>
will store an element of the <object>
for each iteration of the loop.
For example, the following loop prints each element of example_list
. Each time the loop runs, letter
stores one of the list’s elements, in order.
= ["a", "b", "c"]
example_list
for letter in example_list:
print(letter)
a
b
c
There are a few important keywords you can use to help with loops.
break
The keyword break
tells Python to finish the loop immediately. This is often used with conditionals. For example,
= ["a", "b", "c"]
example_list
for letter in example_list:
if letter == "b":
break
print(letter)
a
continue
The keyword continue
tells Python to skip the rest of the current iteration and start the next.
= ["a", "b", "c"]
example_list
for letter in example_list:
if letter == "b":
continue
print(letter)
a
c
We can use our for
loop to run the whole analysis on each file.
First, let’s just place the analysis inside the loop. This will run once for each file, but because we haven’t changed the path from "texts/Macbeth.txt"
, it will still read Macbeth each time.
#### Part 2 ####
= os.listdir("texts")
files_in_texts
for text_path in files_in_texts:
with open("texts/Macbeth.txt", encoding = "utf-8") as file:
= file.read()
contents
= contents.split()
words = set(words)
unique_words
= len(words)
word_count = len(unique_words)
unique_word_count = unique_word_count / word_count
ratio
print()
print(f"There are {word_count} words in Macbeth.")
print(f"There are {unique_word_count} different words in Macbeth.")
print(f"The unique word ratio is {unique_word_count / word_count}")
There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526
There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526
There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526
There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526
There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526
Notice that the final files_in_texts = os.listdir("texts")
has been removed because it’s superfluous.
Now we can use the text
variable, which changes on each iteration of the loop, in place of the file path. Specifically, we’ll make the change
"texts/Macbeth.txt"
\(\rightarrow\) f"texts/{text_path}"
#### Part 2 ####
= os.listdir("texts")
files_in_texts
for text_path in files_in_texts:
with open(f"texts/{text_path}", encoding = "utf-8") as file:
= file.read()
contents
= contents.split()
words = set(words)
unique_words
= len(words)
word_count = len(unique_words)
unique_word_count = unique_word_count / word_count
ratio
print()
print(f"There are {word_count} words in Macbeth.")
print(f"There are {unique_word_count} different words in Macbeth.")
print(f"The unique word ratio is {unique_word_count / word_count}")
There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526
There are 130410 words in Macbeth.
There are 14702 different words in Macbeth.
The unique word ratio is 0.11273675331646346
There are 114125 words in Macbeth.
There are 14307 different words in Macbeth.
The unique word ratio is 0.12536254107338446
There are 464023 words in Macbeth.
There are 40030 different words in Macbeth.
The unique word ratio is 0.0862672755445312
There are 51257 words in Macbeth.
There are 10206 different words in Macbeth.
The unique word ratio is 0.19911426731958562
If you look closely, it has worked - the numbers are changing each time. We need to update our messages though, to make it dynamic.
#### Part 2 ####
= os.listdir("texts")
files_in_texts
for text_path in files_in_texts:
with open(f"texts/{text_path}", encoding = "utf-8") as file:
= file.read()
contents
= contents.split()
words = set(words)
unique_words
= len(words)
word_count = len(unique_words)
unique_word_count = unique_word_count / word_count
ratio
print()
print(f"There are {word_count} words in {text_path}.")
print(f"There are {unique_word_count} different words in {text_path}.")
print(f"The unique word ratio is {unique_word_count / word_count}")
There are 21428 words in Macbeth.txt.
There are 6207 different words in Macbeth.txt.
The unique word ratio is 0.2896677244726526
There are 130410 words in Pride_and_Prejudice.txt.
There are 14702 different words in Pride_and_Prejudice.txt.
The unique word ratio is 0.11273675331646346
There are 114125 words in The_Adventures_of_Huckleberry_Finn.txt.
There are 14307 different words in The_Adventures_of_Huckleberry_Finn.txt.
The unique word ratio is 0.12536254107338446
There are 464023 words in The_Count_of_Monte_Cristo.txt.
There are 40030 different words in The_Count_of_Monte_Cristo.txt.
The unique word ratio is 0.0862672755445312
There are 51257 words in The_Great_Gatsby.txt.
There are 10206 different words in The_Great_Gatsby.txt.
The unique word ratio is 0.19911426731958562
Finally, let’s remove the trailing .txt
on the messages by extracting the text’s title from its path. To do this, slice the string with square brackets: title = text_path[:-4]
. In this case, we slice from the start of the path up to the fourth last character.
Indexing
To extract a substring from a string (or a subset of a list) use square brackets and specify the position of the elements you want. For example, to pick out the first letter in the following string,
= "apple" example_string
you specify the position of the first element, which is 0
(in Python, count from 0):
0] # First element example_string[
'a'
If you want the second element, use 1
:
1] # Second element example_string[
'p'
If you want to count from the end, use negatives:
-1] # Last element example_string[
'e'
Slicing
What if you want multiple elements? You slice by specifying the start and end indices between a colon:
1:3] # Elements 1 and 2 example_string[
'pp'
Notice that it includes the first index but excludes the second.
To start at the beginning, just leave the first index out:
3] # Elements 0, 1 and 2 example_string[:
'app'
To go to the end, leave the second index out:
2:] # Elements 2, 3, ..., -1 example_string[
'ple'
Finally, you can combine negative indexing with slicing. For example, to go up to the last element:
-1] example_string[:
'appl'
In sum:
Code | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
example_string |
"a" |
"p" |
"p" |
"l" |
"e" |
example_string[0] |
"a" |
||||
example_string[2] |
"p" |
||||
example_string[-1] |
"e" |
||||
example_string[1:3] |
"p" |
"p" |
|||
example_string[:3] |
"a" |
"p" |
"p" |
||
example_string[2:] |
"p" |
"l" |
"e" |
||
example_string[:-1] |
"a" |
"p" |
"p" |
"l" |
#### Part 2 ####
= os.listdir("texts")
files_in_texts
for text_path in files_in_texts:
= text_path[:-4] # <-- extract the title
title
with open(f"texts/{text_path}", encoding = "utf-8") as file:
= file.read()
contents
= contents.split()
words = set(words)
unique_words
= len(words)
word_count = len(unique_words)
unique_word_count = unique_word_count / word_count
ratio
print()
print(f"There are {word_count} words in {title}.") # <-- include in message
print(f"There are {unique_word_count} different words in {title}.") # <-- include in message
print(f"The unique word ratio is {unique_word_count / word_count}")
There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526
There are 130410 words in Pride_and_Prejudice.
There are 14702 different words in Pride_and_Prejudice.
The unique word ratio is 0.11273675331646346
There are 114125 words in The_Adventures_of_Huckleberry_Finn.
There are 14307 different words in The_Adventures_of_Huckleberry_Finn.
The unique word ratio is 0.12536254107338446
There are 464023 words in The_Count_of_Monte_Cristo.
There are 40030 different words in The_Count_of_Monte_Cristo.
The unique word ratio is 0.0862672755445312
There are 51257 words in The_Great_Gatsby.
There are 10206 different words in The_Great_Gatsby.
The unique word ratio is 0.19911426731958562
Activity 2
If you open one of the files, you’ll notice that there is front and end matter which isn’t from the original texts. Let’s remove them and save the cleaned texts. To do so, we’ll need to use two skills:
- Slicing
- Writing to files
Part 1: Clean the texts
To remove the front/end matter, notice that the original texts all begin after the string
*** START OF THE PROJECT GUTENBERG EBOOK
and end before
*** END OF THE PROJECT GUTENBERG EBOOK
.
To clean the text,
- Use the function
contents.find(...)
to find the index corresponding to the keys,
= contents.find(...)
start_index = contents.find(...) end_index
- Slice the text between those two indices and save it in a variable.
Part 2: Write the strings to files
Writing strings to a file is similar to reading. We’ll start by creating a with ... as ...
block pointing to the new file path:
with open(f"{title}_clean.txt", "w", encoding = "utf-8") as file:
...
Then use file.write(...)
to write the cleaned string to the new file.
The following code is a possible solution to the problem.
#### Part 2 ####
= os.listdir("texts")
files_in_texts
for text_path in files_in_texts:
# ...
# ...
# ...
# Remove front/end matter and save clean files
= "*** START OF THE PROJECT GUTENBERG EBOOK"
start_message = "*** END OF THE PROJECT GUTENBERG EBOOK"
end_message
= contents.find(start_message)
start = contents.find(end_message)
end
= contents[start:end]
clean_text
with open(f"{title}_clean.txt", "w", encoding = "utf-8") as file:
file.write(clean_text)
Part 3 (extension): Making it modular
In this final (optional) part we take a look making our code modular. In Python, you can do this in two ways:
- Within the script, with functions
- Outside the script, with modules.
Before beginning, just check that your code is up to date:
import os
print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")
# Check that the folder exists in our working directory
if os.path.exists("texts"):
print("The folder /texts/ exists.")
else:
raise FileNotFoundError("Cannot find the folder /texts/.")
= os.listdir("texts")
files_in_texts
# Check that there are five files within texts
if len(files_in_texts) != 5:
raise FileNotFoundError("Incorrect number of files in /texts/.")
#### Part 2 ####
= os.listdir("texts")
files_in_texts
for text_path in files_in_texts:
= text_path[:-4]
title
with open(f"texts/{text_path}", encoding = "utf-8") as file:
= file.read()
contents
= contents.split()
words = set(words)
unique_words
= len(words)
word_count = len(unique_words)
unique_word_count = unique_word_count / word_count
ratio
print()
print(f"There are {word_count} words in {title}.")
print(f"There are {unique_word_count} different words in {title}.")
print(f"The unique word ratio is {unique_word_count / word_count}")
# Remove front/end matter and save clean files
= "*** START OF THE PROJECT GUTENBERG EBOOK"
start_message = "*** END OF THE PROJECT GUTENBERG EBOOK"
end_message
= contents.find(start_message) + len(start_message)
start = contents.find(end_message)
end
= contents[start:end]
clean_text
with open(f"{title}_clean.txt", "w", encoding = "utf-8") as file:
file.write(clean_text)
Running the Python Toolkit Program
The current working directory is C:\Users\uqcwest5\OneDrive - The University of Queensland\Tech Training\technology-training\Python\5-python_toolkit
The folder /texts/ exists.
There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526
There are 130410 words in Pride_and_Prejudice.
There are 14702 different words in Pride_and_Prejudice.
The unique word ratio is 0.11273675331646346
There are 114125 words in The_Adventures_of_Huckleberry_Finn.
There are 14307 different words in The_Adventures_of_Huckleberry_Finn.
The unique word ratio is 0.12536254107338446
There are 464023 words in The_Count_of_Monte_Cristo.
There are 40030 different words in The_Count_of_Monte_Cristo.
The unique word ratio is 0.0862672755445312
There are 51257 words in The_Great_Gatsby.
There are 10206 different words in The_Great_Gatsby.
The unique word ratio is 0.19911426731958562
Modularity within the script: functions
We’ll start with functions. Functions are like a script within a script - a section of code that runs when you call its name. They come in two parts:
- The function call, which runs the code (e.g.
print()
,len()
, etc.) - The function definition, which defines that code
Every time we’ve used a function, like print(...)
, len(...)
, etc., we have performed function calls. However, we need to write new definitions to make our own functions.
Let’s make a new function now, read_book(...)
, which reads a text file and returns the contents as a string, like we do in the loop
We’ll start by defining our function. Do this at the top of your script, just after the import statements. Functions definitions have the following syntax:
def <function_name>(<input1>, <input2>, ...):
code
code
codereturn <output>
Let’s set it up, without including any code yet, with a single input variable path
:
import os
def read_book(path):
return
The function signature, def <signature>:
forms a key part of the definition. Inside the brackets there are different ways to specify the inputs.
No inputs
Your function doesn’t have to take any inputs. For example,
# Definition:
def no_inputs():
...return <output>
# Call:
no_inputs()
Compulsory inputs
If you just give the inputs names they are compulsory: all calls must include them
# Definition:
def compulsory_inputs(input1, input2):
...return <output>
# Call:
compulsory_inputs(a, b)
Default / optional inputs
You can specify default values for function inputs, which makes them optional
# Definition:
def optional_inputs(input1 = "apple", input2 = "banana"):
...return <output>
# Call:
"cherry") # Will interpret as input1 = "cherry", input2 = "banana" optional_inputs(
Positional vs Keyword arguments
Finally, when you call a function, you can either specify the inputs directly or let it assume by position.
def example(input1, input2, input3):
...return <output>
"apple", "banana", "cherry")
example("apple", "banana", input3 = "cherry")
example(= "apple", input2 = "banana", input3 = "cherry")
example(input1 = "cherry", input2 = "banana", input1 = "apple") example(input3
These are all valid calls, with various differences:
- All positional
input1
andinput2
are positional, whileinput3
is keyword- All keyword
- All keyword - the order doesn’t matter for keyword arguments!
Because keyword arguments are unordered, positional arguments must precede them:
# Valid
"apple", input2 = "banana", input3 = "cherry")
example(
# Invalid - positional argument after keyword argument!
= "apple", "banana", "cherry") example(input1
Now, let’s include the code that we previously used to read the file and split the words. Note that the variable containing the full file path is path
, so we should change that accordingly.
import os
def read_book(path):
with open(path, encoding = "utf-8") as file:
= file.read()
contents
return contents
Variables created within functions are deleted once the function runs, so they can’t be accessed by your main code! This is called scope.
Finally, let’s replace code within the loop with a simple function call to our new function. All together,
import os
def read_book(path):
with open(path, encoding = "utf-8") as file:
= file.read()
contents
return contents
print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")
# Check that the folder exists in our working directory
if os.path.exists("texts"):
print("The folder /texts/ exists.")
else:
raise FileNotFoundError("Cannot find the folder /texts/.")
= os.listdir("texts")
files_in_texts
# Check that there are five files within texts
if len(files_in_texts) != 5:
raise FileNotFoundError("Incorrect number of files in /texts/.")
#### Part 2 ####
for text_path in files_in_texts:
= text_path[:-4]
title
= read_book(f"texts/{text_path}") # <-- Custom function call
contents = contents.split()
words
= set(words)
unique_words
= len(words)
word_count = len(unique_words)
unique_word_count = unique_word_count / word_count
ratio
print()
print(f"There are {word_count} words in {title}.")
print(f"There are {unique_word_count} different words in {title}.")
print(f"The unique word ratio is {unique_word_count / word_count}")
# Remove front/end matter and save clean files
= "*** START OF THE PROJECT GUTENBERG EBOOK"
start_message = "*** END OF THE PROJECT GUTENBERG EBOOK"
end_message
= contents.find(start_message) + len(start_message)
start = contents.find(end_message)
end
= contents[start:end]
clean_text
with open(f"{title}_clean.txt", "w", encoding = "utf-8") as file:
file.write(clean_text)
Running the Python Toolkit Program
The current working directory is C:\Users\uqcwest5\OneDrive - The University of Queensland\Tech Training\technology-training\Python\5-python_toolkit
The folder /texts/ exists.
There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526
There are 130410 words in Pride_and_Prejudice.
There are 14702 different words in Pride_and_Prejudice.
The unique word ratio is 0.11273675331646346
There are 114125 words in The_Adventures_of_Huckleberry_Finn.
There are 14307 different words in The_Adventures_of_Huckleberry_Finn.
The unique word ratio is 0.12536254107338446
There are 464023 words in The_Count_of_Monte_Cristo.
There are 40030 different words in The_Count_of_Monte_Cristo.
The unique word ratio is 0.0862672755445312
There are 51257 words in The_Great_Gatsby.
There are 10206 different words in The_Great_Gatsby.
The unique word ratio is 0.19911426731958562
Modularity beyond the script: modules
What actually happens when you run import ...
? Python adds the contents of another Python file to the existing ‘namespace’. Basically, you import a bunch of functions (and classes, and other objects…)!
We can make our own modules that Python recognises with the import
command. The simplest way is just another Python script. Let’s make one to store our new function, so it’s out of the way.
- Create a new script in this folder called
reader.py
- Move the function into that file.
The script should look like this:
reader.py
def read_book_words(path):
with open(path, encoding = "utf-8") as file:
= file.read()
contents
return contents
Finally, we should reflect the changes in our original script.
- Replace the old function definition with the command
import reader
. - Replace the old function
read_book_words(...)
with the command `reader.read_book_words(…)
The main script should look like this
toolkit.py
import os
import reader
print("Running the Python Toolkit Program")
print(f"The current working directory is {os.getcwd()}")
# Check that the folder exists in our working directory
if os.path.exists("texts"):
print("The folder /texts/ exists.")
else:
raise FileNotFoundError("Cannot find the folder /texts/.")
= os.listdir("texts")
files_in_texts
# Check that there are five files within texts
if len(files_in_texts) != 5:
raise FileNotFoundError("Incorrect number of files in /texts/.")
#### Part 2 ####
= os.listdir("texts")
files_in_texts
for text_path in files_in_texts:
= text_path[:-4]
title
= reader.read_book(f"texts/{text_path}")
contents = contents.split()
words
= set(words)
unique_words
= len(words)
word_count = len(unique_words)
unique_word_count = unique_word_count / word_count
ratio
print()
print(f"There are {word_count} words in {title}.")
print(f"There are {unique_word_count} different words in {title}.")
print(f"The unique word ratio is {unique_word_count / word_count}")
# Remove front/end matter and save clean files
= "*** START OF THE PROJECT GUTENBERG EBOOK"
start_message = "*** END OF THE PROJECT GUTENBERG EBOOK"
end_message
= contents.find(start_message)
start = contents.find(end_message)
end
= contents[start:end]
clean_text
with open(f"{title}_clean.txt", "w", encoding = "utf-8") as file:
file.write(clean_text)
Running the Python Toolkit Program
The current working directory is C:\Users\uqcwest5\OneDrive - The University of Queensland\Tech Training\technology-training\Python\5-python_toolkit
The folder /texts/ exists.
There are 21428 words in Macbeth.
There are 6207 different words in Macbeth.
The unique word ratio is 0.2896677244726526
There are 130410 words in Pride_and_Prejudice.
There are 14702 different words in Pride_and_Prejudice.
The unique word ratio is 0.11273675331646346
There are 114125 words in The_Adventures_of_Huckleberry_Finn.
There are 14307 different words in The_Adventures_of_Huckleberry_Finn.
The unique word ratio is 0.12536254107338446
There are 464023 words in The_Count_of_Monte_Cristo.
There are 40030 different words in The_Count_of_Monte_Cristo.
The unique word ratio is 0.0862672755445312
There are 51257 words in The_Great_Gatsby.
There are 10206 different words in The_Great_Gatsby.
The unique word ratio is 0.19911426731958562
Conclusion and Summary
This is a big workshop, and we’ve covered a lot of content! See the summary table below for details on the topics covered. Each is linked to the notes in the workshop.
If you have any further questions, don’t hesistate to contact us at training@library.uq.edu.au.
Topic | Code | Description |
---|---|---|
The os module |
|
A built-in module which enables interacting with your operating system. |
f-strings |
|
Formatted strings, which behave like normal strings except that code within curly brackets {...} is executed. |
Conditionals |
|
Sections of code which only run if a condition is true. Always start with Use Use |
Raising exceptions |
|
A way to manually trigger error messages and stop the program. Replace ... with an errortype, e.g. KeyError , ValueError . |
File input/output |
|
Read and write to files with the Most files use the Send the |
Loops |
|
Run sections of code multiple times with a loop.
|
Indexing and slicing |
|
Access individual elements of a string or list by indexing and slicing with square brackets. |
Custom functions Custom modules |
|
Store sections of code away in functions to run them at a later point.
You can store the functions in a separate script and import that script as a module. |