--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) Cell In[1], line 1 ----> 1 raise FileNotFoundError("PANIC!") FileNotFoundError: PANIC!
Analysing your own data
In this final session, we look at taking the skills we’ve developed during the intensive and applying it to your own data. We’ll also discuss a few common pitfalls and where to get help in the wild.
Setting up
Spend five minutes locating your data and preparing Spyder
- Locate and structure your own project folder
- You might like to create a data/ folder
- You might like to create a scripts/ folder
- (Optional) Turn it into a Spyder project
- Click Projects \(\rightarrow\) New Project…
- Click 🔘 Existing directory
- Choose top-level folder of your project
- Create a new script for processing, called
processing.py(or whatever you’d like to).
We’ll spend most of this session like project time, troubleshooting and setting up your own data for analysis. Before we do that, let’s discuss a few common tips and pitfalls. Namely,
- Managing paths and the working directory
- Importing different data structures and files
- Dealing with different types of data
Environments and importing data
Perhaps the most frustrating issues are those which prevent you from importing the data at all!
Getting your setup right
Nobody likes to encounter this:
To solve it, we need to talk about filesystems.
When running a Python script, there are always three relevant locations:
- Your Python executable
- Your script
- Your working directory
The Python executable runs your script from the working directory
Why does this matter? Because when you import data, Python uses your working directory as the reference, not the others. This means that there are important questions you need to ask:
- Where is your working directory?
- Where is your data?
Answering the first question is easy: simply run
import os
os.getcwd()'/home/uqcwest5/tech_training/training-intensives/python'
- When you run Python from the command line, it’s the current location as specified in the terminal
- When you run Python in Spyder, it’s the folder displayed in “Files” and given by the address in the top right.
This address is your working directory. All paths in your scripts are evaluated relative to this location.
This includes data paths can be absolute or relative, and it can be online.
Absolute paths begin with a root prefix and have folders separated with slashes. They contain all the folders from the root of the filesystem down to the object of interest.
On Windows, absolute paths are prefixed with the drive, e.g. C:\ and folders separated with backslashes \
C:\Users\...\...\data\example.csv
On Unix-like systems, absolute paths are prefixed with a forwardslash /, which also separate folders.
/home/user/.../.../data/example.csv
Alternatively, you can start from your ‘user’ directory by prefixing with a tilde ~:
~/.../.../data/example.csv
Website and web-hosted files can typically be accessed with URLs. Full or ‘absolute’ URLs are prefixed with a protocol (e.g. https://) and a hostname (e.g. www.website.com), with folders then separated by forwardslashes
https://www.website.com/.../.../.../data/example.csv
Relative filepaths have no prefix.
On Windows, relative paths are still separated with backslashes
data\example.csv
On unix-like systems, relative paths are still separated with forwardslashes
data/example.csv
It’s possible to have a relative path to a web file, however, as a relative filepath, you must be running Python from the server.
The syntax is the same as unix-like systems, i.e., folders separated with forwardslashes
data/example.csv
| Pros | Cons | |
| Absolute |
|
|
| Relative |
|
|
Once you have your working directory and your filepath, you can now check that any data paths have been specified correctly.
If the path is absolute
You just need to ensure that your working directory is on the same device as the file.
If the path is relative
Go to the working directory and trace the path to ensure it’s correct. The path begins inside the working directory. A few oddities:
..indicates go up a folder.indicates the current folder
If you’re using a project, you shouldn’t need to change working directories. Instead, try modifying the path first.
If you need to change working directories, you can do this with Python code,
os.chdir("path/to/new/dir")Or in Spyder by changing the address in the top-right.
You should check that it’s worked by running os.getcwd().
The backslashes in Windows paths conflict with Python’s escape character in strings (also a backslash). To fix this, you can
- Replace backslashes with forwardslashes in Python
- Prefix the string with
Rto ignore the backslashes - Escape the backslashes with an extra backslash
For example, the following Windows path
C:\Users\me\data\secrets.csv
could be imported as any of the following
# Prefix with R"...
pd.read_csv(R"C:\Users\me\data\secrets.csv")
# Replace with forwardslashes
pd.read_csv("C:/Users/me/data/secrets.csv")
# Escape the backslashes
pd.read_csv("C:\\Users\\me\\data\\secrets.csv")We recommend using the R"..." option where possible, as it’s the least work.
If you’ve fixed the filepath and you’re on top of the Windows peculiarities, then check the following errors for more troubleshooting.
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes ...
You’ve probably used a path with backslashes and not adjusted it for Python. See “Watch out Windows users” above.
Python throws this error when it runs "\u..." or "\U..." (unless "..." is a valid unicode code; \Users is not).
FileNotFoundError: [Errno 2] No such file or directory ... but everything is correct???
Have you used a path with backslashes on a non-Windows machine? If you have, replace them with forwardslashes.
Final thoughts
- Avoid spaces
- Use relative filepaths where possible
- Get familiar with your working directory
Importing your data correctly
Once you’ve got the path working, the next challenge will be importing the data correctly. Unlike our data, yours might have multiple header rows, missing data, simply be organised differently or even be a different file type.
We’ll look at importing .csv files here, but the same applies to other file types.
The documentation for pd.read_csv explains (as of v2.3) 49 different parameters that the functions supports. These include
| Parameter | Description |
|---|---|
filepath_or_buffer |
The path (doesn’t have to actually be .csv!) |
sep |
The separator (use "\t" for tab separated values) |
header |
The row number(s) containing headers (pass header=None for no header) |
skiprows |
Skip certain rows (by number from 0) |
na_values |
A list of values to interpret as empty cells |
parse_dates |
A list of column names to interpret as datetimes (other options as well, see docs) |
There are plenty more! Check these out if your data isn’t importing correctly.
Additionally, you can import other types of files. See the IO tools user guide for using file types, like .xlsx (Excel workbooks), JSON, etc.
.xlsx)
Reading .xlsx files can be complicated, but for simple reading, just use
pd.read_excel(...)with sheet_name="..." if you’d like to specify the sheet.
Dealing with different and dodgy data
Our data has been set up to be a bit of a challenge, and a bit of a help. Your data might be organised differently, and might need more work! You might also need to perform different tasks.
We’ll look at a few common tips to get you going, but before you start, the best advice is to get out a pencil and paper and draw. Mock up your data, figure out what you want and write down the steps that you would have to do by hand. Then you’ll have a good grasp of what you want, and whether the code is working.
A few resources that you should consult:
- The official pandas User Guide is comprehensive and will likely provide tutorial support for your use cases. Start here.
- The pandas cheatsheet is a fantastic resource which outlines the common pandas tasks along with diagrams of the tabular operations. Consult this often!
- The pandas API reference is a succinct reference of each pandas function. Scroll through the
Seriesoptions for column-based functions to see if any sound appropriate, or learn how to use one you’ve discovered.
See below for a summary of data cleaning tips you might need to apply to your data, assuming your data is stored in df and "col_name" is a column name
Reshaping your data
For simple (or complex) reshaping tasks, like filtering, subsetting and adding new columns, refresh yourself with our second session of this intensive: 2 - Data Processing. Alternatively, read one of the following user guides:
Cleaning up inconsistencies
| Function | Description |
|---|---|
df.replace("old","new") |
Replace all values "old" in a dataframe with "new" |
df["col_name"].replace("old","new") |
Replace all values "old" in a column with "new" |
df.rename(columns = {"old_col": "new_col"}) |
Replace column name "old_col" with "new_col" |
df.fillna("new_na") |
Replace all NA/empty in a dataframe values with "new_na" |
df["col_name"].fillna("new_na") |
Replace all NA/empty values in a column with "new_na" |
Dealing with different types of data
Each series (column) in pandas is stored as a particular dtype (data type). Common types include
objectfor generic non-numeric data. Each cell can contain any Python object, almost always strings but occasionally lists. Mixed types will default toobject.int64for integers.float64for decimals (including scientific notation).boolfor booleans (TrueorFalse)datetime64[ns]for timestampscategoryfor categorical data
For applying methods specific to textual, temporal or categorical data, you must first ensure the columns match the dtype you expect. Common mishaps include
- Timestamps or categorical data stored as
objectdtypes - Numeric categorical data stored as
int64orfloat64dtypes
By default all timestamps and categorical data are read in as objects. You can modify pd.read_csv() at the import stage, otherwise use .astype() to change a column’s type (or pd.to_datetime in the case of timestamps):
df["A"] = pd.to_datetime(df["A"]) # <-- Time series
df["A"] = df["A"].astype("category") # <-- Categorical
df["A"] = df["A"].astype("object") # <-- Object (e.g. string data)We’ll look at these more in the respective sections below.
Pandas v3.0 changes the (default) behaviour for string columns. They will no longer be object dtypes, but str dtypes (by default). Most features will work the same, but it may cause some breaking changes: see the migration user guide for details.
As of December 2025, pandas 3.0 is still in the final stages of development.
In general, pandas allows you to apply type-specific methods with specific accessors:
df["col_name"].strcontains string methodsdf["col_name"].dtcontains datetime methodsdf["col_name"].catcontains categorical methods
We’ll go through a few useful ones here. Note that
.strmethods only work onobjectdtypes.dtmethods only work ondatetime64[ns]dtypes.catmethods only work oncategorytypes.
Textual (.str methods)
When performing text analysis, or for simple string methods, use the methods in df["col_name"].str.
You should consult the official Working with text data user guide for details about string operations. We’ve included a brief summary of some useful functions here.
For each example below, append .method to df["col_name"]. For example, the method .str.replace should be used as df["col_name"].str.replace(). Each function is applied to every row in “col_name”
| Method | Description |
|---|---|
.str.cat(...) |
Concatenate one or more strings to the end |
.str.split(...) |
Split the string into a list of strings based on a common delimiter. Inverse of .join() |
.str.join(...) |
Join a list of strings into a single string (inserting an optional delimiter). Inverse of .split() |
.str.contains(...) |
Return True/False if the string contains .... Useful for filtering: df[df["col_name"].str.contains(...)] will subset for those rows in “col_name” which contain “…”. |
.str.slice(...) |
Return a slice of the strings |
Time series (.dt methods)
When performing time series analysis, or for simple temporal methods, use the methods in df["col_name"].dt.
Note that the column must be a datetime64[ns] dtype (you could change the [ns] to a different precision if you’d prefer, like millisecond [ms] or second [s]). To set column df["col_name"], use pd.to_datetime
df["col_name"] = pd.to_datetime(df["col_name"])`Include format="..." to specify the format of your timestamps, according to the standard Python datetime syntax. For example, format="%d/%m/%y" matches two-digit DD/MM/YY, e.g. 03/02/26.
If the timestamp is spread across multiple columns, you can use pd.to_datetime on those columns (or the whole dataframe if there’s nothing else) to convert them into a single time series. Take for example this dataframe, stored as a variable called temporal:
day month year
0 21 3 1987
1 7 8 2000
2 15 12 2026
This command would convert the three series to a single datetime series:
pd.to_datetime(temporal[["day", "month", "year"]])0 1987-03-21
1 2000-08-07
2 2026-12-15
dtype: datetime64[ns]
You should consult the official Time series / date functionality user guide for details about temporal operations. We’ve included a brief summary of some useful functions here.
For each example below, append .method to df["col_name"]. For example, the method .dt.date should be used as df["col_name"].dt.date. Each function is applied to every row in “col_name”
| Method | Description |
|---|---|
.dt.date |
Return the date part of the timestamp (this is technically an attribute, meaning it’s not a function and should not have brackets ()) |
.dt.weekday() |
Return the day of the week corresponding to the date |
.dt.normalize() |
Convert all times to midnight |
.dt.strftime(...) |
Convert a timestamp to a string. Send the format into ... according to the standard Python datetime syntax |
.resample(...) |
For time-based grouping, e.g. average/max/median \(x\) per day/hour/minute. Note that this is not a .dt method - it applies directly to the column: df["col_name"].resample(...) |
Filtering
You can filter particular parts of the timestamp as normal, because they’re just numbers:
df[df["col_name"].year == 2026]However, you might want to filter from or before a particular timestamp. To do this, use a pd.Timestamp in the condition, e.g. for timestamps after 3rd Feb 2026,
df[df["col_name"] > pd.Timestamp(2026, 2, 3)]Categorical (.cat methods)
When analysing categorical data, you can use the methods in df["col_name"].cat. These are particularly useful for data with ordered levels.
To make a column the "category" type, use .astype("category"):
df["col_name"] = df["col_name"].astype("category")You should consult the official Categorical data user guide for details about categorical operations. We’ve included a brief summary of some useful functions here.
For each example below, append .method to df["col_name"]. For example, the method .cat.categories should be used as df["col_name"].cat.categories. Each function is applied to every row in “col_name”.
| Method | Description |
|---|---|
.cut(...) |
Group numeric data into discrete bins. Apply to numeric columns, returns a category column. |
.cat.categories |
Return the categories in the column. |
.cat.reorder_categories(..., ordered=True) |
Change the category order. Exclude ordered=True if the data is not ordered. |
.sort_values(...) |
Sort the column (but not the rest of the dataframe) based on the category order. Will fail if .cat.ordered == False, and will work for non-category columns (sorting alphabetically for strings and numerically for numbers). |
Geospatial
To manage geospatial data, you should use the geopandas package. With geopandas, you create GeoDataFrames, which behave like pandas DataFrames with additional behaviour for geospatial columns. Let us know if you’d like help with this!
What to do when you don’t know what to do
- Consult the documentation. If your function isn’t behaving, go to the specific page for that function. Otherwise, consult a relevant user guide.
- Search the web for your issue to find practical and canonical approaches to typical data cleaning tasks. StackOverflow is your friend.
- Ask us while we’re here! And once we’re gone, shoot us an email at training@library.uq.edu.au
- Lots of people also ask AI. There are pros and cons, but beware: you’ll get a solution that probably works, but if you don’t know why, you should double check the data matches what you want (and try to understand what it’s done!).
Now give it a go!
For the rest of this session, we’d love to help you get your data set up and working on your end. For that reason, we’ve dedicated this time to troubleshooting any issues that arise together. Good luck!