Analysing your own data

In this final session, we look at taking the skills we’ve developed during the intensive and applying it to your own data. We’ll also discuss a few common pitfalls and where to get help in the wild.

Setting up

Spend five minutes locating your data and preparing Spyder

  1. Locate and structure your own project folder
    • You might like to create a data/ folder
    • You might like to create a scripts/ folder
  2. (Optional) Turn it into a Spyder project
    1. Click Projects \(\rightarrow\) New Project…
    2. Click 🔘 Existing directory
    3. Choose top-level folder of your project
  3. Create a new script for processing, called processing.py (or whatever you’d like to).

We’ll spend most of this session like project time, troubleshooting and setting up your own data for analysis. Before we do that, let’s discuss a few common tips and pitfalls. Namely,

  • Managing paths and the working directory
  • Importing different data structures and files
  • Dealing with different types of data

Environments and importing data

Perhaps the most frustrating issues are those which prevent you from importing the data at all!

Getting your setup right

Nobody likes to encounter this:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[1], line 1
----> 1 raise FileNotFoundError("PANIC!")

FileNotFoundError: PANIC!

To solve it, we need to talk about filesystems.

When running a Python script, there are always three relevant locations:

  • Your Python executable
  • Your script
  • Your working directory

The Python executable runs your script from the working directory

Why does this matter? Because when you import data, Python uses your working directory as the reference, not the others. This means that there are important questions you need to ask:

  1. Where is your working directory?
  2. Where is your data?

Answering the first question is easy: simply run

import os
os.getcwd()
'/home/uqcwest5/tech_training/training-intensives/python'
NoteAlternatively
  • When you run Python from the command line, it’s the current location as specified in the terminal
  • When you run Python in Spyder, it’s the folder displayed in “Files” and given by the address in the top right.

This address is your working directory. All paths in your scripts are evaluated relative to this location.

This includes data paths can be absolute or relative, and it can be online.

Absolute paths begin with a root prefix and have folders separated with slashes. They contain all the folders from the root of the filesystem down to the object of interest.

On Windows, absolute paths are prefixed with the drive, e.g. C:\ and folders separated with backslashes \

C:\Users\...\...\data\example.csv

On Unix-like systems, absolute paths are prefixed with a forwardslash /, which also separate folders.

/home/user/.../.../data/example.csv

Alternatively, you can start from your ‘user’ directory by prefixing with a tilde ~:

~/.../.../data/example.csv

Website and web-hosted files can typically be accessed with URLs. Full or ‘absolute’ URLs are prefixed with a protocol (e.g. https://) and a hostname (e.g. www.website.com), with folders then separated by forwardslashes

https://www.website.com/.../.../.../data/example.csv

Relative filepaths have no prefix.

On Windows, relative paths are still separated with backslashes

data\example.csv

On unix-like systems, relative paths are still separated with forwardslashes

data/example.csv

It’s possible to have a relative path to a web file, however, as a relative filepath, you must be running Python from the server.

The syntax is the same as unix-like systems, i.e., folders separated with forwardslashes

data/example.csv
NoteShould I use absolute or relative paths?
Pros Cons
Absolute
  • Works for any working directory (on same device)
  • Only valid on one device
  • Can get long
  • Can contain irrelevant information
Relative
  • Works on any device with the same project structure
  • Only contains project-specific information
  • Can be shorter
  • Working directory must be set up correctly
  • Can become confusing with many parent folders (e.g. ../../../)

Once you have your working directory and your filepath, you can now check that any data paths have been specified correctly.

If the path is absolute

You just need to ensure that your working directory is on the same device as the file.

If the path is relative

Go to the working directory and trace the path to ensure it’s correct. The path begins inside the working directory. A few oddities:

  • .. indicates go up a folder
  • . indicates the current folder
NoteChanging the working directory

If you’re using a project, you shouldn’t need to change working directories. Instead, try modifying the path first.

If you need to change working directories, you can do this with Python code,

os.chdir("path/to/new/dir")

Or in Spyder by changing the address in the top-right.

You should check that it’s worked by running os.getcwd().

WarningWatch out Windows users

The backslashes in Windows paths conflict with Python’s escape character in strings (also a backslash). To fix this, you can

  • Replace backslashes with forwardslashes in Python
  • Prefix the string with R to ignore the backslashes
  • Escape the backslashes with an extra backslash

For example, the following Windows path

C:\Users\me\data\secrets.csv

could be imported as any of the following

# Prefix with R"...
pd.read_csv(R"C:\Users\me\data\secrets.csv")

# Replace with forwardslashes
pd.read_csv("C:/Users/me/data/secrets.csv")

# Escape the backslashes
pd.read_csv("C:\\Users\\me\\data\\secrets.csv")

We recommend using the R"..." option where possible, as it’s the least work.

If you’ve fixed the filepath and you’re on top of the Windows peculiarities, then check the following errors for more troubleshooting.

You’ve probably used a path with backslashes and not adjusted it for Python. See “Watch out Windows users” above.

Python throws this error when it runs "\u..." or "\U..." (unless "..." is a valid unicode code; \Users is not).

Have you used a path with backslashes on a non-Windows machine? If you have, replace them with forwardslashes.

Final thoughts

  • Avoid spaces
  • Use relative filepaths where possible
  • Get familiar with your working directory

Importing your data correctly

Once you’ve got the path working, the next challenge will be importing the data correctly. Unlike our data, yours might have multiple header rows, missing data, simply be organised differently or even be a different file type.

We’ll look at importing .csv files here, but the same applies to other file types.

The documentation for pd.read_csv explains (as of v2.3) 49 different parameters that the functions supports. These include

Parameter Description
filepath_or_buffer The path (doesn’t have to actually be .csv!)
sep The separator (use "\t" for tab separated values)
header The row number(s) containing headers (pass header=None for no header)
skiprows Skip certain rows (by number from 0)
na_values A list of values to interpret as empty cells
parse_dates A list of column names to interpret as datetimes (other options as well, see docs)

There are plenty more! Check these out if your data isn’t importing correctly.

Additionally, you can import other types of files. See the IO tools user guide for using file types, like .xlsx (Excel workbooks), JSON, etc.

TipMS Excel files (.xlsx)

Reading .xlsx files can be complicated, but for simple reading, just use

pd.read_excel(...)

with sheet_name="..." if you’d like to specify the sheet.

Dealing with different and dodgy data

Our data has been set up to be a bit of a challenge, and a bit of a help. Your data might be organised differently, and might need more work! You might also need to perform different tasks.

We’ll look at a few common tips to get you going, but before you start, the best advice is to get out a pencil and paper and draw. Mock up your data, figure out what you want and write down the steps that you would have to do by hand. Then you’ll have a good grasp of what you want, and whether the code is working.

A few resources that you should consult:

  • The official pandas User Guide is comprehensive and will likely provide tutorial support for your use cases. Start here.
  • The pandas cheatsheet is a fantastic resource which outlines the common pandas tasks along with diagrams of the tabular operations. Consult this often!
  • The pandas API reference is a succinct reference of each pandas function. Scroll through the Series options for column-based functions to see if any sound appropriate, or learn how to use one you’ve discovered.

See below for a summary of data cleaning tips you might need to apply to your data, assuming your data is stored in df and "col_name" is a column name

Reshaping your data

For simple (or complex) reshaping tasks, like filtering, subsetting and adding new columns, refresh yourself with our second session of this intensive: 2 - Data Processing. Alternatively, read one of the following user guides:

Cleaning up inconsistencies

Function Description
df.replace("old","new") Replace all values "old" in a dataframe with "new"
df["col_name"].replace("old","new") Replace all values "old" in a column with "new"
df.rename(columns = {"old_col": "new_col"}) Replace column name "old_col" with "new_col"
df.fillna("new_na") Replace all NA/empty in a dataframe values with "new_na"
df["col_name"].fillna("new_na") Replace all NA/empty values in a column with "new_na"

Dealing with different types of data

Each series (column) in pandas is stored as a particular dtype (data type). Common types include

  • object for generic non-numeric data. Each cell can contain any Python object, almost always strings but occasionally lists. Mixed types will default to object.
  • int64 for integers.
  • float64 for decimals (including scientific notation).
  • bool for booleans (True or False)
  • datetime64[ns] for timestamps
  • category for categorical data

For applying methods specific to textual, temporal or categorical data, you must first ensure the columns match the dtype you expect. Common mishaps include

  • Timestamps or categorical data stored as object dtypes
  • Numeric categorical data stored as int64 or float64 dtypes

By default all timestamps and categorical data are read in as objects. You can modify pd.read_csv() at the import stage, otherwise use .astype() to change a column’s type (or pd.to_datetime in the case of timestamps):

df["A"] = pd.to_datetime(df["A"])       # <-- Time series
df["A"] = df["A"].astype("category")    # <-- Categorical
df["A"] = df["A"].astype("object")      # <-- Object (e.g. string data)

We’ll look at these more in the respective sections below.

NoteString changes in pandas v3.0

Pandas v3.0 changes the (default) behaviour for string columns. They will no longer be object dtypes, but str dtypes (by default). Most features will work the same, but it may cause some breaking changes: see the migration user guide for details.

As of December 2025, pandas 3.0 is still in the final stages of development.

In general, pandas allows you to apply type-specific methods with specific accessors:

  • df["col_name"].str contains string methods
  • df["col_name"].dt contains datetime methods
  • df["col_name"].cat contains categorical methods

We’ll go through a few useful ones here. Note that

  • .str methods only work on object dtypes
  • .dt methods only work on datetime64[ns] dtypes
  • .cat methods only work on category types.

Textual (.str methods)

When performing text analysis, or for simple string methods, use the methods in df["col_name"].str.

TipConsult the user guide

You should consult the official Working with text data user guide for details about string operations. We’ve included a brief summary of some useful functions here.

For each example below, append .method to df["col_name"]. For example, the method .str.replace should be used as df["col_name"].str.replace(). Each function is applied to every row in “col_name”

Method Description
.str.cat(...) Concatenate one or more strings to the end
.str.split(...) Split the string into a list of strings based on a common delimiter. Inverse of .join()
.str.join(...) Join a list of strings into a single string (inserting an optional delimiter). Inverse of .split()
.str.contains(...) Return True/False if the string contains .... Useful for filtering: df[df["col_name"].str.contains(...)] will subset for those rows in “col_name” which contain “…”.
.str.slice(...) Return a slice of the strings

Time series (.dt methods)

When performing time series analysis, or for simple temporal methods, use the methods in df["col_name"].dt.

Note that the column must be a datetime64[ns] dtype (you could change the [ns] to a different precision if you’d prefer, like millisecond [ms] or second [s]). To set column df["col_name"], use pd.to_datetime

df["col_name"] = pd.to_datetime(df["col_name"])`

Include format="..." to specify the format of your timestamps, according to the standard Python datetime syntax. For example, format="%d/%m/%y" matches two-digit DD/MM/YY, e.g. 03/02/26.

If the timestamp is spread across multiple columns, you can use pd.to_datetime on those columns (or the whole dataframe if there’s nothing else) to convert them into a single time series. Take for example this dataframe, stored as a variable called temporal:

   day  month  year
0   21      3  1987
1    7      8  2000
2   15     12  2026

This command would convert the three series to a single datetime series:

pd.to_datetime(temporal[["day", "month", "year"]])
0   1987-03-21
1   2000-08-07
2   2026-12-15
dtype: datetime64[ns]
TipConsult the user guide

You should consult the official Time series / date functionality user guide for details about temporal operations. We’ve included a brief summary of some useful functions here.

For each example below, append .method to df["col_name"]. For example, the method .dt.date should be used as df["col_name"].dt.date. Each function is applied to every row in “col_name”

Method Description
.dt.date Return the date part of the timestamp (this is technically an attribute, meaning it’s not a function and should not have brackets ())
.dt.weekday() Return the day of the week corresponding to the date
.dt.normalize() Convert all times to midnight
.dt.strftime(...) Convert a timestamp to a string. Send the format into ... according to the standard Python datetime syntax
.resample(...) For time-based grouping, e.g. average/max/median \(x\) per day/hour/minute. Note that this is not a .dt method - it applies directly to the column: df["col_name"].resample(...)
Filtering

You can filter particular parts of the timestamp as normal, because they’re just numbers:

df[df["col_name"].year == 2026]

However, you might want to filter from or before a particular timestamp. To do this, use a pd.Timestamp in the condition, e.g. for timestamps after 3rd Feb 2026,

df[df["col_name"] > pd.Timestamp(2026, 2, 3)]

Categorical (.cat methods)

When analysing categorical data, you can use the methods in df["col_name"].cat. These are particularly useful for data with ordered levels.

To make a column the "category" type, use .astype("category"):

df["col_name"] = df["col_name"].astype("category")
TipConsult the user guide

You should consult the official Categorical data user guide for details about categorical operations. We’ve included a brief summary of some useful functions here.

For each example below, append .method to df["col_name"]. For example, the method .cat.categories should be used as df["col_name"].cat.categories. Each function is applied to every row in “col_name”.

Method Description
.cut(...) Group numeric data into discrete bins. Apply to numeric columns, returns a category column.
.cat.categories Return the categories in the column.
.cat.reorder_categories(..., ordered=True) Change the category order. Exclude ordered=True if the data is not ordered.
.sort_values(...) Sort the column (but not the rest of the dataframe) based on the category order. Will fail if .cat.ordered == False, and will work for non-category columns (sorting alphabetically for strings and numerically for numbers).

Geospatial

To manage geospatial data, you should use the geopandas package. With geopandas, you create GeoDataFrames, which behave like pandas DataFrames with additional behaviour for geospatial columns. Let us know if you’d like help with this!

What to do when you don’t know what to do

  1. Consult the documentation. If your function isn’t behaving, go to the specific page for that function. Otherwise, consult a relevant user guide.
  2. Search the web for your issue to find practical and canonical approaches to typical data cleaning tasks. StackOverflow is your friend.
  3. Ask us while we’re here! And once we’re gone, shoot us an email at training@library.uq.edu.au
  4. Lots of people also ask AI. There are pros and cons, but beware: you’ll get a solution that probably works, but if you don’t know why, you should double check the data matches what you want (and try to understand what it’s done!).

Now give it a go!

For the rest of this session, we’d love to help you get your data set up and working on your end. For that reason, we’ve dedicated this time to troubleshooting any issues that arise together. Good luck!