Analysing your own data

In this final session, we look at taking the skills we’ve developed during the intensive and applying it to your own data. We’ll also discuss a few common pitfalls and where to get help in the wild.

Setting up

Spend five minutes locating your data and preparing Spyder

  1. Locate and structure your own project folder
    • You might like to create a data/ folder
    • You might like to create a scripts/ folder
  2. (Optional) Turn it into an RStudio project
    1. Click File \(\rightarrow\) New Project…
    2. Click Existing Directory
    3. Click Browse… and choose the top-level folder of your project
  3. Create a new script for processing, called processing.R (or whatever you’d like to).

We’ll spend most of this session like project time, troubleshooting and setting up your own data for analysis. Before we do that, let’s discuss a few common tips and pitfalls. Namely,

  • Managing paths and the working directory
  • Importing different data structures and files
  • Dealing with different types of data

Environments and importing data

Perhaps the most frustrating issues are those which prevent you from importing the data at all!

Getting your setup right

Nobody likes to encounter this:

Warning in file(file, "rt"): cannot open file 'PANIC!': No such file or
directory
Error in file(file, "rt"): cannot open the connection

To solve it, we need to talk about filesystems.

When running an R script, there are always three relevant locations:

  • Your R executable
  • Your script
  • Your working directory

The R executable runs your script from the working directory

Why does this matter? Because when you import data, R uses your working directory as the reference, not the others. This means that there are important questions you need to ask:

  1. Where is your working directory?
  2. Where is your data?

Answering the first question is easy: simply run

getwd()
[1] "/home/uqcwest5/tech_training/training-intensives/r"
NoteAlternatively
  • When you run R from the command line, it’s the current location as specified in the terminal
  • When you run R in RStudio, it’s the address displayed in the bar attached to Console

This address is your working directory. All paths in your scripts are evaluated relative to this location.

This includes data paths, which can be absolute or relative, and can be online.

Absolute paths begin with a root prefix and have folders separated with slashes. They contain all the folders from the root of the filesystem down to the object of interest.

On Windows, absolute paths are prefixed with the drive, e.g. C:\ and folders separated with backslashes \

C:\Users\...\...\data\example.csv

On Unix-like systems, absolute paths are prefixed with a forwardslash /, which also separate folders.

/home/user/.../.../data/example.csv

Alternatively, you can start from your ‘user’ directory by prefixing with a tilde ~:

~/.../.../data/example.csv

Website and web-hosted files can typically be accessed with URLs. Full or ‘absolute’ URLs are prefixed with a protocol (e.g. https://) and a hostname (e.g. www.website.com), with folders then separated by forwardslashes

https://www.website.com/.../.../.../data/example.csv

Relative filepaths have no prefix.

On Windows, relative paths are still separated with backslashes

data\example.csv

On unix-like systems, relative paths are still separated with forwardslashes

data/example.csv

It’s possible to have a relative path to a web file, however, as a relative filepath, you must be running Python from the server.

The syntax is the same as unix-like systems, i.e., folders separated with forwardslashes

data/example.csv
NoteShould I use absolute or relative paths?
Pros Cons
Absolute
  • Works for any working directory (on same device)
  • Only valid on one device
  • Can get long
  • Can contain irrelevant information
Relative
  • Works on any device with the same project structure
  • Only contains project-specific information
  • Can be shorter
  • Working directory must be set up correctly
  • Can become confusing with many parent folders (e.g. ../../../)

Once you have your working directory and your filepath, you can now check that any data paths have been specified correctly.

If the path is absolute

You just need to ensure that your working directory is on the same device as the file.

If the path is relative

Go to the working directory and trace the path to ensure it’s correct. The path begins inside the working directory. A few oddities:

  • .. indicates go up a folder
  • . indicates the current folder
NoteChanging the working directory

If you’re using a project, you shouldn’t need to change working directories. Instead, try modifying the path first.

If you really need to change working directories, you can do this with R code,

setwd("path/to/dir")

Or in the Files tab by navigating to the folder and clicking More \(\rightarrow\) Set As Working Directory.

You should check that it’s worked by running getwd().

WarningWatch out Windows users

The backslashes in Windows paths conflict with R’s escape character in strings (also a backslash). To fix this, you can

  • Replace backslashes with forwardslashes in R
  • Use raw strings: r"(...)"
  • Escape the backslashes with an extra backslash

For example, the following Windows path

C:\Users\me\data\secrets.csv

could be imported as any of the following

# Raw strings
read.csv(r"(C:\Users\me\data\secrets.csv)")

# Replace with forwardslashes
read.csv("C:/Users/me/data/secrets.csv")

# Escape the backslashes
read.csv("C:\\Users\\me\\data\\secrets.csv")

We recommend using the r"(...)" option where possible, as it’s the least work.

If you’ve fixed the filepath and you’re on top of the Windows peculiarities, then check the following errors for more troubleshooting.

You’ve probably used a path with backslashes and not adjusted it for R. See “Watch out Windows users” above.

You’ve probably used a path with backslashes and not adjusted it for R. See “Watch out Windows users” above.

R throws this error when it runs "\u..." or "\U..." (unless "..." is a valid unicode code; \Users is not).

Have you used a path with backslashes on a non-Windows machine? If you have, replace them with forwardslashes.

Final thoughts

  • Avoid spaces
  • Use relative filepaths where possible
  • Get familiar with your working directory

Importing your data correctly

Once you’ve got the path working, the next challenge will be importing the data correctly. Unlike our data, yours might have multiple header rows, missing data, simply be organised differently or even be a different file type.

We’ll look at importing .csv files here, but the same applies to other file types.

The documentation for read.csv describes (as of v4.4.0) 25 different parameters that the functions supports. These include

Parameter Description
file The path (doesn’t have to actually be .csv!)
header TRUE/FALSE specifying if the first row is a header
sep The separator (use "\t" for tab separated values)
na.strings A list of values to interpret as empty cells
skip Skip the first skip=n rows

There are plenty more! Run

help(read.csv)

to see the rest. For more advanced reading, you can also try the readr package.

TipMS Excel files (.xlsx)

Reading .xlsx files can be complicated. You should use the readxl package

library(readxl)
read_excel(...)

You can use the argument sheet="..." if you’d like to specify the sheet.

Check out the documentation for details.

Dealing with different and dodgy data

Our data has been set up to be a bit of a challenge, and a bit of a help. Your data might be organised differently, and might need more work! You might also need to perform different tasks.

We’ll look at a few common tips to get you going, but before you start, the best advice is to get out a pencil and paper and draw. Mock up your data, figure out what you want and write down the steps that you would have to do by hand. Then you’ll have a good grasp of what you want, and whether the code is working.

A few resources that you should consult:

See below for a summary of data cleaning tips you might need to apply to your data, assuming your data is stored in df and "col_name" is a column name

Reshaping your data

For simple (or complex) reshaping tasks, like filtering, subsetting and adding new columns, refresh yourself with our second session of this intensive: 2 - Data Processing.

The Tidyverse usually wants you to organise your data in the “tidy data” way: single variable per column, single case per row, one value per cell. To get to that point, we often have to reshape our dataframes, usually from “wide format” to “long format”. The package tidyr is perfect for that, using the pivot_longer() function (and its pivot_wider() opposite).

This tidyr example shows the process in a nutshell:

library(tidyr)
# This relig_income dataset has data split into income bracket columns:
relig_income
# A tibble: 18 × 11
   religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
   <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
 1 Agnostic      27        34        60        81        76       137        122
 2 Atheist       12        27        37        52        35        70         73
 3 Buddhist      27        21        30        34        33        58         62
 4 Catholic     418       617       732       670       638      1116        949
 5 Don’t k…      15        14        15        11        10        35         21
 6 Evangel…     575       869      1064       982       881      1486        949
 7 Hindu          1         9         7         9        11        34         47
 8 Histori…     228       244       236       238       197       223        131
 9 Jehovah…      20        27        24        24        21        30         15
10 Jewish        19        19        25        25        30        95         69
11 Mainlin…     289       495       619       655       651      1107        939
12 Mormon        29        40        48        51        56       112         85
13 Muslim         6         7         9        10         9        23         16
14 Orthodox      13        17        23        32        32        47         38
15 Other C…       9         7        11        13        13        14         18
16 Other F…      20        33        40        46        49        63         46
17 Other W…       5         2         3         4         2         7          3
18 Unaffil…     217       299       374       365       341       528        407
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#   `Don't know/refused` <dbl>
# To make it more "tidy":
# Store income brackets in a column, so functions can access the data.
reshaped <- relig_income |>
  pivot_longer(!religion, # all columns except "religion"
               # name the columns where data is stored:
               names_to = "income", # what comes from the column headings
               values_to = "count") # what comes from the cells

The tidyr cheatsheet is great to visualise what exactly the pivot_*() function do, with helpful pictograms.

Cleaning up inconsistencies

The mutate(...) function is the simplest for cleaning up values in certain columns. Use as

df <- df %>%
  mutate(new_col = ...)

This creates a new column based on the computation of ..., which can include other columns. A useful helper function is if_else(), which changes the computation based on the condition. For example, to replace values "A" with "B" in col_name,

df <- %>%
  mutate(col_name = if_else(col_name == "A", "B", col_name))

For reshaping and cleaning, consult the dplyr cheatsheet.

Dealing with different types of data

Each variable (column) in an R dataframe is stored as a particular data type. Common types include

  • character for literal strings
  • numeric for numbers
  • int for integers (no decimals)
  • factor for categorical data
  • logical for booleans (TRUE or FALSE)
  • POSIXct (among others) for timestamps

For applying methods specific to textual, temporal or categorical data, you must first ensure the columns match the dtype you expect. Common mishaps include

  • Timestamps or categorical data stored as character types
  • Numeric categorical data stored as numeric the type.

If you need to change the way a variable is stored, you can either do that during the import stage (e.g. by modifying the function read.csv) or with a relevant function, described in the respective sections below.

Textual

The simplest way to deal with textual data in R is by using the stringr package.

TipConsult the user guide

You should consult the official Introduction to stringr user guide for details about the package, and the official cheatsheet for a detailed summary. We’ve included a brief summary of some useful functions here.

Load in the package with library(stringr). All stringr functions follow the same format:

str_<function_name>(input_string, parameters...)

The value for input_string can be a variable(s). If you’re using pipes, you could use select() to choose the columns to include. For example

df %>%
  select(a_string_col) %>% 
  str_length()   # Or any other stringr function

Some useful functions are included here for reference.

Function Description
str_detect(...) Return TRUE/FALSE if pattern is in the string.
str_sub(...) Extract a substring based on character indices (a slice).
str_trim(...) Trim whitespace from start and/or end of the string.
str_replace(...) Replace the first matched pattern in the string (use str_replace_all(...) to replace all).
str_c(...) Join multiple strings into one.

Time series

The simplest way to deal with temporal data in R is by using the lubridate package.

TipConsult the user guide

You should consult the official Do more with dates and times in R user guide for details about the package, and the official cheatsheet for a detailed summary. We’ve included a brief summary of some useful functions here.

Load in the package with library(lubridate). To convert the column dt_col into an appropriate type, use

df$dt_col <- as_datetime(df$dt_col)

There are other parsing functions, should your needs be more specific. Take a look at the cheatsheet to learn more.

Most lubridate functions will take in a date-time object (or a variable containing them). You can use the pipe, for example

df %>%
  select(a_datetime_col) %>%
  year()

Some useful functions are included here for reference.

Function Description
hour() Returns the hour part of the datetime. Similarly for other units (e.g. year(), second()).
hours(...) Represents a number of hours (similarly for other units). Useful for arithmetic, e.g. df$dt_col + hours(2). Can result in impossible dates, e.g. 31st Feb.
dhours(...) Represents a number of hours in terms of seconds (similarly for other units). Useful for arithmetic, e.g. df$dt_col + dhours(2). Inconsistent units (e.g. months, leap-years) are approximated.
round_date(...) Rounds to the near unit specified.
with_tz(...) Convert a date-time to a new time zone
Filtering

You can filter particular parts of the timestamp as normal, because they’re just numbers:

df %>%
  filter(year(dt_col) == 2026)

However, you might want to filter from or before a particular date-time. Use the parsing functions to do this:

df %>%
  filter(year(dt_col) > as_datetime("2026/02/03"))

Categorical

R has a special, built-in type for categorical data: factor. It stores the different possible values as “levels”, which it maps to your data, making it more efficient than normal strings.

Factors can also order the levels, making it a good choice for ordered data (e.g. “Small” < “Medium” < “Large”).

To make the variable cat_col a factor, use the factor() function:

df$cat_col <- factor(df$cat_col)

You can use levels(df$cat_col) to see the unique levels.

For more advanced functions, you can use the forcats package.

TipConsult the user guide

You should consult the official Introduction to forcats user guide for details about the package, and the official cheatsheet for a detailed summary. We’ve included a brief summary of some useful functions here.

Load in the package with library(forcats).

All forcats functions follow the same format:

fct_<function_name>(input_factor, parameters...)

If you’re using pipes, you could use select() to choose the columns to include. For example

df %>%
  select(a_cat_col) %>% 
  fct_count()   # Or any other forcats function

Some useful functions are included here for reference.

Function Description
fct_count() Count the number of values per level
fct_c() Combine factors with different levels
fct_relevel(...) Manually reorder the factor levels
fct_recode(...) Manually change the levels
fct_collapse(...) Collapse multiple levels into a manual groups
fct_other(...) Collapse selected levels into a single “other” level

Geospatial

Managing geospatial data is a complex task beyond the scope of this introductory intensive. If you’re planning to analyse geospatial data in R, you should check out

  • sf (simple features) for vector data
  • terra for raster data and more complex vector data
  • stars for spatiotemporal data, irregular grids or other advanced features

Let us know if you’d like a hand or want to discuss further!

What to do when you don’t know what to do

  1. Consult the documentation. If your function isn’t behaving, go to the specific page for that function. Otherwise, consult a relevant user guide or cheat sheet.
  2. Search the web for your issue to find practical and canonical approaches to typical data cleaning tasks. StackOverflow is your friend.
  3. Ask us while we’re here! And once we’re gone, shoot us an email at training@library.uq.edu.au
  4. Lots of people also ask AI. There are pros and cons, but beware: you’ll get a solution that probably works, but if you don’t know why, you should double check the data matches what you want (and try to understand what it’s done!).

Now give it a go!

For the rest of this session, we’d love to help you get your data set up and working on your end. For that reason, we’ve dedicated this time to troubleshooting any issues that arise together. Good luck!