Warning in file(file, "rt"): cannot open file 'PANIC!': No such file or
directory
Error in file(file, "rt"): cannot open the connection
In this final session, we look at taking the skills we’ve developed during the intensive and applying it to your own data. We’ll also discuss a few common pitfalls and where to get help in the wild.
Spend five minutes locating your data and preparing Spyder
processing.R (or whatever you’d like to).We’ll spend most of this session like project time, troubleshooting and setting up your own data for analysis. Before we do that, let’s discuss a few common tips and pitfalls. Namely,
Perhaps the most frustrating issues are those which prevent you from importing the data at all!
Nobody likes to encounter this:
Warning in file(file, "rt"): cannot open file 'PANIC!': No such file or
directory
Error in file(file, "rt"): cannot open the connection
To solve it, we need to talk about filesystems.
When running an R script, there are always three relevant locations:
The R executable runs your script from the working directory
Why does this matter? Because when you import data, R uses your working directory as the reference, not the others. This means that there are important questions you need to ask:
Answering the first question is easy: simply run
This address is your working directory. All paths in your scripts are evaluated relative to this location.
This includes data paths, which can be absolute or relative, and can be online.
Absolute paths begin with a root prefix and have folders separated with slashes. They contain all the folders from the root of the filesystem down to the object of interest.
On Windows, absolute paths are prefixed with the drive, e.g. C:\ and folders separated with backslashes \
C:\Users\...\...\data\example.csv
On Unix-like systems, absolute paths are prefixed with a forwardslash /, which also separate folders.
/home/user/.../.../data/example.csv
Alternatively, you can start from your ‘user’ directory by prefixing with a tilde ~:
~/.../.../data/example.csv
Website and web-hosted files can typically be accessed with URLs. Full or ‘absolute’ URLs are prefixed with a protocol (e.g. https://) and a hostname (e.g. www.website.com), with folders then separated by forwardslashes
https://www.website.com/.../.../.../data/example.csv
Relative filepaths have no prefix.
On Windows, relative paths are still separated with backslashes
data\example.csv
On unix-like systems, relative paths are still separated with forwardslashes
data/example.csv
It’s possible to have a relative path to a web file, however, as a relative filepath, you must be running Python from the server.
The syntax is the same as unix-like systems, i.e., folders separated with forwardslashes
data/example.csv
| Pros | Cons | |
| Absolute |
|
|
| Relative |
|
|
Once you have your working directory and your filepath, you can now check that any data paths have been specified correctly.
If the path is absolute
You just need to ensure that your working directory is on the same device as the file.
If the path is relative
Go to the working directory and trace the path to ensure it’s correct. The path begins inside the working directory. A few oddities:
.. indicates go up a folder. indicates the current folderIf you’re using a project, you shouldn’t need to change working directories. Instead, try modifying the path first.
If you really need to change working directories, you can do this with R code,
Or in the Files tab by navigating to the folder and clicking More \(\rightarrow\) Set As Working Directory.
You should check that it’s worked by running getwd().
The backslashes in Windows paths conflict with R’s escape character in strings (also a backslash). To fix this, you can
r"(...)"For example, the following Windows path
could be imported as any of the following
We recommend using the r"(...)" option where possible, as it’s the least work.
If you’ve fixed the filepath and you’re on top of the Windows peculiarities, then check the following errors for more troubleshooting.
You’ve probably used a path with backslashes and not adjusted it for R. See “Watch out Windows users” above.
Error: '\U' used without hex digits in character string...
You’ve probably used a path with backslashes and not adjusted it for R. See “Watch out Windows users” above.
R throws this error when it runs "\u..." or "\U..." (unless "..." is a valid unicode code; \Users is not).
cannot open file... but everything is correct???
Have you used a path with backslashes on a non-Windows machine? If you have, replace them with forwardslashes.
Final thoughts
Once you’ve got the path working, the next challenge will be importing the data correctly. Unlike our data, yours might have multiple header rows, missing data, simply be organised differently or even be a different file type.
We’ll look at importing .csv files here, but the same applies to other file types.
The documentation for read.csv describes (as of v4.4.0) 25 different parameters that the functions supports. These include
| Parameter | Description |
|---|---|
file |
The path (doesn’t have to actually be .csv!) |
header |
TRUE/FALSE specifying if the first row is a header |
sep |
The separator (use "\t" for tab separated values) |
na.strings |
A list of values to interpret as empty cells |
skip |
Skip the first skip=n rows |
There are plenty more! Run
to see the rest. For more advanced reading, you can also try the readr package.
.xlsx)
Reading .xlsx files can be complicated. You should use the readxl package
You can use the argument sheet="..." if you’d like to specify the sheet.
Check out the documentation for details.
Our data has been set up to be a bit of a challenge, and a bit of a help. Your data might be organised differently, and might need more work! You might also need to perform different tasks.
We’ll look at a few common tips to get you going, but before you start, the best advice is to get out a pencil and paper and draw. Mock up your data, figure out what you want and write down the steps that you would have to do by hand. Then you’ll have a good grasp of what you want, and whether the code is working.
A few resources that you should consult:
See below for a summary of data cleaning tips you might need to apply to your data, assuming your data is stored in df and "col_name" is a column name
For simple (or complex) reshaping tasks, like filtering, subsetting and adding new columns, refresh yourself with our second session of this intensive: 2 - Data Processing.
The Tidyverse usually wants you to organise your data in the “tidy data” way: single variable per column, single case per row, one value per cell. To get to that point, we often have to reshape our dataframes, usually from “wide format” to “long format”. The package tidyr is perfect for that, using the pivot_longer() function (and its pivot_wider() opposite).
This tidyr example shows the process in a nutshell:
# A tibble: 18 × 11
religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agnostic 27 34 60 81 76 137 122
2 Atheist 12 27 37 52 35 70 73
3 Buddhist 27 21 30 34 33 58 62
4 Catholic 418 617 732 670 638 1116 949
5 Don’t k… 15 14 15 11 10 35 21
6 Evangel… 575 869 1064 982 881 1486 949
7 Hindu 1 9 7 9 11 34 47
8 Histori… 228 244 236 238 197 223 131
9 Jehovah… 20 27 24 24 21 30 15
10 Jewish 19 19 25 25 30 95 69
11 Mainlin… 289 495 619 655 651 1107 939
12 Mormon 29 40 48 51 56 112 85
13 Muslim 6 7 9 10 9 23 16
14 Orthodox 13 17 23 32 32 47 38
15 Other C… 9 7 11 13 13 14 18
16 Other F… 20 33 40 46 49 63 46
17 Other W… 5 2 3 4 2 7 3
18 Unaffil… 217 299 374 365 341 528 407
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
# `Don't know/refused` <dbl>
# To make it more "tidy":
# Store income brackets in a column, so functions can access the data.
reshaped <- relig_income |>
pivot_longer(!religion, # all columns except "religion"
# name the columns where data is stored:
names_to = "income", # what comes from the column headings
values_to = "count") # what comes from the cellsThe tidyr cheatsheet is great to visualise what exactly the pivot_*() function do, with helpful pictograms.
The mutate(...) function is the simplest for cleaning up values in certain columns. Use as
This creates a new column based on the computation of ..., which can include other columns. A useful helper function is if_else(), which changes the computation based on the condition. For example, to replace values "A" with "B" in col_name,
For reshaping and cleaning, consult the dplyr cheatsheet.
Each variable (column) in an R dataframe is stored as a particular data type. Common types include
character for literal stringsnumeric for numbersint for integers (no decimals)factor for categorical datalogical for booleans (TRUE or FALSE)POSIXct (among others) for timestampsFor applying methods specific to textual, temporal or categorical data, you must first ensure the columns match the dtype you expect. Common mishaps include
character typesnumeric the type.If you need to change the way a variable is stored, you can either do that during the import stage (e.g. by modifying the function read.csv) or with a relevant function, described in the respective sections below.
The simplest way to deal with textual data in R is by using the stringr package.
You should consult the official Introduction to stringr user guide for details about the package, and the official cheatsheet for a detailed summary. We’ve included a brief summary of some useful functions here.
Load in the package with library(stringr). All stringr functions follow the same format:
The value for input_string can be a variable(s). If you’re using pipes, you could use select() to choose the columns to include. For example
Some useful functions are included here for reference.
| Function | Description |
|---|---|
str_detect(...) |
Return TRUE/FALSE if pattern is in the string. |
str_sub(...) |
Extract a substring based on character indices (a slice). |
str_trim(...) |
Trim whitespace from start and/or end of the string. |
str_replace(...) |
Replace the first matched pattern in the string (use str_replace_all(...) to replace all). |
str_c(...) |
Join multiple strings into one. |
The simplest way to deal with temporal data in R is by using the lubridate package.
You should consult the official Do more with dates and times in R user guide for details about the package, and the official cheatsheet for a detailed summary. We’ve included a brief summary of some useful functions here.
Load in the package with library(lubridate). To convert the column dt_col into an appropriate type, use
There are other parsing functions, should your needs be more specific. Take a look at the cheatsheet to learn more.
Most lubridate functions will take in a date-time object (or a variable containing them). You can use the pipe, for example
Some useful functions are included here for reference.
| Function | Description |
|---|---|
hour() |
Returns the hour part of the datetime. Similarly for other units (e.g. year(), second()). |
hours(...) |
Represents a number of hours (similarly for other units). Useful for arithmetic, e.g. df$dt_col + hours(2). Can result in impossible dates, e.g. 31st Feb. |
dhours(...) |
Represents a number of hours in terms of seconds (similarly for other units). Useful for arithmetic, e.g. df$dt_col + dhours(2). Inconsistent units (e.g. months, leap-years) are approximated. |
round_date(...) |
Rounds to the near unit specified. |
with_tz(...) |
Convert a date-time to a new time zone |
You can filter particular parts of the timestamp as normal, because they’re just numbers:
However, you might want to filter from or before a particular date-time. Use the parsing functions to do this:
R has a special, built-in type for categorical data: factor. It stores the different possible values as “levels”, which it maps to your data, making it more efficient than normal strings.
Factors can also order the levels, making it a good choice for ordered data (e.g. “Small” < “Medium” < “Large”).
To make the variable cat_col a factor, use the factor() function:
You can use levels(df$cat_col) to see the unique levels.
For more advanced functions, you can use the forcats package.
You should consult the official Introduction to forcats user guide for details about the package, and the official cheatsheet for a detailed summary. We’ve included a brief summary of some useful functions here.
Load in the package with library(forcats).
All forcats functions follow the same format:
If you’re using pipes, you could use select() to choose the columns to include. For example
Some useful functions are included here for reference.
| Function | Description |
|---|---|
fct_count() |
Count the number of values per level |
fct_c() |
Combine factors with different levels |
fct_relevel(...) |
Manually reorder the factor levels |
fct_recode(...) |
Manually change the levels |
fct_collapse(...) |
Collapse multiple levels into a manual groups |
fct_other(...) |
Collapse selected levels into a single “other” level |
Managing geospatial data is a complex task beyond the scope of this introductory intensive. If you’re planning to analyse geospatial data in R, you should check out
Let us know if you’d like a hand or want to discuss further!
For the rest of this session, we’d love to help you get your data set up and working on your end. For that reason, we’ve dedicated this time to troubleshooting any issues that arise together. Good luck!