Tips

Here’s a few general tips. In addition, we strongly recommend using the pandas cheatsheets, which give a quick and easy reference for common packages and functions, and from Data to Viz, which guides you through choosing a visualisation.

Hotkeys

Code	Hotkey	Description
	`Ctrl`+`Enter`	Run current line (when in Script)
`<-`	`Alt`+`-`	Assignment
`%>%`	`Ctrl`+`Shift`+`M`	Pipe
	`Esc`	Cancel current operation (when in Console)
	`F1`	Help documentation for selected function

Code	Hotkey	Description
	`F9` (or `Fn` + `F9`)	Run current line
`# %%`	`Ctrl` + `2`	New cell (only in Spyder)
	`Ctrl`+`Enter`	Run current cell (when in Script)
	`Ctrl`+`C`	Cancel current operation (when in Console)

Interface Customisation

You can make RStudio a little nicer and suited to yourself by changing its appearance, colour and using Snippets

Appearance

Tools > Global Options > Appearance Under Editor theme, you can choose a colour palette that suits your eyes.

Rainbow Parentheses

Tools > Global Options > Code > Display Down the bottom, under Syntax, check Use rainbow parentheses to make it easier to distinguish your function brackets.

Snippets

Snippets allow you to create your own code-driven shortcuts. If you type lib into a .R file in RStudio, and then press Tab, it will give youthe option to auto-fill to library(package).

You can find more Snippets, and add your own custom ones here: Tools > Edit Code Snippets...

For example, normally when you copy a windows address location, the slashes are the wrong way around for most coding languages. If you add this snippet, when you copy a windows location, and then type pp and press Tab it will paste the location with the slashes changed.

snippet pp “r gsub('"', "", gsub("\\\\", "/", readClipboard()))”

Data manipulation

Import packages to make data manipulation easier

library(dplyr)

import pandas as pd

Importing and exporting data

Read your data with an I/O (input/output) function.

dataset <- read.csv("data/dataset.csv")

df = pd.read_csv("data/dataset.csv")

You can also export your data to a csv file.

write.csv(dataset, "data/output_name.csv")

df.to_csv("data/output_name.csv")

Initial exploration

You’ll want to explore the data to start with - below are a few functions to get started.

Function	Example	Description
`names()`	`names(dataset)`	Returns the variable names
`str()`	`str(dataset)`	Returns the structure of the dataset (variable names, types and first entries)
`$`	`dataset$variable`	Returns a specific variable
`unique()`	`unique(dataset$variable)`	Returns the unique values of a variable
`summary()`	`summary(dataset$variable)`	Returns a statistical summary of a variable

Function	Example	Description
`df.columns`		Returns the variable names
`df.info()`		Returns the structure of the dataset (variable names, counts and types)
`df["variable"]`		Returns a specific column
`df["variable"].unique()`		Returns the unique values of a variable
`df.describe()` or `df["variable"].describe()`	Returns a statistical summary of the dataset or a variable

Removing `nan`s

We can remove nans by filtering

dataset <- dataset %>%
  filter(!is.na(variable_to_check_for_NAs))

Using ! for negation

We use the exclamation mark ! to negate the result, because is.na returns all the rows that are NA.

df = df[df["variable"].notna()]

Time series data

If you’ve picked a dataset with time-series data (e.g. a “date” variable), you should transform that variable so that it visualises better:

dataset$variable <- as.Date(dataset$variable)

df["variable"] = pd.to_datetime(df["variable"])

Categorical and ordered data

If you’re dealing with categorical data, you can specify this explicitly to keep track of the levels.

dataset$variable <- factor(dataset$variable)

df["variable"] = df["variable"].astype("category")

To manually specify the order of categories,

Specify the order by sending in an ordered list of the levels joined with c():

dataset$variable <- factor(dataset$variable, levels = c("first_val", "second_val", ... ))

Alternatively, if you only need to specify the first (reference) level, use

dataset$variable <- relevel(factor(dataset$variable), ref = "reference_level")

Use the df["variable"].cat.reorder_categories() function and use the ordered = True parameter,

df["variable"] = df["variable"].cat.reorder_categories(["cat1", "cat2", ...], ordered = True)

If you’re dealing with categorical data, look at the pandas guide for inspiration and help.

Coffee survey

This is particularly useful for the Coffee survey dataset.

Renaming variables

Some datasets have cumbersome names for their variables which we can rename.

df <- df %>% 
  rename(new_name = old_name)

Use df.rename(), sending a dictionary to the columns = parameter:

df = df.rename(columns = {"old_name": "new_name"})

Dictionaries

A dictionary is a Python variable with key-value pairs. The structure is key: value, so above we have a dictionary with one key, "old_name" and corresponding value "new_name". They are created as follows:

example_dictionary = {"key1": "value1", "key2": "value2", "key3": "value3", ...}

Note that multiple lines are used purely for readability, you could just as well do this on one line.

World population

This is particularly useful for the World population dataset.

Visualisation

We can make simple visualisations of our data.

Use ggplot2’s ggplot() function, with

data = the dataset
mapping = the variables, provideed as an aes(...) object
geom_... the geometries, e.g. geom_line(), geom_point() etc.

library(ggplot2)

ggplot(data = dataset,
       mapping = aes(x = ..., y = ..., colour = ..., ...)) +
  geom_first_layer() + 
  geom_second_layer() + 
  ...

Take a look at the ggplot2 documentation for more information.

Plotly workaround

If you’re having issues using ggplotly (it’s producing a blank plot), you can use this workaround to view it in your browser.

plot <- ggplotly(saved_ggplot_image)
htmlwidgets::saveWidget(as_widget(plot), "plots/name_of_plot.html")

Opening that file will show you the image.

Use seaborn’s relplot(), catplot() and displot() functions. For example,

import seaborn as sns

sns.relplot(data = df, x = "variable_x", y = "variable_y", hue = "variable_colour", ...)

We can add additional customisations to our plots, such as axis labels.

Generally, ggplot2 lets you do this with additional elements added to the plot. For example, to add axis labels,

ggplot(data = dataset,
       mapping = aes(x = ..., y = ..., colour = ..., ...)) +
  geom_first_layer() + 
  geom_second_layer() + 
  labs(title = ..., var1 = ...)

The simplest way to do this in Python is to use the matplotlib.pyplot module’s functions. Generally, this has the format plt.<some_customisation>. For example, to add axis labels,

import seaborn as sns
import matplotlib.pyplot as plt

sns.relplot(data = df, x = "variable_x", y = "variable_y", hue = "variable_colour", ...)
plt.xlabel("x axis label")
plt.ylabel("y axis label")

Hotkeys

Interface Customisation

Data manipulation

Importing and exporting data

Initial exploration

Removing nans

Time series data

Categorical and ordered data

Renaming variables

Visualisation

Removing `nan`s