Tips

Here’s a few general tips. In addition, we strongly recommend using the pandas cheatsheets, which give a quick and easy reference for common packages and functions, and from Data to Viz, which guides you through choosing a visualisation.

Hotkeys

Code Hotkey Description
Ctrl+Enter Run current line (when in Script)
<- Alt+Enter Assignment
%>% Ctrl+Shift+M Pipe
Esc Cancel current operation (when in Console)
F1 Help documentation for selected function
Code Hotkey Description
F9 (or Fn + F9) Run current line
# %% Ctrl + 2 New cell (only in Spyder)
Ctrl+Enter Run current cell (when in Script)
Ctrl+C Cancel current operation (when in Console)

Data manipulation

Import packages to make data manipulation easier

library(dplyr)
import pandas as pd

Importing and exporting data

Read your data with an I/O (input/output) function.

dataset <- read.csv("data/dataset.csv")
df = pd.read_csv("data/dataset.csv")

You can also export your data to a csv file.

write.csv(dataset, "data/output_name.csv")
df.to_csv("data/output_name.csv")

Initial exploration

You’ll want to explore the data to start with - below are a few functions to get started.

Function Example Description
names() names(dataset) Returns the variable names
str() str(dataset) Returns the structure of the dataset (variable names, types and first entries)
$ dataset$variable Returns a specific variable
unique() unique(dataset$variable) Returns the unique values of a variable
summary() summary(dataset$variable) Returns a statistical summary of a variable
Function Example Description
df.columns Returns the variable names
df.info() Returns the structure of the dataset (variable names, counts and types)
df["variable"] Returns a specific column
df["variable"].unique() Returns the unique values of a variable
df.describe() or df["variable"].describe() Returns a statistical summary of the dataset or a variable

Removing nans

We can remove nans by filtering

dataset <- dataset %>%
  filter(!is.na(variable_to_check_for_NAs))
NoteUsing ! for negation

We use the exclamation mark ! to negate the result, because is.na returns all the rows that are NA.

df = df[df["variable"].notna()]

Time series data

If you’ve picked a dataset with time-series data (e.g. a “date” variable), you should transform that variable so that it visualises better:

dataset$variable <- as.Date(dataset$variable)
df["variable"] = pd.to_datetime(df["variable"])

Categorical and ordered data

If you’re dealing with categorical data, you can specify this explicitly to keep track of the levels.

dataset$variable <- factor(dataset$variable)
df["variable"] = df["variable"].astype("category")

To manually specify the order of categories,

Specify the order by sending in an ordered list of the levels joined with c():

dataset$variable <- factor(dataset$variable, levels = c("first_val", "second_val", ... ))

Alternatively, if you only need to specify the first (reference) level, use

dataset$variable <- relevel(factor(dataset$variable), ref = "reference_level")

Use the df["variable"].cat.reorder_categories() function and use the ordered = True parameter,

df["variable"] = df["variable"].cat.reorder_categories(["cat1", "cat2", ...], ordered = True)

If you’re dealing with categorical data, look at the pandas guide for inspiration and help.

TipCoffee survey

This is particularly useful for the Coffee survey dataset.

Renaming variables

Some datasets have cumbersome names for their variables which we can rename.

df <- df %>% 
  rename(new_name = old_name)

Use df.rename(), sending a dictionary to the columns = parameter:

df = df.rename(columns = {"old_name": "new_name"})
NoteDictionaries

A dictionary is a Python variable with key-value pairs. The structure is key: value, so above we have a dictionary with one key, "old_name" and corresponding value "new_name". They are created as follows:

example_dictionary = {"key1": "value1",
                      "key2": "value2",
                      "key3": "value3",
                      ...}

Note that multiple lines are used purely for readability, you could just as well do this on one line.

TipWorld population

This is particularly useful for the World population dataset.

Visualisation

We can make simple visualisations of our data.

Use ggplot2’s ggplot() function, with

  • data = the dataset
  • mapping = the variables, provideed as an aes(...) object
  • geom_... the geometries, e.g. geom_line(), geom_point() etc.
library(ggplot2)

ggplot(data = dataset,
       mapping = aes(x = ..., y = ..., colour = ..., ...)) +
  geom_first_layer() + 
  geom_second_layer() + 
  ...

Take a look at the ggplot2 documentation for more information.

Plotly workaround

If you’re having issues using ggplotly (it’s producing a blank plot), you can use this workaround to view it in your browser.

plot <- ggplotly(saved_ggplot_image)
htmlwidgets::saveWidget(as_widget(plot), "plots/name_of_plot.html")

Opening that file will show you the image.

Use seaborn’s relplot(), catplot() and displot() functions. For example,

import seaborn as sns

sns.relplot(data = df, x = "variable_x", y = "variable_y", hue = "variable_colour", ...)

We can add additional customisations to our plots, such as axis labels.

Generally, ggplot2 lets you do this with additional elements added to the plot. For example, to add axis labels,

ggplot(data = dataset,
       mapping = aes(x = ..., y = ..., colour = ..., ...)) +
  geom_first_layer() + 
  geom_second_layer() + 
  labs(title = ..., var1 = ...)

The simplest way to do this in Python is to use the matplotlib.pyplot module’s functions. Generally, this has the format plt.<some_customisation>. For example, to add axis labels,

import seaborn as sns
import matplotlib.pyplot as plt

sns.relplot(data = df, x = "variable_x", y = "variable_y", hue = "variable_colour", ...)
plt.xlabel("x axis label")
plt.ylabel("y axis label")