Tips
Here’s a few general tips. In addition, we strongly recommend using the pandas cheatsheets, which give a quick and easy reference for common packages and functions, and from Data to Viz, which guides you through choosing a visualisation.
Hotkeys
| Code | Hotkey | Description |
|---|---|---|
| Ctrl+Enter | Run current line (when in Script) | |
<- |
Alt+Enter | Assignment |
%>% |
Ctrl+Shift+M | Pipe |
| Esc | Cancel current operation (when in Console) | |
| F1 | Help documentation for selected function |
| Code | Hotkey | Description |
|---|---|---|
| F9 (or Fn + F9) | Run current line | |
# %% |
Ctrl + 2 | New cell (only in Spyder) |
| Ctrl+Enter | Run current cell (when in Script) | |
| Ctrl+C | Cancel current operation (when in Console) |
Data manipulation
Import packages to make data manipulation easier
library(dplyr)import pandas as pdImporting and exporting data
Read your data with an I/O (input/output) function.
dataset <- read.csv("data/dataset.csv")df = pd.read_csv("data/dataset.csv")You can also export your data to a csv file.
write.csv(dataset, "data/output_name.csv")df.to_csv("data/output_name.csv")Initial exploration
You’ll want to explore the data to start with - below are a few functions to get started.
| Function | Example | Description |
|---|---|---|
names() |
names(dataset) |
Returns the variable names |
str() |
str(dataset) |
Returns the structure of the dataset (variable names, types and first entries) |
$ |
dataset$variable |
Returns a specific variable |
unique() |
unique(dataset$variable) |
Returns the unique values of a variable |
summary() |
summary(dataset$variable) |
Returns a statistical summary of a variable |
| Function | Example | Description |
|---|---|---|
df.columns |
Returns the variable names | |
df.info() |
Returns the structure of the dataset (variable names, counts and types) | |
df["variable"] |
Returns a specific column | |
df["variable"].unique() |
Returns the unique values of a variable | |
df.describe() or df["variable"].describe() |
Returns a statistical summary of the dataset or a variable |
Removing nans
We can remove nans by filtering
dataset <- dataset %>%
filter(!is.na(variable_to_check_for_NAs))! for negation
We use the exclamation mark ! to negate the result, because is.na returns all the rows that are NA.
df = df[df["variable"].notna()]Time series data
If you’ve picked a dataset with time-series data (e.g. a “date” variable), you should transform that variable so that it visualises better:
dataset$variable <- as.Date(dataset$variable)df["variable"] = pd.to_datetime(df["variable"])Categorical and ordered data
If you’re dealing with categorical data, you can specify this explicitly to keep track of the levels.
dataset$variable <- factor(dataset$variable)df["variable"] = df["variable"].astype("category")To manually specify the order of categories,
Specify the order by sending in an ordered list of the levels joined with c():
dataset$variable <- factor(dataset$variable, levels = c("first_val", "second_val", ... ))Alternatively, if you only need to specify the first (reference) level, use
dataset$variable <- relevel(factor(dataset$variable), ref = "reference_level")Use the df["variable"].cat.reorder_categories() function and use the ordered = True parameter,
df["variable"] = df["variable"].cat.reorder_categories(["cat1", "cat2", ...], ordered = True)If you’re dealing with categorical data, look at the pandas guide for inspiration and help.
This is particularly useful for the Coffee survey dataset.
Renaming variables
Some datasets have cumbersome names for their variables which we can rename.
df <- df %>%
rename(new_name = old_name)Use df.rename(), sending a dictionary to the columns = parameter:
df = df.rename(columns = {"old_name": "new_name"})A dictionary is a Python variable with key-value pairs. The structure is key: value, so above we have a dictionary with one key, "old_name" and corresponding value "new_name". They are created as follows:
example_dictionary = {"key1": "value1",
"key2": "value2",
"key3": "value3",
...}
Note that multiple lines are used purely for readability, you could just as well do this on one line.
This is particularly useful for the World population dataset.
Visualisation
We can make simple visualisations of our data.
Use ggplot2’s ggplot() function, with
data =the datasetmapping =the variables, provideed as anaes(...)objectgeom_...the geometries, e.g.geom_line(),geom_point()etc.
library(ggplot2)
ggplot(data = dataset,
mapping = aes(x = ..., y = ..., colour = ..., ...)) +
geom_first_layer() +
geom_second_layer() +
...Take a look at the ggplot2 documentation for more information.
Plotly workaround
If you’re having issues using ggplotly (it’s producing a blank plot), you can use this workaround to view it in your browser.
plot <- ggplotly(saved_ggplot_image)
htmlwidgets::saveWidget(as_widget(plot), "plots/name_of_plot.html")Opening that file will show you the image.
Use seaborn’s relplot(), catplot() and displot() functions. For example,
import seaborn as sns
sns.relplot(data = df, x = "variable_x", y = "variable_y", hue = "variable_colour", ...)We can add additional customisations to our plots, such as axis labels.
Generally, ggplot2 lets you do this with additional elements added to the plot. For example, to add axis labels,
ggplot(data = dataset,
mapping = aes(x = ..., y = ..., colour = ..., ...)) +
geom_first_layer() +
geom_second_layer() +
labs(title = ..., var1 = ...)The simplest way to do this in Python is to use the matplotlib.pyplot module’s functions. Generally, this has the format plt.<some_customisation>. For example, to add axis labels,
import seaborn as sns
import matplotlib.pyplot as plt
sns.relplot(data = df, x = "variable_x", y = "variable_y", hue = "variable_colour", ...)
plt.xlabel("x axis label")
plt.ylabel("y axis label")