R
26Summer
data: Players2024.csv
Author

Hehe

Published

April 1, 2028


Tip

Plots will be images or interactive plots.
Note: images can be clicked on to zoom in on the plot itself
Have fun! :)

Data Exploration on Soccer Players

Load the relevant libraries

library(dplyr)
library(ggplot2)
library(plotly)
library(tidyr)
library(knitr)

Let’s look at the dataset first

players_2024 <- read.csv("../../../../data/Players2024.csv") 

……..Loading data……..


What type of data exists in our dataset?

Headers
name
birth_date
height_cm
positions
nationality
age
club


\(\boldsymbol{Peak\text{ } at \text{ } data}\)

name birth_date height_cm positions nationality age club
James Milner 1986-01-04 175 Midfield England 38 Brighton and Hove Albion Football Club
Anastasios Tsokanis 1991-05-02 176 Midfield Greece 33 Volou Neos Podosferikos Syllogos
Jonas Hofmann 1992-07-14 176 Midfield Germany 32 Bayer 04 Leverkusen Fußball
Pepe Reina 1982-08-31 188 Goalkeeper Spain 42 Calcio Como
Lionel Carole 1991-04-12 180 Defender France 33 Kayserispor Kulübü
Ludovic Butelle 1983-04-03 188 Goalkeeper France 41 Stade de Reims


\(\boldsymbol{Summary\text{ } of \text{ } data}\)

name birth_date height_cm positions nationality age club
Length:5935 Length:5935 Min. : 17 Length:5935 Length:5935 Min. :15.0 Length:5935
Class :character Class :character 1st Qu.:178 Class :character Class :character 1st Qu.:22.0 Class :character
Mode :character Mode :character Median :183 Mode :character Mode :character Median :25.0 Mode :character
Mean :183 Mean :25.5
3rd Qu.:188 3rd Qu.:29.0
Max. :206 Max. :42.0

Issues with dataset

Caution

Visualising height_cm against positions shows issues with dataset

Show code
# Standard box plots (error with missing)
ggplot(players_2024,
       aes(x = positions,
           y = height_cm)) +
  geom_boxplot()


Fix the dataset

Here we will filter out “bad” values

players <-  players_2024 %>% 
  filter(positions != "Missing", height_cm > 100)
Note

NOTE: this is a specific filter command for our case,
it is common to check for na, blank etc etc.
Always check your dataset or adopt good practices when making dataset(csv).


Distribution of height (and age) among different positions

Basic Box plot

We first look at distribution of the height of players across different positions.

Show code
# Boxplot with fixed dataset
p1 <- ggplot(players,
         aes(x = positions,
             y = height_cm)) +
    geom_boxplot()
ggplotly(p1)
Tip

First Interactive Plot!!!
Interact with the plot to get IQR information about plots


Violin Plot

We fancify the basic box plot by surrounding the box plot with a violin plot

Show code
# plot simple violin plot
ggplot(players, 
         aes(x = positions,
             y = height_cm)) +
    geom_violin(aes(fill=positions))+
    geom_boxplot(width=0.1)

Adding age as another variable

What if we just added the age in as another box plot (layered box plot). While this will tell us the distribution as well, it is rather unappealing and the box plots are squeezed as a result of the unnecessarily long scales.

Show code
# Include all 
ggplot(players,
       aes(x = positions,
           y = height_cm)) +
  geom_boxplot() +
  geom_boxplot(
    mapping = aes(y = age), data=players
  ) +
  scale_y_continuous(
    "height in centimetres", 
    sec.axis = sec_axis(~ . * 1.20, name = "age")
  )

Overlapping layers

What if we added age points surrounding the box-plot to get the distribution of ages. It is rather messy because there is far too much (age) data for the same heights

Show code
#Can i inform the audience about age? 
ggplot(players, 
       aes(x = positions,
           y = height_cm)) +
  geom_jitter(aes(colour=age))+
  geom_boxplot(width=0.1)

NOTE: I did attempt to add a distribution around the violin plot
with color gradient but it seems to more difficult that i thought
because the violin plot does not show data points but a distribution.
There are some stack overflow discussion on gradient colors in violin plots
but they are more complicated than a few hours debugging.


(Somewhat) Informative layer plot

Instead of plotting age as it is, let’s get the average age per height.
Lets check if it is sensible to do so.

#Collect for each position, the heights and find the average age per height 
#for each positions
height_age <- players %>% 
  group_by(height_cm,positions) %>%  
  arrange(desc(age)) %>% 
  summarise(mean_age=mean(age),
            n = n())

……..Collecting data……..


\(\boldsymbol{Peak\text{ } at \text{ } data}\)

height_cm positions mean_age n
160 Midfield 25.00000 1
161 Attack 20.00000 1
163 Attack 30.00000 1
163 Defender 23.50000 2
164 Attack 21.33333 3
164 Midfield 28.50000 2


\(\boldsymbol{Summary \text{ } of \text{ } filtered \text{ } data}\)

height_cm positions mean_age n
Min. :160.0 Length:141 Min. :20.00 Min. : 1.00
1st Qu.:175.0 Class :character 1st Qu.:24.66 1st Qu.: 6.00
Median :185.0 Mode :character Median :25.36 Median : 34.00
Mean :184.1 Mean :25.61 Mean : 42.07
3rd Qu.:193.0 3rd Qu.:26.22 3rd Qu.: 65.00
Max. :206.0 Max. :34.00 Max. :166.00


\(\boldsymbol{Interactive \text{ } Plot \text{ } of \text{ } mean\_age \text{ } against \text{ } height \text{ } with \text{ } count \text{ } of \text{ } players \text{ } given}\)

  • Red: more than 100 players
  • Orange: more than 10 but less than 99
  • Grey: less than 10
  • Number labels: Number of players of the same height
Show code
#feel of data
p2 <- ggplot(height_age,
          aes(x = height_cm,
              y = mean_age)) +
  geom_point()+
  geom_point(data = subset(height_age, n >= 100), color = "red")+
  geom_point(data = subset(height_age,  n>= 10 & n <= 99), color = "orange")+
  geom_point(data = subset(height_age, n>=0 & n <=9), color = "grey")+
  geom_text(aes(label = n), vjust = 1.2, size = 3)
ggplotly(p2)


We bring the violin plot (with the box plot within) back and
now add the mean age as coloured points

Show code
# Violin plot with box plot 
ggplot(players, 
       aes(x = positions,
           y = height_cm)) +
  geom_violin()+
  geom_boxplot(width=0.2)+
  geom_point(
    data= height_age,
    aes(x = positions, y = height_cm, colour =mean_age),
    size = 4
  ) + 
  scale_colour_viridis_c(option = "plasma")

Area for improvement: Ideally, we can achieve a violin plot with colour gradient
of the mean age instead of coloured points.🙃

Heat map

Let’s drop the whole box or violin plots. What about a heat map?

Simple heat maps

We could just plot a simple heatmap of height against age where positions
are shown in colours. However, positions are overlapping themselves.

Show code
# Instead of a violin plot and keeping positions as "1 axis", 
# what about a heatmap
#Ugly heatmap
ggplot(players,
       aes(x = age,
           y = height_cm)) + 
  geom_tile(aes(fill=positions),alpha=0.5)

Instead, we separate out the heatmaps into the individual positions. While this is more informative, the colours struggle to exactly show the distribution.

Show code
#Separated but ugly heatmap (no need show)
ggplot(players, 
          aes(x = age, 
              y = height_cm)) + 
  geom_tile(aes(fill=positions),alpha=0.7) + 
  facet_wrap(vars(positions))

More informative heatmap

Since positions is already relayed by the separation of the heatmaps, we can
instead keep the same colour scheme for the distribution,
We then get this more informative map of distribution of height against age

Show code
# Decent heat map that shows concentration of points
ggplot(players, aes(x = age, y = height_cm)) +
  geom_bin2d(aes(fill = after_stat(count)), binwidth = c(1, 2)) +  # 1-year x 2-cm bins
  facet_wrap(vars(positions))+
  scale_fill_viridis_c(trans = "sqrt", name = "Count")

Visually appealing heatmap

Hate the “0” data around the original data?
Fill up the 0s around the original data and get yourselves a more appealing
heatmap 🤗

Distribution of names with special characters across countries

We played around with positions, age, height.
What about other data in the dataset?
Could we do something with name and nationality?

World map

Some names of our players have special characters outside of the 26 characters (A-Z)
What if we looked at the distribution of players that have special characters
in their names?

players_non_english <- players %>%
    filter(grepl("[^A-Za-z '\\-]", name))

……..Extracting data……..


\(\boldsymbol{Peak\text{ } at \text{ } data}\)

name birth_date height_cm positions nationality age club
Raúl Albiol 1985-09-04 190 Defender Spain 39 Villarreal Club de Fútbol S.A.D.
Jesús Navas 1985-11-21 170 Defender Spain 38 Sevilla Fútbol Club S.A.D.
Rúben Fernandes 1986-05-06 187 Defender Portugal 38 Gil Vicente Futebol Clube
Iván Cuéllar 1984-05-27 187 Goalkeeper Spain 40 Real Club Deportivo Mallorca S.A.D.
João Moutinho 1986-09-08 170 Midfield Portugal 38 Sporting Clube de Braga


……..Collecting data as needed……..

#Taken from exemplar 
players_cty <- players_non_english %>% 
  mutate(nationality = case_match(nationality,
                                  "England" ~ "United Kingdom",
                                  "Türkiye" ~ "Turkey",
                                  "Cote d'Ivoire" ~ "Ivory Coast",
                                  "Northern Ireland" ~ "United Kingdom",
                                  "Wales" ~ "United Kingdom",
                                  .default = nationality))

# Make the country count
countries <- players_cty %>%
  group_by(nationality) %>%
  summarise(n = n()) %>% 
  arrange(desc(n))

This part of the work (plotting world map) was adapted from exemplar for our data


\(\boldsymbol{Peak\text{ } at \text{ } data}\)

nationality n
Spain 219
Turkey 212
Portugal 90
France 89
Argentina 56
Germany 46
Denmark 29
Ivory Coast 20
Colombia 19
Brazil 18
Serbia 17
Belgium 16
Uruguay 16
Iceland 14
Mali 13
Norway 13
Hungary 12
Netherlands 12
Italy 11
Venezuela 11


\(\boldsymbol{World\text{ } map \text{ } distribution}\)

Show code
# World map plots
countries %>% 
  plot_ly(type = "choropleth", 
          locations = countries$nationality, 
          locationmode = "country names", 
          z = countries$n) %>%
  colorbar(title = "# of Players")