Book Report

R
25Winter
data: books.csv
Author

Liz

Published

July 1, 2025

Data set

We took a deep dive into our books data set. We have captured authors, publishers, and ratings.

Exlporing the data set

First we told R to use dplyr

library(dplyr)

Then we told R to look at our dataset

read.csv("../../../../data/books.csv")

Then we asked R to make an object called Books

Books <- read.csv("../../../../data/books.csv")

Next we asked a series of questions about the data:

  1. What is the range of page numbers?
range(Books$num_pages)
[1]    0 6576
  1. What is the range of rating counts?
range(Books$ratings_count)
[1]       0 4597666
  1. What publishers were used in the data set?
library(ggplot2)
??Books
str(Books)
'data.frame':   11125 obs. of  13 variables:
 $ X                 : int  1 2 3 4 5 6 7 8 9 10 ...
 $ bookID            : int  1 2 4 5 8 9 10 12 13 14 ...
 $ title             : chr  "Harry Potter and the Half-Blood Prince (Harry Potter  #6)" "Harry Potter and the Order of the Phoenix (Harry Potter  #5)" "Harry Potter and the Chamber of Secrets (Harry Potter  #2)" "Harry Potter and the Prisoner of Azkaban (Harry Potter  #3)" ...
 $ authors           : chr  "J.K. Rowling/Mary GrandPré" "J.K. Rowling/Mary GrandPré" "J.K. Rowling" "J.K. Rowling/Mary GrandPré" ...
 $ average_rating    : num  4.57 4.49 4.42 4.56 4.78 3.74 4.73 4.38 4.38 4.22 ...
 $ isbn              : chr  "0439785960" "0439358078" "0439554896" "043965548X" ...
 $ isbn13            : num  9.78e+12 9.78e+12 9.78e+12 9.78e+12 9.78e+12 ...
 $ language_code     : chr  "eng" "eng" "eng" "eng" ...
 $ num_pages         : int  652 870 352 435 2690 152 3342 815 815 215 ...
 $ ratings_count     : int  2095690 2153167 6333 2339585 41428 19 28242 3628 249558 4930 ...
 $ text_reviews_count: int  27591 29221 244 36325 164 1 808 254 4080 460 ...
 $ publication_date  : chr  "2006-09-16" "2004-09-01" "2003-11-01" "2004-05-01" ...
 $ publisher         : chr  "Scholastic Inc." "Scholastic Inc." "Scholastic" "Scholastic Inc." ...

First, using a scatterplot:

ggplot(data = Books,
       mapping = aes(x = X,
                     y = publisher))+
  geom_point()

Second, using a bar graph:

ggplot(data=Books,
       mapping=aes(x=publisher))+
  geom_bar()

Third, we were only interested in publishers of books with less than 5 pages

Books%>%
    filter(num_pages<5)%>% 
    ggplot(mapping=aes(y=publisher))+
  geom_bar()

Fourth, we were interested in books published by Scholastic

Books%>%
    filter(grepl("Scholastic", publisher))%>% 
    ggplot(mapping=aes(y=publisher))+
  geom_bar()

  1. Are rating counts associated with number of pages?
cor.test(Books$num_pages,Books$ratings_count)

    Pearson's product-moment correlation

data:  Books$num_pages and Books$ratings_count
t = 3.6288, df = 11123, p-value = 0.000286
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.01581461 0.05293589
sample estimates:
       cor 
0.03438711 
  1. Compare ratings counts for books in English versus books not in English?

First we must create a categorical variable English vs. Non-English

Books <- Books |>
  mutate(english_books=language_code=="eng")

Now we can run a t-test to compare ratings counts

t.test(ratings_count~english_books,data=Books)

    Welch Two Sample t-test

data:  ratings_count by english_books
t = -13.323, df = 9894.7, p-value < 2.2e-16
alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
95 percent confidence interval:
 -20888.91 -15530.46
sample estimates:
mean in group FALSE  mean in group TRUE 
           3354.563           21564.247 

Now lets visualise it

library(ggplot2)
ggplot(Books,
       aes(x=english_books,y=ratings_count,
           fill=english_books))+
  geom_col()