Good Reads: Recommender System

title

Overview

We cannot escape! If you use online services or buy anything from a e-commerce company, recommendation is part of your online routine. Online services are suggesting you products and services that you might like. It is everywhere. Netflix recommends shows/movies based on what you watched and recent demand. Amazon displays on your website products that you might be interested based on previous purchases, clicks, and user behavior. Youtube recommends videos and channels based on what you and others have watched. We could keep going and the idea would be the same: suggest something that you probably will enjoy.

Books are not different!! The website goodreads.com could be defined as a social network dedicated for people interested on books. The idea is that you can interact with other readers, authors, and of course books. Between many features, the readers social network provides a members review database to give you more information for the books you are looking for. It works excatly as any other review system. You rate the book you have read (between 1 and 5 stars) and write down your opinion about it, explaining what you like or don’t. I know, nothing is new here. The goodreads.com also have its recommender system for books. It will look for the books you read and the rates you gave to suggest you new books.

In this notebook, I apply a very well known called Item-Based Collaborative Filtering (IBCF) technique to estimate the rate of books based on other similar books. The idea is simple, if I like a book (i.e., I rated it 5 stars) it is likely that I will also enjoy a very similar book to that one.

This technique is not new and also there are other methodologies to estimate the rates. Here we are going to focus on the IBCF and see how it works when predicting books rating!!!

Loading and Cleaning data

The data is available at Kaggle. Here we find a dataset with millions of ratings for 10k books. In this page we can also find a very good kernel about User-Based Collaborative Filtering (UBCF). The objective is the same, to improve the recommendations system but UBCF approach looks for similar users instead of books.

ratings<-read.csv("~/project/Data-Science-Projects/goodreads/data/ratings.csv")
books<-read.csv("~/project/Data-Science-Projects/goodreads/data/books.csv")

For this application we are going to use 2 datasets:

  • ratings: Contains information about ratings. It is a simple dataset with only 3 variables: book_id, user_id, and rating. It is pretty straightforward. It represents the rate given from a user for a book.
  • books: This is dataset has more information. It contains different features for books such as title, author name, year of publication, number of reviews, and etc.
knitr::kable(
  head(books)
)
id book_id best_book_id work_id books_count isbn isbn13 authors original_publication_year original_title title language_code average_rating ratings_count work_ratings_count work_text_reviews_count ratings_1 ratings_2 ratings_3 ratings_4 ratings_5 image_url small_image_url
1 2767052 2767052 2792775 272 439023483 9.780439e+12 Suzanne Collins 2008 The Hunger Games The Hunger Games (The Hunger Games, #1) eng 4.34 4780653 4942365 155254 66715 127936 560092 1481305 2706317 https://images.gr-assets.com/books/1447303603m/2767052.jpg https://images.gr-assets.com/books/1447303603s/2767052.jpg
2 3 3 4640799 491 439554934 9.780440e+12 J.K. Rowling, Mary GrandPré 1997 Harry Potter and the Philosopher’s Stone Harry Potter and the Sorcerer’s Stone (Harry Potter, #1) eng 4.44 4602479 4800065 75867 75504 101676 455024 1156318 3011543 https://images.gr-assets.com/books/1474154022m/3.jpg https://images.gr-assets.com/books/1474154022s/3.jpg
3 41865 41865 3212258 226 316015849 9.780316e+12 Stephenie Meyer 2005 Twilight Twilight (Twilight, #1) en-US 3.57 3866839 3916824 95009 456191 436802 793319 875073 1355439 https://images.gr-assets.com/books/1361039443m/41865.jpg https://images.gr-assets.com/books/1361039443s/41865.jpg
4 2657 2657 3275794 487 61120081 9.780061e+12 Harper Lee 1960 To Kill a Mockingbird To Kill a Mockingbird eng 4.25 3198671 3340896 72586 60427 117415 446835 1001952 1714267 https://images.gr-assets.com/books/1361975680m/2657.jpg https://images.gr-assets.com/books/1361975680s/2657.jpg
5 4671 4671 245494 1356 743273567 9.780743e+12 F. Scott Fitzgerald 1925 The Great Gatsby The Great Gatsby eng 3.89 2683664 2773745 51992 86236 197621 606158 936012 947718 https://images.gr-assets.com/books/1490528560m/4671.jpg https://images.gr-assets.com/books/1490528560s/4671.jpg
6 11870085 11870085 16827462 226 525478817 9.780525e+12 John Green 2012 The Fault in Our Stars The Fault in Our Stars eng 4.26 2346404 2478609 140739 47994 92723 327550 698471 1311871 https://images.gr-assets.com/books/1360206420m/11870085.jpg https://images.gr-assets.com/books/1360206420s/11870085.jpg
knitr::kable(
  head(ratings)
)
book_id user_id rating
1 314 5
1 439 3
1 588 5
1 1169 4
1 1185 4
1 2077 4

Let’s start to clean our ratings dataset. It is possible for a user have more than one rating for the same book. We could assume that the user change its mind and reevalute the book rating. It definitely can happen, but for simplicity we remove these cases and assume that every combination of book and user has only one rating. In this case, we have to eliminate the cases where users gave more than on rating for the same book.

ratings<-ratings %>% group_by(user_id, book_id) %>% mutate(total=n())
cat('Number of duplicate ratings: ', format(nrow(ratings[ratings$total > 1,]),big.mark=",",scientific=FALSE))
## Number of duplicate ratings:  4,487

Example of duplicated rating: As you can see below, user 3204 rated book 8946 five times.

duplicated_ratings<-ratings%>%
    group_by(book_id,user_id)%>%
    summarise(total=n())%>%
    filter(total>1)%>%
    arrange(desc(total))

head(duplicated_ratings)
## # A tibble: 6 x 3
## # Groups:   book_id [5]
##   book_id user_id total
##     <int>   <int> <int>
## 1    8946    3204     5
## 2    2515    4359     4
## 3    3996   38259     4
## 4    6472     691     4
## 5    7420   34548     4
## 6    8946      42     4

The duplicated cases were removed.

ratings <- ratings[ratings$total == 1,]
cat(' Number of ratings: ',format(nrow(ratings),big.mark=",",scientific=FALSE),'\n',
    'Number of Books: ',format(length(unique(ratings$book_id)),big.mark=",",scientific=FALSE),'\n',
    'Number of Users: ',format(length(unique(ratings$user_id)),big.mark=",",scientific=FALSE))
##  Number of ratings:  977,269
##  Number of Books:  10,000
##  Number of Users:  53,380

For this problem, I decided to work with books that have atleast 100 reviews, and user who gave more than 20 reviews. Doing that I am significantly reducing reducing the number books and users. It helped me to reduce process time consumption.

ratings<-ratings%>%group_by(user_id)%>%mutate(total=n())%>%filter(total>20)
ratings<-ratings%>%group_by(book_id)%>%mutate(total=n())%>%filter(total==100)
cat(' Number of ratings: ',format(nrow(ratings),big.mark=",",scientific=FALSE),'\n',
    'Number of Books: ',format(length(unique(ratings$book_id)),big.mark=",",scientific=FALSE),'\n',
    'Number of Users: ',format(length(unique(ratings$user_id)),big.mark=",",scientific=FALSE))
##  Number of ratings:  145,600
##  Number of Books:  1,456
##  Number of Users:  6,735

Now, we have 9,806,160 combinatations of (book, user), and only 145,600 ratings. It means that less than 2% of these possible combination have been rated.

EDA: Understanding more about our data

What’s the most reviewed book?