NBA Calendar

13 minute read

Overview

As you might know (or not), NBA is the men’s professional basketball league in US. It contains 30 teams located around US and Canada (actually there is only one team in Canada) disputing the national title every year. If you have no idea about what I am saying maybe names such as Michael Jordan, Kobe Bryant, and Stephen Curry could help you. If still it does not sounds familiar this post probably will not be as delight as it is for me, but no problem we can still learn something from here.

The tournament is divided in two phases. First, these 30 teams play against each other during what is called the regular season. In the end of the regular season the best 8 teams from each conference (East and West) advance to the Playoffs where they dispute to be the Champion.

For my application, I will only focus on the regular season, when every team are playing against each other for a position in the Playoffs. During this period each team plays 82 games usually between October and April. Half of these 82 games are palyed at home and the other half is played away. It is very common for teams, during the regular season, have a sequence with more than one game away before playing at home, meaning they have to travel and stay away for more than one game.

For every regular season NBA defines a different calendar with time and location of the games. But this calendar cannot be randomly generate, otherwise we could end up with a very inefficient logistic calendar that would force teams to spend a lot of time and money on unnecessary trips. Moreover, this inefficient calendar could impact on the performance of the players.

Said that, it is plausible to claim that NBA already have an optmized regular season calendar. Where teams will not be playing away for a long period and also the amount of distance is minimized. They have an algorithm that create calendar subject to some constraints.

My point here is to explore the concept of Genetic Algorithm to create an efficient NBA calendar. Let’s see!!!

Load Necessary Functions

collectNBACalendar: Function to download NBA calendar for a specific year.
earth.dist: Given a pair of latitude and longitude calculate the distance between them.
nbaFlightsByTeam: Returns a data frame with flights and the distance for a specific team during the season.

These functions were created from scratch to ease the porcess to manipulate data. More information about these functions can be found at my github account.

source('~/project/Data-Science-Projects/NBA calendar/NBA season calendar Functions.R')

Data

Scrapping data from internet

The data used are not in a well structured data frame. It was necessary to extract the calendar from internet. The function calendar is used to extract the nba calendar for an specific season.

Collecting the calendar from .

calendar<-collectNBACalendar(2016)

knitr::kable(head(calendar))

date	time	visitor	visitor_pts	home	home_pts	season
Fri, Jan 1, 2016	8:00 pm	New York Knicks	81	Chicago Bulls	108	2016
Fri, Jan 1, 2016	10:30 pm	Philadelphia 76ers	84	Los Angeles Lakers	93	2016
Fri, Jan 1, 2016	7:30 pm	Dallas Mavericks	82	Miami Heat	106	2016
Fri, Jan 1, 2016	7:30 pm	Charlotte Hornets	94	Toronto Raptors	104	2016
Fri, Jan 1, 2016	7:00 pm	Orlando Magic	91	Washington Wizards	103	2016
Sat, Jan 2, 2016	3:00 pm	Brooklyn Nets	100	Boston Celtics	97	2016

Create date format variable

Fri, Jan 1, 2016 -> 01-01-2016

calendar$date2<-unlist(lapply(strsplit(gsub(",","",calendar$date)," "),function(x) paste(x[2:4],collapse = "-")))
calendar$date2<-as.Date(calendar$date2,"%b-%d-%Y")
calendar<-calendar%>%
  arrange(date2)


### Minimum and Maximum for date2
calendar%>%
    filter(complete.cases(.))%>%
    group_by()%>%
    summarise(min=min(date2),max=max(date2))

## # A tibble: 1 x 2
##   min        max       
##   <date>     <date>    
## 1 2015-10-27 2016-06-19

Filtering Regular Season Games

This calendar contains playoffs games and as I said before we are only interested on regular season games. Therefore, we have to filter the season games.

The 2015-16 season ranged from 10-27-2015 to 04-13-2016

calendar<-calendar%>%
            filter(date2<='2016-04-13')

Checking

Checking if every time has 82 games.

# Quick check: Every team must have 82 games
print(sapply(unique(calendar$home),
       function(x) calendar%>%
  filter((home==x | visitor==x))%>%
  nrow()))

##          Atlanta Hawks          Chicago Bulls  Golden State Warriors
##                     82                     82                     82
##         Boston Celtics          Brooklyn Nets        Detroit Pistons
##                     82                     82                     82
##        Houston Rockets     Los Angeles Lakers      Memphis Grizzlies
##                     82                     82                     82
##             Miami Heat        Milwaukee Bucks  Oklahoma City Thunder
##                     82                     82                     82
##          Orlando Magic           Phoenix Suns Portland Trail Blazers
##                     82                     82                     82
##       Sacramento Kings        Toronto Raptors         Indiana Pacers
##                     82                     82                     82
##   Los Angeles Clippers        New York Knicks    Cleveland Cavaliers
##                     82                     82                     82
##         Denver Nuggets     Philadelphia 76ers      San Antonio Spurs
##                     82                     82                     82
##   New Orleans Pelicans     Washington Wizards      Charlotte Hornets
##                     82                     82                     82
## Minnesota Timberwolves       Dallas Mavericks              Utah Jazz
##                     82                     82                     82

Define Game Location

Based on the name of the home team we can identify the game location. For example, when the home team is ‘Chicago Bulls’ we know the game was hosted in Chicago.

In a simple example, for a match between ‘Chicago Bulls’ and ‘Memphis Grizzles’ where the home team is ‘Chicago Bulls’ we assume that there was travel from Memphis to Chicago.

# Home Location
calendar$home_location<-unlist(
  lapply(strsplit(calendar$home," "),
         function(x) paste(x[1:(length(x)-1)],collapse=" ")
         )
  )

# Visitor Location
calendar$visitor_location<-unlist(
  lapply(strsplit(calendar$visitor," "),
         function(x) paste(x[1:(length(x)-1)],collapse=" ")
         )
  )

Even though the code above was able to identify the games location we still had to do some manual adjustments.

For example:

Golden State -> San Franciso (The team name does not contain the city’ name)
Minnesota -> Minneapolis (The team name contains the state not the city name)

calendar$home_location[calendar$home_location=="Portland Trail"]<-"Portland"
calendar$home_location[calendar$home_location=="Utah"]<-"Salt Lake City"
calendar$home_location[calendar$home_location=="Indiana"]<-"Indianapolis"
calendar$home_location[calendar$home_location=="Minnesota"]<-"Minneapolis"
calendar$home_location[calendar$home_location=="Golden State"]<-"Oakland"
calendar$home_location[calendar$home_location=="Washington"]<-"Washington D.C."

Latitude and Longitude for the Cities where the Teams are located

Using the geocode function from ggmap package it is possible to download the latitude and longitude based on the city name.

#Example for Denver
geocode('Denver')

##         lon      lat
## 1 -104.9903 39.73924

Download for every team the latitude and longitude.

cities<-unique(calendar$home_location)
pos<-geocode(cities)
citiesLocation<-data.frame(cities,pos)

while(length(which(is.na(citiesLocation$lon)))>0){
 citiesLocation[which(is.na(citiesLocation$lon)),c("lon","lat")]<-geocode(cities[which(is.na(citiesLocation$lon))])
}

knitr::kable(head(citiesLocation))

cities	lon	lat
Atlanta	-84.38798	33.74900
Chicago	-87.62980	41.87811
Oakland	-122.27111	37.80436
Boston	-71.05888	42.36008
Brooklyn	-73.94416	40.67818
Detroit	-83.04575	42.33143

Calculate the Distance between Teams

The distance between any two cities is calculated by using its locations (latitude and longitude).

# Every combination between two teams
distance<-expand.grid(unique(calendar$home_location),unique(calendar$home_location))
names(distance)<-c("team1","team2")

# Join the location (latitude and longitude) for each team.
distance<-merge(x=distance,
                y=citiesLocation,
                by.x="team1",
                by.y="cities",
                all.x=TRUE)
names(distance)[3:4]<-c("lon1","lat1")

distance<-merge(x=distance,
                y=citiesLocation,
                by.x="team2",
                by.y="cities",
                all.x=TRUE)
names(distance)[5:6]<-c("lon2","lat2")

knitr::kable(head(distance))

team2	team1	lon1	lat1	lon2	lat2
Atlanta	Atlanta	-84.38798	33.74900	-84.38798	33.749
Atlanta	New York	-74.00597	40.71278	-84.38798	33.749
Atlanta	Denver	-104.99025	39.73924	-84.38798	33.749
Atlanta	Sacramento	-121.49440	38.58157	-84.38798	33.749
Atlanta	Phoenix	-112.07404	33.44838	-84.38798	33.749
Atlanta	Milwaukee	-87.90647	43.03890	-84.38798	33.749

Calculate the distance (km) between two cities using the function earth.dist.

distance$distance<-apply(
  distance[,names(distance)%in%c('lon1','lat1','lon2','lat2')],
  1,
  function(x) earth.dist(x[1],x[2],x[3],x[4],R=6378.145)
  )

knitr::kable(head(distance))

team2	team1	lon1	lat1	lon2	lat2	distance
Atlanta	Atlanta	-84.38798	33.74900	-84.38798	33.749	0.000
Atlanta	New York	-74.00597	40.71278	-84.38798	33.749	1201.662
Atlanta	Denver	-104.99025	39.73924	-84.38798	33.749	1949.539
Atlanta	Sacramento	-121.49440	38.58157	-84.38798	33.749	3354.758
Atlanta	Phoenix	-112.07404	33.44838	-84.38798	33.749	2559.549
Atlanta	Milwaukee	-87.90647	43.03890	-84.38798	33.749	1078.468

Function to calculate Distance traveled by a Team during the season

The nbaFlightsByTeam function returns a data frame with the flights and the distance for a specific team during the season.

For example for the Philadelphia 76ers

PHI_flights<-nbaFlightsByTeam(calendar,"Philadelphia 76ers",date=TRUE)
head(PHI_flights)

##    flight_from    flight_to  distance
## 1 Philadelphia       Boston  436.1181
## 2       Boston Philadelphia  436.1181
## 3 Philadelphia Philadelphia    0.0000
## 4 Philadelphia    Milwaukee 1115.2003
## 5    Milwaukee    Cleveland  539.5069
## 6    Cleveland Philadelphia  576.9254

The first 6 games of the 2016 season for the Phildelphia 76ers are:

Away(A)-Home(H)-H-A-A-H

The Phildelphia 76ers plays its first game at Boston and goes back home to play the next two games. Then they fly to Milwaukee and after that they go to Cleveland before coming back home for another game.

Philadelphia 76ers Travel Map

options(repr.plot.width=20, repr.plot.height=16)
nbaRouteMap(calendar,"Philadelphia 76ers")

The blue numbers on the map represent the order of the games. The concentration of number on the bottom right are the home games.

To get the total kilometers traveled by the Philadelphia 76ers during the 2015-16 regular season we just have to sum the variable distance.

cat(format(sum(PHI_flights$distance),big.mark=",",scientific=FALSE),"km")

## 62,315.35 km

Total distance traveled by team during 2015-16 season

teams<-unique(calendar$home)
total_distance_by_team<-sapply(teams,function(x) sum(nbaFlightsByTeam(calendar,x)$distance))

total_distance_by_team<-as.data.frame(total_distance_by_team[order(total_distance_by_team,decreasing=TRUE)])
total_distance_by_team$team<-rownames(total_distance_by_team)
rownames(total_distance_by_team)<-NULL
colnames(total_distance_by_team)<-c('distance','team')
total_distance_by_team<-total_distance_by_team[c(2,1)]

options(repr.plot.width=8, repr.plot.height=4)
ggplot(total_distance_by_team,aes(x=reorder(team,-distance),distance))+
geom_bar(stat = "identity")+
theme_minimal()+
labs(x="Teams")+
labs(y="Distance (km)")+
 theme(axis.text.x = element_text(face="bold", color="#993333",
                           size=8, angle=45, hjust=1))

Interesting to see that both team that made NBA final are on the extrems. Coincidence?!? Yes, there is no correlation :P

Total distance traveled

If we sum the distance traveled by every team we have the total distance traveled during the season.

cat(format(sum(total_distance_by_team$distance),big.mark=",",scientific=FALSE),"km")

## 2,148,009 km

Propose a new calendar

If we shuffle the orders of the lines from the original calendar we can create a new calendar.

randomCalendar <- calendar[sample(nrow(calendar),replace=FALSE),]

Now, the first 10 games for Philadelphia 76ers using this new calendar would be:

randomCalendar%>%
select(home,visitor)%>%
filter(home=="Philadelphia 76ers" | visitor=="Philadelphia 76ers")%>%
head()

##                   home            visitor
## 1 New Orleans Pelicans Philadelphia 76ers
## 2   Philadelphia 76ers    Detroit Pistons
## 3      Milwaukee Bucks Philadelphia 76ers
## 4        Chicago Bulls Philadelphia 76ers
## 5   Philadelphia 76ers Washington Wizards
## 6   Philadelphia 76ers     Indiana Pacers

As you can see the total distance traveled by Philadelphia 76ers with this calendar has increased.

cat("Original Calendar:\t",
    format(sum(nbaFlightsByTeam(calendar,"Philadelphia 76ers",date=FALSE)$distance),big.mark=",",scientific=FALSE),
    "\nRandom Calendar:\t",format(sum(nbaFlightsByTeam(randomCalendar,"Philadelphia 76ers",date=FALSE)$distance),big.mark=",",scientific=FALSE))

## Original Calendar:    62,315.35
## Random Calendar:  101,729.5

Total distance traveled also has increased.

cat("Original Calendar:\t",
    format(sum(sapply(teams,function(x) sum(nbaFlightsByTeam(calendar,x,date=FALSE)$distance))),big.mark=",",scientific=FALSE),
        "\nCandidate Calendar:\t",
        format(sum(sapply(teams,function(x) sum(nbaFlightsByTeam(randomCalendar,x,date=FALSE)$distance))),big.mark=",",scientific=FALSE)
            )

## Original Calendar:    2,148,009
## Candidate Calendar:   3,178,847

How easy is to create a new calendar a with low distance traveled?

As I wrote in the begining of the post I believe it is hard to improve the original NBA calendar, but still we can try to create a new calendar with a resonable solution.

Let’s create 1,000 random calendars and check the distribution of the total distance traveled.

result<-rep(NA,1000)
for(i in 1:length(result)){
  c_<-calendar[sample(nrow(calendar),replace=FALSE),]
  result[i]<-sum(sapply(teams,function(x) sum(nbaFlightsByTeam(c_,x,date=FALSE)$distance)))
}

ggplot(data.frame(result),aes(x=result))+
  geom_histogram(fill="navy")+
  theme_minimal()+xlab("Distance (Km)")+ylab("")+
  geom_vline(xintercept = sum(sapply(teams,function(x) sum(nbaFlightsByTeam(calendar,x,date=TRUE)$distance))),colour="red",lty=2)

As we know the NBA Calendar has a total travel of 2,148,009 km (red dashed-line) and none of the 1,000 random calendars generate were closed to it. The minumum distance traveled generated from the 1,000 random calendars is greater than 3,000,000 Km. The original NBA calendar is really well optmized.

For this problem is possible to create 1230! different sequence of games. It is unfeasible to mannualy find a reasonable calendar by just generating a random sequence of games. The solution is to work with Genetic Algorithm to converge to an acceptable calendar.

Genetic Algorithm

There is no right answer for this problem. We are not looking for the best calendar, but one that might be a adequate. In this case we are looking for a calendar with a low distance traveled during the season.

This is an heuristic problem and can be solved by using the method so-called genetic algorithm (GA). The idea of this algorithm is pretty inspired by biological evolution (mutation, crossover and selection), that’s why the name genetic algorithm.

In a simple explanation, biological evolution are organisms reproducing and generating changes with each generation. In our case, organisms are the calendars and the evolution are a new calendars with changes generated from its ‘parents’ (previous calendars).

How it works?

If we have as initial calendar the random one we generated with a total distance traveled of 3,157,882 km. As we know that’s not a good calendar. The idea is to generated a new calendar from this one that could reduce this distance. One method to create this new calendar is randomly change some positions and check if it dereceased the total distance traveled. That’s the ideo of creating a new generation based on the previous one.

Let’s create 4 generations from our random calendar as first generation. In this process we proposed 10 random positions changes from this calendar and see the result.

random_distance<-sum(sapply(teams,function(x) sum(nbaFlightsByTeam(randomCalendar,x,date=FALSE)$distance)))

aux1<-randomCalendar

new_distance<-numeric(4)
new_distance[1]<-random_distance

#Create from 2nd to 4th generation
for(i in 2:4){
  new_distance[i]<-new_distance[(i-1)]+1
  while(new_distance[i]>new_distance[(i-1)]){
    for(j in 1:10){
      change<-sample(1:nrow(aux1),2,replace = TRUE)
      aux2<-aux1[change[1],]
      aux1[change[1],]<-aux1[change[2],]
      aux1[change[2],]<-aux2
      }
    new_distance[i]<-sum(sapply(teams,function(x) sum(nbaFlightsByTeam(aux1,x,date=FALSE)$distance)))
  }
}



cat("Distance First Generation Calendar:\t",
    format(random_distance,big.mark=",",scientific=FALSE),
    "\nDistance Second Generation Calendar:\t",
    format(new_distance[2],big.mark=",",scientific=FALSE),
    "\nDistance Third Generation Calendar:\t",
    format(new_distance[3],big.mark=",",scientific=FALSE),
    "\nDistance Fourth Generation Calendar:\t",
    format(new_distance[4],big.mark=",",scientific=FALSE)
    )

## Distance First Generation Calendar:   3,152,361
## Distance Second Generation Calendar:  3,129,660
## Distance Third Generation Calendar:   3,124,903
## Distance Fourth Generation Calendar:  3,113,575

In this simple example, each generation produced a lower total traveled distance calendar than its previous generation. If we keep running this procedure for further generations we could create a calendar with a similar distance from the original NBA calendar.

How to do it:

Create 100 intial random calendars. It was created 100 so we can have more options.
For each of the 100 calendars we create another calendar with a better result (lower total traveled distance) by changing positions. In this case we proposed initially 10 changes.
From these 100 new calendars we select the top 60 with lower distance and randomly select 40 others (It will probably result in repeated calendars). This will select the best calendar and also randomly give a chance for others to present a better result in future generations.
With these set of 100 calendars we repeat the whole process until we get a resonable result. In our case let’s use the total distance traveled in 2016 as reference (2,148,009 km).

Running the Genetic Algorithm from scratch

n_calendars<-100
#list_calendars<-list()
initial_n_changes<-2
n_interactions <- 230
modify_value <- 0
modify_check <- 0

n_changes<-10

# Initial 100 random calendars
for(i in 1:n_calendars){
  print(i)
  list_calendars[[i]]<-calendar[sample(1:nrow(calendar),nrow(calendar),replace = FALSE),]
  list_distance[[i]]<-sum(sapply(teams,function(x) sum(nbaFlightsByTeam(list_calendars[[i]],x,date=FALSE)$distance)))
}

# list of lists
list_calendars<-list(list_calendars)
list_distance<-list(list_distance)

# Create list for each interaction
aux_calendars<-list()
aux_distance<-list()


for(k in 1:n_interactions){
  print(paste0("************* Interaction = ",k," ***************"))

  # Select top 60 calendars + 40 random from every calendar
  candidates<-c(order(unlist(list_distance[[k]]))[1:60],sample(order(unlist(list_distance[[k]])),40,replace = TRUE))


  #### Update number of changes for each interaction ###

  if(modify_check>=50){
    modify_value <- modify_value+floor(modify_check/50)
    n_changes <- ifelse(modify_value>=initial_n_changes,2,initial_n_changes - modify_value)
  }
  else{
    n_changes <- initial_n_changes - modify_value
  }

  modify_check <- 0

  ####

  for(i in 1:length(candidates)){

    # Calculate the distance of the calendar that will be shuffled
    d<-sum(sapply(teams,function(x)
      sum(nbaFlightsByTeam(list_calendars[[k]][[candidates[i]]],x,date=FALSE)$distance)))

    # Variable for the new calendar distance. Intialize it longer than the current calendar
    new_d<-d+1

    # Counter
    count <- 0

    while(d<new_d && count<100 && n_changes!=0){
      aux1<-list_calendars[[k]][[candidates[i]]]
      print(paste0(i,"-",count," - changes: ",n_changes))
      for(j in 1:n_changes){
        change<-sample(1:nrow(aux1),2,replace = TRUE)
        aux2<-aux1[change[1],]
        aux1[change[1],]<-aux1[change[2],]
        aux1[change[2],]<-aux2
      }
      new_d<-sum(sapply(teams,function(x) sum(nbaFlightsByTeam(aux1,x,date=FALSE)$distance)))
      count<-count+1
      # If it takes more than 10 loops to find a better result
      if(count>=10 && count%%10==0){modify_check=modify_check+1}
    }

    aux_calendars[[i]]<-aux1
    aux_distance[[i]]<-new_d
  }

  # list of lists
  list_calendars[[k+1]]<-aux_calendars
  list_distance[[k+1]]<-aux_distance

  aux_calendars<-list()
  aux_distance<-list()

}

Taking the results

df<-as_tibble(
  do.call(cbind, lapply(list_distance,function(x) unlist(x)))
)

df$calendar<-1:nrow(df)

df<-melt(df,id=c("calendar"))%>%
  rename(interaction=variable)

data.frame(distance=map_dbl(lapply(
  list_distance,function(x) unlist(x)
),~quantile(.x,0.5)),
p5=map_dbl(lapply(
  list_distance,function(x) unlist(x)
),~quantile(.x,0.05)),
p95=map_dbl(lapply(
  list_distance,function(x) unlist(x)
),~quantile(.x,0.95)))%>%
  mutate(interaction=as.numeric(rownames(.)))%>%
  ggplot(aes(x=interaction,y=distance))+
  geom_line(col='navy')+
  geom_line(aes(x=interaction,y=p95),colour="orange")+
  geom_line(aes(x=interaction,y=p5),colour="orange")+
  theme_minimal()+
  geom_hline(yintercept = sum(sapply(teams,function(x) sum(nbaFlightsByTeam(calendar,x,date=FALSE)$distance))),colour="red",lty=2)

As we can see, after 200+ interactions we were able to generate a calendar with a lower distance than the 2016 calendar. The blue line represents the median of the 100 calendars and each yellow line represent 5% and 95% quantile of these 100 calendars.

The minimum distance traveled would be:

cat(format(sum(sapply(teams,function(x) sum(nbaFlightsByTeam(min_calendar,x,date = FALSE)$distance))),big.mark=",",scientific=FALSE),"km")

## 2,112,450 km

The total distance is obviously lower than the 2016 calendar. For this same calendar, the Philadelphia 76ers would travel:

cat("Original Calendar:\t",
    format(sum(nbaFlightsByTeam(calendar,"Philadelphia 76ers",date=FALSE)$distance),big.mark=",",scientific=FALSE),
    "\nNew Calendar:\t",format(sum(nbaFlightsByTeam(min_calendar,"Philadelphia 76ers",date=FALSE)$distance),big.mark=",",scientific=FALSE))

## Original Calendar:    62,315.35
## New Calendar:     67,528.54

It is actually more than the original calendar but it is not that different. It resulted in this map:

nbaRouteMap(min_calendar,"Philadelphia 76ers")

total_distance_by_team<-sapply(teams,function(x) sum(nbaFlightsByTeam(min_calendar,x,date = FALSE)$distance))

total_distance_by_team<-as.data.frame(total_distance_by_team[order(total_distance_by_team,decreasing=TRUE)])
total_distance_by_team$team<-rownames(total_distance_by_team)
rownames(total_distance_by_team)<-NULL
colnames(total_distance_by_team)<-c('distance','team')
total_distance_by_team<-total_distance_by_team[c(2,1)]

options(repr.plot.width=8, repr.plot.height=4)
ggplot(total_distance_by_team,aes(x=reorder(team,-distance),distance))+
geom_bar(stat = "identity")+
theme_minimal()+
labs(x="Teams")+
labs(y="Distance (km)")+
 theme(axis.text.x = element_text(face="bold", color="#993333",
                           size=8, angle=45, hjust=1))

In this scenario we would have Portland Trail Blazer with the most distance traveled. It seems to be reasonable given the location of Portland.

Discussion and Conclusion

Trying to randomly find a suitable calendar for NBA regular season seems to be almost an impossible mission. By using the idea of genetic algorithm we were able to create possible calendars that have a total distance traveled lower than real NBA calendars. One important limitation in this procedure is that we are not using any constraint. These calendars could easily be subject to some constraints such as maximum number of games away in a row or some special date games such as christmas and thanksgiving games. Applying these constraints would reduce the number of possible calendars and apparently would also make the process of finding a solution slower.

Share on

Twitter Facebook Google+ LinkedIn

Daniel D Barreto