Lacrosse Predictors

Every year, in early May, the regular season winds down for college lacrosse and the playoffs begin. Unlike NCAA basketball and March Madness – where 66 teams are invited – only 18 teams are invited to the national tournament. For the 70 Division I teams competing for those spots, there are two ways to be selected. The first way is simple, secure an automatic qualification by winning your conference’s playoff tournament. There are ten conferences, leaving eight spots for other teams. These spots, known as “at-large” selections are up to the National Selection Committee to hand out. The committee utilizes the following criteria to select and seed teams:

  • Strength of schedule index.
  • Results of the RPI.
    • Record against ranked teams 1-5; 6-10; 11-15; 16-20; 21+
    • Average RPI win (average RPI of all wins)
    • Average RPI loss (average RPI of all losses)
  • Head-to-head competition:
    • Results versus common opponents.
    • Significant wins and losses (wins against teams ranked higher in the RPI and losses against teams ranked lower in the RPI).
    • Locations of contests.
  • Input from the regional advisory committee (comprised of lacrosse coaches from all AQ conferences).

These criteria seem to place serios significance on this “RPI” ranking, but what even is it?

RPI, or rating percentage index, is a ranking of sports teams that aims to combine wins, losses, and strength of schedule. For any particular lacrosse team, it is calculated as:

The selection committee has traditionally used the RPI as a factor in its decisions for a variety of factors. For one thing, it is relatively easy to calculate. You do not need any advanced data or subjective metrics. All you need is teams wins and losses. Moreover, for a sport like college lacrosse, where each team only plays a small subset of the all the other teams, RPI does a good job considering teams that had harder and easier schedules. On top of that, RPI does typically align well with the “eye-test,” or how good you think a team is just by watching it. For those and other reasons, RPI has become an integral part of the selection process for the national college lacrosse tournament.

There are, however, several aspects of RPI to complain about. First off, it does not account for factors like margin of victory and home field advantage. Those are factors are known to important when predicting future outcomes. Additionally, not all conferences are equally competitive, so RPI can artificially inflate RPIs of “good” conferences and deflate RPIs of “bad” conferences.

With that in mind, I will be looking into the question of whether RPI is the best statistic to use for predicting outcomes of college lacrosse game, or if there is a better predictor out there that maybe the selection committee should be using.

I will be pursuing two main avenues to try and find a better predictor. First, I will look at alternative ranking system such as ELO. ELO is rating system originally designed for ranking chess players, which you can read more about here. There are number of nuances to apply the ELO system to lacrosse, but those have mostly been taken care of by the lacrosse analytics site Lacrosse Reference. A detailed explanation of their lacrosse ELO system can be found here, and I will just use their model for my comparison. The other two ranking systems I will look at are the coaches and media polls from USILA and Inside Lacrosse Respectively. Unlike RPI and ELO, these rankings are more subjective. They are not based comletely on stats, but factor in the "eye-test" as well. The "eye-test" is just judging teams based on observations you make watching them play, or generally any factor not easily represented by a stat. More specific information about USILA can be found here.

Second, I will try to improve RPI itself using gradient decent to find better weights to apply to the three statistics it is based off. 25%, 50%, and 25% were originally arbitrarily chosen as “nice” values that seemed right. It begs to reason then that by tweaking those values with gradient decent we could create an RPI like stat that is a better predictor of the outcome of lacrosse games.

To begin the analysis, there are several datasets I need to create. I first need a dataset of all the college lacrosse games that have been played this season. I can only look at the current season because Lacrosse Reference only has the current season’s ELO rankings publicly available. Therefore, I will not be able to look at predictiveness for post-season games, as the post-season has not completed yet as I am writing this. That is not a problem though, since the regular season has completed, and is a larger dataset to analyze anyway.

To get the game data, I will be using the daily scoreboard on the NCAA Statistics website. The scoreboard has information on all the games played on a given day, including who played and who won. I was not able to find a public API, so to get the data I scraped day by day for every day of the lacrosse season. As I collect the data I am putting it in two dataframes. One dataframe serves as a schedule, saving the winner of each game, as well as which team was home and away. The other dataframe stores each game twice, once for each team. This will make it easier later to select the games a team played or the games where teams played against them, for instance. I can scrap day by day by just changing the date in the URL.

Now that we have all of the games that were played, we have enough information to be able to calculate winning percentages, and thus RPI, for an arbitrary date.

Now we need to get the ELO rankings for teams throughout the season. Lacrosse Reference does have a page with every team's current ELO, but it only has the up to date ranking. However, in order to use ELO to predict games I will need ELO for each team before every game. To get this I will need to look at each team's individual page on Lacrosse Reference. The individual pages have a table that shows the graph with the change to ELO after each game.

Elo Table

I will just need to change this into a cumulative ranking after each game instead of the difference from game to game.

The individual team pages have a standard URL format, where the team to get the page for is identified by a 2 digit number at the end of the URL. By incrementing through all the teams I can scrape through every team's page.

Manually checking those pages that did not have a team, they seem to be for teams that were Division I at some point, but are no longer playing in Division I any more.

Now I have the ELO ranking for each team after every date in the season where it changed. There is a slight issue though. As you can see, the most recent ELO ranking of LIU is listed as 1484. At the time of writing this, the ELO listed by Lacrosse Reference on the main ELO page is 1485. When manually looking through all the teams manually, I noticed that many of them are off from Lacrosse Reference's official rankings by a point or two. I believe this is because behind the scenes, Lacrosse Reference calculates ELO and ELO changes as decimal numbers, but when they display them on the website, they round them off to the nearest whole number. Over the course of a season the small rounding differences add up and cause some of my calculated ELOs to be slightly off. Because Lacrosse Reference does not have a public API, there is not really anything I can do to get around this issue. I am not worried about this effecting my analysis though. When you look at the ELO chart on Lacrosse Reference, you can see that it ranges from about 900 to nearly 2100, with most teams being seperated by more than a point or two. Hence, I am not worried about the small differences of my dataset from the authentic Lacrosse Reference ELO data to cause any signifigant problems.

For the USILA coaches ranking there is again to public API, so I will need to scrap the rankings from the individual pages they were posted in. The polls are posted every monday on a page like this. The URLs are the same just with the date changed, so I can relatively easily scrap through all of them.

Now we have the complete coaches poll rankings from USILA. (I cannot believe somebody ranked Duke above Maryland going into the season)

I can get the media polls from inside lacrosse in a similar way, but as far as I can tell the URLs for the inside lacrosse pages are not predictable. Fortunately, on the main Inside Lacrosse polls page, there is a navigation dropdown with the links to all of the polls. If I scrap that page first for the list of poll URLs, I can go through that list to get the poll for each week.

That concludes the collection of all the data that will be necessary to analize the predictive power of different ranking systems. Because I collected all the data myself and validated it as I went, there is not much cleaning that needs to be done. The only issue is that some sources use slightly different versions of schools names to identify them. This will cause problems later when trying to match them up later on. To fix this I decided to use the version of the name used on the NCAA scoreboard as the standard version, and manually looked through the other sources for where the differences. I compiled those differences into a lookup table so I can go through each dataframe and replace the bad versions of names with the stanardized ones.

To start to explore the data I need to do some quick calculations first. In order to calculate RPI, you need to know each teams winning percentage at the data of each game. To make it easier later I will add teams winning percentages after each game they play to the games datarame.

To be able to calculate the prediction a ranking makes, I need to be able to easily get the ranking of a team on a particular date. Most of the rankings can be found with a lookup from the respective dataframes, but there are some specific cases for each ranking system.

With these helper functions in place, I can go through the schedule dataframe and for each game add whether each stat predicted it correctly or not. If the prediction is correct, I store that as a 1, and if the prediction is incorrect I store it as a 0.

We now have all the games for the season and whether or not they were predicted correctly by ELO, RPI, USILA, or Inside Lacrosse Rankings. For fun, I also added a column for predicting the home team wins every single time.

To explore what this looks like I am going to break the data up by weeks and plot the plot the proportion of games each ranking system predicted correctly over the course of the season. Weeks should be a reasonable size to chuck the data by since typically teams play one game a week.

Looking at this chart a few things jump out. First of all, RPI starts out as a horrible predictor in the first couple weeks of the season. This does actually make sense though, as it relies on opponents winning percentage and opponents-opponents winning percentages. In the first few weeks of the season, teams have very few opponents, and their opponents could have even fewer opponents. That leads to the ranking being determined by very few datapoints and likely being swayed heavily by outliers. The NCAA does recognizee this, as they did not publish their official RPI rankings until March 27th this year, about halfway through the season.

Moreover, there seem to be unsual behavior at both ends of the graph, but that is just because the first couple and last couple weeks pictured have signifigantly fewer games in them. This can be seen by look at the counts of the groups.

Another observation from the chart is that just predicting games based off of who is the home team has the greatest varriation in success throughout the season. While this is not a prediction system I am analyzing, I thought it would be interesting to see how it compares to stats meant to predict. Interestingly, it does appear to predict more than half the games correct in the majority of weeks, which could indicate that home field advantage could be quite signifigant. Analyzing that further would be interesting, but is outside the scope of this project.

A final observation I made from the chart is that the subjective poll based rankings perform much worse than the objective rankings. That does make sense, as those rankings only care about the top 20, so they offer no predictive value to games between two teams that are not in the top 20. When comparing those rankings, it might therefore make sense to disregard games between two non-top 20 teams. It does not seem completely fair to throw out those games because after all, a good ranking system should be able to predict all the games. But for the sake of a more interesting comparison I will adjust the USILA and Inside Lacrosse rankings to only make a prediction on a game when at least one of the teams is ranked.

As you can see, in the adjusted columns for USILA and Inside Lacrosse, if neither team was ranking the prediction is given as NaN. The Pandas function mean() ignores NaNs by default when taking the average, so I can regroup and replot the data to see how these adjusted rankings compare to before.

The chart shows that for games with at least one top 20 team, the subjective rankings have a roughly similar predictive ability to the objective rankings.

Looking at this chart, it does appear that RPI is the worst predictor of matchups besides always choosing the home team, but the variation from week to week makes it unsure if the other predictors are actually signifigantly better. To check for statistical signifigance, I will use a z-test for each of the other rankings. The null hypothesis will be that the ranking is predicts an equivalent proprtion of games correctly, and the alternative hypothesis will be that the ranking system out performs RPI. I will also set the signifigance value for these tests at 0.05.

When I perform these tests, I am only going to look at games after March 27th. I decided to do this because RPI as a predictor is dependent on having enough other games played to have information about the teams. The NCAA recognizes its weakness in the beginning of the season and does not publish RPI rankings until March 27th. I will follow the NCAAs lead and only look at RPI after March 28th.

First I will compare ELO to RPI

The p-value is certainly greater than 0.05, so I cannot say that the increased accuracy of ELO over RPI for predicting games this season was statistically signifigant.

Looking at the USILA coaches poll:

When comparing the USILA to the coaches poll to RPI, I removed all the games where both teams were not ranked. This makes the test less biased as we are comparing RPI's predictive ability only on games that the USILA ranking has an ability to make a prediction on. With that in mind the p-value is greater than 0.05, so I cannot say there is statistically signifigant difference in the performance of these two rankings as predictors of games.

Looking at the Inside Lacrosse Poll:

Similarly, I only looked at the subset of games with at least one team ranked by Inside Lacrosse. And again, the p-value was not below 0.05. While Inside Lacrossee rankings did have lowest p-value, we cannot reject the null that Inside Lacrosse and RPI have the same predictive power.

Overall, no other predictor of Division I college lacrosse games I looked at is statistically better predictor of game outcomes than RPI. This surprised me, as in the chart it does look like the other ranking systems consitantly outperform RPI.

With that in mind, I will now look to improve the RPI statistic through gradient descent and finding weights for the three composing factors of RPI.

Because predicting outcomes of individual games is a discrete matter, I will have to make some modifications to typical gradient descent to use it effectively. For one thing, instead of a loss function, I will just use the accuracy of RPI with the given weights. And instead of minimizing it, I want to maximize it, so this will be more like gradient ascent. Also, since game predictions are discrete, there is no way to take a derivative or calculate a gradient. To avoid this issue I will use the gradient decent method from the class workbook, which uses an algebriac estimate of the gradient to perform gradient descent. The original for this technique is in this notebook. Below I have made the necessary modifications to find better RPI weights.

As you can see gradient descent found that using weights of 0.666 for winning percentage, 0.9343 for opponents winning percentages, and 0.1129 for opponents opponents winning percentages resulted in accurately predicting more games correctly! One slight issue with the weighs is that the no longer total 1. This can easily be fixed by scalling them all down by the total value of all of them.

Now be have the best possible weights for calculating RPI with to predict the outcome of games with. We could see earlier that is it better than standard RPI, but is that difference signifigant?

It is time for another paired t-test to see if the two predictors do have signifigantly different accuracies. The null hypothesis is that both accuracies is the same, and the alternative hypothesis will be that the new weights are actually better at predicting games. The signifigance level for this test will be 0.05.

The p-value is above the signifigance level of 0.05, so we cannot reject the null hypothesis. While the new weights for RPI were better, I cannot say that the difference is signifigant.

Another aspect of the new weights to consider is that they are likely overfitted to this season specifically. All of the training data was for this season, so it seams probably that if the two sets of weights were tested on a different season, they non standard ones would no longer be better.

In general, all of the analysis conducted in this tutorial could have benefitted by looking at multiple season. However, because of the difficulties of data collection, that was not in the scope of this tutorial.

Conclusion

While there are many reasons to dislike RPI, statistically speaking, it holds up when looking at its value as a predictor of the outcomes of games. Perhaps it should not be such a prominent stat in the selection process for the tournament, but it also probably should not be ignored.