What songs make the perfect setlist?

Thursday 06/06/2024 by phishnet

WHAT SONGS MAKE THE PERFECT SETLIST?

[This post is courtesy of Ryan Smith, dot net user @ryansmith534, a data scientist formerly at Spotify.Thank you, Ryan! -Ed.]

Every Phish fan undoubtedly has their own answer to this question – but is there a universal truth across all fans? Using setlist data and user ratings from Phish.net, we can attempt to answer this question empirically.

To do this, we can borrow methodology from basketball and hockey analytics, specifically the concept of RAPM (regularized adjusted plus-minus). This metric attempts to quantify an answer to the question: how much does the presence of a given player on the court contribute to a team’s point differential? In our case, the question becomes: how much does the presence of a given song in a setlist contribute to a show’s rating on Phish.net?

We first need to gather the necessary data, a process made significantly easier because of the convenience of the Phish.net API. After doing a bunch of cleaning and manipulation, we get a dataset that looks like this:

We have one row for every show, a column with the show’s rating, and a column for every song in Phish’s repertoire – with a 0 or 1 value representing whether the song was played at a given show.

We can then fit a model to this data to understand how the presence of a given show impacts user ratings. We’ll use a ridge regression, which is a model well-suited to deal with the problem of multicollinearity – which in our case is the problem of certain songs being played together frequently (think Mike’s Song and Weekapaug Groove), which makes teasing out the impact of each individually somewhat difficult with more basic models.

The output of the model looks something like this, with each song having a value (‘coefficient’) that represents the expected contribution of that song to a show’s rating. For example, we see Ghost is the top rated song, with an expected rating contribution of ~0.18 (the presence of a jam is also worth 0.18). Another way to interpret this – if you took an “neutral” setlist (roughly 3.29) and added Ghost to the setlist, we would expect that show rating to increase to 3.47.

Looking at the songs with the strongest negative contribution to shows ratings, we see Time Turns Elastic rating last, with a contribution of -0.13.

But perhaps this isn’t the full story! For example, what if a song just happened to be played a lot during periods where the band was playing their best (and thus where shows were highly rated)? Put another way, it’s not the presence of the song in a setlist that contributed to a high rating in these cases, rather that the band just sounded great at the time. We can help account for this by adding information about the year or tour to the model.

Re-running the model, now controlling for year, we can again look at the top songs. The results are largely the same, but we do see some difference. For example, My Soul falls from 0.07 to 0.03 in this model, largely because it was most played frequently in 1997 and 1998 when the band was at their best.

Looking at the bottom-ranked songs, we see that AC/DC Bag is no longer towards the very bottom (going from -0.06 to -0.03), in large part because it was frequently played from 1988-1990 where shows tended to be lower rated, which this model now takes into account.

By adding ‘year’ as a feature to the model, we can also look at the impact that each year has on user ratings (independent of the impact from the quality of the setlists). Here we see 1997, 1998, and 2017 as the three “best” years, all with a contribution of 0.17.

We can perform the same exercise with specific tours, rather than years – here we see the 1997 Fall Tour as far and away the most impactful single tour, with a contribution of 0.22.

And here looking at the bottom-ranked tours:

Here’s a look at the plus-minus rating for the 40 most-played shows since 1984, using this model:

Is all this modeling worth the effort? Why not just look at the average rating for shows where a given song is played? We do indeed see these two things are correlated (r-squared = ~0.4), but there are plenty of edge cases.

For example, shows where Fluffhead is played hold an average rating of 3.85, which is in the bottom 30% of all shows – largely because the song was played so frequently during the late 1980s where average ratings are low. But the plus-minus impact Fluffhead has is roughly 0.10, which is in the top 10% of all songs. Going back to our basketball analogy, Fluffhead has a low average rating because it “tended to play on bad teams”, but we can consider Fluffhead to be an “elite player” because of its estimated impact – in other words, these shows would have rated even worse without Fluffhead, and indeed we can see on a year-by-year basis, shows with Fluffhead tend to rate higher than shows without Fluffhead (“on the court vs. off the court”).

Finally, how do we know if this model is any good? When we say Ghost is the most impactful song, how do we know whether that’s accurate? We can test this by training a model on a subset of our data, and trying to predict the rating of some shows we set aside.

As visualized below, the vanilla model (trained without year or tour information) is moderately predictive of user ratings, with an r-squared 0.28 – meaning about 28% of the variance in user ratings from show-to-show can be explained solely by the songs that appear on the setlist.

When we conduct the same exercise with tour information encoded in the model, we see a modest bump in predictiveness, with r-squared going up to 0.36 – meaning about 36% of the variance in user ratings from show-to-show can be explained by the tour a show was a part of, as well as the songs that appear on the setlist.

That’s still a lot of unexplained variance! Which, to me, speaks to the beauty of this band. No matter how a tour has been going, and no matter what songs are set to be played on a given night, there’s always a chance you can have your mind blown.

To see ratings for all songs, please check out this spreadsheet.

If you liked this blog post, one way you could "like" it is to make a donation to The Mockingbird Foundation, the sponsor of Phish.net. Support music education for children, and you just might change the world.

21 comments - Link: http://phi.sh/b/665cd0d4

Comments

2024-06-06 6:25 am, comment by c_wallob

This is epic. I've never felt so seen...

Score: 11

2024-06-06 9:27 am, comment by davethewall

well done!

I'm surprised that Hydrogen doesn't fare better, but I guess H contributes little above and beyond Mike's and Weekapaug.

or maybe we have evidence that (Mike's groove with H) < (Mike's groove with something other than H)

Score: 4

2024-06-06 10:33 am, comment by Wookatmee

Very cool stuff.

I'm curious what a 4.0-only analysis looks like -- it would remove songs from years ago that are rarities or outright shelved, bring in newer material, and better reflect their current songbook (acknowledging that it's always a moving target, with new material in 2023 and TAB tour hinting more to come this summer).

From a forward-looking perspective, it would be interesting to get a plus/minus for that period and see how predictive it is for the rest of 2024.

Score: 6

2024-06-06 1:18 pm, comment by Ignace

PhD worthy, for sure! Yale, Berkeley & Harvard must have headhunted you by now. Your model enables people to see where they divert from the average Phish Phan. Speaking for myself, songs like Fluffhead & Hood make a show less attractive, but I see mu enthusiasm for Lizards, Weekapaug, YEM etc is confirmed.
So very well done!

Score: 2

2024-06-06 2:32 pm, comment by deceasedlavy

This is incredibly cool but what are these "show ratings" that seem to be a key element of this study? Where does one find these?

Score: 3

2024-06-06 4:42 pm, comment by LoveMusicOTW

No wonder I have lost my shit when I hear first chords of Ghost…especially as set closer over last couple of years.

Score: 1

2024-06-06 5:16 pm, comment by Outlive

I can't express adequately in words how much I love this.

Score: 2

2024-06-06 6:31 pm, comment by ryansmith534

@deceasedlavy said:

This is incredibly cool but what are these "show ratings" that seem to be a key element of this study? Where does one find these?

These are the user ratings submitted here on Phish.net!

Score: 0

2024-06-06 6:36 pm, comment by ryansmith534

@Wookatmee said:

Very cool stuff.

I'm curious what a 4.0-only analysis looks like -- it would remove songs from years ago that are rarities or outright shelved, bring in newer material, and better reflect their current songbook (acknowledging that it's always a moving target, with new material in 2023 and TAB tour hinting more to come this summer).

From a forward-looking perspective, it would be interesting to get a plus/minus for that period and see how predictive it is for the rest of 2024.

I added a tab to the spreadsheet linked above that has the ratings from just 4.0 shows. Obviously gets a little noisier with smaller sample size -- let me know what you think!

Score: 1

2024-06-06 6:53 pm, comment by youenjoymyghost

so ghost is the best song...lets all take note that I have the most correct favorite song

Score: 4

2024-06-06 8:18 pm, comment by jwlabrec

As a scientist who works in the pharmaceutical industry, a devout phish fan going back to ‘94, and essentially a nerd, I love this. It really shows how data can be interpreted to tell different stories; it’s not manipulated (though the show ratings are very subjective).

I will say that TTE gets a bum rap. It is written like a mathematic formula, and there was a time when I would crank the studio versions after years of being starved. It’s like Petrichor - it shouldn’t be jammed, just played. Sparingly.

I’ll tell you what’s bad - seeing 50th shows and estimating a combination of Mike’s Groove/Possum at half. Love em all, but don’t need to hear another live version of any of them.

Score: 3

2024-06-06 9:47 pm, comment by rmc2123

Great job and thanks so much for sharing. This is inspiring, and I did not know there was an API. Great to see a fellow R user on here also. Would love to see the code!

Score: 1

2024-06-06 10:13 pm, comment by Kesslerc14

This is incredible. Nerd phish at its best

Score: 3

2024-06-06 10:27 pm, comment by ryansmith534

@rmc2123 said:

Great job and thanks so much for sharing. This is inspiring, and I did not know there was an API. Great to see a fellow R user on here also. Would love to see the code!

Here's the code! Not super pretty as I just brute forced my way through: https://pastebin.com/20pgwtnd

Score: 2

2024-06-06 10:40 pm, comment by Mazegue

Data gold. This is amazing.

Score: 1

2024-06-06 11:43 pm, comment by Lydos

This is the greatest thing I have seen in a long time. Pure research, put out for peer review. I love it, I wish all research were like this. Best fans in the world.

Score: 1

2024-06-07 10:00 am, comment by RollinInMyGrego

The answer is 12/11/99.

This is amazing. Similar idea to Song Batting Average Thread ... but much more advanced.

Score: 1

2024-06-07 1:43 pm, comment by johnmd750

This is amazing. Great analysis and great read. Cool to see statistical analyses to explain stuff. Keep them coming!
Definitely in the running for plenary session at next year's Phish Studies Conference in Oregon!

Score: 2

2024-06-07 3:03 pm, comment by jasn1001

Great work!

Score: 1

2024-06-08 10:12 pm, comment by tweezeher

@deceasedlavy said:

This is incredibly cool but what are these "show ratings" that seem to be a key element of this study? Where does one find these?

Phish.net!

Score: 0

2024-06-10 1:25 pm, comment by paulj

Cool! I tried something similar for all modern shows (2009 and later) and had trouble coming up with the set of "core" songs that influence ratings. Good job! We missed you at the Phish Conference in Oregon; this would have fit right in!