Final Project: Python Data Analysis

Caden Truelick

February 28, 2023

DAT 301, Professor Mahzabeen

Imports and Viewing the Data

  • In this report, I will be analyzing a PGA Tour Dataset (2010-2018) to find correlation between variables like strokes gained, average distance, fairway accuracy, average score, points, wins, money, and more!

  • Describing Variables:

    • Player Name: Name of the player
    • Rounds: Total rounds played in the year
    • Fairway Percentage: % of tee shots that land in the fairway
    • Year: The year that the statistics was collected in
    • Avg Distance: Average Distance of all tee shots on par 4's and 5's
    • gir: Greens hit in regulation (at least two less than the par)
    • Average Putts: Average number of strokes on the green in one round
    • Average Scrambling: % of time that a player makes par or better after missing GIR
    • Average Score: Average score for each round
    • Points: Total FedExCup points earned in the season
    • Wins: Total number of wins in the season
    • Top 10: Total number of top 10s
    • Average SG Putts: Average number of strokes gained or lost on the greens for each round
    • Average SG Total: Average number of strokes gained or lost total for each round
    • SG:OTT: Average number of strokes gained or lost off the tee for each round
    • SG:ARG: Average number of strokes gained or lost around the green for each round
    • Money: Total money (dollars) won in the season
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

pgadata = pd.read_csv('pgaTourData.csv')
In [2]:
pgadata.head()
Out[2]:
Player Name Rounds Fairway Percentage Year Avg Distance gir Average Putts Average Scrambling Average Score Points Wins Top 10 Average SG Putts Average SG Total SG:OTT SG:APR SG:ARG Money
0 Henrik Stenson 60.0 75.19 2018 291.5 73.51 29.93 60.67 69.617 868 NaN 5.0 -0.207 1.153 0.427 0.960 -0.027 $2,680,487
1 Ryan Armour 109.0 73.58 2018 283.5 68.22 29.31 60.13 70.758 1,006 1.0 3.0 -0.058 0.337 -0.012 0.213 0.194 $2,485,203
2 Chez Reavie 93.0 72.24 2018 286.5 68.67 29.12 62.27 70.432 1,020 NaN 3.0 0.192 0.674 0.183 0.437 -0.137 $2,700,018
3 Ryan Moore 78.0 71.94 2018 289.2 68.80 29.17 64.16 70.015 795 NaN 5.0 -0.271 0.941 0.406 0.532 0.273 $1,986,608
4 Brian Stuard 103.0 71.44 2018 278.9 67.12 29.11 59.23 71.038 421 NaN 3.0 0.164 0.062 -0.227 0.099 0.026 $1,089,763

Cleaning the Data

  • Replacing NaNs with 0s for Top 10 and Wins
  • Removing commas from Points and Money
  • Removing dollar signs from Money
  • Changing variable type to integer for Top 10, Wins, Rounds, and Points
  • Changing variable type to float for Money
In [3]:
#Replace NaNs in Top 10 and Wins with 0 and make them int
pgadata['Top 10'].fillna(0, inplace = True)
pgadata['Top 10'] = pgadata['Top 10'].astype(int)

pgadata['Wins'].fillna(0, inplace = True)
pgadata['Wins'] = pgadata['Wins'].astype(int)

#Drop rest of NaNs
pgadata.dropna(axis = 0, inplace = True)

#Change Rounds to int
pgadata['Rounds'] = pgadata['Rounds'].astype(int)

#Take away the comma in points and make it an int
pgadata['Points'] = pgadata['Points'].apply(lambda x: x.replace(',',''))
pgadata['Points'] = pgadata['Points'].astype(int)

#Take away $ and commas to change to int
pgadata['Money'] = pgadata['Money'].apply(lambda x: x.replace('$',''))
pgadata['Money'] = pgadata['Money'].apply(lambda x: x.replace(',',''))
pgadata['Money'] = pgadata['Money'].astype(float)

pgadata.head()
Out[3]:
Player Name Rounds Fairway Percentage Year Avg Distance gir Average Putts Average Scrambling Average Score Points Wins Top 10 Average SG Putts Average SG Total SG:OTT SG:APR SG:ARG Money
0 Henrik Stenson 60 75.19 2018 291.5 73.51 29.93 60.67 69.617 868 0 5 -0.207 1.153 0.427 0.960 -0.027 2680487.0
1 Ryan Armour 109 73.58 2018 283.5 68.22 29.31 60.13 70.758 1006 1 3 -0.058 0.337 -0.012 0.213 0.194 2485203.0
2 Chez Reavie 93 72.24 2018 286.5 68.67 29.12 62.27 70.432 1020 0 3 0.192 0.674 0.183 0.437 -0.137 2700018.0
3 Ryan Moore 78 71.94 2018 289.2 68.80 29.17 64.16 70.015 795 0 5 -0.271 0.941 0.406 0.532 0.273 1986608.0
4 Brian Stuard 103 71.44 2018 278.9 67.12 29.11 59.23 71.038 421 0 3 0.164 0.062 -0.227 0.099 0.026 1089763.0

Drive for Dough. Putt for Show.

  • Strokes Gained in Golf measures how good a player is on certain shots.

    • SG:OTT measures how many strokes a player gains on the field from their tee shots
    • SG:ARG measures how many strokes a player gains on the field from shots around the green
    • Average SG Putts measure how many strokes a player gains on the field from putts
    • And Average SG Total measures how many strokes a player gains on the field for ALL shots
  • In a correlation matrix, if a number is closer to 1 or -1 it has a strong correlation. And if it is closer to zero, it has a weaker correlation.

  • Looking at the final column of this matrix is where we get the most interesting data.

  • By the looks of this matrix, we can see than Strokes Gained: Total is -0.96, which is very close to -1. This makes sense because Strokes Gained total and Score are essentially the same thing.
  • What is most interesting is comparing, SG:OTT, SG:ARG, and SG:Putts against each other. It appears that Strokes Gained off the tee has the most impact on shooting low scores. This puts the saying "Drive for show. Putt for dough." to rest!
In [4]:
corrdata = pgadata[['Average SG Total', 'SG:OTT','SG:ARG','Average SG Putts', 'Average Score']]
corr = corrdata.corr()
corr.style.background_gradient(cmap='coolwarm')
Out[4]:
Average SG Total SG:OTT SG:ARG Average SG Putts Average Score
Average SG Total 1.000000 0.540811 0.408012 0.388954 -0.962385
SG:OTT 0.540811 1.000000 -0.198754 -0.219593 -0.530372
SG:ARG 0.408012 -0.198754 1.000000 0.256419 -0.391596
Average SG Putts 0.388954 -0.219593 0.256419 1.000000 -0.364698
Average Score -0.962385 -0.530372 -0.391596 -0.364698 1.000000

Drive for Dough. Putt for Show.

  • These scatter plots show the correlation between Average Score and Strokes Gained Off the Tee and Average Score and Strokes Gained Putting
  • Previously, we discovered that the correlation factor was for SG:OTT and Score was -0.53, which is a moderate negative correlation. And the correlation factor for SG:Putt and Score was -0.36, which is a slightly weaker negative correlation.
  • In a lot of cases, the better a player is off the tee, the less shots they will take.
  • In putting, there are some outliers.
  • Graph Info: In HTML format, you can hover over the graph to see which player each point belongs to and what their Strokes Gained and Average Score was.
In [5]:
fig = make_subplots(rows=1,cols=2, 
                    subplot_titles=('Average Score vs. SG: Off the Tee', 'Average Score vs. SG: Putting'), #add subplot titles
                    y_title='Average Score') #add master y title

# Avg Score vs. SG:OTT scatter plot
fig.add_trace(
    go.Scatter(x=pgadata['SG:OTT'], 
               y=pgadata['Average Score'], 
               mode='markers', 
               hovertext=pgadata['Player Name'],
               showlegend=False,
               name='Score v. SG:OTT'),
    row=1, col=1)

# Avg Score vs. SG:Putt scatter plot
fig.add_trace(
    go.Scatter(x=pgadata['Average SG Putts'], 
               y=pgadata['Average Score'], 
               mode='markers', 
               hovertext=pgadata['Player Name'],
               showlegend=False,
               name='Score v. SG:Putt'),
    row=1, col=2)

#add x labels
fig['layout']['xaxis']['title']='Strokes Gained: Off the Tee'
fig['layout']['xaxis2']['title']='Strokes Gained: Putting'

fig.update_xaxes(
    range=[-2,2])

fig.update_yaxes(
    range=[68,75])


fig.show()

Understanding the Graph

  • Green Zone: Players who gained shots off the tee and averaged under par
  • Blue Zone: Players who averaged under par but lost shots off the tee
  • Yellow Zone: Players who gained shots off the tee but averaged over par
  • Red Zone: Players who gained shots off the tee and averaged over par

  • When looking at this graph, it is interesting to see how few players reside in the Yellow Zone. That goes to show that players who gain shots off the tee also typically have lower scores.

  • The player with the highest average score who still gained shots off the tee was Derek Ernst

    • Average Score: 72.593
    • Strokes Gained (Off the tee): 0.312
  • The player with the highest average score who still gained shots putting was Steven Bowditch
    • Average Score: 74.262
    • Strokes Gained (Putting): 0.131
In [6]:
fig.add_shape(type='rect',
             xref='x1',
             yref='y1',
             x0=0, x1=2,
             y0=68, y1=72,
             fillcolor='green',
             line_color='green',
             opacity=0.1)

fig.add_shape(type='rect',
             xref='x1',
             yref='y1',
             x0=-2, x1=0,
             y0=68, y1=72,
             fillcolor='blue',
             line_color='blue',
             opacity=0.1)

fig.add_shape(type='rect',
             xref='x1',
             yref='y1',
             x0=0, x1=2,
             y0=72, y1=75,
             fillcolor='yellow',
             line_color='yellow',
             opacity=0.1)

fig.add_shape(type='rect',
             xref='x1',
             yref='y1',
             x0=-2, x1=0,
             y0=72, y1=75,
             fillcolor='red',
             line_color='red',
             opacity=0.1)

fig.show()

Strokes Gained Off the Tee over time (2010-2018)

In [7]:
sorted_df = pgadata.sort_values('Year',ascending=True)
fig = px.scatter(sorted_df, x='SG:OTT', y='Average Score', animation_frame='Year', hover_name='Player Name')

fig.layout.updatemenus[0].buttons[0].args[1]['frame']['duration'] = 1000
fig.layout.updatemenus[0].buttons[0].args[1]['transition']['duration'] = 1

fig.show()

Strokes Gained: Off the Tee, Fairway %, and Distance

  • Total strokes gained off the tee is the best way to reall know how good a player is with their driver.
  • This data compares SG: Off the tee with Fairway % and Distance.
  • The correlation factor between Strokes Gained: Off the Tee and Fairway % is only 0.1799, which is a very weak correlation.
  • The correlation factor between Average Distance and Fairway % is 0.6, which is more than three times as strong of a correlation as fairway %.
  • So it appears that accuracy does not matter as much as distance on the PGA Tour.
In [8]:
corrdata = pgadata[['SG:OTT','Fairway Percentage', 'Avg Distance']]
corr = corrdata.corr()
corr.style.background_gradient(cmap='coolwarm')
Out[8]:
SG:OTT Fairway Percentage Avg Distance
SG:OTT 1.000000 0.179917 0.603395
Fairway Percentage 0.179917 1.000000 -0.534017
Avg Distance 0.603395 -0.534017 1.000000

Grip It and Rip It

  • Looking at the scatter plots below we can see that there is a much more defined correlation with distance compared to accuracy.
  • The farther you hit it, the more strokes you are going to gain on the PGA Tour.
  • Hitting it straighter than most doesn't have much of a benefit.
In [9]:
fig = make_subplots(rows=1,cols=2, 
                    subplot_titles=('Fairway % vs. Strokes Gained: Off the Tee', 'Average Distance vs. Strokes Gained: Off the Tee'), #add subplot titles
                    x_title='Strokes Gained: Off the Tee') #add master x title

# Fairway % vs. SG:OTT scatter plot
fig.add_trace(
    go.Scatter(x=pgadata['SG:OTT'], 
               y=pgadata['Fairway Percentage'], 
               mode='markers', 
               hovertext=pgadata['Player Name'],
               showlegend=False,
               name='Fairway % v. SG:OTT'),
    row=1, col=1)

# Avg Dist vs. SG:OTT scatter plot
fig.add_trace(
    go.Scatter(x=pgadata['SG:OTT'], 
               y=pgadata['Avg Distance'], 
               mode='markers', 
               hovertext=pgadata['Player Name'],
               showlegend=False,
               name='SG:OTT vs. Avg Dist'),
    row=1, col=2)

#add y labels
fig['layout']['yaxis']['title']='Fairway Percentage'
fig['layout']['yaxis2']['title']='Average Distance'

fig.update_xaxes(
    range=[-2,2])

fig.show()