In [1]:
import numpy as np
import matplotlib.pyplot as plot
import seaborn as sb
import h5py as h5

%matplotlib inline

Load in the 3 datafiles, which I downloaded (and modified) from https://www.indatabet.com/free.html

In [2]:
football = h5.File('football.h5')
basketball = h5.File('basketball.h5')
hockey = h5.File('hockey.h5')
We'll start by plotting the number of scoring events per match for each of the three sports. For simplicity, for now, we will assume that every scoring event in basketball is worth 3 points (which reduces the number of scoring events, if anything).
I have truncated the Hockey results, since there were a couple of crazy high scoring games, e.g. https://en.wikipedia.org/wiki/2016_Kontinental_Hockey_League_All-Star_Game, and I have also removed any games that were settled via penalties or overtime (since then the score is always odd). I don't make these cuts when calculating the rate of upsets
In [14]:
totalScoresFootball = (football['Hgoals'])[...]+(football['Agoals'])[...]
footballMean = np.mean(totalScoresFootball)

totalScoresHockey = (hockey['Hgoals'])[...]+(hockey['Agoals'])[...]
noOvertimeMask = ~(hockey['overtimeOrPenalties'])[...]
hockeyMean = np.mean(totalScoresHockey[noOvertimeMask])

totalScoresBasketball = ((basketball['Hscore'])[...]+(basketball['Ascore'])[...])/3.
basketballMean = np.mean(totalScoresBasketball)
In [26]:
fig,axs = plot.subplots(1,3)

axs = axs.reshape((-1))

axs[0].hist(totalScoresFootball,bins=15,normed=True,color='b',alpha=0.6,histtype='stepfilled',label='Football')
axs[1].hist(totalScoresBasketball,bins=15,normed=True,color='g',alpha=0.6,histtype='stepfilled',label='Basketball')
axs[2].hist(totalScoresHockey[noOvertimeMask],bins=15,normed=True,color='r',alpha=0.6,histtype='stepfilled',range=(0,14),label='Ice Hockey')

axs[0].axvline(footballMean,color='k',label='Mean = '+str(footballMean))
axs[1].axvline(basketballMean,color='k',label='Mean = '+str(basketballMean))
axs[2].axvline(hockeyMean,color='k',label='Mean = '+str(hockeyMean))

for ax in axs:
    ax.legend()
    ax.set_xlabel('Scoring Events',size=14)

fig.set_size_inches(18,5)
Now, we can answer our question. We'll trust that the bookies know what they're doing, and we'll say that a team is expected to lose if a bet on that team stands to return more than three times as much than a bet on the other team. Bets are stated in decimals (e.g. odds of 5.09 would return £50.90 for a £10 bet)
In [67]:
def countUnexpectedResults(homeOdds,awayOdds,homeScore,awayScore):
    
    expectedHomeWin = 3.*homeOdds < awayOdds
    expectedAwayWin = homeOdds > 3.*awayOdds
    
    homeWin = homeScore > awayScore
    awayWin = homeScore < awayScore
    
    unexpectedResult = (expectedHomeWin & awayWin) | (expectedAwayWin & homeWin)
    
    return np.count_nonzero(unexpectedResult)
In [68]:
nUnexpectedFootball = countUnexpectedResults((football['Hodds'])[...],
                       (football['Aodds'])[...],
                       (football['Hgoals'])[...],
                       (football['Agoals'])[...])

nUnexpectedBasketball = countUnexpectedResults((basketball['Hodds'])[...],
                       (basketball['Aodds'])[...],
                       (basketball['Hscore'])[...],
                       (basketball['Ascore'])[...])

nUnexpectedHockey = countUnexpectedResults((hockey['Hodds'])[...],
                       (hockey['Aodds'])[...],
                       (hockey['Hgoals'])[...],
                       (hockey['Agoals'])[...])
Ironically, the error on the number of unexpected results in the sample will be Poisson distributed, so will have $\sim \pm \sqrt{N}$ errors.
We can now calculate the expected rate of unexpected results
In [69]:
nTotalFootball = len(totalScoresFootball)
nTotalBasketball = len(totalScoresBasketball)
nTotalHockey = len(totalScoresHockey)

rateUnexpectedFootball = nUnexpectedFootball/float(nTotalFootball)
rateUnexpectedBasketball = nUnexpectedBasketball/float(nTotalBasketball)
rateUnexpectedHockey = nUnexpectedHockey/float(nTotalHockey)

ruFootballError = np.sqrt(nUnexpectedFootball)/float(nTotalFootball)
ruBasketballError = np.sqrt(nUnexpectedBasketball)/float(nTotalBasketball)
ruHockeyError = np.sqrt(nUnexpectedHockey)/float(nTotalHockey)
In [75]:
fig,ax = plot.subplots(1,1)

ax.errorbar([0],[rateUnexpectedFootball],[ruFootballError],color='b',fmt='|',linewidth=8)
ax.errorbar([1],[rateUnexpectedBasketball],[ruBasketballError],color='g',fmt='|',linewidth=8)
ax.errorbar([2],[rateUnexpectedHockey],[ruHockeyError],color='r',fmt='|',linewidth=8)

plot.ylabel('Probability of Unexpected Result',size=15)
plot.xticks(range(3),['Football','Basketball','Hockey'],size=15)

fig.set_size_inches(18,5)
So, the rate of upsets seems to have no correlation with the number of scoring opportunities per game (or the bookies are just better at predicting the outcome of football matches)