Predicting shot played from text commentary

Train machine learning models to predict shot played by batters from text commentary.
Data science
Cricket

Introduction

While there is a lot of ball-by-ball information like batter, bowler, runs scored, wicket fell, text commentary etc., for all forms of major cricket matches since the mid 2000s, more micro-level information is lacking. For example, an important feature of each ball is the shot played by the batter. Knowing this would allow analysis of performances, risks (runs vs wicket) and other quantifiers categorized by the shot type. We’d also understand trends in shots played over the years, over different players, in different phases of a match and in different match types.

While text commentary is largely available, shot played is already available only for a subset of matches - international matches played in India in the recent years, and the recent IPL seasons. In this project, I attempted to train models that can learn and predict shot played from the associated text commentary for each ball.

A few examples

Here are a few examples from the 2022 IPL final between Gujarat Titans and Rajasthan Royals from ESPNCricinfo’s commentary and shot played from IPLT20.com. As you can see, some balls have the shot played explicitly mentioned in the commentary, while some don’t. Some balls have a not-entirely-accurate shot tagged, for example shorter and into the ribs, Buttler drags a hook behind square on the leg side for one has the shot tagged as Pull Shot while Hook Shot would have been more appropriate. So, we go into this project with the understanding that such inaccuracies exist, and any predictions coming out of a model trained on this dataset would carry with it these inaccuracies and caveats.

143.1ks, much fuller and snaking in to hit the pad from round the stumps, the sharp angle and movement was dragging it down the leg side. The result is a leg-bye for RR
Shot: On Drive
---
140.8ks, nice and full, seaming in to attack the stumps, whisked away to the right of mid-on
Shot: On Drive
---
good length and slanting in from round the stumps, but does not threaten off stump. So, Jaiswal ignores it in the channel
Shot: Left Alone
---
chipped away in the air...but lands in front of square leg or rather short midwicket where Rashid collects it on the bounce
Shot: Flick
---
dug in short an slanting back in to cramp the batter for room, Jaiswal somehow keeps it out
Shot: Cut Shot
---
full and pushed across off, left alone by Buttler
Shot: Left Alone
---
back of a length and skids back into middle and leg, jammed to the leg side off the inner half
Shot: Defence
---
short and wide of off, Buttler latches onto the width and scythes a cut between point and cover for four. Fetch that!
Shot: Cut Shot
---
better from Dayal. Good length and much closer to off stump, draws Buttler into a front-foot defence to short cover
Shot: Defence
---
shorter and into the ribs, Buttler drags a hook behind square on the leg side for one
Shot: Pull Shot
---
short and veering down the leg side, Jaiswal misses out on the pull and wears it on his thigh pad
Shot: Pull Shot

Here’s the distribution of shot types across all the balls available for training. Defence, On Drive, Coverdrive etc. are the most frequent while Inside Out, Paddle Sweep, Hook Shot, Reverse Sweep and Chip Shot are the rarest.

Most correlated words and phrases for each shot type

To get to understand the dataset a little more, I first looked at the top 10 words and phrases (monograms, bigrams and trigrams) for each shot type by computing the chi-squared statistic between the text commentary (after TF-IDF vectorization) and the shot played.

This looks fairly promising. For each shot type, the top most correlated words and phrases are indeed very relevant to the shot type.


1. Defence
----------

Unigrams: around, long, deep, foot, blocked, blocks, forward, defend, defends, defended
Bigrams: stumps defended, and blocks, forward and, front foot, he defends, defended to, defended back, and defends, to defend, off defended
Trigrams: forward and blocks, defended on the, on the front, back to him, forward and defends, outside off defended, the stumps defended, defended back to, around off defended, the front foot

2. On Drive
-----------

Unigrams: point, flicks, cover, clipped, whipped, whips, flicked, on, midwicket, long
Bigrams: flicked to, full on, through midwicket, midwicket for, wide long, deep midwicket, mid on, to long, on for, long on
Trigrams: of long on, to mid on, to deep midwicket, over long on, towards long on, down to long, wide long on, on for one, long on for, to long on

3. Coverdrive
-------------

Unigrams: punches, driven, leg, slapped, drives, covers, drive, sweeper, extra, cover
Bigrams: cover sweeper, through cover, to sweeper, the cover, through covers, sweeper cover, to extra, cover for, deep cover, extra cover
Trigrams: straight to extra, the cover sweeper, over extra cover, deep extra cover, cover for one, to sweeper cover, deep cover for, extra cover for, to extra cover, to deep cover

4. Pull Shot
------------

Unigrams: deep, hook, bouncer, square, midwicket, swivels, short, pulls, pulled, pull
Bigrams: pull to, short on, pull and, pull but, pulled to, and pulls, short ball, pulls it, to pull, the pull
Trigrams: to pull and, short ball on, on the pull, deep square leg, short on the, the pull but, back and pulls, for the pull, pulled to deep, and pulls it

5. Off Drive
------------

Unigrams: midwicket, deep, leg, punched, drive, drives, off, driven, long, mid
Bigrams: towards mid, of mid, towards long, off driven, driven to, to mid, to long, off for, mid off, long off
Trigrams: towards mid off, wide long off, of mid off, driven to long, mid off for, towards long off, to mid off, off for one, long off for, to long off

6. Cut Shot
-----------

Unigrams: sweeper, slashes, slash, width, backward, chopped, chops, point, cuts, cut
Bigrams: short and, deep point, cut and, and cuts, cut away, cut but, backward point, cuts it, to cut, the cut
Trigrams: outside off cut, back and cuts, on the cut, goes to cut, the cut but, to cut and, to deep point, for the cut, and cuts it, short and wide

7. Flick
--------

Unigrams: fine, whipped, flick, clips, flicks, clipped, leg, flicked, pads, square
Bigrams: pads clipped, flicked to, clipped away, pads flicked, backward square, through square, leg for, deep square, the pads, square leg
Trigrams: the pads flicked, leg for one, deep backward square, backward square leg, flicked to deep, through square leg, deep square leg, to deep square, square leg for, on the pads

8. Left Alone
-------------

Unigrams: sways, channel, shoulders, called, lets, bouncer, ducks, leaves, left, alone
Bigrams: stump left, the channel, it alone, channel left, leaves it, alone outside, lets it, it go, off left, left alone
Trigrams: in the channel, length ball in, leaves it alone, left alone outside, the channel left, channel left alone, alone outside off, lets it go, outside off left, off left alone

9. Leg Glance
-------------

Unigrams: glanced, strays, glances, tucked, down, side, pads, flick, fine, leg
Bigrams: down the, the pads, to fine, short fine, the flick, wide down, fine leg, leg side, the leg, down leg
Trigrams: on the pads, off the hip, but down the, short fine leg, misses the flick, to fine leg, to short fine, wide down the, the leg side, down the leg

10. Straight Drive
------------------

Unigrams: smacks, arrow, striker, slot, non, ground, straight, sightscreen, head, bowler
Bigrams: driven back, back past, his head, over his, the sightscreen, straight back, head for, back over, the bowler, bowler head
Trigrams: into the sightscreen, head for six, past the bowler, back over the, bowler head for, down the ground, over his head, back over his, the bowler head, over the bowler

11. Glide
---------

Unigrams: steer, steers, guide, guides, guided, face, opens, steered, man, third
Bigrams: off steered, the face, runs this, third for, runs it, opens the, man for, to third, deep third, third man
Trigrams: third for one, outside off steered, runs it down, deep third for, opens the face, down to third, man for one, third man for, to third man, to deep third

12. Sweep Shot
--------------

Unigrams: leg, sweeping, slog, paddle, square, knee, fine, swept, sweeps, sweep
Bigrams: short fine, swept away, sweep but, sweep and, swept to, sweeps it, and sweeps, sweeps this, to sweep, the sweep
Trigrams: on one knee, to sweep but, looks to sweep, goes to sweep, the sweep but, and sweeps it, on the sweep, to sweep and, swept to deep, for the sweep

13. Square Drive
----------------

Unigrams: punched, deep, punches, width, face, opens, backward, sweeper, steers, point
Bigrams: the point, driven square, point to, point and, behind point, through point, square drive, point for, backward point, deep point
Trigrams: outside off square, off sliced away, point sweeper for, square drive to, off he opens, deep point for, point for one, deep backward point, through point for, to deep point

14. Slog Sweep
--------------

Unigrams: fetches, perishes, slogs, midwicket, pastes, knee, swept, sweeps, sweep, slog
Bigrams: and slog, knee and, with slog, one knee, sweep over, slog swept, another slog, slog sweeps, the slog, slog sweep
Trigrams: goes to slog, to slog sweep, sweep over midwicket, slog sweep but, slog sweep over, another slog sweep, with slog sweep, slog sweep and, slog sweeps it, the slog sweep

15. Reverse Sweep
-----------------

Unigrams: switched, swept, sweeps, stance, switches, premeditates, reverses, switch, sweep, reverse
Bigrams: reverse and, off reverse, for reverse, switch hit, to reverse, another reverse, reverse swept, reverse sweeps, the reverse, reverse sweep
Trigrams: tries the reverse, reverse sweep this, reverse sweep to, to reverse sweep, another reverse sweep, out the reverse, reverse sweep but, for the reverse, reverse sweep and, the reverse sweep

16. Slog Shot
-------------

Unigrams: toss, backpedaling, context, juice, swings, cannons, hoick, slog, win, heave
Bigrams: slog this, tewatia goes, is dragged, row over, high full, while running, go leg, swings at, another wild, has swing
Trigrams: heaved to deep, gets that off, to that good, dragging it to, high full toss, to slog this, has swing but, knee to slog, to go leg, half to deep

17. Scoop
---------

Unigrams: paddle, fine, across, cheeky, shuffles, laps, scooped, ramp, scoops, scoop
Bigrams: laps it, scoop over, scoop but, the ramp, over fine, the scoop, scoops it, scoop this, and scoops, to scoop
Trigrams: scoops it over, trying to scoop, across to scoop, to scoop but, across and scoops, over short fine, and scoops it, scoop this over, over fine leg, to scoop this

18. Late Cut
------------

Unigrams: dabbing, man, chop, face, opens, deft, dabs, cut, late, third
Bigrams: quicker outside, flat quick, the late, dabs it, deft touch, the face, late cuts, to late, late cut, short third
Trigrams: for the late, fine of short, of short third, looks to late, short third man, towards short third, the late cut, short third for, late cuts it, to short third

19. Upper Cut
-------------

Unigrams: instinctively, man, arches, ramped, bouncer, uppercuts, ramps, upper, uppercut, ramp
Bigrams: ramp but, and upper, to upper, ramps it, the upper, to uppercut, uppercut this, upper cut, to ramp, ramp it
Trigrams: to upper cut, for the upper, ramp it away, tries to upper, looks to ramp, uppercut this over, to uppercut this, ramp it over, the upper cut, to ramp it

20. Sliced Over Point
---------------------

Unigrams: sliced, effect, yorker, throws, slicing, uncontrolled, slices, backspins, scythes, slice
Bigrams: and backspins, running from, slices drive, low wide, wide yorker, slice it, from vyshak, over backward, over point, to slice
Trigrams: sees the batter, slice it over, ball and carves, low wide full, scythes it over, wide yorker but, full and wide, to slice it, over point for, over backward point

21. Inside Out
--------------

Unigrams: fashioned, inside, ample, building, extra, makes, backs, belt, room, manufactures
Bigrams: makes room, room early, made ample, chip over, swinging room, goes inside, over extra, manufactures swinging, inside out, out over
Trigrams: it over extra, out over the, out over cover, but clears the, goes inside out, room and looks, over extra cover, out over extra, manufactures swinging room, inside out over

22. Paddle Sweep
----------------

Unigrams: laps, type, conventional, sweeps, paddled, sweep, delicate, fine, paddle, paddles
Bigrams: bumrah at, to paddle, sweeps again, sweeps from, and paddle, paddle sweeps, the paddle, and paddles, paddle sweep, paddles it
Trigrams: for the paddle, leg side buttler, across and paddles, paddle it fine, paddle sweep but, this to fine, and paddle sweeps, paddle sweeps this, and paddles it, the paddle sweep

23. Hook Shot
-------------

Unigrams: hooks, wayyyy, hovering, smoothly, turtles, hooked, adventurous, aplomb, help, srk
Bigrams: third tier, with aplomb, head it, to help, help it, and swivels, to hook, around head, and help, some shot
Trigrams: fine leg it, the third tier, at the head, short ball another, the attempted pull, shortish ball around, dribbles away to, boundary for six, bouncer on the, it along with

24. Reverse Scoop
-----------------

Unigrams: grille, yeah, ramped, rvdd, dink, switches, funky, scoop, reverse, ramp
Bigrams: to reverse, the reverse, scoop it, ramp this, beaten he, reverse scoop, ramp but, reverse pull, and reverse, reverse ramp
Trigrams: for the reverse, short third but, over third man, scoop it over, man but misses, is beaten he, to reverse scoop, tries the reverse, goes to reverse, the reverse ramp

25. Chip Shot
-------------

Unigrams: chipping, innocuous, 100kph, 99kph, uppishly, chips, pocket, entirely, hobbles, chipped
Bigrams: walking single, half step, hit uppishly, chipped uppishly, takes half, off flicks, timing that, lofted to, not entirely, chipped over
Trigrams: him over mid, air but he, leading edge on, for walking single, legcutter on length, and dug out, flicks it uppishly, go through with, away to make, lofted to long

Evaluation of multiple models

To make some initial inroads into the problem, I evaluated three models on TF-IDF vectorized text commentary with monograms, bigrams and trigrams - Random Forest classifier, Linear SVC and Multinomial Naive Bayes - using a 5-fold cross validation. I chose these three because they can be run fairly quickly to get first impressions. Out of these three, Linear SVC does the best with around 70% accuracy.

I also ran an XGBoost method with a validation set and early stopping. This resulted in an accuracy around 69%, which is very similar to Linear SVC. However, XGBoost takes a longer time to train, so I chose to go ahead with Linear SVC and perform hyperparameter tuning.

Hyperparameter tuning

I chose to go ahead with Linear SVC and performed hyperparameter tuning using GridSearchCV to find the best parameters.

Best Hyperparameters: {'vectorizer__max_features': None, 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 3)}
Best CV Score: 0.7095704763787206
Test Accuracy: 0.7173983389062668

The best parameter set is to consider all document frequencies and take monograms, bigrams and trigrams together as the features. However, the gains with not keeping an upper limit on the vocabulary size is very little, so to keep the computation from taking up too much RAM, I chose max_df as 10000.

Linear SVC

Tthe precision, recall values shot types shows that the model performs pretty well on shots that have a lot of support, like Defence, On Drive and Pull Shot. However, some shots have a zero precision/recall - Chip Shot and Hook Shot, and some have very low precision/recall values - Late Cut, Slog Shot and Sliced Over Point.

Accuracy: 0.7026210764750297
Precision: 0.5591104200516839
Recall: 0.44190912927180775
precision recall f1-score support
Chip Shot 0.00 0.00 0.00 6
Coverdrive 0.66 0.70 0.68 951
Cut Shot 0.59 0.63 0.61 542
Defence 0.80 0.89 0.84 2109
Flick 0.51 0.50 0.50 462
Glide 0.57 0.56 0.57 239
Hook Shot 0.00 0.00 0.00 10
Inside Out 1.00 0.05 0.09 21
Late Cut 0.17 0.05 0.08 58
Left Alone 0.83 0.86 0.84 440
Leg Glance 0.48 0.50 0.49 315
Off Drive 0.70 0.66 0.68 817
On Drive 0.71 0.75 0.73 1258
Paddle Sweep 0.44 0.19 0.27 21
Pull Shot 0.79 0.82 0.80 933
Reverse Scoop 1.00 0.11 0.20 9
Reverse Sweep 0.89 0.94 0.91 90
Scoop 0.76 0.68 0.72 65
Sliced Over Point 0.18 0.07 0.10 44
Slog Shot 0.12 0.02 0.04 89
Slog Sweep 0.48 0.35 0.40 104
Square Drive 0.44 0.19 0.26 166
Straight Drive 0.56 0.43 0.49 279
Sweep Shot 0.66 0.67 0.67 195
Upper Cut 0.64 0.44 0.52 48
accuracy 0.70 0.70 0.70 0
macro avg 0.56 0.44 0.46 9271
weighted avg 0.69 0.70 0.69 9271

Confusion matrix

The confusion matrix reveals which shots get categorized as which shots when they are incorrectly identified. We see that Chip Shot gets classified as Off Drive, which isn’t a bad prediction per se. Anyway, it has such a low support that not identifying it correctly is insignificant. Coverdrive is identified correctly 70% of the time, but otherwise its identified as Cut Shot (not bad), Off Drive (not bad), or as Defence (not so good). A couple of other examples are Leg Glance getting identified as Flick (totally fine) 25% of the times, and Square Drive getting identified as Coverdrive (not bad) and Cut Shot (not bad) 56% of the time.

This suggests that even when the shots aren’t identified exactly, the other incorrect guesses are often okay. A lenient accuracy score would score higher than 70%.

This table lists all the shots in order of their frequency in the dataset along with their accuracy values and the second best guesses. In many cases, the second best guess is quite acceptable.

Test Train Correct % Best Guess Second Best Guess
Defence 2109 8436 89.0 Defence On Drive
On Drive 1258 5029 75.4 On Drive Defence
Coverdrive 951 3806 70.2 Coverdrive Cut Shot
Pull Shot 933 3732 81.8 Pull Shot On Drive
Off Drive 817 3268 65.6 Off Drive Coverdrive
Cut Shot 542 2168 63.1 Cut Shot Coverdrive
Flick 462 1847 49.8 Flick Leg Glance
Left Alone 440 1761 85.7 Left Alone Leg Glance
Leg Glance 315 1258 49.5 Leg Glance Flick
Straight Drive 279 1116 42.7 Straight Drive Off Drive
Glide 239 955 56.5 Glide Defence
Sweep Shot 195 781 67.2 Sweep Shot Slog Sweep
Square Drive 166 662 18.7 Coverdrive Cut Shot
Slog Sweep 104 416 34.6 Slog Sweep On Drive
Reverse Sweep 90 362 94.4 Reverse Sweep Defence
Slog Shot 89 358 2.2 On Drive Pull Shot
Scoop 65 262 67.7 Scoop Pull Shot
Late Cut 58 232 5.2 Cut Shot Defence
Upper Cut 48 190 43.8 Upper Cut Cut Shot
Sliced Over Point 44 178 6.8 Coverdrive Cut Shot
Inside Out 21 86 4.8 Coverdrive Inside Out
Paddle Sweep 21 84 19.0 Sweep Shot Paddle Sweep
Hook Shot 10 40 0.0 Pull Shot Flick
Reverse Scoop 9 34 11.1 Reverse Sweep Reverse Scoop
Chip Shot 6 22 0.0 Off Drive Flick

Next steps

Armed with this model, we can now take text commentary for all matches from ESPNCricinfo and predict the shot types. Of course, we have to make sure that the commentary for each ball is not too short, in which case a random shot might be predicted because of incomplete information. Perhaps an idea would be to augment the training set with such examples and have a dummy shot type, so that such data in unseen datasets can be identified correctly.