Heuristics for Winner Prediction in International Cricket Matches

Cricket is popularly known as the game of gentlemen. The game of cricket has been introduced to the World by England. Since the introduction till date, it has become the second most ever popular game. In this context, few a data mining and analytical techniques have been proposed for the same. In this work, two different scenario have been considered for the prediction of winning team based on several parameters. These scenario are taken for two different standard formats for the game namely, one day international (ODI) cricket and twenty-twenty cricket (T-20). The prediction approaches differ from each other based on the types of parameters considered and the corresponding functional strategies. The strategies proposed here adopts two different approaches. One approach is for the winner prediction for one-day matches and the other is for predicting the winner for a T-20 match. The approaches have been proposed separately for both the versions of the game pertaining to the intra-variability in the strategies adopted by a team and individuals for each. The proposed strategies for each of the two scenarios have been individually evaluated against existing benchmark works, and for each of the cases the duo of approaches have outperformed the rest in terms of the prediction accuracy. The novel heuristics proposed herewith reflects efficiency and accuracy with respect to prediction of cricket data.


Introduction
Cricket is the game of gentlemen introduced to the world by the Britishers. It has gone through several modifications since its introduction, and now it exhibits in three different formats. These are test cricket (a five days format), ODI (One day international), and T-20 (An aggressive format comprising of a total of 40 overs). Among these, the ODI and T-20 formats have gained a lot of popularity and thus attracted the attentions of billions of audience and numerous business firms. Many of the countries with high density of populations are involved in the game. Few of the names are India, Pakistan, England, Australia, and Srilanka. The prediction of winning team for a forthcoming match is too much crucial for three most important aspects as mentioned below:-• Essential for a country to strengthen the team, • Essential for a business firm to invest, • Essential for the team itself to adopt new strategies.
Data mining and machine learning have been adopted as essential tools for several aspects like classification, recognition, data analysis, and prediction in the field of computing. Numerous techniques have been proposed so far in this context ( [1], [2], [3], [4], [5], [6]) irrespective of the research areas (medical, weather, big data, 603 education, business, banking, etc.). Recent study reveals various heuristics with utmost efficiency being presented in almost every research domain. However, there exist certain research domains where there exist multiple scopes for implementing these data mining approaches and analyze the outcomes. The game of cricket is one among such domain. Several data mining techniques have been proposed so far for different aspects involved in the match of a cricket. However, it is still a challenging task for proposing an efficient prediction heuristic. In [7], a statistics based model has been presented for the suitable selection of players for a particular team. Most essentially the past performance of the players have been considered as the basis. The batting, bowling and overall statistics pertaining to individual players have been considered for the work. The last five match performance only have been considered in this work. On a Hadoop setup, an accuracy of 91% have been reported for the work only for the Indian players. Another method of player selection has been proposed in [8]. For the purpose, neural network approach has been utilized. They have considered the historical match statistics during the year 1985-2006. Progressive training and testing has been done on four different sets of data. In this work also only the recent player performance during world cup have been taken into account. However, debut-ant players characteristics can not be analysis as the method ignores the same. In citea3, analysis has been made on the powerplay characteristics during a cricket match irrespective of the match format. The analysis has taken into account the difference between the score if there is a powerplay during the match and if there is no powerplay during that match. The themes around this work includes various powerplay formats, benefits of the powerplay for the batting team, benefits of the powerplay for the bowling team, and nature of the match without powerplay. However, powerplay strategies vary between the ODI and T-20 format of a match. COnsidering no powerplay is also an hypothetical situation that may not be fitting suitable for every model of analysis. A cricket outcome predictor has been presented in [10]. ODI outcome is being predicted using this method. Several feature considered for the work are nature of the match (day or day/night match), index of the innings (1st/ 2nd), and fitness of the teams. Classifiers used for the work are Naive bayes, support vector machine (SVM), and Random forest (RF). Combining these three classifiers outcomes, a tool has been developed namely COP (cricket outcome predictor). However, quantification of certain features considered here is a tedious task. Also, this work does not predict the outcome of a T-20 format match. In [11], a forecasting model has been proposed for runtime prediction of the outcome of an ODI cricket match. Logistic regression has been used as the basic tool. The work does the prediction with minimal number of features because of the use of a cross-validation technique, they have eradicated features with less importance. A study has been made in [12] for determining the importance and usefulness of business betting for the match of cricket. They have suggested a profit of 20% is achievable if netting is done as per the fall of of wickets during the match. The Monte-Carlo estimation has been used for the purpose.
In [13], predictive tool has been presented for the test cricket match format. A test cricket match is played between two teams for a duration of five days with each day being played for around 90 overs. They have used a probabilistic approach for the prediction of the final outcome. Twelve different precondition parameters have been considered in this work. It also uses the logistic regression as it's basis tool. In [14], online social information have been used for making a prediction of the top ranking players and teams in cricket. The future trend of a match is generated based on the data trending on social media. In [15], two different themes have been merged into a model for predicting of outcome of a ODI match. Based on various parameters pertaining to the first and second innings of a match (50 overs each), the outcomes are generated at runtime. Linear regression and Naive Bayes have been utilized for the said purpose. A mild rate of accuracy of 68% has been reported which gradually increases to 91%.

Related Works
In [23], a scheme has been proposed for mining association rules using principal component analysis (PCA). This is exclusively for cricket matches. They have proposed a framework for establishing correlations between pieces of cricket statistics with frequent patterns. This framework is meant to help in making and improving coaching strategies. In [24], the same association rule mining has been implemented for strategic planning for teams during ICC-2015. Several decisive parameters like match-venue, toss output, rank order of a batsman, strike-rate, and score-economy has been analyzed. In [25], performance data mining has been presented for the cricket team of New-Zealand. It takes into consideration all the historical data pertaining to the New-Zealand versus other teams starting from the year 1975. In [26], a combined approach of few of the modern classification techniques has 604 HEURISTICS FOR WINNER PREDICTION IN CRICKET been analyzed for the prediction of ODI cricket outcomes. Naive Bayesian, Support Vector Machines (SVM), and Random Forest (RF) have been used for the purpose.

Motivation and Objectives
So far, numerous techniques have been proposed for several aspects in cricket. However, there remain the limitations that are need to be addressed with efficient heuristics. It is learned from the literature that, for the two different format of the game of cricket, a single prediction strategy may not yield fruitful prediction outcome. This is because of the fact, the match being played by the teams in these two formats are with strategically distinct approaches. Thus, there has been a need for two distinct schemes for two of this formats (ODI and T-20) of the same game. In this work, such an attempt has been made to propose the winning probability for a team for the two formats of a cricket match (ODI and T-20). The objective of this work is to devise suitable prediction strategies for the ODI and T-20 versions of the cricket game separately. Optimality need to be given utmost priority while designing the prediction strategies. The organization of this paper is as follows. In the subsequent sections the background, proposed heuristics, and experimental analysis have been illustrated in a sequence. Final conclusion has been made in the last section.

Background
The background characteristics are depicted as under.
• The game of cricket is played between two teams, with each team having eleven numbers of live players. • A toss is performed at the beginning for the act of choosing one of the option from fielding and bowling and batting as the first choice by a team, • For a team, the game result can be a win, or a loss, or a draw. (Let's not consider the scenario of matches getting abandoned), • For the bowling team, an over refers to act of delivering the ball six times as per the rule of bowling actions.
Invalid action of delivering the ball may be considered as extra runs in terms of wide-ball or no-ball which may be awarded to the opponent team (batting team). • The score made by a batting team has to be chased by the opponent team. Upon completing the chase successfully, the team is declared winner, else a loosing team. • A one-day-cricket is played for a single day. this comprises of a total of a hundred of overs with individual team acting for bowling and batting for fifty overs each. • A T-20 match, as the name suggests, comprises of a total of forty overs with individual team acting for bowling and batting for twenty overs each.

Proposed Scheme
The proposed scheme comprises of two different approached for the two different formats of the game. These have discussed in a sequel.

Approach-1(One-day-cricket)
This is a statistical approach which considers the data pertaining to the team's performance and individual players performance in the recent past. The direct way of doing the analysis and prediction is to verdict for the team with overall good performance (GP). The GP can be defined as the sum of individual computations as given below:- • tot won is the total number of matches won by the team till date, • tot played is the total number of matches played by the team till date, • tot won vs is the total number of matches won by the team versus the current opponent, • tot played vs is the total number of matches won by the team versus the current opponent, • IP i is the individual performance of the players of the team. This can be computed based on the formula as given in the equation below:- where, BP i and W P i are the batting performance and bowling performance of a player respectively. These parameters are given by the formula as described below:- where, w 1 , w 2 , and w 3 are the weight values which can be computed using the function as given below, where, penalty = The penalty due to extra runs given through no ball, and wide ball. This can be computed as, penalty = extra delivered × 6.0 (8) where, • NFL is the number of fifty runs scored in last ten matches, • NFS is the number of fifty runs scored so far, • NHL is the number of hundred runs scored in last ten matches, • NHS is the number of hundred runs scored so far, • SR is the strike rate for the last ten matches, • TSR is the total strike rate in career, • NWT is the number of 3-wickets taken in last ten matches, item NWF is the number of 5-wickets taken in last ten matches, • extra is the total number of extra balls delivered to the current opponent, • delivered is the total number of extra balls delivered so far in career, • The constant value 6.0 has been introduced in the denominator because there are six valid balls delivered in an over.

Approach-2(T-20)
This approach takes into consideration the aggression factor. This is because, a T-20 format cricket involves a match of a total of 40 overs, where each of the two teams are allowed to bat/ bowl for 20 overs. That makes a sense of aggression towards the teams. As the number of overs are less, hence the players individual performance has to consider an aggression factor (α for batsman and β for bowler). Thus the equations for computing the overall performance of a team also needs to be altered. The corresponding set of equations for the same computation are given as under.
where, α and β are the aggression factors. These can be assumed to be probabilistic values as they are somehow dependent on the outcome of a toss. The toss is essentially important for predicting the winner so far as a T-20 match is concerned. Another influential factor is the pitch condition on which the match would be played. Hence, these two parameters are taken as per the formula given below:

Prediction Probability
For each of the two approaches discussed in the previous sections, the prediction probability can now be found as per the formula as given in the equation below: The parameters α and β have been considered for the final probability calculation. This is because the overall outcome of the match somehow considered to dependent on those irrespective of the version type of the cricket format.

Case Study
The proposed heuristics are considered for case studies on the data of performance of six distinct countries. The data considered here are for both the ODI and T-20 format of matches. Random samples of winning matches for the teams are selected along with mentioned attributes and the proposed schemes are applied individually. The prediction outcomes are obtained in terms of probabilistic values. These values are scaled to a value of 100 and are plotted against the real match outcomes of those corresponding matches. For the obtained probabilistic results, a threshold is set as 60% so that it can be considered valid if and only if it is above that threshold. The threshold percentage is set to be 60% instead of 50% only to assume a fare estimation of the efficiency of the schemes. The dataset available in www.kaggle.com has been considered for the purpose. The said website draws the dataset from the espn private limited which is the leading sports media company. Kaggle is a genuine and popular machine learning quest platform that facilitate vivid variety of challenges for researchers for grooming their competencies. The overview of the dataset particularly used for this case-study has been depicted in Table 1.

HEURISTICS FOR WINNER PREDICTION IN CRICKET
In Figure1(a), it can be observed that, both of the ODI and T-20 predictions are most often falling above the threshold percentage. This shows that the scheme is satisfactorily working with rate of accuracies of 85% and 90% for both of the match formats. Similarly, the prediction plots for both of the formats are plotted for five other countries namely, Australia, India, Pakistan, West-indies, and Zimbabwe in Figures 1(b) -1(f) respectively. All these plots show similar satisfactory rates of accuracies. This makes the proposed schemes appear to be robust and efficient. The overall rate of accuracy for each of the six teams has been shown in Figure 1(g), where both the values of accuracies in ODI and T-2-formats have been presented. The rates are in the range of 80-95 % which is satisfactory. A comparative analysis of the proposed schemes have been made with state-of-the-art machine learning (ML) schemes. These schemes have been simulated on the same set of samples and the results so obtained are compared with the proposed scheme. This comparison has been shown in Table 2. The proposed scheme outperforms the rest with a sufficiently good marginal difference.  Proposed heuristics Historical 89.1 88

Conclusion
Heuristics for efficient prediction of the winner for a cricket match of two different formats have been proposed. The proposed approaches consider every important aspects which are directly and indirectly effecting the outcome of a match. Here, statistical data are used to derive at a concluding single parametric value which is finally used in a suitably defined probabilistic function for predicting the winning probability of a team. These approaches are unique of their kind as they do not incorporate any type of predefined classifiers. Test cases have been considered from benchmark dataset for evaluation purpose. Further, these methods upon comparison with other schemes those using benchmark classifiers give comparatively better performance in terms of the overall rate of accuracies (89.1% for ODI and 88.33% for T-20). The future work may focus on devising a dynamic approach for live match prediction by taking the outcomes of these methods as a prior.