A game of cricket is played between 2 teams with 11 players on each side. The cricket ground is made of 2 concentric circles. Typically, The bowling team  is allowed to field 4 fielders in the outer circle. However during a powerplay, the fielding restrictions allow only 2 fielders outside. With this fielding arrangement, the batsmen are encouraged to score at a brisk pace. However,  it is often found that in the process of accelerating their run-rate (run scored every 6 balls bowled), the batting ide loses a lot of their batsmen.


The aim of the analysis is to analyze the factors that contribute to success during the powerplay overs.

Tools/ Data Source

Data Gathering

I acquired the data through the ESPNCricinfo Website. Manually pulling the data into a CSV file.

Data Cleaning

The process involved de-duplication of games/teams and scores

Data Analysis

I used Tableau for the data analysis and visualization



The graphs below represent the powerplay performance of the teams in the 2011 Cricket World Cup that was held in India. The host nation – India emerged victorious in the tournament

The graph above shows that South Afrika scores maximum runs during the power play overs. However, they lose a lot of wicket too.


In order to see the Powerplay Performance Index, I created a weighted score for number of runs scored per wicket lost and the results obtained are shown below:



India has the maximum runs scores for every wicket lost. Zimbabwe is second with 52 runs for every wicket lost.

ICC Cricket World Powerplay Analysis

Australia won the recently concluded world cup held in Australia/New Zealand. To further analyze the difference between the batting and bowling performance, it is important to understand two concepts:

  1. Batting/Bowling average: Average number of runs scored during batting/ runs given while bowling
  2. Run-rate: Average Number of runs scored per 6 balls (1 over)
  3. Economy Rate: Average number of runs given per 6 balls (1 over)


As can be seen in the graph above, the difference between the batting and bowling average of Australia is maximum. This indicates a better overall powerplay performance



As can be seen in the graph above, the difference between the run-rate and economy-rate of Australia is maximum. This further indicates a better overall powerplay performance


 The graph above shows the period during the 50 overs when teams opted for batting/bowling powerplay. In majority of the games, all teams opt for powerplay after or at the 40th over.  In a 50-over game, the batting teams usually accelerate the scoring and a powerplay fielding restriction would helps the cause.


The graph above elicits the fastest players during a powerplay. This score is calculated based on the strike rate. A strike rate is the average number of runs per 100 balls. Kevin O Brien from Ireland emerges at the top.


Finally, I analyzed AB Devillers performance against West Indies when he scored the fast century (10o runs) in cricket. Surprisingly, a part-time bowler, Gayle was the most economical bowler.



  1. Batting is not the only determining factor for powerplay performance
  2. Australia has the maximum difference between the runs scored and runs given during powerplay
  3. Australia scores at a faster rate and provide runs at the best economy
  4. Bowling the main bowlers doesn’t necessitate best powerplay bowling


Evolution of players in NBA draft from 1987 to 2014


We have taken data from the to assess data for physical and performance measurements of players drafted (first and second round of draft) from 1987 to 2014 into the NBA. Overall the most interesting insights were that on average the height of players is decreasing as Centers are becoming smaller while point guards are becoming bigger (perhaps in line with the conclusions of the article “Is Basketball becoming position less[1]”). However the speed and vertical jumping abilities have been improving over the years. All this leads us to believe that there is trend towards drafting more nimble and versatile players.

[PDF Version available at this: link]

Average Height of players drafted:

On average the height of NBA players has been decreasing over time. As the data shows the average height of the NBA player has moved from 79 inches (6’7””) to 77.5 inches (6’5.5””).


We will assess trends in height of players drafted by each position. If we do a regression for players by position we find the following coefficients:


Position Coefficient
Center -0.07
Power Forward -0.02
Small Forward -0.02
Shooting Guard -0.03
Point Guard +0.03



Each coefficient means how much on average (in inches) the height of players change every year. For instance Centers (on average) have be decreasing by 0.07 inches each year where as Point Guards’ height is increasing by 0.03 inches each year. Overall this show that Centers are decreasing the fastest, point guards are the only position getting taller and Small Forwards, Power forwards and shooting guards and getting slightly smaller.

height drafter


By position:




Power Forward:

Power Forwards


Small Forward:

Small Forward


Shooting Guard:

Shooting Guard


Point Guard:

Point Guard


Average Wingspan of players drafted:


Given that the wingspan of an individual is correlated to their height it is not surprising that wingspan of players is also decreasing as their average height is also decreasing.




Average Weight of players drafted:


On average players have are 15 pounds lighter over the 26 years of data we’ve assessed.




Average Body Fat of players drafted:


Body fat (similar to weight is also decreasing).



Max Vertical of players drafted:


Although the players have been getting on average smaller their jumping ability has been increasing over time as during the past 15 years the average Vertical has increasing (on average) about 5 inches.


Note: A straight line regression was not used as only one data point existed for each year prior to 1998.

max vert



Average Sprint of players drafted:

A Basketball court is 94 feet long. But a 3/4 court would be 70.5, and the test is standardized at 75 feet or 22.68 meters. The 3/4 court sprint test is the distance from the end line to the opponents free throw line in College basketball.


94 feet court minus 15 feet from free throw line to backboard minus 4 feet overhang = 75 feet.


The 3/4 Basketball Court Sprint is one of many test at the NBA combine. Like the NFL Combine, the vertical jump for explosive power is one of the tests. Other tests include: Kneeling Power Ball Throw, Lane Agility Drill, Multi-Stage Hurdle and the Max Touch test.


As can be seen the charts, the average sprint time of players is increased (on average) by very slightly.




Average Bench of players drafted:


The bench test for players has been (on average) constant over the years.



Average Agility of players drafted:


The Lane Agility Drill run at the Pre-Draft Camp is pretty basic.

The player starts on the baseline, sprints to the free throw line, slides (like defensive slides) to the opposite elbow, back peddles back to the baseline and slides back to the starting spot. The goal is to do it as fast as possible. The league measures the results in seconds. As can be seen in the chart players are getting faster over time.

Screen Shot 2015-05-02 at 8.58.07 AM







Big Data and Boxing


For this semester, I chose to explore the possibilities of incorporating big data into the sport of boxing. As an avid boxing fan and novice fighter myself, the idea of marrying big data with boxing was extremely exciting and thought-provoking. Many critics have declared boxing a dying sport over the recent years. There is a lack of box office and pay-per-view attractions (namely athletes) with the exception of a select few. Similar combat sports, such as Mixed Martial Arts Fighting (MMA), has taken center stage. With the sport on the decline, I asked the question: what could big data do to the help revitalize the sport? Are there similar big data methods that we can draw inspiration from in other sports, such as basketball and football?

Part 1: Big Data and Boxing Introduction


I broke my project down to three mini power point presentations. For my first presentation, I gave a brief overview and background of the current state of boxing. I discussed why boxing is (or rather, appears) to be on the decline, the public relations, how the sport generates its revenues, and a compare/contrast analysis with other sports. I then discussed three areas in which big data can be integrated into boxing, what the goals for each of these areas should be, and how they would benefit the sport.

Part 2: Existing Wearable Technologies


For my second presentation, I researched existing wearable technologies that can track the fighter’s movements, punches, and health in real time. Some examples of these technologies include StrikeTec sensors that can be worn in boxing gloves to track punch data, head-injury detecting technologies such as the FITguard mouth piece and Checklight skull cap, and smart clothing that have built-in EEG, EMG, and EKG sensors to monitor the overall health of the athlete. I chose to do a case study focusing on the StrikeTec glove sensors. On April 4th, 2015, a championship bout in Big Knockout Boxing (BKB) – a combat sport similar to boxing that does not include a ring or ropes – recently inserted StrikeTec sensors into the two competing fighters’ gloves during the match. Viewers at the stadium and at home downloaded a StrikeTec application that allows them to view all sorts of punch data, including the speed, count, force, and type of punch (amongst other things), all of which were viewed in real-time. This was a breakthrough moment for combat sports and big data, as it was the first time that punch data was tracked via sensor technology and made visible to the audience for viewing.

Part 3: Looking Ahead


For my final presentation, I started to look into how certain existing wearable technologies can be used to help improve boxing. As I did my research, I realized that there were more factors to consider than I had originally thought. Specifically, I asked the question: which wearable sensor technologies can be used to fortify the sport, without taking away its raw and aggressive nature, while giving the fans a more satisfying experience, and at the same time, prolonging the longevity and health of the boxers? I outlined three main points to consider when marrying big data with boxing, and discussed how some of the existing wearable technologies can be used to improve the sport.

I concluded my presentation by looking at how some of these technologies can used to determine the outcome of the Pacquiao-Mayweather fight, as well as some of the issues and concerns to consider if these big data methods were to be integrated.

Predicting Cycling Speed with Strava Data

Strava is a mobile app for tracking runs and bike rides using a GPS device, such as a smartphone or a dedicated device. After recording an activity, a map of the activity can be reviewed, along with statistics of the activity (such as distance, moving time, average speed).



Additionally, graphs can be viewed for various metrics throughout the ride, such as speed, power, and elevation.


For this post I am turning my attention to a feature of Strava called Routes. You can plan out a cycling or running route to follow at a later time. While planning the route, Strava gives information such as the distance, total elevation gain (along with the elevation profile map), and an estimated moving time.



When introducing the feature, Strava noted that the estimated moving time was generated by “based on your average speed over the past four weeks”. This got me thinking, and I spent some time exploring how to improve on this estimate.

I used the Strava API to grab my activities, and started exploring the data in R.


As a first observation, the average speeds for my activities are all over the place, ranging from 8mph to 22mph. There doesn’t seem to be much of a correlation between the length of the activity and the average speed.


Any cyclist in the Bay Area has experienced the slowing effects of climbing (I’m looking at you, El Toyonal, my current favorite climb..), so I wanted to explore the relationship between elevation gain and average speed. There is a clear downward trend as elevation gain increases, but there is still a lot of variation.

One reason this could be the case is that knowing the elevation gain doesn’t tell us anything about the distance of the ride. Plotting the distance of a ride with the elevation/mile (feet per mile) of climbing, it is clear that rides come in all shapes and sizes…


I then plotted the elevation gain per mile against speed, and the results were a lot more promising:


The clustering around the linear model is pretty good for relatively hilly rides, but there is a lot of variation between 0 and 50 ft/mi. Looking at my rides, this happens because flat rides tend to be either chill rides around town or on the Bay Trail path, or fast races like triathlons. These rides may not be great for estimating an average speed for a route, so I plotted the same data, but only for rides with at least 50 ft/mi of climbing:



This is starting to look more promising! But is there a difference between rides of a given climbing/mile? For example, What if all the climbing is in the first half of the ride, but flat thereafter? Would that be different than a ride with consistent, gradual climbing and descending throughout the ride?

To try to quantify this, I look within each ride and take each piece of the ride (as Strava’s API returns it) and look at the climbing grade percent (elevation gain divided by horizontal distance). I then calculate a statistic for the distribution of grades throughout the ride, modeled after the standard deviation statistic. More ‘extreme’ rides will have higher climbing deviations, whereas more consistent rides will have lower deviations.



There does seem to be some correlation between deviation and speed, but it’s not great. The statistics is not very precise, since the reported grades for each piece of the ride may not be very accurate, so this has to be taken with a grain of salt.

With that in mind, I created a linear model for estimating my speed taking the feet/mile of climbing, along with the climbing deviation statistic into account. The resulting function is:

Avg speed = 16.012mph – 0.0206 * (climbing in feet/mile) – 0.01556 * (climbing deviation factor)

This function should improve on a route’s estimated speed by including the elevation gain and elevation profile into account.


For some more unrelated fun, I got some of my teammates on the Cal Triathlon team to send me their Strava data (using some Python scripts I put on a web server) and plotted their data as well. It is interesting to see how rides are different from person to person. For example, here is the distance vs speed for one of my teammates.


Note that she rides faster the longer the ride is! Looking at how distance correlates with climbing, one reason this might be the case is that her longer rides involve less climbing per mile: