Data Science Writers Home
  1. Predicting Location via Indoor Positioning Systems (13)
  2. Modeling Runners' Times in the Cherry Blossom Race (21)
  3. Using Statistics to Identify Spam (9)
  4. Processing Robot and Sensor Log Files: Seeking a Circular Target (16)
  5. Strategies for Analyzing a 12-Gigabyte Data Set: Airline Flight Delays (1)
  6. Pairs Trading (9)
  7. Simulation Study of a Branching Process (9)
  8. A Self-Organizing Dynamic System with a Phase Transition (36)
  9. Simulating Blackjack (4)
  10. Baseball: Exploring Data in a Relational Database (2)
  11. CIA Factbook Mashup (10)
  12. Exploring Data Science Jobs with Web Scraping and Text Mining (11)

1 Predicting Location via Indoor Positioning Systems

Floor Plan of the Test Environment

In this floor plan, the 6 fixed access points are denoted by black square markers, the offline/training data were collected at the locations marked by grey dots, and the online measurements were recorded at randomly selected points indicated with black dots. The grey dots are spaced one meter apart.

Empirical CDF of Orientation for the Hand-Held Device

This empirical distribution function of orientation shows that there are 8 basic orientations that are 45 degrees apart. We see from the steps in the function that these orientations are not exactly 45, 90, 135, etc. Also, the 0 orientation is split into the two groups, one near 0 and the other near 360.

Boxplots of Orientation for the Hand-Held Device



s of the original orientation against the rounded value confirm that the values have mapped correctly to 0, 45, 90, 135, etc. The outliers at the top left corner of the plot are the values near 360 that have been mapped to 0.

Screenshot of the Mac Address Lookup Form

The Web site offers lookup services to find the MAC address for a vendor and vice versa.

Counts of signals detected at each position

Plotted at each location in the building is the total number of signals detected from all access points for the offline data. Ideally for each location, 110 signals were measured at 8 angles for each of 6 access points, for a total of 5280 recordings. These data include a seventh Mac address and not all signals were detected, so there are about 5500 recordings at each location.

Signal Strength by Angle for Each Access Point



s in this figure represent signals for one location, which is in the upper left corner of the floor plan, i.e., and . These boxes are organized by access point and the angle of the hand-held device. The dependence of signal strength on angle is evident at several of the access points, e.g., 00:14:bf:97:90 in the top right panel of the figure.

Distribution of Signal by Angle for Each Access Point

The density curves shown here are for the signal strengths measured at the position: and . These 48 density plots represent each of the access point angle combinations. There are roughly 110 observations in each panel. Some look roughly normal while many others look skewed left.

SD of Signal Strength by Mean Signal Strength

The average and SD for the signals detected at each location-angle-access point combination are plotted against each other. The weak signals have low variability and the stronger signals have greater variability.

Comparison of Mean and Median Signal Strength

This smoothed


shows the difference between the mean and median signal strength for each combination of location, access point, and angle against the number of observations. These differences are close to 0 with a typical deviation of 1 to 2 dBm.

Median Signal at Two Access Points and Two Angles

These four


provide a


representation of signal strength. The top two maps are for the access point 00:14:bf:b1:97:90 and the angles 0 (left) and 135 (right). The two bottom heat maps represent the signal strength for the 00:0f:a3:39:e1:c0 and the same two angles.

Signal Strength vs. Distance to Access Point

These 48


s show the relationship between the signal strength and the distance to the access point for each of the 6 access points and 8 orientations of the device. The shape is consistent across panels showing curvature in the relationship.

Floor Plan with Predicted and Actual Locations

The red line segments shown in the floor plan connect the test locations (black dots) to their predicted locations (asterisks). The top plot shows the predictions for and the bottom plot is for


s. In this model, we use as training data the average signal strengths from each of the 166 offline locations (grey dots) to the 6 access points (black squares) for the 3 closest angles to the angle at which the test data was measured.

Cross Validated Selection of

This line plot shows the sum of square errors as a function of the number of neighbors used in predicting the location of a new observation. The sums of squared errors are obtained via


of the offline data.

2 Modeling Runners' Times in the Cherry Blossom Race

Screen Shot of Cherry Blossom Run Web site

This page contains links to each year's race results. The year 1999 is the earliest for which they provide data. Men's and women's results are listed separately.

Screen Shot of the 2012 Male Results

This screenshot shows the results, in race order, for men competing in the 2012 Cherry Blossom 10 Mile Run. Notice that both 5-mile times and net times are provided. We know that the Time column is net time because it is so indicated in the header of the table.

Screen Shot of Men's 2011 Race Results

This screenshot shows the results, in race order, for men competing in the 2011 Cherry Blossom road race. Notice that in 2011, 3 times are recorded – the time to complete the first 5 miles and the gun and net times for the full run. In contrast, the results from 2012 do not provide gun time.

Box Plot of Age by Year

These side-by-side boxplots of age for each race year show a few problems with the data for 2003 and 2006. The runners in these years are unusually young.

Box Plot of Age by Year

These side-by-side boxplots of age for each race year show a reasonable age distribution. For example, the lower quartile for all years range between 29 and 32. The problems identified earlier for 2003 and 2006 have been addressed.

Default Scatter Plot for Run Time vs. Age for Male Runners

This plot demonstrates that a simple


of run time by age for the 70,000 male runners leads to such severe


that the shape of the data is not discernible.

Revised Scatter Plot of Male Runners

This plot revises the simple scatter plot of Figure 2.6, “Default Scatter Plot for Run Time vs. Age for Male Runners” by changing the plotting symbol from a circle to a disk, reducing the size of the plotting symbol, using a transparent


for the disk, and adding a small amount of random noise to age. Now we see the shape of the high density region containing most of the runners and the slight upward trend of time with increasing age.

Smoothed Scatter Plot of Male Runners Race Times vs. Age

This plot offers an alternative to the scatter plot of Figure 2.7, “Revised Scatter Plot of Male Runners” that uses jittering and transparent


to ameliorate the


. Here there is no need to jitter age because the smoothing action essentially does that for us by spreading an individual runner's (age, run time) pair over a small region. The shape of the high density region has a very similar shape to the earlier plot.

Side-by-Side Boxplots of Male Runners' Run Time vs. Age

This sequence of boxplots shows the quartiles of time for men grouped into 10-year age intervals. As age increases, all the quartiles increase. However, the box becomes asymmetrical with age, which indicates that the upper quartile increases faster than the median and lower quartile.

Residual Plot from Fitting a Simple Linear Model of Performance to Age

Shown here is a smoothed scatter plot of the residuals from the fit of the simple


of run time to age for male runners who are 15 to 80 years old. Overlaid on the scatter plot are two curves. The curve in purple is a solid horizontal line at \(y = 0\). The green dashed curve is a local smooth of the residuals.

Piecewise Linear and Loess Curves Fitted to Run Time vs. Age

Here we have plotted the fitted curves from loess() and a piecewise


with hinges at 30, 40, 50, and 60. These curves follow each other quite closely. However, there appears to be more curvature in the over 50 loess fit that is not captured in the piecewise



Line Plot of the Number of Male Runners by Year

This plot shows that the number of male runners in the Cherry Blossom 10-mile race has more than doubled from 1999 to 2012.

Density Curves for the Age of Male Runners in 1999 and 2012

These two


have quite different shapes. The 1999 male runners have a broad, nearly flat mode where they are roughly evenly distributed in age from 28 to 45. In contrast, the 2012 runners are younger with a sharper peak just under 30 years and a skew right distribution.

Loess Curves Fit to Performance for 1999 and 2012 Male Runners

This loess fit of run time to age for 2012 male runners sits above the fit for 1999 male runners. The gap between these curves is about 5 minutes for most years. The exception is in the late 40s to early 60s where the curves are within 2–3 minutes of each other. Both curves have a similar shape.

Difference between Loess Curves

This line plot shows the difference between the predicted run time for 2012 and 1999 male runners.

Screen Shot of One Runner's Web Page of Race Results

This Web page at contains the race results of one runner who participated in the Cherry Blossom run for 12 of the 14 years for which we have data. Notice that his fastest time was from his most recent run in 2012 where he completed the race in under 85 minutes. He was 45 at that time. Also, his slowest time was 123 minutes in 2002 at the age of 35.

Run Times for Multiple Races

These line plots show the times for male runners who completed at least 8 Cherry Blossom races. Each set of connected segments corresponds to the run times for one athlete. Looking at all line plots, we see a similar shape to the scatter plot in Figure 2.7, “Revised Scatter Plot of Male Runners”, i.e., an upward curve with age. However, we can also see how an individual's performance changes. For example, many middle-aged runners show a sharp increase in run time with age but that is not the case for all. Some of them improve and others change more slowly.

Linear Fits of Run Time to Age for Individual Runners

Here we have augmented the bottom-right line plot from Figure 2.17, “Run Times for Multiple Races” with the


fit of run time for each of the athletes. These are the 30 or so black dashed line segments plotted on each of the individual runner's times series.

Coefficients from Longitudinal Analysis of Athletes

This scatter plot displays the slope of the fitted line to each of the 300+ runners who competed in at least 8 Cherry Blossom road races. A negative coefficient indicates the runner is getting faster as he ages. The plot includes a


fitted line and a


. Notice that nearly all of the coefficients for those over 50 are positive. The typical size of this coefficient for a 50-year old is about one minute per year.

Screen Shot of the Source for Men's 2012 Cherry Blossom Results

This screen shot is of the HTML source for the male results for the 2012 Cherry Blossom road race. While the format is not quite the same as the female results for 2011 (see Figure 2.21, “Screen Shot of the Source for Women's 2011 Cherry Blossom Results”), both are plain text tables within <pre> nodes in an HTML document.

Screen Shot of the Source for Women's 2011 Cherry Blossom Results

This screen shot is of the HTML source for the female results for the 2011 Cherry Blossom road race. Notice that times given are for the midpoint of the race (5 Mile) and for two finish times (Time and Net Tim). Also notice the leftmost column labeled S. While the format is different than the male results for 2012, both are plain text tables within <pre> nodes in an HTML document.

3 Using Statistics to Identify Spam

Boxplot of Log Likelihood Ratio for Spam and Ham

The log


, , for 3116 test messages was computed using a


approximation based on word frequencies found in manually classified training data. The test messages are grouped according to whether they are spam or ham. Notice most ham messages have values well below 0 and nearly all spam values are above 0.

Comparison of Type I and II Error Rates



rates for the 3116 test messages are shown as a function of the threshold . For example, with a threshold of , all messages with an LLR value above -43 are classified as spam and those below as ham. In this case, 1% of ham is misclassified as spam and 2% of spam is misclassified as ham.

Example Tree from a Recursive Partition

This tree is a simple example of a recursive partition fitted model. It was fitted using the rpart() function and restricting the tree depth to 3 levels. The first yes–no question is whether the percentage of capitals in the message is less than 13. If not, the second question is whether there are fewer than 289 characters in the message. If the answer to this question is also no, then the next question is whether the message header contains an InReplyTo key. If the answer is again no, then the message is classified as spam. Of the 6232 messages in the training set, 77 ham and 643 spam fall into this leaf. The spam have been correctly classified and the 77 ham have been misclassified.

Comparison of Two Measures of Length for a Message

This scatter plot shows the relationship between the number of lines and the number of characters in the body of a message. The plot is on log scale, and 1 is added to all of the values before taking logs to address issues with empty bodies. The line is added for comparison purposes.

Use of Capitalization in Email



s compare the percentage of capital letters among all letters in a message body for spam and ham. The use of a log scale makes it easier to see that nearly 3/4 of the spam have more capital letters than nearly all of the ham.

Comparison of the Amount of Capitalization and the Size of the Message



examines the relationship between the percentage of capital letters among all letters in a message and the total number of characters in the message. Spam is marked by purple dots and ham by green. The darker color indicates overplotting. We see here that the spam tends to be longer and have more capital letters than ham.

Exploring Categorical Measures Derived from Email

These two


use area to denote the proportion of messages that fall in each category. The plot on the top shows those messages that have an Re: in the subject line tend not to be spam. The bottom plot shows that those messages that are from a user with a number at the end of their email address tend to be spam. However, few messages are sent from such users so it is not clear how helpful this distinction will be in our classification problem.

Tree for Partitioning Email to Predict Spam

This tree was fitted using rpart() on 6232 messages. The default values for all of the arguments to rpart() were used. Notice the leftmost leaf classifies as ham those messages with fewer than 13% capitals, fewer than 4% HTML tags, and at least 1 forward. Eighteen spam messages fall into this leaf and so are misclassified, but 2240 of the ham is properly classified using these 3 yes–no questions.

Type I and II Errors for Recursive Partitioning

This plot displays the


for predicting spam as a function of the size of the complexity parameter in the rpart() function. The complexity parameter is a mechanism for specifying the threshold for choosing a split for a subgroup. Splits that do not achieve a gain in fit of at least the size of the parameter value provided are not made. The Type I error is minimized at a complexity parameter value of 0.001 for an error rate of 3.9%. The Type II error rate for this complexity parameter value is 10.5%.

4 Processing Robot and Sensor Log Files: Seeking a Circular Target

Example of the Course
This shows a sample path through the course. The robot starts in the lower left corner. The circular target can be seen at approximately (4.5, -6.5). There are two rectangular obstacles and one triangular obstacle. The horizontal dimensions range from -15 to +15, and the vertical from -8 to +8.
Log File Size
This shows the distribution of the size of the 100 log files.
Elapsed Time of 100 Experiments in Seconds
There appear to be 3 different groups in this distribution. Most of the experiments are completed between 1 and 16 minutes with a "center" of about 8 minutes. A smaller group of experiments is centered around 18 minutes. The final group includes those that do not find the target and use all of the 30 minutes allowed and end then.
Distribution of the Changes in the Horizontal and Vertical Directions
We compute the change in the horizontal and vertical directions separately for each pair of consecutive records in each log to explore how far the robot typically moves between records.
Distribution of the Velocity of the Robots
This shows the bimodal distribution of the velocity of all of the robots across all log files. We compute the distance between consecutive points in each log and divide this by the time between these two records.
Robot Path for the First Experiment
This displays the path from the first log file. The panel on the left shows the robot's movements from left to right across the course and then vertically along the side. The second panel shows this path relative to the entire course. This illustrates that the robot moved along the bottom side and only slightly vertically before the run terminated. This experiment lasted almost 19 minutes.
Display of All Experiments
This displays the path of the robot in each of the 100 experiments. The starting point is displayed in green with a circle. The direction of the robot corresponds to the shift in colors from green to red. The final location is marked with a blue x.
Sample Final Look
This is the path/shape seen by the robot in the final look of the first log file, JRSPdata_2010_03_10_12_12_31. The robot is in the center of the circle. At the top right of the circle, we see a circular-like object that might be the target. A straight edge corresponding to a rectangular obstacle appears at the bottom of the circle.
Enhanced Display of a Look
This figure shows the improved display of a robot's "look." We remove the misleading lines connecting the edges of the circular target and the 2 meter arc. This uses points(, type = "l") but for each sub-group of points that form a contiguous sub-arc of points at 2 meters, and points that are less than 2 meters. It does not connect adjacent but disconnected sub-arcs. (See ???.)
All Final Looks
This shows the final look of each of the 100 log files. Which show a circle?
Density of Repeated Range Values
This shows the distribution of the repeated measurements of the range values when a robot revisited the same location. These are deviations from the mean of nominally identical values. There are some very extreme values (-1.58 and 1.68). The distribution has very large tails. Most of the observations are exactly 0.
Characteristics of Looks
This shows 4 different types of looks. In the first look, the robot sees nothing and so there is no circular target present. In the next 3 looks (moving row-wise), the robot sees a straight side or two straight sides that intersect. Again, there is no circular target. All of the remaining looks appear to contain a circular target. In the fifth look, the robot only sees the circle, while in the next 4 looks, it sees a circle and part of one or two obstacles. The last 3 looks are more complex. We see the circular target but, drawn in this manner, the circle appears to be connected to obstacles.
Looks Containing a Segment Identified as a Circle
These are the looks that were classified as containing the circular target. We see that most of the looks do indeed contain a shape that looks like the target. However, there are several that have confused a right angle corresponding to an obstacle in the course with the target and these seem to be false positives.
Looks Containing No Identified Circle
These are the looks that were classified as not containing the circular target. We see that most of the looks do indeed contain no indication of the target. However, there are several that do and suggest false negatives.
Patterns in the False Negatives
This shows the 9 looks in which the circular target appears to be present but which were not detected by our robot.evaluation() function. They are arranged to show two characteristics. The first of these is a circular target "connected" to another obstacle. The second pattern is a circular target that is very close to the robot (i.e., the center of the look) and so does not appear circular.
Misclassified Looks with a Target
These 9 looks are those that were misclassified as not containing the target. In all but one of these, the target is very close to the center of the robot. The seventh look is more problematic.

5 Strategies for Analyzing a 12-Gigabyte Data Set: Airline Flight Delays

Hourly Delay Quantiles

The airline delay quantiles (in minutes) for each hour of the day.

6 Pairs Trading

Historical Prices for the




Indices, 1990–1995
This shows the


of the two "stock" prices for the 5-year period. There is clearly a high correlation over time between the two financial indices. The values of the two series are quite different and plotted on different scales, with that for the


on the right of the


Historical Ratio for the




This shows the ratio of the two "stock" prices from 1990 to 1995. The ratio appears to move around the mean until 1994 and then to rise above it. The horizontal lines show one and two standard deviations from the mean of the ratio.
Ratio of




for 1995–2010
The circles show the starting and closing positions for trades. We open a position when the ratio is outside of the threshold lines. We close that position when the ratio returns to the mean. The green circles identify the opening of a trading position; the red circles the corresponding close of the position. Circle 7 opens a position but the ratio never returns to the mean. We close the position on the final day of the series. Note that the threshold lines use the mean and standard deviation from the period 1990 to 1995.
A Simple Plot of the Ratio of Stock Prices for




The mean of the ratio (\(\mu_{ratio}\)) is shown as a dashed line and thresholds lines indicating \(\mu_{ratio} \pm k \sigma_{ratio}\), with \(k = 0.85\).

Visualizing the First Three Positions
This shows the first 3 positions for the




stock price ratio. It allows us to verify that our findNextPosition() function is working correctly.
Positions for \(k = 0.5\)
With a smaller value of k, we get many more opening and closing positions.
Profit for Different Values of k
For this time training data, all values of k yielded a positive profit. For small values of k, there were many opportunities for positions and the profits were larger than for increasing values of k. The maximum profit of 0.88 occurs at values of k between 0.61 and 0.62.
Two Simulated Time Series
The two time series are generated from the model in with \(\rho = 0.99\) and \(\psi = 0.9\) and \(\beta_1 = c(.05, .1)\).
Density of Profit
This shows the distribution for 999 values of profit simulated from our time series model in equation with \(\rho = 0.99\) and \(\psi = 0.9\). We see a large concentration very close to 0. We also see some slightly more commonly occurring values corresponding to the small modes/"bumps." The extreme values are very extreme and also very variable across simulations, as we expect.

7 Simulation Study of a Branching Process

Diagram of an Example Branching Process



shows a possible realization of the


studied by Aldous and Krebs. Each node in the tree represents a program. Jobs marked with an X have completed running, those marked R are currently running, and W nodes are waiting for their parent to complete before starting to run.

Empirical CDFs for the Sum of Three Exponentials

In a study of the sum of 3 independent


random variables with the same rate, 6,000 sample outcomes were generated and used to estimate the


. The plot shows the empirical CDF for 16 values of the rate parameter, i.e., 0.1, 0.2, ..., 0.9, 1, 2, ..., 7. To help match each rate to its curve, a few of these rates are included in the figure next to their corresponding curves.

Empirical Distribution of the Number of Offspring for a Job

This figure shows the observed proportion of the number of offspring randomly generated for a parent with a birth time of 1 and a completion time of 6. The inter-arrival times of the parent's children are independent and follow an exponential distribution with parameter . The process was simulated 1,000 times, and the observed proportion of 0, 1, 2, ..., 9 offspring plotted (black line segments). In addition, the true probabilities from the Poisson(2.5) distribution are plotted next to the observed proportions (grey line segments).

Visualization of a Randomly Generated Branching Process

This plot shows the lifetimes of each member of a randomly generated birth and assassination process with a birth rate of and a completion rate of . Each job's lifetime is represented by a grey line segment with endpoints at its birth and completion times. The Xs on the segment denote the birth times of the job's offspring. The dashed lines separate the generations. None of the jobs in the fourth generation of this instance of the process had offspring so the process terminated there. Notice that one job in the second generation ran for a very long time and had just one child.

Visualization of a Randomly Generated Branching Process Over a Fixed Time Interval

This plot shows the lifetimes of each member of the first 5 generations of a simulated birth and assassination process that has been observed up to time of 8. Each lifetime is represented by a grey line segment with endpoints at its birth and completion times. A job that has not completed by 8 time steps is censored and consequently, we see only its offspring born before 8.

Scatterplots of the Number of Generations Against the Number of Offspring

These 4


s show the different behavior exhibited by the branching process as and vary. Each simulation terminates when the process dies out or one of the following limits is reached: 200 generations; 100,000 offspring. One hundred simulations are run for each (, ) pair.

Three-Dimensional Scatterplot of the Number of Offspring by and

Each point in this


represents the upper quartile of the number of offspring in 400 random outcomes of the branching process for a particular (, ) pair. The offspring are plotted on log base 10 scale, so the first category, i.e., [0, 0.5) corresponds to 1 to 3 offspring.

Image Map of the Proportion of Replicates That Reach the Simulation Limits



represents a


of the proportion of the 400 simulations for each pair that reached 20 generations or 1000 offspring and so were terminated.

Proportion of Simulations with at Least 20 Offspring





from the rainbow


to represent the proportion of 400 random outcomes for each pair that have at least 20 offspring.

8 A Self-Organizing Dynamic System with a Phase Transition

Movement on a Sample Grid
(a) shows the initial state of a 3-by-5 grid containing 3 red and 3 blue cars. At time \(t = 1\), the red cars move horizontally. The red car on the bottom row "wraps around" to the first column on the same row. At time \(t = 2\), the blue cars move up within the same column. The blue car in cell \((1, 3)\) is blocked by the red car above it that moved to that cell in at time \(t = 1\). Accordingly, we obtain a sequence of grids indexed by time.

Table 8.1. 

t = 0t = 500t = 1,000

Sample Free-Flowing Traffic Grids
The 3 panels show the initial grid at time t = 0, t = 500, and t = 1000. Cars occupy 25% of the cells. After 500 iterations, the cars start to organize along diagonal lines. After 1000 iterations, the lines are becoming clearer.

Table 8.2. 

t = 0t = 500t = 1,000

Sample Deadlocked Traffic Grids
Three grids at different time steps showing the emergence of deadlocked traffic.
Simple Plot from image()
Using a simple call to image() to display the grid uses the wrong colors and also does not display the cells in the order we expect or want.

Table 8.3. 

Sample Grid Displays
The first panel shows a 100-by-100 grid with 50% of the cells occupied equally with red and blue cars. The second panel has the same dimensions but there are 5000 red cars and only 500 blue cars. The 3rd panel shows a small 3-by-4 grid where we placed the cars manually.

Table 8.4. 

A Grid before and after Moving the Red Cars
The left panel shows the initial 4-by-7 grid. The second panel shows the state of the grid after the red cars have moved.
Run-Time for


Code as a Function of the Number of Grid Cells
This shows that as the number of grid cells, and the total number of cars, doubles, the time taken approximately doubles. We have also added the


fit for the elapsed times as a function of grid size n. This plot was computed for square grids, but applies generally to the number of cells in a grid.
Comparison of Times for Different Computational Approaches

Table 8.5. 


This shows the relative speed of the different approaches, ranging from the naïve loop version, to the fastest vectorized approach, to using the C code. There is a very significant speedup from the loop-version of the code. However, this plot hides the significant speedup between the different vectorized approaches and especially the C code. Panel (b) shows these relative speedups when we omit the loop approaches and we can compare the remaining approaches on a more appropriate scale.

Table 8.6. 

\(\rho = .25\)\(\rho = .33\)
\(\rho = .38\)\(\rho = .38\)
\(\rho = .55\)\(\rho = .65\)

Sample 1024-by-1024 Grids with Different Densities
Each grid has the same dimension and is run for 128,000 time steps. In the top-left, we have a free-flowing grid with a density of 0.25 in which the cars arrange themselves into diagonal lines. At a slightly higher density of 0.33 (top-right), we see one group of cars that are partially deadlocked but which continue to move. We also see bands of red and blue cars moving, but not parallel to each other. Some cars will move freely, at least until they meet the partially deadlocked group. The cars in that group move more slowly until they escape. This gives rise to different velocities. At a slightly higher density, we see additional groups but the same pattern of moving cars. For a different configuration at the same density (0.38) we see a very different structure with mostly deadlocked cars in two parallel groups running diagonally. Note the alternating colors of the diagonal bands. For higher densities, we see more localized groupings but the same deadlock.

Table 8.7. 

Average Velocity of Cars by Occupancy Density

These plots mirror those in[bib:Raissa2005] (page 3), with apparently more replications. They show the average proportion of cars moving in each time step on the vertical axis. The horizontal axis shows a range of car densities for the different grids. The 4 panels correspond to the different sizes of the square grid: 64, 128, 256, and 512. Each point in a plot corresponds to running the initial grid through 64,000 total iterations of both red and blue cars, i.e. 128,000 time steps. We see variation for a given density corresponding to different random grid configurations. We see that the density at which deadlock occurs decreases as the grid size increases. Importantly, we see that there are different points of equilibrium, not simply free-flowing or deadlocked. These are the intermediate states that were previously unexpected until reported in[bib:Raissa2005].

9 Simulating Blackjack

Density Plot Comparing the Winnings from Two Strategies

This plot shows the


of the gain from 1,000 $1 bets for two different strategies: the


for an infinite deck, and a


that hits when the dealer shows a 6 or higher and the player's cards are under 17, and otherwise stands. A gain of $2 or -$2 results from winning or losing a double down, respectively. The simple strategy never doubles the bet.

A Comparison of the Optimal and Simple Strategies

The distribution of payoffs under the optimal strategy peaks at a higher value. Note that the variability is considerable: there is about a 40% chance of losing 5% or more of your money with the optimal strategy and about a 55% chance with the simple strategy.

Plot of the bet() Function

This plot shows the bet() function over the domain from -5 to 10. We can use this visualization to confirm that the function works as expected.

Comparison of Average Gain for Card Counting

A comparison of the fixed and varying bet amount


. For each set of 50 hands, we take the difference in the gains between a fixed bet and a bet adjusted according to the Hi-Low counting strategy. Values above 0 are when the gain for the counting strategy exceeded the fixed bet.

10 Baseball: Exploring Data in a Relational Database

Boxplots of Team Payrolls by League

These boxplots show the team payrolls for the American League (dark gray) and the National League (light gray) from 1985 to 2012. Payroll has been logged and is reported in millions of dollars. The American League appears to have greater spread in payroll than the National League. Also evident is the sharp increase in payrolls in the late 1980s and early 1990s.

Scatter Plot of Team Payroll by Year

In this scatter plot, the plotting symbol is a transparent gray and the year has been jittered slightly to avoid over plotting. The identifier of the teams that won the World Series are added to the corresponding point, which is colored red. In almost every case, these dots are at or above the upper quartile of team payroll. Payroll is reported in millions of dollars and plotted on a log scale.

11 CIA Factbook Mashup

Display of Infant Mortality by Country on <gearth></gearth>

This screenshot of the <gearth></gearth> virtual earth browser displays circles scaled to the population size and


ed according to the


rate for a country. The data are available from the CIA Factbook. The locations of the circles are determined from


's latitude and longitude of the country's geographic center. When the viewer clicks on a circle, a window pops up with more detailed information for that country.

Distribution of Infant Mortality for Countries in the CIA Factbook

This histogram of infant mortality rates shows a highly skewed distribution. Most countries have rates under 20 per 1000 live births and a few countries have rates between 80 and 125.

Population Distribution for Countries in the CIA Factbook

This histogram of the square root of population size for countries shows a highly skewed distribution with a mode around 1000.

Incorrect Map of Infant Mortality and Population

In this map each disk corresponds to a country's infant mortality and population. The size of the disk is proportional to the population and the color reflects the infant mortality rate. Notice the size of China's disk is too small – it has the highest population but one of the smallest disks. Other anomalies are apparent with closer inspection.

Screenshot of the ISO Country Code Mapping

This screenshot of the ISO Web site shows the ISO code for the United Kingdom as GB. Codes are also available in XML and CSV formats at

Map of Infant Mortality and Population

This map correctly matches the country demographic information with latitude and longitude. Notice now the symbols for India and China are approximately the same size and the largest symbols on the map. Also note that the symbol for the United Kingdom is now pale yellow, the color we would expect it to be, because it is not being confused with Gabon.

Default <gearth></gearth> Image

This screenshot of <gearth></gearth> displays the location of each country with a pushpin. The locations are from our latitude and longitude file. A more informative <gearth></gearth> visualization appears in Figure 11.1, “Display of Infant Mortality by Country on <gearth></gearth>.

Screenshot of the CIA Factbook Rendered in Chrome
The Chrome browser renders an XML file using indentation and color to highlight the structure of the document. The top or root node of the file is <factbook> and it has an attribute called lastupdate, which indicates how recently the information was updated. The <news> nodes are indented one space, corresponding to their depth in the hierarchy, i.e., they are children of <factbook>.
Screenshot of the MaxMind Web Page with Country Latitude and Longitude

MaxMind makes available the average latitude and longitude for countries in several formats, including in a display on the Web as shown here in this screenshot.

Screenshot of the HTML Source of the MaxMind Web Page

This screenshot displays the HTML source of the Web page shown in Figure 11.9, “Screenshot of the MaxMind Web Page with Country Latitude and Longitude”. Notice that the latitudes and longitudes for the countries appear within a <pre> node in the HTML document.

12 Exploring Data Science Jobs with Web Scraping and Text Mining

Kaggle Jobs Board Web Page
This is the front page of the


jobs forum. All of the posts are available via this page and its successive pages via the links 2, 3, .... We can click on the link for a specific job to read that post.


Job Post and Comments
This shows a very informal job posting and follow-up posts on the same page. There are very few details about the position being advertised.
Screenshot of


Job Post by Pandora, the free, personalized radio Web site
There are several paragraphs describing the company and the position. There is a bullet-list of required skills.
List of Search Results for Data Science Jobs on


We can enter the search term in various fields to obtain a list of matching job posts. These are shown below the textfields. Each entry corresponds to a job posting. These is some metadata in these items, but none of much interest to us. There are also numerous advertisements on the page, distracting us from the data of interest.
Job Post on
This is a cross-listed job posting on that actually comes from the


Web site. As with the


posts, much of the post is free-form text. There are several lists that provide metadata about the job. There is also structured content for similar jobs, listing the related skills.
Semi-Structured Job Post on Kaggle
This shows another


job posting. Again, much of the content is free-form text within different sections that have a title. There are 2 lists with metadata about education and experience qualifications necessary for the job.
Search Results for Data Scientist Jobs on


We enter the search term in the text field at the top of the Web page. The results are displayed below. Note that in addition to the link to each particular job, the results page also shows metadata about each job, e.g., salary range, location, and a list of required skills. If this is all the information we wanted, we would not have to scrape the actual postings.

Sample Job Post on

This job posting, like others on,


has several paragraphs of free-form text, and also some itemized lists including one listing the skills necessary for the position. Additionally, the post has the location and salary separately in the top-right corner. It also displays the preferred skills as a list of phrases.

Frequency of Selected Terms over all Kaggle Job Posts

This shows the number of occurrences of each of the selected terms across all 842 job posts on Kaggle by January 2015.

Dotplot for Frequency of Skill Phrases across Data Science Job Posts

This shows the number of occurrences of different terms we selected across Data Science job postings on this Web site.

Word Cloud for Frequency of Skill Phrases Across Job Posts

This is a different display of the counts of different


across posts on