Data Science Writers Home
## 1 Predicting Location via Indoor Positioning Systems

**Floor Plan of the Test Environment**
In this floor plan, the 6 fixed access points are denoted by black
square markers, the offline/training data were collected at the
locations marked by grey dots, and the online measurements were
recorded at randomly selected points indicated with black dots. The
grey dots are spaced one meter apart.

**Empirical CDF of Orientation for the Hand-Held Device**
This empirical distribution function of orientation shows that there
are 8 basic orientations that are 45 degrees apart. We see from
the steps in the function that these orientations are not exactly 45,
90, 135, etc. Also, the 0 orientation is split into the two groups,
one near 0 and the other near 360.

**Boxplots of Orientation for the Hand-Held Device**
These

s of the original orientation against the rounded value
confirm that the values have mapped correctly to 0, 45, 90, 135, etc.
The “outliers” at the top left corner of the plot are the
values near 360 that have been mapped to 0.

**Screenshot of the coffer.com Mac Address Lookup Form**
The coffer.com Web site offers lookup services to find the
MAC address for a vendor and vice versa.

**Counts of signals detected at each position**
Plotted at each location in the building is the total number of
signals detected from all access points for the offline data. Ideally
for each location, 110 signals were measured at 8 angles for each of 6
access points, for a total of 5280 recordings. These data include a
seventh Mac address and not all signals were detected, so there are
about 5500 recordings at each location.

**Signal Strength by Angle for Each Access Point**
The

s in this figure represent signals for one location, which
is in the upper left corner of the floor plan, i.e., and . These boxes
are organized by access point and the angle of the hand-held
device. The dependence of signal strength on angle is evident at
several of the access points, e.g., `00:14:bf:97:90`

in the top right panel of the figure.

**Distribution of Signal by Angle for Each Access Point**
The density curves shown here are for the signal strengths measured at
the position: and . These 48 density plots represent each of the
access point angle
combinations. There are roughly 110 observations in each panel. Some
look roughly normal while many others look skewed left.

**SD of Signal Strength by Mean Signal Strength**
The average and SD for the signals detected at each
location-angle-access point combination are plotted against each
other. The weak signals have low variability and the stronger signals
have greater variability.

**Comparison of Mean and Median Signal Strength**
This smoothed

shows the difference between the mean and
median signal strength for each combination of location, access point,
and angle against the number of observations. These differences are
close to 0 with a typical deviation of 1 to 2 dBm.

**Median Signal at Two Access Points and Two Angles**
These four

provide a

representation of
signal strength. The top two maps are for the access point
`00:14:bf:b1:97:90`

and the angles 0 (left) and 135
(right). The two bottom heat maps represent the signal strength for
the `00:0f:a3:39:e1:c0`

and the same two angles.

**Signal Strength vs. Distance to Access Point**
These 48

s show the relationship between the signal
strength and the distance to the access point for each of the 6 access
points and 8 orientations of the device. The shape is consistent
across panels showing curvature in the relationship.

**Floor Plan with Predicted and Actual Locations**
The red line segments shown in the floor plan connect the test
locations (black dots) to their predicted locations (asterisks). The
top plot shows the predictions for
and the bottom plot is for

s. In this model, we use as training data the average signal
strengths from each of the 166 offline locations (grey dots) to the
6 access points (black squares) for the 3 closest angles to the
angle at which the test data was measured.

**Cross Validated Selection of **
This line plot shows the sum of square errors as a function of the
number of neighbors used in predicting the location of a new
observation. The sums of squared errors are obtained via

of the offline data.

## 2 Modeling Runners' Times in the Cherry Blossom Race

**Screen Shot of Cherry Blossom Run Web site**
This page contains links to each year's race results. The year 1999 is
the earliest for which they provide data. Men's and women's results
are listed separately.

**Screen Shot of the 2012 Male Results**
This screenshot shows the results, in race order, for men competing in
the 2012 Cherry Blossom 10 Mile Run. Notice that both 5-mile times
and net times are provided. We know that the `Time`

column is net time because it is so indicated in the header of the
table.

**Screen Shot of Men's 2011 Race Results**
This screenshot shows the results, in race order, for men competing in
the 2011 Cherry Blossom road race. Notice that in 2011, 3 times are
recorded – the time to complete the first 5 miles and the gun
and net times for the full run. In contrast, the results from 2012 do
not provide gun time.

**Box Plot of Age by Year**
These side-by-side boxplots of age for each race year show a few
problems with the data for 2003 and 2006. The runners in these years
are unusually young.

**Box Plot of Age by Year**
These side-by-side boxplots of age for each race year show a
reasonable age distribution. For example, the lower quartile for all
years range between 29 and 32. The problems identified earlier for
2003 and 2006 have been addressed.

**Default Scatter Plot for Run Time vs. Age for Male
Runners**
This plot demonstrates that a simple

of run time by age
for the 70,000 male runners leads to such severe

that
the shape of the data is not discernible.

**Revised Scatter Plot of Male Runners**
This plot revises the simple scatter plot of Figure 2.6, “Default Scatter Plot for Run Time vs. Age for Male
Runners” by changing the plotting symbol from a
circle to a disk, reducing the size of the plotting symbol, using a
transparent

for the disk, and adding a small amount of random
noise to age. Now we see the shape of the high density region
containing most of the runners and the slight upward trend of time
with increasing age.

**Smoothed Scatter Plot of Male Runners Race Times vs. Age**
This plot offers an alternative to the scatter plot of Figure 2.7, “Revised Scatter Plot of Male Runners” that uses jittering and
transparent

to ameliorate the

. Here there is no
need to jitter age because the smoothing action essentially does that
for us by spreading an individual runner's (age, run time) pair over a
small region. The shape of the high density region has a very similar
shape to the earlier plot.

**Side-by-Side Boxplots of Male Runners' Run Time vs. Age**
This sequence of boxplots shows the quartiles of time for men grouped
into 10-year age intervals. As age increases, all the quartiles
increase. However, the box becomes asymmetrical with age, which
indicates that the upper quartile increases faster than the median and
lower quartile.

**Residual Plot from Fitting a Simple Linear Model of Performance to Age**
Shown here is a smoothed scatter plot of the residuals from the fit of
the simple

of run time to age for male runners who are 15 to
80 years old. Overlaid on the scatter plot are two curves. The
“curve” in purple is a solid horizontal line at \(y = 0\).
The green dashed curve is a local smooth of the residuals.

**Piecewise Linear and Loess Curves Fitted to Run Time vs. Age**
Here we have plotted the fitted curves from *loess()* and
a piecewise

with hinges at 30, 40, 50, and 60. These
curves follow each other quite closely. However, there appears to be
more curvature in the over 50 loess fit that is not captured in the
piecewise

.

**Line Plot of the Number of Male Runners by Year**
This plot shows that the number of male runners in the Cherry Blossom
10-mile race has more than doubled from 1999 to 2012.

**Density Curves for the Age of Male Runners in 1999 and 2012**
These two

have quite different shapes. The 1999 male
runners have a broad, nearly flat mode where they are roughly evenly
distributed in age from 28 to 45. In contrast, the 2012 runners are
younger with a sharper peak just under 30 years and a skew right
distribution.

**Loess Curves Fit to Performance for 1999 and 2012 Male Runners**
This loess fit of run time to age for 2012 male runners sits above the
fit for 1999 male runners. The gap between these curves is about 5
minutes for most years. The exception is in the late 40s to early 60s
where the curves are within 2–3 minutes of each other. Both curves
have a similar shape.

**Difference between Loess Curves**
This line plot shows the difference between the predicted run time
for 2012 and 1999 male runners.

**Screen Shot of One Runner's Web Page of Race Results**
This Web page at https://storage.athlinks.com contains
the race results of one runner who participated in the Cherry Blossom
run for 12 of the 14 years for which we have data. Notice that his
fastest time was from his most recent run in 2012 where he completed
the race in under 85 minutes. He was 45 at that time. Also, his
slowest time was 123 minutes in 2002 at the age of 35.

**Run Times for Multiple Races**
These line plots show the times for male runners who completed at
least 8 Cherry Blossom races. Each set of connected segments
corresponds to the run times for one athlete. Looking at all line
plots, we see a similar shape to the scatter plot in Figure 2.7, “Revised Scatter Plot of Male Runners”, i.e., an upward curve with
age. However, we can also see how an individual's performance
changes. For example, many middle-aged runners show a sharp increase
in run time with age but that is not the case for all. Some of them
improve and others change more slowly.

**Linear Fits of Run Time to Age for Individual Runners**
Here we have augmented the bottom-right line plot from Figure 2.17, “Run Times for Multiple Races” with the

fit of
run time for each of the athletes. These are the 30 or so black dashed
line segments plotted on each of the individual runner's times series.

**Coefficients from Longitudinal Analysis of Athletes**
This scatter plot displays the slope of the fitted line to each of the
300+ runners who competed in at least 8 Cherry Blossom road
races. A negative coefficient indicates the runner is getting faster
as he ages. The plot includes a

fitted line and a

. Notice that nearly all of the coefficients for those
over 50 are positive. The typical size of this coefficient for a
50-year old is about one minute per year.

**Screen Shot of the Source for Men's 2012 Cherry Blossom Results**
**Screen Shot of the Source for Women's 2011 Cherry Blossom Results**
This screen shot is of the **HTML** source for the female results for
the 2011 Cherry Blossom road race. Notice that times given are for the
midpoint of the race (`5 Mile`

) and for two finish
times (`Time`

and `Net Tim`

). Also
notice the leftmost column labeled `S`

. While the
format is different than the male results for 2012, both are plain
text tables within `<pre>`

nodes in an **HTML**
document.

## 3 Using Statistics to Identify Spam

**Boxplot of Log Likelihood Ratio for Spam and Ham**
The log

, , for 3116 test messages was computed
using a

approximation based on word frequencies found in
manually classified training data. The test messages are grouped according
to whether they are spam or ham. Notice most ham messages have values
well below 0 and nearly all spam values are above 0.

**Comparison of Type I and II Error Rates**
The

rates for the 3116 test messages are shown as
a function of the threshold . For
example, with a threshold of ,
all messages with an LLR value above -43 are classified as spam and
those below as ham. In this case, 1% of ham is misclassified
as spam and 2% of spam is misclassified as ham.

**Example Tree from a Recursive Partition**
This tree is a simple example of a recursive partition fitted
model. It was fitted using the *rpart()* function and
restricting the tree depth to 3 levels. The first yes–no question is
whether the percentage of capitals in the message is less than 13. If
not, the second question is whether there are fewer than 289
characters in the message. If the answer to this question is also no,
then the next question is whether the message header contains an
`InReplyTo`

key. If the answer is again no, then the
message is classified as spam. Of the 6232 messages in the training
set, 77 ham and 643 spam fall into this leaf. The spam have been
correctly classified and the 77 ham have been misclassified.

**Comparison of Two Measures of Length for a Message**
This scatter plot shows the relationship between the number of lines
and the number of characters in the body of a message. The plot is on
log scale, and 1 is added to all of the values before taking logs to
address issues with empty bodies. The line is added for comparison purposes.

**Use of Capitalization in Email**
These

s compare the percentage of capital letters among all
letters in a message body for spam and ham. The use of a log scale
makes it easier to see that nearly 3/4 of the spam have more capital
letters than nearly all of the ham.

**Comparison of the Amount of Capitalization and the Size of the
Message**
This

examines the relationship between the percentage of
capital letters among all letters in a message and the total number of
characters in the message. Spam is marked by purple dots and ham by
green. The darker color indicates overplotting. We see here that
the spam tends to be longer and have more capital letters than ham.

**Exploring Categorical Measures Derived from Email**
These two

use area to denote the proportion of messages
that fall in each category. The plot on the top shows those messages
that have an `Re:`

in the subject line tend not to be
spam. The bottom plot shows that those messages that are from a user
with a number at the end of their email address tend to be
spam. However, few messages are sent from such users so it is not
clear how helpful this distinction will be in our classification
problem.

**Tree for Partitioning Email to Predict Spam**
This tree was fitted using *rpart()* on 6232
messages. The default values for all of the arguments to
*rpart()* were used. Notice the leftmost leaf classifies
as ham those messages with fewer than 13% capitals, fewer than 4%
**HTML** tags, and at least 1 forward. Eighteen spam messages fall into
this leaf and so are misclassified, but 2240 of the ham is properly
classified using these 3 yes–no questions.

**Type I and II Errors for Recursive Partitioning**
This plot displays the

for predicting spam as a
function of the size of the complexity parameter in the
*rpart()* function. The complexity parameter is a
mechanism for specifying the threshold for choosing a split for a
subgroup. Splits that do not achieve a gain in fit of at least the
size of the parameter value provided are not made. The Type I error is
minimized at a complexity parameter value of 0.001 for an error rate
of 3.9%. The Type II error rate for this complexity parameter value
is 10.5%.

## 4 Processing Robot and Sensor Log Files: Seeking a Circular Target

**Example of the Course***
This shows a sample path through the course.
The robot starts in the lower left corner.
The circular target can be seen at approximately (4.5, -6.5).
There are two rectangular obstacles and one triangular obstacle.
The horizontal dimensions range from -15 to +15, and the
vertical from -8 to +8.
*

**Log File Size***
This shows the distribution of the size of the 100 log files.
*

**Elapsed Time of 100 Experiments in Seconds***
There appear to be 3 different groups in this distribution.
Most of the experiments are completed between 1 and 16 minutes
with a "center" of about 8 minutes.
A smaller group of experiments is centered around 18 minutes.
The final group includes those that do not find the target and
use all of the 30 minutes allowed and end then.
*

**Distribution of the Changes in the Horizontal and Vertical Directions***
We compute the change in the horizontal and vertical directions separately for
each pair of consecutive records in each log to explore how far the robot
typically moves between records.
*

**Distribution of the Velocity of the Robots***
This shows the bimodal distribution of the velocity of all of the robots
across all log files. We compute the distance between consecutive points in each log
and divide this by the time between these two records.
*

**Robot Path for the First Experiment***
This displays the path from the first log file.
The panel on the left shows the robot's movements from left to right across the course
and then vertically along the side.
The second panel shows this path relative to the entire course. This illustrates
that the robot moved along the bottom side and only slightly vertically before the run terminated.
This experiment lasted almost 19 minutes.
*

**Display of All Experiments***
This displays the path of the robot in each of the 100 experiments.
The starting point is displayed in green with a circle.
The direction of the robot corresponds to the shift in colors from green to red.
The final location is marked with a blue x.
*

**Sample Final Look***This is the path/shape seen by the robot in the
final look of the first log file, *`JRSPdata_2010_03_10_12_12_31`

.
The robot is in the center of the circle.
At the top right of the circle, we see a circular-like object that might be the target.
A straight edge corresponding to a rectangular obstacle appears at the bottom of the circle.

**Enhanced Display of a Look***
This figure shows the improved display of a robot's "look."
We remove the misleading lines connecting the edges
of the circular target and the 2 meter arc.
This uses *`points(, type = "l")`

but for each sub-group of points that form a contiguous sub-arc
of points at 2 meters, and points that are less than 2 meters.
It does not connect adjacent but disconnected sub-arcs.
(See ???.)
**All Final Looks***
This shows the final look of each of the 100 log files.
Which show a circle?
*

**Density of Repeated Range Values***
This shows the distribution of the repeated measurements
of the range values when a robot revisited the same location.
These are deviations from the mean of nominally identical values.
There are some very extreme values (-1.58 and 1.68).
The distribution has very large tails. Most of the observations are
exactly 0.
*

**Characteristics of Looks***
This shows 4 different types of looks.
In the first look, the robot sees nothing and so there is no circular target present.
In the next 3 looks (moving row-wise), the robot sees a straight side or two straight sides that intersect.
Again, there is no circular target. All of the remaining looks appear to contain
a circular target. In the fifth look, the robot only sees the circle,
while in the next 4 looks, it sees a circle and part of one or two obstacles.
The last 3 looks are more complex. We see the circular target but, drawn in this manner,
the circle appears to be connected to obstacles.
*

**Looks Containing a Segment Identified as a Circle***
These are the looks that were classified as containing the circular target.
We see that most of the looks do indeed contain a shape that looks like the target.
However, there are several that have confused a right angle corresponding to an obstacle in the course
with the target and these seem to be false positives.
*

**Looks Containing No Identified Circle***These are the looks that were classified as not containing the circular target.
We see that most of the looks do indeed contain no indication of the target.
However, there are several that do and suggest false negatives.*

**Patterns in the False Negatives***
This shows the 9 looks in which the circular target appears to
be present but which were not detected by our **robot.evaluation()*
function. They are arranged to show two characteristics.
The first of these is a circular target "connected" to another obstacle.
The second pattern is a circular target that is very close to the
robot (i.e., the center of the look) and so does not appear circular.

**Misclassified Looks with a Target***These 9 looks are those that were misclassified as not containing the target.
In all but one of these, the target is very close to the center of the robot.
The seventh look is more problematic.*

## 5 Strategies for Analyzing a 12-Gigabyte Data Set: Airline Flight Delays

**Hourly Delay Quantiles**The airline delay quantiles (in minutes) for each hour of the day.

## 6 Pairs Trading

**Historical Prices for the and Indices, 1990–1995***
This shows the of the two "stock" prices for the 5-year period.
There is clearly a high correlation over time between the two financial indices.
The values of the two series are quite different and plotted on different scales, with that for the on the right of the .
*
**Historical Ratio for the and Indices***
This shows the **ratio* of the two "stock" prices from 1990 to 1995.
The ratio appears to move around the mean until 1994 and then to rise above it.
The horizontal lines show one and two standard deviations from the mean of the ratio.

**Ratio of to for 1995–2010***The circles show the starting and closing positions for trades.
We open a position when the ratio is outside of the threshold lines.
We close that position when the ratio returns to the mean.
The green circles identify the opening of a trading position;
the red circles the corresponding close of the position.
Circle 7 opens a position but the ratio never returns to the mean.
We close the position on the final day of the series.
Note that the threshold lines use the mean and standard deviation
from the period 1990 to 1995.*

**A Simple Plot of the Ratio of Stock Prices for and **
The mean of the ratio (\(\mu_{ratio}\)) is shown as a dashed line and thresholds lines
indicating \(\mu_{ratio} \pm k \sigma_{ratio}\),
with \(k = 0.85\).

**Visualizing the First Three Positions***
This shows the first 3 positions for the / stock price ratio.
It allows us to verify that our **findNextPosition()* function is working correctly.
**Positions for \(k = 0.5\)***With a smaller value of k, we get many more opening and closing positions.*

**Profit for Different Values of k***
For this time training data, all values of k yielded a positive profit.
For small values of k, there were many opportunities for positions and
the profits were larger than for increasing values of k.
The maximum profit of 0.88 occurs at values of k
between 0.61 and 0.62.
*

**Two Simulated Time Series***The two time series are generated from the model in
with \(\rho = 0.99\) and \(\psi = 0.9\) and \(\beta_1 = c(.05, .1)\).*

**Density of Profit***
This shows the distribution for 999 values of profit simulated
from our time series model in equation
with \(\rho = 0.99\) and \(\psi = 0.9\).
We see a large concentration very close to 0. We also see
some slightly more commonly occurring values corresponding to the small modes/"bumps."
The extreme values are very extreme and also very variable across simulations, as we expect.
*

## 7 Simulation Study of a Branching Process

**Diagram of an Example Branching Process**
This

shows a possible realization of the

studied by Aldous and Krebs. Each node in the tree represents a
program. Jobs marked with an “`X`

” have
completed running, those marked “`R`

”
are currently running, and “`W`

” nodes
are waiting for their parent to complete before starting to run.

**Empirical CDFs for the Sum of Three Exponentials**
In a study of the sum of 3 independent

random
variables with the same rate, 6,000 sample outcomes were generated and
used to estimate the

. The plot shows
the empirical CDF for 16 values of the rate parameter,
i.e., 0.1, 0.2, ..., 0.9, 1, 2, ..., 7.
To help match each rate to its curve, a few of these rates are included in the figure
next to their corresponding curves.

**Empirical Distribution of the Number of Offspring for a Job**
This figure shows the observed proportion of the number of offspring
randomly generated for a parent with a birth time of 1 and a
completion time of 6. The inter-arrival times of the parent's children
are independent and follow an exponential distribution with parameter
. The process was simulated
1,000 times, and the observed proportion of 0, 1, 2, ..., 9 offspring
plotted (black line segments). In addition, the true probabilities
from the Poisson(2.5) distribution are plotted next to the observed
proportions (grey line segments).

**Visualization of a Randomly Generated Branching Process**
This plot shows the lifetimes of each member of a randomly generated
birth and assassination process with a birth rate of
and a completion rate of
. Each job's lifetime is
represented by a grey line segment with endpoints at its birth and
completion times. The `X`

s on the segment denote the birth times of the
job's offspring. The dashed lines separate the generations. None of
the jobs in the fourth generation of this instance of the process had
offspring so the process terminated there. Notice that one job in the
second generation ran for a very long time and had just one child.

**Visualization of a Randomly Generated Branching Process Over a Fixed Time Interval**
This plot shows the lifetimes of each member of the first 5
generations of a simulated birth and assassination process that has
been observed up to time of 8. Each lifetime is represented by a grey
line segment with endpoints at its birth and completion times. A job
that has not completed by 8 time steps is censored and consequently, we see
only its offspring born before 8.

**Scatterplots of the Number of Generations Against the Number of Offspring**
These 4

s show the different behavior exhibited by the
branching process as and
vary. Each simulation terminates
when the process dies out or one of the following limits is reached:
200 generations; 100,000 offspring. One hundred simulations are run
for each (,
) pair.

**Three-Dimensional Scatterplot of the Number of Offspring by
and
**
Each point in this

represents the upper quartile of the
number of offspring in 400 random outcomes of the branching process
for a particular (,
) pair. The offspring are plotted
on log base 10 scale, so the first category, i.e., [0, 0.5)
corresponds to 1 to 3 offspring.

**Image Map of the Proportion of Replicates That Reach the Simulation Limits**
This

represents a

of the proportion of the
400 simulations for each
pair that reached 20 generations or 1000 offspring and so were
terminated.

**Proportion of Simulations with at Least 20 Offspring**
This

uses

from the rainbow

to represent the
proportion of 400 random outcomes for each pair that have at least 20 offspring.

## 8 A Self-Organizing Dynamic System with a Phase Transition

**Movement on a Sample Grid***
(a) shows the initial state of a 3-by-5 grid containing 3 red and 3 blue cars.
At time \(t = 1\), the red cars move horizontally.
The red car on the bottom row "wraps around" to the first column on the same row.
At time \(t = 2\), the blue cars move up within the same column.
The blue car in cell \((1, 3)\) is blocked by the red car above it
that moved to that cell in at time \(t = 1\).
Accordingly, we obtain a sequence of grids indexed by time.
*

**Sample Free-Flowing Traffic Grids***The 3 panels show the initial grid at time t = 0, t = 500, and t = 1000.
Cars occupy 25% of the cells. After 500 iterations, the cars start to organize
along diagonal lines. After 1000 iterations, the lines are becoming clearer.
*

**Sample Deadlocked Traffic Grids***Three grids at different time steps showing the emergence of deadlocked traffic.*

**Simple Plot from ***image()**
Using a simple call to **image()* to display the grid uses
the wrong colors and also does not display the cells in the order we expect or want.

**Sample Grid Displays***
The first panel shows a 100-by-100 grid with 50% of the cells occupied equally
with red and blue cars.
The second panel has the same dimensions but there are 5000 red cars and only 500 blue cars.
The 3rd panel shows a small 3-by-4 grid where we placed the cars manually.
*

**A Grid before and after Moving the Red Cars***
The left panel shows the initial 4-by-7 grid.
The second panel shows the state of the grid after the red cars have moved.
*

**Run-Time for Code as a Function of the Number of Grid Cells***
This shows that as the number of grid cells, and the total number of cars, doubles,
the time taken approximately doubles. We have also added the
fit for the elapsed times as a function of grid size ***n**.
This plot was computed for square grids, but applies generally to the number of
cells in a grid.
**Comparison of Times for Different Computational Approaches***
This shows the relative speed of the different approaches, ranging
from the naïve loop version, to the fastest vectorized approach, to
using the **C* code. There is a very significant speedup from the
loop-version of the code. However, this plot hides the significant
speedup between the different vectorized approaches and especially the
*C* code.
Panel (b) shows these relative speedups when we omit the loop approaches
and we can compare the remaining approaches on a more appropriate scale.

**Sample 1024-by-1024 Grids with Different Densities***
Each grid has the same dimension and is run for 128,000 time steps.
In the top-left, we have a free-flowing grid with a density of 0.25 in which the cars arrange
themselves into diagonal lines. At a slightly higher density of 0.33 (top-right),
we see one group of cars that are partially deadlocked but which continue to move.
We also see bands of red and blue cars moving, but not parallel to each other.
Some cars will move freely, at least until they meet the partially deadlocked group.
The cars in that group move more slowly until they escape. This gives rise to different velocities.
At a slightly higher density, we see additional groups but the same pattern of moving cars.
For a different configuration at the same density (0.38) we see a very different structure
with mostly deadlocked cars in two parallel groups running diagonally. Note the alternating colors of the diagonal bands.
For higher densities, we see more localized groupings but the same deadlock.
*

**Average Velocity of Cars by Occupancy Density**These plots mirror those in[bib:Raissa2005] (page 3),
with apparently more replications.
They show the average proportion of cars moving in each time step on the vertical axis.
The horizontal axis shows a range of car densities for the different grids.
The 4 panels correspond to the different sizes of the square grid: 64, 128, 256, and 512.
Each point in a plot corresponds to running the initial grid through 64,000 total
iterations of both red and blue cars, i.e. 128,000 time steps.
We see variation for a given density corresponding to different random grid configurations.
We see that the density at which deadlock occurs decreases as the grid size increases.
Importantly, we see that there are different points of equilibrium, not
simply free-flowing or deadlocked. These are the intermediate states that were previously
unexpected until reported in[bib:Raissa2005].

## 9 Simulating Blackjack

**Density Plot Comparing the Winnings from Two Strategies**
This plot shows the

of the gain from 1,000 $1 bets for two
different strategies: the

for an infinite deck, and a

that hits when the dealer shows a 6 or higher and
the player's cards are under 17, and otherwise stands. A gain of $2 or
-$2 results from winning or losing a double down, respectively. The
simple strategy never doubles the bet.

**A Comparison of the Optimal and Simple Strategies**
The distribution of payoffs under the optimal strategy peaks at a
higher value. Note that the variability is considerable: there is
about a 40% chance of losing 5% or more of your money with the optimal
strategy and about a 55% chance with the simple strategy.

**Plot of the ***bet()* Function
This plot shows the *bet()* function over the domain from
-5 to 10. We can use this visualization to confirm that the function
works as expected.

**Comparison of Average Gain for Card Counting**
A comparison of the fixed and varying bet amount

.
For each set of 50 hands, we take the difference in the gains between
a fixed bet and a bet adjusted according to the Hi-Low counting
strategy. Values above 0 are when the gain for the counting strategy
exceeded the fixed bet.

## 10 Baseball: Exploring Data in a Relational Database

**Boxplots of Team Payrolls by League**
These boxplots show the team payrolls for the American League (dark
gray) and the National League (light gray) from 1985 to 2012. Payroll
has been logged and is reported in millions of dollars. The American
League appears to have greater spread in payroll than the National
League. Also evident is the sharp increase in payrolls in the late
1980s and early 1990s.

**Scatter Plot of Team Payroll by Year**
In this scatter plot, the plotting symbol is a transparent gray and
the year has been jittered slightly to avoid over plotting. The
identifier of the teams that won the World Series are added to the
corresponding point, which is colored red. In almost every case,
these dots are at or above the upper quartile of team payroll.
Payroll is reported in millions of dollars and plotted on a log scale.

## 11 CIA Factbook Mashup

**Display of Infant Mortality by Country on <gearth></gearth>**
This screenshot of the <gearth></gearth> virtual earth browser displays
circles scaled to the population size and

ed according to the

rate for a country. The data are available from the
CIA Factbook. The locations of the circles are determined from

's latitude and longitude of the country's geographic
center. When the viewer clicks on a circle, a window pops up with more
detailed information for that country.

**Distribution of Infant Mortality for Countries in the CIA Factbook**
This histogram of infant mortality rates shows a highly skewed
distribution. Most countries have rates under 20 per 1000 live
births and a few countries have rates between 80 and 125.

**Population Distribution for Countries in the CIA Factbook**
This histogram of the square root of population size for countries
shows a highly skewed distribution with a mode around 1000.

**Incorrect Map of Infant Mortality and Population**
In this map each disk corresponds to a country's infant mortality and
population. The size of the disk is proportional to the population and
the color reflects the infant mortality rate. Notice the size of
China's disk is too small – it has the highest population but one of
the smallest disks. Other anomalies are apparent with closer
inspection.

**Screenshot of the ISO Country Code Mapping**
**Map of Infant Mortality and Population**
This map correctly matches the country demographic information with
latitude and longitude. Notice now the symbols for India and China are
approximately the same size and the largest symbols on the map. Also note
that the symbol for the United Kingdom is now pale yellow, the color
we would expect it to be, because it is not being confused with Gabon.

**Default <gearth></gearth> Image**
**Screenshot of the CIA Factbook Rendered in Chrome***
The Chrome browser renders an ***XML** file using indentation and color
to highlight the structure of the document. The top or root node of
the file is `<factbook>`

and it has an attribute called
`lastupdate`

, which indicates how recently the
information was updated. The `<news>`

nodes are
indented one space, corresponding to their depth in the hierarchy,
i.e., they are children of `<factbook>`

.

**Screenshot of the MaxMind Web Page with Country Latitude and Longitude**
MaxMind makes available the average latitude and longitude for
countries in several formats, including in a display on the Web as
shown here in this screenshot.

**Screenshot of the ****HTML** Source of the MaxMind Web Page

## 12 Exploring Data Science Jobs with Web Scraping and Text Mining

**Kaggle Jobs Board Web Page***
This is the front page of the jobs forum. All of the posts are
available via this page and its successive pages via the links 2, 3, ***...**.
We can click on the link for a specific job to read that post.
** Job Post and Comments***This shows a very informal job posting and follow-up posts on the same page.
There are very few details about the position being advertised.*

**Screenshot of Job Post by Pandora, the free, personalized radio Web site***
There are several paragraphs describing the company and the position.
There is a bullet-list of required skills. *

**List of Search Results for Data Science Jobs on ***
We can enter the search term in various fields to obtain a list of matching job posts.
These are shown below the textfields.
Each entry corresponds to a job posting. These is some metadata in these items,
but none of much interest to us. There are also numerous advertisements
on the page, distracting us from the data of interest.
*

**Job Post on monster.com***This is a cross-listed job posting on monster.com that actually comes
from the cybercoders.com Web site.
As with the posts, much of the post is free-form text.
There are several lists that provide metadata about the job.
There is also structured content for **similar jobs*,
listing the related skills.
**Semi-Structured Job Post on Kaggle***
This shows another job posting.
Again, much of the content is free-form text
within different sections that have a title.
There are 2 lists with metadata about
education and experience qualifications necessary for the job.
*
**Search Results for Data Scientist Jobs on CyberCoders.com**
We enter the search term in the text field at the top of the Web page.
The results are displayed below. Note that in addition to the link
to each particular job, the results page also shows metadata about each
job, e.g., salary range, location, and a list of required skills.
If this is all the information we wanted, we would not have to scrape
the actual postings.

**Sample Job Post on CyberCoders.com**
This job posting, like others on cybercoders.com,

has several paragraphs of free-form text,
and also some itemized lists including one listing the skills necessary for the position.
Additionally, the post has the location and salary separately in the top-right
corner. It also displays the preferred skills as a list of phrases.

**Frequency of Selected Terms over all Kaggle Job Posts**
This shows the number of occurrences of each of the selected terms
across all 842 job posts on Kaggle by January 2015.

**Dotplot for Frequency of Skill Phrases across CyberCoder.com Data Science Job Posts**
This shows the number of occurrences of different terms we selected
across Data Science job postings on this Web site.

**Word Cloud for Frequency of Skill Phrases Across CyberCoder.com Job Posts**
This is a different display of the counts of different

across posts on
cybercoders.com.