Data Science Writers Home
1 Predicting Location via Indoor Positioning Systems
Floor Plan of the Test Environment
In this floor plan, the 6 fixed access points are denoted by black
square markers, the offline/training data were collected at the
locations marked by grey dots, and the online measurements were
recorded at randomly selected points indicated with black dots. The
grey dots are spaced one meter apart.
Empirical CDF of Orientation for the Hand-Held Device
This empirical distribution function of orientation shows that there
are 8 basic orientations that are 45 degrees apart. We see from
the steps in the function that these orientations are not exactly 45,
90, 135, etc. Also, the 0 orientation is split into the two groups,
one near 0 and the other near 360.
Boxplots of Orientation for the Hand-Held Device
These
s of the original orientation against the rounded value
confirm that the values have mapped correctly to 0, 45, 90, 135, etc.
The “outliers” at the top left corner of the plot are the
values near 360 that have been mapped to 0.
Screenshot of the coffer.com Mac Address Lookup Form
The coffer.com Web site offers lookup services to find the
MAC address for a vendor and vice versa.
Counts of signals detected at each position
Plotted at each location in the building is the total number of
signals detected from all access points for the offline data. Ideally
for each location, 110 signals were measured at 8 angles for each of 6
access points, for a total of 5280 recordings. These data include a
seventh Mac address and not all signals were detected, so there are
about 5500 recordings at each location.
Signal Strength by Angle for Each Access Point
The
s in this figure represent signals for one location, which
is in the upper left corner of the floor plan, i.e., and . These boxes
are organized by access point and the angle of the hand-held
device. The dependence of signal strength on angle is evident at
several of the access points, e.g., 00:14:bf:97:90
in the top right panel of the figure.
Distribution of Signal by Angle for Each Access Point
The density curves shown here are for the signal strengths measured at
the position: and . These 48 density plots represent each of the
access point angle
combinations. There are roughly 110 observations in each panel. Some
look roughly normal while many others look skewed left.
SD of Signal Strength by Mean Signal Strength
The average and SD for the signals detected at each
location-angle-access point combination are plotted against each
other. The weak signals have low variability and the stronger signals
have greater variability.
Comparison of Mean and Median Signal Strength
This smoothed
shows the difference between the mean and
median signal strength for each combination of location, access point,
and angle against the number of observations. These differences are
close to 0 with a typical deviation of 1 to 2 dBm.
Median Signal at Two Access Points and Two Angles
These four
provide a
representation of
signal strength. The top two maps are for the access point
00:14:bf:b1:97:90
and the angles 0 (left) and 135
(right). The two bottom heat maps represent the signal strength for
the 00:0f:a3:39:e1:c0
and the same two angles.
Signal Strength vs. Distance to Access Point
These 48
s show the relationship between the signal
strength and the distance to the access point for each of the 6 access
points and 8 orientations of the device. The shape is consistent
across panels showing curvature in the relationship.
Floor Plan with Predicted and Actual Locations
The red line segments shown in the floor plan connect the test
locations (black dots) to their predicted locations (asterisks). The
top plot shows the predictions for
and the bottom plot is for
s. In this model, we use as training data the average signal
strengths from each of the 166 offline locations (grey dots) to the
6 access points (black squares) for the 3 closest angles to the
angle at which the test data was measured.
Cross Validated Selection of
This line plot shows the sum of square errors as a function of the
number of neighbors used in predicting the location of a new
observation. The sums of squared errors are obtained via
of the offline data.
2 Modeling Runners' Times in the Cherry Blossom Race
Screen Shot of Cherry Blossom Run Web site
This page contains links to each year's race results. The year 1999 is
the earliest for which they provide data. Men's and women's results
are listed separately.
Screen Shot of the 2012 Male Results
This screenshot shows the results, in race order, for men competing in
the 2012 Cherry Blossom 10 Mile Run. Notice that both 5-mile times
and net times are provided. We know that the Time
column is net time because it is so indicated in the header of the
table.
Screen Shot of Men's 2011 Race Results
This screenshot shows the results, in race order, for men competing in
the 2011 Cherry Blossom road race. Notice that in 2011, 3 times are
recorded – the time to complete the first 5 miles and the gun
and net times for the full run. In contrast, the results from 2012 do
not provide gun time.
Box Plot of Age by Year
These side-by-side boxplots of age for each race year show a few
problems with the data for 2003 and 2006. The runners in these years
are unusually young.
Box Plot of Age by Year
These side-by-side boxplots of age for each race year show a
reasonable age distribution. For example, the lower quartile for all
years range between 29 and 32. The problems identified earlier for
2003 and 2006 have been addressed.
Default Scatter Plot for Run Time vs. Age for Male
Runners
This plot demonstrates that a simple
of run time by age
for the 70,000 male runners leads to such severe
that
the shape of the data is not discernible.
Revised Scatter Plot of Male Runners
This plot revises the simple scatter plot of Figure 2.6, “Default Scatter Plot for Run Time vs. Age for Male
Runners” by changing the plotting symbol from a
circle to a disk, reducing the size of the plotting symbol, using a
transparent
for the disk, and adding a small amount of random
noise to age. Now we see the shape of the high density region
containing most of the runners and the slight upward trend of time
with increasing age.
Smoothed Scatter Plot of Male Runners Race Times vs. Age
This plot offers an alternative to the scatter plot of Figure 2.7, “Revised Scatter Plot of Male Runners” that uses jittering and
transparent
to ameliorate the
. Here there is no
need to jitter age because the smoothing action essentially does that
for us by spreading an individual runner's (age, run time) pair over a
small region. The shape of the high density region has a very similar
shape to the earlier plot.
Side-by-Side Boxplots of Male Runners' Run Time vs. Age
This sequence of boxplots shows the quartiles of time for men grouped
into 10-year age intervals. As age increases, all the quartiles
increase. However, the box becomes asymmetrical with age, which
indicates that the upper quartile increases faster than the median and
lower quartile.
Residual Plot from Fitting a Simple Linear Model of Performance to Age
Shown here is a smoothed scatter plot of the residuals from the fit of
the simple
of run time to age for male runners who are 15 to
80 years old. Overlaid on the scatter plot are two curves. The
“curve” in purple is a solid horizontal line at \(y = 0\).
The green dashed curve is a local smooth of the residuals.
Piecewise Linear and Loess Curves Fitted to Run Time vs. Age
Here we have plotted the fitted curves from loess() and
a piecewise
with hinges at 30, 40, 50, and 60. These
curves follow each other quite closely. However, there appears to be
more curvature in the over 50 loess fit that is not captured in the
piecewise
.
Line Plot of the Number of Male Runners by Year
This plot shows that the number of male runners in the Cherry Blossom
10-mile race has more than doubled from 1999 to 2012.
Density Curves for the Age of Male Runners in 1999 and 2012
These two
have quite different shapes. The 1999 male
runners have a broad, nearly flat mode where they are roughly evenly
distributed in age from 28 to 45. In contrast, the 2012 runners are
younger with a sharper peak just under 30 years and a skew right
distribution.
Loess Curves Fit to Performance for 1999 and 2012 Male Runners
This loess fit of run time to age for 2012 male runners sits above the
fit for 1999 male runners. The gap between these curves is about 5
minutes for most years. The exception is in the late 40s to early 60s
where the curves are within 2–3 minutes of each other. Both curves
have a similar shape.
Difference between Loess Curves
This line plot shows the difference between the predicted run time
for 2012 and 1999 male runners.
Screen Shot of One Runner's Web Page of Race Results
This Web page at https://storage.athlinks.com contains
the race results of one runner who participated in the Cherry Blossom
run for 12 of the 14 years for which we have data. Notice that his
fastest time was from his most recent run in 2012 where he completed
the race in under 85 minutes. He was 45 at that time. Also, his
slowest time was 123 minutes in 2002 at the age of 35.
Run Times for Multiple Races
These line plots show the times for male runners who completed at
least 8 Cherry Blossom races. Each set of connected segments
corresponds to the run times for one athlete. Looking at all line
plots, we see a similar shape to the scatter plot in Figure 2.7, “Revised Scatter Plot of Male Runners”, i.e., an upward curve with
age. However, we can also see how an individual's performance
changes. For example, many middle-aged runners show a sharp increase
in run time with age but that is not the case for all. Some of them
improve and others change more slowly.
Linear Fits of Run Time to Age for Individual Runners
Here we have augmented the bottom-right line plot from Figure 2.17, “Run Times for Multiple Races” with the
fit of
run time for each of the athletes. These are the 30 or so black dashed
line segments plotted on each of the individual runner's times series.
Coefficients from Longitudinal Analysis of Athletes
This scatter plot displays the slope of the fitted line to each of the
300+ runners who competed in at least 8 Cherry Blossom road
races. A negative coefficient indicates the runner is getting faster
as he ages. The plot includes a
fitted line and a
. Notice that nearly all of the coefficients for those
over 50 are positive. The typical size of this coefficient for a
50-year old is about one minute per year.
Screen Shot of the Source for Men's 2012 Cherry Blossom Results
Screen Shot of the Source for Women's 2011 Cherry Blossom Results
This screen shot is of the HTML source for the female results for
the 2011 Cherry Blossom road race. Notice that times given are for the
midpoint of the race (5 Mile
) and for two finish
times (Time
and Net Tim
). Also
notice the leftmost column labeled S
. While the
format is different than the male results for 2012, both are plain
text tables within <pre>
nodes in an HTML
document.
3 Using Statistics to Identify Spam
Boxplot of Log Likelihood Ratio for Spam and Ham
The log
, , for 3116 test messages was computed
using a
approximation based on word frequencies found in
manually classified training data. The test messages are grouped according
to whether they are spam or ham. Notice most ham messages have values
well below 0 and nearly all spam values are above 0.
Comparison of Type I and II Error Rates
The
rates for the 3116 test messages are shown as
a function of the threshold . For
example, with a threshold of ,
all messages with an LLR value above -43 are classified as spam and
those below as ham. In this case, 1% of ham is misclassified
as spam and 2% of spam is misclassified as ham.
Example Tree from a Recursive Partition
This tree is a simple example of a recursive partition fitted
model. It was fitted using the rpart() function and
restricting the tree depth to 3 levels. The first yes–no question is
whether the percentage of capitals in the message is less than 13. If
not, the second question is whether there are fewer than 289
characters in the message. If the answer to this question is also no,
then the next question is whether the message header contains an
InReplyTo
key. If the answer is again no, then the
message is classified as spam. Of the 6232 messages in the training
set, 77 ham and 643 spam fall into this leaf. The spam have been
correctly classified and the 77 ham have been misclassified.
Comparison of Two Measures of Length for a Message
This scatter plot shows the relationship between the number of lines
and the number of characters in the body of a message. The plot is on
log scale, and 1 is added to all of the values before taking logs to
address issues with empty bodies. The line is added for comparison purposes.
Use of Capitalization in Email
These
s compare the percentage of capital letters among all
letters in a message body for spam and ham. The use of a log scale
makes it easier to see that nearly 3/4 of the spam have more capital
letters than nearly all of the ham.
Comparison of the Amount of Capitalization and the Size of the
Message
This
examines the relationship between the percentage of
capital letters among all letters in a message and the total number of
characters in the message. Spam is marked by purple dots and ham by
green. The darker color indicates overplotting. We see here that
the spam tends to be longer and have more capital letters than ham.
Exploring Categorical Measures Derived from Email
These two
use area to denote the proportion of messages
that fall in each category. The plot on the top shows those messages
that have an Re:
in the subject line tend not to be
spam. The bottom plot shows that those messages that are from a user
with a number at the end of their email address tend to be
spam. However, few messages are sent from such users so it is not
clear how helpful this distinction will be in our classification
problem.
Tree for Partitioning Email to Predict Spam
This tree was fitted using rpart() on 6232
messages. The default values for all of the arguments to
rpart() were used. Notice the leftmost leaf classifies
as ham those messages with fewer than 13% capitals, fewer than 4%
HTML tags, and at least 1 forward. Eighteen spam messages fall into
this leaf and so are misclassified, but 2240 of the ham is properly
classified using these 3 yes–no questions.
Type I and II Errors for Recursive Partitioning
This plot displays the
for predicting spam as a
function of the size of the complexity parameter in the
rpart() function. The complexity parameter is a
mechanism for specifying the threshold for choosing a split for a
subgroup. Splits that do not achieve a gain in fit of at least the
size of the parameter value provided are not made. The Type I error is
minimized at a complexity parameter value of 0.001 for an error rate
of 3.9%. The Type II error rate for this complexity parameter value
is 10.5%.
4 Processing Robot and Sensor Log Files: Seeking a Circular Target
Example of the Course
This shows a sample path through the course.
The robot starts in the lower left corner.
The circular target can be seen at approximately (4.5, -6.5).
There are two rectangular obstacles and one triangular obstacle.
The horizontal dimensions range from -15 to +15, and the
vertical from -8 to +8.
Log File Size
This shows the distribution of the size of the 100 log files.
Elapsed Time of 100 Experiments in Seconds
There appear to be 3 different groups in this distribution.
Most of the experiments are completed between 1 and 16 minutes
with a "center" of about 8 minutes.
A smaller group of experiments is centered around 18 minutes.
The final group includes those that do not find the target and
use all of the 30 minutes allowed and end then.
Distribution of the Changes in the Horizontal and Vertical Directions
We compute the change in the horizontal and vertical directions separately for
each pair of consecutive records in each log to explore how far the robot
typically moves between records.
Distribution of the Velocity of the Robots
This shows the bimodal distribution of the velocity of all of the robots
across all log files. We compute the distance between consecutive points in each log
and divide this by the time between these two records.
Robot Path for the First Experiment
This displays the path from the first log file.
The panel on the left shows the robot's movements from left to right across the course
and then vertically along the side.
The second panel shows this path relative to the entire course. This illustrates
that the robot moved along the bottom side and only slightly vertically before the run terminated.
This experiment lasted almost 19 minutes.
Display of All Experiments
This displays the path of the robot in each of the 100 experiments.
The starting point is displayed in green with a circle.
The direction of the robot corresponds to the shift in colors from green to red.
The final location is marked with a blue x.
Sample Final LookThis is the path/shape seen by the robot in the
final look of the first log file, JRSPdata_2010_03_10_12_12_31
.
The robot is in the center of the circle.
At the top right of the circle, we see a circular-like object that might be the target.
A straight edge corresponding to a rectangular obstacle appears at the bottom of the circle.
Enhanced Display of a Look
This figure shows the improved display of a robot's "look."
We remove the misleading lines connecting the edges
of the circular target and the 2 meter arc.
This uses points(, type = "l")
but for each sub-group of points that form a contiguous sub-arc
of points at 2 meters, and points that are less than 2 meters.
It does not connect adjacent but disconnected sub-arcs.
(See ???.)
All Final Looks
This shows the final look of each of the 100 log files.
Which show a circle?
Density of Repeated Range Values
This shows the distribution of the repeated measurements
of the range values when a robot revisited the same location.
These are deviations from the mean of nominally identical values.
There are some very extreme values (-1.58 and 1.68).
The distribution has very large tails. Most of the observations are
exactly 0.
Characteristics of Looks
This shows 4 different types of looks.
In the first look, the robot sees nothing and so there is no circular target present.
In the next 3 looks (moving row-wise), the robot sees a straight side or two straight sides that intersect.
Again, there is no circular target. All of the remaining looks appear to contain
a circular target. In the fifth look, the robot only sees the circle,
while in the next 4 looks, it sees a circle and part of one or two obstacles.
The last 3 looks are more complex. We see the circular target but, drawn in this manner,
the circle appears to be connected to obstacles.
Looks Containing a Segment Identified as a Circle
These are the looks that were classified as containing the circular target.
We see that most of the looks do indeed contain a shape that looks like the target.
However, there are several that have confused a right angle corresponding to an obstacle in the course
with the target and these seem to be false positives.
Looks Containing No Identified CircleThese are the looks that were classified as not containing the circular target.
We see that most of the looks do indeed contain no indication of the target.
However, there are several that do and suggest false negatives.
Patterns in the False Negatives
This shows the 9 looks in which the circular target appears to
be present but which were not detected by our robot.evaluation()
function. They are arranged to show two characteristics.
The first of these is a circular target "connected" to another obstacle.
The second pattern is a circular target that is very close to the
robot (i.e., the center of the look) and so does not appear circular.
Misclassified Looks with a TargetThese 9 looks are those that were misclassified as not containing the target.
In all but one of these, the target is very close to the center of the robot.
The seventh look is more problematic.
5 Strategies for Analyzing a 12-Gigabyte Data Set: Airline Flight Delays
Hourly Delay QuantilesThe airline delay quantiles (in minutes) for each hour of the day.
6 Pairs Trading
Historical Prices for the and Indices, 1990–1995
This shows the of the two "stock" prices for the 5-year period.
There is clearly a high correlation over time between the two financial indices.
The values of the two series are quite different and plotted on different scales, with that for the on the right of the .
Historical Ratio for the and Indices
This shows the ratio of the two "stock" prices from 1990 to 1995.
The ratio appears to move around the mean until 1994 and then to rise above it.
The horizontal lines show one and two standard deviations from the mean of the ratio.
Ratio of to for 1995–2010The circles show the starting and closing positions for trades.
We open a position when the ratio is outside of the threshold lines.
We close that position when the ratio returns to the mean.
The green circles identify the opening of a trading position;
the red circles the corresponding close of the position.
Circle 7 opens a position but the ratio never returns to the mean.
We close the position on the final day of the series.
Note that the threshold lines use the mean and standard deviation
from the period 1990 to 1995.
A Simple Plot of the Ratio of Stock Prices for and
The mean of the ratio (\(\mu_{ratio}\)) is shown as a dashed line and thresholds lines
indicating \(\mu_{ratio} \pm k \sigma_{ratio}\),
with \(k = 0.85\).
Visualizing the First Three Positions
This shows the first 3 positions for the / stock price ratio.
It allows us to verify that our findNextPosition() function is working correctly.
Positions for \(k = 0.5\)With a smaller value of k, we get many more opening and closing positions.
Profit for Different Values of k
For this time training data, all values of k yielded a positive profit.
For small values of k, there were many opportunities for positions and
the profits were larger than for increasing values of k.
The maximum profit of 0.88 occurs at values of k
between 0.61 and 0.62.
Two Simulated Time SeriesThe two time series are generated from the model in
with \(\rho = 0.99\) and \(\psi = 0.9\) and \(\beta_1 = c(.05, .1)\).
Density of Profit
This shows the distribution for 999 values of profit simulated
from our time series model in equation
with \(\rho = 0.99\) and \(\psi = 0.9\).
We see a large concentration very close to 0. We also see
some slightly more commonly occurring values corresponding to the small modes/"bumps."
The extreme values are very extreme and also very variable across simulations, as we expect.
7 Simulation Study of a Branching Process
Diagram of an Example Branching Process
This
shows a possible realization of the
studied by Aldous and Krebs. Each node in the tree represents a
program. Jobs marked with an “X
” have
completed running, those marked “R
”
are currently running, and “W
” nodes
are waiting for their parent to complete before starting to run.
Empirical CDFs for the Sum of Three Exponentials
In a study of the sum of 3 independent
random
variables with the same rate, 6,000 sample outcomes were generated and
used to estimate the
. The plot shows
the empirical CDF for 16 values of the rate parameter,
i.e., 0.1, 0.2, ..., 0.9, 1, 2, ..., 7.
To help match each rate to its curve, a few of these rates are included in the figure
next to their corresponding curves.
Empirical Distribution of the Number of Offspring for a Job
This figure shows the observed proportion of the number of offspring
randomly generated for a parent with a birth time of 1 and a
completion time of 6. The inter-arrival times of the parent's children
are independent and follow an exponential distribution with parameter
. The process was simulated
1,000 times, and the observed proportion of 0, 1, 2, ..., 9 offspring
plotted (black line segments). In addition, the true probabilities
from the Poisson(2.5) distribution are plotted next to the observed
proportions (grey line segments).
Visualization of a Randomly Generated Branching Process
This plot shows the lifetimes of each member of a randomly generated
birth and assassination process with a birth rate of
and a completion rate of
. Each job's lifetime is
represented by a grey line segment with endpoints at its birth and
completion times. The X
s on the segment denote the birth times of the
job's offspring. The dashed lines separate the generations. None of
the jobs in the fourth generation of this instance of the process had
offspring so the process terminated there. Notice that one job in the
second generation ran for a very long time and had just one child.
Visualization of a Randomly Generated Branching Process Over a Fixed Time Interval
This plot shows the lifetimes of each member of the first 5
generations of a simulated birth and assassination process that has
been observed up to time of 8. Each lifetime is represented by a grey
line segment with endpoints at its birth and completion times. A job
that has not completed by 8 time steps is censored and consequently, we see
only its offspring born before 8.
Scatterplots of the Number of Generations Against the Number of Offspring
These 4
s show the different behavior exhibited by the
branching process as and
vary. Each simulation terminates
when the process dies out or one of the following limits is reached:
200 generations; 100,000 offspring. One hundred simulations are run
for each (,
) pair.
Three-Dimensional Scatterplot of the Number of Offspring by
and
Each point in this
represents the upper quartile of the
number of offspring in 400 random outcomes of the branching process
for a particular (,
) pair. The offspring are plotted
on log base 10 scale, so the first category, i.e., [0, 0.5)
corresponds to 1 to 3 offspring.
Image Map of the Proportion of Replicates That Reach the Simulation Limits
This
represents a
of the proportion of the
400 simulations for each
pair that reached 20 generations or 1000 offspring and so were
terminated.
Proportion of Simulations with at Least 20 Offspring
This
uses
from the rainbow
to represent the
proportion of 400 random outcomes for each pair that have at least 20 offspring.
8 A Self-Organizing Dynamic System with a Phase Transition
Movement on a Sample Grid
(a) shows the initial state of a 3-by-5 grid containing 3 red and 3 blue cars.
At time \(t = 1\), the red cars move horizontally.
The red car on the bottom row "wraps around" to the first column on the same row.
At time \(t = 2\), the blue cars move up within the same column.
The blue car in cell \((1, 3)\) is blocked by the red car above it
that moved to that cell in at time \(t = 1\).
Accordingly, we obtain a sequence of grids indexed by time.
Sample Free-Flowing Traffic GridsThe 3 panels show the initial grid at time t = 0, t = 500, and t = 1000.
Cars occupy 25% of the cells. After 500 iterations, the cars start to organize
along diagonal lines. After 1000 iterations, the lines are becoming clearer.
Sample Deadlocked Traffic GridsThree grids at different time steps showing the emergence of deadlocked traffic.
Simple Plot from image()
Using a simple call to image() to display the grid uses
the wrong colors and also does not display the cells in the order we expect or want.
Sample Grid Displays
The first panel shows a 100-by-100 grid with 50% of the cells occupied equally
with red and blue cars.
The second panel has the same dimensions but there are 5000 red cars and only 500 blue cars.
The 3rd panel shows a small 3-by-4 grid where we placed the cars manually.
A Grid before and after Moving the Red Cars
The left panel shows the initial 4-by-7 grid.
The second panel shows the state of the grid after the red cars have moved.
Run-Time for Code as a Function of the Number of Grid Cells
This shows that as the number of grid cells, and the total number of cars, doubles,
the time taken approximately doubles. We have also added the
fit for the elapsed times as a function of grid size n.
This plot was computed for square grids, but applies generally to the number of
cells in a grid.
Comparison of Times for Different Computational Approaches
This shows the relative speed of the different approaches, ranging
from the naïve loop version, to the fastest vectorized approach, to
using the C code. There is a very significant speedup from the
loop-version of the code. However, this plot hides the significant
speedup between the different vectorized approaches and especially the
C code.
Panel (b) shows these relative speedups when we omit the loop approaches
and we can compare the remaining approaches on a more appropriate scale.
Sample 1024-by-1024 Grids with Different Densities
Each grid has the same dimension and is run for 128,000 time steps.
In the top-left, we have a free-flowing grid with a density of 0.25 in which the cars arrange
themselves into diagonal lines. At a slightly higher density of 0.33 (top-right),
we see one group of cars that are partially deadlocked but which continue to move.
We also see bands of red and blue cars moving, but not parallel to each other.
Some cars will move freely, at least until they meet the partially deadlocked group.
The cars in that group move more slowly until they escape. This gives rise to different velocities.
At a slightly higher density, we see additional groups but the same pattern of moving cars.
For a different configuration at the same density (0.38) we see a very different structure
with mostly deadlocked cars in two parallel groups running diagonally. Note the alternating colors of the diagonal bands.
For higher densities, we see more localized groupings but the same deadlock.
Average Velocity of Cars by Occupancy DensityThese plots mirror those in[bib:Raissa2005] (page 3),
with apparently more replications.
They show the average proportion of cars moving in each time step on the vertical axis.
The horizontal axis shows a range of car densities for the different grids.
The 4 panels correspond to the different sizes of the square grid: 64, 128, 256, and 512.
Each point in a plot corresponds to running the initial grid through 64,000 total
iterations of both red and blue cars, i.e. 128,000 time steps.
We see variation for a given density corresponding to different random grid configurations.
We see that the density at which deadlock occurs decreases as the grid size increases.
Importantly, we see that there are different points of equilibrium, not
simply free-flowing or deadlocked. These are the intermediate states that were previously
unexpected until reported in[bib:Raissa2005].
9 Simulating Blackjack
Density Plot Comparing the Winnings from Two Strategies
This plot shows the
of the gain from 1,000 $1 bets for two
different strategies: the
for an infinite deck, and a
that hits when the dealer shows a 6 or higher and
the player's cards are under 17, and otherwise stands. A gain of $2 or
-$2 results from winning or losing a double down, respectively. The
simple strategy never doubles the bet.
A Comparison of the Optimal and Simple Strategies
The distribution of payoffs under the optimal strategy peaks at a
higher value. Note that the variability is considerable: there is
about a 40% chance of losing 5% or more of your money with the optimal
strategy and about a 55% chance with the simple strategy.
Plot of the bet() Function
This plot shows the bet() function over the domain from
-5 to 10. We can use this visualization to confirm that the function
works as expected.
Comparison of Average Gain for Card Counting
A comparison of the fixed and varying bet amount
.
For each set of 50 hands, we take the difference in the gains between
a fixed bet and a bet adjusted according to the Hi-Low counting
strategy. Values above 0 are when the gain for the counting strategy
exceeded the fixed bet.
10 Baseball: Exploring Data in a Relational Database
Boxplots of Team Payrolls by League
These boxplots show the team payrolls for the American League (dark
gray) and the National League (light gray) from 1985 to 2012. Payroll
has been logged and is reported in millions of dollars. The American
League appears to have greater spread in payroll than the National
League. Also evident is the sharp increase in payrolls in the late
1980s and early 1990s.
Scatter Plot of Team Payroll by Year
In this scatter plot, the plotting symbol is a transparent gray and
the year has been jittered slightly to avoid over plotting. The
identifier of the teams that won the World Series are added to the
corresponding point, which is colored red. In almost every case,
these dots are at or above the upper quartile of team payroll.
Payroll is reported in millions of dollars and plotted on a log scale.
11 CIA Factbook Mashup
Display of Infant Mortality by Country on <gearth></gearth>
This screenshot of the <gearth></gearth> virtual earth browser displays
circles scaled to the population size and
ed according to the
rate for a country. The data are available from the
CIA Factbook. The locations of the circles are determined from
's latitude and longitude of the country's geographic
center. When the viewer clicks on a circle, a window pops up with more
detailed information for that country.
Distribution of Infant Mortality for Countries in the CIA Factbook
This histogram of infant mortality rates shows a highly skewed
distribution. Most countries have rates under 20 per 1000 live
births and a few countries have rates between 80 and 125.
Population Distribution for Countries in the CIA Factbook
This histogram of the square root of population size for countries
shows a highly skewed distribution with a mode around 1000.
Incorrect Map of Infant Mortality and Population
In this map each disk corresponds to a country's infant mortality and
population. The size of the disk is proportional to the population and
the color reflects the infant mortality rate. Notice the size of
China's disk is too small – it has the highest population but one of
the smallest disks. Other anomalies are apparent with closer
inspection.
Screenshot of the ISO Country Code Mapping
Map of Infant Mortality and Population
This map correctly matches the country demographic information with
latitude and longitude. Notice now the symbols for India and China are
approximately the same size and the largest symbols on the map. Also note
that the symbol for the United Kingdom is now pale yellow, the color
we would expect it to be, because it is not being confused with Gabon.
Default <gearth></gearth> Image
Screenshot of the CIA Factbook Rendered in Chrome
The Chrome browser renders an XML file using indentation and color
to highlight the structure of the document. The top or root node of
the file is <factbook>
and it has an attribute called
lastupdate
, which indicates how recently the
information was updated. The <news>
nodes are
indented one space, corresponding to their depth in the hierarchy,
i.e., they are children of <factbook>
.
Screenshot of the MaxMind Web Page with Country Latitude and Longitude
MaxMind makes available the average latitude and longitude for
countries in several formats, including in a display on the Web as
shown here in this screenshot.
Screenshot of the HTML Source of the MaxMind Web Page
12 Exploring Data Science Jobs with Web Scraping and Text Mining
Kaggle Jobs Board Web Page
This is the front page of the jobs forum. All of the posts are
available via this page and its successive pages via the links 2, 3, ....
We can click on the link for a specific job to read that post.
Job Post and CommentsThis shows a very informal job posting and follow-up posts on the same page.
There are very few details about the position being advertised.
Screenshot of Job Post by Pandora, the free, personalized radio Web site
There are several paragraphs describing the company and the position.
There is a bullet-list of required skills.
List of Search Results for Data Science Jobs on
We can enter the search term in various fields to obtain a list of matching job posts.
These are shown below the textfields.
Each entry corresponds to a job posting. These is some metadata in these items,
but none of much interest to us. There are also numerous advertisements
on the page, distracting us from the data of interest.
Job Post on monster.comThis is a cross-listed job posting on monster.com that actually comes
from the cybercoders.com Web site.
As with the posts, much of the post is free-form text.
There are several lists that provide metadata about the job.
There is also structured content for similar jobs,
listing the related skills.
Semi-Structured Job Post on Kaggle
This shows another job posting.
Again, much of the content is free-form text
within different sections that have a title.
There are 2 lists with metadata about
education and experience qualifications necessary for the job.
Search Results for Data Scientist Jobs on CyberCoders.com
We enter the search term in the text field at the top of the Web page.
The results are displayed below. Note that in addition to the link
to each particular job, the results page also shows metadata about each
job, e.g., salary range, location, and a list of required skills.
If this is all the information we wanted, we would not have to scrape
the actual postings.
Sample Job Post on CyberCoders.com
This job posting, like others on cybercoders.com,
has several paragraphs of free-form text,
and also some itemized lists including one listing the skills necessary for the position.
Additionally, the post has the location and salary separately in the top-right
corner. It also displays the preferred skills as a list of phrases.
Frequency of Selected Terms over all Kaggle Job Posts
This shows the number of occurrences of each of the selected terms
across all 842 job posts on Kaggle by January 2015.
Dotplot for Frequency of Skill Phrases across CyberCoder.com Data Science Job Posts
This shows the number of occurrences of different terms we selected
across Data Science job postings on this Web site.
Word Cloud for Frequency of Skill Phrases Across CyberCoder.com Job Posts
This is a different display of the counts of different
across posts on
cybercoders.com.