In this floor plan, the 6 fixed access points are denoted by black square markers, the offline/training data were collected at the locations marked by grey dots, and the online measurements were recorded at randomly selected points indicated with black dots. The grey dots are spaced one meter apart.
This empirical distribution function of orientation shows that there are 8 basic orientations that are 45 degrees apart. We see from the steps in the function that these orientations are not exactly 45, 90, 135, etc. Also, the 0 orientation is split into the two groups, one near 0 and the other near 360.
The coffer.com Web site offers lookup services to find the MAC address for a vendor and vice versa.
Plotted at each location in the building is the total number of signals detected from all access points for the offline data. Ideally for each location, 110 signals were measured at 8 angles for each of 6 access points, for a total of 5280 recordings. These data include a seventh Mac address and not all signals were detected, so there are about 5500 recordings at each location.
The
s in this figure represent signals for one location, which
is in the upper left corner of the floor plan, i.e., and . These boxes
are organized by access point and the angle of the hand-held
device. The dependence of signal strength on angle is evident at
several of the access points, e.g., 00:14:bf:97:90
in the top right panel of the figure.
The density curves shown here are for the signal strengths measured at the position: and . These 48 density plots represent each of the access point angle combinations. There are roughly 110 observations in each panel. Some look roughly normal while many others look skewed left.
The average and SD for the signals detected at each location-angle-access point combination are plotted against each other. The weak signals have low variability and the stronger signals have greater variability.
The red line segments shown in the floor plan connect the test locations (black dots) to their predicted locations (asterisks). The top plot shows the predictions for and the bottom plot is for
s. In this model, we use as training data the average signal strengths from each of the 166 offline locations (grey dots) to the 6 access points (black squares) for the 3 closest angles to the angle at which the test data was measured.
This page contains links to each year's race results. The year 1999 is the earliest for which they provide data. Men's and women's results are listed separately.
This screenshot shows the results, in race order, for men competing in
the 2012 Cherry Blossom 10 Mile Run. Notice that both 5-mile times
and net times are provided. We know that the Time
column is net time because it is so indicated in the header of the
table.
This screenshot shows the results, in race order, for men competing in the 2011 Cherry Blossom road race. Notice that in 2011, 3 times are recorded – the time to complete the first 5 miles and the gun and net times for the full run. In contrast, the results from 2012 do not provide gun time.
These side-by-side boxplots of age for each race year show a few problems with the data for 2003 and 2006. The runners in these years are unusually young.
These side-by-side boxplots of age for each race year show a reasonable age distribution. For example, the lower quartile for all years range between 29 and 32. The problems identified earlier for 2003 and 2006 have been addressed.
This plot revises the simple scatter plot of Figure 2.6, “Default Scatter Plot for Run Time vs. Age for Male Runners” by changing the plotting symbol from a circle to a disk, reducing the size of the plotting symbol, using a transparent
for the disk, and adding a small amount of random noise to age. Now we see the shape of the high density region containing most of the runners and the slight upward trend of time with increasing age.
This plot offers an alternative to the scatter plot of Figure 2.7, “Revised Scatter Plot of Male Runners” that uses jittering and transparent
to ameliorate the
. Here there is no need to jitter age because the smoothing action essentially does that for us by spreading an individual runner's (age, run time) pair over a small region. The shape of the high density region has a very similar shape to the earlier plot.
This sequence of boxplots shows the quartiles of time for men grouped into 10-year age intervals. As age increases, all the quartiles increase. However, the box becomes asymmetrical with age, which indicates that the upper quartile increases faster than the median and lower quartile.
Shown here is a smoothed scatter plot of the residuals from the fit of the simple
of run time to age for male runners who are 15 to 80 years old. Overlaid on the scatter plot are two curves. The “curve” in purple is a solid horizontal line at \(y = 0\). The green dashed curve is a local smooth of the residuals.
This plot shows that the number of male runners in the Cherry Blossom 10-mile race has more than doubled from 1999 to 2012.
This loess fit of run time to age for 2012 male runners sits above the fit for 1999 male runners. The gap between these curves is about 5 minutes for most years. The exception is in the late 40s to early 60s where the curves are within 2–3 minutes of each other. Both curves have a similar shape.
This line plot shows the difference between the predicted run time for 2012 and 1999 male runners.
This Web page at https://storage.athlinks.com contains the race results of one runner who participated in the Cherry Blossom run for 12 of the 14 years for which we have data. Notice that his fastest time was from his most recent run in 2012 where he completed the race in under 85 minutes. He was 45 at that time. Also, his slowest time was 123 minutes in 2002 at the age of 35.
These line plots show the times for male runners who completed at least 8 Cherry Blossom races. Each set of connected segments corresponds to the run times for one athlete. Looking at all line plots, we see a similar shape to the scatter plot in Figure 2.7, “Revised Scatter Plot of Male Runners”, i.e., an upward curve with age. However, we can also see how an individual's performance changes. For example, many middle-aged runners show a sharp increase in run time with age but that is not the case for all. Some of them improve and others change more slowly.
Here we have augmented the bottom-right line plot from Figure 2.17, “Run Times for Multiple Races” with the
fit of run time for each of the athletes. These are the 30 or so black dashed line segments plotted on each of the individual runner's times series.
This scatter plot displays the slope of the fitted line to each of the 300+ runners who competed in at least 8 Cherry Blossom road races. A negative coefficient indicates the runner is getting faster as he ages. The plot includes a
fitted line and a
. Notice that nearly all of the coefficients for those over 50 are positive. The typical size of this coefficient for a 50-year old is about one minute per year.
This screen shot is of the HTML source for the male results for the
2012 Cherry Blossom road race. While the format is not quite the same
as the female results for 2011 (see Figure 2.21, “Screen Shot of the Source for Women's 2011 Cherry Blossom Results”), both are plain text tables within
<pre>
nodes in an HTML document.
This screen shot is of the HTML source for the female results for
the 2011 Cherry Blossom road race. Notice that times given are for the
midpoint of the race (5 Mile
) and for two finish
times (Time
and Net Tim
). Also
notice the leftmost column labeled S
. While the
format is different than the male results for 2012, both are plain
text tables within <pre>
nodes in an HTML
document.
The log
, , for 3116 test messages was computed using a
approximation based on word frequencies found in manually classified training data. The test messages are grouped according to whether they are spam or ham. Notice most ham messages have values well below 0 and nearly all spam values are above 0.
This tree is a simple example of a recursive partition fitted
model. It was fitted using the rpart() function and
restricting the tree depth to 3 levels. The first yes–no question is
whether the percentage of capitals in the message is less than 13. If
not, the second question is whether there are fewer than 289
characters in the message. If the answer to this question is also no,
then the next question is whether the message header contains an
InReplyTo
key. If the answer is again no, then the
message is classified as spam. Of the 6232 messages in the training
set, 77 ham and 643 spam fall into this leaf. The spam have been
correctly classified and the 77 ham have been misclassified.
This scatter plot shows the relationship between the number of lines and the number of characters in the body of a message. The plot is on log scale, and 1 is added to all of the values before taking logs to address issues with empty bodies. The line is added for comparison purposes.
This
examines the relationship between the percentage of capital letters among all letters in a message and the total number of characters in the message. Spam is marked by purple dots and ham by green. The darker color indicates overplotting. We see here that the spam tends to be longer and have more capital letters than ham.
These two
use area to denote the proportion of messages
that fall in each category. The plot on the top shows those messages
that have an Re:
in the subject line tend not to be
spam. The bottom plot shows that those messages that are from a user
with a number at the end of their email address tend to be
spam. However, few messages are sent from such users so it is not
clear how helpful this distinction will be in our classification
problem.
This tree was fitted using rpart() on 6232 messages. The default values for all of the arguments to rpart() were used. Notice the leftmost leaf classifies as ham those messages with fewer than 13% capitals, fewer than 4% HTML tags, and at least 1 forward. Eighteen spam messages fall into this leaf and so are misclassified, but 2240 of the ham is properly classified using these 3 yes–no questions.
This plot displays the
for predicting spam as a function of the size of the complexity parameter in the rpart() function. The complexity parameter is a mechanism for specifying the threshold for choosing a split for a subgroup. Splits that do not achieve a gain in fit of at least the size of the parameter value provided are not made. The Type I error is minimized at a complexity parameter value of 0.001 for an error rate of 3.9%. The Type II error rate for this complexity parameter value is 10.5%.
JRSPdata_2010_03_10_12_12_31
.
The robot is in the center of the circle.
At the top right of the circle, we see a circular-like object that might be the target.
A straight edge corresponding to a rectangular obstacle appears at the bottom of the circle.
points(, type = "l")
but for each sub-group of points that form a contiguous sub-arc
of points at 2 meters, and points that are less than 2 meters.
It does not connect adjacent but disconnected sub-arcs.
(See ???.)
The mean of the ratio (\(\mu_{ratio}\)) is shown as a dashed line and thresholds lines indicating \(\mu_{ratio} \pm k \sigma_{ratio}\), with \(k = 0.85\).
In a study of the sum of 3 independent
random variables with the same rate, 6,000 sample outcomes were generated and used to estimate the
. The plot shows the empirical CDF for 16 values of the rate parameter, i.e., 0.1, 0.2, ..., 0.9, 1, 2, ..., 7. To help match each rate to its curve, a few of these rates are included in the figure next to their corresponding curves.
This figure shows the observed proportion of the number of offspring randomly generated for a parent with a birth time of 1 and a completion time of 6. The inter-arrival times of the parent's children are independent and follow an exponential distribution with parameter . The process was simulated 1,000 times, and the observed proportion of 0, 1, 2, ..., 9 offspring plotted (black line segments). In addition, the true probabilities from the Poisson(2.5) distribution are plotted next to the observed proportions (grey line segments).
This plot shows the lifetimes of each member of a randomly generated
birth and assassination process with a birth rate of
and a completion rate of
. Each job's lifetime is
represented by a grey line segment with endpoints at its birth and
completion times. The X
s on the segment denote the birth times of the
job's offspring. The dashed lines separate the generations. None of
the jobs in the fourth generation of this instance of the process had
offspring so the process terminated there. Notice that one job in the
second generation ran for a very long time and had just one child.
This plot shows the lifetimes of each member of the first 5 generations of a simulated birth and assassination process that has been observed up to time of 8. Each lifetime is represented by a grey line segment with endpoints at its birth and completion times. A job that has not completed by 8 time steps is censored and consequently, we see only its offspring born before 8.
These plots mirror those in[bib:Raissa2005] (page 3), with apparently more replications. They show the average proportion of cars moving in each time step on the vertical axis. The horizontal axis shows a range of car densities for the different grids. The 4 panels correspond to the different sizes of the square grid: 64, 128, 256, and 512. Each point in a plot corresponds to running the initial grid through 64,000 total iterations of both red and blue cars, i.e. 128,000 time steps. We see variation for a given density corresponding to different random grid configurations. We see that the density at which deadlock occurs decreases as the grid size increases. Importantly, we see that there are different points of equilibrium, not simply free-flowing or deadlocked. These are the intermediate states that were previously unexpected until reported in[bib:Raissa2005].
This plot shows the
of the gain from 1,000 $1 bets for two different strategies: the
for an infinite deck, and a
that hits when the dealer shows a 6 or higher and the player's cards are under 17, and otherwise stands. A gain of $2 or -$2 results from winning or losing a double down, respectively. The simple strategy never doubles the bet.
The distribution of payoffs under the optimal strategy peaks at a higher value. Note that the variability is considerable: there is about a 40% chance of losing 5% or more of your money with the optimal strategy and about a 55% chance with the simple strategy.
This plot shows the bet() function over the domain from -5 to 10. We can use this visualization to confirm that the function works as expected.
These boxplots show the team payrolls for the American League (dark gray) and the National League (light gray) from 1985 to 2012. Payroll has been logged and is reported in millions of dollars. The American League appears to have greater spread in payroll than the National League. Also evident is the sharp increase in payrolls in the late 1980s and early 1990s.
In this scatter plot, the plotting symbol is a transparent gray and the year has been jittered slightly to avoid over plotting. The identifier of the teams that won the World Series are added to the corresponding point, which is colored red. In almost every case, these dots are at or above the upper quartile of team payroll. Payroll is reported in millions of dollars and plotted on a log scale.
This screenshot of the <gearth></gearth> virtual earth browser displays circles scaled to the population size and
ed according to the
rate for a country. The data are available from the CIA Factbook. The locations of the circles are determined from
's latitude and longitude of the country's geographic center. When the viewer clicks on a circle, a window pops up with more detailed information for that country.
This histogram of infant mortality rates shows a highly skewed distribution. Most countries have rates under 20 per 1000 live births and a few countries have rates between 80 and 125.
This histogram of the square root of population size for countries shows a highly skewed distribution with a mode around 1000.
In this map each disk corresponds to a country's infant mortality and population. The size of the disk is proportional to the population and the color reflects the infant mortality rate. Notice the size of China's disk is too small – it has the highest population but one of the smallest disks. Other anomalies are apparent with closer inspection.
This screenshot of the ISO Web site shows the ISO code for the United
Kingdom as GB
. Codes are also available in XML
and CSV formats at https://www.iso.org/iso/home/standards/country_codes.htm.
This map correctly matches the country demographic information with latitude and longitude. Notice now the symbols for India and China are approximately the same size and the largest symbols on the map. Also note that the symbol for the United Kingdom is now pale yellow, the color we would expect it to be, because it is not being confused with Gabon.
This screenshot of <gearth></gearth> displays the location of each country with a pushpin. The locations are from our latitude and longitude file. A more informative <gearth></gearth> visualization appears in Figure 11.1, “Display of Infant Mortality by Country on <gearth></gearth>”.
<factbook>
and it has an attribute called
lastupdate
, which indicates how recently the
information was updated. The <news>
nodes are
indented one space, corresponding to their depth in the hierarchy,
i.e., they are children of <factbook>
.
MaxMind makes available the average latitude and longitude for countries in several formats, including in a display on the Web as shown here in this screenshot.
This screenshot displays the HTML source of the Web page shown in
Figure 11.9, “Screenshot of the MaxMind Web Page with Country Latitude and Longitude”. Notice that the latitudes and
longitudes for the countries appear within a <pre>
node in the HTML document.
We enter the search term in the text field at the top of the Web page. The results are displayed below. Note that in addition to the link to each particular job, the results page also shows metadata about each job, e.g., salary range, location, and a list of required skills. If this is all the information we wanted, we would not have to scrape the actual postings.
This job posting, like others on cybercoders.com,
has several paragraphs of free-form text, and also some itemized lists including one listing the skills necessary for the position. Additionally, the post has the location and salary separately in the top-right corner. It also displays the preferred skills as a list of phrases.
This shows the number of occurrences of each of the selected terms across all 842 job posts on Kaggle by January 2015.
This shows the number of occurrences of different terms we selected across Data Science job postings on this Web site.