This is a book containing 12 comprehensive case studies focused primarily on data manipulation, programming and computional aspects of statistical topics in authentic research applications. The aim is to provide students, researchers and faculty with exposure to the entire thought process of approaching the computations of a complete data analysis project. This differs from teaching a programming language. Instead, it illustrates how to think about programming with very concrete and complete examples. We also emphasize testing and validating computations, when and how to make them faster, and give the reader insight into how high-level programming evolves.
Each chapter works through all of the computations and programming to acquire, transform and explore the data or create the simulations. We discuss different aspects of the analysis and show results. However, readers have the opportunity to take the analyses much further, building on the core computational work described in the chapters.
The case studies form 3 basic groups (with overlap in most chapters)
- data analysis and statistical methods
- simulation
- data technologies
- exploratory data analysis (EDA),
- naïve Bayes,
- k-nearest neighbors,
- classification and regression trees,
- repeated measurements and time series,
- regression,
- non-linear least squares and optimization,
- cross validation,
- connections and text processing,
- regular expressions,
- UNIX shell tools,
- relational databases and SQL,
- scraping Web pages and HTML,
- XML, KML.
The chapters also provide rich examples of some more advanced aspects of R, including
- object-oriented progamming with both S3 and reference classes,
- dealing with large data with, e.g., the bigmemory package
- profiling R code
- making code faster
- interfacing to C code