This book is about data science: a field that uses results from statistics, machine learning, and computer science to create predictive models. Because of the broad nature of data science, it’s important to discuss it a bit and to outline the approach we take in this book.
The statistician William S. Cleveland defined data science as an interdisciplinary field larger than statistics itself. We define data science as managing the process that can transform hypotheses and data into actionable predictions. Typical predictive analytic goals include predicting who will win an election, what products will sell well together, which loans will default, or which advertisements will be clicked on. The data scientist is responsible for acquiring the data, managing the data, choosing the modeling technique, writing the code, and verifying the results.
Because data science draws on so many disciplines, it’s often a “second calling.” Many of the best data scientists we meet started as programmers, statisticians, business intelligence analysts, or scientists. By adding a few more techniques to their repertoire, they became excellent data scientists. That observation drives this book: we introduce the practical skills needed by the data scientist by concretely working through all of the common project steps on real data. Some steps you’ll know better than we do, some you’ll pick up quickly, and some you may need to research further.
Machine learning, at its core, is concerned with algorithms that transform information into actionable intelligence. This fact makes machine learning well-suited to the present day era of Big Data. Without machine learning, it would be nearly impossible to keep up with the massive stream of information.
Given the growing prominence of R—a cross-platform, zero-cost statistical programming environment—there has never been a better time to start using machine learning. R offers a powerful but easy-to-learn set of tools that can assist you with finding data insights.
By combining hands-on case studies with the essential theory that you need to understand how things work under the hood, this book provides all the knowledge that you will need to start applying machine learning to your own projects.
The examples in this book were written for and tested with R Version 2.15.3 on both Microsoft Windows and Mac OS X, though they are likely to work with any recent version of R.
R is a wonderfully flexible platform and language for exploring, visualizing, and understanding data. I chose the quote from Alice in Wonderland to capture the flavor of statistical analysis today—an interactive process of exploration, visualization, and interpretation.
The second quote reflects the generally held notion that R is difficult to learn. What I hope to show you is that is doesn’t have to be. R is broad and powerful, with so many analytic and graphic functions available (more than 50,000 at last count) that it easily intimidates both novice and experienced users alike. But there is rhyme and reason to the apparent madness. With guidelines and instructions, you can navigate the tremendous resources available, selecting the tools you need to accomplish your work with style, elegance, efficiency—and more than a little coolness.
I first encountered R several years ago, when applying for a new statistical consulting position. The prospective employer asked in the pre-interview material if I was conversant in R. Following the standard advice of recruiters, I immediately said yes, and set off to learn it. I was an experienced statistician and researcher, had 25 years’ experience as an SAS and SPSS programmer, and was fluent in a half dozen programming languages. How hard could it be? Famous last words.
This book invites the reader to learn about multivariate analysis, its modern ideas, innovative statistical techniques, and novel computational tools, as well as exciting new applications.
The need for a fresh approach to multivariate analysis derives from three recent developments. First, many of our classical methods of multivariate analysis have been found to yield poor results when faced with the types of huge, complex data sets that private companies, government agencies, and scientists are collecting today; second, the questions now being asked of such data are very different from those asked of the much-smaller data sets that statisticians were traditionally trained to analyze; and, third, the computational costs of storing and processing data have crashed over the past decade, just as we see the enormous improvements in computational power and equipment. All these rapid developments have now made the efficient analysis of more complicated data a lot more feasible than ever before.
Multivariate statistical analysis is the simultaneous statistical analysis of a collection of random variables. It is partly a straightforward extension of the analysis of a single variable, where we would calculate, for example, measures of location and variation, check violations of a particular distributional assumption, and detect possible outliers in the data. Multivariate analysis improves upon separate univariate analyses of each variable in a study because it incorporates information into the statistical analysis about the relationships between all the variables.
This book series reflects the recent rapid growth in the development and application of R, the programming language and software environment for statistical computing and graphics. R is now widely used in academic research, education, and industry.
It is constantly growing, with new versions of the core software released regularly and more than 5,000 packages available. It is difficult for the documentation to keep pace with the expansion of the software, and this vital book series provides a forum for the publication of books covering many aspects of the development and application of R.
The scope of the series is wide, covering three main threads:
- Applications of R to specific disciplines such as biology, epidemiology, genetics, engineering, finance, and the social sciences.
- Using R for the study of topics of statistical methodology, such as linear and mixed modeling, time series, Bayesian methods, and missing data.
- The development of R, including programming, building packages, and graphics.
The books will appeal to programmers and developers of R software, as well as applied statisticians and data analysts in many fields. The books will feature detailed worked examples and R code fully integrated into the text, ensuring their usefulness to researchers, practitioners and students.
Change is occurring at an accelerating rate; today is not like yesterday, and tomorrow will be different from today. Continuing today’s strategy is risky; so is turning to a new strategy. Therefore, tomorrow’s successful companies will have to heed three certainties:
- Global forces will continue to affect everyone’s business and personal life.
- Technology will continue to advance and amaze us.
- There will be a continuing push toward deregulation of the economic sector.
These three developments—globalization, technological advances, and deregulation— spell endless opportunities. But what is marketing and what does it have to do with these issues?
Marketing deals with identifying and meeting human and social needs. One of the shortest definitions of marketing is “meeting needs profitably.” Whether the marketer is Procter & Gamble, which notices that people feel overweight and want tasty but less fatty food and invents Olestra; or CarMax, which notes that people want more certainty when they buy a used automobile and invents a new system for selling used cars; or IKEA, which notices that people want good furniture at a substantially lower price and creates knock-down furniture—all illustrate a drive to turn a private or social need into a profitable business opportunity through marketing.
My 40-year career in marketing has produced some knowledge and even a little wisdom. Reflecting on the state of the discipline, it occurred to me that it is time to revisit the basic concepts of marketing.
First, I listed the 80 concepts in marketing critical today and spent time mulling over their meanings and implications for sound business practice. My primary aim was to ascertain the best principles and practices for effective and innovative marketing. I found this journey to be filled with many surprises, yielding new insights and perspectives.
I didn’t want to write another 800-page textbook on marketing. And I didn’t want to repeat thoughts and passages that I have written in previous books. I wanted to present fresh and stimulating ideas and perspectives in a format that could be picked up, sampled, digested, and put down anytime. This short book is the result, and it was written with the following audiences in mind:
- Managers who have just learned that they need to know something about marketing; you could be a financial vice president, an executive director of a not-for-profit organization, or an entrepreneur about to launch a new product. You may not even have time to read Marketing for Dummies with its 300 pages. Instead you want to understand some key concepts and marketing principles presented by an authoritative voice, in a convenient way.
- Managers who may have taken a course on marketing some years ago and have realized things have changed. You may want to refresh your understanding of marketing’s essential concepts and need to know the latest thinking about highperformance marketing.
- Professional marketers who might feel unanchored in the daily chaos of marketing events and want to regain some clarity and recharge their understanding by reading this book.