The Next Page / Data Driven: Why statistics is 'sexy'
Hal Varian, Google's chief economist, says statistician will be 'the sexy job in the next 10 years.' Chad Schafer explains why.
August 4, 2013 4:00 AM
This year has been designated the "International Year of Statistics" to highlight the central importance of statistics in managing a 21st-century data overload.
By Chad Schafer
There is growing recognition of the value of careful analysis of the vast quantity of information on consumer and market behavior. Corporations repeatedly promote their use of "analytics" to "harness the power of big data." For statisticians, however, the value of data is old news. This year has been designated the "International Year of Statistics" to highlight the central importance of statistics in managing a 21st-century data overload.
Statistics in everyday life
You have two tickets to an upcoming Pirates game, but you are unable to attend. You decide to sell the tickets online, but at what price?
Your decision is based on several factors, including the quality of the Pirates' opponent, the weather forecast, the pitching match-up and current ticket prices. You also may rely on your intuition and any experience with previous sales.
There is uncertainty, both in the data (will it rain?) and in the best way to combine the data into the chosen sales price. Your choice is only an estimate of the ideal price.
This type of decision process is made formal by statistical methods of prediction. A statistical prediction model uses available past information to construct a mathematical formula to estimate the quantity of interest (in our example, the ideal price for the tickets).
Mike Shuker, senior marketing research consultant at Larimer-based Management Science Associates, leads projects that construct statistical models for these situations. Working with major U.S. and international symphonies, MSA uses historical concert programming features (such as time and day and soloist and composer information) to develop highly accurate statistical models to forecast demand for future concerts. This forecast helps the symphony with budgeting and planning, and it replaces the "intuitive" estimates previously used.
Given misconceptions about statistics, it is not surprising that some people would attempt to rebrand the field as "analytics." Statisticians are not accountants, nor are they the source of colorful charts in newspapers. The work of statisticians is neither tabulating nor archiving numerical information. Instead, the field of statistics is engaged in the pursuit of knowledge from these data. Methods of statistical analysis are built on a foundation of mathematics and then utilized in all areas of science, medicine, public policy, industry and so forth.
In fact, statistical analysis is an integral part of the research process. The data gathered as part of a study (a clinical trial to test a new treatment for a disease, for example), rarely yield a clear-cut answer to the question that motivated that study. Statistical methods are constructed to handle the uncertainty present in the data and to translate those data into the strongest conclusions possible.
For example, the Biostatistical Center of the Pittsburgh-based National Surgical Adjuvant Breast and Bowel Project evaluates therapies for the prevention and treatment of breast and colorectal cancer. In more than 50 years of research, this group, led by University of Pittsburgh biostatistics professor Joseph Costantino, has made substantial contributions, including the initial studies that demonstrated the value of tamoxifen in reducing the incidence of breast cancer in women at high risk for the disease.
But none of these studies produced clean, simple-to-interpret results. It is never the case that 100 percent of the patients respond to the treatment under study with a positive outcome. What success rate is required in order to deem a treatment a success? Any success rate derived from the study necessarily will be an estimate. How many patients should be enrolled to ensure that the uncertainty in this estimate is acceptably low? How should one adjust for patients' age differences? These questions, and many more, are issues that the statisticians involved in these studies must address with mathematical tools in order to develop and apply appropriate statistical methods.
The statisticians of the NSABP are not consultants but full members of the scientific collaboration, contributing to all stages of design, execution and interpretation of studies. The growing number of institutes that bring together statisticians and scientists speaks to the importance of these contributions.
For more than 60 years, the Rand Corp. has performed research in areas ranging from energy policy to healthcare reform. The Statistics Group is an important component of these projects, providing the statistical analysis expertise that makes sense of the available data. The statistical analysis of public-policy data is particularly challenging because policy interventions often have outcomes that are difficult to quantify.
For example, the Pittsburgh office of Rand has an ongoing emphasis on research aimed at improving education. Statistician Claude Setodji collaborated on a study testing New York City's decision to end social promotion in schools. A major challenge in such a study is finding students to compare with those who are retained in grade by the new policy. Statisticians and education researchers worked together closely to find such a group based on the rules for who could and could not be promoted and found that the policy was having a positive effect on students.
Through their collaborative research, statisticians often develop subject-matter expertise, and their insights are highly valued. Multiple faculty members in Pitt's Department of Statistics are also members of the Department of Psychiatry. These include Yu Cheng, who has completed research ranging from assessments of smoking-cessation programs to the benefits of mental-health services as part of pediatric care, and the Statistics Department chairman, Satish Iyengar, who develops statistical methods for analyzing data from experiments in neuroscience.
Thirty years ago, it may not have seemed necessary to involve statisticians in neuroscience research because the amount of data was limited. Technological advancements have led to massive growth in the quantity and quality of information available. This is a recurring situation: Technology enables new, richer sources of information, but making sense of these data requires more sophisticated methods of analysis.
It is now possible to determine your entire DNA sequence in a reasonable amount of time -- and for less than $1,000. There is significant potential for improvements in the diagnosis and treatment of conditions and diseases with a genetic basis. For example, professors Bernie Devlin of the University of Pittsburgh and Kathryn Roeder of Carnegie Mellon University are part of a nationwide, multi-institution consortium studying the genetic causes of autism. Their recent work has demonstrated that there is a contribution to autism cases from genetic mutations that are present in children but not inherited from their parents.
This finding came only after statistical analysis of the full genetic sequence data of the affected children and their parents. Researchers also had to account for natural genetic variability. Ultimately, it was estimated that up to 15 percent of autism cases could be attributed to these mutations. Hence, although this is a significant step, there is much more work to be done to understand the causes of this condition.
Port Authority buses provide another example of the challenges and benefits of working with complex data collections. The buses are equipped with devices that record massive amounts of information regarding the timing and location of all stops. As part of CMU's Traffic21 initiative, statistics professor Bill Eddy has worked with graduate students analyzing these data with the objective of improving transit in the busy Downtown corridor.
To fully exploit the data, the team had to work to understand the synchronization of the traffic lights on Forbes and Fifth avenues through Oakland and how the synchronization affects the buses (for which the timing was not designed). This particularly affects the buses that run against traffic on Fifth Avenueand are forced to stop at every light. Results of the statistical analysis showed that simple adjustments in the timing pattern could shorten travel times.
Students, eager to gain experience in cutting-edge statistical analyses, have played a key role in each of these research projects. The level of interest in learning methods of statistical analysis has grown tremendously. Enrollment in many statistics courses at local universities has doubled over the past five years.
This demand is driven by employers that have seen tremendous gains from statistical approaches to market research, risk assessment and quality control. Eliezer Batista, technology manager at Alcoa, uses statistical methods to proactively identify and address production problems. These tools, which allow for the determination of ideal operating settings and for the reduction of variability in the aluminum-production process, have been instrumental in Alcoa's sustainability approach to reducing energy consumption.
Ross McGowan, director of data science at CivicScience in East Liberty, analyzes the results of voluminous amounts of consumer-survey data to identify new marketing opportunities for companies. He and Mr. Batista recently received master's degrees in statistics from CMU. The relatively new program teaches statistical methods and practice to students of varied undergraduate backgrounds.
Statisticians often are criticized for their tendency to avoid making strong statements and instead to qualify any conclusions with cautionary statements regarding the limitations of analysis procedures. Statisticians are comfortable with uncertainty in their conclusions; they know it is an accurate reflection of the realities of working with data.
Chad Schafer (firstname.lastname@example.org) is associate professor in the Department of Statistics at Carnegie Mellon University. He is a member of the McWilliams Center for Cosmology, a collaboration of astronomers, computer scientists and statisticians at CMU. He recently served as president of the Pittsburgh Chapter of the American Statistical Association.