The annual Joint Statistical Meetings (JSM) is among the largest statistics conferences in the world–and, as it nears its 180th anniversary–among the oldest. Hosted by the American Statistical Association, the International Biometric Society, the Institute of Mathematical Statistics, Statistical Society of Canada, and several other organizations, the event brings together thousands of participants from academia, industry, and government to share research and perspectives, and to encourage collaboration on a wide range of topics.
As it has in the past, Two Sigma sponsored JSM 2018, which was held from July 28 to August 2 in Vancouver. Below, several of the Two Sigma researchers in attendance provide an overview of some of the most interesting lectures and sessions they attended, and highlight some of the most important challenges statisticians face going forward.
Sessions on Named Lectures
JSM featured two Medallion Lectures this year:
- Statistical Inference for Complex Extreme Events
- Statistical Analysis of Large Tensors
Statistical Inference for Complex Extreme Events – Anthony Davison, Ecole Polytechnique Fédérale de Lausanne
The study of extreme events has become increasingly important in many fields of human activity due to their unexpected and catastrophic nature. Proper statistical modeling of these events allows us to predict their occurrence and manage their impact more accurately. Typical applications are now widely found in insurance, disaster prediction, and reliability engineering, but lately considerable interest has emerged from other areas, like finance and national security. Davison’s lecture at JSM offered the audience a brief overview of statistical modeling of extreme events, as well as some recent innovations he and his colleagues have made in this area.
Extreme events, by definition, are observed only rarely, making the modeling problem more about extrapolation instead of interpolation. As a consequence, traditional approaches of inferring the underlying distribution from i.i.d. samples generally do not work well, and models based on extreme value theories (EVT) are among the most popular alternatives. Lying at the heart of EVT is the Fisher-Tippet-Gnedenko theorem (Fisher 1928), which states that all limiting distribution of maxima belongs to either the Gumbel, the Frechet or the (reverse) Weibull family, together known as the generalized extreme value distribution (GEV). The GEV family is shown to be the only non-trivial distribution that satisfies the max-stability relation, and its distributional property is fully characterized by the shape parameter, a quantity playing a critical role in statistical modeling of extreme events.
While GEV is very helpful in modeling extreme events in relatively simple scenarios, for more complex settings like analyzing space-time data, models based on a random process instead of a single parametric family are usually necessary. By modeling the occurrence of extreme events as exceedances over high thresholds (Davison and Smith, 1990), Davison showed that there exists a natural way to derive a family of random processes, known as Pareto processes, to model extrema. He emphasized that likelihood inference under such models is computationally challenging, and techniques based on quasi-Monte Carlo approximation and gradient scores are being adopted in practice.
Going beyond the discussion of the theoretical development in statistical modeling on extreme events, Davison also covered several interesting case studies of applying these ideas to real-world examples, bringing a welcome balance between theory and practice to the lecture. Davison’s closing remarks emphasized again how modeling extreme events differs from typical statistical inference, in terms of both the challenges as well as the opportunities lying ahead.
Statistical Analysis of Large Tensors – Ming Yuan, Columbia University
The era of big data has not only seen the sheer volume of data for research explode, but recent challenges have also prompted statisticians to develop models for datasets with increasingly complex structures. For instance, in neuroscience, genetics, and many other scientific fields, it is now not uncommon to have data collected directly as multi-dimensional arrays, i.e., tensors, a data type that has lately drawn considerable attention in statistics.
Yuan’s lecture served as a timely introduction to statistical modeling for tensors as well as a review of some progress made by the speaker and his collaborators on this topic. In particular, Yuan covered three active areas in tensor analysis:
- Tensor PCA, the primary goal of which is to study rank-one approximation to a tensor;
- Tensor completion, which addresses the problem of imputing missing entries of partially observed tensors; and
- Tensor sparsification, an area focusing on seeking for sparse representation of a tensor.
The challenge in the analysis of tensors turns out to be both statistical and computational. For high-order tensors, many well-studied concepts for matrices like operator norm turn out to be either not well defined or computationally challenging. For instance, Silva and Lim (2008) discussed a case where a tensor of rank three can be approximated by a sequence of rank-two tensors with error converging to zero, implying that the low-rank decomposition, such as SVD for matrices, does not easily generalize to high-order tensors (Kolda and Bader, 2009). While tensor PCA is well-defined conceptually, the problem of finding the exact solution was shown to be NP-hard and highly non-convex. To make things worse, with noisy observation, MLE was shown not to be always consistent and many interesting results have since been obtained in understanding the property of MLE and other polynomial time estimators. See Montanari and Richard (2014) for an interesting discussion of recent advances and an in-depth study of the statistical and computational aspects of some existing and new methods.
Going beyond tensor PCA, Yuan also discussed tensor completion and sparsification, two topics that have been extensively studied in matrix analysis. As in the case of PCA, well-known techniques that used to work well for matrices cease to remain effective, and new methods are needed for tensors of higher dimensionality. Taking the problem of sparsification for instance, when applied to tensors, the existing randomized sparsification scheme for matrices generally requires the number of nonzero entries to be a polynomial of order being half the dimension of the tensor (Nguyen etc., 2015). While such an exponential dependence is essentially optimal in the matrix case, the recent work by Xia and Yuan (2017) showed that with a more judicious sampling scheme, non-trivial improvement can be achieved for higher order tensor, leaving the tensor sparsification an interesting problem to be further studied.
Despite the aforementioned challenges, there have been several successful applications of tensor modeling in many scientific areas. In neuroscience, especially neuroimaging analysis, modern electroencephalography (EEG) experiments naturally produce data with multiple dimensions, including time, frequency, space and other experiment-specific characteristics, making tensor methods like the Tucker model part of the standard toolbox for scientists to understand brain functions. Signal processing is another area where tensor-based models have found promising applications, in problems like blind-source separation, MIMO space-time coding and audio/video processing. Not surprisingly, there has also been surging interest lately in the machine learning community toward applying tensor methods (Anandkumar etc., 2016).
Yuan’s JSM lecture was a stimulating example to show statisticians the new opportunities and challenges technology advances provide. Today’s rapidly increasing computational power makes it possible to research and apply sophisticated models that were intractable previously. At the same time, these advances are also bringing to us datasets of varying types and complexity, calling for new methodologies to be developed.
Fisher Lecture – Susan Murphy, Harvard University
The Fisher lectureship, named after Sir Ronald Fisher — one of the founding fathers of modern statistics — is an annual award conferred on a contemporary statistician to recognize the recipient’s meritorious achievement and scholarship in statistics. This year’s Fisher lecture was delivered by Harvard’s Susan Murphy, who talked about some interesting progress her research group has made in the study of micro-randomized trials in mobile health. Quite a few ideas discussed in the lecture date back to Fisher’s original work in randomized experiments, factorial designs, and analysis of variance, but obviously many new methodologies have since been developed.
Perhaps one of the most intriguing themes of this year’s Fisher lecture was the application of randomized trials in a sequential and microscopic setting. The motivating application is to give mobile intervention treatments, e.g., iPhone notifications, to subjects/people who are trying to quit smoking. Research studies have indicated that stress predicts smoking relapse, and performing brief relaxing exercises helps buffer real-life stress. Therefore, it is of major policy interest to know if pushing reminders to do exercises can help smoking cessation. Since the researchers are able to use sensors to monitor the stress level of the participants, the statistical question naturally becomes when and how many times the intervention should be provided.
Murphy formulated the research question as a dynamic decision-making problem with time-varying stratification variable (for example, stress level of the participant), with the goal in this particular study being to minimize the stress level in the future. The mobile interventions are then randomly allocated to participants subject to a soft budget constraint. Murphy described an inverse-probability weighted estimator to estimate the causal effect, which can also be generalized to construct an asymptotic F-test of causal hypotheses, like whether there is any treatment effect or whether the treatment effect decays over time.
Murphy’s lecture highlights the great potential of the application of causal inference under modern experimental designs that are made available by technology innovations. Close collaborations among statisticians, computer scientists, and domain experts has become a new norm to explore the immense opportunities offered by the advent of the big data era.
Additional Invited Sessions
Theory vs. Practice – Edward George, Trevor Hastie, Elizaveta Levina, John Petkau, Nancy Reid, Richard J Samworth, Robert Tibshirani, Larry Wasserman, Bin Yu
The evolution of statistics as a discipline of science has been far from smooth sailing. The entire 20th century has seen the statistics community haunted with clashes rooted in the dichotomy between Frequentists and Bayesians. While such debates have largely subsided in recent years, another division within statistics is now growing and starting to take another shape. On one end of the spectrum are statisticians who pride themselves in developing theories, putting mathematical rigor as an essential component. On the other end are practitioners who focus solely on solving real-world problems, making practicality the priority. With such a division echoing the long-standing debate on the difference between the two in science, the JSM this year featured an interesting panel discussion by a group of seasoned statisticians on how theory and practice may best interplay in statistics.
The panel discussion began with an array of questions put forth by the facilitator, including some notable queries about the current balance between theory and applications in the field, the value of theoretical work versus that of applied work, and the role of computation in statistical theory and applications. Several discussants shared their thoughts on these topics with the audience, and there seemed to be consensus that statisticians nowadays should be more open-minded to accepting and adopting methods that have been empirically proven but otherwise lack well-understood theories. Tibshirani cited deep neural networks as an example. Despite its popularity in various applications, he pointed out that solid progress has been made only recently towards understanding the statistical properties of DNN (Cohen etc, 2018).
Some of the discussants expressed their concern that the academic community still seems to prefer theoretical work over applied work today, and several of them strongly argued that such a perspective needs to change to help shape a better future for the field itself. Participants also asserted that the current journal review and tenure evaluation processes should put more weight on high-quality applied work and that there should not be an intentional separation between theory and practice. To echo this, UC Berkeley’s Yu suggested that applied work should naturally feed back to revise theory, while theory should provide insight and guidance in practice.
The panel discussion also featured audience questions, several of which related directly to how the community should live up to the challenges emerging from the sweeping changes happening throughout science and engineering nowadays. As Tukey envisioned more than half a century ago (Tukey, 1962), “it remains to us, to our willingness to take up the rocky road of real problems in preferences to the smooth road of unreal assumptions, arbitrary criteria, and abstract results without real attachments. Who is for the challenge?”
Random Forests in Big Data, Machine Learning and Statistics – Hemant Ishwaran, Jean-Michel Poggi, Lisa Schlosser, Lucas Mentch, Ruoqing Zhu
Random forest, a nonparametric method pioneered by Breiman (2001), has become a powerful tool for many prediction problems in science and engineering. Despite its root in statistics and the wide success it enjoys in real-world applications, little is known about its theoretical properties, and most of such discussion to date has happened in other communities, like machine learning. Interestingly, this year’s JSM featured several sessions on recent studies of random forest, with one particular meeting discussing methodology development and applications of random forest in statistics.
A few of the talks in this session focused on studying some aspect of the statistical properties of random forest. For example, Ishwaran proposed a new subsampling approach to estimate standard errors and construct confidence intervals for variable importance measures when using random forest for classification, regression, and survival analysis. He emphasized that the new method enjoys particular computational efficiency in the case of large-scale datasets. Zhu showed that when applying random forest in survival analysis, the censoring distribution can affect the consistency of the splitting rule, and a modification satisfying certain restrictions is needed–especially when applying the technique to high-dimensional cases.
There was also an interesting talk given by Schlosser on distributional regression forests, which combine the framework of random forest and GAMLSS (generalized additive models for location, scale, and shape). Similar to random forest, the model consists of distributional trees constructed from bootstrap samples. Each distributional tree is built by first partitioning the covariate space recursively into disjoint segments and then fitting a distributional model from GAMLSS families on each segment. The variable splitting is achieved by testing associations of the score function instead of minimizing residual sum of squares or gini criterion. For each observation of covariate, rather than averaging the predicted values from individual trees, a predictive distribution is obtained by maximizing the combined likelihood from all distributional trees, allowing probabilistic statements to be made for each prediction. The proposed model is applied to the problem of precipitation forecasting in complex terrain, with encouraging results observed.
Multidimensional Monotonicity Discovery with MBART – Edward George, University of Pennsylvania
BART, i.e., Bayesian Additive Regression Trees, has been a widely adopted nonparametric regression technique in practice. BART employs a Bayesian approach to develop an ensemble of trees to approximate the target function. Edward George’s talk introduced two extensions to BART: mBART (Monotone BART) and mBARTD (Monotone Discovery with mBART). mBART assumes that the multivariate target function is monotone, i.e., the function is monotone in every dimension fixing all other dimensions, and approximates the target function by the sum of many individual monotone tree models. The resulting fit is shown to be much smoother than the vanilla BART fit, thanks to the regularization.
In practice, however, it is hard to be absolutely certain about the monotonicity assumption. As an alternative, mBARTD is capable of discovering monotonicity relationships, instead of explicitly imposing the assumption. Relying on the fact that any function can be decomposed as the sum of a monotone increasing function and a monotone decreasing function, the target function is first modeled as the sum of two monotone components with mBART, and monotonicity is then checked by testing whether one of them is close to zero. Different variants of BART are tested with examples including wiggly one-dimensional sine waves and higher dimensional functions, and mBARTD is shown to outperform BART consistently by being smoother, and to outperform mBART by being more flexible for non-monotone functions.
Nonparametric independence testing via mutual information– Richard Samworth, University of Cambridge
Testing independence, or, more generally, measuring dependence, has been extensively studied in statistics, with early work dating back to almost a century ago. While many classic methods like Pearson’s correlation (Pearson 1920) have been widely applied, most of them do assume certain structures about the dependence relations, which reduces the power of the tests when applied in more complex settings. Berrett and Samworth (2017) proposed a new nonparametric test of independence based on mutual information, i.e., the Kullback-Leibler divergence between the joint distribution and the product of two marginal distributions. Unlike most existing methods, the test is shown to have its asymptotic power converging to one for alternatives where the mutual information is bounded below from zero. Exact testing procedures were developed for three individual cases, including testing with known marginal, testing with unknown marginal, and goodness-of-fit test in regression, and the corresponding procedures have been made available in the R package IndepTest.