Big-Data Analytics in Resource-constrained Regime: Statistical Limits and Computational Challenges

Yihong Wu, Bruce Hajek, R. Srikant: Electrical and Computer Engineering

Chandra Chekuri: Computer Science

Sewoong Oh: Industrial and Enterprise Systems Engineering

Addressing the Problem

Statistical inference involving massive datasets is increasingly the norm of modern data analytic practice. Traditional approaches in statistical inference assume the data-constrained regime where it is expensive to collect samples or data points. In modern Big Data analytics, the amount of the data available and the speed at which such datasets are being collected are increasing at extreme rates. To cope with the statistical and computational challenges in such datasets, we need a new paradigm for statistical inference in the resource-constrained regime, where data are abundant but we are increasingly pressed by time and memory space. The key questions still remaining unanswered are: When dealing with big data, how do resource constraints impact the performance of inference and learning and how can we strike the optimal tradeoff between statistical performance and resource efficiency?

Traditionally, theoretical computer scientists have focused on the computational and space complexity of various worst-case problems rather than problems of a statistical nature. In contrast, statisticians are mostly concerned with performing inference tasks optimally under various statistical models from a decision-theoretic perspective, without addressing the constraints on computational and storage resources. However, the ever-growing amount of data dictates us to take into account the computational and space complexity of inference and learning. To date, the connection between statistical inference and complexity theory on big data has remained largely an untapped area, which recently has garnered a surge of attention from various communities such as computer science, engineering, and statistics. There is a dire need for bridging the gap between these two viewpoints on statistical inference with massive datasets.

Research Goals

Combining both statistical and computational perspectives, we plan to undertake a systematic study of various inference and learning problems in big-data analytics. In addition to the information-theoretic determination of the fundamental limits, the major innovation lies in identifying the optimal statistical performance under resource constraints such as computational and space complexity, as well as designing approximation algorithms with efficient implementations and provable guarantees, thereby quantifying the optimal tradeoff between statistical performance and resource utilization.