A Rate-distortion Approach to Massive Set Data Analysis

Amy Joan Braverman
Ph.D., 1999
Advisor: Donald Ylvisaker
This work sets forth a methodology for compressing massive data sets so that the reduced version retains certain features of the original. The approach is based partly on quantization theory; a branch of signal processing concerned with lossy data compression. Data sets play the role of information sources, and are regarded as fixed, not random. A statistical framework for studying relationships between summarized, or compressed, and raw data sets is proposed. That framework is used to articulate various statistical properties of compressed data, and in particular, to bound or approximate errors incurred using summarized information in place of raw information. The methodology is demonstrated using data from the International Satellite Cloud Climatology Project.