Award Abstract # 1065125
AF: Medium: Taming Masssive Data with Sub-Linear Algorithms

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: MASSACHUSETTS INSTITUTE OF TECHNOLOGY
Initial Amendment Date: February 25, 2011
Latest Amendment Date: February 25, 2011
Award Number: 1065125
Award Instrument: Standard Grant
Program Manager: Rahul Shah
CCF
 Division of Computing and Communication Foundations
CSE
 Direct For Computer & Info Scie & Enginr
Start Date: March 1, 2011
End Date: February 29, 2016 (Estimated)
Total Intended Award Amount: $1,160,930.00
Total Awarded Amount to Date: $1,160,930.00
Funds Obligated to Date: FY 2011 = $1,160,930.00
History of Investigator:
  • Ronitt Rubinfeld (Principal Investigator)
    ronitt@csail.mit.edu
  • Piotr Indyk (Co-Principal Investigator)
Recipient Sponsored Research Office: Massachusetts Institute of Technology
77 MASSACHUSETTS AVE
CAMBRIDGE
MA  US  02139-4301
(617)253-1000
Sponsor Congressional District: 07
Primary Place of Performance: Massachusetts Institute of Technology
77 MASSACHUSETTS AVE
CAMBRIDGE
MA  US  02139-4301
Primary Place of Performance
Congressional District:
07
Unique Entity Identifier (UEI): E2NYLCDML6V1
Parent UEI: E2NYLCDML6V1
NSF Program(s): Algorithmic Foundations
Primary Program Source: 01001112DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7924, 7926, 9218, HPCC
Program Element Code(s): 7796
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

The rampant growth of massive data sets presents new challenges for data processing and analysis. To cope with this phenomenon, sublinear time and sublinear space algorithms that are capable of analyzing and extracting value from such immense inputs must be developed. This project aims to study sublinear time and space algorithms from a unified perspective, using the synergies in order to gain a better understanding that will lead to faster, space efficient and more widely applicable algorithms.

The proposed research has two core components. The first component is the study of sparse representations of large data sets. Sparse representations are useful for quickly analyzing and processing data. This component itself will have two parts. First, it will lead to a better understanding of sublinear time sampling algorithms for data coming from a succinctly described distribution. The succinct descriptions considered include those defined by a small number of parameters, such as power laws, Gaussians, and histogram distributions. Second, it will improve the current state of knowledge of streaming algorithms, data sketching, compressive sensing and sparse recovery techniques. Such techniques will for example have an impact on algorithms for acquiring and processing images, audio and network data.

The second component aims to design novel statistical techniques for understanding various distributional quantities describing commonly used structured objects such as graphs. The focus in this component will be on sublinear time and space algorithms that estimate parameters of graphs. This project will significantly advance the algorithmic foundations of algorithms with limited resources, and develop highly efficient algorithms for analyzing massive data sets.

This project will significantly advance the algorithmic foundations of computation with limited resources. It will develop highly efficient algorithms for analyzing massive data sets that arise in diverse areas including electronic commerce, health, network security, and scientific data collection.

The broader impacts of this project are in the education and mentoring of young researchers including underrepresented groups. Community outreach activities will introduce primary school children to interesting mathematical and computer science ideas.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Dana Ron, Ronitt Rubinfeld, Muli Safra, Alex Samorodnitsky, Omri Weinstein "Approximating the Influence of Monotone Boolean Functions in $O(\sqrt{n})$ Query Complexity" Transactions on Computation Theory , v.4(4) , 2012 , p.11
Ronitt Rubinfeld "Taming Big Probability Distributions" ACM Crossroads , v.19(1) , 2012 , p.24-28
Sofya Raskhodnikova, Dana Ron, Ronitt Rubinfeld, and Adam Smith "Sublinear Algorithms for Approximating String Compressibility" Algorithmica , v.65 , 2013 , p.685
Ronitt Rubinfeld and Ning Xie "Robust characterizations of k-wise independence over product spaces and related testing results" Random Struct. Algorithms , v.43 , 2013 , p.265
Reut Levi, Dana Ron, and Ronitt Rubinfeld "Testing Properties of Collections of Distributions" Theory of Computing , v.9 , 2013 , p.295
Reut Levi, Dana Ron, Ronitt Rubinfeld "Testing Similar Means" SIAM Journal on Discrete Mathematics , v.28 , 2014 , p.1699-1724
Sheela Devadas and Ronitt Rubinfeld "A Self-Tester for Linear Functions over the Integers with an Elementary Proof of Correctness" Theory of Computing Systems (published online). Also CoRR abs/1412.5484 (2014) , 2015
Albert Kim, Eric Blais, Aditya G. Parameswaran, Piotr Indyk, Samuel Madden, and Ronitt Rubinfeld "Rapid Sampling for Visualizations with Ordering Guarantees" PVLDB , v.8 , 2015

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The rampant growth of massive data sets presents new challenges for data processing and analysis. To cope with this phenomenon, sublinear time and sublinear space algorithms that are capable of analyzing and extracting value from such immense inputs must be developed.  This project studies sublinear time and space algorithms from a unified perspective, using the synergies in order to gain a better understanding that will lead to faster, space efficient and more
widely applicable algorithms.

The research has two core components. The first component is the study of sparse representations of large data sets. Sparse representations are useful for quickly analyzing and processing data.
Results on this part include: Algorithms for using sparsity of data to compress and represent large data sets were developed. Improved sparse recovery algorithms for inputs that have a structured sparsity,
namely described as  tree and block sparse, have been given.   Optimal sparse Fourier Transform algorithms have been developed.
The best algorithms for the sparse recovery  of matrices under the
condition of restricted isometry property are given.

The second part of this research is to achieve a better understanding of sublinear time sampling algorithms for very large data. Algorithms that run in time sublinear in the size of the data were discovered:
For example, approximation algorithms for the minimum vertex cover size (a classical combinatorial optimization problem on graphs) were found that run in time approaching the average degree of the graph, which is significantly smaller than the description of the entire graph.

A new model of local computation was introduced.   This model allows one to locally compute just the part of the result that one needs, rather than the whole input/output relationship. For example, if a resident is being matched to a medical residency program, the
resident need only compute which hospital the resident has been matched to --  not the entire matching of all residents.    Within this model, the first results on finding local algorithms for basic problems such as maximal independent set, compression of data,  sparse spanning graphs, and resource allocation are given.

A large thrust of this project is in learning to understand data coming from samples of a distribution over a very large domain.   Such algorithms attempt to understand whether the distributions have certain minimal properties, some examples include being describable as a k-histogram distribution (commonly used in Database
applications),  monotone distribution, piecewise-polynomial. Surprisingly, one can understand whether the distribution has these properties faster than one can learn the distribution.    In this project, the first algorithms for learning and testing k-histogram distributions were given,    a framework for using structure to test classes of discrete distributions was introduced which gives near-optimal sample complexity for testing many well studied classes of  distributions, including monotone, log-concave, t-modal, piecewise-polynomial,
and Poisson Binomial distributions. 

A new approach to dealing with noisy sample data which introduces ``sampling correctors'' to address corruptions to samples of an unknown distribution. Sampling correctors, acting as filters between the noisy data and the end user, use structure that the distribution is purported to have, in order to allow one to  make "on-the-fly" corrections to samples drawn from probability distributions.  Surprisingly, this work shows that there are settings in which sampling correctors are  significantly faster than distribution learning algorithms.

Several results are given w...

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page