NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | February 25, 2011 |
Latest Amendment Date: | February 25, 2011 |
Award Number: | 1065125 |
Award Instrument: | Standard Grant |
Program Manager: |
Rahul Shah
CCF Division of Computing and Communication Foundations CSE Direct For Computer & Info Scie & Enginr |
Start Date: | March 1, 2011 |
End Date: | February 29, 2016 (Estimated) |
Total Intended Award Amount: | $1,160,930.00 |
Total Awarded Amount to Date: | $1,160,930.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
77 MASSACHUSETTS AVE CAMBRIDGE MA US 02139-4301 (617)253-1000 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
77 MASSACHUSETTS AVE CAMBRIDGE MA US 02139-4301 |
Primary Place of Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Algorithmic Foundations |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
The rampant growth of massive data sets presents new challenges for data processing and analysis. To cope with this phenomenon, sublinear time and sublinear space algorithms that are capable of analyzing and extracting value from such immense inputs must be developed. This project aims to study sublinear time and space algorithms from a unified perspective, using the synergies in order to gain a better understanding that will lead to faster, space efficient and more widely applicable algorithms.
The proposed research has two core components. The first component is the study of sparse representations of large data sets. Sparse representations are useful for quickly analyzing and processing data. This component itself will have two parts. First, it will lead to a better understanding of sublinear time sampling algorithms for data coming from a succinctly described distribution. The succinct descriptions considered include those defined by a small number of parameters, such as power laws, Gaussians, and histogram distributions. Second, it will improve the current state of knowledge of streaming algorithms, data sketching, compressive sensing and sparse recovery techniques. Such techniques will for example have an impact on algorithms for acquiring and processing images, audio and network data.
The second component aims to design novel statistical techniques for understanding various distributional quantities describing commonly used structured objects such as graphs. The focus in this component will be on sublinear time and space algorithms that estimate parameters of graphs. This project will significantly advance the algorithmic foundations of algorithms with limited resources, and develop highly efficient algorithms for analyzing massive data sets.
This project will significantly advance the algorithmic foundations of computation with limited resources. It will develop highly efficient algorithms for analyzing massive data sets that arise in diverse areas including electronic commerce, health, network security, and scientific data collection.
The broader impacts of this project are in the education and mentoring of young researchers including underrepresented groups. Community outreach activities will introduce primary school children to interesting mathematical and computer science ideas.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The rampant growth of massive data sets presents new challenges for data processing and analysis. To cope with this phenomenon, sublinear time and sublinear space algorithms that are capable of analyzing and extracting value from such immense inputs must be developed. This project studies sublinear time and space algorithms from a unified perspective, using the synergies in order to gain a better understanding that will lead to faster, space efficient and more
widely applicable algorithms.
The research has two core components. The first component is the study of sparse representations of large data sets. Sparse representations are useful for quickly analyzing and processing data.
Results on this part include: Algorithms for using sparsity of data to compress and represent large data sets were developed. Improved sparse recovery algorithms for inputs that have a structured sparsity,
namely described as tree and block sparse, have been given. Optimal sparse Fourier Transform algorithms have been developed.
The best algorithms for the sparse recovery of matrices under the
condition of restricted isometry property are given.
The second part of this research is to achieve a better understanding of sublinear time sampling algorithms for very large data. Algorithms that run in time sublinear in the size of the data were discovered:
For example, approximation algorithms for the minimum vertex cover size (a classical combinatorial optimization problem on graphs) were found that run in time approaching the average degree of the graph, which is significantly smaller than the description of the entire graph.
A new model of local computation was introduced. This model allows one to locally compute just the part of the result that one needs, rather than the whole input/output relationship. For example, if a resident is being matched to a medical residency program, the
resident need only compute which hospital the resident has been matched to -- not the entire matching of all residents. Within this model, the first results on finding local algorithms for basic problems such as maximal independent set, compression of data, sparse spanning graphs, and resource allocation are given.
A large thrust of this project is in learning to understand data coming from samples of a distribution over a very large domain. Such algorithms attempt to understand whether the distributions have certain minimal properties, some examples include being describable as a k-histogram distribution (commonly used in Database
applications), monotone distribution, piecewise-polynomial. Surprisingly, one can understand whether the distribution has these properties faster than one can learn the distribution. In this project, the first algorithms for learning and testing k-histogram distributions were given, a framework for using structure to test classes of discrete distributions was introduced which gives near-optimal sample complexity for testing many well studied classes of distributions, including monotone, log-concave, t-modal, piecewise-polynomial,
and Poisson Binomial distributions.
A new approach to dealing with noisy sample data which introduces ``sampling correctors'' to address corruptions to samples of an unknown distribution. Sampling correctors, acting as filters between the noisy data and the end user, use structure that the distribution is purported to have, in order to allow one to make "on-the-fly" corrections to samples drawn from probability distributions. Surprisingly, this work shows that there are settings in which sampling correctors are significantly faster than distribution learning algorithms.
Several results are given w...
Please report errors in award information by writing to: awardsearch@nsf.gov.