Paper ID: 143 Title: More data speeds up training time in learning halfspaces over sparse vectors
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
On page 2 -- you say that it is impossible to use standard reductions for
proving NP-hardness. I'd say that this issue is a bit more subtle: It
seems that some classes of reduction will not allow basing hardness of
learning on RP \neq NP. (Cite the paper by Applebaum, Barak and Xiao).

You make the point that you are not basing your results on cryptographic ssumptions. However, it is not clear (at least to me)
that the complexity-theoretic assumption you are using is necessarily
weaker than assuming the existence of one-way functions. Possibly, there
is some evidence that the assumption that there are average-case hard
functions in NP is "weaker" than one-way functions exist. There are some
results by Noam Livne indicating such a possibility. You should discuss the relative hardness of assumptions you are using
vs. those that others have used. For example, is it obvious that the
result you have holds under a strong-enough cryptographic assumption such
as factoring is hard? I suspect it should be possible.

For the upper bounds: I think using your techniques, it should be possible
to show that H_{n, k+1} can be learnt using \tilde{O}(n^k/\epsilon^2)
samples? Is that right? Also, do you think it is possible to push the
lower-bound (using a different assumption?) to show that as k gets large
the gap between information-theoretic sample complexity and complexity
increases?

---------------
1. Line 130: The proof **of** item 1.
2. On line 343, you mean y_k = b right?

--

I think adding a discussion about your complexity assumption (with respect to other such assumptions) will enhance the value of your paper, and will be particularly appealing to a theory audience. In particular, your result shows that "hardness conjectures" in learning appear to be weaker than other conjectures that people have been willing to make in other areas.

I agree that showing a simple concept class for which a separation exists is interesting. In that sense, I really like your result. However, it may help if you clarify explicitly in the writing that you don't necessarily claim your assumption to be weaker than existence of one-way functions (as it may implicitly appear to say so).

Q2: Please summarize your review in 1-2 sentences
The paper presents a result showing that under a complexity-theoretic
assumption -- no polynomial-time algorithm can learn a specific concept
class (halfspaces) under a class of distributions (low Hamming weight
boolean vectors) -- with sample complexity that is sufficient for
information-theoretic learning. However, if a much larger sample is
provided polynomial time learning is possible.

The paper follows a long line of such works establishing separation
between information-theoretic sample complexity and computational
complexity, starting with Decatur, Goldreich, and Ron.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper provides one of the most natural examples of a learning problem for which the problem becomes computationally tractable when given a sufficient amount of data, but is computationally intractable (though still information theoretically tractable) when given a smaller quantity of data. This computational intractability is based on a complexity-theoretic assumption about the hardness of distinguishing satisfiable 3SAT formulas from random ones at a given clause density (more specifically, the 3MAJ variant of the conjecture).

The specific problem considered by the authors is learning halfspaces over 3-sparse vectors. The authors complement their negative results with nearly matching positive results (if one believes a significantly stronger complexity theoretic conjecture-- that hardness persists even for random formulae whose density is n^mu over the satistfiability threshold). Sadly, the algorithmic results are described in the Appendix, and are not discussed. It seems like they are essentially modifications of Hazan et al.'s 2012, though it would be greatly appreciated if the authors included a high-level discussion of the algorithm. Even if no formal proofs of correctness will fit in the body, a description of the algorithm would be helpful.

Overall I like the paper, but suggest that the authors significantly change the presentation. The proof of 3.1 is much more clear than the high-level intuition of the proof given on page 6--and this section can probably be significantly shortened (at no expense of clarity), allowing for some discussion of the algorithmic results of section 4.

typos/suggestions:
Perhaps rephrase 'refuting random 3**' as 'distinguishing high value formulae from random' (???)
p.6 'it would have necessarily return"
"with proper efficient algorithm"
p. 3 "proof item 1"
appendix A 'B(n)-decomposable. -> B(n)-decomposable matrices
Q2: Please summarize your review in 1-2 sentences
I really like this paper, and believe this result might turn into one of the more classic examples of 'some problems have a different threshold for information theoretic vs computational tractability'.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary of paper :

The paper considers the problem learning half-spaces over 3 sparse input in {+1,-1,0}^n. While the information theoretic sample complexity for this problem is of order n/eps^2, assuming the hardness of refuting random 3CNF formulas, it is shown that one cannot efficiently learn using only order n/eps^2 samples. In fact under a stronger version of the same hardness assumption it is shown that it is not possible to efficiently learn using order n^{1+mu}/eps^2 samples for appropriate mu in [0,0.5).
The hardness result is shown to hold for improper learning.

On the other hand it is shown that one can efficiently learn these half spaces with sample of size n^2/eps^2. Thus a true gap in statistical versus computational complexity in learning is shown. It is also shown that the gap between sample complexity of learning information theoretically and sample complexity of efficiently learning is not present while learning 2 sparse vectors.

What can be said about extending the positive result of efficiently learning H_{n,3} in n^2/eps^2 to learning H_{n,k}. The result seems to rely mainly on using Hazan et al algorithm to learning H_{n,2} efficiently by using lemma A.4. Is is possible to extend something like lemma A.4 to deal with H_{n,k} ? Whats is the dependence on k in such a case ?

Q2: Please summarize your review in 1-2 sentences
Overall the paper is well-written and a nice read. While the reduction/proof is fairly straightforward it is quiet insightful. Providing hardness results for improper learning has been notoriously hard as the usual NP hardness type assumptions cant yield results for improper learning problems. This paper provides a new hardness result for improper learning without using the usual cryptographic hardness assumptions and reduction to lattice problems. The simplicity of the hardness assumption and reduction makes me believe that the result could be useful to other learning scenarios too. I believe the work is definitely worth publishing.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Reviewer 4 seems to miss the two main points of the paper:
1) Relation to previous computational-sample tradeoff results: Reviewer 4 criticizes that the result is just another result about computational-sample tradeoff in a long line of such works. All previous results are about highly synthetic hypothesis classes. This fact was emphasized by the authors of these previous papers as a weakness of the approach. Here we consider halfspaces, which is one of the workhorse hypothesis classes nowadays in machine learning.
2) Cryptographic vs. average case complexity: Reviewer 4 complains that it is not clear whether our hardness assumptions are weaker than cryptographic ones. We did not intend to argue that the average case assumption is more plausible than cryptographic assumptions. The reason we emphasized the difference between the two techniques is because it is very difficult to derive results for natural hypothesis classes using the former, while we show how to derive results for the natural hypothesis class of halfspaces using the latter.

Regarding Reviewer 4’s concern about the possibility to use standard reductions from NP hard problems in the context of improper learning: We do not claim that this is impossible. We just say that nobody succeeded to prove a hardness of improper learning based on NP hardness so far. You’re right that our phrasing is confusing
and we will fix it.

To reviews 5 and 7:
Thanks for the kind review.
Thanks also for the suggestion to include the algorithm for the upper bound in the body of the text. Indeed, the upper bound uses Hazan et al method as a black-box, but it requires additional new ideas.