Paper ID: 1320
Title: An Approximate, Efficient Solver for LP Rounding
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary: The paper proposes a new efficient method to obtain approximate solutions to LP relaxations of integer linear programs. It argues that since the fractional solutions of the relaxation would need to be rounded in order to obtain the final solution, we could potentially get similar accuracy by rounding an approximate solution instead of the optimal one. To obtain an approximate solution, it solves a related "regularized quadratic penalty" formulation of the LP relaxation using an efficient stochastic coordinate descent algorithm. Experiments performed on tasks such as vertex cover, maximum independent set and multiway cut show that rounding the approximate solution provides similar accuracy and a decent speed-up compared to rounding the optimal solution.

Quality: Interesting paper. The algorithm appears technically correct but I didn't check the proofs thoroughly. The experiments are designed to test the main hypothesis that it is sufficient to compute an efficient, approximate solution of the relaxation.

Clarity: The paper is mostly clear. There are a large number of minor typos (some examples below). Perhaps the paper was not proof-read before submission. These can be easily corrected in subsequent versions.
- line 47: replace "specific form type" by "specific form" or "specific type"
- line 51: replace "approaches tasks" by "approaches for tasks"
- line 69: replace "to to" by "to"
- line 96: replace "Consider a graph G(V,E) denote a graph" by "Let G(V,E) denote a graph"

Originality: To the best of my knowledge, the proposed approach is new. The hypothesis that approximate solutions are sufficient for accurate rounding also appears to be new to this paper, at least within the ML community.

Experiments: The simple baseline of increasing the tolerance \epsilon for the Cplex-LP solver is not tested. Instead, the same value of \epsilon is used for both the Cplex-LP solver and the QP approximation. Is there any empirical evidence that the proposed method would be significantly faster than this simple baseline?

It would also be helpful to see how the method compares to specialized LP solvers for specific problems, for example, the LP solvers for MAP estimation.
Q2: Please summarize your review in 1-2 sentences
Interesting paper. I'm leaning towards accept. I hope the authors can provide a comparison with the simple baseline during the rebuttal phase.

Post-rebuttal comments: The rebuttal has answered my only (minor) concern regarding the choice of the tolerance values. If there is sufficient space for it, please include the experiments in the final version of the paper, or at least in a related technical report.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary:
The paper considers LP relaxations in machine learning and studies the problem
of solving the LP relaxation only approximately and derives bounds on the
overall approximation rate when solving the integer program (by rounding the LP
solution).

Pros:
- very interesting problem (solve LP relaxation, knowing that in the end we will
round the solution). This is a relevant problem in many machine learning
fields.
- to the best of my knowledge the first work that gets approximation guarantees
for the overall algorithm (LP approximation by smoothing, only solve smoothed
LP to a some threshold, integer rounding).

Cons:
- I would skip the SCD algorithm in the paper and rather try to give more
intuition and proof sketches in the main part of the paper. I do not think a
reader can appreciate section 3.3 too much. It looks too me that the running
time is not very exciting, both n and 1/epsilon^2 appear (SGD does for example
not have the dependence on n in SVMs), why not solve the problem with a more
standard solver? It seems to me SCD is not the main contribution of the paper.
- the speed improvements seen in the experiments are not hugely impressive
(roughly one order of magnitude). Also there is probably some precision option
in CPLEX, what happens if you lower the precision in CPLEX LP?
- few language problems, nothing major.
- long appendix

Quality:
Experiments could be improved, some questions about the main theorems remain.

Clarity:
Language problems and suboptimal choices in the presentation (e.g. get rid of
3.3 and include more material from the appendix).

Originality:
To the best of my knowledge the main section (3.2) is original.

Significance:
LP relaxations are important tools for combinatorial optimization in machine
learning.

Comments:
39: Despite ?
40: .. Gurobi cannot scale .., the order 10 of speedup of your method is
probably not going to change this.
43: -- but often much more efficiently. Rewrite.
52: Cplex do no scale to
52: The Cplex LP solver was slower but it ran through in the time limit on
almost all problems.
55: Obviously message passing is related in that it also solves LP relaxations, but I fail to
see the connection to your method. The smoothing?
69: in which to articulate
74: would appreciate some citations
98: is a subset of V each edge (rephrase!)
102: vertex cover as a
104: for each vertex v \forall v \in V, rephrase!
130: one can easily
132: and an example such scheme (rephrase)
144: Figure caption needs more details about the parameters.
(3): min_x
(4): max_u
167: is the definition standard? if so, cite
177: \prod_[0,1]^n was not really introduced
176: to the linear program in
193: how is d defined, it is a concatenation of vectors and a matrix, what's the
norm on d?
191: the paragraph here seems to apply to general LPs, it is a bit confusing, as
it's introduced after (5)
236: \beta depends on optimal solution, how would we estimate this?
280: algorithm 2 is described (rephrase)
304: what does thetis stand for?
310: have rarely been used? for
315: Boolean
317: the resulting graphs
320: Gibbs sampling is not really state of the art. How exactly, annealing?
321: why didn't you also use an LP MAP solver such as the one in [14]?
331: have placed if (rephrase)
Figure 2: what about energies (apart of P, R, F1)?
357: in this paragraph you mention the appendix twice
360: the second family of
369: this experiment is to compare
418: Cplex-IP on large scale
960: describe difference between Table 3 + 4 in more detail
Q2: Please summarize your review in 1-2 sentences
Significant work, but I feel the writeup is lacking in many parts and the
experiments could be improved.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Paper summary.
The authors propose an efficient scheme to solve LP relaxations of
combinatorial optimization problems. Their contribution is a novel scheme and
analysis that takes into account the original goal of constructing an integral
feasible solution from the relaxed solution. They prove that an approximate
solution is sufficient to construct an integral alpha-approximate solution to
the vertex cover problem. They also prove a convergence result for an
algorithm solving a suitably constructed QP approximation to a general
standard form LP problem. The proposed method is evaluated experimentally on
a number of combinatorial optimization problems and shown to be competitive
with Cplex, a state-of-the-art LP solver.


Review.
The paper is well written and the technical contributions are clear and well
motivated. I like the general direction of tailoring the solver of the
relaxation for the final goal of constructing good primal feasible solutions
and I believe the authors make valuable contributions in this direction.

The technical results make sense but I did not check carefully the
derivations. The dependence on the well-conditioning of the linear program
(\delta_P and \delta_D) seems useful for deriving the results but it is not
clear how these numbers and the proposed method would behave for more general
problems, say MAP-MRF LOCAL relaxations; in particular we know that modern LP
solvers such as Cplex and Gurobi are exceptionally robust to ill-conditioning.

The experiments are a mixed bag. It seems (Table 1) that the Cplex-LP results
are not comparable to the results by the proposed method: the speed up (as
measured by S) is provided by the proposed method but the solutions are very
much inferior (as measured by Q). As such it is not clear to me that one is
better than the other. The particular results in Table 1 (and Table 2) are
not referenced explicitly from the main text, but at least from Table 1 I
would disagree about the proposed method being comparable to Cplex (which is a
big ask, I know).

The claim that Cplex and Gurobi are not competitive for solving linear
programs arising from machine learning applications could be toned town a
little; see the recent study (Kappes et al., "A Comparative Study of Modern
Inference Techniques for Discrete Energy Minimization Problems", CVPR 2013),
which has shown that Cplex is often providing exceptional performance for
problems with up to a hundred thousand variables (a size comparable to many of
the problems solved in the submitted paper).

Overall I liked the paper and believe the contribution to be an important one.


Minor details.
Table 1 caption: please say "lower values of Q are better" and "higher values
of Q are better", as the previous sentence mentions both S and Q.
Reference [15], "Mathematical".
Q2: Please summarize your review in 1-2 sentences
A well written paper with an interesting direction and solid technical
contributions.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their thorough reports and stimulating comments.

Reviewers 4 and 5 raised concerns about the many typographical and presentation errors. We are grateful that they persevered with their reviews despite these errors, which are being fixed as we revise the manuscript.

TO REVIEWER 4: Our experiments use comparable (fairly loose) tolerances for both Cplex-LP and Thetis solvers. Cplex is relatively insensitive to tolerance, while Thetis is faster at finding less accurate solutions. For both methods, the quality of integral solutions are not affected by the choice of tolerance. These observations are confirmed by the following table, which reports on the frb59-26-1 instance for the vertex cover problem for three values of epsilon: 10^{-1,-3,-5}.

Table: Quality and computational time of fractional and integral solutions for 3 values of epsilon for vertex cover on frb59-26-1.
Thetis Cplex-LP
| 1e-1 1e-3 1e-5 | 1e-1 1e-3 1e-5 |
Fractional Solution | 768.2 767.2 767 | 767 767 767 |
Integral Solution | 1534 1533 1533 | 1534 1534 1534 |
Solve time (s) | 0.65 6.89 14.33 | 5.06 4.48 4.47 |

Specialized LP solvers for MAP estimation: We have not yet had time to implement the algorithm from [14], but we repeated the experiment in Figure 2 using Cplex-IP. The following table shows quality metrics of the optimal IP solution of MAP estimation. We refer to Figure 2 of the paper for details of notation.
Cplex-IP
Dataset | P R F1 |
CoNLL | .87 .91 .89 |
TAC-KBP | .80 .80 .80 |

The solutions obtained using Thetis are of similar quality.

TO REVIEWER 5: We believe that the SCD method for QP is vital to our approach, as it allows low-accuracy solutions of the QP to be obtained
quickly. By contrast, the runtime of interior-point QP methods is not very sensitive to convergence tolerance. Moreover, an efficient
multicore implementation of SCD is easy, whereas parallelization of simplex and interior-point methods is much harder.

The analysis in Sections 3.2 and 3.3 allows us to close the loop: we are able to derive a worst-case complexity bound for the Thetis
approach.

Comparisons with stochastic gradient (SG): To obtain the desired accuracy in constraints and function value of the LP - captured by epsilon - we need to solve the QP to within a different, tighter tolerance, quantified by bar{epsilon}. Both tolerances are related to the choice of penalty parameter beta in the QP. Ignoring here the dependence on dimensions, we note the relationships beta ~ epsilon^{-1} (Theorem 5) and bar{epsilon} ~ beta^{-3} ~ epsilon^3. Expressing all quantities in terms of epsilon, and using Theorem 6 and subsequent discussion, we see an iteration complexity of epsilon^{-2} for SCD (ignoring log terms). (The linear convergence rate of SCD is instrumental to this favorable value.) By contrast, standard variants of SGD applied to the QP yield poorer complexity. For diminishing-step or constant-step variants of SG, we see complexity of epsilon^{-7}, while for robust SG, we see epsilon^{-10}. (Besides the inverse dependence on bar{epsilon} or its square, there is a contribution of order epsilon^{-2} from the conditioning of the QP. Further details will appear in the revision; )

As to dimension dependence: we note that beta depends on dimension (see e.g. lines 234-236) so that n and m enter into the SG complexity, as they do into SCD. We will investigate dimension dependence further.

We agree with the reviewer that high-quality QP solvers are available, but we believe that SCD has an advantage in obtaining low-accuracy solutions and in multicore implementation. A simple experiment compares Cplex's QP solver with SCD on equation 5 of our paper, arising from a vertex cover LP, setting beta=0.1 and solving to an accuracy of 10^{-3}. The resulting QP may be harder than the original LP. On these tests, SCD is two orders of magnitude faster.
Time (s)
Dataset | Cplex-QP | SCD |
Amazon | 1435.38 | 8.42 |
DBLP | 560.30 | 5.85 |
Google+ | - | 10.43 |
('-' indicates timeout after 1 hr).

Thetis vs Cplex-LP run time: On some instances we outperform Cplex by more than an order of magnitude. More importantly, we can obtain
results in cases where Cplex is not able to execute at all.

Prompted by the referee, we clarify our experimental methodology. Our instances to fall in to three categories:
(1) All methods could run;
(2) Cplex-LP could run, but not Cplex-IP;
(3) Only Thetis works.
In addition, we selected standard benchmark instances. We will study further the way in which 'real-world' problems are distributed among these categories.

The reviewer raises the possibility of speeding the computation by loosening the Cplex LP tolerance in solving the LP approximation. In resonse, we point to Tables 3 and 4 of the appendix, which repeats the experiment in Table 2 of the main manuscript by setting eps=0.1 and the default tolerance of Cplex-LP

TO REVIEWER 6: We agree that the paper should tone down its claim that Cplex and Gurobi are unsuited to machine learning applications, and we are grateful for the reference provided. We will make the more limited claim that we identify a class of problems where using approximate LP solutions for LP-rounding is a competitive approach.

In Table 1, Q measures the quality of the solution obtained using our approach as a ratio of the solution obtained by Cplex-IP (a proxy for the true optimal solution). Speedup S measures the ratio of the time taken by Cplex-LP and our approach to obtain a fractional solution. Our approximate LP solver generates integral solutions of comparable quality with Cplex-LP. This is our final quality goal, even if the intermediate solutions are of lower quality.