
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper describes some bounds for the error made in the estimation of the posterior mean and variance of a target distribution when using expectation propagation. The bounds obtained compare well with the corresponding bounds of the Laplace approximation, which motivates the use of EP over such a method. A practical limitation of the paper is that it does not contain experiments assessing the results obtained.
Q2: Please summarize your review in 12 sentences
Interesting theoretical results about expectation propagation. The lack of experiments questions the quality of the paper.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
1. Given that the main goal is to estimate the mean, the mode approximation seems to be not the right comparison. I would suggest to include some comparison with other methods that directly estimate mean, including Monte Carlo (which should be O(1/(n\sqrt{m})), where $m$ is the sample size, here the $1/n$ factor is due to the concentration of p(x) itself), and quadrature based methods (O(1/(nm))), which should match with EP with m = n.
See, e.g., Huszar & Duvenaud Optimally weighted herding is Bayesian quadrature.
It is also worth to mention the asymptotic analysis of the other variational inference methods. For example, see Wang, Bo, and D. M. Titterington. "Convergence and asymptotic normality of variational Bayesian approximations for exponential family models with missing values." Proceedings of the 20th conference on Uncertainty in artificial intelligence. AUAI Press, 2004.
2. All the $f_i(x)$ are assumed to be $\beta_m$ logconcave. How would the result change when $\beta_m$ are different for different $f_i(x)$?
typos:
line 249: "mean x^*_{LC}" appendix line 912: "\beta_i is obtained the difference" appendix line 914: "to to"
Q2: Please summarize your review in 12 sentences
This paper establishes theoretical analysis of EP
for log concave distributions (with slow changing) under the asymptotic setting when the number n of "sites" (corresponding data points) increases to infinite. The main proof technique is based on an extension of BrascampLieb inequality which is of independent interest.
This work can have a potential significant importance given that theoretical properties of EP have not been well discussed before.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper derives inference bounds for the ExpectationPropagation (EP) algorithm in variational inference. The results are developed for logconcave distributions with slowlychanging logs, relating the finitesample bounds to these obtained via Bernsteinvon Mises limits for the Canonical Gaussian Approximation (CGA). KullbackLeibler divergence costs of EP versus CGA are also discussed.
EP is a popular method that works well in practice but comes with few theoretical guarantees, so these developments are of interest.
There are of course several directions in which these results could be improved, which the authors already discuss. In practice, the current results are still restrictive, but they are certainly a step in the right direction.
My main discussion point is that EP is often more powerful for prediction than inference, but the paper only discusses inference approximation bounds. I expect that, even in cases when the sites are not logconcave, EP could perform badly in inference but reasonably in prediction. In that sense, the current bounds are not as interesting as they could be.
Some minor points:
* the notation $\phi_p$ to denote $\sum \phi_i$ is a bit confusing. * many of the formulae don't have punctuation marks at the end.
* The authors could cite "Expectation propagation as a way of life" by Andrew Gelman, Aki Vehtari, Pasi Jylaenki, Christian Robert, Nicolas Chopin, John P. Cunningham. * The references should be cleaned up.
Q2: Please summarize your review in 12 sentences
The paper is nicely written and it provides some theoretical guarantees for ExpectationPropagation which were previously unavailable. EP is widely used so these results should be of interest.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a novel approach to the analysis of expectation propagation (EP), proving several new results about its accuracy.
The paper is wellwritten and the methods provide a good foundation for future work in this area.
The main limitation, admitted by the authors, is that the theorems have overly restrictive conditions that rarely hold in practice.
So the paper's results do not directly apply to common uses of EP.
I was able to follow the derivations fairly easily except for equation (14).
In (14), beta_(i) and r_(i) are used without definition.
To make (14) more intuitive, it would help to add one sentence mentioning that hybrids are n*beta_m strongly logconcave.
The discussion of fixed points in section 1.4 is a bit odd since there are known cases where EP does not have fixed points, and sufficient conditions for a fixed point to exist were already given by Minka (UAI 2001).
These conditions are that factors must be bounded and the moments must exist.
If the factors are strongly logconcave, as in this paper, then the conditions are satisfied so a fixed point exists.
Obviously, if moments don't exist (e.g. a Cauchy distribution) then EP cannot have a fixed point.
Q2: Please summarize your review in 12 sentences
Novel theory about expectation propagation that is not quite strong enough to apply to practice, but a good starting point.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
Response to reviewers
Dear
reviewers,
Thank you all very much for your time and your comments
on how to improve our presentation of our results.
A common theme
from your comments is a concern on whether our results and our methods are
of general interest.
We agree that the assumptions we use are
restrictive. However, the proof technique we introduce is flexible and
will extend to a more general setup (at the cost of heavier notation and
longer calculations). The easiest extension is to suppose only that hybrid
distributions are strongly logconcave, which is much less stringent than
making assumptions on likelihood sites, and covers many latent Gaussian
models, like logistic or probit regression. We have already managed to
extend Theorem 1 to this case, and we therefore believe our proofs could
serve as blueprints for more general results. Note that our results are
nonprobabilistic and make no assumptions on the datagenerating process
Most results of good behavior of statistical algorithms are of the form
"probability of bad behavior goes to 0", whereas we give results showing
that it's impossible for EP not to work under our assumptions, under
purely functional assumptions on the sites If one were willing to make
assumptions on the datagenerating process, one could also be less
stringent on what the sites should be like, giving a second avenue for
extending our results. Finally, note that, as pointed out by referee 4, EP
has proven time and time again to be extremely resilient to theoretical
analysis in the 15 years since it has been proposed. Our results, while
certainly restrictive and limited, represent a sizable step forward from
the current state of the art.
Many of you also feel that the
absence of simulations is a weakness of the paper. We chose to focus on
theoretical results because the literature already contains many
simulations studies comparing EP to other methods. Since the practical
effectiveness of EP is not in doubt, we tried to explain as much of our
results as the space constraints would allow.
Thank you again for
your time in reviewing this work.
Best regards ,
The
authors
Responses to specific comments
Reviewer 1: The CGA
was meant as a baseline approximate inference method. Comparaison against
other methods is interesting, but we feel like a more honest comparaison
would also include computational cost of each method, and goes way beyond
the scope of our work. It would also overlap with work such as Nickisch
and Rasmussen, 2008. Having multiple beta_m surprisingly changes nothing.
Indeed, in that case all hybrids and the priors are still (\sum
beta_m)strongly logconcave.
Reviewer 2: Using our results to
study prediction performance is a great idea. It seems to us that Th. 1
could be used to bound prediction error, relative to perfect Bayesian
inference, in a classification setting where predictions are made based on
posterior means and variances (using smoothness in the prediction
function). . Also, while we show that EP performs well on logconcave
sites, we believe that it could also work well on non logconcave sites,
both for prediction and classification.
Reviewer 3: Thank you very
much for correcting our mistake about the existence of fixedpoints. We'll
be sure to modify this point in future presentation.
Reviewer 6:
Our results are indeed specific to Gaussian EP. However, Gaussian EP does
represent an overwhelming majority of the practical uses of EP. It thus
seems to us like this isn't very damaging for the generality of our
results. It seems very hard to extend our current methods to nonGaussian
approximating families. 
