Paper ID: 749
Title: EDML for Learning Parameters in Directed and Undirected Graphical Models
Reviews

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors claim to represent a simplified perspective of EDML, that:
1. casts it as a general approach to continuous optimization
2. makes immediate some results that were previously obtained for EDML, but through more effort
3. Facilitates the design of new EDML algorithms for new classes of models

First, EDML in Bayesian networks is reviewed as a method to optimize the likelihood of the observed data, where each update step for the next parameters requires inference on the Bayesian network using the current parameters. Then, the same analysis is applied to undirected models (MRFs). As a last step, EDML is claimed to converge in one iteration for MRFs under complete data. The last is experimented and compared to conjugate gradient (CG) and L-BFGS on benchmark problems, including handwritten digits modeled with 2D-grid-MRFs and 13 other networks that could be infered with jointree. EDML outperforms both CG and L-BFGS in running time till convergence to a certain accuracy, even if number of iterations may be larger at times.

Comments:
- The title is very similar to UAI10 and does not emphasize the main point claimed for this particular work, which is "observing EDML as a general approach for continuous optimization under complete data."
- Why would the overview of EDML in the uai paper not fall under the definition of "continuous optimization"? Not sure I see the difference in "perspective"
- To my understanding, the focus in previous work was application to Bayesian networks and comparison to EM, while here the focus is undirected MRFs and comparison to CG and L-BFGS. The experiments indeed show the benefit in using this method upon the other optimization techniques.
Q2: Please summarize your review in 1-2 sentences
The main contribution is providing a useful technique for parameter training in MRFs with complete data, by "transferring" EDML from Bayesian (directed) to undirected networks.
It wasn't clear to me why this is claimed to be a different view of the method.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper provides a simplified perspective on EDML
and illustrated the benefits of such a simplified view.
First, existing results about EDML are much easier to proce.
Second, and most important, this new formulation provides
a systematic procedure for deriving new instances of EDML
for other models. This is illustrated for Markov networks.
The derived EDML is shown to be competitive to existing
learning approaches.

Parameter estimation of graphical models is important.
The paper, which is extremely well written,
presents a very elegant view on this task.
I also like the simplified view on EDML. It really helps
understanding it pros and cons. It also shows that
it essentially implements a coordinate descend for
parameter estimation. Indeed, such approach have been
proposed for graphical models --- for instance,
iterative proportional fitting can be viewed as such a
method --- however I expect it to be too slow
to scale well. Still, there are some recent advances
for MaxEnt models

Fang-Lan Huang, Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin:
Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models.
Journal of Machine Learning Research 11: 815-848 (2010)

that is also nicely comparing iterative scaling and coordinate
descend. Moreover, it shows that there is not just one IPS
approach in general but several. It just depends on the
auxiliar function you are using. This should really be clarified.

In my opinion, the big pro of the present paper is that it shows that
parameter estimation within graphical models requires to
iteratively solve rather simple convex optimization sub-problems.
In many cases, each sub-problem is of simple form, too, and
involved inferences made from the data at hand. I value this
clear view on what parameter estimation is a lot.

However, there are also other gradient optimization techniques
that avoid the line-search and have an interesting connection
to EM such as

E. Bauer, D. Koller, Y. Singer:
Update Rules for Parameter Estimation in Bayesian Networks.
UAI 1997: 3-13

L.E. Ortiz, L. Pack Kaelbling:
Accelerating EM: An Empirical Study.
UAI 1999: 512-521

J. Fischer, K. Kersting:
Scaled CGEM: A Fast Accelerated EM.
ECML 2003: 133-144

M. Jamshidian and R. I. Jennrich.
Accleration of the EM Algorithm by using Quasi-Newton Methods.
Journal of the Royal Stat. Society B, 59(3):569–587, 1997.

I would expect these approaches to work almost as good but to run
considerable faster than CG and LBFGS.

Still the transfer to the Markov network learning setting is
interesting.

Q2: Please summarize your review in 1-2 sentences
To summarize,

+ Novel perspective on EDML that provides additional insight into it.
+ Derives an EDML approach for parameter learning of Markov networks

- connection to CS and IPS approaches should be improved and also
illustrated empirically
- Many accelerated parameter estimation method are not discussed.
For instance, there are accelerated EM approaches that also make
use of advanced gradient methods such as scaled-conjugate gradient
avoiding the line-search via scaling an (approx.) Hessian.

Although I still think that the novel view is highly interesting, the discussion also
revealed that some issues (see the other reviews) should be addressed. Still, since I think the simplified vies is interesting I lean towards accept.

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary:
Overall, I think the paper is potentially useful and a contribution to the community since it puts forth a simpler framework for EDML (the previous works on EDML are quite dense). However, I was disappointed that it presented it in a confusing manner and that it didn't address a number of key issues such as convergence, relationship to EM, etc. Further, the contribution to parameter estimation for undirected models is quite incremental.

Contribution:
This paper portrays EDML as fixed-point iteration where fixed points correspond to stationary points of the log likelihood. Each iteration requires performing inference and has a notion of soft evidence, like EM. Their application to complete-data learning for Markov networks reduces to some Jacobi-style/ coordinate descent-type method for the convex log likelihood and can be used as a black-box optimizer for Markov networks.

Novelty:
The connection between EDML and continuous optimization is novel. However, the literature on alternatives to EM is vast, and I'm not sure how novel it is as a learning method under missing data. For complete-data Markov networks, the method is similar to basic, known methods for convex optimization.

Pros:
Attractive method for learning in the presence of missing data that points to fertile ground for future work. Unified treatment of directed and undirected graphical models is also interesting and potentially impactful. Many practitioners would benefit from reading the paper in order to think about alternative perspectives on parameter learning in the presence of unobserved data.

Cons:
1) The paper casts EDML as an optimization technique, but lacks necessary details for readers to understand this interpretation.
2) Exposition, particularly notation, is sometimes confusing.
3) insufficient conceptual or experimental comparison to EM.
4) experiments only address an overly simple case of parameter learning problems (complete data).

1) I found section 2 very confusing. It first appears as though you're introducing some new family of optimization algorithms, but then it becomes apparent that you're discussing the general family of techniques described in the first paragraph of the related work section. I would reshape this section to first state that such a family exists and discuss their convergence properties. Then, I'd state that for the particular class of problems you're dealing with, these coordinate-wise minimizations are tractable. Claim1 is boring, and more space should be made for previewing the sort of content discussed in the first paragraph of the related work section.

The paper also lacks a formal discussion of convergence behavior for the method. How do we know that the update rule at the top of page 5 will converge? I agree that its fixed points are fixed points of the marginal likelihood, but how do we know it will ever converge to one of these? This issue would be remedied if you were much more straightforward in section 2 about what you need for such coordinate minimization techniques to converge. Fixed point iteration is not guaranteed to converge, you need some sort of contraction. See proofs, for example, about the convergence of loopy belief propagation.

Why does section 3.4 have the title 'Finding the Unique Optimum?" The subfunctions are convex, so each of them has a unique minimizer. However, the overall problem doesn't have a unique stationary point, and the algorithm you present in this section is for finding such a point. I understand that the updates at the top of page 5 are obtained by analytically minimizing simultaneously in every coordinate (where you find a unique minimum), but the title is very confusing.

2) The notation in section 3.2 (soft evidence) needs to be defined much more clearly. In equation (3) is there implicitly a sum over possible values that X_i can take on? I don't understand how you can get a marginal likelihood without summing over the possible values for each X_i. What's the difference between P(\eta_i|x) and P(X_i|x)? The corresponding appendix does not clear things up either. Furthermore, I didn't find 3.2 particularly illuminating about the method without some comparison to 'soft evidence' in EM.

Also, it's strange that you present an algorithm at the top of page 5 based on an earlier work and leave it to the reader to deduce that this can also be obtained by applying the procedure from page 2 to the log likelihood. You should, at the very least, say that the updates at the top of page 5 correspond to exact simultaneous minimization of the subfunctions. You should also discuss under what conditions such fixed point iteration will converge.

3) In general, I'd like to see much more discussion about the relationship between EDML and EM. EM has some desirable properties (monotonicity of the updates) and some downsides (it's maximizing a concave lower bound on the marginal likelihood rather than the actual likelihood). It would be good to juxtapose these with the properties of EDML. Of course it would have been nice to see experiments comparing the two as well.

4) The experiments section only considers complete-data learning, in which the log likelihood is convex. In this setting EDML is just another convex optimization method, so a much more rigorous discussion of its convergence behavior, memory requirements, per-iteration complexity, robustness to errors in approximate inference, etc. is necessary. For future work, I would make sure to do experiments with missing data and compare to EM. The two methods are not guaranteed to converge to the same value of the marginal likelihood, so comparing solution quality, and not just time, is necessary.

Minor points
What exactly is the semantics of your feasibility function g(y)? Are you always assuming it's a projection onto the feasible set? Is the feasibility function you use in If you don't, g() isn't guaranteed to give you what you want. First, you make it sound like a projection function onto a feasible set, but then you use it as an indicator function for the set?
You should define somewhere early on what EDML stands for.
Is this technique stable if doing exact inference in the bayes net is intractable? Discussion or experimental evaluation in this setting would be useful.
The update equation at the top of page 5 should have a number.

----- Post Author Response -----
In their response, the authors address many of the key omissions in their paper, such as the lack of a proof of convergence, by referring us to their other recent papers on EDML. This isn't quite sufficient to address the confusion that will arise for future readers of the paper. Some things, such as a comparison to EM can be addressed by providing a pointer to the earlier papers. However, the discussion of EDML as an optimization technique needs to be able to stand on its own. It is not sufficient to present an optimization algorithm and not prove that it converges or characterize what sorts of points it will converge to.

Overall, I am familiar with the modern literature on inference and learning in graphical models, and this paper is not of sufficient quality to be presented at NIPS. My concerns with this paper are mostly presentation related. For subsequent versions of the paper, I hope the authors will address the concerns that were raised. This is quite an important contribution, and I hope authors will resubmit it again (if it is rejected at NIPS).
Q2: Please summarize your review in 1-2 sentences
I think the paper is potentially useful and a contribution to the community since it puts forth a simpler framework for EDML (the previous works on EDML are quite dense). However, I was disappointed that it presented it in a confusing manner and that it didn't address a number of key issues such as convergence, relationship to EM, etc. Further, the contribution to parameter estimation for undirected models is quite incremental.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
One goal of this work is to indeed provide a different perspective on parameter estimation in probabilistic graphical models (directed and undirected alike). Namely, we seek an understanding of parameter estimation beyond what was provided by classical and well-studied frameworks such as EM. Reviewer 7 clearly appreciates this view and the insights it brings. Reviewer 8 recognizes the fertile ground that this perspective may lay for future work. Reviewer 6 recognizes the utility of this view in deriving new EDML algorithms for undirected models.

Reviewer 8 asked for a comparison between EDML and EM. Please note that previous work [5, 16] has already compared EDML to EM, theoretically and empirically. Given the proposed pure optimization perspective, we thought it was more appropriate to use the limited space available to relate EDML to existing optimization methods (such as CG and L-BFGS), under complete data. Our experiments also imply practical benefits for the incomplete data case, when EDML is used in lieu of gradient methods for the M-step of EM (comparisons with EM were also given for directed models in [5, 16]). On the other hand, [5, 16] include answers to several questions that Reviewer 8 raised, including discussions about convergence. We will next clarify the main points raised by each reviewer.

Reviewer 6
**********

#1 (The UAI paper):

1a) EDML in the UAI paper is a "specific" optimization algorithm for learning Bayesian networks (BNs) that invokes a lot of BN concepts.

1b) Viewing EDML in the simple form given in Section 2 did not invoke any BN concepts, and therefore interpreting EDML as a "general" optimization algorithm.

#2 (Other comments):

2a) For MRFs, EDML can take more than one iteration to converge under complete data (Lines 301 and 302).

2b) The title emphasizes that EDML can be now used for both directed and undirected models. We will think about a more informative title.

Reviewer 7
**********

#1 (Parameter estimation in graphical models):

1a) The reviewer's note about the big pro of the paper is insightful: showing that parameter estimation in graphical models requires iteratively solving rather simple convex optimization sub-problems.

1b) We emphasize too that the creation of all these sub-problems requires performing inference "only once", at every EDML iteration, which we think is interesting.

#2 (EDML as a coordinate descent (CD) method):

2a) EDML as a CD method makes two choices:
1] Optimization in different directions is done in parallel, i.e. the current parameters are used to generate all sub-problems.
2] Every parameter set is considered a direction of optimization.

2b) Huang et al 2010, proposed a unified framework. They discuss CD methods that use sequential and parallel updates. EDML falls in the latter class. We will clarify the connection in the paper.

#3 (Other comments):

3a) The reviewer points to interesting related work that we will definitely consider. We tried simple gradient descent (with and "without" line search), but did not find it competitive on our benchmarks. There are still many variations that are worth trying, as suggested by the reviewer.

Reviewer 8
**********

#1 (EDML vs EM):

1a) A theoretical and empirical comparison between EDML and EM was already done in previous work [5, 16], and showed advantages for EDML. The goal of this paper is not to re-compare EDML with EM, but to simplify and generalize EDML.

#2 (Sec 3.4):

2a) This section does not find a stationary point for the overall learning problem, but only finds the unique optimum of each convex sub-problem. This is only a single step of the EDML algorithm. At the beginning of Section 3.4, we mention explicitly that we are solving Equation 3 (the sub-problem).

#3 (Convergence and the convex sub-problems):

3a) The convergence analysis of the update equation solving the small convex sub-problems was already given in [16]: the update equation at the top of Page 5 is a simpler form of (Equation 4 in [16]), which was proved to converge by (Theorem 2 in [16]).

3b) While (Equation 4 in [16]) is guaranteed to converge, the EDML iterative procedure is not guaranteed to converge in general cases. Please refer to (Section 6 in [5]) for cases where EDML is guaranteed to converge.

3c) Equation 2 is a simplified form of the convex sub-problems gotten by previous work [16], and therefore it is possible to use (Equation 4 in [16]) to optimize it, as proposed in Sec 3.4.

We will clarify these explicitly in the paper.

#4 (Markov Networks and Experiments):

4a) Beyond allowing us to derive EDML for Markov networks, our new formulation further lends itself to the derivation of EDML for new families of graphical models.

4b) In our experiments, the partition function is still non-trivial to compute. Thus, even with complete data, minimizing the number of evaluations of the partition function or its gradient is quite important.

#5 (Section 2 and the Related Work):

5a) Section 2 sheds light on a family of algorithms, emphasizing that the current feasible parameters are used to generate all the sub-problems. IPF does not fall directly in this family. We will clarify such difference in Section 2.

#6 (Soft Evidence):

6a) There is a specific technical definition for the term soft evidence we use (See Footnote 4). P(\eta_i|x) and P(X_i|x) are different; details are given in [4,16] but we will clarify in this paper as well.

6b) sum_x is a summation over all the possible values of X. Note that Equation 3 is for one sub-problem. Sub-problems become independent.

#7 (Other comments):

7a) In Section 2, the feasibility function (g) takes an arbitrary point and returns a feasible point. Theorem 5 in the appendix adds a constraint on g to get some favorable behavior. We do not use g as an indicator function.

7b) We make no claims yet about the algorithm's behavior with approximate inference.