
Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper consider the problem of nonparametric
estimation of the conditional distribution when the conditioning variable
is highdimensional. The authors construct a treebased structure for X
which is fixed a priori and build a mixture model on the tree (conditional
on the observed regressors for an observation).
Quality: The idea
is an interesting one and has the potential for working well with many
regressor variables. It's unfortunate the reference for the METIS
algorithm is missing and that the method is not clearly described. Seems
unclear how pairwise distances are unused to construct the partition. For
example, it seems to me that, for a fixed level, pairs x_1,x_2 and x_2,x_3
may meet the criteria but x_1,x_3 don't. It would be helpful to
describe how the partitions arise. It would also be useful to mention that
the author are defining a mixture model (rather than a linear model which
is commonly used with wavelets).
There seems to be no
justification for the comment "The implication of this is that each scale
within a path is weighted to optimize the bias/variance tradeoff across
scales."
The paper would benefit some discussion of the properties
of the stickbeaking prior.
The prior is infinitedimensional. Is
the prior truncated in the inference?
Clarity: The paper was
unclear to me in places. How are some examples: page 4, what is $\Xi$
page 5, I'm not clear what it means that "we let each $\psi_{j,k}$
simply be a Dirac delta function operating only on the $X$'s". Does it
just mean that there is no regression on $X$ in the conditional densities.
Originality: There has been little work on applications of
flexible conditional distribution estimation with highdimensional
regressors.
Significance: This has the potential to have a
substantial impact. Q2: Please summarize your review in
12 sentences
A very interesting idea. The paper would benefit from
some tightening up. Submitted by
Assigned_Reviewer_7
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Summary: This paper addresses the problem of
conditional density estimation with a high dimensional input space (p
>> n), an important problems as most (if not all) current models for
nonparametric conditional density estimation do not scale to
highdimensions. Moreover, datasets with high dimensional inputs but
relatively small sample sizes are becoming increasingly common. The model
for the conditional density f(yx) is defined in three stages. First, a
tree structure is defined over the input space. Second, given the tree
structure, C_{j,k}, the k^th partition of the X space at scale j, is
mapped to a lower dimensional space. Third, a local model for the
conditional density at each node j,k is selected which only depends on x
through its lower dimensional representation. In practice the lower
dimension representation is the indicator function 1(x \in C_{j,k}) and
the local model is Gaussian, i.e. f_{j,k}(yx)=N(m_{j,k}, s_{j,k}).
The inference algorithm is “partially Bayesian”, as an empirical Bayes
approach is used for the tree structure (which is estimated via the METIS
algorithm). Then, a Gibbs sampling algorithm is used for posterior
inference of the weights and parameters (m_{j,k}, s_{j,k}) of each node
indexed by j,k of the tree.
Strengths: The authors develop a
novel method for highdimensional conditional density estimation which
lacks any competitors. Performance is quite impressive when compared
with LASSO, CART, PC regression (even more so considering the added
flexibility of the model).
Weaknesses: Poor notation and lack
of clear representation of the model for the conditional density. The
model is stated very vaguely, but then a more specific version is used.
The paper would be more readable if the focus was only on the specific
version of the model used in practiced and included a clear mathematical
description of the model. In equation (1), your model for the conditional
density given the tree structure tau, depends on some j,k. In particular,
f(yx,tau) is a weighted average over the local conditional density of all
ancestors of j,k. How is this cutoff j,k selected? In equation (1),
your notation C_{j',k'} \in A_{j,k} does not match your definition of the
sets, as C_{j',k'} is a subset of the X space and A_{j,k} is a subset of
the integers. Throughout the text the conditional distribution
function is used when conditional density should be, leading to some
incorrect statements (examples are pg. line 86 and pg. 6 line 322).
The word approximately is used throughout the text, but it may be more
sensible to replace approximately with equals and state that these are
model assumptions. What does f_{yx} represent? The entire collection
of conditional densities f_{yx}=(f_{yx}(\cdotx), x\in X) or a specific
f_{yx}(yx)? Same comments for F_{YX} pg. 4 line 196, there is a
missing reference. pg. 4 equation (2), should proportional be replaced
with equals? And the last pi_{j,k}=1\sum V_{j',k'} (where the sum is over
the ancestors of j,k). Otherwise it involves a normalization term which
may affect the full conditional of the V_{j,k}. pg. 5 line 239, the
notation k_j(x_i) is used but not introduced. pg. 5 line 242, Is the
full conditional of V_{j,k} correct? Because the likelihood seems to
involve a normalization term based on V_{j,k}. The paper by
Bhattacharya, Page, and Dunson (2013) Density Estimation and
Classification via Bayesian nonparametric learning of affine subspaces
could be an important reference/competitor to consider.
Quality:
The work is supported by some impressive examples but the model lacks a
clear mathematical explanation.
Clarity: I would suggest removing
the vague representation of the model and just focus on the specific model
used in practice. Notation needs some cleaning.
Originality:
Similar to other regression techniques, the proposed model achieves
flexibility through a tree structure; however, its originality comes from
flexibility in modelling the entire conditional density not just the mean
function, which is achieved by allowing for local conditional densities at
each node of the tree. For inference in highdimensions, they must take an
empirical Bayes approach with regards to the tree structure.
Significance: The results could potentially be very important as
the allow for conditional density estimation with highdimensional inputs
and experimental results are impressive. Q2: Please
summarize your review in 12 sentences
Overall, this is a nice paper that address an
important issue and has impressive performance over existing, more
inflexible models (in terms of both run time and accuracy). However, the
exact model used needs to be clearly stated mathematical and notation
should be cleaned. Submitted by
Assigned_Reviewer_8
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a model for conditional density
estimation in extremely high dimensions (p=10^6). First, a tree that
hierarchically partitions the sample space is constructed
deterministically from the data. The hierarchical partition consists of
cells C_{j,k} at different resolutions (j) and locations (k). The "edges"
in the tree are given prior weights according to stickbreaking, and this
helps determine to which cells the Xs are assigned. The Xs have hard
assignments to cells but these assignments are sampled. Given an
assignment of Xs to cells, the Xs are mapped to a lower dimensional space,
and from this embedding a distribution over YX is inferred. In this case
Y is modeled as Gaussian, but the method is open to other choices.
Experiments are performed on several simulated datasets and on
neuroimaging data. The proposed method beats several competitors on the
synthetic data in both time and accuracy. On the neuroimaging problems the
proposed method is competitive with other methods in terms of MSE, and
scales up much better to larger problems. Q2: Please
summarize your review in 12 sentences
A treebased model for conditional density estimation
is proposed for large p (~10^6) problems. The model is not very elegant
but it provides good experimental results.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
Thank you for your useful comments and suggestions on
our manuscript. We will certainly modify the manuscript accordingly.
Please, find below our response to your comments.
Reply to
Reviewer #1:
METIS is described in detail in [1]. METIS operates
on a dissimilarity matrix, constructed via pairwise comparisons. For this
work, we followed the advice given in [2]. In short, we used an L2
distance scaled by the variances of each of the data points. The
partitions arose via recursively partitioning the graph defined by the
nearest neighbors according to the above described dissimilarity. We will
clarify in the revision.
We will provide a supplementary figure
demonstrating how the weights change across scale according to their
relative signal for the inference task.
We are using a
stickbreaking prior over an infinitelydeep tree, but only go as deep as
the data allows; when we run out of data, we truncate the tree.
$\Xi$ is the embedding space, the space in which the data is
embedded after dimensionality reduction. We will clarify
That
$\psi_{j,k}$ is a Dirac delta function means that there is no regression
on $X$ in the conditional densities, rather, we simply obtain the density
for all those elements in the cell $C_{j,k}$.
Reply to
Reviewer #2:
Thank you for pointing out some inconsistencies in
our notation and distributions. We will resolve them all in the revision.
“pg. 4 line 196. Missing reference.” Karypis and Kumar (1999) [1]
“pg. 4 equation (2).” Yes, proportional should be replaced by equal.
To guarantee that weights associated to any path (s_1, …, s_L) sum up to
1, we set stick breaking weights corresponding to subsets located at the
last leaves equal to one, i.e. V_{L j}=1 for all j (Ishwaran and James,
2001)[3]. “pg. 5 line 239.” k_j(x_i) is a variable indicating the set
at level j where x_i has been allocated. “pg. 5 line 242.” Since
weights (pi_{s_1}, …, pi_{s_L}) sum up to 1 for any path (s_1, …, s_L),
there is no normalizing term depending on stick breaking weights. Please,
refer to Ishwaran and James (2001) [3] for more details on full
conditionals.
Reply to Reviewer #3
“Line 216 claims that
the embedding function \psi_{j,k} is a Dirac delta”
See above.
“line 63 claim of NPhardness is not cited or explained”
[4] is a nice review on variable selection. In brief, the problem
is NPhard because there are $2^p$ possible subsets of variables to select
(because we can select any $\binom{p}{k}$ for $k=1,...,p$).
Reference:
[1] G. Karypis and V. Kumar (1999) A fast and
high quality multilevel scheme for partitioning irregular graphs. SIAM
Journal on Scientific Computing.
[2] W. K. Allard G. Chen and M.
Maggioni (2012) Multiscale geometric methods for data sets II: Geometric
multiresolution analysis. Applied and Computational Harmonic Analysis.
[3] H. Ishwaran and L. F. James (2001), Gibbs Sampling Methods for
StickBreaking Priors, JASA.
[4] I. Guyon, A. Elisseeff (2003) An
Introduction to Variable and Feature Selection, JMLR.
 