__ Summary and Contributions__: The paper focuses on developing methods for estimating point-wise dependency.

__ Strengths__: The paper is well written, introducing all the required details and motivating the need to study point-wise dependency. The strength of the paper is in the experiments.
1. There are multiple different application scenarios studied, that are broad and representative.
2. The comparisons are extensive though they can be represented more convincingly (see below).
3. Self-supervised representation learning is a good useful application.

__ Weaknesses__: The connection between Section 3.1 and Section 3.2 is hard to follow and can be written better. The paper discusses how PD can be naturally obtained when optimizing fro MI neural variational bounds but the part on these methods having large variance and hence the need for other methods for PD estimation could be motivated better. The proposed methods address an interesting problem but they follow from the density ratio method for PMI. More details on the novelty of the proposed approach can be helpful to the reviewer and some details in Section 3.1 can be abstracted and absorbed into related work as this is not really the paper's contribution.
The results also can be represented and explained better. Especially connecting the high variance of the existing approaches and how the proposed approaches are better. For example, It is hard to understand how the results are better than SMILE in Figure 1. Given that the experiments are the major strength of the paper, this can be expressed more convincingly.

__ Correctness__: The claims and methodology is interesting and correct.

__ Clarity__: The paper is well-written and easy to read.

__ Relation to Prior Work__: The paper clearly articulates its position with respect to the prior literature.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper focuses on estimating the point-wise dependency (PD) that measures the instance-level dependency between events taken by two random variables. The authors show that although PD can be obtained via optimizing mutual information (MI) neural variational bounds, it leads to large variance. The authors further propose two data-driven approaches to estimate PD: (1) Probabilistic Classifier and (2) Density-Ratio Fitting. The first one casts the problem into a binary classification by sampling data pairs from joint density as positive labels and from the product of marginals as negative labels. The second approach directly minimizes the expected square distance between the true and estimated PD.
The authors applied their PD estimation method in several applications, including MI estimation (by plugging-in the point-wise MI obtained by taking the log of PD), self-supervised representation learning (by using the constructive learning approach, i.e., similar pairs having higher PD) and cross-model retrieval (using audio and text data). The proposed method was shown empirically comparable to the baselines.

__ Strengths__: Point-wise dependency estimation is an interesting yet understudied research issue. I am glad to see efforts beyond just estimating the aggregated MI. The problem studied and the approaches taken seem pretty novel to me and are technically sound. The paper is well-written and organized. Intuitions are given as well as rigorous mathematical descriptions, which makes it very easy to follow, and I find it enjoyable to read. Besides, the evaluations, theoretical analysis and relevant discussion are also done with high standards.

__ Weaknesses__: 1. In Fig.1, it seems that the probabilistic classifier approach is better than the density-ratio fitting approach, as it has both smaller bias and variance. However, in Fig. 2 for another task, the density-ratio fitting is consistently better than all other approaches. I am wondering if authors have any insights to the differences between their performance in different tasks.
2. In the cross-modal learning section, no baselines were compared, and the density-ratio fitting was neither compared. While I understand that the main purpose is to showcase the usage of PD, cross-modality learning is potentially an important application of PD estimation, so I would suggest authors to compare against some SOTA baselines in this topic.

__ Correctness__: The approaches developed are technically sound. Both theoretical analysis and empirical evaluations are present and solid.

__ Clarity__: The paper is well-written, organized and extremely easy to follow.

__ Relation to Prior Work__: Relevant prior works are properly cited, discussed and compared.

__ Reproducibility__: Yes

__ Additional Feedback__: Please kindly see the Weaknesses section.

__ Summary and Contributions__: This paper studies estimating point-wise dependency of data instance. For this purpose, two methods are proposed. Experiments on three tasks demonstrate the effectiveness of the proposed methods.

__ Strengths__: 1. The problem is important to the NeurIPS community.
2. The method is theoretically sound.
3. The empirical evaluation is extensive.

__ Weaknesses__: 1. The contribution is not significant, given existing neural method for density ratio estimation and point-wise mutual information estimation.
2. The experiment is mainly conducted on toy data.

__ Correctness__: They are correct.

__ Clarity__: The paper is well written.

__ Relation to Prior Work__: The discussion is clear.

__ Reproducibility__: Yes

__ Additional Feedback__: This paper studies estimating point-wise dependency of data instance. For this purpose, two methods are proposed. Experiments on three tasks demonstrate the effectiveness of the proposed methods.
The paper also has several weaknesses:
1. The contribution is not so significant.
1) This paper focuses on estimating point-wise dependency of data instance, and the proposed methods are principled and theoretically sound. Despite the merit, similar problems have been extensively studied in the machine learning literature. For example, many prior works estimate the density ratio of two distributions by using the conjugate form of f-divergence, and these methods can be used to estimate the point-wise dependency. Also, some other works try to use neural method to estimate point-wise mutual information, which are also able to estimate the point-wise dependency. Given these existing studies, the idea of the paper seems quite straightforward.
2) For point-wise dependency estimation, two methods are proposed. The first Probabilistic Classifier method optimizes a classifier, which is then used to estimate the point-wise dependency. However, I feel like this method is a direct extension of GAN and f-divergence for density ratio estimation. For the second Density-Ratio Fitting method, it is also inspired by a prior work. In this sense, this paper does not propose much new insight on methodology.
3) The paper also points out that the problem of point-wise dependency estimation can be solved by existing neural estimator of mutual information. From my understanding, the proposed methods share very similar methodology to these methods. I wonder what is the advantage of the proposed methods for estimating point-wise dependency over these related methods?
2. The experiment is only conducted on toy dataset.
Although the experiment in the paper is extensive, where three tasks are considered, the experiment is only conducted on toy datasets. For example, in application 1, different methods are evaluated with correlated Gaussian distributions; in application 2, two small datasets MNIST and CIFAR are used. For application 1, it is possible to evaluate with some other distributions? For application 2, is it possible to evaluate on ImageNet? For application 3, is there any baseline method to compare against?
-------------------------
Thanks the authors for the clarity on the contribution and the additional experimental results! Overall, this is a solid work and I lean towards an accept.

__ Summary and Contributions__: In this paper, the authors study how to efficiently and effectively perform point-wise dependency estimation by neural methods. The main contribution could be summarized as follows.
C1. An interesting angle to address mutual information estimation is discussed.
C2. Probabilistic classifier and density-ratio fitting are proposed to enable effective point-wise dependency estimation.
C3. The value of point-wise dependency estimation is highlighted from empirical study.

__ Strengths__: S1. The authors suggest interesting perspectives to approach point-wise dependency estimation.
S2. Theoretical evidences are provided to enrich the discussion on mutual information estimation.
S3. The author explore multiple applications to demonstrate the value in point-wise dependency estimation.

__ Weaknesses__: W1. It is difficult to clearly see the concrete difference/impact brought by either probabilistic classifier or density-ratio fitting. It could be the presentation in Figure 1. Instead of being qualitative, the authors may make this comparison more quantitative.
W2. The value of point-wise dependency estimation in cross-modal learning is a bit weak. The task discussed in Section 6 seems to be a typical classification or ranking problem. To this end, to better show the value of point-wise estimation, it is important to make comparison with state-of-the-art baselines. Otherwise, only feasibility could be claimed, leaving limited impact.

__ Correctness__: In general, yes, but the empirical results presentation may need improvement.

__ Clarity__: Yes

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__: In terms of self-supervised representation learning discussed in Section 5, how do similar pairs are decided in MNIST and CIFAR10?