Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The proposed loss functions seem novel and theoretical analysis are well presented to support their validation. These Bi-Tempered Logistic loss functions are variants of existing ones. They are derived by introducing a temperature into the exponential function and also by replacing the softmax with a high temperature generalization. Besides all these well presented derivation, I’d like to see some statistical properties of these functions, and how they are compared to the existing ones. The authors claim the proposed loss functions are robust to noise and outliers. The authors are encouraged to present more theoretical & emperical analysis on this part. This new loss function is not convex. Although the conventional logistic loss is not convex with respect to some parameters if neural network is used, its convexity still enables researchers to theoretically analyze the performance of the learning algorithm. If this new non-convex function is used, is the analysis still possible?
The authors propose a loss which is controlled by two temperature parameters of generalized logarithm (exponential), one of which comes from the probability model and the other comes from Bregman (beta) divergence. The proposed loss shows robust nature to noises as expected and confirmed by simple numerical experiments with good visualization. The idea looks similar with ref using Bregman divergence instead of Tsallis, therefore the proposal is not surprising. In statistics, it is well known Bregman divergences (not only beta-divergence) leads consistent and robust estimators, and heavy-tailed distributions (not only Bregman-dual link function) are insensitive to outliers, so robustness of the proposed loss looks natural outcome. The former facts are intensively investigated by Shinto Eguchi, Frank Nielsen and their collaborators. The latter facts are found in, for example, a famous book by Huber (1981). Overall organization of the paper is well considered including a good introduction of Bregman divergence, and theoretical discussion of the proposed loss is clear enough, so quality and clarity of the paper is very high. The presented idea is nice, but significance to the community is unclear.
The paper is well-motivated and introduce a novel tunable class of losses for (DNN) classification by replacing the usual logistic loss. The experiments demonstrate the gain obtained by using this biparametric logistic loss and improve AISTATS'19 - mention general deformed logarithm and exponential (integral of a monotonous function) and then introduce its specialization of Eq. 1 Cite the book of Jan Naudts : "Generalised Thermostatistics", Springer. - explain that parameters are called temperature because of their use in thermostatistics . I think the paper (and notably the abstract) will gain in readibility by not mentioning "temperature" of generalized thermostatistics. - cite book of Amari 2016 "Information Geometry and Its Applications", mention conformal flattening of Tsallis relative entropy, and escort distributions - need to state whether domain is open (convex) or not, and whether the Bregman generator of Legendre-type or not - Should better explain "However, the Tsallis based divergences do not result in proper loss functions" (AISTATS 19 paper) Minor typos: - Kullback Leibler divergence -> Kullback-Leibler divergence - typo 106 -> Kullback-Leibler (KL) divergence