Reviews: U-Time: A Fully Convolutional Network for Time Series Segmentation Applied to Sleep Staging

- Originality: The authors tackled the well-known sleep staging segmentation task with a new method U-Time for time-series segmentation, which derived from the commonly-used image segmentation model U-Net. Also, different from the previous studies mainly focus on RNN-based architecture for time-series analysis, the authors proposed a 1D convolution and FCN-based model which has less limitation on temporal resolution and distributed computation. The sleep staging literature is well-cited but needs more comparison (better in the related work section). The methods used for comparison in Table 1 are better to be discussed in the beginning. The references of time-series segmentation/classification should also be added from the methodology perspective. - Quality: This is a complete study, and the proposed U-Time method is technically sound for the time-series segmentation problem. No theoretical analysis is provided but the empirical results demonstrate that the method is promising and support the claims that it works better than other recurrent-based methods in some conditions. The authors also performed detailed hyperparameter experiments in the supplementary material. The weakness of the study is that the baseline for each dataset is not unified---in S-EDF-39 there are many methods for comparison but for others, there are one (but different across datasets) or no baseline to compare with. - Clarity: The paper is well-written and well-organized. The dataset description may be organized into a table instead of using a full page to describe them. Some qualitative analysis would be helpful for model interpretability. - Significance: The results of using U-Time on multiple sleep staging datasets demonstrate that the proposed framework is useful for time-series segmentation in sleep staging problem. This may be potentially helpful for other, general time-series segmentation tasks as well. From a clinical perspective, good PSG/EEG segmentation and classification can assist neurologists to diagnose sleep-related diseases in a shorter time, which may optimize the clinical workflow.

Post response: The authors did a great job responding to my question about architecture decisions and I recommend summarizing some of this in the paper. On the other hand, I think the authors may have missed my point about error analysis. When I suggested using confusion matrices, I was not necessarily suggesting including them in the paper, I was suggesting using as a tool to help explain the apparent systematic differences in predictive behavior between the models. That is, the authors have convinced me that their model performs better on N1 segments, but I have no idea why that is the case (which I consider the more interesting scientific question). I think the paper should be accepted as is, but could be improved by including this type of error analysis. Also, Figure 1 in the author response is a great figure that I recommend including in the main paper. -------------------------------------------- Overall: In this work, the authors propose a convolutional neural net based model for sleep staging and test this model on several sleep staging datasets. Overall, I found this to be a well-executed application paper with a meaningful, well-explained application, a reasonable approach, and a convincing evaluation. My main comments are that the model seems rather arbitrary and the evaluations could use some error analysis. Major comments: 1. The model, while reasonable, seemed very arbitrary. If the architecture was based on some theory or intuition for what would work in this application, the authors should explain this theory. If the authors tried several models with less success, then sensitivity to these choices should be described. For example, in the Decoder section, the authors specify four transposed-convolution blocks with specific kernel sizes. How were these values selected? If, as I suspect, there was some architecture searching done, details of this search (e.g. sensitivity to the number of transposed-convolution blocks) should be included in the supplementary material. 2. I think the results section would heavily benefit from some error analysis. As it stands, the authors identify certain systematic differences in the errors between models, but never explain why those occur. Perhaps confusion matrices (some of which the authors included in the supplementary materials) would be useful for this. As the authors note, the most glaring such systematic difference appears to be that U-time performs better on N1 segments, which are also the hardest to recognize. Why is this the case? Minor comments: 1. Lines 43-45 "U-Time as opposed to...": What is the intuition behind this dataset invariance? 2. Figure 1 is hard to see in black and white. I recommend using a color palette that is good for color blindness (e.g. https://davidmathlogic.com/colorblind/) 3. Could you expand on the "confidence scores" in lines 118-121? In particular, how are they generated and do they have any specific interpretation beyond general "confidence" (e.g. as probabilities)? 4. Lines 177-179: Could you expand on the hyperparameter selection procedure? Why not re-tune for each dataset? Was cross-validation used for tuning on S-EDF-39?

Paper ID:	2469
Title:	U-Time: A Fully Convolutional Network for Time Series Segmentation Applied to Sleep Staging

Reviewer 1

Reviewer 2

Reviewer 3