You Don’t Say. . . Linguistic Features in Sarcasm Detection

We explore linguistic features that contribute to sarcasm detection. The linguistic features that we investigate are a combination of text and word complexity, stylistic and psychological features. We experiment with sarcastic tweets with and without context. The results of our experiments indicate that contextual information is crucial for sarcasm prediction. One important observation is that sarcastic tweets are typically incongruent with their context in terms of sentiment or emotional load.


Introduction
Sarcasm, or verbal irony, is a figurative language device employed to convey the opposite meaning of what is actually being said. In verbal communication, a pause, intonation, or look can provide the cues necessary to determine whether there is sarcastic intent behind a comment. In writing, these social cues are inaccessible. Thus, we must rely on our understanding of the world, the speaker, and the context beyond the statement to discern between sarcasm and sincerity. This task has proven to be so subjective that social media users moderate their own comments using symbols and hashtags such as /s and #sarcasm to denote the sentiment on Reddit and Twitter, respectively. In fact, the dataset used in this paper was collected using such hashtags (Ghosh et al., 2020).
For machines, the lack of real-word knowledge is detrimental to their understanding of sarcasm as it hinders many natural language processing applications. Beyond social-media conversations, assessing product reviews as positive or negative requires an understanding of both rhetorical and literary devices. Back in 2012, BIC rolled out a "For Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Her" line of pens which led their intended female audience to poke fun at the misogynist message of the product. One reviewer commented, "Well at last pens for us ladies to use. . . now all we need is "for her" paper and I can finally learn to write!". While this review seems positive and gave the product four stars, our understanding of the social climate today leads us to conclude that this review is sarcastic and should be classified as such.
In social media communication, new slang words are introduced every day and emojis are often used to negate the sentiment of the text. In addition, stylistic devices and stylometric features are also often employed to convey a meaning opposite from its literal interpretation. While deep learning models can be very effective in their detection of sarcasm, they provide a "black box" approach that gives linguists little to no insight into what features are characteristic of sarcasm. The purpose of the current work is to learn linguistic patterns associated with sarcastic tweets and their contexts and determine which are the strongest indicators of sarcasm. The next step is to combine these observations with transformer-based architectures to achieve a better prediction accuracy.

Previous Work
The field of automatic sarcasm recognition has become quite active in recent years. The most current event is the shared task (Ghosh et al., 2020) organized as a part of the 2nd FigLang workshop at ACL 2020. The task is typically framed as a binary classification task (sarcastic vs. non-sarcastic) considering either an utterance in isolation or in combination with contextual information. Early approaches to automatic sarcasm detection rely on different types of features, including sarcasm markers, word embeddings, emoticons, patterns between positive and negative sentiment (e.g., Davi-dov et al. 2010;Tsur et al. 2010;González-Ibáñez et al. 2011;Riloff et al. 2013;Maynard and Greenwood 2014;Wallace et al. 2015;Ghosh et al. 2015;Joshi et al. 2015;Veale and Hao 2010;Liebrecht et al. 2013). Buschmeier et al. (2014 explore a range of features, mainly focused on sentiment, for the detection of verbal irony in product reviews. While this paper provides a good baseline for irony classification, our data differs in that it includes a multi-speaker thread of context prior to the sarcastic remark. More recent approaches apply deep learning methods (e.g., Ghosh and Veale 2016; Tay et al. 2018;Wallace et al. 2015). There is a great amount of research exploring the role of contextual information for sarcasm detection (e.g., Joshi et al.  2020) report that almost all systems submitted as part of the shared task have used the transformer architecture, such as BERT (Turc et al. 2019) or RoBERTa (Liu et al. 2020), and other variants. They performed better than RNN architectures, even without any task specific fine-tuning. Unfortunately, it is difficult to interpret what these models capture about sarcastic tweets and their context. Our approach uses classical supervised algorithms to better understand which elements characterize sarcasm in a social media setting. We categorize linguistic features, experiment with different combinations, and take context into account when performing our experiments.

Our Approach
Our approach utilizes a combination of complex, stylometric, and psychological linguistic features to automatically detect the presence or absence of sarcasm in a given text. We intentionally experiment with classical machine learning classification algorithms to get a better understanding of the linguistic features contributing to the sarcasm detection task. Our linguistic intuition is that there will be a discordance between the linguistic features corresponding with the responses and contexts labeled as sarcastic. Sarcastic tweets are likely to be semantically or emotionally incongruent with their preceding tweets, while non-sarcastic tweets show a greater harmony with their context. To measure the emotional load of a response and its context, we extract a number of sentiment-and emotionrelated features. We also look at the distribution of these features across the two classes. Furthermore, we test the performance of our classifier and importance of our features by considering just the response tweet versus the response with its accompanying context.

Data Set
We use the Twitter Corpus from the CodaLab shared task on sarcasm detection (Ghosh et al., 2020). The training data consists of 2,500 tweets labeled 'SARCASM' and 2,500 tweets labeled 'NON SARCASM', the balanced test data consists of an additional 1,800 labeled tweets. Ghosh et al. (2020), this is a self-labeled data set where the tweets are annotated as sarcastic based on the hashtags used by the users. The non-sarcastic tweets are the ones that do not contain the sarcasm hashtags, but may be labeled with either positive or negative sentiment hashtags, such as '#happy'. Retweets, duplicates, quotes, etc., are excluded (see Ghosh et al. 2020 for more details). Each sarcastic and non-sarcastic tweet is accompanied with an hierarchical conversation thread, e.g., context/1 is the immediate context, context/0 is the context that preceded context/1, and so on. The training and test data include up to 19 preceding tweets labeled as context/0, context/1, . . . , context/19 (if available).

Feature Extraction
Our research focuses on the role linguistic features play in sarcasm detection. We classify our features into three categories: complexity, stylistic, and psychological. Abonizio et al. (2020) defines complexity features as linguistic features that capture the overall objective of the context at the word and sentence level. Stylistic features use natural language techniques to gain grammatical information to better understand the syntax and style of the document. Psychological features are closest related to emotions and the cognitive aspect of NLP. We expand on these psychological features by utilizing VAD (Valence, Arousal, Dominance) (Warriner et al., 2013), emotional embeddings, andLIWC (Tausczik andPennebaker, 2010) . Lastly, we use word-level count vectors, word-level tf-idf, n-gram word-level tf-idf, n-gram character-level tf-idf. We stack these features and refer to them as count vectors for the remainder of this paper. , 2010) is a text analysis program with a built-in dictionary that counts words in psychologically meaningful categories. After all the words have been reviewed, the module calculates the total percentages of words that are similar and match that of the user dictionary categories. We used LIWC to extract features to detect and categorize the meaning, emotional sentiment, and social relationship of the words in the data set.

Valence, Arousal, Dominance (VAD)
VAD (Valence Arousal Dominance) (Warriner et al., 2013) includes almost 14,000 lemmas rated on a 1-9 scale according to the emotions evoked by the terms. Valence refers to the pleasantness of the word, arousal determines how dull or exciting the emotion is, and dominance ranges from submission to feeling in control. The VAD dimensions allow us to further explore the affective meanings of tweets and determine their viability as a predictor of sarcasm. We compute VAD scores for each "response" and use the three scores obtained as a feature in our classifiers. Furthermore, we explore using the scores as a measure of congruity between our response and contexts. We calculate the VAD scores for each individual response and context and then subtract the response scores by their respective context scores. In other words, if a response receives a valence score of 8 and its context/0 receives a valence score of 2, the valence congruity score would be a 6. We hypothesize that sarcastic tweets might show very little affective congruity compared to their non-sarcastic counterparts.

VADER
VADER (Valence Aware Dictionary and sEntiment Reasoner) (Hutto and Gilbert, 2015) is a lexicon and rule-based tool built especially for sentiment analysis of social media texts. VADER maps lexical features to emotions and provides insight into the intensity of such emotions through a series of polarity indices. VADER considers capitalization, punctuation, degree modifiers, emojis, and negations to compute its negative, positive and neutral scores. Furthermore, VADER's compound score provides a normalized, weighted composite score for a given tweet.

Emotional Embeddings
The emotions conveyed in our data set are portrayed through emotional embeddings. Calculating the emotions of the text goes a level deeper than just looking at the word embeddings. Using a pretrained model from Hugging Face (Saravia et al., 2018), we categorize the tweets into six emotions. The emotions include, joy, anger, fear, surprise, sadness and love. Figure 1 above represents an example of the distribution of emotions between response and context/0 in the balanced training data set. The results support our intuition that sarcasm is typically associated with negative emotions. When the context is labeled as "anger", nonsarcastic tweets tend to respond with joy, while sarcastic tweets usually respond with anger. By contrast, when the context is labeled as "joy", nonsarcastic tweets overwhelmingly respond with joy, while sarcastic tweets still largely respond with anger. There are 1,216 instances of the same emotion expressed in both response and context for the non-sarcasm class and 863 instances of this in the sarcasm class. Sarcastic tweets are generally incongruent with emotions throughout the response and context, unless associated with a negative emotion, e.g., anger.

Tweet-Context Similarity Scores
We use the standard document similarity estimation technique using word embeddings (GloVe, Pennington et al. 2014) and emotional embeddings (Saravia et al. 2018), which consists of measuring the similarity between the vector representations of the two documents. Let x 1 , . . . x m and y 1 , . . . , y n be the emotion (or word embedding) vectors of two documents. The cosine similarity value between the two documents (e.g., a tweet and its context) x i and C y = 1 n n i=1 y i is calculated as follows: where x, y denotes the inner product of two vectors x and y. We compute two similarity scores: 1) semantic cosine similarity using word embeddings; 2) cosine similarity using emotional embeddings. Our linguistic intuition is that a sarcastic response is going to be semantically or emotionally incongruent with its context and this is what creates the sarcasm effect.

Message c/0
It's no secret that this president has routinely targeted religious and ethnic minorities. He has fanned the flames of hate against refugees, Muslims, Africans, immigrants, women and all racial and religious minorities.

c/1
He is routinely and openly hostile to any legitimate Congressional oversight. He has made clear his wanton corruption by soliciting a bribe from a foreign government for his personal political gain. R Yassss queen, you're so brave and bold.   Table 1 is an example of a sarcastic tweet whose context/0, context/1 and response received an emotion of anger, anger, and joy, respectively. Table 2 represents a non-sarcastic thread of tweets where each message was classified as joy. This indicates that non-sarcastic tweets tend to be more emotionally similar to the preceding context while sarcastic tweets tend to shift in emotion. As a result, when compared to its contexts, the sarcastic tweet received lower emotional similarity scores than the non-sarcastic tweet.

Feature Analysis
After running all of the features on the training data, we implemented SHAP (SHapley Additive exPlanations) (Lundberg and Lee, 2017) to determine which features are the most important for classification. SHAP is a theoretic output technique that explains predictions of our model, by producing a SHAPLEY score that plots the most important features in our model. The features produced by SHAP were used in our experiments and are referred to as our "select linguistic features". The top 20 features SHAP selects contain a combination of character features such as character count, as well as a number of sentiment features, including VADER scores, emotion scores for both a response and its context as well as VAD features.
6 Experimental Evaluation

Data Preprocessing
Our preprocessing procedure consists of steps to remove noisy and unnecessary data. First, we tokenize and lemmatize the tweets using NLTK (Loper and Bird, 2002). We also remove any instance of "@USER" due to the repetition of this token in the beginning of most tweets. Prior research demonstrated that classifiers did not tend to benefit from large quantities of additional context and we noticed that a majority of the tweets only contained context/0 and context/1. While we plan to experiment further with additional context layers, in this work we only report on experiments that involve context/0 and context/1. We did not remove any stop words due to the small amount of text in each tweet. We also maintained punctuation and emojis as they proved to be useful information during the extraction of certain features, such as VADER.

Results
We use a Random Forest classifier and run 21 different experiments of which the most relevant ones are outlined in Table 3. The baseline scores represent an attention based LSTM model described in Ghosh et al. (2018) and used in the CodaLab Shared Task. We look at how each feature performed on just the response versus the response and context. We notice that for response, a combination of all count features and all linguistic features achieves the best F1 score of 67%. This score is further increased to 70% when the context is considered.

Conclusion
In this paper we explored the role various linguistic features play in computational sarcasm detection. We investigated a combination of text and word complexity features, stylistic and psychological features. The result of our experiments indicate that contextual information is crucial for sarcasm detection. We also observed that sarcastic tweets are often incongruent with their context in terms of sentiment or emotional load. Using a Random Forest classifier and the features we extracted we obtain promising results. Our current work is concerned with combining these observations with transformer-based architectures to achieve a better prediction accuracy.