A Comparative Study of Sentiment Analysis on Mask-Wearing Practices during the COVID-19 Pandemic

COVID-19 has become one of the most highly orated subject matter in these days. Countries have taken many viable actions to prevent the spread of the virus directed by international recommendations, which led to many disputes concerning wearing a face mask as a preventive measure against the virus. This study aims to assess and compare the overall accuracy, macro precision, macro F-measure and macro recall of the diﬀerent decision models towards the COVID-19 mask-wearing practices via sentiment analysis. Tweets are labeled and text pre-processing techniques are applied as stemming, normalization, tokenization, and stop-word removal. Subsequently, the tweets are transformed into master feature vectors by applying various feature extraction, feature representation, feature selection and word embedding techniques with ﬁve supervised machine learning decision models to predict mask-wearing practices reinforced from Twitter tweets. Moreover, the highest macro F-measure and macro precision are found with feature extraction as hybrid-grams, feature representation as TF-IDF, feature selection as Chi-Squared Test, and highest macro recall with feature extraction as BOW, feature representation as TF-IDF, feature selection as ANOVA F-value. Hence, this study concludes that the Naive Bayes (NB) algorithm outperforms other decision models with master feature vectors applied. In addition, it also outperforms word embedding techniques.

being under pressure [4]. Relatively, wearing a face mask helps prevent the virus from transmitting the disease to another healthy person or prevents the illness by reducing the probability of infection. Wearing a face mask has been a controversial topic, and many had a variety of suggestions. However, On April 3, 2020, the CDC (Centers for Disease Control and Prevention) of the US has recommended the public to wear face masks [5]. Mask wearing practices have shown more success and have lessened the virus's threat in the past during the 2003 SARS epidemic [6]. Social media has gained popularity in recent years. It is regularly used among the elderly, teenagers, academics, politicians, and entrepreneurs, irrespective of age or gender. Twitter is one of the commonly used micro-blogging sites among users today. It confines users to tweet within 140 characters. Whereas it allows the users to use a variation of spellings, indecorous use of punctuation marks, slangs, and emojis [7]. This led many researchers to work on sentimental analysis. It is a part of natural language processing whose primary focus is on identifying and classifying sentiments or opinions in a textual format. Sentiments can be known as the emotions of service, brand, or campaign in social media platforms. It is a phenomenal way to elaborate the sentiments via comments or situations to understand the surrounding of the brand, service, or campaign [8]. Sentiment analysis has been widely applied in many studies due to the fact that many social media users have their freedom to express their opinions, thoughts, and suggestion for any viral or sensational issue [9]. Hence, it leads to identifying their impression concerning certain topics and current happenings. Understanding whether the given opinion falls into positive or negative sentiment is considered a difficult task. The given features applied in a sentence must contain strong adjectives to classify them. Moreover, sentiment analysis helps to determine the satisfaction of goods or services before they are purchased. Companies and corporations use this information to understand their products and selling patterns [10]. In sentiment analysis, supervised and unsupervised machine learning algorithms play a vital role. Hence, machine learning techniques are used to split the data into test and train. Train data is used to build a classifier to classify the sentiments of given data [10]. If the instances are given as labeled, it falls into the category of supervised learning [11].
The study presented in this paper aspires to achieve three main objectives. Firstly, to compare and contrast the differences in performance measures of five supervised machine learning decision models such as SVM (Support Vector Machine), NB (Naive Bayes), RF (Random Forest), KNN (K-Nearest Neighbour Algorithm), and MaxEnt (Maximum Entropy). Secondly, to evaluate the best feature value representation with feature extraction as BOW (Bag-of-words), N-grams. Whereas, in N-gram, three techniques are compared, i.e., bigram (N=2), trigram (N=3), and hybrid-grams (BOW + bigram + trigram). In feature representation, Binary Frequency (BR), TF (Term Frequency), TF-IDF (Term Frequency -Inverse Document Frequency), and CountVectorizer are used to convert the text into vectors. Moreover, in feature selection, three techniques are used, i.e. Chi-squared test, information gain, and Anova F-value. Thirdly, feature value representation is compared via Word embedding techniques applying GloVe and Word2Vec embeddings. Hence, the corpus is developed from Twitter tweets by eliminating retweets and tweet duplicates with pre-processing techniques by filtering unwanted symbols, special characters, null values, stop word removal, and correcting misspelled words. Hence, it is applied in the training model with 1200 tweets labeled as negative and positive, with 600 tweets per each class.
The rest of the paper is organized into the following sections. Related work is discussed in section 2. Section 3 discusses the implementation carried out in the research methodology. Section 4 includes the results. Section 5 is for the discussion of the study. Finally, the conclusion of the study is presented in section 6.

Related Work
COVID-19 is a newly transpired disease that is known to be a new research area. However, fewer studies have been conducted in regards to sentiment analysis by the time of piloting this study. Moreover, there have been some studies on sentiment analysis with supervised machine learning algorithms, which to a certain extent, are related to our study.
In Anjaria and Guddeti [12], the authors use Twitter as the primary data collection. Whereas, it is a great platform that intends a quick and effective way in collecting the sentiments. Hence, the sentiment of a sentence is derived by the occurrence of the words which can be positive or negative based on the tone of the words [13]. The study applies several feature extraction techniques to extract the data such as unigram, bigram and (unigram + bigram) hybrid. Hence the results indicate SVM along with the hybrid approach gives a better accuracy compared to NB, MaxEnt, SVM and ANN. The study presents an overview of collecting the tweets, cleaning data and feature extraction to predict the outcome.
Sethi et. al. [14] conduct a study to analyze the emotions of twitter users via tweets related to COVID-19. The proposed model is used to analyze the actual sentiments of the tweets. The dataset for the study is manually created via a twitter API with COVID19 and coronavirus hashtags. Moreover, the study uses MaxEnt, MNB (Multinomial Naive Bayes), decision tree, RF, XGBoost, SVM algorithms with N-gram representations as unigram, bigram and unigram + bigram + trigram. Hence, the study concludes that SVM and decision tree classifier performs better. Although, SVM is considered to be more stable during the experimental process.
Eshan and Hasan [15] pilot a study to detect abusive Bangali text using machine learning algorithms with unigram, bigram and trigram as feature extraction techniques based CountVectorizer and TF-IDF for feature representation. The study uses RF, SVM with Polynomial and Linear kernel, MNB as algorithms. The study finds SVM linear kernel with TF-IDF to perform better compared to CountVectorizer; albeit, the study does not apply any feature selection techniques. Comparative research by Mujtaba et al. [16] finds the cause of death (COD) via the autopsy reports. Hence, the study uses unigram, bigram, trigram and hybrid-grams for feature extraction by representing the feature values using BR, TF and TF-IDF. Furthermore, Chi-squared test, information gain and Pearson correlation approaches are used. The study applies five popular decision models, i.e., SVM, J48, RF, NB, KNN to evaluate the performance of each classifier. This study states that SVM performs better than other decision models. Moreover, the study presents a complete methodological structure from data collection to evaluation metrics. Hence, the study confirms the classification accuracy based on several feature value representations.
Samuel et. al. [17] conduct a study on the fear sentiment progression over the peak time of COVID-19 in the United States. The study finds that classification accuracy of 91% for short tweets with the NB decision model and MaxEnt performs similarly well with an accuracy of 74% with short tweets and it also reveals that both models perform weaker in longer tweets. The study applies 4 decision models, i.e., NB, MaxEnt, Linear Regression and K-Nearest Neighbor; albeit the study does not apply any feature selection technique to select the best features from the document of words. Sharma and Daniels [18] conduct a study to execute sentiment analysis based on 2019 election twitter tweets with Word2Vec word embedding along with RF classification model. Moreover, it also improve the accuracy compared to other feature representation schemes, i.e., BOW and TF-IDF. Hence, the study proposes that Word2Vec embedding improves the quality of features, thus it signifies the importance on word embedding in sentiment analysis. However, to the best of our knowledge, there is no existing study related to mask-wearing practices during the Covid-19 pandemic using sentiment analysis.

Methodology
The methodology of the study emphasizes the scheme of analyzing the sentiments towards mask-wearing practices during the COVID-19 pandemic. Hence, the study is divided into nine main steps as illustrated in Figure 1. Each step carries certain tasks and it is discussed in the subsequent sections of the methodology. The study uses supervised machine learning decision models to classify with the training and test sets. We have taken 60% dataset for training and the remaining Fig. 1: Methodology of the Study for testing. The sentiment for the instance is manually labeled with two categories which helps to create the prediction model. Whereas, the model is accountable to calculate the label for the unlabeled test set.

Data Collection
This study uses Twitter tweets as the primary data during Covid-19 pandemic which focuses on maskwearing practices. Whereas, tweets are collected using Twitter API with Python Tweepy Library [19]. We have created a Twitter developer account to extract the tweets via Python Tweepy Library. The gathered 5000 tweets contain a user ID, username, tweet text and tweet location of the tweet posted by the user. Tweets are gathered using the following three main criteria.
1) Tweets published during the COVID 19 pandemic are selected for the study. 2) Tweets are selected with relevant hashtags and search queries related to mask-wearing practices. 3) Tweets are extracted from Twitter API as extended, untruncated Tweet text.

Data Pre-processing and Data Labeling
The collected twitter tweets are then moved to the preprocessing phase. In this phase, some pre-processing techniques are applied to clean the data found in several retweets from different users. Hence, we keep one tweet and remove the duplicates. Then, unwanted characters, punctuation marks, numerical values are removed. Tweets are then converted into lowercase. Empty or null tweets are removed and misspelled Finally, stop words were filtered from the tweets as shown in Figure 2.
After the pre-processing step, tweets are labeled by applying two classes as positive or negative based on the tweet posted by the user whether they support the mask wearing practices. 1200 tweets are labeled as both positive and negative sentiments as shown in Table 1 which avoids the issue of data imbalance by labeling an equal amount of tweets for each classes.

Feature Extraction & Feature Representation
The labeled data is transformed into vectors by applying two feature extraction strategies: i) BOW Representation, and ii) Ngram words representation to apply the sparse data with bigram trigram and hybridgram sequence of words. Scikitlearn provides several vectorizer objects to perform feature extraction strategies efficiently and this study uses four of those vectorizers as, 1) CountVectorizer: It's one of the basic and simple ways to tokenize a collection of texts into building a known word. It is used to count the number of times a token shows up in the document and uses this value as its weight [21]. 2) TF-IDF Vectorizer: TF-IDF (Term frequency-inverse document frequency) is used to highlight the words which are more significant to the document in a collection of documents. IDF (Inverse Document Frequency) calculates the importance of a term. Some words which may appear in every document such as 'this', 'the', 'what' and 'if' rank lower because they do not add much significance to the document [22]. 3) TF: It is a feature representation technique used to describe how often a term occurs in a document. Terms are known to be words, phrases or tokens in a text [23]. 4) BR: The binary feature value representation reflects on (i.e., 1 for presence and 0 for absence) features disregarding the occurrence of the term in the document [24].

Feature Selection
This study then applied several Feature selection Techniques to gather informative features by excluding non-informative features from the sparse feature vector which helps to lessen the feature size. Hence, a large number of features may produce less accurate models and tend to increase computational time and complexity [25]. In this study, three main feature selection techniques are applied to compare and contrast the performance of each decision models. The feature selection techniques are as follows.

1) Chi-Squared Test:
It is a statistical test used to determine whether there is a statistically important difference between expected and observed frequencies in the feature vector [25] [26]. 2) Information Gain: It is an entropy-based feature selection method and it is calculated by the use of the term for classification of information by the importance given in a feature vector [27]. 3) ANOVA F-Value: Analysis of Variance (ANOVA) is a statistical way to check the means of two or more different groups in a feature vector. This helps to decrease the features by selecting the important features which significantly improve the computation time and accuracy of decision models [28].

Decision Models
There are numerous supervised machine learning decision models available for sentiment analysis [29]. However, each decision model has its unique ways of the learning process. Therefore, picking the most applicable decision model for the study is critical. This study mainly uses five decision models, i.e., linear and polynomial SVM kernels, NB, MaxEnt, RF and KNN applied on master feature vectors to find the best performing decision models.

Creating Master Feature Vectors
The tweets are pre-processed via the support of Natural Language Tool Kit (NLTK) [30]. After preprocessing the tweets, features are extracted using BOW, N-Grams as bigram, trigram and hybrid-grams (BOW + bigram + trigram). Then, the study advances to the next step by implementing the feature representation by applying BR, TF, TF-IDF and CountVectorizer. Moreover, best 1000 features are filtered using Chi-Squared Test, information gain, and Anova F-Value from the feature vector. Hence, based on the feature extraction, feature representation and feature selection criteria, 48 master feature vector labels are formed as shown in Table 2.

Applying Word Embedding Techniques
This study applies two word embedding techniques apart from master feature vectors. Whereas, Word2Vec and GloVe allow to show the words in vector form (embedding). Hence, Word2vec embedding is based on training a shallow feed-forward neural network, while GloVe embedding is learnt via matrix factorization techniques [31][32].

Results
The experiments show various findings related to each specified portion mentioned in the methodology. The Train and Test accuracy along with macro precision, macro recall and macro F-measure represented with word embedding techniques, master feature vectors indicating 288 analysis with 4 feature extraction, and 4 feature representation accompanying 5 classification algorithms are shown in Table 3 The performance measurement of master feature Vectors for the lowest macro precision (0.481), macro recall (0.483) and macro F-measure (0.347) were achieved. Whereas, the highest macro precision (0.863), macro recall (0.837) and macro F-measure (0.838) were accomplished by NB algorithm. Overall NB has performed better compared to other five algorithms. In word embedding techniques, lowest macro precision (0.578), macro recall (0.578) and macro Fmeasure (0.578) were obtained by NB algorithm, and the highest macro precision (0.652), macro re-call (0.653) and macro F-measure (0.652) were achieved by linear SVM algorithm.

Discussion
This study indicates that hybrid-grams and BOW perform well compared to bigram or trigram when applying Chi-squared test, information gain and Anova F-Value as feature representation. Whereas, NB decision model performs significantly better compared to other algorithms. Moreover, next to NB algorithm, MaxEnt performs better in all feature extraction, feature selection, and feature representation criteria mentioned in the study. Although, it does not perform well in trigram as feature extraction.
When assessing the Chi-squared test, information gain and Anova F-Value, Chi-squared test gives better performance than information gain and Anova F-Value. In terms of feature representation, CountVectorizer comparatively performs well along with other feature representation schemes. This study shows that KNN algorithm does not perform well in all three feature selection techniques compared to other algorithms. Hence, this study finds that the highest macro F-measure and macro precision with feature extraction as hybrid-grams, feature representation as TF-IDF, feature selection as Chi-squared test and the highest macro recall is achieved with feature extraction as BOW, and feature representation as TF-IDF. Feature selection as ANOVA F-value performs better when applying NB as decision model.
This study also uses word embedding techniques to assess the performance of the algorithms. Whereas, linear SVM performs well compared to other algorithms. Although, the highest macro precision, macro recall and macro F-measure values are very low compared to feature vector representation schemes. NB algorithm has the lowest accuracy in both GloVe and Word2Vec word embedding techniques as shown in Figure 3-5 respectively.