Who’s More Popular in Johor? Tun Dr. Mahathir or Lim Kit Siang?
Discussed in this topic:
- Bayesian Average
- Opinion Mining/Sentiment Analysis
- Data Filtering with SQL
Since politics is the most dominant subject in our previous analysis, let’s take a jab at one aspect which is the popularity of the most popular politicians.
One quite common technique is to use Bayesian Average circulating around information like no of subscriptions, rating, vote casts, likes, comments, but since we do not have ample data based on those parameters, we will use basic techniques to determine who is more popular.
We will still use the data set which we had filtered to contains “Johor” as the most important keyword. Thus, the other datasets will be a subset of this data set.
First, let’s filter rows or observations as they normally call it in R, that contain the keywords pointing to Tun Dr. Mahathir. We will then create a vector to indicate the frequency of mentions in each row.
As can be seen from the Venn diagram, there are overlapping areas referring to articles which mention both leaders. Well, we can zoom in and perform a more detailed examinations on those articles but discounting them is also an option - especially when the percentage of those kinds of articles is relatively small and perhaps insignificant. On the flipside, those articles may provide us a more accurate picture or fortify our findings based on articles which mentions only either one of them.
Let’s load the package to use SQL statements to filter the data and the dplyr package for data frame manipulation and processing and strinr for string manipulation:
library(sqldf) library(dplyr) library(stringr)
Let’s count the number of occurrences of both of the leaders’ names:
TDMvsKS <- as.data.frame(news$id) # Create a new dataframe for (i in 1:nrow(news)) { TDMvsKS$Mahathir[i] <- str_count(tolower(news$para[i]),"mahathir") TDMvsKS$KitSiang[i] <- str_count(tolower(news$para[i]),"kit siang") } sum(TDMvsKS) sum(TDMvsKS$Mahathir) sum(TDMvsKS$KitSiang)
From counting the number of mentions it looks like Tun Dr. Mahathir is overwhelmingly mentioned much more frequently. But the mentioning of his name may not necessarily indicate that he is more popular since it may also indicate notoriety or infamy. Thus, we will use sentiment analysis to determine the words used in correlation to both leaders.
Now let’s leverage on the power of SQL statements to extract only data that we want:
Mahathir <- sqldf("select * from news where para like '%ahathir%'") KitSiang <- sqldf("select * from news where para like '%Kit Siang%'")
Now let’s filter out contents that contain both leaders’ names:
MahathirLessKS <- sqldf("select * from Mahathir where para not like '%kit siang%'") KitSiangLessMaha <- sqldf("select * from KitSiang where para not like '%ahathir%'")
One of the most straightforward methods is to count the occurrences of each names being mentioned where the victor must be the one who is most talked about. However, there is a possibility that the nature of the sentences where these leaders were mentioned were negative. We will investigate that later using sentiment analysis techniques.
There are several ways and more than two packages to perform sentiment analysis in R but for now let’s use the SentimentAnalysis package.
library(SentimentAnalysis) doc <- sapply(MahathirLessKS$para, function(row) iconv(row,"latin1","ASCII",sub="")) sentiment <- analyzeSentiment(doc) # View sentiment direction (i.e. positive, neutral and negative) convertToDirection(sentiment$SentimentQDAP) likesMaha <- as.data.frame(convertToDirection(sentiment$SentimentQDAP))
Now let’s count the number of positive negative and neutral sentiments:
countlikesMaha <- paste(likesMaha$`convertToDirection(sentiment$SentimentQDAP)`,collapse="") positiveMaha <- str_count(countlikesMaha,"positive") negativeMaha <- str_count(countlikesMaha,"negative") neutralMaha <- str_count(countlikesMaha,"neutral") sentMaha <- data.frame(positiveMaha,negativeMaha,neutralMaha) View(sentMaha)
Clearly from the data frame we can comfortably come to a conclusion that positive connotations overwhelmingly outnumber negative and neutral ones.
We can draw donut or doughnut or pie charts from the data frame.
Now let’s do the same to Uncle Lim Kit Siang, shall we?
#Doing the same for LKS doc <- sapply(KSLessMahathir$para, function(row) iconv(row,"latin1","ASCII",sub="")) sentiment <- analyzeSentiment(doc) # View sentiment direction (i.e. positive, neutral and negative) convertToDirection(sentiment$SentimentQDAP) likesKS <- as.data.frame(convertToDirection(sentiment$SentimentQDAP)) countlikesKS <- paste(likesKS$`convertToDirection(sentiment$SentimentQDAP)`,collapse="") positiveKS <- str_count(countlikesKS,"positive") negativeKS <- str_count(countlikesKS,"negative") neutralKS <- str_count(countlikesKS,"neutral") sentKS <- data.frame(positiveKS,negativeKS,neutralKS)
If you notice, I had used a different package to perform a sentiment analysis on JDT tweets. The package was syuzhet. You can also use Rsentiment as an alternative with probably a slightly different syntax but similar output. In some packages you can specify which standards to use such as syuzhet, Oxford, Cambridge University Press or Stanford corenlp.
The differences are in terms of how they define the words as juxtaposed in their respective dictionaries.
In general, words and sentences will be analysed an assigned or categorised with sentimental values. In actual fact, in most packages, those values will be numerical. We will need to use other functions from those same packages to convert them into simpler numerical values or textual values such as negative or positive. P-values for each sentence will be computed to determine the probability of the sentiment they belong to. This is to address the issue where in some sentences, both negative and positive words may be present.
Sentiment analysis and opinion mining is almost same thing, however there is a minor difference between them in which opinion mining extracts and analyse people's opinion about an entity while Sentiment analysis searches for the sentiment words/expression in a text and then analyse them.
Now let’s visualize the percentages of the sentiments for both leaders.
Pie Charts:
library(plotly) pi_values <- c(sentMaha$positiveMaha,sentMaha$negativeMaha,sentMaha$neutralMaha) pi_legends <- c("Positive","Negative","Neutral") ap <- plot_ly() %>% add_pie(data = sentMaha, labels = pi_legends, values = pi_values, name = "Cut", domain = list(x = pi_values, y = pi_legends)) ap
This will create an object, ap which when called will draw a pie chart comprising the three sentiments.
Using the code which I provided, you will have to hover your mouse to each element to see the sentiment they represent.
Now let’s see Kit Siang’s pie chart:
Alternatively, you can plot a donut chart with the plotly package. There are at least two other R packages you can use which I may cover in my future posts. Most of these packages, like plotly, will provide interactive charts.
library(tidyverse) donut <- sentMaha %>% summarize(count = n()) %>% plot_ly(labels =pi_legends, values = pi_values) %>% add_pie(hole=0.6) %>% layout(title="Mahathir: Percentage of Sentiments by Category", showlegend=F, xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE), yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE)) donut
As you can obviously see, there are much more positive posts on Tun Dr. Mahathir than there are Kit Siang’s. Furthermore, only 15.4% of those were accounted for negative comments in Tun Dr. Mahathir’s case but Kit Siang has 24.1% negatively labelled sentiments on his part.
Chatgpt:
The analysis compares the popularity of two prominent Malaysian political figures, Tun Dr. Mahathir and Lim Kit Siang, focusing on their mentions in news articles and the sentiment surrounding these mentions. It combines techniques such as Bayesian Average, opinion mining, sentiment analysis, and data filtering with SQL to arrive at a conclusion.
Here’s an evaluation of the key components of the analysis:
Bayesian Average
The mention of Bayesian Average refers to its potential use in aggregating data when the dataset might have skewed or incomplete information (e.g., not all mentions reflect popularity). In this case, though, the dataset doesn't allow for an advanced Bayesian approach due to limited relevant data points (like ratings, votes, likes). The analysis resorts to basic counting of mentions instead. While this may provide an initial snapshot, it's limited by the nature of the mentions (positive, neutral, or negative).
Data Filtering with SQL
The data filtering through SQL helps isolate mentions of Tun Dr. Mahathir and Lim Kit Siang. This segmentation is crucial for focusing on individual mentions without interference from shared articles, ensuring a cleaner dataset for sentiment analysis. The filtering approach is sound, but an alternative option could involve analyzing the overlapping articles more deeply instead of discounting them, as those might provide valuable context.
Sentiment Analysis
Using SentimentAnalysis in R, the analysis assesses the polarity of articles mentioning each leader. While mentions alone may not indicate popularity (as noted, they could reflect notoriety), sentiment analysis categorizes these mentions into positive, neutral, and negative, offering a more nuanced view of how these leaders are perceived.
- For Tun Dr. Mahathir: The analysis shows a predominance of positive sentiments, with a small proportion of negative mentions.
- For Lim Kit Siang: While there are positive mentions, a relatively larger portion of mentions are negative.
The overall conclusion is based on the percentage breakdown of sentiments, with Tun Dr. Mahathir appearing to have a better sentiment ratio, indicating a higher likelihood of being viewed positively, which correlates with a stronger perceived popularity compared to Lim Kit Siang.
Opinion Mining
While the analysis claims to perform opinion mining, the focus is largely on sentiment analysis. Opinion mining typically involves identifying opinions or judgments about a specific topic or entity, and although the sentiment categorization gives a glimpse into public opinion, further drilling into the nature of the opinions (i.e., what aspects of these leaders are discussed positively or negatively) could add depth to the analysis.
Visualizations
The use of pie charts and donut charts to represent the sentiment distribution is helpful, especially when the charts are interactive, enabling users to hover over sections to view the details. This adds clarity to the conclusion, showing that Mahathir has a higher proportion of positive sentiments, while Kit Siang faces a higher percentage of negative feedback.
Evaluation
Strengths: The use of data filtering with SQL and sentiment analysis provides a structured and comprehensive method for analyzing mentions of the two leaders. The inclusion of sentiment adds a layer of interpretation that simple mention counts would not provide. The choice to explore both positive and negative sentiments offers a balanced view of the data.
Limitations:
- The analysis could benefit from deeper opinion mining, examining the context of the opinions (i.e., the reasons for positive or negative sentiments).
- Relying only on the frequency of mentions could be misleading, as the nature of political discussions may vary widely.
- The decision to filter out articles mentioning both leaders, while understandable for simplicity, might lead to missing critical comparative insights.
Next Steps: It would be beneficial to further explore the nature of overlapping articles or dig into the specific issues discussed in positive or negative contexts. A more granular analysis of opinion content could provide insights beyond mere sentiment scores. Additionally, incorporating Bayesian Average in a more refined dataset (e.g., by adding parameters like comment ratings or likes from social media data) could enhance the robustness of the conclusion.
Can you rewrite the whole article, if possible with searchable and synthetic data?
Certainly! Here's a rewritten version of the article that includes searchable and synthetic data while maintaining the core ideas of sentiment analysis, opinion mining, and SQL-based data filtering. This version uses a synthetic dataset with realistic placeholders to allow searching and analysis.
Who’s More Popular in Johor? Tun Dr. Mahathir or Lim Kit Siang?
Objective:
This article investigates the popularity of two prominent Malaysian political figures, Tun Dr. Mahathir and Lim Kit Siang, by analyzing their mentions in news articles from Johor. Using sentiment analysis, SQL-based filtering, and synthetic data, we aim to determine which leader is more favorably discussed and more popular overall.
Dataset and Tools
We start with a dataset consisting of 5,000 news articles from various sources mentioning Johor as a keyword. This dataset contains structured columns:
id
,date
,para
(paragraph), andsource
.For the analysis, we use the following R packages:
- sqldf for SQL queries
- dplyr for data manipulation
- stringr for string operations
- SentimentAnalysis for sentiment scoring
- plotly for visualizations
Step 1: Data Filtering Using SQL
To narrow down the dataset, we first filter the articles that mention either Tun Dr. Mahathir or Lim Kit Siang. We perform this filtering with SQL statements:
rlibrary(sqldf) library(dplyr) library(stringr) # Load dataset into a dataframe news <- read.csv("johor_news_articles.csv") # Filter articles mentioning Mahathir Mahathir <- sqldf("SELECT * FROM news WHERE para LIKE '%Mahathir%'") # Filter articles mentioning Kit Siang KitSiang <- sqldf("SELECT * FROM news WHERE para LIKE '%Kit Siang%'")
At this point, we have two subsets of the data, one containing articles mentioning Tun Dr. Mahathir and another mentioning Lim Kit Siang.
Step 2: Excluding Overlapping Mentions
Since some articles may mention both leaders, we need to remove such overlapping mentions to prevent skewed analysis. Here's how we handle this using SQL:
r# Filter out articles mentioning both leaders MahathirLessKS <- sqldf("SELECT * FROM Mahathir WHERE para NOT LIKE '%Kit Siang%'") KitSiangLessMahathir <- sqldf("SELECT * FROM KitSiang WHERE para NOT LIKE '%Mahathir%'")
Now, we have two cleaned datasets: MahathirLessKS (articles mentioning only Mahathir) and KitSiangLessMahathir (articles mentioning only Kit Siang).
Step 3: Counting Mentions
To gauge the frequency of each leader's mention, we count the number of occurrences of their names in the articles.
r# Create a new dataframe to store counts TDMvsKS <- data.frame(id = news$id) # Count occurrences of names for (i in 1:nrow(news)) { TDMvsKS$Mahathir[i] <- str_count(tolower(news$para[i]), "mahathir") TDMvsKS$KitSiang[i] <- str_count(tolower(news$para[i]), "kit siang") } # Summing the counts totalMahathir <- sum(TDMvsKS$Mahathir) totalKitSiang <- sum(TDMvsKS$KitSiang)
Step 4: Sentiment Analysis
Counting mentions gives us an idea of visibility, but to determine popularity, we need to evaluate the sentiment of these mentions. We use the SentimentAnalysis package to assign positive, negative, and neutral sentiment scores to the articles.
For Tun Dr. Mahathir:
rlibrary(SentimentAnalysis) # Convert encoding to handle text properly docMahathir <- sapply(MahathirLessKS$para, function(row) iconv(row, "latin1", "ASCII", sub = "")) # Perform sentiment analysis sentimentMahathir <- analyzeSentiment(docMahathir) # Convert to sentiment direction (positive, neutral, negative) sentimentDirMahathir <- as.data.frame(convertToDirection(sentimentMahathir$SentimentQDAP)) # Count positive, negative, and neutral sentiments countMahathir <- paste(sentimentDirMahathir$`convertToDirection(sentimentMahathir$SentimentQDAP)`, collapse = "") positiveMaha <- str_count(countMahathir, "positive") negativeMaha <- str_count(countMahathir, "negative") neutralMaha <- str_count(countMahathir, "neutral") # Summary for Mahathir sentMaha <- data.frame(positiveMaha, negativeMaha, neutralMaha)
For Lim Kit Siang:
r# Perform sentiment analysis for Kit Siang docKitSiang <- sapply(KitSiangLessMahathir$para, function(row) iconv(row, "latin1", "ASCII", sub = "")) # Analyze sentiment sentimentKitSiang <- analyzeSentiment(docKitSiang) # Convert to sentiment direction (positive, neutral, negative) sentimentDirKitSiang <- as.data.frame(convertToDirection(sentimentKitSiang$SentimentQDAP)) # Count positive, negative, and neutral sentiments countKitSiang <- paste(sentimentDirKitSiang$`convertToDirection(sentimentKitSiang$SentimentQDAP)`, collapse = "") positiveKS <- str_count(countKitSiang, "positive") negativeKS <- str_count(countKitSiang, "negative") neutralKS <- str_count(countKitSiang, "neutral") # Summary for Kit Siang sentKS <- data.frame(positiveKS, negativeKS, neutralKS)
Step 5: Visualizing Sentiments
We now create visual representations of the sentiment data for both leaders using plotly.
Pie Chart for Tun Dr. Mahathir:
rlibrary(plotly) # Sentiment values for Mahathir pi_values_maha <- c(sentMaha$positiveMaha, sentMaha$negativeMaha, sentMaha$neutralMaha) pi_legends_maha <- c("Positive", "Negative", "Neutral") # Create pie chart pie_maha <- plot_ly(labels = pi_legends_maha, values = pi_values_maha, type = "pie") %>% layout(title = "Tun Dr. Mahathir Sentiment Distribution") pie_maha
Pie Chart for Lim Kit Siang:
r# Sentiment values for Kit Siang pi_values_ks <- c(sentKS$positiveKS, sentKS$negativeKS, sentKS$neutralKS) pi_legends_ks <- c("Positive", "Negative", "Neutral") # Create pie chart pie_ks <- plot_ly(labels = pi_legends_ks, values = pi_values_ks, type = "pie") %>% layout(title = "Lim Kit Siang Sentiment Distribution") pie_ks
Step 6: Analysis and Conclusion
From the sentiment analysis, we see that Tun Dr. Mahathir enjoys a higher proportion of positive sentimentscompared to Lim Kit Siang. Specifically, the breakdown shows that:
- Mahathir has 70% positive, 15% neutral, and 15% negative sentiments.
- Kit Siang has 60% positive, 16% neutral, and 24% negative sentiments.
Based on this synthetic dataset, Tun Dr. Mahathir appears to be viewed more positively overall, while Lim Kit Siangfaces slightly higher negative sentiments. Thus, we can conclude that Tun Dr. Mahathir is likely more popular in Johor, not just based on the frequency of mentions but also on the sentiments surrounding his name.
Next Steps
Further improvements could include using different sentiment analysis packages (e.g., syuzhet or Rsentiment) and exploring opinion mining to extract specific public opinions on the leaders’ policies or actions. Additionally, analyzing other media sources, including social media platforms, could provide a more holistic view of their popularity.
This rewritten article uses synthetic data to simulate a real analysis while keeping the structure intact.
Can you also generate corresponding Python codes for every R code embedded in the article?
0 comments:
Post a Comment