Wednesday, October 2, 2024

Who’s More Popular in Johor? Tun Dr. Mahathir or Lim Kit Siang?

 Who’s More Popular in Johor? Tun Dr. Mahathir or Lim Kit Siang?

Discussed in this topic:

  1. Bayesian Average
  2. Opinion Mining/Sentiment Analysis
  3. Data Filtering with SQL

Since politics is the most dominant subject in our previous analysis, let’s take a jab at one aspect which is the popularity of the most popular politicians. 

One quite common technique is to use Bayesian Average circulating around information like no of subscriptions, rating, vote casts, likes, comments, but since we do not have ample data based on those parameters, we will use basic techniques to determine who is more popular.

We will still use the data set which we had filtered to contains “Johor” as the most important keyword. Thus, the other datasets will be a subset of this data set.

First, let’s filter rows or observations as they normally call it in R, that contain the keywords pointing to Tun Dr. Mahathir. We will then create a vector to indicate the frequency of mentions in each row.

As can be seen from the Venn diagram, there are overlapping areas referring to articles which mention both leaders. Well, we can zoom in and perform a more detailed examinations on those articles but discounting them is also an option - especially when the percentage of those kinds of articles is relatively small and perhaps insignificant. On the flipside, those articles may provide us a more accurate picture or fortify our findings based on articles which mentions only either one of them.

Let’s load the package to use SQL statements to filter the data and the dplyr package for data frame manipulation and processing and strinr for string manipulation:

library(sqldf)
library(dplyr)
library(stringr)


Let’s count the number of occurrences of both of the leaders’ names:

	TDMvsKS <- as.data.frame(news$id) # Create a new dataframe

for (i in 1:nrow(news)) {
  TDMvsKS$Mahathir[i] <- str_count(tolower(news$para[i]),"mahathir")
  TDMvsKS$KitSiang[i] <- str_count(tolower(news$para[i]),"kit siang")
   
}

sum(TDMvsKS)
sum(TDMvsKS$Mahathir)
sum(TDMvsKS$KitSiang)



From counting the number of mentions it looks like Tun Dr. Mahathir is overwhelmingly mentioned much more frequently. But the mentioning of his name may not necessarily indicate that he is more popular since it may also indicate notoriety or infamy. Thus, we will use sentiment analysis to determine the words used in correlation to both leaders.

Now let’s leverage on the power of SQL statements to extract only data that we want:

Mahathir <- sqldf("select * from news where para like '%ahathir%'")
KitSiang <- sqldf("select * from news where para like '%Kit Siang%'")


	

Now let’s filter out contents that contain both leaders’ names:

MahathirLessKS <- sqldf("select * from Mahathir where para not like '%kit siang%'")

KitSiangLessMaha <- sqldf("select * from KitSiang where para not like '%ahathir%'")


One of the most straightforward methods is to count the occurrences of each names being mentioned where the victor must be the one who is most talked about. However, there is a possibility that the nature of the sentences where these leaders were mentioned were negative. We will investigate that later using sentiment analysis techniques.

There are several ways and more than two packages to perform sentiment analysis in R but for now let’s use the SentimentAnalysis package.

library(SentimentAnalysis)

doc <- sapply(MahathirLessKS$para, function(row) iconv(row,"latin1","ASCII",sub=""))

sentiment <- analyzeSentiment(doc)

# View sentiment direction (i.e. positive, neutral and negative)
convertToDirection(sentiment$SentimentQDAP)
likesMaha <- as.data.frame(convertToDirection(sentiment$SentimentQDAP))


	

Now let’s count the number of positive negative and neutral sentiments:

countlikesMaha <- paste(likesMaha$`convertToDirection(sentiment$SentimentQDAP)`,collapse="")

positiveMaha <- str_count(countlikesMaha,"positive")
negativeMaha <- str_count(countlikesMaha,"negative")
neutralMaha <- str_count(countlikesMaha,"neutral")

sentMaha <- data.frame(positiveMaha,negativeMaha,neutralMaha)

View(sentMaha)


	

Clearly from the data frame we can comfortably come to a conclusion that positive connotations overwhelmingly outnumber negative and neutral ones.

We can draw donut or doughnut or pie charts from the data frame.

Now let’s do the same to Uncle Lim Kit Siang, shall we?

#Doing the same for LKS
doc <- sapply(KSLessMahathir$para, function(row) iconv(row,"latin1","ASCII",sub=""))

sentiment <- analyzeSentiment(doc)

# View sentiment direction (i.e. positive, neutral and negative)
convertToDirection(sentiment$SentimentQDAP)
likesKS <- as.data.frame(convertToDirection(sentiment$SentimentQDAP))

countlikesKS <- paste(likesKS$`convertToDirection(sentiment$SentimentQDAP)`,collapse="")

positiveKS <- str_count(countlikesKS,"positive")
negativeKS <- str_count(countlikesKS,"negative")
neutralKS <- str_count(countlikesKS,"neutral")

sentKS <- data.frame(positiveKS,negativeKS,neutralKS)


	

If you notice, I had used a different package to perform a sentiment analysis on JDT tweets. The package was syuzhet. You can also use Rsentiment as an alternative with probably a slightly different syntax but similar output. In some packages you can specify which standards to use such as syuzhet, Oxford, Cambridge University Press or Stanford corenlp.

The differences are in terms of how they define the words as juxtaposed in their respective dictionaries.

In general, words and sentences will be analysed an assigned or categorised with sentimental values. In actual fact, in most packages, those values will be numerical. We will need to use other functions from those same packages to convert them into simpler numerical values or textual values such as negative or positive. P-values for each sentence will be computed to determine the probability of the sentiment they belong to. This is to address the issue where in some sentences, both negative and positive words may be present.

Sentiment analysis and opinion mining is almost same thing, however there is a minor difference between them in which opinion mining extracts and analyse people's opinion about an entity while Sentiment analysis searches for the sentiment words/expression in a text and then analyse them.

Now let’s visualize the percentages of the sentiments for both leaders.

Pie Charts:

library(plotly)
pi_values <- c(sentMaha$positiveMaha,sentMaha$negativeMaha,sentMaha$neutralMaha)
pi_legends <- c("Positive","Negative","Neutral")
ap <- plot_ly() %>%
  add_pie(data = sentMaha, labels = pi_legends, values = pi_values,
          name = "Cut", domain = list(x = pi_values, y = pi_legends))
ap


	

This will create an object, ap which when called will draw a pie chart comprising the three sentiments.

Using the code which I provided, you will have to hover your mouse to each element to see the sentiment they represent.

Now let’s see Kit Siang’s pie chart:

Alternatively, you can plot a donut chart with the plotly package. There are at least two other R packages you can use which I may cover in my future posts. Most of these packages, like plotly, will provide interactive charts.

	library(tidyverse)
donut <- sentMaha %>%
  summarize(count = n()) %>%
  plot_ly(labels =pi_legends, values = pi_values) %>%
  add_pie(hole=0.6) %>%
  layout(title="Mahathir: Percentage of Sentiments by Category", showlegend=F,
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
donut


As you can obviously see, there are much more positive posts on Tun Dr. Mahathir than there are Kit Siang’s. Furthermore, only 15.4% of those were accounted for negative comments in Tun Dr. Mahathir’s case but Kit Siang has 24.1% negatively labelled sentiments on his part.


Chatgpt:

The analysis compares the popularity of two prominent Malaysian political figures, Tun Dr. Mahathir and Lim Kit Siang, focusing on their mentions in news articles and the sentiment surrounding these mentions. It combines techniques such as Bayesian Averageopinion miningsentiment analysis, and data filtering with SQL to arrive at a conclusion.

Here’s an evaluation of the key components of the analysis:

Bayesian Average

The mention of Bayesian Average refers to its potential use in aggregating data when the dataset might have skewed or incomplete information (e.g., not all mentions reflect popularity). In this case, though, the dataset doesn't allow for an advanced Bayesian approach due to limited relevant data points (like ratings, votes, likes). The analysis resorts to basic counting of mentions instead. While this may provide an initial snapshot, it's limited by the nature of the mentions (positive, neutral, or negative).

Data Filtering with SQL

The data filtering through SQL helps isolate mentions of Tun Dr. Mahathir and Lim Kit Siang. This segmentation is crucial for focusing on individual mentions without interference from shared articles, ensuring a cleaner dataset for sentiment analysis. The filtering approach is sound, but an alternative option could involve analyzing the overlapping articles more deeply instead of discounting them, as those might provide valuable context.

Sentiment Analysis

Using SentimentAnalysis in R, the analysis assesses the polarity of articles mentioning each leader. While mentions alone may not indicate popularity (as noted, they could reflect notoriety), sentiment analysis categorizes these mentions into positive, neutral, and negative, offering a more nuanced view of how these leaders are perceived.

  • For Tun Dr. Mahathir: The analysis shows a predominance of positive sentiments, with a small proportion of negative mentions.
  • For Lim Kit Siang: While there are positive mentions, a relatively larger portion of mentions are negative.

The overall conclusion is based on the percentage breakdown of sentiments, with Tun Dr. Mahathir appearing to have a better sentiment ratio, indicating a higher likelihood of being viewed positively, which correlates with a stronger perceived popularity compared to Lim Kit Siang.

Opinion Mining

While the analysis claims to perform opinion mining, the focus is largely on sentiment analysis. Opinion mining typically involves identifying opinions or judgments about a specific topic or entity, and although the sentiment categorization gives a glimpse into public opinion, further drilling into the nature of the opinions (i.e., what aspects of these leaders are discussed positively or negatively) could add depth to the analysis.

Visualizations

The use of pie charts and donut charts to represent the sentiment distribution is helpful, especially when the charts are interactive, enabling users to hover over sections to view the details. This adds clarity to the conclusion, showing that Mahathir has a higher proportion of positive sentiments, while Kit Siang faces a higher percentage of negative feedback.

Evaluation

  • Strengths: The use of data filtering with SQL and sentiment analysis provides a structured and comprehensive method for analyzing mentions of the two leaders. The inclusion of sentiment adds a layer of interpretation that simple mention counts would not provide. The choice to explore both positive and negative sentiments offers a balanced view of the data.

  • Limitations:

    1. The analysis could benefit from deeper opinion mining, examining the context of the opinions (i.e., the reasons for positive or negative sentiments).
    2. Relying only on the frequency of mentions could be misleading, as the nature of political discussions may vary widely.
    3. The decision to filter out articles mentioning both leaders, while understandable for simplicity, might lead to missing critical comparative insights.
  • Next Steps: It would be beneficial to further explore the nature of overlapping articles or dig into the specific issues discussed in positive or negative contexts. A more granular analysis of opinion content could provide insights beyond mere sentiment scores. Additionally, incorporating Bayesian Average in a more refined dataset (e.g., by adding parameters like comment ratings or likes from social media data) could enhance the robustness of the conclusion.


    Can you rewrite the whole article, if possible with searchable and synthetic data?

    Certainly! Here's a rewritten version of the article that includes searchable and synthetic data while maintaining the core ideas of sentiment analysis, opinion mining, and SQL-based data filtering. This version uses a synthetic dataset with realistic placeholders to allow searching and analysis.


    Who’s More Popular in Johor? Tun Dr. Mahathir or Lim Kit Siang?

    Objective:

    This article investigates the popularity of two prominent Malaysian political figures, Tun Dr. Mahathir and Lim Kit Siang, by analyzing their mentions in news articles from Johor. Using sentiment analysis, SQL-based filtering, and synthetic data, we aim to determine which leader is more favorably discussed and more popular overall.

    Dataset and Tools

    We start with a dataset consisting of 5,000 news articles from various sources mentioning Johor as a keyword. This dataset contains structured columns: iddatepara (paragraph), and source.

    For the analysis, we use the following R packages:

    • sqldf for SQL queries
    • dplyr for data manipulation
    • stringr for string operations
    • SentimentAnalysis for sentiment scoring
    • plotly for visualizations

    Step 1: Data Filtering Using SQL

    To narrow down the dataset, we first filter the articles that mention either Tun Dr. Mahathir or Lim Kit Siang. We perform this filtering with SQL statements:

    r
    library(sqldf) library(dplyr) library(stringr) # Load dataset into a dataframe news <- read.csv("johor_news_articles.csv") # Filter articles mentioning Mahathir Mahathir <- sqldf("SELECT * FROM news WHERE para LIKE '%Mahathir%'") # Filter articles mentioning Kit Siang KitSiang <- sqldf("SELECT * FROM news WHERE para LIKE '%Kit Siang%'")

    At this point, we have two subsets of the data, one containing articles mentioning Tun Dr. Mahathir and another mentioning Lim Kit Siang.

    Step 2: Excluding Overlapping Mentions

    Since some articles may mention both leaders, we need to remove such overlapping mentions to prevent skewed analysis. Here's how we handle this using SQL:

    r
    # Filter out articles mentioning both leaders MahathirLessKS <- sqldf("SELECT * FROM Mahathir WHERE para NOT LIKE '%Kit Siang%'") KitSiangLessMahathir <- sqldf("SELECT * FROM KitSiang WHERE para NOT LIKE '%Mahathir%'")

    Now, we have two cleaned datasets: MahathirLessKS (articles mentioning only Mahathir) and KitSiangLessMahathir (articles mentioning only Kit Siang).

    Step 3: Counting Mentions

    To gauge the frequency of each leader's mention, we count the number of occurrences of their names in the articles.

    r
    # Create a new dataframe to store counts TDMvsKS <- data.frame(id = news$id) # Count occurrences of names for (i in 1:nrow(news)) { TDMvsKS$Mahathir[i] <- str_count(tolower(news$para[i]), "mahathir") TDMvsKS$KitSiang[i] <- str_count(tolower(news$para[i]), "kit siang") } # Summing the counts totalMahathir <- sum(TDMvsKS$Mahathir) totalKitSiang <- sum(TDMvsKS$KitSiang)

    Step 4: Sentiment Analysis

    Counting mentions gives us an idea of visibility, but to determine popularity, we need to evaluate the sentiment of these mentions. We use the SentimentAnalysis package to assign positive, negative, and neutral sentiment scores to the articles.

    For Tun Dr. Mahathir:

    r
    library(SentimentAnalysis) # Convert encoding to handle text properly docMahathir <- sapply(MahathirLessKS$para, function(row) iconv(row, "latin1", "ASCII", sub = "")) # Perform sentiment analysis sentimentMahathir <- analyzeSentiment(docMahathir) # Convert to sentiment direction (positive, neutral, negative) sentimentDirMahathir <- as.data.frame(convertToDirection(sentimentMahathir$SentimentQDAP)) # Count positive, negative, and neutral sentiments countMahathir <- paste(sentimentDirMahathir$`convertToDirection(sentimentMahathir$SentimentQDAP)`, collapse = "") positiveMaha <- str_count(countMahathir, "positive") negativeMaha <- str_count(countMahathir, "negative") neutralMaha <- str_count(countMahathir, "neutral") # Summary for Mahathir sentMaha <- data.frame(positiveMaha, negativeMaha, neutralMaha)

    For Lim Kit Siang:

    r
    # Perform sentiment analysis for Kit Siang docKitSiang <- sapply(KitSiangLessMahathir$para, function(row) iconv(row, "latin1", "ASCII", sub = "")) # Analyze sentiment sentimentKitSiang <- analyzeSentiment(docKitSiang) # Convert to sentiment direction (positive, neutral, negative) sentimentDirKitSiang <- as.data.frame(convertToDirection(sentimentKitSiang$SentimentQDAP)) # Count positive, negative, and neutral sentiments countKitSiang <- paste(sentimentDirKitSiang$`convertToDirection(sentimentKitSiang$SentimentQDAP)`, collapse = "") positiveKS <- str_count(countKitSiang, "positive") negativeKS <- str_count(countKitSiang, "negative") neutralKS <- str_count(countKitSiang, "neutral") # Summary for Kit Siang sentKS <- data.frame(positiveKS, negativeKS, neutralKS)

    Step 5: Visualizing Sentiments

    We now create visual representations of the sentiment data for both leaders using plotly.

    Pie Chart for Tun Dr. Mahathir:

    r
    library(plotly) # Sentiment values for Mahathir pi_values_maha <- c(sentMaha$positiveMaha, sentMaha$negativeMaha, sentMaha$neutralMaha) pi_legends_maha <- c("Positive", "Negative", "Neutral") # Create pie chart pie_maha <- plot_ly(labels = pi_legends_maha, values = pi_values_maha, type = "pie") %>% layout(title = "Tun Dr. Mahathir Sentiment Distribution") pie_maha

    Pie Chart for Lim Kit Siang:

    r
    # Sentiment values for Kit Siang pi_values_ks <- c(sentKS$positiveKS, sentKS$negativeKS, sentKS$neutralKS) pi_legends_ks <- c("Positive", "Negative", "Neutral") # Create pie chart pie_ks <- plot_ly(labels = pi_legends_ks, values = pi_values_ks, type = "pie") %>% layout(title = "Lim Kit Siang Sentiment Distribution") pie_ks

    Step 6: Analysis and Conclusion

    From the sentiment analysis, we see that Tun Dr. Mahathir enjoys a higher proportion of positive sentimentscompared to Lim Kit Siang. Specifically, the breakdown shows that:

    • Mahathir has 70% positive, 15% neutral, and 15% negative sentiments.
    • Kit Siang has 60% positive, 16% neutral, and 24% negative sentiments.

    Based on this synthetic dataset, Tun Dr. Mahathir appears to be viewed more positively overall, while Lim Kit Siangfaces slightly higher negative sentiments. Thus, we can conclude that Tun Dr. Mahathir is likely more popular in Johor, not just based on the frequency of mentions but also on the sentiments surrounding his name.

    Next Steps

    Further improvements could include using different sentiment analysis packages (e.g., syuzhet or Rsentiment) and exploring opinion mining to extract specific public opinions on the leaders’ policies or actions. Additionally, analyzing other media sources, including social media platforms, could provide a more holistic view of their popularity.


    This rewritten article uses synthetic data to simulate a real analysis while keeping the structure intact.


    Can you also generate corresponding Python codes for every R code embedded in the article?


    Certainly! Below are the corresponding Python equivalents of each R code snippet from the article, using commonly used Python libraries like pandassqlite3re (for string manipulation), nltk or TextBlob for sentiment analysis, and plotly for visualizations.


    Step 1: Data Filtering Using SQL

    R Code:

    r
    library(sqldf) library(dplyr) library(stringr) # Load dataset into a dataframe news <- read.csv("johor_news_articles.csv") # Filter articles mentioning Mahathir Mahathir <- sqldf("SELECT * FROM news WHERE para LIKE '%Mahathir%'") # Filter articles mentioning Kit Siang KitSiang <- sqldf("SELECT * FROM news WHERE para LIKE '%Kit Siang%'")

    Python Equivalent:

    python
    import pandas as pd import sqlite3 # Load dataset into a pandas dataframe news = pd.read_csv("johor_news_articles.csv") # Create SQLite connection conn = sqlite3.connect(':memory:') news.to_sql('news', conn, index=False) # Filter articles mentioning Mahathir Mahathir = pd.read_sql_query("SELECT * FROM news WHERE para LIKE '%Mahathir%'", conn) # Filter articles mentioning Kit Siang KitSiang = pd.read_sql_query("SELECT * FROM news WHERE para LIKE '%Kit Siang%'", conn) # Close SQLite connection conn.close()

    Step 2: Excluding Overlapping Mentions

    R Code:

    r
    # Filter out articles mentioning both leaders MahathirLessKS <- sqldf("SELECT * FROM Mahathir WHERE para NOT LIKE '%Kit Siang%'") KitSiangLessMahathir <- sqldf("SELECT * FROM KitSiang WHERE para NOT LIKE '%Mahathir%'")

    Python Equivalent:

    python
    # Filter out articles mentioning both leaders MahathirLessKS = Mahathir[~Mahathir['para'].str.contains('Kit Siang')] KitSiangLessMahathir = KitSiang[~KitSiang['para'].str.contains('Mahathir')]

    Step 3: Counting Mentions

    R Code:

    r
    # Create a new dataframe to store counts TDMvsKS <- data.frame(id = news$id) # Count occurrences of names for (i in 1:nrow(news)) { TDMvsKS$Mahathir[i] <- str_count(tolower(news$para[i]), "mahathir") TDMvsKS$KitSiang[i] <- str_count(tolower(news$para[i]), "kit siang") } # Summing the counts totalMahathir <- sum(TDMvsKS$Mahathir) totalKitSiang <- sum(TDMvsKS$KitSiang)

    Python Equivalent:

    python
    import re # Create a dataframe to store counts TDMvsKS = pd.DataFrame({'id': news['id']}) # Count occurrences of names TDMvsKS['Mahathir'] = news['para'].str.lower().apply(lambda x: len(re.findall('mahathir', x))) TDMvsKS['KitSiang'] = news['para'].str.lower().apply(lambda x: len(re.findall('kit siang', x))) # Summing the counts totalMahathir = TDMvsKS['Mahathir'].sum() totalKitSiang = TDMvsKS['KitSiang'].sum()

    Step 4: Sentiment Analysis

    For Tun Dr. Mahathir (R Code):

    r
    library(SentimentAnalysis) # Convert encoding to handle text properly docMahathir <- sapply(MahathirLessKS$para, function(row) iconv(row, "latin1", "ASCII", sub = "")) # Perform sentiment analysis sentimentMahathir <- analyzeSentiment(docMahathir) # Convert to sentiment direction (positive, neutral, negative) sentimentDirMahathir <- as.data.frame(convertToDirection(sentimentMahathir$SentimentQDAP)) # Count positive, negative, and neutral sentiments countMahathir <- paste(sentimentDirMahathir$`convertToDirection(sentimentMahathir$SentimentQDAP)`, collapse = "") positiveMaha <- str_count(countMahathir, "positive") negativeMaha <- str_count(countMahathir, "negative") neutralMaha <- str_count(countMahathir, "neutral") # Summary for Mahathir sentMaha <- data.frame(positiveMaha, negativeMaha, neutralMaha)

    Python Equivalent:

    python
    from textblob import TextBlob # Perform sentiment analysis using TextBlob def get_sentiment(text): analysis = TextBlob(text) if analysis.sentiment.polarity > 0: return 'positive' elif analysis.sentiment.polarity == 0: return 'neutral' else: return 'negative' # Apply sentiment analysis to MahathirLessKS dataset MahathirLessKS['sentiment'] = MahathirLessKS['para'].apply(get_sentiment) # Count positive, negative, and neutral sentiments positiveMaha = (MahathirLessKS['sentiment'] == 'positive').sum() negativeMaha = (MahathirLessKS['sentiment'] == 'negative').sum() neutralMaha = (MahathirLessKS['sentiment'] == 'neutral').sum() # Summary for Mahathir sentMaha = pd.DataFrame({'positiveMaha': [positiveMaha], 'negativeMaha': [negativeMaha], 'neutralMaha': [neutralMaha]})

    For Lim Kit Siang (R Code):

    r
    # Perform sentiment analysis for Kit Siang docKitSiang <- sapply(KitSiangLessMahathir$para, function(row) iconv(row, "latin1", "ASCII", sub = "")) # Analyze sentiment sentimentKitSiang <- analyzeSentiment(docKitSiang) # Convert to sentiment direction (positive, neutral, negative) sentimentDirKitSiang <- as.data.frame(convertToDirection(sentimentKitSiang$SentimentQDAP)) # Count positive, negative, and neutral sentiments countKitSiang <- paste(sentimentDirKitSiang$`convertToDirection(sentimentKitSiang$SentimentQDAP)`, collapse = "") positiveKS <- str_count(countKitSiang, "positive") negativeKS <- str_count(countKitSiang, "negative") neutralKS <- str_count(countKitSiang, "neutral") # Summary for Kit Siang sentKS <- data.frame(positiveKS, negativeKS, neutralKS)

    Python Equivalent:

    python
    # Apply sentiment analysis to KitSiangLessMahathir dataset KitSiangLessMahathir['sentiment'] = KitSiangLessMahathir['para'].apply(get_sentiment) # Count positive, negative, and neutral sentiments positiveKS = (KitSiangLessMahathir['sentiment'] == 'positive').sum() negativeKS = (KitSiangLessMahathir['sentiment'] == 'negative').sum() neutralKS = (KitSiangLessMahathir['sentiment'] == 'neutral').sum() # Summary for Kit Siang sentKS = pd.DataFrame({'positiveKS': [positiveKS], 'negativeKS': [negativeKS], 'neutralKS': [neutralKS]})

    Step 5: Visualizing Sentiments

    R Code for Pie Chart for Mahathir:

    r
    library(plotly) # Sentiment values for Mahathir pi_values_maha <- c(sentMaha$positiveMaha, sentMaha$negativeMaha, sentMaha$neutralMaha) pi_legends_maha <- c("Positive", "Negative", "Neutral") # Create pie chart pie_maha <- plot_ly(labels = pi_legends_maha, values = pi_values_maha, type = "pie") %>% layout(title = "Tun Dr. Mahathir Sentiment Distribution") pie_maha

    Python Equivalent for Mahathir Pie Chart:

    python
    import plotly.express as px # Sentiment values for Mahathir pi_values_maha = [positiveMaha, negativeMaha, neutralMaha] pi_legends_maha = ["Positive", "Negative", "Neutral"] # Create pie chart fig_maha = px.pie(values=pi_values_maha, names=pi_legends_maha, title="Tun Dr. Mahathir Sentiment Distribution") fig_maha.show()

    R Code for Pie Chart for Kit Siang:

    r
    # Sentiment values for Kit Siang pi_values_ks <- c(sentKS$positiveKS, sentKS$negativeKS, sentKS$neutralKS) pi_legends_ks <- c("Positive", "Negative", "Neutral") # Create pie chart pie_ks <- plot_ly(labels = pi_legends_ks, values = pi_values_ks, type = "pie") %>% layout(title = "Lim Kit Siang Sentiment Distribution") pie_ks

    Python Equivalent for Kit Siang Pie Chart:

    python
    # Sentiment values for Kit Siang pi_values_ks = [positiveKS, negativeKS, neutralKS] pi_legends_ks = ["Positive", "Negative", "Neutral"]


    List the Python packages used


    Here is a list of the Python packages used in the provided code:

    1. pandas – For data manipulation and analysis.

      • Install: pip install pandas
    2. sqlite3 – For executing SQL queries on the dataset.

      • This is a built-in Python library, so no installation is required.
    3. re – For regular expression operations (used for counting word occurrences).

      • This is a built-in Python library, so no installation is required.
    4. textblob – For performing sentiment analysis.

      • Install: pip install textblob
    5. plotly – For creating interactive visualizations such as pie charts.

      • Install: pip install plotly

    These packages cover data handling, SQL filtering, sentiment analysis, and visualization in Python, corresponding to the functionalities from the R code.


    Can you rewrite the article again, but this time, include also the python components and use my style of writing but be mindful of the grammar as well. You may enhance the emotional aspect or any other aspect using better vocabulary.


    Absolutely! Here's the revised version of the article, rewritten with Python components integrated and a refined tone to reflect your sophisticated style, while maintaining impeccable grammar.


    Who Holds More Popularity in Johor? Tun Dr. Mahathir or Lim Kit Siang?

    Introduction

    In the realm of Malaysian politics, Tun Dr. Mahathir and Lim Kit Siang are two towering figures who command a great deal of attention. But when it comes to Johor, a critical political battleground, which of these leaders garners more positive sentiment? To answer this, we take a closer look at the data and apply sentiment analysisopinion mining, and SQL-based filtering techniques to assess their relative popularity.

    Given that mentions alone do not necessarily equate to popularity, especially in the political sphere where controversy often drives attention, we will evaluate not just the number of times they are mentioned, but also the tone of these mentions. Sentiment analysis will be crucial in distinguishing between positive and negative connotations, allowing us to draw more meaningful insights from the data.

    Step 1: Data Filtering with SQL

    Our dataset contains 5,000 news articles that focus on Johor. To begin, we need to isolate articles that mention either Tun Dr. Mahathir or Lim Kit Siang. The data has already been structured into columns: iddatepara (paragraph), and source. We filter these articles using SQL statements to ensure the relevant data is extracted with precision.

    Python Code:

    python
    import pandas as pd import sqlite3 # Load the dataset into a pandas dataframe news = pd.read_csv("johor_news_articles.csv") # Create SQLite connection conn = sqlite3.connect(':memory:') news.to_sql('news', conn, index=False) # Filter articles mentioning Mahathir Mahathir = pd.read_sql_query("SELECT * FROM news WHERE para LIKE '%Mahathir%'", conn) # Filter articles mentioning Kit Siang KitSiang = pd.read_sql_query("SELECT * FROM news WHERE para LIKE '%Kit Siang%'", conn) # Close SQLite connection conn.close()

    This step allows us to extract specific mentions of both leaders, setting the foundation for our analysis. But the real challenge lies in understanding the context behind these mentions.

    Step 2: Handling Overlapping Mentions

    Mentions of both leaders in the same article could skew the results. Therefore, it’s essential to remove articles that mention both Tun Dr. Mahathir and Lim Kit Siang. This ensures the analysis remains untainted by ambiguous overlap, allowing us to focus solely on articles that mention either leader individually.

    Python Code:

    python
    # Filter out articles mentioning both leaders MahathirLessKS = Mahathir[~Mahathir['para'].str.contains('Kit Siang')] KitSiangLessMahathir = KitSiang[~KitSiang['para'].str.contains('Mahathir')]

    With these two refined datasets, we can now confidently move forward, knowing that each dataset reflects independent coverage of the respective leader.

    Step 3: Counting Mentions – Visibility vs. Notoriety

    Counting mentions provides an initial snapshot of visibility, but it’s a crude metric—one that could be clouded by controversy or infamy. Nonetheless, it’s a useful baseline. Using Python’s powerful string manipulation capabilities, we count the frequency of each leader’s name in their respective datasets.

    Python Code:

    python
    import re # Create a dataframe to store counts TDMvsKS = pd.DataFrame({'id': news['id']}) # Count occurrences of names TDMvsKS['Mahathir'] = news['para'].str.lower().apply(lambda x: len(re.findall('mahathir', x))) TDMvsKS['KitSiang'] = news['para'].str.lower().apply(lambda x: len(re.findall('kit siang', x))) # Summing the counts totalMahathir = TDMvsKS['Mahathir'].sum() totalKitSiang = TDMvsKS['KitSiang'].sum()

    From this simple count, we can see who is getting more attention, but that attention could be negative. Thus, we must proceed to sentiment analysis to discern the actual emotional tone of these mentions.

    Step 4: Sentiment Analysis – Distilling Sentiment from Data

    Here’s where the analysis takes on a deeper dimension. Mentions alone are insufficient to declare someone "popular." We need to distinguish positiveneutral, and negative sentiments within these mentions.

    To achieve this, we employ TextBlob—a Python package for sentiment analysis—to evaluate the polarity of the articles. Polarity scores will help us understand how favorably (or unfavorably) each leader is viewed.

    For Tun Dr. Mahathir:

    python
    from textblob import TextBlob # Perform sentiment analysis using TextBlob def get_sentiment(text): analysis = TextBlob(text) if analysis.sentiment.polarity > 0: return 'positive' elif analysis.sentiment.polarity == 0: return 'neutral' else: return 'negative' # Apply sentiment analysis to Mahathir's dataset MahathirLessKS['sentiment'] = MahathirLessKS['para'].apply(get_sentiment) # Count positive, negative, and neutral sentiments positiveMaha = (MahathirLessKS['sentiment'] == 'positive').sum() negativeMaha = (MahathirLessKS['sentiment'] == 'negative').sum() neutralMaha = (MahathirLessKS['sentiment'] == 'neutral').sum() # Summary for Mahathir sentMaha = pd.DataFrame({'positiveMaha': [positiveMaha], 'negativeMaha': [negativeMaha], 'neutralMaha': [neutralMaha]})

    For Lim Kit Siang:

    python
    # Apply sentiment analysis to Kit Siang's dataset KitSiangLessMahathir['sentiment'] = KitSiangLessMahathir['para'].apply(get_sentiment) # Count positive, negative, and neutral sentiments positiveKS = (KitSiangLessMahathir['sentiment'] == 'positive').sum() negativeKS = (KitSiangLessMahathir['sentiment'] == 'negative').sum() neutralKS = (KitSiangLessMahathir['sentiment'] == 'neutral').sum() # Summary for Kit Siang sentKS = pd.DataFrame({'positiveKS': [positiveKS], 'negativeKS': [negativeKS], 'neutralKS': [neutralKS]})

    At this point, we start to see a more nuanced picture emerge: the emotional context surrounding each leader. Tun Dr. Mahathir, for example, might be mentioned more frequently, but the nature of those mentions is what truly matters.

    Step 5: Visualizing Sentiment – A Picture is Worth a Thousand Words

    Numbers and tables are great, but visualization makes data more accessible and interpretable. Here, we create pie charts to illustrate the distribution of positive, negative, and neutral sentiments for both leaders using Plotly.

    Pie Chart for Tun Dr. Mahathir:

    python
    import plotly.express as px # Sentiment values for Mahathir pi_values_maha = [positiveMaha, negativeMaha, neutralMaha] pi_legends_maha = ["Positive", "Negative", "Neutral"] # Create pie chart fig_maha = px.pie(values=pi_values_maha, names=pi_legends_maha, title="Tun Dr. Mahathir Sentiment Distribution") fig_maha.show()

    Pie Chart for Lim Kit Siang:

    python
    # Sentiment values for Kit Siang pi_values_ks = [positiveKS, negativeKS, neutralKS] pi_legends_ks = ["Positive", "Negative", "Neutral"] # Create pie chart fig_ks = px.pie(values=pi_values_ks, names=pi_legends_ks, title="Lim Kit Siang Sentiment Distribution") fig_ks.show()

    The visualizations show us that Tun Dr. Mahathir enjoys a higher proportion of positive sentiments compared to Lim Kit Siang, who, while still viewed favorably, has a relatively higher percentage of negative mentions.

    Conclusion

    The data speaks for itself. Tun Dr. Mahathir is not only mentioned more frequently, but the sentiments surrounding his mentions are overwhelmingly positive, painting the picture of a leader still held in high regard by the public in Johor. While Lim Kit Siang commands respect, he faces a more significant share of negative sentiment, which tempers his popularity.

    Ultimately, popularity is a complex interplay between visibility and perception. Both leaders remain influential, but as far as this analysis goes, Tun Dr. Mahathir emerges as the more favorably viewed figure in Johor.


    This refined and emotionally charged analysis provides both a methodical exploration of the data and a compelling narrative about the popularity of two of Malaysia’s most significant political figures.


    What I meant was, for the article to include BOTH R and Python.


    ***

    Ah, I see what you mean now! You would like the article to include both R and Python code snippets side-by-side, allowing a comparison between how the same tasks are performed in each language. I'll now rewrite the article with both R and Python code embedded, maintaining your sophisticated tone, and making sure it's well-structured and clear.


    Who Holds More Popularity in Johor? Tun Dr. Mahathir or Lim Kit Siang?

    Introduction

    When it comes to Malaysian politics, two prominent names often dominate the landscape: Tun Dr. Mahathir and Lim Kit Siang. In Johor, a critical political state, we seek to understand which of these figures holds greater popularity by analyzing media mentions and the sentiment surrounding them. Using sentiment analysisSQL filtering, and opinion mining, we will explore not only the frequency of mentions but also the emotional tone associated with these mentions.

    We will illustrate the analysis using both R and Python for code implementations. This approach serves to provide a comprehensive guide for those who favor one programming language over the other, without losing sight of the larger narrative.

    Step 1: Data Filtering with SQL

    We begin by isolating the articles that mention Tun Dr. Mahathir and Lim Kit Siang from a dataset containing 5,000 news articles related to Johor. These articles are structured with columns such as iddatepara (paragraph), and source. Filtering them using SQL allows us to extract relevant articles efficiently.

    R Code:

    r
    library(sqldf) library(dplyr) library(stringr) # Load dataset into a dataframe news <- read.csv("johor_news_articles.csv") # Filter articles mentioning Mahathir Mahathir <- sqldf("SELECT * FROM news WHERE para LIKE '%Mahathir%'") # Filter articles mentioning Kit Siang KitSiang <- sqldf("SELECT * FROM news WHERE para LIKE '%Kit Siang%'")

    Python Code:

    python
    import pandas as pd import sqlite3 # Load the dataset into a pandas dataframe news = pd.read_csv("johor_news_articles.csv") # Create SQLite connection conn = sqlite3.connect(':memory:') news.to_sql('news', conn, index=False) # Filter articles mentioning Mahathir Mahathir = pd.read_sql_query("SELECT * FROM news WHERE para LIKE '%Mahathir%'", conn) # Filter articles mentioning Kit Siang KitSiang = pd.read_sql_query("SELECT * FROM news WHERE para LIKE '%Kit Siang%'", conn) # Close SQLite connection conn.close()

    At this point, we have successfully filtered the dataset to contain only articles that mention the leaders individually.

    Step 2: Handling Overlapping Mentions

    Next, we remove any overlapping articles that mention both Tun Dr. Mahathir and Lim Kit Siang. This ensures that our analysis remains distinct, focusing on articles that mention either leader independently.

    R Code:

    r
    # Remove overlapping mentions MahathirLessKS <- sqldf("SELECT * FROM Mahathir WHERE para NOT LIKE '%Kit Siang%'") KitSiangLessMahathir <- sqldf("SELECT * FROM KitSiang WHERE para NOT LIKE '%Mahathir%'")

    Python Code:

    python
    # Remove overlapping mentions MahathirLessKS = Mahathir[~Mahathir['para'].str.contains('Kit Siang')] KitSiangLessMahathir = KitSiang[~KitSiang['para'].str.contains('Mahathir')]

    Step 3: Counting Mentions – Visibility vs. Notoriety

    Counting how often a leader is mentioned in the media gives us a sense of their visibility. However, mentions alone could reflect notoriety as much as popularity. We proceed with counting the number of occurrences of each leader’s name.

    R Code:

    r
    # Create a new dataframe to store counts TDMvsKS <- data.frame(id = news$id) # Count occurrences of names for (i in 1:nrow(news)) { TDMvsKS$Mahathir[i] <- str_count(tolower(news$para[i]), "mahathir") TDMvsKS$KitSiang[i] <- str_count(tolower(news$para[i]), "kit siang") } # Summing the counts totalMahathir <- sum(TDMvsKS$Mahathir) totalKitSiang <- sum(TDMvsKS$KitSiang)

    Python Code:

    python
    import re # Create a dataframe to store counts TDMvsKS = pd.DataFrame({'id': news['id']}) # Count occurrences of names TDMvsKS['Mahathir'] = news['para'].str.lower().apply(lambda x: len(re.findall('mahathir', x))) TDMvsKS['KitSiang'] = news['para'].str.lower().apply(lambda x: len(re.findall('kit siang', x))) # Summing the counts totalMahathir = TDMvsKS['Mahathir'].sum() totalKitSiang = TDMvsKS['KitSiang'].sum()

    At this point, we have an overview of the frequency of mentions for both leaders. However, frequency alone doesn’t tell the full story—sentiment is key to understanding whether these mentions are positive or negative.

    Step 4: Sentiment Analysis – Understanding the Tone

    We use sentiment analysis to categorize mentions as positiveneutral, or negative. In R, we leverage the SentimentAnalysis package, while in Python, we opt for TextBlob to evaluate the sentiment polarity.

    R Code for Tun Dr. Mahathir:

    r
    library(SentimentAnalysis) # Convert encoding to handle text properly docMahathir <- sapply(MahathirLessKS$para, function(row) iconv(row, "latin1", "ASCII", sub = "")) # Perform sentiment analysis sentimentMahathir <- analyzeSentiment(docMahathir) # Convert to sentiment direction (positive, neutral, negative) sentimentDirMahathir <- as.data.frame(convertToDirection(sentimentMahathir$SentimentQDAP)) # Count positive, negative, and neutral sentiments countMahathir <- paste(sentimentDirMahathir$`convertToDirection(sentimentMahathir$SentimentQDAP)`, collapse = "") positiveMaha <- str_count(countMahathir, "positive") negativeMaha <- str_count(countMahathir, "negative") neutralMaha <- str_count(countMahathir, "neutral") # Summary for Mahathir sentMaha <- data.frame(positiveMaha, negativeMaha, neutralMaha)

    Python Code for Tun Dr. Mahathir:

    python
    from textblob import TextBlob # Perform sentiment analysis using TextBlob def get_sentiment(text): analysis = TextBlob(text) if analysis.sentiment.polarity > 0: return 'positive' elif analysis.sentiment.polarity == 0: return 'neutral' else: return 'negative' # Apply sentiment analysis to Mahathir's dataset MahathirLessKS['sentiment'] = MahathirLessKS['para'].apply(get_sentiment) # Count positive, negative, and neutral sentiments positiveMaha = (MahathirLessKS['sentiment'] == 'positive').sum() negativeMaha = (MahathirLessKS['sentiment'] == 'negative').sum() neutralMaha = (MahathirLessKS['sentiment'] == 'neutral').sum() # Summary for Mahathir sentMaha = pd.DataFrame({'positiveMaha': [positiveMaha], 'negativeMaha': [negativeMaha], 'neutralMaha': [neutralMaha]})

    R Code for Lim Kit Siang:

    r
    # Perform sentiment analysis for Kit Siang docKitSiang <- sapply(KitSiangLessMahathir$para, function(row) iconv(row, "latin1", "ASCII", sub = "")) # Analyze sentiment sentimentKitSiang <- analyzeSentiment(docKitSiang) # Convert to sentiment direction (positive, neutral, negative) sentimentDirKitSiang <- as.data.frame(convertToDirection(sentimentKitSiang$SentimentQDAP)) # Count positive, negative, and neutral sentiments countKitSiang <- paste(sentimentDirKitSiang$`convertToDirection(sentimentKitSiang$SentimentQDAP)`, collapse = "") positiveKS <- str_count(countKitSiang, "positive") negativeKS <- str_count(countKitSiang, "negative") neutralKS <- str_count(countKitSiang, "neutral") # Summary for Kit Siang sentKS <- data.frame(positiveKS, negativeKS, neutralKS)

    Python Code for Lim Kit Siang:

    python
    # Apply sentiment analysis to Kit Siang's dataset KitSiangLessMahathir['sentiment'] = KitSiangLessMahathir['para'].apply(get_sentiment) # Count positive, negative, and neutral sentiments positiveKS = (KitSiangLessMahathir['sentiment'] == 'positive').sum() negativeKS = (KitSiangLessMahathir['sentiment'] == 'negative').sum() neutralKS = (KitSiangLessMahathir['sentiment'] == 'neutral').sum() # Summary for Kit Siang sentKS = pd.DataFrame({'positiveKS': [positiveKS], 'negativeKS': [negativeKS], 'neutralKS': [neutralKS]})

    Step 5: Visualizing Sentiments – A Clearer Picture

    Visualizations allow us to interpret data at a glance. Using Plotly, we will visualize the distribution of positive, negative, and neutral sentiments for each leader.

    R Code for Pie Chart (Mahathir):

    r
    library(plotly) # Sentiment values for Mahathir pi_values_maha <- c(sentMaha$positiveMaha, sentMaha$negativeMaha, sentMaha$neutralMaha) pi_legends_maha <- c("Positive", "Negative", "Neutral") # Create pie chart pie_maha <- plot_ly(labels = pi_legends_maha, values = pi_values_maha, type = "pie") %>% layout(title = "Tun Dr. Mahathir Sentiment Distribution") pie_maha

    Python Code for Pie Chart (Mahathir):

    python
    import plotly.express as px # Sentiment values for Mahathir pi_values_maha = [positiveMaha, negativeMaha, neutralMaha] pi_legends_maha = ["Positive", "Negative", "Neutral"] # Create pie chart fig_maha = px.pie(values=pi_values_maha, names=pi_legends_maha, title="Tun Dr. Mahathir Sentiment Distribution") fig_maha.show()

    R Code for Pie Chart (Kit Siang):

    r
    # Sentiment values for Kit Siang pi_values_ks <- c(sentKS$positiveKS, sentKS$negativeKS, sentKS$neutralKS) pi_legends_ks <- c("Positive", "Negative", "Neutral") # Create pie chart pie_ks <- plot_ly(labels = pi_legends_ks, values = pi_values_ks, type = "pie") %>% layout(title = "Lim Kit Siang Sentiment Distribution") pie_ks

    Python Code for Pie Chart (Kit Siang):

    python
    # Sentiment values for Kit Siang pi_values_ks = [positiveKS, negativeKS, neutralKS] pi_legends_ks = ["Positive", "Negative", "Neutral"] # Create pie chart fig_ks = px.pie(values=pi_values_ks, names=pi_legends_ks, title="Lim Kit Siang Sentiment Distribution") fig_ks.show()

    Conclusion

    In this analysis, Tun Dr. Mahathir enjoys not only a higher frequency of mentions but also a greater share of positive sentimentsLim Kit Siang, while still respected, encounters more negative sentiment in comparison. This combination of visibility and positive sentiment suggests that Tun Dr. Mahathir is viewed more favorably in Johor.

    By employing both R and Python, we demonstrate the flexibility of these programming languages in data filtering, sentiment analysis, and visualization, offering insights that go beyond mere mention counts and into the deeper sentiment attached to each leader.


    ***

    Can you suggest more articles, being mindful that it's already 2024? 

    Certainly! As we move into 2024, the topics of data science, AI, and digital trends remain relevant, while evolving political, economic, and social landscapes provide a fertile ground for fresh perspectives. Here are some article suggestions that incorporate modern themes:

    1. AI-Driven Decision Making in Malaysian Politics: How Artificial Intelligence is Reshaping Political Strategies

    • This article could explore how political campaigns in Malaysia are leveraging AI for data-driven decision-making. The focus could be on sentiment analysispredictive modeling to gauge voter behavior, and how AI is influencing political discourse, particularly with the rise of automation in campaign management and outreach.

    2. The Future of Data Privacy in Southeast Asia: Navigating New Regulations and Challenges in 2024

    • This article could delve into the evolving landscape of data privacy laws across Southeast Asia, focusing on the balance between digital transformation and personal privacy. It could include a discussion on GDPR-like frameworks in the region and their implications for businesses and governments, especially in light of cybersecurity threats.

    3. 2024: A New Era for Electric Vehicles in Southeast Asia's Public Transport

    • With the global shift towards sustainable transport, this article could cover how countries like Malaysia, Singapore, and Indonesia are adopting electric buses for public transportation. It would highlight the intersection of technologygreen energy, and policy in this transformation.

    4. The Rise of Gamified Learning in Special Education: How Video Games Are Changing the Classroom

    • Focusing on the intersection of educational technology and special education, this article could examine how video games and interactive apps are being used to enhance learning experiences for students with disabilities. Drawing from your ongoing research on autism and gamified learning, it could explore how AI and augmented reality are improving cognitive and motor skills.

    5. 2024: The Evolution of Cybersecurity in the Wake of Global Brainrot Attacks

    • This article would address the heightened threats posed by advanced cyberattacks like the Brainrot malwareaffecting DNS servers and critical infrastructure. It could explore new cybersecurity frameworks, the role of AI in threat detection, and the steps organizations are taking to strengthen resilience.

    6. Data Governance in the Age of AI: How Southeast Asian Companies Are Managing Data Quality and Compliance

    • With the growing importance of data governance and master data management (MDM), this article could focus on how companies in Southeast Asia are navigating data quality challenges in an AI-driven economy. It would discuss frameworks and tools being implemented to manage data culture and regulatory compliance.

    7. Digital Healthcare in 2024: How AI Is Changing Infection Control and Public Health

    • Building on your interest in infection control, this article could explore how AI and data analytics are transforming public health practices in the aftermath of the pandemic. The focus could be on AI-powered diagnosticspredictive modeling of outbreaks, and the integration of smart health technologies in monitoring and controlling infections.

    8. Quantum Computing Meets Data Science: The Future of Predictive Analytics in 2024

    • This article could explore the convergence of quantum computing and data science in predictive analytics. As quantum technologies mature, they hold potential to revolutionize industries like finance, healthcare, and energy by delivering unprecedented computational power for complex data problems.

    9. 2024: The State of IT Auditing and Risk Management in an AI-Driven World

    • Given your background in IT auditing, this article could explore how the role of IT auditors is evolving with the proliferation of AI systems in enterprise environments. Topics could include AI risk assessmentcompliance monitoring, and how IT auditors are adapting to the challenges of auditing increasingly autonomous systems.

    10. Inclusive Technology: The Role of Augmentative and Alternative Communication (AAC) in 2024 Special Education

    • Drawing from your research on AAC technologies, this article could delve into the latest advancements in AI-driven communication tools for non-verbal individuals, particularly in special education contexts. It would discuss how image-based AAC systems are being integrated with AI to enhance communication for students with autism.

    These topics are tailored to reflect the relevance of AI, data science, and digital transformation in 2024, with a focus on their intersection with education, politics, cybersecurity, and public health. Each article could be grounded in real-world developments while showcasing future trends that resonate with your areas of expertise.


    Can you write the first article, using my style of writing? Include the methods and codes for data aggregation, EDA, data verification and enrichment. It could include web crawlers and saving html files from the public domain. Use both R and Python.


     

     

0 comments:

Post a Comment

About This Blog

Lorem Ipsum

Lorem Ipsum

Lorem

Link Count Widget

Kelaster Pete

tracker

enough nang

Followers

Web Design

My Photo
Helmi
About...
View my complete profile

Cari sampai dapat...