Analyzing Media Bias: A Text Analytics Approach to Distinguishing Liberal and Conservative News Content
Text analytics has been widely utilized in sentiment analysis, large language models (LLMs), and natural language processing (NLP), but its applications extend into media and social science as well. By analyzing unstructured data, text analytics can help identify biases and perceived perceptions, enabling better access to and understanding of information. This approach holds significant potential for uncovering insights into media content and social science research, contributing to a more nuanced understanding of societal dynamics and communication.
In this article, I will demonstrate text analytics techniques for calculating similarity scores between sentences and their application in domain-specific scenarios. The focus will be on assessing the similarity scores between news headlines and their potential use in distinguishing between left and right-leaning news agencies in the United States.
The article is structured in two main sections. The first part focuses on the coding aspect of text analytics, providing a step-by-step guide on how to implement text processing and analysis techniques. This section is essential for readers looking to understand the technical underpinnings and methodologies involved in analyzing textual data effectively. The second part of the article presents my findings, where I delve into the insights and patterns uncovered from the text analytics process. This division not only organizes the content logically but also caters to both technically inclined readers and those more interested in the outcomes of text analytics.
Part 1: Text Analytics in Python
I conducted primary data collection by compiling similar news from four different platforms: New York Times and CNN on the left, and Fox News and NewsMax on the right. I collected just the headlines of nine different news articles, each headline talking about the same news/topic. The goal was to assess the similarity or difference of the scores generated by text analytics when comparing these news agencies. I aim to determine if left-leaning and right-leaning news agencies have more similarities with each other and if text analytic tools can detect such differences.
The data I used is available here on my GitHub. Check the data news.xlsx
Unlike my previous articles, this one is focused entirely on Python. I’ve utilized various Python packages, and most of the analytics demonstrated in this article can be accomplished without coding from scratch.
To address this problem effectively, you’ll need to equip yourself with a set of specific Python packages. Below is a curated list of essential libraries that play a pivotal role in text analytics projects. If you’re planning to follow along with a hands-on example, make sure to also import your dataset file.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk.stem import PorterStemmer
from gensim.models import Word2Vec
import pandas as pd
from sklearn.metrics import f1_score
import numpy as np
import nltk
# Load spreadsheet
news = pd.ExcelFile('news.xlsx')
# Load a sheet into a DataFrame by its name
news = news.parse('Sheet1')
I will be utilizing the power of Google’s Word2Vec model. Word2Vec, developed by a team of researchers at Google, represents a transformative approach to word embeddings, a technique pivotal in the field of natural language processing (NLP). This algorithm seeks to understand the meaning of a word based on the context in which it appears, effectively capturing the syntactic and semantic nuances of language in a high-dimensional vector space. The Word2Vec model, particularly noted for its performance on the Google News dataset, generates 300-dimensional vectors for over 3 million words and phrases.
It’s okay if you’re not familiar with this. This is a ready-made package from Google that we can easily take advantage of. This model will help me generate similarity scores between different words, which I will further demonstrate in this article.
Import the word2vec.
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()
# Function to process headlines with lemmatization
def pre_processing_by_nltk(doc, lemmatize=True):
# Tokenize the document
tokens = word_tokenize(doc)
# Apply lemmatization if specified
if lemmatize:
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(word) for word in tokens]
# Convert tokens to lowercase
return [w.lower() for w in tokens]
The above code provides a ready-made function that you can use to lemmatize the sentence. Lemmatization is a process in natural language processing (NLP) where the goal is to reduce a word to its base or root form, called a lemma. Unlike stemming, which often simply chops off endings from a word, lemmatization considers the context and uses a full understanding of the word, including its part of speech and meaning, to transform it into its canonical form. This process allows words with different forms but the same underlying meaning to be treated as the same entity, improving the performance of text-processing applications.
In the context of news articles, writers often employ a diverse vocabulary and complex sentence structures to convey nuanced viewpoints, evoke emotional responses, or highlight specific aspects of a story. Lemmatization aids in simplifying this complexity, making it easier to analyze the core themes and sentiments expressed in the articles. This is particularly beneficial for tasks like sentiment analysis, thematic categorization, or trend identification, where understanding the underlying meaning is essential.
See how the function works
pre_processing_by_nltk(news['Headline'][1])
Original headline: Pence Says He Won’t Endorse Trump, but Won’t Vote for Biden Either
Processed headline: [‘pence’, ‘says’, ‘he’, ‘won’, ‘’’, ‘t’, ‘endorse’, ‘trump’, ‘,’, ‘but’, ‘won’, ‘’’, ‘t’, ‘vote’, ‘for’, ‘biden’, ‘either’]
Original headline: US threatens TikTok ban if Chinese owners don’t sell stakes in company
Processed headlines: [‘us’, ‘threatens’, ‘tiktok’, ‘ban’, ‘if’, ‘chinese’, ‘owner’, ‘do’, “n’t”, ‘sell’, ‘stake’, ‘in’, ‘company’]
Processed words can be compared using Word2Vec, and a cosine similarity score can be derived. A potential issue might arise, for example, with ‘US’, intended to indicate the United States, being lemmatized to ‘us’. However, Google’s Word2Vec model is designed with scoring metrics sophisticated enough to ensure that these values do not result in a complete loss of intended meaning.
Now, using the function, I pre-process the headline and store it in a new column.
news['processed_headline'] = news['Headline'].apply(pre_processing_by_nltk)
The new column will have preprocessed values for all the columns. These values will be used for analytics. I have filtered some columns to make the dataset smaller.
news.rename(columns={'Topic ': 'Topic'}, inplace=True)
text_news = news[['Agency', 'Topic', 'Headline','processed_headline']]
Now, I defined a few functions that will help me get the similarity scores. Again, it’s ok if you don’t understand every step. Even I had trouble writing, I used ChatGpt for help.
from numpy.linalg import norm
def sentence_vector(words, mod):
# Filter out words not in the model's vocabulary
word_vectors = [mod[word] for word in words if word in mod]
# Handle case where sentence contains no valid words after preprocessing
if len(word_vectors) == 0:
return np.zeros(mod.vector_size)
# Return the average vector
return np.mean(word_vectors, axis=0)
def cosine_similarity(vec1, vec2):
# Calculate the cosine similarity between two vectors
return np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))
def similarity_score(text1, text2, mod):
# Get the sentence vectors for the two lists of words
vec1 = sentence_vector(text1, mod)
vec2 = sentence_vector(text2, mod)
# Calculate and return the cosine similarity
return cosine_similarity(vec1, vec2)
This is how you can now generate comparisons.
text1 = text_news['processed_headline'][0]
text2 = text_news['processed_headline'][1]
similarity = similarity_score(text1, text2, wv)
print(f"liberal Similarity score: {similarity}")
#Output will be
#liberal Similarity score: 0.7776579260826111
I will check the similarity score for the headlines with each other, but I want to make sure I’m only comparing articles that are about the same one. Our data has a ‘Topic’ column that groups each different new. This will help me compare only the articles that belong to the same category, ensuring my comparisons are fair and on-topic.
scores = []
for i in range(1, 10): # Assuming topics range from 1 to 9
sub_data = text_news[text_news['Topic'] == i]
agencies = sub_data['Agency'].unique()
# Iterate over pairs of agencies
for j in range(len(agencies)):
for k in range(j + 1, len(agencies)):
agency1 = agencies[j]
agency2 = agencies[k]
text1 = sub_data[sub_data['Agency'] == agency1]['processed_headline'].iloc[0]
text2 = sub_data[sub_data['Agency'] == agency2]['processed_headline'].iloc[0]
similarity = similarity_score(text1, text2, wv)
scores.append({
'Topic': i,
'Agency1': agency1,
'Agency2': agency2,
'Similarity': similarity
})
# Convert the list of dictionaries to a DataFrame
score_df = pd.DataFrame(scores)
# Display the DataFrame
score_df
This line of code will help the dataset be more easy to interpret.
# Rank the similarity scores within each topic
score_df['Rank'] = score_df.groupby('Topic')['Similarity'].rank(ascending=False)
# Assign values based on the agencies
def assign_agency_group(row):
left_agencies = ['NYT', 'CNN']
right_agencies = ['Fox', 'NewsMax']
if row['Agency1'] in left_agencies and row['Agency2'] in left_agencies:
return 'left'
elif row['Agency1'] in right_agencies and row['Agency2'] in right_agencies:
return 'right'
else:
return 'left vs right'
score_df['Agency_Group'] = score_df.apply(assign_agency_group, axis=1)
The final data set, score_df
shows the similarity scores, how it ranks within each topic, and what kind of comparison it is doing (right vs left).
The similarity scores provided offer a snapshot into the comparative analysis of headlines from various news agencies. The dataset is organized by topic, ensuring a coherent flow of comparison. The columns labeled Agency1
and Agency2
display the respective headlines from the agencies under comparison, serving as a direct reference for each pairing. The similarity
column reflects the output from our model, quantifying the degree of similarity between the headlines, thus indicating how closely the narratives align or diverge.
Additionally, the ‘rank’ column presents a hierarchy of these scores within each topic, illustrating the relative closeness of the headlines in each unique set of comparisons (for instance, among six possible combinations in topic 1). The final column categorizes what type of comparison we are making (NYT Vs CNN is left Vs left, NYT Vs Fox is left Vs right, NewsMax Vs CNN is right Vs left and Fox Vs NewsMax is right Vs right).
These codes should be reproducible. If you use the same worksheet format, adding rows, changing from headlines to the news itself, etc should not be an issue. Using more news agencies or anything else may require some tweaking.
Part 2: Media Bias Findings
Having completed the coding phase, I now have some findings regarding the model’s performance. My analysis was conducted through three distinct comparisons: firstly, assessing the similarities within left-leaning media outlets across all nine news items; secondly, evaluating the similarities within right-leaning media for the same news items; and thirdly, analyzing the differences between left-leaning and right-leaning media across all nine news pieces. These comparisons provide a comprehensive overview of the model’s ability to discern and quantify ideological nuances in media reporting.
NYT Vs CNN
From this table, it is evident that The New York Times (NYT) and CNN have quite high similarity scores for most of the headlines. The three instances of rank one indicate that left-leaning agencies had the best similarity scores among the respective headlines.
Fox vs NewsMax
Though not as high as the left-leaning agencies, right-leaning agencies also exhibit decent similarity scores. However, only one headline ranks at the top in similarity scores among right-leaning agencies. Surprisingly, some headlines have very low scores as well and are ranked the least. While we might expect these news agencies to align closely, it is possible that variations in wording and phrasing may have contributed to the lower scores.
NYT vs Fox and NewsMax; CNN vs Fox and NewsMax
The comparison between right-leaning and left-leaning agencies offers a wealth of information and shows a diverse distribution of similarity scores. There are some surprising findings, such as the high similarity score of 0.91 between CNN (left) and Fox (right) for news headline 6, which is the highest similarity score across all combinations. However, in general, CNN and Fox tend to have lower similarity scores compared to other combinations. For example, for headline 5, the similarity score between CNN and Fox is only 0.47, and for headline 8, it is 0.58. These variations highlight the complexity of the relationship between the reporting styles of different news agencies.
Particular Comparision Between NYT and Fox
The comparison between Fox News and The New York Times (NYT) reveals the most contrasting results among the pairs. Out of the nine different headlines analyzed, six received the lowest rank in terms of similarity between these two agencies. Surprisingly, headline 5 achieved the top rank among these pairs, but even then, the similarity score was 0.74, which is not particularly high considering that the best-ranking results typically range from 0.8 to 0.9. This indicates a significant divergence in the reporting styles and content between Fox News and The New York Times.
This observation highlights the distinctive approaches taken by The New York Times (NYT) and Fox News in reporting the same news event. Despite covering the same story, each outlet employs contrasting language and phrasing to convey its unique perspective or message. This divergence in word choice underscores the ideological differences between the outlets, illustrating how similar information can be framed in varying ways to align with or emphasize particular viewpoints.
Although the results are diverse and not definitively conclusive, they suggest that text analytics can indeed discern variations in how media outlets report news, reflecting the underlying ideological leanings. This study focused solely on news headlines, which tend to be brief and less diverse in language compared to full-length articles. Despite this limitation, the findings hint at a significant potential: By expanding the analysis to encompass more detailed text, there’s a promising indication that text analytics, alongside similarity scoring methods, can effectively highlight the distinct language choices made by different news agencies. These linguistic preferences are not arbitrary; they are strategic, aiming to underscore each outlet’s unique ideology and perspective, thereby shaping the message conveyed to the audience.
All codes and files are available in my Git Repository. Click here. (coming soon)