There is a variety of powerful tools for Natural Language Processing (NLP) to analyze and interpret textual data. This project serves as a practical application to apply the simpliest NLP technique using a neat dataset. The primary goal is to distinguish the tweets by Donald Trump or Justin Trudeau with a simple yet effective approach.
This is an interesting dataset. It is shared by Moez Ali, an amazing professor who is teaching Predictive Modelling and Big Data Analytics at Queen’s University. It contains 400 tweets by Donald Trump and Justin Trudeau.
1 | import pandas as pd |
Tweets from Donald Trump.
id | author | status |
---|---|---|
157 | Donald J. Trump | #JFKFiles https://t.co/AnPBSJFh3J |
152 | Donald J. Trump | After strict consultation with General Kelly, … |
105 | Donald J. Trump | The United States will be immediately implemen… |
114 | Donald J. Trump | ….for the Middle Class. The House and Senate… |
130 | Donald J. Trump | Thank you @LuisRiveraMarin! https://t.co/BK7sD… |
Tweets from Justin Trudeau.
id | author | status |
---|---|---|
345 | Justin Trudeau | RT @PMcanadien: En direct: le PM Trudeau souli… |
276 | Justin Trudeau | Merci à Nguyen Cong Hiep, du consulat canadien… |
336 | Justin Trudeau | Today, I spoke with Governor @GregAbbott_TX to… |
302 | Justin Trudeau | This afternoon, I met with Vietnam’s Secretary… |
323 | Justin Trudeau | RT @PattyHajdu: Focusing on prevention, increa… |
I used 20% of the dataset for validation to see how well the model is performing.
1 | from sklearn.model_selection import train_test_split |
Since this project is for fun and the dataset is not huge, I am just using TfidfVectorizer
instead of more advanced techniques.
1 | from sklearn.feature_extraction.text import TfidfVectorizer |
This logistic regression model achieved an accuracy score of 0.8875. It is safe to say that the model is performing very well.
1 | from sklearn.metrics import accuracy_score |
— Aug 21, 2024