Sure, I can help you with that! To scrape data from a specific hashtag in Twitter and perform sentiment analysis using Python, we will need to use a few different libraries. Here are the steps we will follow:
- Install the necessary libraries
- Authenticate with the Twitter API
- Scrape tweets from the desired hashtag
- Clean and preprocess the text data
- Perform sentiment analysis using nltk and Sastrawi
- Save the results to a CSV file
Let's get started!
1. Install the necessary libraries
We will need to install several Python libraries to complete this task. You can install them using pip, the Python package manager. Run the following command in your terminal or command prompt:
pip install tweepy pandas nltk Sastrawi
2. Authenticate with the Twitter API
To access Twitter's API, we will need to create a developer account and obtain API keys. Once you have your keys, you can use the following code to authenticate with the API:
pythonimport tweepy
consumer_key = 'your_consumer_key'
consumer_secret = 'your_consumer_secret'
access_token = 'your_access_token'
access_token_secret = 'your_access_token_secret'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
Replace the placeholders with your own keys.
3. Scrape tweets from the desired hashtag
Now we can use the Tweepy library to scrape tweets that contain our desired hashtag. We will use the Cursor
object to paginate through the results and return the text content of each tweet.
pythonimport pandas as pd
tweets = []
query = '#yourhashtag'
for tweet in tweepy.Cursor(api.search_tweets,
q=query,
lang='id').items(1000):
tweets.append(tweet.text)
df = pd.DataFrame(tweets, columns=['content'])
This code will scrape the 1000 most recent tweets containing your hashtag and store the text content in a Pandas DataFrame called df
.
4. Clean and preprocess the text data
Before we can perform sentiment analysis, we need to clean and preprocess the text data. This will involve removing any URLs, mentions, and hashtags, as well as tokenizing the text into individual words.
pythonimport re
from nltk.tokenize import word_tokenize
def preprocess_text(text):
# Remove URLs, mentions, and hashtags
text = re.sub(r'http\S+', '', text)
text = re.sub(r'@\S+', '', text)
text = re.sub(r'#\S+', '', text)
# Tokenize the text
tokens = word_tokenize(text)
return tokens
df['tokens'] = df['content'].apply(preprocess_text)
This code will add a new column to the DataFrame called tokens
, which contains the preprocessed text data.
5. Perform sentiment analysis using nltk and Sastrawi
Now we can use nltk and Sastrawi to perform sentiment analysis on the text data. We will use a pre-trained classifier from the nltk library to classify each tweet as positive, negative, or neutral.
pythonimport nltk
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
# Download the necessary nltk data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('movie_reviews')
# Load the pre-trained classifier
classifier = nltk.classify.load('nltk_data/classifiers/movie_reviews_NB_A.pkl')
# Create a stemmer
factory = StemmerFactory()
stemmer = factory.create_stemmer()
def analyze_sentiment(tokens):
#
def analyze_sentiment(tokens): # Apply the stemmer to each token stemmed_tokens = [stemmer.stem(token) for token in tokens]
python# Create a dictionary of token frequencies
freq_dist = nltk.FreqDist(stemmed_tokens)
features = freq_dist.keys()
# Define a function to extract the features from the tokens
def extract_features(document):
document_words = set(document)
features = {}
for word in freq_dist.keys():
features['contains({})'.format(word)] = (word in document_words)
return features
# Extract the features from the tokens
test_features = extract_features(stemmed_tokens)
# Use the classifier to classify the tweet as positive, negative, or neutral
sentiment = classifier.classify(test_features)
return sentiment
df['sentiment'] = df['tokens'].apply(analyze_sentiment)
css
This code will add a new column to the DataFrame called `sentiment`, which contains the sentiment analysis results.
### 6. Save the results to a CSV file
Finally, we can save the results to a CSV file using the `to_csv` method of the Pandas DataFrame.
```python
df.to_csv('tweets.csv', index=False, header=['username', 'content', 'sentiment'])
This code will save the DataFrame to a CSV file called tweets.csv
, with columns for username
, content
, and sentiment
.
And that's it! You now have a Python script that can scrape tweets from a specific hashtag in Twitter, perform sentiment analysis using nltk and Sastrawi, and save the results to a CSV file.
Sure! Here's the complete code:
pythonimport tweepy
import pandas as pd
import re
from nltk.tokenize import word_tokenize
import nltk
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
# Authenticate with the Twitter API
consumer_key = 'your_consumer_key'
consumer_secret = 'your_consumer_secret'
access_token = 'your_access_token'
access_token_secret = 'your_access_token_secret'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
# Scrape tweets from the desired hashtag
tweets = []
query = '#yourhashtag'
for tweet in tweepy.Cursor(api.search_tweets,
q=query,
lang='id').items(1000):
tweets.append(tweet.text)
df = pd.DataFrame(tweets, columns=['content'])
# Clean and preprocess the text data
def preprocess_text(text):
# Remove URLs, mentions, and hashtags
text = re.sub(r'http\S+', '', text)
text = re.sub(r'@\S+', '', text)
text = re.sub(r'#\S+', '', text)
# Tokenize the text
tokens = word_tokenize(text)
return tokens
df['tokens'] = df['content'].apply(preprocess_text)
# Perform sentiment analysis using nltk and Sastrawi
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('movie_reviews')
classifier = nltk.classify.load('nltk_data/classifiers/movie_reviews_NB_A.pkl')
factory = StemmerFactory()
stemmer = factory.create_stemmer()
def analyze_sentiment(tokens):
stemmed_tokens = [stemmer.stem(token) for token in tokens]
freq_dist = nltk.FreqDist(stemmed_tokens)
features = freq_dist.keys()
def extract_features(document):
document_words = set(document)
features = {}
for word in freq_dist.keys():
features['contains({})'.format(word)] = (word in document_words)
return features
test_features = extract_features(stemmed_tokens)
sentiment = classifier.classify(test_features)
return sentiment
df['sentiment'] = df['tokens'].apply(analyze_sentiment)
# Save the results to a CSV file
df.to_csv('tweets.csv', index=False, header=['username', 'content', 'sentiment'])
Remember to replace yourhashtag
and the authentication credentials with your own values. I hope this helps!