You can modify the code to find the accuracy by incorporating the LDA model and cross-validation. First, import the necessary modules and split the data into training and testing sets. Then, fit the LDA model and evaluate its accuracy using cross-validation. Here's the modified code:
pythonimport pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from nltk.tokenize.casual import casual_tokenize
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
# Open and read the file
with open('/content/drive/MyDrive/Colab Notebooks/pass_task_3/articles+4.txt', 'r') as f:
data = f.read()
# Assume each line represents a separate document
documents = data.split('\n')
tfidf = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf.fit_transform(raw_documents=documents).toarray()
tfidf_docs = pd.DataFrame(tfidf_docs)
tfidf_docs = tfidf_docs - tfidf_docs.mean()
# List of dimensions to test
dimensions = [2, 4, 8, 16, 32]
# Train PCA models with different dimensions and compare explained variance ratio
explained_variances = {}
for dim in dimensions:
pca = PCA(n_components=dim)
pca = pca.fit(tfidf_docs)
total_explained_variance = np.sum(pca.explained_variance_ratio_)
explained_variances[dim] = total_explained_variance
print(f"Total explained variance for {dim} dimensions: {total_explained_variance:.3f}")
# Find and print the best number of dimensions
best_dim = max(explained_variances, key=explained_variances.get)
print(f"Best number of dimensions: {best_dim} with total explained variance: {explained_variances[best_dim]:.3f}")
# Assuming you have the target labels (0 for not spam and 1 for spam) for each document in a variable called 'labels'
# Replace sms.spam with labels
labels = [...]
# Apply PCA with the best number of dimensions
pca = PCA(n_components=best_dim)
pca_topic_vectors = pca.fit_transform(tfidf_docs)
X_train, X_test, y_train, y_test = train_test_split(pca_topic_vectors, labels, test_size=0.3, random_state=271828)
lda = LDA(n_components=1)
lda.fit(X_train, y_train)
score = lda.score(X_test, y_test).round(3)
print(f"Accuracy: {score}")
lda = LDA(n_components=1)
scores = cross_val_score(lda, pca_topic_vectors, labels, cv=10)
print("Accuracy: {:.3f} (+/-{:.3f})".format(scores.mean(), scores.std() * 2))
Please note that you need to replace sms.spam
with the target labels for each document. I have added a placeholder variable called labels
. You need to replace it with the actual labels for each document.
To compare the accuracy for each dimension, you can use a loop to iterate through the dimensions and perform cross-validation for each one. Here's the modified code, which also replaces sms.spam
with documents
:
pythonimport pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from nltk.tokenize.casual import casual_tokenize
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
# Open and read the file
with open('/content/drive/MyDrive/Colab Notebooks/pass_task_3/articles+4.txt', 'r') as f:
data = f.read()
# Assume each line represents a separate document
documents = data.split('\n')
tfidf = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf.fit_transform(raw_documents=documents).toarray()
tfidf_docs = pd.DataFrame(tfidf_docs)
tfidf_docs = tfidf_docs - tfidf_docs.mean()
# Assuming you have the target labels (0 for not spam and 1 for spam) for each document in a variable called 'labels'
# Replace sms.spam with labels
labels = [...]
# List of dimensions to test
dimensions = [2, 4, 8, 16, 32]
# Train PCA models with different dimensions and compare the accuracy
accuracies = {}
for dim in dimensions:
pca = PCA(n_components=dim)
pca_topic_vectors = pca.fit_transform(tfidf_docs)
lda = LDA(n_components=1)
scores = cross_val_score(lda, pca_topic_vectors, labels, cv=5)
mean_accuracy = scores.mean()
accuracies[dim] = mean_accuracy
print(f"Accuracy for {dim} dimensions: {mean_accuracy:.3f} (+/-{scores.std() * 2:.3f})")
# Find and print the best number of dimensions based on accuracy
best_dim = max(accuracies, key=accuracies.get)
print(f"Best number of dimensions: {best_dim} with accuracy: {accuracies[best_dim]:.3f}")
Please note that you need to replace sms.spam
with the target labels for each document. I have added a placeholder variable called labels
. You need to replace it with the actual labels for each document.
The error you're encountering is because the length of labels
is not consistent with the length of your data (pca_topic_vectors). It appears that the labels
variable is a single value instead of an array with the same length as the number of documents.
You need to make sure that the labels
variable is an array or list of labels with the same length as the number of documents. Here's a sample code snippet to create a list of random labels for demonstration purposes:
pythonimport random
# Replace this part with the actual logic to get labels for your documents
labels = [random.randint(0, 1) for _ in range(len(documents))]
Replace the labels = [...]
line in the previous code example with the above code snippet or use the appropriate logic to create a list of labels for your documents.
Remember that the sample code snippet creates random labels and should be replaced with the actual logic to get the correct labels for your documents.
You can use a loop to iterate through the dimensions and perform PCA transformation for each one. Here's the modified code:
pythonimport pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from nltk.tokenize.casual import casual_tokenize
# Open and read the file
with open('/content/drive/MyDrive/Colab Notebooks/pass_task_3/articles+4.txt', 'r') as f:
data = f.read()
# Assume each line represents a separate document
documents = data.split('\n')
tfidf = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf.fit_transform(raw_documents=documents).toarray()
tfidf_docs = pd.DataFrame(tfidf_docs)
tfidf_docs = tfidf_docs - tfidf_docs.mean()
# List of dimensions to test
dimensions = [2, 4, 8, 16, 32]
# Train PCA models with different dimensions and transform the data
for dim in dimensions:
pca = PCA(n_components=dim)
pca = pca.fit(tfidf_docs)
pca_topic_vectors = pca.transform(tfidf_docs)
columns = ['topic{}'.format(i) for i in range(pca.n_components)]
pca_topic_vectors = pd.DataFrame(pca_topic_vectors, columns=columns)
print(f"PCA topic vectors for {dim} dimensions:")
print(pca_topic_vectors.round(3).head(6))
print()
This code will print the first 6 rows of the PCA topic vectors for each dimension.
This code computes the term frequency-inverse document frequency (TF-IDF) matrix for the given documents and then centers the data by subtracting the mean of each feature.