Prompt Detail:
Question 1.1.1 Assign stemmed_message to the stemmed version of the word "vegetables".
pythonstemmed_message = vocab_mapping.where("Word", "vegetables").column("Stem")[0]
stemmed_message
In this question, we search for the stemmed version of the word "vegetables" in the vocab_mapping
table and assign it to the variable stemmed_message
.
Question 1.1.2 What stem in the dataset has the most words that are shortened to it? Assign most_stem to that stem.
pythonmost_stem = vocab_mapping.group("Stem").sort("count", descending=True).column("Stem")[0]
most_stem
In this question, we first group the vocab_mapping
table by the "Stem" column and count the occurrences of each stem. Then, we sort the resulting table in descending order based on the count and select the stem with the most occurrences.
Question 1.1.3 What is the longest word in the dataset whose stem wasn't shortened? Assign that to longest_uncut. Break ties alphabetically from Z to A (so if your options are "albatross" or "batman", you should pick "batman").
pythontbl_with_lens = vocab_mapping.with_columns(
"Word Length", vocab_mapping.apply(len, "Word"),
"Stem Length", vocab_mapping.apply(len, "Stem")
)
tbl_with_dif = tbl_with_lens.with_column(
"Difference", tbl_with_lens.column("Word Length") - tbl_with_lens.column("Stem Length")
)
longest_uncut = tbl_with_dif.where("Difference", 0).sort(["Word Length", "Word"], descending=True).column("Word")[0]
longest_uncut
In this question, we first add columns to the vocab_mapping
table containing the length of the word and the length of the stem. Then, we add another column to store the difference between the lengths. If the word is not shortened, the difference will be 0. We filter the rows where the difference is 0, sort the table by word length and word (descending), and select the longest word whose stem wasn't shortened.
Question 1.2.1 Calculate the correlation coefficient for the association between proportion of words that are "outer" and the proportion of words that are "space" for every movie in the dataset, and assign it to outer_space_r.
pythonouter_su = (outer - np.mean(outer)) / np.std(outer) space_su = (space - np.mean(space)) / np.std(space) outer_space_r = np.mean(outer_su * space_su) outer_space_r
In this question, we first calculate the standardized units for the "outer" and "space" columns. Then, we compute the element-wise product of the standardized units and take the mean to find the correlation coefficient.
Question 1.2.2 Choose two different words in the dataset with a correlation higher than 0.2 or smaller than -0.2 that are not outer and space and plot a scatter plot with a line of best fit for them.
pythonword_x = "chief"
word_y = "execut"
arr_x = movies.column(word_x)
arr_y = movies.column(word_y)
x_su = (arr_x - np.mean(arr_x)) / np.std(arr_x)
y_su = (arr_y - np.mean(arr_y)) / np.std(arr_y)
r = np.mean(x_su * y_su)
slope = r * np.std(arr_y) / np.std(arr_x)
intercept = np.mean(arr_y) - slope
-
I enjoy reading an article that will make people think. Also, thank you for allowing me to comment! - canadian prescriptions online - 4 months ago