Prompt Detail:
Here are the updated answers with explanations for questions 1.1.1 to 1.3.1:
Question 1.1.1: Find the stemmed version of the word "vegetables":
pythonstemmed_message = vocab_mapping.where("Word", "vegetables").column("Stem").item(0)
stemmed_message
Explanation: The code above searches for the stemmed version of the word "vegetables" in the vocab_mapping
table and assigns it to the variable stemmed_message
.
Question 1.1.2: Find the stem in the dataset with the most words that are shortened to it:
pythonmost_stem = vocab_table.group("Stem").sort("count", descending=True).column("Stem").item(0)
most_stem
Explanation: The code above groups the vocab_table
by the stem, sorts by the count of words in descending order, and then selects the stem with the most words.
Question 1.1.3: Find the longest word in the dataset whose stem wasn't shortened:
pythontbl_with_lens = vocab_table.with_columns(
"Word Length", vocab_table.apply(len, "Word"),
"Stem Length", vocab_table.apply(len, "Stem")
)
tbl_with_dif = tbl_with_lens.with_column("Difference", tbl_with_lens.column("Word Length") - tbl_with_lens.column("Stem Length"))
longest_uncut = tbl_with_dif.where("Difference", 0).sort("Word Length", descending=True).sort("Word", descending=True).column("Word").item(0)
longest_uncut
Explanation: The code above adds columns with the length of the word and the length of the stem, then adds a column with the difference between those lengths. Next, it filters the table where the difference is 0 (meaning the word is not shortened) and sorts by word length and alphabetically (from Z to A) to find the longest uncut word.
Question 1.2.1: Calculate the correlation coefficient for the association between the proportion of words that are "outer" and the proportion of words that are "space" for every movie in the dataset:
pythonouter_su = (outer - np.mean(outer)) / np.std(outer) space_su = (space - np.mean(space)) / np.std(space) outer_space_r = np.mean(outer_su * space_su) outer_space_r
Explanation: The code above calculates the standardized units for "outer" and "space" by subtracting the mean and dividing by the standard deviation element-wise. Then, it calculates the correlation coefficient by taking the mean of the product of the standardized units.
Question 1.3.1: Draw a horizontal bar chart with two bars that show the proportion of Comedy movies in each dataset:
pythondef comedy_proportion(table):
# Return the proportion of movies in a table that have the Comedy genre.
return np.count_nonzero(table.column("Genre") == "comedy") / table.num_rows
proportions_table = Table().with_columns(
"Dataset", ["Training", "Test"],
"Comedy Proportion", [comedy_proportion(train_movies), comedy_proportion(test_movies)]
)
proportions_table.barh("Dataset", "Comedy Proportion")
Explanation: The comedy_proportion
function returns the proportion of Comedy movies in a given table. We then create a new table with dataset names and their corresponding comedy proportions, and plot a horizontal bar chart using this table.