This commit is contained in:
Noah L. Schrick 2023-02-12 21:10:10 -06:00
parent baa5d425bf
commit b24483b7a3

View File

@ -29,41 +29,44 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"# Problem 14.1\n", "# Problem 14.1\n",
"An analyst at a subscription-based satellite radio company has been given a sample of data from their customer database, with the goal of finding groups of customers who are associated with one another. The data consist of company data, together with purchased demographic data that are mapped to the company data (see Table 14.13). The analyst decides to apply association rules to learn more about the associations between customers. Comment on this approach.\n",
"\n",
"This is a good approach for exploring associative relationships between customers. Since there is company data mixed with demographic data, the association rules can yield better results and demonstrate better associations since purchases can be examined with respect to age, location, number of dependents, and any other demographic data available." "This is a good approach for exploring associative relationships between customers. Since there is company data mixed with demographic data, the association rules can yield better results and demonstrate better associations since purchases can be examined with respect to age, location, number of dependents, and any other demographic data available."
] ]
}, },
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{ {
"attachments": {}, "attachments": {},
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"# Problem 14.3" "# Problem 14.3\n",
"We again consider the data in CourseTopics.csv describing course purchases at Statistics.com (see Problem 14.2 and data sample in Table 14.14). We want to provide a course recommendation to a student who purchased the Regression and Forecast courses. Apply user-based collaborative filtering to the data. You will get a Null matrix. Explain why this happens."
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 3, "execution_count": 28,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
"ename": "KeyError", "name": "stdout",
"evalue": "\"None of [Index(['userID', 'itemID', 'rating'], dtype='object')] are in the [columns]\"", "output_type": "stream",
"text": [
"Computing the cosine similarity matrix...\n"
]
},
{
"ename": "ZeroDivisionError",
"evalue": "float division",
"output_type": "error", "output_type": "error",
"traceback": [ "traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[0;31mZeroDivisionError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[3], line 18\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[39m# Convert the data set into the format required by the surprise package\u001b[39;00m\n\u001b[1;32m 16\u001b[0m \u001b[39m# The columns must correspond to user id, item id and ratings (in that order)\u001b[39;00m\n\u001b[1;32m 17\u001b[0m reader \u001b[39m=\u001b[39m Reader(rating_scale\u001b[39m=\u001b[39m(\u001b[39m1\u001b[39m, \u001b[39m5\u001b[39m))\n\u001b[0;32m---> 18\u001b[0m data \u001b[39m=\u001b[39m Dataset\u001b[39m.\u001b[39mload_from_df(courses_df[[\u001b[39m'\u001b[39;49m\u001b[39muserID\u001b[39;49m\u001b[39m'\u001b[39;49m, \u001b[39m'\u001b[39;49m\u001b[39mitemID\u001b[39;49m\u001b[39m'\u001b[39;49m, \u001b[39m'\u001b[39;49m\u001b[39mrating\u001b[39;49m\u001b[39m'\u001b[39;49m]], reader)\n\u001b[1;32m 20\u001b[0m \u001b[39m# Split into training and test set\u001b[39;00m\n\u001b[1;32m 21\u001b[0m trainset, testset \u001b[39m=\u001b[39m train_test_split(data, test_size\u001b[39m=\u001b[39m\u001b[39m.25\u001b[39m, random_state\u001b[39m=\u001b[39m\u001b[39m1\u001b[39m)\n", "Cell \u001b[0;32mIn[28], line 16\u001b[0m\n\u001b[1;32m 14\u001b[0m sim_options \u001b[39m=\u001b[39m {\u001b[39m'\u001b[39m\u001b[39mname\u001b[39m\u001b[39m'\u001b[39m: \u001b[39m'\u001b[39m\u001b[39mcosine\u001b[39m\u001b[39m'\u001b[39m, \u001b[39m'\u001b[39m\u001b[39muser_based\u001b[39m\u001b[39m'\u001b[39m: \u001b[39mTrue\u001b[39;00m} \u001b[39m# compute cosine similarities between users\u001b[39;00m\n\u001b[1;32m 15\u001b[0m algo \u001b[39m=\u001b[39m KNNBasic(sim_options\u001b[39m=\u001b[39msim_options)\n\u001b[0;32m---> 16\u001b[0m algo\u001b[39m.\u001b[39;49mfit(trainset)\n\u001b[1;32m 17\u001b[0m \u001b[39m#pred = algo.predict(str(823519), str(30), r_ui=4, verbose=True)\u001b[39;00m\n",
"File \u001b[0;32m~/.local/lib/python3.10/site-packages/pandas/core/frame.py:3811\u001b[0m, in \u001b[0;36mDataFrame.__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 3809\u001b[0m \u001b[39mif\u001b[39;00m is_iterator(key):\n\u001b[1;32m 3810\u001b[0m key \u001b[39m=\u001b[39m \u001b[39mlist\u001b[39m(key)\n\u001b[0;32m-> 3811\u001b[0m indexer \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mcolumns\u001b[39m.\u001b[39;49m_get_indexer_strict(key, \u001b[39m\"\u001b[39;49m\u001b[39mcolumns\u001b[39;49m\u001b[39m\"\u001b[39;49m)[\u001b[39m1\u001b[39m]\n\u001b[1;32m 3813\u001b[0m \u001b[39m# take() does not accept boolean indexers\u001b[39;00m\n\u001b[1;32m 3814\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mgetattr\u001b[39m(indexer, \u001b[39m\"\u001b[39m\u001b[39mdtype\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39mNone\u001b[39;00m) \u001b[39m==\u001b[39m \u001b[39mbool\u001b[39m:\n", "File \u001b[0;32m~/.local/lib/python3.10/site-packages/surprise/prediction_algorithms/knns.py:98\u001b[0m, in \u001b[0;36mKNNBasic.fit\u001b[0;34m(self, trainset)\u001b[0m\n\u001b[1;32m 95\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39mfit\u001b[39m(\u001b[39mself\u001b[39m, trainset):\n\u001b[1;32m 97\u001b[0m SymmetricAlgo\u001b[39m.\u001b[39mfit(\u001b[39mself\u001b[39m, trainset)\n\u001b[0;32m---> 98\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39msim \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mcompute_similarities()\n\u001b[1;32m 100\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39m\n",
"File \u001b[0;32m~/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py:6113\u001b[0m, in \u001b[0;36mIndex._get_indexer_strict\u001b[0;34m(self, key, axis_name)\u001b[0m\n\u001b[1;32m 6110\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m 6111\u001b[0m keyarr, indexer, new_indexer \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_reindex_non_unique(keyarr)\n\u001b[0;32m-> 6113\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_raise_if_missing(keyarr, indexer, axis_name)\n\u001b[1;32m 6115\u001b[0m keyarr \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mtake(indexer)\n\u001b[1;32m 6116\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39misinstance\u001b[39m(key, Index):\n\u001b[1;32m 6117\u001b[0m \u001b[39m# GH 42790 - Preserve name from an Index\u001b[39;00m\n", "File \u001b[0;32m~/.local/lib/python3.10/site-packages/surprise/prediction_algorithms/algo_base.py:248\u001b[0m, in \u001b[0;36mAlgoBase.compute_similarities\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 246\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mgetattr\u001b[39m(\u001b[39mself\u001b[39m, \u001b[39m\"\u001b[39m\u001b[39mverbose\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39mFalse\u001b[39;00m):\n\u001b[1;32m 247\u001b[0m \u001b[39mprint\u001b[39m(\u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mComputing the \u001b[39m\u001b[39m{\u001b[39;00mname\u001b[39m}\u001b[39;00m\u001b[39m similarity matrix...\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[0;32m--> 248\u001b[0m sim \u001b[39m=\u001b[39m construction_func[name](\u001b[39m*\u001b[39;49margs)\n\u001b[1;32m 249\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mgetattr\u001b[39m(\u001b[39mself\u001b[39m, \u001b[39m\"\u001b[39m\u001b[39mverbose\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39mFalse\u001b[39;00m):\n\u001b[1;32m 250\u001b[0m \u001b[39mprint\u001b[39m(\u001b[39m\"\u001b[39m\u001b[39mDone computing similarity matrix.\u001b[39m\u001b[39m\"\u001b[39m)\n",
"File \u001b[0;32m~/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py:6173\u001b[0m, in \u001b[0;36mIndex._raise_if_missing\u001b[0;34m(self, key, indexer, axis_name)\u001b[0m\n\u001b[1;32m 6171\u001b[0m \u001b[39mif\u001b[39;00m use_interval_msg:\n\u001b[1;32m 6172\u001b[0m key \u001b[39m=\u001b[39m \u001b[39mlist\u001b[39m(key)\n\u001b[0;32m-> 6173\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mKeyError\u001b[39;00m(\u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mNone of [\u001b[39m\u001b[39m{\u001b[39;00mkey\u001b[39m}\u001b[39;00m\u001b[39m] are in the [\u001b[39m\u001b[39m{\u001b[39;00maxis_name\u001b[39m}\u001b[39;00m\u001b[39m]\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[1;32m 6175\u001b[0m not_found \u001b[39m=\u001b[39m \u001b[39mlist\u001b[39m(ensure_index(key)[missing_mask\u001b[39m.\u001b[39mnonzero()[\u001b[39m0\u001b[39m]]\u001b[39m.\u001b[39munique())\n\u001b[1;32m 6176\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mKeyError\u001b[39;00m(\u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m{\u001b[39;00mnot_found\u001b[39m}\u001b[39;00m\u001b[39m not in index\u001b[39m\u001b[39m\"\u001b[39m)\n", "File \u001b[0;32m~/.local/lib/python3.10/site-packages/surprise/similarities.pyx:83\u001b[0m, in \u001b[0;36msurprise.similarities.cosine\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: \"None of [Index(['userID', 'itemID', 'rating'], dtype='object')] are in the [columns]\"" "\u001b[0;31mZeroDivisionError\u001b[0m: float division"
] ]
} }
], ],
@ -71,13 +74,21 @@
"## Read in Course Topics data\n", "## Read in Course Topics data\n",
"courses_df = pd.read_csv('Coursetopics.csv')\n", "courses_df = pd.read_csv('Coursetopics.csv')\n",
"\n", "\n",
"# Convert to format usable for surprise similarities\n",
"courses_df['Index'] = range(1, len(courses_df) + 1)\n",
"course_melt = courses_df.melt(id_vars =['Index'], value_vars =['Intro', 'DataMining', 'Survey', 'Cat Data', 'Regression', 'Forecast', 'DOE', 'SW'], \n",
" var_name ='Course', value_name ='Taken')\n",
"\n",
"\n",
"reader = Reader(rating_scale=(0, 1))\n", "reader = Reader(rating_scale=(0, 1))\n",
"data = Dataset.load_from_df(courses_df['customerID', 'movieID', 'rating']], reader)\n", "data = Dataset.load_from_df(course_melt[['Index', 'Course', 'Taken']], reader)\n",
"trainset = data.build_full_trainset()\n", "trainset = data.build_full_trainset()\n",
"sim_options = {'name': 'cosine', 'user_based': True} # compute cosine similarities between items\n", "\n",
"# NOTE: The following will error. This is expected and part of the question. Explanation in the corresponding answer.\n",
"sim_options = {'name': 'cosine', 'user_based': True} # compute cosine similarities between users\n",
"algo = KNNBasic(sim_options=sim_options)\n", "algo = KNNBasic(sim_options=sim_options)\n",
"algo.fit(trainset)\n", "algo.fit(trainset)\n",
"pred = algo.predict(str(823519), str(30), r_ui=4, verbose=True)" "#pred = algo.predict(str(823519), str(30), r_ui=4, verbose=True)"
] ]
}, },
{ {
@ -85,7 +96,26 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"# Problem 14.4" "The provided dataset is composed of Boolean values for \"have taken\" or \"have not taken\" various courses. The dataset represents \"have not taken\" with a 0, and \"taken\" with a 1. The dataset is considered a sparse matrix, since each user has only taken a few of the listed courses. Due to the sparsity, when computing the cosine between users, many computations involve comparing a user's \"not taken\" course to another user's \"not taken\" course. This leads to difficulties with the cosine computation since the denominator will be zero, causing a float division error. This can be remedied by using \"NULL\" values, which are supported in the surprise package."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Problem 14.4\n",
"The data shown in Table 14.15 and the output in Table 14.16 are based on a subset of a dataset on cosmetic purchases (Cosmetics.csv) at a large chain drugstore. The store wants to analyze associations among purchases of these items for purposes of point-of-sale display, guidance to sales personnel in promoting cross-sales, and guidance for piloting an eventual time-of-purchase electronic recommender\n",
"system to boost cross-sales. Consider first only the data shown in Table 14.15, given in binary matrix form.\n",
" a. Select several values in the matrix and explain their meaning.\n",
" b. Consider the results of the association rules analysis shown in Table 14.16.\n",
" i. For the first row, explain the “confidence” output and how it is calculated.\n",
" ii. For the first row, explain the “support” output and how it is calculated.\n",
" iii. For the first row, explain the “lift” and how it is calculated.\n",
" iv. For the first row, explain the rule that is represented there in words.\n",
" c. Now, use the complete dataset on the cosmetics purchases (in the file Cosmetics.csv). Using Python, apply association rules to these data (for apriori use min_support=0.1 and use_colnames=True, for association_rules use default parameters).\n",
" i. Interpret the first three rules in the output in words.\n",
" ii. Reviewing the first couple of dozen rules, comment on their redundancy and how you would assess their utility."
] ]
} }
], ],