QM-7063-Learning-Practice-6/Schrick-Noah_Learning-Practice-6.ipynb
2023-03-07 16:29:07 -06:00

949 lines
212 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "code",
"execution_count": 190,
"metadata": {},
"outputs": [],
"source": [
"# Learning Practice 6 for the University of Tulsa's QM-7063 Data Mining Course\n",
"# Logistic Regression for Classification\n",
"# # Professor: Dr. Abdulrashid, Spring 2023\n",
"# Noah L. Schrick - 1492657\n",
"\n",
"%matplotlib inline\n",
"\n",
"from pathlib import Path\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn.linear_model import LogisticRegression, LogisticRegressionCV\n",
"from sklearn.linear_model import LinearRegression, Lasso, Ridge, LassoCV, BayesianRidge\n",
"from dmba import stepwise_selection\n",
"from dmba import regressionSummary\n",
"from sklearn.model_selection import train_test_split\n",
"import statsmodels.api as sm\n",
"from pandas.plotting import scatter_matrix\n",
"import seaborn as sns\n",
"from dmba.metric import AIC_score"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Problem 10.3\n",
"\n",
"A company that manufactures riding mowers wants to identify the best sales prospects for an intensive sales campaign. In particular, the manufacturer is interested in classifying households as prospective owners or nonowners on the basis of Income (in $1000s) and Lot Size (in 1000 ft2). The marketing expert looked at a random sample of 24 households, given in the file RidingMowers.csv. \n",
"\n",
"Use all the data to fit a logistic regression of ownership on the two predictors.\n",
"\n",
"a. What percentage of households in the study were owners of a riding mower? \n",
"b. Create a scatter plot of Income vs. Lot Size using color or symbol to distinguish owners from nonowners. From the scatter plot, which class seems to have a higher average income, owners or nonowners? \n",
"c. Among nonowners, what is the percentage of households classified correctly? \n",
"d. To increase the percentage of correctly classified nonowners, should the cutoff probability be increased or decreased? \n",
"e. What are the odds that a household with a $60K income and a lot size of 20,000ft2 is an owner? \n",
"f. What is the classification of a household with a $60K income and a lot size of 20,000 ft2? Use cutoff = 0.5. \n",
"g. What is the minimum income that a household with 16,000 ft2 lot size should have before it is classified as an owner? "
]
},
{
"cell_type": "code",
"execution_count": 191,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Owner 50.0\n",
"Nonowner 50.0\n",
"Name: Ownership, dtype: float64\n"
]
}
],
"source": [
"mowers_df = pd.read_csv('RidingMowers.csv')\n",
"\n",
"# a\n",
"owner_pctg = mowers_df['Ownership'].value_counts(normalize=True) * 100\n",
"print(owner_pctg)\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Ownership\n",
"Nonowner 57.400\n",
"Owner 79.475\n",
"Name: Income, dtype: float64\n"
]
},
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# b \n",
"mowers_df.plot.scatter(x='Lot_Size', y='Income', legend=True)\n",
"owner_inc = mowers_df.groupby('Ownership')['Income'].mean()\n",
"print(owner_inc)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Classified Correctly: 80.0 %\n",
" actual p(0) p(1) predicted\n",
"13 Nonowner 0.418583 0.581417 Owner\n",
"14 Nonowner 0.838644 0.161356 Nonowner\n",
"17 Nonowner 0.936463 0.063537 Nonowner\n",
"18 Nonowner 0.958456 0.041544 Nonowner\n",
"20 Nonowner 0.979416 0.020584 Nonowner\n"
]
}
],
"source": [
"# c\n",
"predictors = ['Lot_Size', 'Income']\n",
"outcome = 'Ownership'\n",
"\n",
"X = pd.get_dummies(mowers_df[predictors], drop_first=True)\n",
"y = mowers_df[outcome]\n",
"classes = ['Owner', 'Nonowner']\n",
"\n",
"# split into training and validation\n",
"train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.25, \n",
" random_state=1)\n",
"\n",
"logit_full = LogisticRegression(penalty=\"l2\", C=1e42, solver='liblinear')\n",
"logit_full.fit(train_X, train_y)\n",
"\n",
"logit_reg_pred = logit_full.predict_proba(valid_X)\n",
"full_result = pd.DataFrame({'actual': valid_y, \n",
" 'p(0)': [p[0] for p in logit_reg_pred],\n",
" 'p(1)': [p[1] for p in logit_reg_pred],\n",
" 'predicted': logit_full.predict(valid_X)})\n",
"full_result = full_result.sort_values(by=['p(1)'], ascending=False)\n",
"\n",
"subset_df = full_result.loc[full_result['actual'] == 'Nonowner']\n",
"\n",
"num_corr = 0\n",
"total = 0\n",
"for index, row in subset_df.iterrows(): \n",
" if (row['actual'] == row['predicted']):\n",
" num_corr += 1\n",
" total += 1\n",
" else:\n",
" total += 1\n",
"\n",
"print(\"Classified Correctly:\", num_corr/total*100.00, \"%\")\n",
"print(subset_df)\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# d\n",
"Cutoff percentage should be decreased."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Odds of event: 1.7719334017055501\n"
]
}
],
"source": [
"# e\n",
"data = [[20, 60]]\n",
"pred = pd.DataFrame(data, columns=['Lot_Size', 'Income'])\n",
"\n",
"logit_reg_pred_s = logit_full.predict_proba(pred)\n",
"p0 = [p[0] for p in logit_reg_pred_s]\n",
"p1 = [p[1] for p in logit_reg_pred_s]\n",
"full_result = pd.DataFrame({'p(0)': p0,\n",
" 'p(1)': p1,\n",
" 'predicted': logit_full.predict(pred)})\n",
"print(\"Odds of event:\", np.exp(p1[0]))\n"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Nonowner\n"
]
}
],
"source": [
"# f\n",
"print(full_result)\n"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"94.9000000000068\n"
]
}
],
"source": [
"# g. What is the minimum income that a household with 16,000 ft2 lot size should have before it is classified as an owner? \n",
"init = 60\n",
"while(True):\n",
" data = [[16, init]]\n",
" pred = pd.DataFrame(data, columns=['Lot_Size', 'Income'])\n",
"\n",
" logit_reg_pred_s = logit_full.predict_proba(pred)\n",
" p0 = [p[0] for p in logit_reg_pred_s]\n",
" p1 = [p[1] for p in logit_reg_pred_s]\n",
" full_result = pd.DataFrame({'p(0)': p0,\n",
" 'p(1)': p1,\n",
" 'predicted': logit_full.predict(pred)})\n",
" if(full_result['predicted'][0] == 'Nonowner'):\n",
" init = init + 0.025\n",
" else:\n",
" print(init)\n",
" break\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Problem 10.4\n",
"\n",
"The file eBayAuctions.csv contains information on 1972 auctions transacted on eBay.com during MayJune 2004. The goal is to use these data to build a model that will distinguish competitive auctions from non-competitive ones. A competitive auction is defined as an auction with at least two bids placed on the item being auctioned. The data include variables that describe the item (auction category), the seller (his or her eBay rating), and the auction terms that the seller selected (auction duration, opening price, currency, day of week of auction close). In addition, we have the price at which the auction closed. The goal is to predict whether or not an auction of interest will be competitive.\n",
"\n",
"Data preprocessing. Create dummy variables for the categorical predictors.\n",
"These include Category (18 categories), Currency (USD, GBP, Euro), EndDay\n",
"(MondaySunday), and Duration (1, 3, 5, 7, or 10 days).\n",
"\n",
"a. Create pivot tables for the mean of the binary outcome (Competitive?) as a function of the various categorical variables (use the original variables, not the dummies). Use the information in the tables to reduce the number of dummies that will be used in the model. For example, categories that appear most similar with respect to the distribution of competitive auctions could be combined. \n",
"b. Split the data into training (60%) and validation (40%) datasets. Run a logistic model with all predictors with a cutoff of 0.5. \n",
"c. If we want to predict at the start of an auction whether it will be competitive, we cannot use the information on the closing price. Run a logistic model with all predictors as above, excluding price. How does this model compare to the full model with respect to predictive accuracy? \n",
"d. Interpret the meaning of the coefficient for closing price. Does closing price have a practical significance? Is it statistically significant for predicting competitiveness of auctions? (Use a 10% significance level.) \n",
"e. Use stepwise regression as described in Section 6.4 to find the model with the best fit to the training data (highest accuracy). Which predictors are used? \n",
"f. Use stepwise regression to find the model with the highest accuracy on the validation data. Which predictors are used? \n",
"g. What is the danger of using the best predictive model that you found? \n",
"h. Explain how and why the best-fitting model and the best predictive models are the same or different. \n",
"i. Use regularized logistic regression with L1 penalty on the training data. Compare its selected predictors and classification performance to the best-fitting and best predictive models. \n",
"j. If the major objective is accurate classification, what cutoff value should be used? \n",
"k. Based on these data, what auction settings set by the seller (duration, opening price, ending day, currency) would you recommend as being most likely to lead to a competitive auction. "
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_27606/691198861.py:10: FutureWarning: The `inplace` parameter in pandas.Categorical.rename_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.\n",
" auction_df.currency.cat.rename_categories(new_categories, inplace=True)\n"
]
}
],
"source": [
"# Pre-processing\n",
"orig_auction_df = pd.read_csv('eBayAuctions.csv')\n",
"auction_df = pd.read_csv('eBayAuctions.csv')\n",
"auction_df.columns = [c.replace(' ', '_') for c in auction_df.columns]\n",
"\n",
"auction_df['Duration'] = auction_df['Duration'].astype('category')\n",
"\n",
"auction_df['currency'] = auction_df['currency'].astype('category')\n",
"new_categories = {1: 'USD', 2: 'GBP', 3: 'Euro'}\n",
"auction_df.currency.cat.rename_categories(new_categories, inplace=True)\n",
"auction_df = pd.get_dummies(auction_df, prefix_sep='_', drop_first=True)\n",
"\n",
"category_cols = [col for col in auction_df.columns if 'Category_' in col]\n",
"endDay_cols = [col for col in auction_df.columns if 'endDay_' in col]\n",
"\n",
"for col in category_cols:\n",
" auction_df[col] = auction_df[col].astype('category')\n",
"\n",
"for col in endDay_cols:\n",
" auction_df[col] = auction_df[col].astype('category')\n"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Competitive?\n",
"Duration \n",
"1 0.521739\n",
"3 0.450704\n",
"5 0.686695\n",
"7 0.489142\n",
"10 0.544554\n",
" Competitive?\n",
"currency \n",
"EUR 0.551595\n",
"GBP 0.687075\n",
"US 0.519350\n",
" Competitive?\n",
"endDay_Mon \n",
"0 0.489466\n",
"1 0.673358\n",
" Competitive?\n",
"endDay_Sat \n",
"0 0.565083\n",
"1 0.427350\n",
" Competitive?\n",
"endDay_Sun \n",
"0 0.552020\n",
"1 0.485207\n",
" Competitive?\n",
"endDay_Thu \n",
"0 0.533333\n",
"1 0.603960\n",
" Competitive?\n",
"endDay_Tue \n",
"0 0.541366\n",
"1 0.532164\n",
" Competitive?\n",
"endDay_Wed \n",
"0 0.542963\n",
"1 0.480000\n",
" Competitive?\n",
"Category_Automotive \n",
"0 0.559086\n",
"1 0.353933\n",
" Competitive?\n",
"Category_Books \n",
"0 0.54171\n",
"1 0.50000\n",
" Competitive?\n",
"Category_Business/Industrial \n",
"0 0.539406\n",
"1 0.666667\n",
" Competitive?\n",
"Category_Clothing/Accessories \n",
"0 0.542903\n",
"1 0.504202\n",
" Competitive?\n",
"Category_Coins/Stamps \n",
"0 0.545220\n",
"1 0.297297\n",
" Competitive?\n",
"Category_Collectibles \n",
"0 0.535488\n",
"1 0.577406\n",
" Competitive?\n",
"Category_Computer \n",
"0 0.538223\n",
"1 0.666667\n",
" Competitive?\n",
"Category_Electronics \n",
"0 0.533125\n",
"1 0.800000\n",
" Competitive?\n",
"Category_EverythingElse \n",
"0 0.543223\n",
"1 0.235294\n",
" Competitive?\n",
"Category_Health/Beauty \n",
"0 0.552935\n",
"1 0.171875\n",
" Competitive?\n",
"Category_Home/Garden \n",
"0 0.534225\n",
"1 0.656863\n",
" Competitive?\n",
"Category_Jewelry \n",
"0 0.548148\n",
"1 0.365854\n",
" Competitive?\n",
"Category_Music/Movie/Game \n",
"0 0.524538\n",
"1 0.602978\n",
" Competitive?\n",
"Category_Photography \n",
"0 0.538540\n",
"1 0.846154\n",
" Competitive?\n",
"Category_Pottery/Glass \n",
"0 0.54252\n",
"1 0.35000\n",
" Competitive?\n",
"Category_SportingGoods \n",
"0 0.528139\n",
"1 0.725806\n",
" Competitive?\n",
"Category_Toys/Hobbies \n",
"0 0.542002\n",
"1 0.529915\n"
]
}
],
"source": [
"# a\n",
"dur_pivot = orig_auction_df.pivot_table(index =['Duration'],\n",
" values =['Competitive?'],\n",
" aggfunc ='mean')\n",
"print(dur_pivot)\n",
"\n",
"cur_pivot = orig_auction_df.pivot_table(index =['currency'],\n",
" values =['Competitive?'],\n",
" aggfunc ='mean')\n",
"print(cur_pivot)\n",
"\n",
"for col in endDay_cols:\n",
" date_pivot = auction_df.pivot_table(index = [col],\n",
" values =['Competitive?'],\n",
" aggfunc ='mean')\n",
" print(date_pivot)\n",
"\n",
"for col in category_cols:\n",
" cat_pivot = auction_df.pivot_table(index = [col],\n",
" values =['Competitive?'],\n",
" aggfunc ='mean')\n",
" print(cat_pivot)\n"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 1200x700 with 25 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt=scatter_matrix(orig_auction_df,diagonal='kde',figsize=(12,7))\n",
"\n",
"# Combine open and close price"
]
},
{
"cell_type": "code",
"execution_count": 167,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" actual p(0) p(1) predicted\n",
"480 1 0.000000 1.000000 1\n",
"512 1 0.000000 1.000000 1\n",
"1664 1 0.000000 1.000000 1\n",
"1704 1 0.000000 1.000000 1\n",
"1963 1 0.000000 1.000000 1\n",
"... ... ... ... ...\n",
"1863 0 0.963370 0.036630 0\n",
"1960 0 0.979843 0.020157 0\n",
"1955 0 0.995957 0.004043 0\n",
"1952 0 0.996900 0.003100 0\n",
"1967 0 0.998912 0.001088 0\n",
"\n",
"[789 rows x 4 columns]\n",
"Classified Correctly: 76.1723700887199 %\n"
]
}
],
"source": [
"# b\n",
"outcome = auction_df['Competitive?']\n",
"predictors = auction_df.drop('Competitive?',axis=1)\n",
"X = auction_df.drop(columns=['Competitive?'])\n",
"\n",
"\n",
"df_dummies=pd.get_dummies(predictors,drop_first=True)\n",
"#df_dummies.insert(0,'Intercept',[1]*len(df_dummies))\n",
"\n",
"train_X,valid_X,train_y,valid_y=train_test_split(df_dummies,outcome,test_size=0.40, random_state=1)\n",
"train_X_p = train_X\n",
"valid_X_p = valid_X\n",
"\n",
"\n",
"logit_full_p = LogisticRegression(penalty=\"l2\", C=1e42, solver='liblinear')\n",
"logit_full_p.fit(train_X, train_y)\n",
"\n",
"logit_reg_pred_p = logit_full_p.predict_proba(valid_X)\n",
"full_result_p = pd.DataFrame({'actual': valid_y, \n",
" 'p(0)': [p[0] for p in logit_reg_pred_p],\n",
" 'p(1)': [p[1] for p in logit_reg_pred_p],\n",
" 'predicted': logit_full_p.predict(valid_X)})\n",
"full_result_p = full_result_p.sort_values(by=['p(1)'], ascending=False)\n",
"print(full_result_p)\n",
"\n",
"num_corr = 0\n",
"total = 0\n",
"for index, row in full_result_p.iterrows(): \n",
" if (row['actual'] == row['predicted']):\n",
" num_corr += 1\n",
" total += 1\n",
" else:\n",
" total += 1\n",
"\n",
"inc_price_pctg = num_corr/total*100.00\n",
"\n",
"print(\"Classified Correctly:\", inc_price_pctg, \"%\")"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" actual p(0) p(1) predicted\n",
"1772 0 0.030589 0.969411 1\n",
"852 1 0.083026 0.916974 1\n",
"955 1 0.096801 0.903199 1\n",
"1836 1 0.097049 0.902951 1\n",
"1622 1 0.099306 0.900694 1\n",
"... ... ... ... ...\n",
"1081 1 0.910252 0.089748 0\n",
"348 0 0.910385 0.089615 0\n",
"1237 1 0.910617 0.089383 0\n",
"1955 0 0.940586 0.059414 0\n",
"1952 0 0.963785 0.036215 0\n",
"\n",
"[789 rows x 4 columns]\n",
"Classified Correctly: 63.37135614702155 %\n",
"When not including close price, the model is 1.202 times worse\n"
]
}
],
"source": [
"# c\n",
"new_predictors = predictors.drop('ClosePrice',axis=1)\n",
"\n",
"df_dummies=pd.get_dummies(new_predictors,drop_first=True)\n",
"df_dummies.insert(0,'Intercept',[1]*len(df_dummies))\n",
"\n",
"train_X,valid_X,train_y,valid_y=train_test_split(df_dummies,outcome,test_size=0.40, random_state=1)\n",
"\n",
"logit_full = LogisticRegression(penalty=\"l2\", C=1e42, solver='liblinear')\n",
"logit_full.fit(train_X, train_y)\n",
"\n",
"logit_reg_pred = logit_full.predict_proba(valid_X)\n",
"full_result = pd.DataFrame({'actual': valid_y, \n",
" 'p(0)': [p[0] for p in logit_reg_pred],\n",
" 'p(1)': [p[1] for p in logit_reg_pred],\n",
" 'predicted': logit_full.predict(valid_X)})\n",
"full_result = full_result.sort_values(by=['p(1)'], ascending=False)\n",
"print(full_result)\n",
"\n",
"num_corr = 0\n",
"total = 0\n",
"for index, row in full_result.iterrows(): \n",
" if (row['actual'] == row['predicted']):\n",
" num_corr += 1\n",
" total += 1\n",
" else:\n",
" total += 1\n",
"\n",
"not_inc_price_pctg = num_corr/total*100.00\n",
"print(\"Classified Correctly:\", not_inc_price_pctg, \"%\")\n",
" \n",
"\n",
"print(\"When not including close price, the model is\", inc_price_pctg/not_inc_price_pctg, \"times worse\")"
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"intercept -0.36315695612599685\n",
" sellerRating ClosePrice OpenPrice Category_Automotive \\\n",
"coeff -0.000046 0.088855 -0.105865 1.758587 \n",
"\n",
" Category_Books Category_Business/Industrial \\\n",
"coeff 0.557255 -0.08761 \n",
"\n",
" Category_Clothing/Accessories Category_Coins/Stamps \\\n",
"coeff 0.323714 -0.033867 \n",
"\n",
" Category_Collectibles Category_Computer ... Duration_3 Duration_5 \\\n",
"coeff 0.171399 -0.609743 ... 1.256207 -0.108202 \n",
"\n",
" Duration_7 Duration_10 endDay_Mon endDay_Sat endDay_Sun \\\n",
"coeff -0.186949 0.315695 0.280735 -0.612956 -0.468657 \n",
"\n",
" endDay_Thu endDay_Tue endDay_Wed \n",
"coeff -0.56343 -0.198906 -0.712514 \n",
"\n",
"[1 rows x 32 columns]\n"
]
}
],
"source": [
"# d\n",
"print('intercept ', logit_full_p.intercept_[0])\n",
"print(pd.DataFrame({'coeff': logit_full_p.coef_[0]}, index=X.columns).transpose())\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Closing Price\n",
"The coefficient of closing price indicates that it has a positive effect on competitiveness. The coefficient is 0.089, which is considered statistically significant when using a p-value of 0.1."
]
},
{
"cell_type": "code",
"execution_count": 173,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Variables: sellerRating, ClosePrice, OpenPrice, currency_GBP, currency_US, Duration_3, Duration_5, Duration_7, Duration_10, Category_Automotive_1, Category_Books_1, Category_Business/Industrial_1, Category_Clothing/Accessories_1, Category_Coins/Stamps_1, Category_Collectibles_1, Category_Computer_1, Category_Electronics_1, Category_EverythingElse_1, Category_Health/Beauty_1, Category_Home/Garden_1, Category_Jewelry_1, Category_Music/Movie/Game_1, Category_Photography_1, Category_Pottery/Glass_1, Category_SportingGoods_1, Category_Toys/Hobbies_1, endDay_Mon_1, endDay_Sat_1, endDay_Sun_1, endDay_Thu_1, endDay_Tue_1, endDay_Wed_1\n",
"Start: score=1716.20, constant\n",
"Step: score=1676.05, add endDay_Mon_1\n",
"Step: score=1645.10, add ClosePrice\n",
"Step: score=1599.18, add OpenPrice\n",
"Step: score=1571.92, add Category_Health/Beauty_1\n",
"Step: score=1551.14, add currency_GBP\n",
"Step: score=1536.20, add Category_Coins/Stamps_1\n",
"Step: score=1524.50, add Category_Automotive_1\n",
"Step: score=1519.89, add Duration_5\n",
"Step: score=1515.38, add sellerRating\n",
"Step: score=1511.82, add Category_Clothing/Accessories_1\n",
"Step: score=1507.95, add Category_EverythingElse_1\n",
"Step: score=1505.33, add Category_Jewelry_1\n",
"Step: score=1503.52, add Category_Business/Industrial_1\n",
"Step: score=1501.89, add Category_SportingGoods_1\n",
"Step: score=1500.47, add Category_Pottery/Glass_1\n",
"Step: score=1500.47, unchanged None\n",
"['endDay_Mon_1', 'ClosePrice', 'OpenPrice', 'Category_Health/Beauty_1', 'currency_GBP', 'Category_Coins/Stamps_1', 'Category_Automotive_1', 'Duration_5', 'sellerRating', 'Category_Clothing/Accessories_1', 'Category_EverythingElse_1', 'Category_Jewelry_1', 'Category_Business/Industrial_1', 'Category_SportingGoods_1', 'Category_Pottery/Glass_1']\n",
"LinearRegression()\n"
]
}
],
"source": [
"# e\n",
"def train_model(variables):\n",
" if len(variables) == 0:\n",
" return None\n",
" model = LinearRegression()\n",
" model.fit(train_X[variables], train_y)\n",
" return model\n",
"\n",
"def score_model(model, variables):\n",
" if len(variables) == 0:\n",
" return AIC_score(train_y, [train_y.mean()] * len(train_y), model, df=1)\n",
" return AIC_score(train_y, model.predict(train_X[variables]), model)\n",
"\n",
"best_step_model, best_step_variables = stepwise_selection(train_X_p.columns, train_model, score_model, verbose=True)\n",
"print(best_step_variables)\n"
]
},
{
"cell_type": "code",
"execution_count": 175,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"LASSO\n",
"\n",
"Regression statistics\n",
"\n",
" Mean Error (ME) : 0.0219\n",
"Root Mean Squared Error (RMSE) : 0.4804\n",
" Mean Absolute Error (MAE) : 0.4766\n",
"\n",
"\n",
"LASSO CV\n",
"\n",
"Regression statistics\n",
"\n",
" Mean Error (ME) : 0.0218\n",
"Root Mean Squared Error (RMSE) : 0.4813\n",
" Mean Absolute Error (MAE) : 0.4776\n",
"Lasso-CV chosen regularization: 1.242215531068193\n"
]
}
],
"source": [
"print(\"LASSO\")\n",
"lasso = Lasso(alpha=1)\n",
"lasso.fit(train_X, train_y)\n",
"regressionSummary(valid_y, lasso.predict(valid_X))\n",
"print(\"\\n\")\n",
"\n",
"print(\"LASSO CV\")\n",
"lasso_cv = LassoCV(cv=5)\n",
"lasso_cv.fit(train_X, train_y)\n",
"regressionSummary(valid_y, lasso_cv.predict(valid_X))\n",
"print('Lasso-CV chosen regularization: ', lasso_cv.alpha_)\n"
]
},
{
"cell_type": "code",
"execution_count": 176,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RIDGE\n",
"\n",
"Regression statistics\n",
"\n",
" Mean Error (ME) : 0.0172\n",
"Root Mean Squared Error (RMSE) : 0.4623\n",
" Mean Absolute Error (MAE) : 0.4303\n",
"\n",
"\n",
"BAYESIAN RIDGE\n",
"\n",
"Regression statistics\n",
"\n",
" Mean Error (ME) : 0.0179\n",
"Root Mean Squared Error (RMSE) : 0.4607\n",
" Mean Absolute Error (MAE) : 0.4367\n",
"Bayesian ridge chosen regularization: 16.53562606806346\n",
"\n",
"\n"
]
}
],
"source": [
"# f\n",
"print(\"RIDGE\")\n",
"ridge = Ridge(alpha=1)\n",
"ridge.fit(train_X, train_y)\n",
"regressionSummary(valid_y, ridge.predict(valid_X))\n",
"print(\"\\n\")\n",
"\n",
"print(\"BAYESIAN RIDGE\")\n",
"bayesianRidge = BayesianRidge()\n",
"bayesianRidge.fit(train_X, train_y)\n",
"regressionSummary(valid_y, bayesianRidge.predict(valid_X))\n",
"print('Bayesian ridge chosen regularization: ', bayesianRidge.lambda_ / bayesianRidge.alpha_)\n",
"print(\"\\n\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Best Model\n",
"RIDGE: Lowest ME (0.0172), lowest MAE (0.4303), second lowest RMSE (0.4623)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# g\n",
"The biggest concern with using Bayesian Ridge Regression is that the underlying model assumes a linear relationship. This linear relationship is not able to capture the logistic regression fit and accurately map all outcomes, as indicated by the high MAE and RMSE."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# h\n",
"The best-fitting models and the best predictive models can often differ due to many factors. A model that fits very well to the training data may be overfitted, leading to poor results when predicting future, unknown data. The best predictive model on the test data set may be too simplistic, and fail to properly represent data with abnormal or unique behavior unseen from the model found in the training set. Various errors are a good indicator of where a best-fit model may differ from the best predictive model."
]
},
{
"cell_type": "code",
"execution_count": 178,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" actual p(0) p(1) predicted\n",
"480 1 0.000000 1.000000 1\n",
"1661 1 0.000000 1.000000 1\n",
"1962 1 0.000000 1.000000 1\n",
"1704 1 0.000000 1.000000 1\n",
"1664 1 0.000000 1.000000 1\n",
"... ... ... ... ...\n",
"1863 0 0.962442 0.037558 0\n",
"1960 0 0.978774 0.021226 0\n",
"1955 0 0.995925 0.004075 0\n",
"1952 0 0.996922 0.003078 0\n",
"1967 0 0.998845 0.001155 0\n",
"\n",
"[789 rows x 4 columns]\n",
"Classified Correctly: 75.66539923954373 %\n"
]
}
],
"source": [
"# i\n",
"logit_full_1 = LogisticRegression(penalty=\"l1\", C=1e42, solver='liblinear')\n",
"logit_full_1.fit(train_X, train_y)\n",
"\n",
"logit_reg_pred_1 = logit_full_1.predict_proba(valid_X)\n",
"full_result_1 = pd.DataFrame({'actual': valid_y, \n",
" 'p(0)': [p[0] for p in logit_reg_pred_1],\n",
" 'p(1)': [p[1] for p in logit_reg_pred_1],\n",
" 'predicted': logit_full_1.predict(valid_X)})\n",
"full_result_1 = full_result_1.sort_values(by=['p(1)'], ascending=False)\n",
"print(full_result_1)\n",
"\n",
"num_corr = 0\n",
"total = 0\n",
"for index, row in full_result_1.iterrows(): \n",
" if (row['actual'] == row['predicted']):\n",
" num_corr += 1\n",
" total += 1\n",
" else:\n",
" total += 1\n",
"\n",
"pctg_1 = num_corr/total*100.00\n",
"print(\"Classified Correctly:\", pctg_1, \"%\")"
]
},
{
"cell_type": "code",
"execution_count": 189,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<AxesSubplot: xlabel='ClosePrice', ylabel='Competitive?'>"
]
},
"execution_count": 189,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# j\n",
"sns.regplot(x='ClosePrice', y='Competitive?', data=auction_df, logistic=True)\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# j\n",
"This plot alone does not give much insight into a good cutoff value. The logistic regression model is multi-variate, and many variables have differing coefficients. Using PCA, plotting more variables, and varying cutoff values to obtain error rates are necessary to experimentally find a good cutoff value. Using the default of 0.5 suffices for this problem, since the error rates are not abnormally high. Adjusting the cutoff value will alter both the true negative and false positive error rates."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# k\n",
"An auction that lasts 10 days contributes most strongly to a competitive auction. The ending day has multiple candidates that all negatively contribute to a competitive auction."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}