{"id":11451,"date":"2018-03-06T08:00:43","date_gmt":"2018-03-06T13:00:43","guid":{"rendered":"http:\/\/pzd.hmy.temporary.site\/?p=11451"},"modified":"2018-03-05T20:06:01","modified_gmt":"2018-03-06T01:06:01","slug":"machine-learning-with-python-scikit-learn-part-1","status":"publish","type":"post","link":"https:\/\/datascienceplus.com\/machine-learning-with-python-scikit-learn-part-1\/","title":{"rendered":"Machine Learning with Python scikit-learn; Part 1"},"content":{"rendered":"<p>Previously, I have written a blog post on <a href=\"https:\/\/datascienceplus.com\/machine-learning-with-r-caret-part-1\/\">machine learning with R<\/a> by Caret package. In this post, I will use the <code>scikit-learn<\/code> library in Python. As we did in the R post, we will predict power output given a set of environmental readings from various sensors in a natural gas-fired power generation plant.<\/p>\n<p>The real-world data we are using in this post consists of <a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Combined+Cycle+Power+Plant\" rel=\"noopener\" target=\"_blank\">9,568 data points<\/a>, each with 4 environmental attributes collected from a Combined Cycle Power Plant over 6 years (2006-2011), and is provided by the University of California, Irvine at UCI Machine Learning Repository Combined Cycle Power Plant Data Set. You can find more details about the dataset on the UCI page.<\/p>\n<p>Import libraries to perform Extract-Transform-Load (ETL) and Exploratory Data Analysis (EDA)<\/p>\n<pre>\r\nimport pandas as pd\r\nimport seaborn as sns\r\nimport statsmodels.api as sm\r\nimport matplotlib.pyplot as plt\r\n<\/pre>\n<h3>Load Data<\/h3>\n<pre>power_plant = pd.read_excel(\"Folds5x2_pp.xlsx\")<\/pre>\n<h2>Exploratory Data Analysis (EDA)<\/h2>\n<p>This is a step that we should always perform before trying to fit a model to the data, as this step will often lead to important insights about our data.<\/p>\n<pre>type(power_plant)\r\n<em>pandas.core.frame.DataFrame<\/em><\/pre>\n<p>See first few rows<\/p>\n<pre>\r\npower_plant.head()\r\n<em>AT\tV\tAP\tRH\tPE\r\n0\t14.96\t41.76\t1024.07\t73.17\t463.26\r\n1\t25.18\t62.96\t1020.04\t59.08\t444.37\r\n2\t5.11\t39.40\t1012.16\t92.14\t488.56\r\n3\t20.86\t57.32\t1010.24\t76.64\t446.48\r\n4\t10.82\t37.50\t1009.23\t96.62\t473.90<\/em><\/pre>\n<p>The columns in the DataFrame are:<\/p>\n<ul>AT = Atmospheric Temperature in C<br \/>\nV = Exhaust Vacuum Speed<br \/>\nAP = Atmospheric Pressure<br \/>\nRH = Relative Humidity<br \/>\nPE = Power Output<\/ul>\n<p>Power Output is the value we are trying to predict given the measurements above.<\/p>\n<p>Size of  DataFrame<\/p>\n<pre>\r\n  power_plant.shape\r\n<em>(9568, 5)<\/em><\/pre>\n<p>Class of each column in the DataFrame<\/p>\n<pre>\r\npower_plant.dtypes    # all columns are numeric\r\n<em>AT    float64\r\nV     float64\r\nAP    float64\r\nRH    float64\r\nPE    float64\r\ndtype: object<\/em><\/pre>\n<p>Are there any missing values in any of the columns?<\/p>\n<pre>power_plant.info()  # There is no missing data in all of the columns\r\n<em>RangeIndex: 9568 entries, 0 to 9567\r\nData columns (total 5 columns):\r\nAT    9568 non-null float64\r\nV     9568 non-null float64\r\nAP    9568 non-null float64\r\nRH    9568 non-null float64\r\nPE    9568 non-null float64\r\ndtypes: float64(5)\r\nmemory usage: 373.8 KB<\/em><\/pre>\n<h2>Visualize relationship between variables<\/h2>\n<p>Before we perform any modeling, it is a good idea to explore correlations between the predictors and the predictand. This step can be important as it helps us to select appropriate models. If our features and the outcome are linearly related, we may start with linear regression models. However, if the relationships between the label and the features are non-linear, non-linear ensemble models such as random forest can be better.<\/p>\n<p>Correlation between power output and temperature<\/p>\n<pre>power_plant.plot(x ='AT', y = 'PE', kind =\"scatter\", \r\n                 figsize = [10,10],\r\n                 color =\"b\", alpha = 0.3, \r\n                fontsize = 14)\r\nplt.title(\"Temperature vs Power Output\", \r\n          fontsize = 24, color=\"darkred\")\r\nplt.xlabel(\"Atmospheric Temperature\", fontsize = 18) \r\nplt.ylabel(\"Power Output\", fontsize = 18)\r\nplt.show()<\/pre>\n<p>Gives this plot:<br \/>\n<a href=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/plot1pythonplot.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/plot1pythonplot-490x490.png\" alt=\"\" width=\"490\" height=\"490\" class=\"alignnone size-medium wp-image-11469\" srcset=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/plot1pythonplot-490x490.png 490w, https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/plot1pythonplot-144x144.png 144w, https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/plot1pythonplot.png 624w\" sizes=\"auto, (max-width: 490px) 100vw, 490px\" \/><\/a><\/p>\n<p>As shown in the above figure, there is strong linear correlation between Atmospheric Temperature and Power Output.<\/p>\n<h3>Correlation between Exhaust Vacuum Speed and power output<\/h3>\n<pre>\r\npower_plant.plot(x ='V', y = 'PE',kind =\"scatter\", \r\n                 figsize = [10,10],\r\n                 color =\"g\", alpha = 0.3, \r\n                fontsize = 14)\r\nplt.title(\"Exhaust Vacuum Speed vs Power Output\", fontsize = 24, color=\"darkred\")\r\nplt.xlabel(\"Atmospheric Temperature\", fontsize = 18) \r\nplt.ylabel(\"Power Output\", fontsize = 18)\r\nplt.show()<\/pre>\n<p>Gives this plot:<br \/>\n<a href=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/Exhaust-Vacuum-Speed-vs-Power-Output.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/Exhaust-Vacuum-Speed-vs-Power-Output-490x490.png\" alt=\"\" width=\"490\" height=\"490\" class=\"alignnone size-medium wp-image-11470\" srcset=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/Exhaust-Vacuum-Speed-vs-Power-Output-490x490.png 490w, https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/Exhaust-Vacuum-Speed-vs-Power-Output-144x144.png 144w, https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/Exhaust-Vacuum-Speed-vs-Power-Output.png 624w\" sizes=\"auto, (max-width: 490px) 100vw, 490px\" \/><\/a><\/p>\n<h3>Correlation between Exhaust Vacuum Speed and power output <\/h3>\n<pre>\r\npower_plant.plot(x ='AP', y = 'PE',kind =\"scatter\", \r\n                 figsize = [10,10],\r\n                 color =\"r\", alpha = 0.3,\r\n                fontsize = 14)\r\nplt.title(\"Atmospheric Pressure vs Power Output\", fontsize = 24, color=\"darkred\")\r\nplt.xlabel(\"Atmospheric Temperature\", fontsize = 18) \r\nplt.ylabel(\"Power Output\", fontsize = 18)\r\nplt.show()<\/pre>\n<p>Gives this plot:<br \/>\n<a href=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/Atmospheric-Pressure-vs-Power-Output.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/Atmospheric-Pressure-vs-Power-Output-490x482.png\" alt=\"\" width=\"490\" height=\"482\" class=\"alignnone size-medium wp-image-11471\" srcset=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/Atmospheric-Pressure-vs-Power-Output-490x482.png 490w, https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/Atmospheric-Pressure-vs-Power-Output.png 636w\" sizes=\"auto, (max-width: 490px) 100vw, 490px\" \/><\/a><\/p>\n<h3>Correlation between relative humidity and power output <\/h3>\n<pre>\r\npower_plant.plot(x ='RH', y = 'PE',kind =\"scatter\", \r\n                 figsize = [10,10],\r\n                 color =\"m\", alpha = 0.3)\r\nplt.title(\"Relative Humidity vs Power Output\", fontsize = 24, color=\"darkred\")\r\nplt.xlabel(\"Relative Humidity\", fontsize = 18) \r\nplt.ylabel(\"Power Output\", fontsize = 18)\r\nplt.show()  <\/pre>\n<p>Gives this plot:<br \/>\n<a href=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/Relative-Humidity-vs-Power-Output.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/Relative-Humidity-vs-Power-Output-488x490.png\" alt=\"\" width=\"488\" height=\"490\" class=\"alignnone size-medium wp-image-11472\" srcset=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/Relative-Humidity-vs-Power-Output-488x490.png 488w, https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/Relative-Humidity-vs-Power-Output-144x144.png 144w, https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/Relative-Humidity-vs-Power-Output.png 618w\" sizes=\"auto, (max-width: 488px) 100vw, 488px\" \/><\/a><\/p>\n<h3>Correlation heatmap <\/h3>\n<pre>corr = power_plant.corr()\r\nplt.figure(figsize = (9, 7))\r\nsns.heatmap(corr, cmap=\"RdBu\",\r\n            xticklabels=corr.columns.values,\r\n            yticklabels=corr.columns.values)\r\nplt.show()<\/pre>\n<p>Gives this plot:<br \/>\n<a href=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/correlation-heatmap.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/correlation-heatmap-490x399.png\" alt=\"\" width=\"490\" height=\"399\" class=\"alignnone size-medium wp-image-11473\" srcset=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/correlation-heatmap-490x399.png 490w, https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/correlation-heatmap.png 503w\" sizes=\"auto, (max-width: 490px) 100vw, 490px\" \/><\/a><\/p>\n<p>As shown in the correlation heatmap above, the target is correlated with the features. However, we also observe correlation among the features, hence we have multi-collinearity problem. He will use regularization to check if the collinearity we observe has a significant impact on the performance of linear regression model.<\/p>\n<h2>Data Modeling<\/h2>\n<p>All the columns are numeric and there are no missing values, which makes our modeling task strightforward.<br \/>\nNow, let&#8217;s model our data to predict what the power output will be given a set of sensor readings. Our first model will be based on simple linear regression since we saw some linear patterns in our data based on the scatter plots and correlation heatmap during the exploration stage.<br \/>\nWe need a way of evaluating how well our linear regression model predicts power output as a function of input parameters. We can do this by splitting up our initial data set into a Training Set, used to train our model and a Test Set, used to evaluate the model&#8217;s performance in giving predictions.<\/p>\n<pre>\r\nfrom sklearn.model_selection import train_test_split\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import GridSearchCV\r\nfrom sklearn.preprocessing import StandardScaler\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.linear_model import LinearRegression\r\nfrom sklearn.linear_model import Ridge\r\nfrom sklearn.linear_model import Lasso\r\nfrom sklearn.linear_model import ElasticNet\r\nfrom sklearn.tree import DecisionTreeRegressor\r\nfrom sklearn.ensemble import RandomForestRegressor\r\nfrom sklearn.ensemble import GradientBoostingRegressor\r\nfrom sklearn.svm import SVR\r\nfrom sklearn.metrics import mean_squared_error\r\nimport numpy as np<\/pre>\n<h3>Split data into training and test datasets<\/h3>\n<p>Let&#8217;s split the original dataset into training and test datasets. The training dataset is 80% of the whole dataset, the test set is the remaining 20% of the original dataset. In Python, we use the <code>train_test_split<\/code> function to acheieve that.<\/p>\n<pre>X = power_plant.drop(\"PE\", axis = 1).values\r\ny = power_plant['PE'].values\r\ny = y.reshape(-1, 1)\r\n\r\n# Split into training and test set\r\n# 80% of the input for training and 20% for testing\r\n\r\nX_train, X_test, y_train, y_test = train_test_split(X, y,\r\n                                               test_size = 0.2, \r\n                                               random_state=42)\r\n\r\nTraining_to_original_ratio = round(X_train.shape[0]\/(power_plant.shape[0]), 2) * 100\r\n\r\nTesting_to_original_ratio = round(X_test.shape[0]\/(power_plant.shape[0]), 2) * 100\r\n\r\nprint ('As shown below {}% of the data is for training and the rest {}% is for testing.'.format(Training_to_original_ratio, \r\n                                                                                               Testing_to_original_ratio))\r\nlist(zip([\"Training set\", \"Testing set\"],\r\n   [Training_to_original_ratio, Testing_to_original_ratio]))<\/pre>\n<p>As shown below 80.0% of the data is for training and the rest 20.0% is for testing.<\/p>\n<h2>Linear Regression<\/h2>\n<pre># Instantiate linear regression: reg\r\n# Standardize features by removing the mean \r\n# and scaling to unit variance using the\r\n# StandardScaler() function\r\n\r\n# Apply Scaling to X_train and X_test\r\nstd_scale = StandardScaler().fit(X_train)\r\nX_train_scaled = std_scale.transform(X_train)\r\nX_test_scaled = std_scale.transform(X_test)\r\nlinear_reg = LinearRegression()\r\nreg_scaled = linear_reg.fit(X_train_scaled, y_train)\r\ny_train_scaled_fit = reg_scaled.predict(X_train_scaled)\r\nprint(\"R-squared for training dataset:{}\".\r\n      format(np.round(reg_scaled.score(X_train_scaled, y_train),\r\n                      2)))\r\nprint(\"Root mean square error: {}\".\r\n      format(np.round(np.sqrt(mean_squared_error(y_train, \r\n                                        y_train_scaled_fit)), 2)))\r\ncoefficients = reg_scaled.coef_\r\nfeatures = list(power_plant.drop(\"PE\", axis = 1).columns)\r\nprint(\" \")\r\nprint('The coefficients of the features from the linear model:')\r\nprint(dict(zip(features, coefficients[0])))\r\nprint(\"\")\r\nprint(\"The intercept is {}\".format(np.round(reg_scaled.intercept_[0],3)))\r\n<em>R-squared for training dataset:0.93\r\nRoot mean square error: 4.57\r\n \r\nThe coefficients of the features from the linear model:\r\n{'AT': -14.763927385645419, 'V': -2.9496320985616462, 'AP': 0.36978031656087407, 'RH': -2.3121956560685817}\r\n\r\nThe intercept is 454.431<\/em><\/pre>\n<pre>\r\npred = reg_scaled.predict(X_test_scaled)\r\nprint(\"R-squared for test dataset:{}\".\r\n      format(np.round(reg_scaled.score(X_test_scaled, \r\n                                       y_test),  2)))\r\nprint(\"Root mean square error for test dataset: {}\".\r\n      format(np.round(np.sqrt(mean_squared_error(y_test, \r\n                                        pred)), 2)))\r\ndata =  {\"prediction\": pred, \"observed\": y_test}\r\ntest = pd.DataFrame(pred, columns = [\"Prediction\"])\r\ntest[\"Observed\"] = y_test\r\nlowess = sm.nonparametric.lowess\r\nz = lowess(pred.flatten(), y_test.flatten())\r\ntest.plot(figsize = [10,10],\r\n          x =\"Prediction\", y = \"Observed\", kind = \"scatter\", color = 'darkred')\r\nplt.title(\"Linear Regression: Prediction Vs Test Data\", fontsize = 24, color = \"darkgreen\")\r\nplt.xlabel(\"Predicted Power Output\", fontsize = 18) \r\nplt.ylabel(\"Observed Power Output\", fontsize = 18)\r\nplt.plot(z[:,0], z[:,1], color = \"blue\", lw= 3)\r\nplt.show()\r\n<em>R-squared for test dataset:0.93\r\nRoot mean square error for test dataset: 4.5<\/em><\/pre>\n<p>The plot:<br \/>\n<a href=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/Linear-Regression-Prediction-Vs-Test-Data.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/Linear-Regression-Prediction-Vs-Test-Data-488x490.png\" alt=\"\" width=\"488\" height=\"490\" class=\"alignnone size-medium wp-image-11474\" srcset=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/Linear-Regression-Prediction-Vs-Test-Data-488x490.png 488w, https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/Linear-Regression-Prediction-Vs-Test-Data-144x144.png 144w, https:\/\/datascienceplus.com\/wp-content\/uploads\/2018\/03\/Linear-Regression-Prediction-Vs-Test-Data.png 618w\" sizes=\"auto, (max-width: 488px) 100vw, 488px\" \/><\/a><\/p>\n<p>In this blog we saw non-regularized multivariate linear regression. You may also read the <a href=\"https:\/\/datascienceplus.com\/machine-learning-with-r-caret-part-1\/\">R version<\/a> of this post. In the second part of the post, we will work with regularized linear regression models. Next, we will see the other non-linear regression models. See you in the next post.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Previously, I have written a blog post on machine learning with R by Caret package. In this post, I will use the scikit-learn library in Python. As we did in the R post, we will predict power output given a set of environmental readings from various sensors in a natural gas-fired power generation plant. The [&hellip;]<\/p>\n","protected":false},"author":361,"featured_media":11475,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[29,92,231],"class_list":["post-11451","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-introduction","tag-linear-regression","tag-machine-learning","tag-python"],"views":7312,"_links":{"self":[{"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/posts\/11451","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/users\/361"}],"replies":[{"embeddable":true,"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/comments?post=11451"}],"version-history":[{"count":0,"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/posts\/11451\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/media\/11475"}],"wp:attachment":[{"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/media?parent=11451"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/categories?post=11451"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/tags?post=11451"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}