{"id":477,"date":"2020-02-23T19:39:52","date_gmt":"2020-02-23T19:39:52","guid":{"rendered":"https:\/\/fcpython.com\/?p=477"},"modified":"2020-12-18T20:08:55","modified_gmt":"2020-12-18T20:08:55","slug":"introduction-to-simple-linear-regression-in-python","status":"publish","type":"post","link":"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python","title":{"rendered":"Introduction to Simple Linear Regression in Python"},"content":{"rendered":"<p><span style=\"font-size: inherit;\">Linear regression allows us to model the relationship between variables. This might allow us to predict a future outcome if we already know some information, or give us an insight into what is needed to reach a goal.<\/span><\/p>\n<div id=\"notebook\" class=\"border-box-sizing\" tabindex=\"-1\">\n<div id=\"notebook-container\" class=\"container\">\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>To fit a linear regression model, we need one dependent variable, which we will study the changes of as one or more independent variables are changed. As an example, we could model how many goals are scored (dependent variable), as more shots are taken (independent variable). As we have just one independent variable, this is a simple linear regression &#8211; models that take in multiple independent variables are are known as multiple linear regressions.<\/p>\n<p>This article is going to apply a simple linear regression model to squad value data against performance in the Premier League. This might help us to see how much a squad might need to invest to avoid relegation, make European spots or to create a data-driven target for our team.<\/p>\n<p>The steps that we are going to take include a quick look &amp; explore of our dataset, creating the model &amp; then making some assessments on the back of it. Then, we&#8217;ll calculate a better metric to improve our model. We will use the sklearn module to make this much less intimidating than it might seem right now! Let&#8217;s get the modules in place and read in a local dataset called positionsvsValue &#8211; which you can download <a href=\"http:\/\/www.sharecsv.com\/s\/b1315bc34df3d924f9bdbf9f67150f7d\/PositionsvsValue.csv\">here<\/a>.<\/p>\n<h3>Initial set-up &amp; exploration<\/h3>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[1]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"kn\">import<\/span> <span class=\"nn\">pandas<\/span> <span class=\"k\">as<\/span> <span class=\"nn\">pd<\/span>\n<span class=\"kn\">import<\/span> <span class=\"nn\">numpy<\/span> <span class=\"k\">as<\/span> <span class=\"nn\">np<\/span>\n\n<span class=\"kn\">import<\/span> <span class=\"nn\">matplotlib.pyplot<\/span> <span class=\"k\">as<\/span> <span class=\"nn\">plt<\/span>\n<span class=\"kn\">import<\/span> <span class=\"nn\">seaborn<\/span> <span class=\"k\">as<\/span> <span class=\"nn\">sns<\/span>\n<span class=\"o\">%<\/span><span class=\"k\">matplotlib<\/span> inline\n\n<span class=\"kn\">from<\/span> <span class=\"nn\">sklearn.model_selection<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">train_test_split<\/span>\n<span class=\"kn\">from<\/span> <span class=\"nn\">sklearn.linear_model<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">LinearRegression<\/span>\n<span class=\"kn\">from<\/span> <span class=\"nn\">sklearn<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">metrics<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[2]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"c1\">#load data<\/span>\n<span class=\"n\">data<\/span> <span class=\"o\">=<\/span> <span class=\"n\">pd<\/span><span class=\"o\">.<\/span><span class=\"n\">read_csv<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"positionsvsValue.csv\"<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">data<\/span><span class=\"o\">.<\/span><span class=\"n\">head<\/span><span class=\"p\">()<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt output_prompt\">Out[2]:<\/div>\n<div class=\"output_html rendered_html output_subarea output_execute_result\">\n<table class=\"dataframe\" border=\"1\">\n<thead>\n<tr style=\"text-align: right;\">\n<th><\/th>\n<th>League<\/th>\n<th>Season<\/th>\n<th>Team<\/th>\n<th>Squad<\/th>\n<th>Average Age<\/th>\n<th>Non-Homegrown<\/th>\n<th>Squad Value<\/th>\n<th>Avg Player Value<\/th>\n<th>GD<\/th>\n<th>Points<\/th>\n<th>Position<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<th>0<\/th>\n<td>EPL<\/td>\n<td>2008<\/td>\n<td>Chelsea FC<\/td>\n<td>28<\/td>\n<td>25.6<\/td>\n<td>21<\/td>\n<td>406.70<\/td>\n<td>14.53<\/td>\n<td>44<\/td>\n<td>83<\/td>\n<td>3<\/td>\n<\/tr>\n<tr>\n<th>1<\/th>\n<td>EPL<\/td>\n<td>2008<\/td>\n<td>Manchester United<\/td>\n<td>31<\/td>\n<td>24.3<\/td>\n<td>20<\/td>\n<td>356.10<\/td>\n<td>11.49<\/td>\n<td>44<\/td>\n<td>90<\/td>\n<td>1<\/td>\n<\/tr>\n<tr>\n<th>2<\/th>\n<td>EPL<\/td>\n<td>2008<\/td>\n<td>Liverpool FC<\/td>\n<td>28<\/td>\n<td>23.9<\/td>\n<td>24<\/td>\n<td>257.23<\/td>\n<td>9.19<\/td>\n<td>50<\/td>\n<td>86<\/td>\n<td>2<\/td>\n<\/tr>\n<tr>\n<th>3<\/th>\n<td>EPL<\/td>\n<td>2008<\/td>\n<td>Arsenal FC<\/td>\n<td>38<\/td>\n<td>21.3<\/td>\n<td>30<\/td>\n<td>250.85<\/td>\n<td>6.6<\/td>\n<td>31<\/td>\n<td>72<\/td>\n<td>4<\/td>\n<\/tr>\n<tr>\n<th>4<\/th>\n<td>EPL<\/td>\n<td>2008<\/td>\n<td>Tottenham Hotspur<\/td>\n<td>35<\/td>\n<td>22.5<\/td>\n<td>18<\/td>\n<td>212.60<\/td>\n<td>6.07<\/td>\n<td>0<\/td>\n<td>51<\/td>\n<td>8<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[3]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"n\">data<\/span><span class=\"o\">.<\/span><span class=\"n\">describe<\/span><span class=\"p\">()<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt output_prompt\">Out[3]:<\/div>\n<div class=\"output_html rendered_html output_subarea output_execute_result\">\n<table class=\"dataframe\" border=\"1\">\n<thead>\n<tr style=\"text-align: right;\">\n<th><\/th>\n<th>Season<\/th>\n<th>Squad<\/th>\n<th>Average Age<\/th>\n<th>Non-Homegrown<\/th>\n<th>Squad Value<\/th>\n<th>GD<\/th>\n<th>Points<\/th>\n<th>Position<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<th>count<\/th>\n<td>220.000000<\/td>\n<td>220.000000<\/td>\n<td>220.000000<\/td>\n<td>220.000000<\/td>\n<td>220.000000<\/td>\n<td>220.000000<\/td>\n<td>220.000000<\/td>\n<td>220.000000<\/td>\n<\/tr>\n<tr>\n<th>mean<\/th>\n<td>2013.000000<\/td>\n<td>36.304545<\/td>\n<td>24.793636<\/td>\n<td>22.886364<\/td>\n<td>225.792909<\/td>\n<td>0.000000<\/td>\n<td>52.245455<\/td>\n<td>10.500000<\/td>\n<\/tr>\n<tr>\n<th>std<\/th>\n<td>3.169489<\/td>\n<td>5.410372<\/td>\n<td>1.136427<\/td>\n<td>5.377171<\/td>\n<td>183.079602<\/td>\n<td>27.061405<\/td>\n<td>17.569788<\/td>\n<td>5.779431<\/td>\n<\/tr>\n<tr>\n<th>min<\/th>\n<td>2008.000000<\/td>\n<td>21.000000<\/td>\n<td>21.300000<\/td>\n<td>8.000000<\/td>\n<td>22.500000<\/td>\n<td>-54.000000<\/td>\n<td>16.000000<\/td>\n<td>1.000000<\/td>\n<\/tr>\n<tr>\n<th>25%<\/th>\n<td>2010.000000<\/td>\n<td>33.000000<\/td>\n<td>23.975000<\/td>\n<td>19.000000<\/td>\n<td>99.662500<\/td>\n<td>-20.000000<\/td>\n<td>40.000000<\/td>\n<td>5.750000<\/td>\n<\/tr>\n<tr>\n<th>50%<\/th>\n<td>2013.000000<\/td>\n<td>36.000000<\/td>\n<td>24.800000<\/td>\n<td>22.000000<\/td>\n<td>158.275000<\/td>\n<td>-7.000000<\/td>\n<td>47.000000<\/td>\n<td>10.500000<\/td>\n<\/tr>\n<tr>\n<th>75%<\/th>\n<td>2016.000000<\/td>\n<td>40.000000<\/td>\n<td>25.500000<\/td>\n<td>26.000000<\/td>\n<td>299.782500<\/td>\n<td>20.250000<\/td>\n<td>64.250000<\/td>\n<td>15.250000<\/td>\n<\/tr>\n<tr>\n<th>max<\/th>\n<td>2018.000000<\/td>\n<td>54.000000<\/td>\n<td>28.100000<\/td>\n<td>41.000000<\/td>\n<td>1000.100000<\/td>\n<td>79.000000<\/td>\n<td>100.000000<\/td>\n<td>20.000000<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>So we have a 220-row dataset, with each row being a team in each Premier League season since 2008\/09. For each of the teams, we get squad sizes, ages, squad value (in Euros) as well as performance data with goal difference, points &amp; position. The values are taken from Transfermarkt (once again, you can find the data <a href=\"http:\/\/www.sharecsv.com\/s\/b1315bc34df3d924f9bdbf9f67150f7d\/PositionsvsValue.csv\">here)<\/a>.<\/p>\n<p>Our aim is to get a model together that would help us to predict a team&#8217;s points based on their squad value. Before we do that, we should check to see what the relationships are among some of the key variables. Let&#8217;s do that visually with a pair plot.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[4]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"n\">sns<\/span><span class=\"o\">.<\/span><span class=\"n\">pairplot<\/span><span class=\"p\">(<\/span><span class=\"n\">data<\/span><span class=\"p\">[[<\/span><span class=\"s1\">'Season'<\/span><span class=\"p\">,<\/span><span class=\"s1\">'GD'<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'Squad Value'<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'Points'<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'Position'<\/span><span class=\"p\">]])<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt output_prompt\">Out[4]:<\/div>\n<div class=\"output_text output_subarea output_execute_result\">\n<pre>&lt;seaborn.axisgrid.PairGrid at 0x1a261dbb50&gt;<\/pre>\n<\/div>\n<\/div>\n<div class=\"output_area\">\n<div class=\"prompt\"><\/div>\n<div class=\"output_png output_subarea \"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-478\" src=\"https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/pairplot1.png\" alt=\"Python Pairplot\" width=\"1440\" height=\"1416\" srcset=\"https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/pairplot1.png 1440w, https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/pairplot1-300x295.png 300w, https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/pairplot1-1024x1007.png 1024w, https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/pairplot1-768x755.png 768w\" sizes=\"(max-width: 1440px) 100vw, 1440px\" \/><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Some interesting points to keep in mind:<\/p>\n<ul>\n<li>Points &amp; goal difference correlate really strongly, as you might expect.<\/li>\n<li>Squad value goes up as goal difference and points go up, but as more of a curve than a line.<\/li>\n<li>Squad value has increased over time (important! We&#8217;ll come back to this)<\/li>\n<\/ul>\n<p>Thinking back to our initial problem &#8211; modelling squad value on performance &#8211; we need to define what performance is. I think that we can answer this by seeing which of points and position correlate more with squad value. Let&#8217;s check if position correlates more than points:<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[5]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"nb\">abs<\/span><span class=\"p\">(<\/span><span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Squad Value'<\/span><span class=\"p\">]<\/span><span class=\"o\">.<\/span><span class=\"n\">corr<\/span><span class=\"p\">(<\/span><span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Position'<\/span><span class=\"p\">]))<\/span> <span class=\"o\">&gt;<\/span> <span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Squad Value'<\/span><span class=\"p\">]<\/span><span class=\"o\">.<\/span><span class=\"n\">corr<\/span><span class=\"p\">(<\/span><span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Points'<\/span><span class=\"p\">])<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt output_prompt\">Out[5]:<\/div>\n<div class=\"output_text output_subarea output_execute_result\">\n<pre>False<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Seemingly not, so for the purpose of the article, we&#8217;re going to build our model around how many points you should expect for your squad value, not the position.<\/p>\n<h3>Building our Model<\/h3>\n<p>So let&#8217;s get to it. We&#8217;ll take the following steps:<\/p>\n<p>1) Get and reshape the two columns that we want to use in our model: Points &amp; Squad Value<\/p>\n<p>2) Split each of the two variables into a training set, and a test set. The train set will build our model, the test set will allow us to see how good the model is.<\/p>\n<p>3) Create an empty linear regression model, then fit it against our two training sets<\/p>\n<p>4) Examine and test the model<\/p>\n<p>Let&#8217;s work through each step<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[6]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"c1\">#1- Get our two columns into variables, then reshape them<\/span>\n\n<span class=\"n\">X<\/span> <span class=\"o\">=<\/span> <span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Squad Value'<\/span><span class=\"p\">]<\/span>\n<span class=\"n\">y<\/span> <span class=\"o\">=<\/span> <span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Points'<\/span><span class=\"p\">]<\/span>\n\n<span class=\"n\">X<\/span> <span class=\"o\">=<\/span> <span class=\"n\">X<\/span><span class=\"o\">.<\/span><span class=\"n\">values<\/span><span class=\"o\">.<\/span><span class=\"n\">reshape<\/span><span class=\"p\">(<\/span><span class=\"o\">-<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span><span class=\"mi\">1<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">y<\/span> <span class=\"o\">=<\/span> <span class=\"n\">y<\/span><span class=\"o\">.<\/span><span class=\"n\">values<\/span><span class=\"o\">.<\/span><span class=\"n\">reshape<\/span><span class=\"p\">(<\/span><span class=\"o\">-<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span><span class=\"mi\">1<\/span><span class=\"p\">)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>We can use train_test_split to easily create our training and test sets. There are a few arguments we have to pass, in addition to the variables that will be split. There is test_size, which tells the function what % of the split should be in the test side. Random_state is not necessary, but it sets a starting point for the random number generation involved in the split &#8211; if you want your data to look like this tutorial, keep this the same.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[7]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"c1\">#2- Use the train_test_split function to create our training sets &amp; test sets<\/span>\n<span class=\"n\">X_train<\/span><span class=\"p\">,<\/span> <span class=\"n\">X_test<\/span><span class=\"p\">,<\/span> <span class=\"n\">y_train<\/span><span class=\"p\">,<\/span> <span class=\"n\">y_test<\/span> <span class=\"o\">=<\/span> <span class=\"n\">train_test_split<\/span><span class=\"p\">(<\/span><span class=\"n\">X<\/span><span class=\"p\">,<\/span> <span class=\"n\">y<\/span><span class=\"p\">,<\/span> <span class=\"n\">test_size<\/span><span class=\"o\">=<\/span><span class=\"mf\">0.25<\/span><span class=\"p\">,<\/span> <span class=\"n\">random_state<\/span><span class=\"o\">=<\/span><span class=\"mi\">101<\/span><span class=\"p\">)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Next up is creating the empty model, then fitting it with our training data. The sklearn package means that this only takes a couple of lines:<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[8]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"n\">lm<\/span> <span class=\"o\">=<\/span> <span class=\"n\">LinearRegression<\/span><span class=\"p\">()<\/span>\n<span class=\"n\">lm<\/span><span class=\"o\">.<\/span><span class=\"n\">fit<\/span><span class=\"p\">(<\/span><span class=\"n\">X_train<\/span><span class=\"p\">,<\/span><span class=\"n\">y_train<\/span><span class=\"p\">)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt output_prompt\">Out[8]:<\/div>\n<div class=\"output_text output_subarea output_execute_result\">\n<pre>LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Holy shit, you&#8217;ve just made a linear regression model! Bit of an anticlimax until we do something with it&#8230;<\/p>\n<p>The final part is examining the model. This means seeing what conclusions it gives to answer our main question (value -&gt; performance), and importantly, how valid they are.<\/p>\n<p>We can start by checking the coefficient. This is the amount that we expect our response variable (points) to change for every unit that our predictor variable changes (squad value in m Euros). Simply, for every extra million we put into our squad value, how many extra points should we get? We find out with the .coef_ method of the model.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[9]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"nb\">print<\/span><span class=\"p\">(<\/span><span class=\"n\">lm<\/span><span class=\"o\">.<\/span><span class=\"n\">coef_<\/span><span class=\"p\">)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt\"><\/div>\n<div class=\"output_subarea output_stream output_stdout output_text\">\n<pre>[[0.07152655]]\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>So on average, an extra million gets you 0.07 points. Looks like we&#8217;re going to need an absolute warchest to stay up.<\/p>\n<p>We now need to test the model by checking predictions from the trained model against the test data that we know is true. Let&#8217;s check out a few ways of doing this. Firstly, we&#8217;ll create some predictions using lm.predict &#8211; we&#8217;ll feed it the real squad value data, and it will predict the points based on the model. Then we&#8217;ll use this in 2 charts, firstly plotting the real data against the prediction line, then plotting the prediction against the true data.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[28]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"n\">predictions<\/span> <span class=\"o\">=<\/span> <span class=\"n\">lm<\/span><span class=\"o\">.<\/span><span class=\"n\">predict<\/span><span class=\"p\">(<\/span><span class=\"n\">X_test<\/span><span class=\"p\">)<\/span>\n\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">scatter<\/span><span class=\"p\">(<\/span><span class=\"n\">X_test<\/span><span class=\"p\">,<\/span> <span class=\"n\">y_test<\/span><span class=\"p\">,<\/span>  <span class=\"n\">color<\/span><span class=\"o\">=<\/span><span class=\"s1\">'purple'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">plot<\/span><span class=\"p\">(<\/span><span class=\"n\">X_test<\/span><span class=\"p\">,<\/span> <span class=\"n\">predictions<\/span><span class=\"p\">,<\/span> <span class=\"n\">color<\/span><span class=\"o\">=<\/span><span class=\"s1\">'green'<\/span><span class=\"p\">,<\/span> <span class=\"n\">linewidth<\/span><span class=\"o\">=<\/span><span class=\"mi\">3<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">title<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"EPL Squad value vs points - Model One\"<\/span><span class=\"p\">)<\/span>\n\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">show<\/span><span class=\"p\">()<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"prompt\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-479 size-medium\" src=\"https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.13.27-300x214.png\" alt=\"Simple Linear Regression Test 1\" width=\"300\" height=\"214\" srcset=\"https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.13.27-300x214.png 300w, https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.13.27.png 756w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[11]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">scatter<\/span><span class=\"p\">(<\/span><span class=\"n\">y_test<\/span><span class=\"p\">,<\/span><span class=\"n\">predictions<\/span><span class=\"p\">)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt output_prompt\">Out[11]:<\/div>\n<div class=\"output_text output_subarea output_execute_result\">\n<pre>&lt;matplotlib.collections.PathCollection at 0x1a27b8ab90&gt;<\/pre>\n<\/div>\n<\/div>\n<div class=\"output_area\">\n<div class=\"prompt\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-480 size-medium\" src=\"https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.13.35-300x214.png\" alt=\"\" width=\"300\" height=\"214\" srcset=\"https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.13.35-300x214.png 300w, https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.13.35.png 756w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Lots of values that match up well, and lots that don&#8217;t. Tough to see how far we are out, though. So let&#8217;s get a histogram to plot the differences between the predictions and the true data:<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[34]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">title<\/span><span class=\"p\">(<\/span><span class=\"s1\">'How many points out is each prediction?'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">sns<\/span><span class=\"o\">.<\/span><span class=\"n\">distplot<\/span><span class=\"p\">((<\/span><span class=\"n\">y_test<\/span><span class=\"o\">-<\/span><span class=\"n\">predictions<\/span><span class=\"p\">),<\/span><span class=\"n\">bins<\/span><span class=\"o\">=<\/span><span class=\"mi\">50<\/span><span class=\"p\">,<\/span> <span class=\"n\">color<\/span> <span class=\"o\">=<\/span> <span class=\"s1\">'purple'<\/span><span class=\"p\">)\n<\/span><\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-medium wp-image-486\" src=\"https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-20.28.11-300x214.png\" alt=\"Linear Regression Histogram\" width=\"300\" height=\"214\" srcset=\"https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-20.28.11-300x214.png 300w, https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-20.28.11.png 756w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>A few where we are way out, like 30-40 points out. But mostly, we are within 10 points or so either way.<\/p>\n<p>We are going to look to improve this, so to help with the comparison let&#8217;s use a metric called &#8216;mean absolute error&#8217;. This is simply the average difference between the prediction and the truth. Hopefully, we can reduce this with the next one.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[13]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"nb\">print<\/span><span class=\"p\">(<\/span><span class=\"s1\">'Mean Absolute Error:'<\/span><span class=\"p\">,<\/span> <span class=\"n\">metrics<\/span><span class=\"o\">.<\/span><span class=\"n\">mean_absolute_error<\/span><span class=\"p\">(<\/span><span class=\"n\">y_test<\/span><span class=\"p\">,<\/span> <span class=\"n\">predictions<\/span><span class=\"p\">))<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt\"><\/div>\n<div class=\"output_subarea output_stream output_stdout output_text\">\n<pre>Mean Absolute Error: 9.728206663986418\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Alternatively, we could put these in a table, rather than plot them. But that is a bit less friendly to work through.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[14]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"n\">df<\/span> <span class=\"o\">=<\/span> <span class=\"n\">pd<\/span><span class=\"o\">.<\/span><span class=\"n\">DataFrame<\/span><span class=\"p\">({<\/span><span class=\"s1\">'Actual'<\/span><span class=\"p\">:<\/span> <span class=\"n\">y_test<\/span><span class=\"o\">.<\/span><span class=\"n\">flatten<\/span><span class=\"p\">(),<\/span> <span class=\"s1\">'Predicted'<\/span><span class=\"p\">:<\/span> <span class=\"n\">predictions<\/span><span class=\"o\">.<\/span><span class=\"n\">flatten<\/span><span class=\"p\">()})<\/span>\n<span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">head<\/span><span class=\"p\">()<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt output_prompt\">Out[14]:<\/div>\n<div class=\"output_html rendered_html output_subarea output_execute_result\">\n<table class=\"dataframe\" border=\"1\">\n<thead>\n<tr style=\"text-align: right;\">\n<th><\/th>\n<th>Actual<\/th>\n<th>Predicted<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<th>0<\/th>\n<td>47<\/td>\n<td>40.995974<\/td>\n<\/tr>\n<tr>\n<th>1<\/th>\n<td>56<\/td>\n<td>42.834207<\/td>\n<\/tr>\n<tr>\n<th>2<\/th>\n<td>49<\/td>\n<td>43.510133<\/td>\n<\/tr>\n<tr>\n<th>3<\/th>\n<td>63<\/td>\n<td>80.843418<\/td>\n<\/tr>\n<tr>\n<th>4<\/th>\n<td>61<\/td>\n<td>55.204724<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[15]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"n\">df<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Actual'<\/span><span class=\"p\">]<\/span><span class=\"o\">.<\/span><span class=\"n\">corr<\/span><span class=\"p\">(<\/span><span class=\"n\">df<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Predicted'<\/span><span class=\"p\">])<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt output_prompt\">Out[15]:<\/div>\n<div class=\"output_text output_subarea output_execute_result\">\n<pre>0.6540205213240837<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"Improving-the-model\">Improving the model<\/h3>\n<p>When we took an exploratory look at the data, we found that team values had increased over seasons. As such, comparing a 100m squad in 2008 to a 100m squad in 2018 probably isn&#8217;t fair.<\/p>\n<p>To counter this, we are going to create a new &#8216;Relative Value&#8217; column. This will take each team in a season, and divide it by the highest value in that league. These values will be between 0 &amp; 1 and give a better impression of comparative buying power, hence performance in the league. Hopefully it will provide for a better model than the example above.<\/p>\n<p>Let&#8217;s create this column as a list, then add it to our dataframe.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[16]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"c1\">#Blank list<\/span>\n<span class=\"n\">relativeValue<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[]<\/span>\n\n<span class=\"c1\">#Loop through each row<\/span>\n<span class=\"k\">for<\/span> <span class=\"n\">index<\/span><span class=\"p\">,<\/span> <span class=\"n\">team<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">data<\/span><span class=\"o\">.<\/span><span class=\"n\">iterrows<\/span><span class=\"p\">():<\/span>\n    \n    <span class=\"c1\">#Obtain which season we are looking at<\/span>\n    <span class=\"n\">season<\/span> <span class=\"o\">=<\/span> <span class=\"n\">team<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Season'<\/span><span class=\"p\">]<\/span>\n    \n    <span class=\"c1\">#Create a new dataframe with just this season<\/span>\n    <span class=\"n\">teamseason<\/span> <span class=\"o\">=<\/span> <span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Season'<\/span><span class=\"p\">]<\/span> <span class=\"o\">==<\/span> <span class=\"n\">season<\/span><span class=\"p\">]<\/span>\n    \n    <span class=\"c1\">#Find the max value<\/span>\n    <span class=\"n\">maxvalue<\/span> <span class=\"o\">=<\/span> <span class=\"n\">teamseason<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Squad Value'<\/span><span class=\"p\">]<\/span><span class=\"o\">.<\/span><span class=\"n\">max<\/span><span class=\"p\">()<\/span>\n    \n    <span class=\"c1\">#Divide this row's value by the max value for the season<\/span>\n    <span class=\"n\">tempRelativeValue<\/span> <span class=\"o\">=<\/span> <span class=\"n\">team<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Squad Value'<\/span><span class=\"p\">]<\/span><span class=\"o\">\/<\/span><span class=\"n\">maxvalue<\/span>\n    \n    <span class=\"c1\">#Append it to our list<\/span>\n    <span class=\"n\">relativeValue<\/span><span class=\"o\">.<\/span><span class=\"n\">append<\/span><span class=\"p\">(<\/span><span class=\"n\">tempRelativeValue<\/span><span class=\"p\">)<\/span>\n    \n<span class=\"c1\">#Add list to new column in main dataframe<\/span>\n<span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s2\">\"Relative Value\"<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"n\">relativeValue<\/span>\n\n<span class=\"n\">data<\/span><span class=\"o\">.<\/span><span class=\"n\">head<\/span><span class=\"p\">()<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt output_prompt\">Out[16]:<\/div>\n<div class=\"output_html rendered_html output_subarea output_execute_result\">\n<table class=\"dataframe\" border=\"1\">\n<thead>\n<tr style=\"text-align: right;\">\n<th><\/th>\n<th>League<\/th>\n<th>Season<\/th>\n<th>Team<\/th>\n<th>Squad<\/th>\n<th>Average Age<\/th>\n<th>Non-Homegrown<\/th>\n<th>Squad Value<\/th>\n<th>Avg Player Value<\/th>\n<th>GD<\/th>\n<th>Points<\/th>\n<th>Position<\/th>\n<th>Relative Value<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<th>0<\/th>\n<td>EPL<\/td>\n<td>2008<\/td>\n<td>Chelsea FC<\/td>\n<td>28<\/td>\n<td>25.6<\/td>\n<td>21<\/td>\n<td>406.70<\/td>\n<td>14.53<\/td>\n<td>44<\/td>\n<td>83<\/td>\n<td>3<\/td>\n<td>1.000000<\/td>\n<\/tr>\n<tr>\n<th>1<\/th>\n<td>EPL<\/td>\n<td>2008<\/td>\n<td>Manchester United<\/td>\n<td>31<\/td>\n<td>24.3<\/td>\n<td>20<\/td>\n<td>356.10<\/td>\n<td>11.49<\/td>\n<td>44<\/td>\n<td>90<\/td>\n<td>1<\/td>\n<td>0.875584<\/td>\n<\/tr>\n<tr>\n<th>2<\/th>\n<td>EPL<\/td>\n<td>2008<\/td>\n<td>Liverpool FC<\/td>\n<td>28<\/td>\n<td>23.9<\/td>\n<td>24<\/td>\n<td>257.23<\/td>\n<td>9.19<\/td>\n<td>50<\/td>\n<td>86<\/td>\n<td>2<\/td>\n<td>0.632481<\/td>\n<\/tr>\n<tr>\n<th>3<\/th>\n<td>EPL<\/td>\n<td>2008<\/td>\n<td>Arsenal FC<\/td>\n<td>38<\/td>\n<td>21.3<\/td>\n<td>30<\/td>\n<td>250.85<\/td>\n<td>6.6<\/td>\n<td>31<\/td>\n<td>72<\/td>\n<td>4<\/td>\n<td>0.616794<\/td>\n<\/tr>\n<tr>\n<th>4<\/th>\n<td>EPL<\/td>\n<td>2008<\/td>\n<td>Tottenham Hotspur<\/td>\n<td>35<\/td>\n<td>22.5<\/td>\n<td>18<\/td>\n<td>212.60<\/td>\n<td>6.07<\/td>\n<td>0<\/td>\n<td>51<\/td>\n<td>8<\/td>\n<td>0.522744<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Looking good, the 4 teams below Chelsea do indeed have lower squad values, as represented by lower relative values.<\/p>\n<p>Let&#8217;s get a pairplot to check out the new column&#8217;s relationship with the others.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[17]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"n\">sns<\/span><span class=\"o\">.<\/span><span class=\"n\">pairplot<\/span><span class=\"p\">(<\/span><span class=\"n\">data<\/span><span class=\"p\">[[<\/span><span class=\"s1\">'GD'<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'Squad Value'<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'Relative Value'<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'Points'<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'Position'<\/span><span class=\"p\">]])<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt output_prompt\">Out[17]:<\/div>\n<div class=\"output_text output_subarea output_execute_result\">\n<pre>&lt;seaborn.axisgrid.PairGrid at 0x1a27e22950&gt;<\/pre>\n<\/div>\n<\/div>\n<div class=\"output_area\">\n<div class=\"prompt\"><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-482 size-full\" src=\"https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.14.29.png\" alt=\"Pairplot\" width=\"1446\" height=\"1412\" srcset=\"https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.14.29.png 1446w, https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.14.29-300x293.png 300w, https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.14.29-1024x1000.png 1024w, https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.14.29-768x750.png 768w\" sizes=\"(max-width: 1446px) 100vw, 1446px\" \/><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Looks quite similar to the squad value relationships in many parts, but looks to have a stronger correlation with points and goal difference. Hopefully this will give us a more accurate model. Let&#8217;s create a new one in the same way as above<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[18]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"c1\">#Assign relevant columns to variables and reshape them<\/span>\n<span class=\"n\">X<\/span> <span class=\"o\">=<\/span> <span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Relative Value'<\/span><span class=\"p\">]<\/span>\n<span class=\"n\">y<\/span> <span class=\"o\">=<\/span> <span class=\"n\">data<\/span><span class=\"p\">[<\/span><span class=\"s1\">'Points'<\/span><span class=\"p\">]<\/span>\n<span class=\"n\">X<\/span> <span class=\"o\">=<\/span> <span class=\"n\">X<\/span><span class=\"o\">.<\/span><span class=\"n\">values<\/span><span class=\"o\">.<\/span><span class=\"n\">reshape<\/span><span class=\"p\">(<\/span><span class=\"o\">-<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span><span class=\"mi\">1<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">y<\/span> <span class=\"o\">=<\/span> <span class=\"n\">y<\/span><span class=\"o\">.<\/span><span class=\"n\">values<\/span><span class=\"o\">.<\/span><span class=\"n\">reshape<\/span><span class=\"p\">(<\/span><span class=\"o\">-<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span><span class=\"mi\">1<\/span><span class=\"p\">)<\/span>\n\n<span class=\"c1\">#Create training and test sets for each of the two variables<\/span>\n<span class=\"n\">X_train<\/span><span class=\"p\">,<\/span> <span class=\"n\">X_test<\/span><span class=\"p\">,<\/span> <span class=\"n\">y_train<\/span><span class=\"p\">,<\/span> <span class=\"n\">y_test<\/span> <span class=\"o\">=<\/span> <span class=\"n\">train_test_split<\/span><span class=\"p\">(<\/span><span class=\"n\">X<\/span><span class=\"p\">,<\/span> <span class=\"n\">y<\/span><span class=\"p\">,<\/span> <span class=\"n\">test_size<\/span><span class=\"o\">=<\/span><span class=\"mf\">0.25<\/span><span class=\"p\">,<\/span> <span class=\"n\">random_state<\/span><span class=\"o\">=<\/span><span class=\"mi\">101<\/span><span class=\"p\">)<\/span>\n\n<span class=\"c1\">#Create an empty model, then train it against the variables<\/span>\n<span class=\"n\">lm<\/span> <span class=\"o\">=<\/span> <span class=\"n\">LinearRegression<\/span><span class=\"p\">()<\/span>\n<span class=\"n\">lm<\/span><span class=\"o\">.<\/span><span class=\"n\">fit<\/span><span class=\"p\">(<\/span><span class=\"n\">X_train<\/span><span class=\"p\">,<\/span><span class=\"n\">y_train<\/span><span class=\"p\">)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt output_prompt\">Out[18]:<\/div>\n<div class=\"output_text output_subarea output_execute_result\">\n<pre>LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>And we&#8217;ll again look at the coefficient to see what our model tells us to expect. We&#8217;ll divide it by 10, to see how many points increasing our squad value by 10% of the most expensive team should earn<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[19]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"nb\">print<\/span><span class=\"p\">(<\/span><span class=\"n\">lm<\/span><span class=\"o\">.<\/span><span class=\"n\">coef_<\/span><span class=\"o\">\/<\/span><span class=\"mi\">10<\/span><span class=\"p\">)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt\"><\/div>\n<div class=\"output_subarea output_stream output_stdout output_text\">\n<pre>[[5.31884201]]\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[37]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"n\">predictions<\/span> <span class=\"o\">=<\/span> <span class=\"n\">lm<\/span><span class=\"o\">.<\/span><span class=\"n\">predict<\/span><span class=\"p\">(<\/span><span class=\"n\">X_test<\/span><span class=\"p\">)<\/span>\n\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">scatter<\/span><span class=\"p\">(<\/span><span class=\"n\">X_test<\/span><span class=\"p\">,<\/span> <span class=\"n\">y_test<\/span><span class=\"p\">,<\/span>  <span class=\"n\">color<\/span><span class=\"o\">=<\/span><span class=\"s1\">'purple'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">plot<\/span><span class=\"p\">(<\/span><span class=\"n\">X_test<\/span><span class=\"p\">,<\/span> <span class=\"n\">predictions<\/span><span class=\"p\">,<\/span> <span class=\"n\">color<\/span><span class=\"o\">=<\/span><span class=\"s1\">'green'<\/span><span class=\"p\">,<\/span> <span class=\"n\">linewidth<\/span><span class=\"o\">=<\/span><span class=\"mi\">3<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">title<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"Relative Squad value vs points - Model Two\"<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">show<\/span><span class=\"p\">()<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-medium wp-image-483\" src=\"https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.14.44-300x214.png\" alt=\"Scatter Plot\" width=\"300\" height=\"214\" srcset=\"https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.14.44-300x214.png 300w, https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.14.44.png 756w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>The model predicts just over 5 points. This seems to make sense, as the difference between top and bottom would often range around 53 or so points.<\/p>\n<p>So for every 10% that you are off of the most expensive team, our model suggests that you should expect to drop 5.3 points.<\/p>\n<p>Let&#8217;s run the same tests as before to check out whether or not this new model performs better. Firstly, the same two charts &#8211; the scatter plot &amp; the distribution of the errors. The scatter plot looks to to have more of a correlation and the distribution also is a bit tighter, with fewer big errors.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[22]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">scatter<\/span><span class=\"p\">(<\/span><span class=\"n\">y_test<\/span><span class=\"p\">,<\/span><span class=\"n\">predictions<\/span><span class=\"p\">)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt output_prompt\">Out[22]:<\/div>\n<div class=\"output_text output_subarea output_execute_result\">\n<pre>&lt;matplotlib.collections.PathCollection at 0x1a28ae3450&gt;<\/pre>\n<\/div>\n<\/div>\n<div class=\"prompt\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-medium wp-image-484\" src=\"https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.15.39-300x214.png\" alt=\"Scatterplot Linear Regression\" width=\"300\" height=\"214\" srcset=\"https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.15.39-300x214.png 300w, https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.15.39.png 756w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[35]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">title<\/span><span class=\"p\">(<\/span><span class=\"s1\">'How many points out is each prediction?'<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">sns<\/span><span class=\"o\">.<\/span><span class=\"n\">distplot<\/span><span class=\"p\">((<\/span><span class=\"n\">y_test<\/span><span class=\"o\">-<\/span><span class=\"n\">predictions<\/span><span class=\"p\">),<\/span><span class=\"n\">bins<\/span><span class=\"o\">=<\/span><span class=\"mi\">50<\/span><span class=\"p\">,<\/span><span class=\"n\">color<\/span><span class=\"o\">=<\/span><span class=\"s1\">'purple'<\/span><span class=\"p\">);<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-medium wp-image-481\" src=\"https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.13.44-300x201.png\" alt=\"Linear Regression Histogram\" width=\"300\" height=\"201\" srcset=\"https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.13.44-300x201.png 300w, https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.13.44-768x516.png 768w, https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.13.44-272x182.png 272w, https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/Screenshot-2020-02-23-at-18.13.44.png 810w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>To back up the eye test, we&#8217;ll use our mean absolute error metric &#8211; the average difference between the prediction and the truth. Our previous metric was 9.7&#8230;<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[24]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython3\">\n<pre><span class=\"nb\">print<\/span><span class=\"p\">(<\/span><span class=\"s1\">'MAE:'<\/span><span class=\"p\">,<\/span> <span class=\"n\">metrics<\/span><span class=\"o\">.<\/span><span class=\"n\">mean_absolute_error<\/span><span class=\"p\">(<\/span><span class=\"n\">y_test<\/span><span class=\"p\">,<\/span> <span class=\"n\">predictions<\/span><span class=\"p\">))<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt\"><\/div>\n<div class=\"output_subarea output_stream output_stdout output_text\">\n<pre>MAE: 8.972066563663786\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>So that&#8217;s nearly an 8% improvement&#8230; not a gamechanger, but I think we can agree that this model makes more sense than the one before. Not only does it fit better (correlation between predictions\/reality also increased significantly), but we know from our own knowledge of football that transfer fees and market values have hugely inflated over the length of our dataset.<\/p>\n<p>There are other oddities that you will have noticed, such as the extreme outliers (Leicester 15\/16, Chelsea 15\/16, Chelsea 18\/19), the cluster of teams around the relegation places. All of these could do with their own further analysis, but that is beyond the scope of this tutorial. Would make for a really interesting piece itself if you fancy trying your hand at this!<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<h3 class=\"prompt input_prompt\">Summary<\/h3>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>That just about covers off our simple linear regression 101 &#8211; let&#8217;s summarise what we learned.<\/p>\n<p>1) Simple linear regression is an approach to explaining how one variable may affect another.<\/p>\n<p>2) We built a model where we see how squad value affects points.<\/p>\n<p>3) We observed what the model suggested and saw how many points an extra million spent might gain.<\/p>\n<p>4) We checked the validity of the model and saw what the average error was.<\/p>\n<p>5) We repeated the above with another (new) metric to create an improved model, reducing the error.<\/p>\n<p>Great effort making it this far. For developing these concepts, you may want to gather data from other leagues to see if squad value is as closely related to winning as it is here. Otherwise, with aggregated event data, you could look to see how reliable shots or passes are as goal predictors.<\/p>\n<p>As for building your stats model knowledge, take a read on multiple linear regressions and we will look to have an article up on this topic soon!<\/p>\n<p>Any questions, you&#8217;ll find us on Twitter <a href=\"https:\/\/twitter.com\/FC_Python\">@fc_python<\/a>.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Linear regression allows us to model the relationship between variables. This might allow us to predict a future outcome if we already know some&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[61],"tags":[62,17,63,64,42,65],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v19.13 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Introduction to Simple Linear Regression in Python - FC Python<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python\" \/>\n<meta property=\"og:locale\" content=\"en_GB\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Introduction to Simple Linear Regression in Python - FC Python\" \/>\n<meta property=\"og:description\" content=\"Linear regression allows us to model the relationship between variables. This might allow us to predict a future outcome if we already know some&hellip;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python\" \/>\n<meta property=\"og:site_name\" content=\"FC Python\" \/>\n<meta property=\"article:published_time\" content=\"2020-02-23T19:39:52+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-12-18T20:08:55+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/pairplot1.png\" \/>\n<meta name=\"author\" content=\"FCPythonADMIN\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@FC__Python\" \/>\n<meta name=\"twitter:site\" content=\"@FC__Python\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"FCPythonADMIN\" \/>\n\t<meta name=\"twitter:label2\" content=\"Estimated reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"12 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python#article\",\"isPartOf\":{\"@id\":\"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python\"},\"author\":{\"name\":\"FCPythonADMIN\",\"@id\":\"https:\/\/fcpython.com\/#\/schema\/person\/ed81e5728929acd0f3f2d9bf824a0bd0\"},\"headline\":\"Introduction to Simple Linear Regression in Python\",\"datePublished\":\"2020-02-23T19:39:52+00:00\",\"dateModified\":\"2020-12-18T20:08:55+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python\"},\"wordCount\":1818,\"publisher\":{\"@id\":\"https:\/\/fcpython.com\/#organization\"},\"keywords\":[\"machine learning\",\"Pandas\",\"sklearn\",\"transfer\",\"transfermarkt\",\"visualisation\"],\"articleSection\":[\"Machine Learning\"],\"inLanguage\":\"en-GB\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python\",\"url\":\"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python\",\"name\":\"Introduction to Simple Linear Regression in Python - FC Python\",\"isPartOf\":{\"@id\":\"https:\/\/fcpython.com\/#website\"},\"datePublished\":\"2020-02-23T19:39:52+00:00\",\"dateModified\":\"2020-12-18T20:08:55+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python#breadcrumb\"},\"inLanguage\":\"en-GB\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/fcpython.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Introduction to Simple Linear Regression in Python\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/fcpython.com\/#website\",\"url\":\"https:\/\/fcpython.com\/\",\"name\":\"FC Python\",\"description\":\"Learning Python through football\",\"publisher\":{\"@id\":\"https:\/\/fcpython.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/fcpython.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-GB\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/fcpython.com\/#organization\",\"name\":\"FC Python\",\"url\":\"https:\/\/fcpython.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-GB\",\"@id\":\"https:\/\/fcpython.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/fcpython.com\/wp-content\/uploads\/2017\/12\/Logocomp9.png\",\"contentUrl\":\"https:\/\/fcpython.com\/wp-content\/uploads\/2017\/12\/Logocomp9.png\",\"width\":981,\"height\":1049,\"caption\":\"FC Python\"},\"image\":{\"@id\":\"https:\/\/fcpython.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/twitter.com\/FC__Python\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/fcpython.com\/#\/schema\/person\/ed81e5728929acd0f3f2d9bf824a0bd0\",\"name\":\"FCPythonADMIN\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-GB\",\"@id\":\"https:\/\/fcpython.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/7a172a6f730270fc0f8bb1a8ff958895?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/7a172a6f730270fc0f8bb1a8ff958895?s=96&d=mm&r=g\",\"caption\":\"FCPythonADMIN\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Introduction to Simple Linear Regression in Python - FC Python","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python","og_locale":"en_GB","og_type":"article","og_title":"Introduction to Simple Linear Regression in Python - FC Python","og_description":"Linear regression allows us to model the relationship between variables. This might allow us to predict a future outcome if we already know some&hellip;","og_url":"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python","og_site_name":"FC Python","article_published_time":"2020-02-23T19:39:52+00:00","article_modified_time":"2020-12-18T20:08:55+00:00","og_image":[{"url":"https:\/\/fcpython.com\/wp-content\/uploads\/2020\/02\/pairplot1.png"}],"author":"FCPythonADMIN","twitter_card":"summary_large_image","twitter_creator":"@FC__Python","twitter_site":"@FC__Python","twitter_misc":{"Written by":"FCPythonADMIN","Estimated reading time":"12 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python#article","isPartOf":{"@id":"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python"},"author":{"name":"FCPythonADMIN","@id":"https:\/\/fcpython.com\/#\/schema\/person\/ed81e5728929acd0f3f2d9bf824a0bd0"},"headline":"Introduction to Simple Linear Regression in Python","datePublished":"2020-02-23T19:39:52+00:00","dateModified":"2020-12-18T20:08:55+00:00","mainEntityOfPage":{"@id":"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python"},"wordCount":1818,"publisher":{"@id":"https:\/\/fcpython.com\/#organization"},"keywords":["machine learning","Pandas","sklearn","transfer","transfermarkt","visualisation"],"articleSection":["Machine Learning"],"inLanguage":"en-GB"},{"@type":"WebPage","@id":"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python","url":"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python","name":"Introduction to Simple Linear Regression in Python - FC Python","isPartOf":{"@id":"https:\/\/fcpython.com\/#website"},"datePublished":"2020-02-23T19:39:52+00:00","dateModified":"2020-12-18T20:08:55+00:00","breadcrumb":{"@id":"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python#breadcrumb"},"inLanguage":"en-GB","potentialAction":[{"@type":"ReadAction","target":["https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/fcpython.com\/machine-learning\/introduction-to-simple-linear-regression-in-python#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/fcpython.com\/"},{"@type":"ListItem","position":2,"name":"Introduction to Simple Linear Regression in Python"}]},{"@type":"WebSite","@id":"https:\/\/fcpython.com\/#website","url":"https:\/\/fcpython.com\/","name":"FC Python","description":"Learning Python through football","publisher":{"@id":"https:\/\/fcpython.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/fcpython.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-GB"},{"@type":"Organization","@id":"https:\/\/fcpython.com\/#organization","name":"FC Python","url":"https:\/\/fcpython.com\/","logo":{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/fcpython.com\/#\/schema\/logo\/image\/","url":"https:\/\/fcpython.com\/wp-content\/uploads\/2017\/12\/Logocomp9.png","contentUrl":"https:\/\/fcpython.com\/wp-content\/uploads\/2017\/12\/Logocomp9.png","width":981,"height":1049,"caption":"FC Python"},"image":{"@id":"https:\/\/fcpython.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/twitter.com\/FC__Python"]},{"@type":"Person","@id":"https:\/\/fcpython.com\/#\/schema\/person\/ed81e5728929acd0f3f2d9bf824a0bd0","name":"FCPythonADMIN","image":{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/fcpython.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/7a172a6f730270fc0f8bb1a8ff958895?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/7a172a6f730270fc0f8bb1a8ff958895?s=96&d=mm&r=g","caption":"FCPythonADMIN"}}]}},"_links":{"self":[{"href":"https:\/\/fcpython.com\/wp-json\/wp\/v2\/posts\/477"}],"collection":[{"href":"https:\/\/fcpython.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fcpython.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fcpython.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fcpython.com\/wp-json\/wp\/v2\/comments?post=477"}],"version-history":[{"count":3,"href":"https:\/\/fcpython.com\/wp-json\/wp\/v2\/posts\/477\/revisions"}],"predecessor-version":[{"id":530,"href":"https:\/\/fcpython.com\/wp-json\/wp\/v2\/posts\/477\/revisions\/530"}],"wp:attachment":[{"href":"https:\/\/fcpython.com\/wp-json\/wp\/v2\/media?parent=477"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fcpython.com\/wp-json\/wp\/v2\/categories?post=477"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fcpython.com\/wp-json\/wp\/v2\/tags?post=477"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}