{"id":30873,"date":"2020-11-30T21:10:23","date_gmt":"2020-12-01T03:10:23","guid":{"rendered":"http:\/\/pzd.hmy.temporary.site\/?p=30873"},"modified":"2020-11-30T21:10:23","modified_gmt":"2020-12-01T03:10:23","slug":"introduction-for-decision-tree","status":"publish","type":"post","link":"https:\/\/datascienceplus.com\/introduction-for-decision-tree\/","title":{"rendered":"Introduction for Decision Tree"},"content":{"rendered":"<p>Decision Tree falls under supervised machine learning, as the name suggests it is a tree-like structure that helps us to make decisions based on certain conditions. A decision tree can help us to solve both regression and classification problems.<\/p>\n<h3>What is Classification?<\/h3>\n<p>Classification is the process of dividing the data into different categories or groups by giving certain labels. For Example; categorize the transaction data based on whether the transaction is Fraud or Genuine. If we take the present epidemic as an example based on the symptoms like fever, cold and cough we categorize the patient as suffering from covid or not.<\/p>\n<h3>What is Regression?<\/h3>\n<p>Regression is a process to get the predictions which is a continuous value. For example; prediction the weight or predicting the sales or profit of the company etc.<br \/>\nA gentle introduction to the Decision tree:<br \/>\nA decision tree is a graphical representation that helps us to make decisions based on certain conditions. <\/p>\n<p>For Example; making a decision whether to watch a movie or not.<br \/>\n<a href=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2020\/11\/Picture1.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2020\/11\/Picture1.png\" alt=\"\" width=\"484\" height=\"178\" class=\"alignnone size-full wp-image-30876\" \/><\/a><\/p>\n<h3>Important terminology in the decision tree:<\/h3>\n<p><b>Node:<\/b><\/p>\n<p>A decision tree is made up of several nodes:<\/p>\n<p>  1.Root Node: A Root Node represents the entire data and the starting point of the tree. From the above example the<br \/>\n    First Node where we are checking the first condition, whether the movie belongs to Hollywood or not that is the<br \/>\n    Rood node from which the entire tree grows<br \/>\n  2.Leaf Node: A Leaf Node is the end node of the tree, which can\u2019t split into further nodes.<br \/>\n    From the above example \u2018watch movie\u2019 and \u2018Don\u2019t watch \u2018are leaf nodes.<br \/>\n  3.Parent\/Child Nodes: A Node that splits into a further node will be the parent node for the successor nodes. The<br \/>\n    nodes which are obtained from the previous node will be child nodes for the above node. <\/p>\n<p><b>Branches:<\/b><\/p>\n<p>Branches are the arrows which is a connection between nodes, it represents a flow from the starting\/Root node to the leaf node.<\/p>\n<p>How to select an attribute to create the tree or split the node:<br \/>\nWe use criteria to select attribute which helps us to split the data into partitions.<\/p>\n<p>Here are the most important and useful methods to select the node for splitting the data <\/p>\n<p><b>Information Gain:<\/b><\/p>\n<p>In the process of selecting an attribute that gives more information about the data, we select the attribute for splitting further from which we get the highest information gain. For calculating Information gain we use matric<br \/>\nEntropy.<\/p>\n<p>\t       Information from attribute = \u2211p(x). Entropy (x)<\/p>\n<p>Here, x represents a class in the attribute<br \/>\nInformation Gain for any attribute = total entropy \u2013 Information from attribute after splitting<br \/>\nEntropy:<br \/>\nEntropy is used to measure the Impurity and disorder in the dataset<\/p>\n<p>\t\t\tEntropy = &#8211; \u2211 p(y). log2 p(y)<\/p>\n<p>Here, y represents the class in the target variable<\/p>\n<p><b>Gini Index:<\/b><\/p>\n<p>Gini Index is also called Gini Impurity which calculates the probability of an attribute that is randomly selected.<\/p>\n<h2>R code<\/h2>\n<p>The dataset that we are looking into:<br \/>\n<a href=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2020\/11\/Picture1-1.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2020\/11\/Picture1-1.png\" alt=\"\" width=\"481\" height=\"326\" class=\"alignnone size-full wp-image-30922\" srcset=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2020\/11\/Picture1-1.png 481w, https:\/\/datascienceplus.com\/wp-content\/uploads\/2020\/11\/Picture1-1-100x69.png 100w, https:\/\/datascienceplus.com\/wp-content\/uploads\/2020\/11\/Picture1-1-90x62.png 90w\" sizes=\"auto, (max-width: 481px) 100vw, 481px\" \/><\/a><\/p>\n<p>We are going to build a decision tree model to decide whether to play outside or not.<br \/>\nNow we will build a decision tree on the above dataset.<br \/>\nTo build a decision tree, first we have to select an attribute that gives highest information among all the attributes.<\/p>\n<h3>Calculating Total Entropy<\/h3>\n<pre>\r\nView(dataset)\r\n## changing the data into factors type \r\ndata = data.frame(lapply(dataset,factor))\r\nsummary(data) ## summary of the data \r\n### Claculating Total Entropy\r\ntable(data$Play)\r\n## p(Yes)*log2 p(Yes)-p(No)*log2 p(No)\r\nTotalEntropy= -(9\/14)*log2(9\/14)-(5\/14)*log2(5\/14)\r\nTotalEntropy\r\n<em>&gt; View(dataset)\r\n&gt; ## changing the data into factors type \r\n&gt; data = data.frame(lapply(dataset,factor))\r\n&gt; summary(data) ## summary of the data \r\n     Outlook  Temperature   Humidity     wind    Play  \r\n Overcast:4   Cold:4      High  :7   Strong:6   No :5  \r\n Rainy   :5   Hot :4      Normal:7   Weak  :8   Yes:9  \r\n Sunny   :5   Mild:6                                   \r\n&gt; ### Claculating Total Entropy\r\n&gt; table(data$Play)\r\n\r\n No Yes \r\n  5   9 \r\n&gt; ## p(Yes)*log2 p(Yes)-p(No)*log2 p(No)\r\n&gt; TotalEntropy= -(9\/14)*log2(9\/14)-(5\/14)*log2(5\/14)\r\n&gt; TotalEntropy\r\n[1] 0.940286\r\n<\/em>\r\n<\/pre>\n<p>Calculate Entropy for each class in Outlook and Information Gain for the Outlook<\/p>\n<pre>\r\ntable(data$Play)\r\n## p(Yes)*log2 p(Yes)-p(No)*log2 p(No)\r\nTotalEntropy= -(9\/14)*log2(9\/14)-(5\/14)*log2(5\/14)\r\nTotalEntropy\r\n## filtering Outlook data to calculate entropy\r\nlibrary(dplyr)\r\n## Calculate Entropy for Outlook\r\nOutlook_Rainy = data.frame(filter(select(data,Outlook,Play),Outlook=='Rainy'))\r\n\r\nView(Outlook_Rainy)\r\nEntropy_Rainy = -(3\/5)*log2(3\/5)-(2\/5)*log2(2\/5)\r\nEntropy_Rainy\r\n\r\nOutlook_Overcast = data.frame(filter(select(data,Outlook,Play),Outlook=='Overcast'))\r\nView(Outlook_Overcast)\r\nEntropy_Overcast=-(4\/4)*log2(4\/4)-0 ## since we don't have any No values\r\nEntropy_Overcast\r\n\r\nOutlook_Sunny = data.frame(filter(select(data,Outlook,Play),Outlook=='Sunny'))\r\nView(Outlook_Sunny)\r\nEntropy_Sunny = -(2\/5)*log2(2\/5)-(3\/5)*log2(3\/5)\r\nEntropy_Sunny\r\n\r\n# calculating Information for outlook\r\n### Info = summation(p(x)*Entropy(x))\r\n\r\nOutlook_Info = ((5\/14)*Entropy_Rainy)+((4\/14)*Entropy_Overcast)+((5\/14)*Entropy_Sunny)\r\nOutlook_Info\r\n## Information gain\r\n## Info_gain = Total Entropy- Outlook_info\r\nInfo_gain1 = TotalEntropy - Outlook_Info\r\nInfo_gain1\r\n<em>&gt; table(data$Play)\r\n\r\n No Yes \r\n  5   9 \r\n&gt; ## p(Yes)*log2 p(Yes)-p(No)*log2 p(No)\r\n&gt; TotalEntropy= -(9\/14)*log2(9\/14)-(5\/14)*log2(5\/14)\r\n&gt; TotalEntropy\r\n[1] 0.940286\r\n&gt; ## filtering Outlook data to calculate entropy\r\n&gt; library(dplyr)\r\n&gt; ## Calculate Entropy for Outlook\r\n&gt; Outlook_Rainy = data.frame(filter(select(data,Outlook,Play),Outlook=='Rainy'))\r\n&gt; View(Outlook_Rainy)\r\n&gt; Entropy_Rainy = -(3\/5)*log2(3\/5)-(2\/5)*log2(2\/5)\r\n&gt; Entropy_Rainy\r\n[1] 0.9709506\r\n&gt; Outlook_Overcast = data.frame(filter(select(data,Outlook,Play),Outlook=='Overcast'))\r\n&gt; View(Outlook_Overcast)\r\n&gt; Entropy_Overcast=-(4\/4)*log2(4\/4)-0 ## since we don't have any No values\r\n&gt; Entropy_Overcast\r\n[1] 0\r\n&gt; Outlook_Sunny = data.frame(filter(select(data,Outlook,Play),Outlook=='Sunny'))\r\n&gt; View(Outlook_Sunny)\r\n&gt; Entropy_Sunny = -(2\/5)*log2(2\/5)-(3\/5)*log2(3\/5)\r\n&gt; Entropy_Sunny\r\n[1] 0.9709506\r\n&gt; Outlook_Info = ((5\/14)*Entropy_Rainy)+((4\/14)*Entropy_Overcast)+((5\/14)*Entropy_Sunny)\r\n&gt; Outlook_Info\r\n[1] 0.6935361\r\n&gt; ## Information gain\r\n&gt; ## Info_gain = Total Entropy- Outlook_info\r\n&gt; Info_gain1 = TotalEntropy - Outlook_Info\r\n&gt; Info_gain1\r\n[1] 0.2467498\r\n<\/em>\r\n<\/pre>\n<p>Same way calculating entropy and Information gain for all remaining columns<\/p>\n<p><a href=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2020\/11\/Screenshot-2020-11-26-at-4.46.03-PM.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2020\/11\/Screenshot-2020-11-26-at-4.46.03-PM.png\" alt=\"\" width=\"479\" height=\"150\" class=\"alignnone size-full wp-image-30926\" \/><\/a><\/p>\n<p>From the Above table Outlook has highest information gain, so the first attribute at the root node is Outlook<\/p>\n<p><a href=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2020\/11\/Screenshot-2020-11-26-at-4.47.36-PM.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2020\/11\/Screenshot-2020-11-26-at-4.47.36-PM.png\" alt=\"\" width=\"479\" height=\"257\" class=\"alignnone size-full wp-image-30927\" \/><\/a><\/p>\n<p>From the above diagram, we can observe that Overcast has only the Yes class. So, there is no need for further splitting. But for the Rainy and Sunny Contains both Yes and No. Again the same process repeats.<\/p>\n<p>Till now, we have seen the manual process<\/p>\n<p>Here is the R code for Building a Decision Tree Model using C5.0 function and plot of the Decision Tree.<\/p>\n<pre>\r\nlibrary(C50)\r\n\r\n## Syntax  C5.0(Input_Columns, Target)\r\n\r\nmodel = C5.0(data[,1:4],data$Play)\r\nplot(model)\r\n<\/pre>\n<p><a href=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2020\/11\/Picture4-1.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2020\/11\/Picture4-1-490x229.png\" alt=\"\" width=\"490\" height=\"229\" class=\"alignnone size-medium wp-image-30928\" srcset=\"https:\/\/datascienceplus.com\/wp-content\/uploads\/2020\/11\/Picture4-1-490x229.png 490w, https:\/\/datascienceplus.com\/wp-content\/uploads\/2020\/11\/Picture4-1-1024x479.png 1024w, https:\/\/datascienceplus.com\/wp-content\/uploads\/2020\/11\/Picture4-1-768x359.png 768w, https:\/\/datascienceplus.com\/wp-content\/uploads\/2020\/11\/Picture4-1.png 1188w\" sizes=\"auto, (max-width: 490px) 100vw, 490px\" \/><\/a><\/p>\n<p>Resource Article: <a href=\"https:\/\/www.excelr.com\/blog\/data-science\/regression\/simple-linear-regression\">simple-linear-regression<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Decision Tree falls under supervised machine learning, as the name suggests it is a tree-like structure that helps us to make decisions based on certain conditions. A decision tree can help us to solve both regression and classification problems. What is Classification? Classification is the process of dividing the data into different categories or groups [&hellip;]<\/p>\n","protected":false},"author":6265,"featured_media":30951,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[459],"class_list":["post-30873","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-basic-statistics","tag-decision-trees"],"views":11712,"_links":{"self":[{"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/posts\/30873","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/users\/6265"}],"replies":[{"embeddable":true,"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/comments?post=30873"}],"version-history":[{"count":0,"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/posts\/30873\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/media\/30951"}],"wp:attachment":[{"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/media?parent=30873"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/categories?post=30873"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/tags?post=30873"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}