{"id":267,"date":"2015-08-03T21:26:07","date_gmt":"2015-08-03T20:26:07","guid":{"rendered":"http:\/\/rinscience.com\/?p=267"},"modified":"2017-04-28T18:23:26","modified_gmt":"2017-04-28T22:23:26","slug":"missing-values-in-r","status":"publish","type":"post","link":"https:\/\/datascienceplus.com\/missing-values-in-r\/","title":{"rendered":"How to Deal with Missing Values in R"},"content":{"rendered":"<p>It might happen that your dataset is not complete, and when information is not available we call it <em>missing values<\/em>. In R the missing values are coded by the symbol <code>NA<\/code>. To identify missings in your dataset the function is <code>is.na()<\/code>.<\/p>\n<p>First lets create a small dataset:<\/p>\n<pre>\r\nName &lt;- c(\"John\", \"Tim\", NA)\r\nSex &lt;- c(\"men\", \"men\", \"women\")\r\nAge &lt;- c(45, 53, NA)\r\ndt &lt;- data.frame(Name, Sex, Age)\r\n<\/pre>\n<p>Here is our dataset called <code>dt<\/code>:<\/p>\n<pre>\r\ndt \r\n<em>Name\u00a0\u00a0 Sex Age\r\n1 John\u00a0\u00a0 men\u00a0 45\r\n2\u00a0 Tim\u00a0\u00a0 men\u00a0 53\r\n3\u00a0 &lt;NA&gt; women\u00a0 NA<\/em>\r\n<\/pre>\n<p>Now will see for missings in the dataset:<\/p>\n<pre>\r\nis.na(dt)\r\n<em>Name\u00a0\u00a0\u00a0 Sex\u00a0\u00a0 Age\r\nFALSE FALSE FALSE\r\nFALSE FALSE FALSE\r\nTRUE\u00a0 FALSE\u00a0 TRUE<\/em>\r\n<\/pre>\n<p>You also can find the <em>sum<\/em> and the <em>percentage<\/em> of missings in your dataset with the code below:<\/p>\n<pre>\r\nsum(is.na(dt))\r\nmean(is.na(dt))\r\n<em>2\r\n0.2222222<\/em>\r\n<\/pre>\n<p>When you import dataset from other statistical applications the missing values might be coded with a number, for example <code>99<\/code>. In order to let R know that is a missing value you need to recode it.<\/p>\n<pre>\r\ndt$Age[dt$Age == 99] &lt;- NA\r\n<\/pre>\n<p>Another useful function in R to deal with missing values is <code>na.omit()<\/code> which delete incomplete observations. <\/p>\n<p>Let see another example, by creating first another small dataset:<\/p>\n<pre>\r\nName &lt;- c(\"John\", \"Tim\", NA)\r\nSex &lt;- c(\"men\", NA, \"women\")\r\nAge &lt;- c(45, 53, NA)\r\ndt &lt;- data.frame(Name, Sex, Age)\r\n<\/pre>\n<p>Here is the dataset, called again <code>dt<\/code>:<\/p>\n<pre>\r\ndt\r\n<em>Name Sex Age\r\nJohn men  45\r\nTim  &lt;NA&gt;  53\r\n&lt;NA&gt; women NA\r\n<\/em><\/pre>\n<p>Now will use the function to remove the missings<\/p>\n<pre>\r\nna.omit(dt)\r\n<em>Name Sex Age\r\nJohn men  45<\/em><\/pre>\n<p>This was introduction for dealing with missings values. To learn how to impute missing data please read <a href=\"https:\/\/datascienceplus.com\/imputing-missing-data-with-r-mice-package\/\">this post<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>It might happen that your dataset is not complete, and when information is not available we call it missing values. In R the missing values are coded by the symbol NA. To identify missings in your dataset the function is is.na(). First lets create a small dataset: Name &lt;- c(&#8220;John&#8221;, &#8220;Tim&#8221;, NA) Sex &lt;- c(&#8220;men&#8221;, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":272,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[37,232,46],"class_list":["post-267","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-management","tag-missing-values","tag-rstats","tag-tips-tricks"],"views":241835,"_links":{"self":[{"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/posts\/267","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/comments?post=267"}],"version-history":[{"count":0,"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/posts\/267\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/media\/272"}],"wp:attachment":[{"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/media?parent=267"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/categories?post=267"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/datascienceplus.com\/wp-json\/wp\/v2\/tags?post=267"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}