{"id":535,"date":"2021-06-08T01:27:39","date_gmt":"2021-06-08T01:27:39","guid":{"rendered":"https:\/\/practicalsecurityanalytics.com\/?p=535"},"modified":"2024-05-05T15:28:17","modified_gmt":"2024-05-05T15:28:17","slug":"pe-malware-machine-learning-dataset","status":"publish","type":"post","link":"https:\/\/practicalsecurityanalytics.com\/pe-malware-machine-learning-dataset\/","title":{"rendered":"PE Malware Machine Learning Dataset"},"content":{"rendered":"\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_82_2 ez-toc-wrap-left counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #000000;color:#000000\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #000000;color:#000000\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/practicalsecurityanalytics.com\/pe-malware-machine-learning-dataset\/#Download\" >Download<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/practicalsecurityanalytics.com\/pe-malware-machine-learning-dataset\/#Terms_of_Use\" >Terms of Use<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/practicalsecurityanalytics.com\/pe-malware-machine-learning-dataset\/#Purpose\" >Purpose<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/practicalsecurityanalytics.com\/pe-malware-machine-learning-dataset\/#About_the_Dataset\" >About the Dataset<\/a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/practicalsecurityanalytics.com\/pe-malware-machine-learning-dataset\/#Statistics\" >Statistics<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/practicalsecurityanalytics.com\/pe-malware-machine-learning-dataset\/#Layout\" >Layout<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/practicalsecurityanalytics.com\/pe-malware-machine-learning-dataset\/#Labels\" >Labels<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/practicalsecurityanalytics.com\/pe-malware-machine-learning-dataset\/#Sources\" >Sources<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/practicalsecurityanalytics.com\/pe-malware-machine-learning-dataset\/#Potential_Biases\" >Potential Biases<\/a><\/li><\/ul><\/li><\/ul><\/nav><\/div>\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Download\"><\/span>Download<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"\"><strong><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-luminous-vivid-amber-color\">WARNING! <\/mark><\/strong>The link will download an encrypted zip file that contains real malicious samples. Handle the contents with care. Only utilize for legitimate purposes.I have removed the file extensions from all of the samples in order to prevent accidental execution; however, I still highly recommend opening it up in a sandboxed environment. As an additional precaution, you should also change the permissions of the folder to deny &#8220;Execute&#8221; permissions to all files in the folder. Conducting your analysis on a non-Windows operating system will also help eliminate risk.<\/p>\n\n\n\n<p class=\"\"><strong>OneDrive<\/strong>: <a href=\"https:\/\/1drv.ms\/u\/s!AsaC1RPcfUL1oB5qbdWOm-PIk2jX?e=mOlo6J\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/1drv.ms\/u\/s!AsaC1RPcfUL1oB5qbdWOm-PIk2jX?e=mOlo6J<\/a><br><strong>Password:<\/strong> infected<br>SHA256: A8B02407A1F8C77DD9DCCC229503A4F668083271EDCBB0289D53C28EBF51215E<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Terms_of_Use\"><\/span>Terms of Use<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"\">If you use this dataset, please adhere to the following rules:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"\">Do not use the files for malicious purposes.<\/li>\n\n\n\n<li class=\"\">Let me know via comment, email, or tweet if you found this dataset useful.<\/li>\n\n\n\n<li class=\"\">Site me as a source in any academic paper that leverages this dataset using the following contact information:\n<ul class=\"wp-block-list\">\n<li class=\"\"><strong>Name:<\/strong> Michael Lester<\/li>\n\n\n\n<li class=\"\"><strong>Email:<\/strong> michael.lester.main@gmail.com<\/li>\n\n\n\n<li class=\"\"><strong>Website:<\/strong> <a href=\"https:\/\/practicalsecurityanalytics.com\/\">https:\/\/www.practicalsecurityanalytics.com<\/a><\/li>\n<\/ul>\n<\/li>\n\n\n\n<li class=\"\">Provide feedback or recommendations on how to improve the dataset.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Purpose\"><\/span>Purpose<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"\">The purpose of this dataset is to provide raw labeled portable executables to security and AI researchers in order to improve cyber security in the industry. Many of the datasets that I have seen (such as this <a rel=\"noreferrer noopener\" href=\"https:\/\/www.kaggle.com\/c\/microsoft-malware-prediction\/data\" target=\"_blank\">dataset <\/a>from a Microsoft sponsored Kaggle competition) do not provide the raw binary files themselves, but rather metadata that has already been pre-extracted from the samples. This prevents a lot of potential learning that can come from exploring other features that could be extracted from the raw samples themselves.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"About_the_Dataset\"><\/span>About the Dataset<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<h4 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Statistics\"><\/span>Statistics<span class=\"ez-toc-section-end\"><\/span><\/h4>\n\n\n\n<figure class=\"is-style-regular wp-block-table\"><table><tbody><tr><td><strong>Samples<\/strong><\/td><td>201,549<\/td><\/tr><tr><td><strong>Legitimate<\/strong><\/td><td>86,812<\/td><\/tr><tr><td><strong>Malicious<\/strong><\/td><td>114,737<\/td><\/tr><tr><td><strong>Compressed Size<\/strong><\/td><td>43.8GB<\/td><\/tr><tr><td><strong>Uncompressed Size<\/strong><\/td><td>117GB<\/td><\/tr><tr><td><strong>File Types<\/strong><\/td><td>All are Portable Executable files. Most are user-mode Portable Executable files (e.g. .exe, .dll, .scr).<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Layout\"><\/span>Layout<span class=\"ez-toc-section-end\"><\/span><\/h4>\n\n\n\n<p class=\"\">The dataset has the following folder structure:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">samples\n<ul class=\"wp-block-list\">\n<li class=\"\">1<\/li>\n\n\n\n<li class=\"\">2<\/li>\n\n\n\n<li class=\"\">3<\/li>\n\n\n\n<li class=\"\">&#8230;<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li class=\"\">samples.csv<\/li>\n<\/ul>\n\n\n\n<p class=\"\">The files in the &#8220;samples&#8221; folder are given the name of their corresponding entry in the ID field of the samples.csv file. The samples.csv file contains the labels for each of the samples in the samples folder.<\/p>\n\n\n\n<p class=\"\"><strong>Note: The extension has been removed from all the files in the samples directory in order to prevent accidental execution. The extension would have to be manually renamed, in most cases, in order to get the malware to execute properly. The proper extension can be determined by parsing the PE header.<\/strong><\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Labels\"><\/span>Labels<span class=\"ez-toc-section-end\"><\/span><\/h4>\n\n\n\n<p class=\"\">Each entry in the samples.csv file contains the following metadata fields:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Field<\/strong><\/td><td><strong>Description<\/strong><\/td><td><strong>Example<\/strong><\/td><\/tr><tr><td>id<\/td><td>The identifier for the sample that corresponds to the name of the file in the samples directory.<\/td><td>5<\/td><\/tr><tr><td>md5<\/td><td>The MD5 hash of the file.<\/td><td>ad27f1a72dda61d1659810c406f37ab8<\/td><\/tr><tr><td>sha1<\/td><td> The SHA1 hash of the file.<\/td><td>f8fd630c880257c7e74c1f87929993477453d989<\/td><\/tr><tr><td>sha256<\/td><td>The SHA256 of the file.<\/td><td>984d732c9f32197232918f2fce0aa9cedc1011d93e32acb4ad01e13f2f76d599<\/td><\/tr><tr><td>total<\/td><td>The total number of antivirus engines that scan this file at the time of the query.<\/td><td>67<\/td><\/tr><tr><td>positives<\/td><td>The number of antivirus engines that flag this files malicious at the time of the query.<\/td><td>0<\/td><\/tr><tr><td>list<\/td><td>Either blacklist or whitelist indicating whether or not the file is malicious or legitimate respectively.<\/td><td>Whitelist<\/td><\/tr><tr><td>filetype<\/td><td>This field will always be exe for this data set.<\/td><td>exe<\/td><\/tr><tr><td>submitted<\/td><td>The date that the sample was entered into my database.<\/td><td>6\/24\/2018 4:18:38 PM<\/td><\/tr><tr><td>user_id<\/td><td>Redacted.<\/td><td>1<\/td><\/tr><tr><td>length<\/td><td>The length of the file in bytes.<\/td><td>211,456<\/td><\/tr><tr><td>entropy<\/td><td>The Shannon entropy of the file. The values will range from 0 to 8.<\/td><td>2.231824<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Sources\"><\/span>Sources<span class=\"ez-toc-section-end\"><\/span><\/h4>\n\n\n\n<p class=\"\">Malicious samples in the dataset come primarily from the sources linked below.<\/p>\n\n\n\n<ul id=\"block-0f7e3031-d87a-45c9-b36f-efdf7c43f25c\" class=\"wp-block-list\">\n<li class=\"\"><a rel=\"noreferrer noopener\" href=\"https:\/\/virusshare.com\/\" target=\"_blank\">VirusShare<\/a><\/li>\n\n\n\n<li class=\"\"><a rel=\"noreferrer noopener\" href=\"https:\/\/malshare.com\/\" target=\"_blank\">MalShare<\/a><\/li>\n\n\n\n<li class=\"\"><a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/ytisf\/theZoo\" target=\"_blank\">TheZoo<\/a><\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Potential_Biases\"><\/span>Potential Biases<span class=\"ez-toc-section-end\"><\/span><\/h4>\n\n\n\n<p class=\"\">The majority of the samples came from easy-to-acquire locations. There are many samples of very similar families of malware that tend to dominate the dataset. While the dataset does contain samples from more sophisticated malware from Advanced Persistent Threat actors, there are far fewer of those samples than there are of generic adware, spyware, and ransomware. As a result, the dataset may not be reflective of malware used in actual intrusions. The dataset may be able to generalize to more advanced malware, or it may not.<\/p>\n\n\n\n<p class=\"\">The majority of legitimate files came from instances of various versions of Windows 7 and above with a variety of different software download and installed. There were only so many applications that I downloaded and tested. Other legitimate files come from the sources listed above, but were false positives. This may give a particular bias towards Microsoft produced software as those binaries dominate the legitimate file dataset.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The purpose of this dataset is to provide raw labeled portable executables to security and AI researchers in order to improve cyber security in the industry. Many of the datasets that I have seen (such as this dataset from a Microsoft sponsored Kaggle competition) does not provide the raw binary files themselves, but rather metadata that has already been pre-extracted from the samples. This prevents a lot of potential learning that can come from exploring other features that could be extracted from the raw samples themselves.<\/p>\n","protected":false},"author":2,"featured_media":540,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"advgb_blocks_editor_width":"","advgb_blocks_columns_visual_guide":"","_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"set","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[2,5],"tags":[11,10,6],"class_list":["post-535","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog-posts","category-executable-features-series","tag-download","tag-machine-learning","tag-malware"],"author_meta":{"display_name":"pracsec","author_link":"https:\/\/practicalsecurityanalytics.com\/author\/michael-lester-main\/"},"featured_img":"https:\/\/i0.wp.com\/practicalsecurityanalytics.com\/wp-content\/uploads\/2021\/06\/national-cancer-institute-fd0b-Bl4cFc-unsplash.jpg?fit=300%2C200&quality=100&ssl=1","jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/practicalsecurityanalytics.com\/wp-content\/uploads\/2021\/06\/national-cancer-institute-fd0b-Bl4cFc-unsplash.jpg?fit=1920%2C1280&quality=100&ssl=1","coauthors":[],"tax_additional":{"categories":{"linked":["<a href=\"https:\/\/practicalsecurityanalytics.com\/category\/blog-posts\/\" class=\"advgb-post-tax-term\">Blog Posts<\/a>","<a href=\"https:\/\/practicalsecurityanalytics.com\/category\/blog-posts\/executable-features-series\/\" class=\"advgb-post-tax-term\">Executable Features Series<\/a>"],"unlinked":["<span class=\"advgb-post-tax-term\">Blog Posts<\/span>","<span class=\"advgb-post-tax-term\">Executable Features Series<\/span>"]},"tags":{"linked":["<a href=\"https:\/\/practicalsecurityanalytics.com\/category\/blog-posts\/executable-features-series\/\" class=\"advgb-post-tax-term\">download<\/a>","<a href=\"https:\/\/practicalsecurityanalytics.com\/category\/blog-posts\/executable-features-series\/\" class=\"advgb-post-tax-term\">machine learning<\/a>","<a href=\"https:\/\/practicalsecurityanalytics.com\/category\/blog-posts\/executable-features-series\/\" class=\"advgb-post-tax-term\">malware<\/a>"],"unlinked":["<span class=\"advgb-post-tax-term\">download<\/span>","<span class=\"advgb-post-tax-term\">machine learning<\/span>","<span class=\"advgb-post-tax-term\">malware<\/span>"]}},"comment_count":"10","relative_dates":{"created":"Posted 5 years ago","modified":"Updated 2 years ago"},"absolute_dates":{"created":"Posted on June 8, 2021","modified":"Updated on May 5, 2024"},"absolute_dates_time":{"created":"Posted on June 8, 2021 1:27 am","modified":"Updated on May 5, 2024 3:28 pm"},"featured_img_caption":"","series_order":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pbnFRW-8D","_links":{"self":[{"href":"https:\/\/practicalsecurityanalytics.com\/wp-json\/wp\/v2\/posts\/535","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/practicalsecurityanalytics.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/practicalsecurityanalytics.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/practicalsecurityanalytics.com\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/practicalsecurityanalytics.com\/wp-json\/wp\/v2\/comments?post=535"}],"version-history":[{"count":5,"href":"https:\/\/practicalsecurityanalytics.com\/wp-json\/wp\/v2\/posts\/535\/revisions"}],"predecessor-version":[{"id":2368,"href":"https:\/\/practicalsecurityanalytics.com\/wp-json\/wp\/v2\/posts\/535\/revisions\/2368"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/practicalsecurityanalytics.com\/wp-json\/wp\/v2\/media\/540"}],"wp:attachment":[{"href":"https:\/\/practicalsecurityanalytics.com\/wp-json\/wp\/v2\/media?parent=535"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/practicalsecurityanalytics.com\/wp-json\/wp\/v2\/categories?post=535"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/practicalsecurityanalytics.com\/wp-json\/wp\/v2\/tags?post=535"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}