{"id":1752,"date":"2024-08-09T23:35:09","date_gmt":"2024-08-09T18:05:09","guid":{"rendered":"https:\/\/geekpython.in\/?p=1752"},"modified":"2024-08-09T23:35:10","modified_gmt":"2024-08-09T18:05:10","slug":"copy-on-write-in-pandas","status":"publish","type":"post","link":"https:\/\/geekpython.in\/copy-on-write-in-pandas","title":{"rendered":"Efficiently Manage Memory Usage in Pandas with Large Datasets"},"content":{"rendered":"\n<p>Pandas supports Copy-on-Write, an optimization technique that helps improve memory use, particularly when working with large datasets.<\/p>\n\n\n\n<p>Starting from version 2.0 of Pandas, the Copy-on-Write (CoW) has taken effect but has not been fully implemented. Most of the optimizations that are possible through Copy-on-Write are supported.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Aim of Copy-on-Write<\/h2>\n\n\n\n<p>As the name suggests, the data will be copied when it is modified. What it means?<\/p>\n\n\n\n<p>When a DataFrame or Series shares the same data as the original, it will initially share the same memory for the data rather than creating a copy. When the data of either the original or new DataFrame is modified, a new copy of the data is created for the DataFrame that is being modified.<\/p>\n\n\n\n<p>This will efficiently save memory usage and improve performance when working with large datasets.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Enabling CoW in Pandas<\/h2>\n\n\n\n<p>It is not enabled by default, so we need to enable it using the <code>copy_on_write<\/code> configuration option in Pandas.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \" >import pandas as pd\n\n# Option1\npd.options.mode.copy_on_write = True\n# Option2\npd.set_option(\"mode.copy_on_write\" : True)<\/pre><\/div>\n\n\n\n<p>You can use any of the options to turn on CoW globally in your environment.<\/p>\n\n\n\n<p><strong>Note: CoW will be enabled by default in Pandas 3.0, so get used to it early on.<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Impact of CoW in Pandas<\/h2>\n\n\n\n<p>The CoW will disallow updating the multiple pandas objects at the same time. Here&#8217;s how it will happen.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \" >import pandas as pd\ndf = pd.DataFrame({\"A\": [1, 2, 3], \"B\": [4, 5, 6]})\nsubset = df[\"A\"]\nsubset.iloc[0] = 10\ndf<\/pre><\/div>\n\n\n\n<p>With CoW, the above snippet will not modify <code>df<\/code> rather it modifies only the data of <code>subset<\/code>.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tex decode:true \" ># df\n\tA\tB\n0\t1\t4\n1\t2\t5\n2\t3\t6\n\n# subset\n\tA\n0\t10\n1\t2\n2\t3<\/pre><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">inplace Operations will Not Work<\/h3>\n\n\n\n<p>Similarly, the <code>inplace<\/code> operations will not work with CoW enabled, which directly modifies the original <code>df<\/code>.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \" >df = pd.DataFrame({\"A\": [1, 2, 3], \"B\": [4, 5, 6]})\ndf[\"A\"].replace(1, 5, inplace=True)\ndf\n--------------------\n\tA\tB\n0\t1\t4\n1\t2\t5\n2\t3\t6<\/pre><\/div>\n\n\n\n<p>We can see that <code>df<\/code> has remained unchanged and additionally, we will see a <code>ChainedAssignmentError<\/code> warning.<\/p>\n\n\n\n<p>The above operation can be performed in two different ways. One method is&nbsp;to avoid <code>inplace<\/code>, and another is&nbsp;to use <code>inplace<\/code> to directly modify the original <code>df<\/code> at the <code>DataFrame<\/code> level.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \" ># Avoid inplace\ndf = pd.DataFrame({\"A\": [1, 2, 3], \"B\": [4, 5, 6]})\ndf[\"A\"] = df[\"A\"].replace(1, 5)\ndf\n--------------------\n    A\tB\n0\t5\t4\n1\t2\t5\n2\t3\t6<\/pre><\/div>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \" ># Using inplace at DataFrame level\ndf = pd.DataFrame({\"A\": [1, 2, 3], \"B\": [4, 5, 6]})\ndf.replace({\"A\": {2: 34}}, inplace=True)\ndf\n--------------------\n    A\tB\n0\t1\t4\n1\t34\t5\n2\t3\t6<\/pre><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Chained Assignment will Never Work<\/h3>\n\n\n\n<p>When we modify the DataFrame or Series using multiple indexing operations in a single line of code, this is what we call the chained assignment technique.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \" ># CoW disabled\nwith pd.option_context(\"mode.copy_on_write\", False):\n    df = pd.DataFrame({\"A\": [1, 2, 3, 4], \"B\": [5, 6, 7, 8]})\n    df[\"B\"][df['A'] &gt; 2] = 10\ndf<\/pre><\/div>\n\n\n\n<p>The above code snippet is trying to change column <code>B<\/code> from the original <code>df<\/code> where column <code>A<\/code> is greater than 2. It means the value at the 2nd and 3rd index in column <code>B<\/code> will be modified.<\/p>\n\n\n\n<p>Since the CoW is disabled, this operation is allowed, and the original <code>df<\/code> will be modified.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tex decode:true \" >    A\tB\n0\t1\t5\n1\t2\t6\n2\t3\t10\n3\t4\t10<\/pre><\/div>\n\n\n\n<p>But, this will never work with CoW enabled in pandas.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \" ># CoW enabled\ndf = pd.DataFrame({\"A\": [1, 2, 3, 4], \"B\": [5, 6, 7, 8]})\ndf[\"B\"][df[\"A\"] &gt; 2] = 10\ndf\n--------------------\n    A\tB\n0\t1\t5\n1\t2\t6\n2\t3\t7\n3\t4\t8<\/pre><\/div>\n\n\n\n<p>Instead, with copy-on-write, we can use <code>.loc<\/code> to modify the <code>df<\/code> using multiple indexing conditions.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \" ># CoW enabled\ndf = pd.DataFrame({\"A\": [1, 2, 3, 4], \"B\": [5, 6, 7, 8]})\ndf.loc[(df[\"A\"] == 1) | (df[\"A\"] &gt; 3), \"B\"] = 100\ndf<\/pre><\/div>\n\n\n\n<p>This will modify column <code>B<\/code> where column <code>A<\/code> is either 1 or greater than 3. The original <code>df<\/code> will look like the following.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tex decode:true \" >\tA\tB\n0\t1\t100\n1\t2\t6\n2\t3\t7\n3\t4\t100<\/pre><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Read-only Arrays<\/h3>\n\n\n\n<p>When a Series or DataFrame is accessed as a NumPy array, that array will be read-only if the array shares the same data with the initial DataFrame or Series.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \" >df = pd.DataFrame({\"A\": [1, 2, 3, 4], \"B\": ['5', '6', '7', '8']})\narr = df.to_numpy()\narr\n--------------------\narray([[1, '5'],\n       [2, '6'],\n       [3, '7'],\n       [4, '8']], dtype=object)<\/pre><\/div>\n\n\n\n<p>In the above code snippet, <code>arr<\/code> will be a copy because df contains two different types of arrays (<code>int<\/code> and <code>str<\/code>). We can perform modifications on the <code>arr<\/code>.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \" >arr[1, 0] = 10\narr\n--------------------\narray([[1, '5'],\n       [10, '6'],\n       [3, '7'],\n       [4, '8']], dtype=object)<\/pre><\/div>\n\n\n\n<p>Take a look at this case.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \" >df = pd.DataFrame({\"A\": [1, 2, 3, 4], \"B\": [5, 6, 7, 8]})\narr = df.to_numpy()\narr<\/pre><\/div>\n\n\n\n<p>The DataFrame <code>df<\/code> has only one NumPy array (array of the same data types), so <code>arr<\/code> shares the data with <code>df<\/code>. This means <code>arr<\/code> will be read-only and cannot be modified in place.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \" >print(arr.flags.writeable)\narr[0,0] = 10\narr\n--------------------\nFalse\nValueError: assignment destination is read-only<\/pre><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Lazy Copy Mechanism<\/h3>\n\n\n\n<p>When two or more DataFrames share the same data, the copies will not be created immediately.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \" >df = pd.DataFrame({\"A\": [1, 2, 3], \"B\": [4, 5, 6]})\ndf2 = df.reset_index(drop=True)<\/pre><\/div>\n\n\n\n<p>Both <code>df<\/code> and <code>df2<\/code> shares the same reference in the memory as both share the same data. The copy mechanism will trigger only when any of the DataFrame is modified.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \" >df2.iloc[0, 0] = 10\nprint(df2)\nprint(df)\n--------------------\n    A  B\n0  10  4\n1   2  5\n2   3  6\n   A  B\n0  1  4\n1  2  5\n2  3  6<\/pre><\/div>\n\n\n\n<p>But this is not necessary, if we don&#8217;t want initial <code>df<\/code>, we can simply reassign it to the same variable (<code>df<\/code>) and this process will create a new reference. This will avoid the copy-on-write process.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \" >df = pd.DataFrame({\"A\": [1, 2, 3], \"B\": [4, 5, 6]})\nprint(\"Initial reference: \",id(df))\ndf = df.reset_index(drop=True)\nprint(\"New reference: \",id(df))\ndf.iloc[0, 0] = 10\nprint(df)\n--------------------\nInitial reference:  138400246865760\nNew reference:      138400246860336\n    A  B\n0  10  4\n1   2  5\n2   3  6<\/pre><\/div>\n\n\n\n<p>This same optimization (lazy copy mechanism) is added to the methods that don&#8217;t require a copy of the original data.<\/p>\n\n\n\n<p><strong>DataFrame.rename()<\/strong><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \" >df = pd.DataFrame({\"A\": [1, 2, 3], \"B\": [4, 5, 6]})\ndf.rename(columns={\"A\": \"X\", \"B\": \"Y\"})\n--------------------\n\tX\tY\n0\t1\t4\n1\t2\t5\n2\t3\t6<\/pre><\/div>\n\n\n\n<p>When CoW is enabled, this method returns the original DataFrame rather than creating an entire copy of the data, unlike the regular execution.<\/p>\n\n\n\n<p><strong>DataFrame.drop() for axis=1<\/strong><\/p>\n\n\n\n<p>Similarly, the same mechanism is implemented for <code>DataFrame.drop()<\/code> for <code>axis=1<\/code> (<code>axis='columns'<\/code>).<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \" >df = pd.DataFrame({\"A\": [1, 2, 3], \"B\": [4, 5, 6], \"C\": [7, 8, 9]})\ndf.drop([\"A\"], axis=1)\n--------------------\n\tB\tC\n0\t4\t7\n1\t5\t8\n2\t6\t9<\/pre><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>Pandas will by default implement Copy-on-Write (CoW) in version 3.0. All these optimizations that are compliant with CoW will lead to efficient memory and resource management when working with large datasets.<\/p>\n\n\n\n<p>This will reduce unpredictable or inconsistent behavior and greatly maximize performance.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>\ud83c\udfc6<strong>Other articles you might be interested in if you liked this one<\/strong><\/p>\n\n\n\n<p>\u2705<a target=\"_blank\" rel=\"noreferrer noopener\" href=\"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas\">Merge, combine, and concatenate multiple datasets using pandas<\/a>.<\/p>\n\n\n\n<p>\u2705<a target=\"_blank\" rel=\"noreferrer noopener\" href=\"https:\/\/geekpython.in\/find-and-delete-duplicate-rows-from-dataset\">Find and delete duplicate rows from the dataset using pandas<\/a>.<\/p>\n\n\n\n<p>\u2705<a target=\"_blank\" rel=\"noreferrer noopener\" href=\"https:\/\/geekpython.in\/find-and-delete-mismatched-columns-from-dataframes-using-pandas\">Find and Delete Mismatched Columns From DataFrames Using pandas<\/a>.<\/p>\n\n\n\n<p>\u2705<a target=\"_blank\" rel=\"noreferrer noopener\" href=\"https:\/\/geekpython.in\/tempfile-in-python\">Create temporary files and directories using tempfile module in Python<\/a>.<\/p>\n\n\n\n<p>\u2705<a target=\"_blank\" rel=\"noreferrer noopener\" href=\"https:\/\/geekpython.in\/render-images-from-flask\">Upload and display images on the frontend using Flask<\/a>.<\/p>\n\n\n\n<p>\u2705<a target=\"_blank\" rel=\"noreferrer noopener\" href=\"https:\/\/geekpython.in\/impact-of-learning-rates-on-ml-and-dl-models\">How does the learning rate affect the ML and DL models<\/a>?<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><strong>That&#8217;s all for now<\/strong><\/p>\n\n\n\n<p><strong>Keep Coding\u270c\u270c<\/strong><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Pandas supports Copy-on-Write, an optimization technique that helps improve memory use, particularly when working with large datasets. Starting from version 2.0 of Pandas, the Copy-on-Write (CoW) has taken effect but has not been fully implemented. Most of the optimizations that are possible through Copy-on-Write are supported. Aim of Copy-on-Write As the name suggests, the data [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1754,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"ocean_post_layout":"","ocean_both_sidebars_style":"","ocean_both_sidebars_content_width":0,"ocean_both_sidebars_sidebars_width":0,"ocean_sidebar":"","ocean_second_sidebar":"","ocean_disable_margins":"enable","ocean_add_body_class":"","ocean_shortcode_before_top_bar":"","ocean_shortcode_after_top_bar":"","ocean_shortcode_before_header":"","ocean_shortcode_after_header":"","ocean_has_shortcode":"","ocean_shortcode_after_title":"","ocean_shortcode_before_footer_widgets":"","ocean_shortcode_after_footer_widgets":"","ocean_shortcode_before_footer_bottom":"","ocean_shortcode_after_footer_bottom":"","ocean_display_top_bar":"default","ocean_display_header":"default","ocean_header_style":"","ocean_center_header_left_menu":"","ocean_custom_header_template":"","ocean_custom_logo":0,"ocean_custom_retina_logo":0,"ocean_custom_logo_max_width":0,"ocean_custom_logo_tablet_max_width":0,"ocean_custom_logo_mobile_max_width":0,"ocean_custom_logo_max_height":0,"ocean_custom_logo_tablet_max_height":0,"ocean_custom_logo_mobile_max_height":0,"ocean_header_custom_menu":"","ocean_menu_typo_font_family":"","ocean_menu_typo_font_subset":"","ocean_menu_typo_font_size":0,"ocean_menu_typo_font_size_tablet":0,"ocean_menu_typo_font_size_mobile":0,"ocean_menu_typo_font_size_unit":"px","ocean_menu_typo_font_weight":"","ocean_menu_typo_font_weight_tablet":"","ocean_menu_typo_font_weight_mobile":"","ocean_menu_typo_transform":"","ocean_menu_typo_transform_tablet":"","ocean_menu_typo_transform_mobile":"","ocean_menu_typo_line_height":0,"ocean_menu_typo_line_height_tablet":0,"ocean_menu_typo_line_height_mobile":0,"ocean_menu_typo_line_height_unit":"","ocean_menu_typo_spacing":0,"ocean_menu_typo_spacing_tablet":0,"ocean_menu_typo_spacing_mobile":0,"ocean_menu_typo_spacing_unit":"","ocean_menu_link_color":"","ocean_menu_link_color_hover":"","ocean_menu_link_color_active":"","ocean_menu_link_background":"","ocean_menu_link_hover_background":"","ocean_menu_link_active_background":"","ocean_menu_social_links_bg":"","ocean_menu_social_hover_links_bg":"","ocean_menu_social_links_color":"","ocean_menu_social_hover_links_color":"","ocean_disable_title":"default","ocean_disable_heading":"default","ocean_post_title":"","ocean_post_subheading":"","ocean_post_title_style":"","ocean_post_title_background_color":"","ocean_post_title_background":0,"ocean_post_title_bg_image_position":"","ocean_post_title_bg_image_attachment":"","ocean_post_title_bg_image_repeat":"","ocean_post_title_bg_image_size":"","ocean_post_title_height":0,"ocean_post_title_bg_overlay":0.5,"ocean_post_title_bg_overlay_color":"","ocean_disable_breadcrumbs":"default","ocean_breadcrumbs_color":"","ocean_breadcrumbs_separator_color":"","ocean_breadcrumbs_links_color":"","ocean_breadcrumbs_links_hover_color":"","ocean_display_footer_widgets":"default","ocean_display_footer_bottom":"default","ocean_custom_footer_template":"","ocean_post_oembed":"","ocean_post_self_hosted_media":"","ocean_post_video_embed":"","ocean_link_format":"","ocean_link_format_target":"self","ocean_quote_format":"","ocean_quote_format_link":"post","ocean_gallery_link_images":"on","ocean_gallery_id":[],"footnotes":""},"categories":[3,69,2],"tags":[15,70],"class_list":["post-1752","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ml","category-pandas","category-python","tag-ml","tag-pandas","entry","has-media"],"_links":{"self":[{"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/posts\/1752","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/comments?post=1752"}],"version-history":[{"count":4,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/posts\/1752\/revisions"}],"predecessor-version":[{"id":1757,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/posts\/1752\/revisions\/1757"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/media\/1754"}],"wp:attachment":[{"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/media?parent=1752"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/categories?post=1752"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/tags?post=1752"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}