{"id":1167,"date":"2023-07-05T18:30:00","date_gmt":"2023-07-05T13:00:00","guid":{"rendered":"https:\/\/geekpython.in\/?p=1167"},"modified":"2024-03-01T17:11:55","modified_gmt":"2024-03-01T11:41:55","slug":"multiple-datasets-integration-using-pandas","status":"publish","type":"post","link":"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas","title":{"rendered":"Join, Merge, and Combine Multiple Datasets Using pandas"},"content":{"rendered":"\n<p>Data processing becomes critical when training a robust machine learning model. We occasionally need to restructure and add new data to the datasets to increase the efficiency of the data.<\/p>\n\n\n\n<p>We&#8217;ll look at how to combine multiple datasets and merge multiple datasets with the same and different column names in this article. We&#8217;ll use the&nbsp;<code>pandas<\/code>&nbsp;library&#8217;s following functions to carry out these operations.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>pandas.concat()<\/code><\/li>\n\n\n\n<li><code>pandas.merge()<\/code><\/li>\n\n\n\n<li><code>pandas.DataFrame.join()<\/code><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"heading-preparing-sample-data\"><a href=\"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas#heading-preparing-sample-data\"><\/a>Preparing Sample Data<\/h2>\n\n\n\n<p>We&#8217;ll create sample datasets using&nbsp;<code>pandas.DataFrame()<\/code>&nbsp;function and then perform concatenating operations on them.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1012\" height=\"493\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-4.jpeg\" alt=\"Sample data creation\" class=\"wp-image-1201\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-4.jpeg 1012w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-4-300x146.jpeg 300w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-4-768x374.jpeg 768w\" sizes=\"auto, (max-width: 1012px) 100vw, 1012px\" \/><\/figure>\n\n\n\n<p>The code in the image will generate two datasets from&nbsp;<code>data<\/code>&nbsp;and&nbsp;<code>data1<\/code>&nbsp;using&nbsp;<code>pd.DataFrame(data)<\/code>&nbsp;and&nbsp;<code>pd.DataFrame(data1)<\/code>&nbsp;and store them in the variables&nbsp;<code>df1<\/code>&nbsp;and&nbsp;<code>df2<\/code>.<\/p>\n\n\n\n<p>Then, using the&nbsp;<code>.to_csv()<\/code>&nbsp;function,&nbsp;<code>df1<\/code>&nbsp;and&nbsp;<code>df2<\/code>&nbsp;will be saved in the&nbsp;<code>CSV<\/code>&nbsp;format as&nbsp;<code>'employee.csv'<\/code>&nbsp;and&nbsp;<code>'employee1.csv'<\/code>&nbsp;respectively.<\/p>\n\n\n\n<p>Here, the data that we created looks as shown in the following image.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"713\" height=\"472\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-5.jpeg\" alt=\"Data preview\" class=\"wp-image-1202\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-5.jpeg 713w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-5-300x199.jpeg 300w\" sizes=\"auto, (max-width: 713px) 100vw, 713px\" \/><\/figure>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"heading-combining-data-using-concat\"><a href=\"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas#heading-combining-data-using-concat\"><\/a>Combining Data Using concat()<\/h1>\n\n\n\n<p>We can use the&nbsp;<code>pandas<\/code>&nbsp;library to analyze, modify, and do other things with our&nbsp;<strong>CSV<\/strong>&nbsp;(comma-separated value) data. The library includes the&nbsp;<code>concat()<\/code>&nbsp;function which we will use to perform the concatenation of multiple datasets.<\/p>\n\n\n\n<p>There are two axes on which the datasets can be concatenated: the&nbsp;<strong>row axis<\/strong>&nbsp;and the&nbsp;<strong>column axis<\/strong>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"1000\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-6.jpeg\" alt=\"Visual representation of concatenation on different axis\" class=\"wp-image-1203\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-6.jpeg 1400w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-6-300x214.jpeg 300w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-6-1024x731.jpeg 1024w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-6-768x549.jpeg 768w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"heading-combine-data-along-the-row-axis\"><a href=\"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas#heading-combine-data-along-the-row-axis\"><\/a>Combine Data Along the Row Axis<\/h2>\n\n\n\n<p>We previously created two datasets named&nbsp;<code>'employee.csv'<\/code>&nbsp;and&nbsp;<code>'employee1.csv'<\/code>. We&#8217;ll concatenate them horizontally, which means the data will be spliced across the rows.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">combine = pd.concat([dt, dt1])<\/pre><\/div>\n\n\n\n<p>The above code demonstrates the basic use of the&nbsp;<code>concat()<\/code>&nbsp;function. We passed a list of datasets(<code>objects<\/code>) that will be combined along the&nbsp;<strong>row axis<\/strong>&nbsp;by default.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"757\" height=\"349\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-7.jpeg\" alt=\"Concatenated dataset\" class=\"wp-image-1204\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-7.jpeg 757w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-7-300x138.jpeg 300w\" sizes=\"auto, (max-width: 757px) 100vw, 757px\" \/><\/figure>\n\n\n\n<p>The&nbsp;<code>concat()<\/code>&nbsp;function accepts some parameters that affect the concatenation of the data.<\/p>\n\n\n\n<p>The indices of the data are taken from their corresponding data, as seen in the above output. How do we create a new data index?<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"heading-the-ignoreindex-parameter\"><a href=\"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas#heading-the-ignoreindex-parameter\"><\/a><strong>The<\/strong>&nbsp;ignore_index&nbsp;<strong>Parameter<\/strong><\/h3>\n\n\n\n<p>When&nbsp;<code>ignore_index=True<\/code>&nbsp;is set, a new index from&nbsp;<code>0<\/code>&nbsp;to&nbsp;<code>n-1<\/code>&nbsp;is created. The default value is&nbsp;<code>False<\/code>, which is why the indices were repeated in the above example.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">set_index = pd.concat([dt, dt1], ignore_index=True)<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"745\" height=\"345\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-8.jpeg\" alt=\"Created new index\" class=\"wp-image-1205\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-8.jpeg 745w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-8-300x139.jpeg 300w\" sizes=\"auto, (max-width: 745px) 100vw, 745px\" \/><\/figure>\n\n\n\n<p>As shown in the image above, the dataset contains a new index ranging from&nbsp;<code>0<\/code>&nbsp;to&nbsp;<code>7<\/code>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"heading-the-join-parameter\"><a href=\"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas#heading-the-join-parameter\"><\/a><strong>The<\/strong>&nbsp;join&nbsp;<strong>Parameter<\/strong><\/h3>\n\n\n\n<p>In the above image, we can see that the first four data points for the&nbsp;<code>Salary<\/code>&nbsp;and&nbsp;<code>No_of_awards<\/code>&nbsp;columns are missing.<\/p>\n\n\n\n<p>This is due to the&nbsp;<code>join<\/code>&nbsp;parameter, which by default is set to&nbsp;<code>\"outer\"<\/code>&nbsp;which joins the data exactly as it is. If it is set to&nbsp;<code>\"inner\"<\/code>, data that does not match another dataset is removed.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">inner_join = pd.concat([dt, dt1], join=\"inner\")<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"709\" height=\"343\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-9.jpeg\" alt=\"Inner join\" class=\"wp-image-1206\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-9.jpeg 709w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-9-300x145.jpeg 300w\" sizes=\"auto, (max-width: 709px) 100vw, 709px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"heading-the-keys-parameter\"><a href=\"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas#heading-the-keys-parameter\"><\/a>The&nbsp;keys&nbsp;Parameter<\/h3>\n\n\n\n<p>The&nbsp;<code>keys<\/code>&nbsp;parameter creates an index from the&nbsp;<strong>keys<\/strong>&nbsp;which is used to differentiate and identify the original data in the concatenated objects.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">keys = pd.concat([dt, dt1], keys=[\"Dataset1\", \"Dataset2\"])<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"830\" height=\"350\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-10.jpeg\" alt=\"Key index\" class=\"wp-image-1207\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-10.jpeg 830w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-10-300x127.jpeg 300w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-10-768x324.jpeg 768w\" sizes=\"auto, (max-width: 830px) 100vw, 830px\" \/><\/figure>\n\n\n\n<p>The datasets were concatenated, and a multi-level index was created, with the first level representing the outermost index (<code>Dataset1<\/code>&nbsp;and&nbsp;<code>Dataset2<\/code>&nbsp;from the&nbsp;<code>keys<\/code>) and the second level representing the original index.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"heading-combine-data-along-the-column-axis\"><a href=\"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas#heading-combine-data-along-the-column-axis\"><\/a>Combine Data Along the Column Axis<\/h2>\n\n\n\n<p>The datasets were concatenated along the row axis or horizontally in the previous section, but in this approach, we will stitch them&nbsp;<strong>vertically<\/strong>&nbsp;or along the&nbsp;<strong>column axis<\/strong>&nbsp;using the&nbsp;<code>axis<\/code>&nbsp;parameter.<\/p>\n\n\n\n<p>The&nbsp;<code>axis<\/code>&nbsp;parameter is set to&nbsp;<code>0<\/code>&nbsp;or&nbsp;<code>\"index\"<\/code>&nbsp;by default, which concatenates the datasets along the&nbsp;<strong>row axis<\/strong>, but if we change its value to&nbsp;<code>1<\/code>&nbsp;or&nbsp;<code>\"columns\"<\/code>, it concatenates the datasets along the&nbsp;<strong>column axis<\/strong>.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">combine_vertically = pd.concat([dt, dt1], axis=\"columns\")\n#---------------------------OR---------------------------#\ncombine_vertically = pd.concat([dt, dt1], axis=1)<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1109\" height=\"240\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-11.jpeg\" alt=\"Datasets concatenated vertically\" class=\"wp-image-1208\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-11.jpeg 1109w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-11-300x65.jpeg 300w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-11-1024x222.jpeg 1024w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-11-768x166.jpeg 768w\" sizes=\"auto, (max-width: 1109px) 100vw, 1109px\" \/><\/figure>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"heading-merging-data-using-merge\"><a href=\"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas#heading-merging-data-using-merge\"><\/a>Merging Data Using merge()<\/h1>\n\n\n\n<p>The&nbsp;<code>pandas.merge()<\/code>&nbsp;function merges data from one or more datasets based on common columns or indices.<\/p>\n\n\n\n<p>We&#8217;ll operate on a different dataset that we created and contains the information shown in the following image.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"729\" height=\"466\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-12.jpeg\" alt=\"Sample data preview\" class=\"wp-image-1209\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-12.jpeg 729w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-12-300x192.jpeg 300w\" sizes=\"auto, (max-width: 729px) 100vw, 729px\" \/><\/figure>\n\n\n\n<p>The&nbsp;<code>merge()<\/code>&nbsp;function takes&nbsp;<code>left<\/code>&nbsp;and&nbsp;<code>right<\/code>&nbsp;parameters which are datasets to be merged.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"heading-the-how-parameter\"><a href=\"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas#heading-the-how-parameter\"><\/a>The&nbsp;how&nbsp;Parameter<\/h2>\n\n\n\n<p>We can now specify the type of merge we want to perform on these datasets by providing the&nbsp;<code>how<\/code>&nbsp;parameter. The&nbsp;<code>how<\/code>&nbsp;parameter allows for five different types of merges:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>inner<\/code>: Default. It only includes the values that match from both datasets.<\/li>\n\n\n\n<li><code>outer<\/code>: It includes all of the values from both datasets but fills the missing values with&nbsp;<code>NaN<\/code>&nbsp;(<strong>Not a Number<\/strong>).<\/li>\n\n\n\n<li><code>left<\/code>: It includes all of the values from the left dataset and replaces any missing values in the right dataset with&nbsp;<code>NaN<\/code>.<\/li>\n\n\n\n<li><code>right<\/code>: It includes all of the values from the right dataset and replaces any missing values in the left dataset with&nbsp;<code>NaN<\/code>.<\/li>\n\n\n\n<li><code>cross<\/code>: It creates the Cartesian product which means that the number of rows created will be equal to the product of the row counts of both datasets. If both datasets have four rows, then four times four (<code>4 * 4<\/code>) equals sixteen (<code>16<\/code>) rows.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"heading-examples\"><a href=\"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas#heading-examples\"><\/a>Examples<\/h3>\n\n\n\n<p><strong>Performing<\/strong>&nbsp;<code>inner<\/code>&nbsp;<strong>merge<\/strong><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">inner_merging = pd.merge(dt1, dt2, how=\"inner\")<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1044\" height=\"205\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-13.jpeg\" alt=\"Inner merged dataset\" class=\"wp-image-1210\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-13.jpeg 1044w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-13-300x59.jpeg 300w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-13-1024x201.jpeg 1024w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-13-768x151.jpeg 768w\" sizes=\"auto, (max-width: 1044px) 100vw, 1044px\" \/><\/figure>\n\n\n\n<p>We can see that only values with the same&nbsp;<code>Id<\/code>&nbsp;from both datasets have been included.<\/p>\n\n\n\n<p><strong>Performing<\/strong>&nbsp;<code>outer<\/code>&nbsp;<strong>merge<\/strong><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">outer_merging = pd.merge(dt1, dt2, how=\"outer\")<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1043\" height=\"262\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-14.jpeg\" alt=\"Outer merged dataset\" class=\"wp-image-1211\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-14.jpeg 1043w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-14-300x75.jpeg 300w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-14-1024x257.jpeg 1024w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-14-768x193.jpeg 768w\" sizes=\"auto, (max-width: 1043px) 100vw, 1043px\" \/><\/figure>\n\n\n\n<p>In the case of an&nbsp;<strong>outer merge<\/strong>, all of the values from both datasets were included, and the missing fields were filled in with&nbsp;<code>NaN<\/code>.<\/p>\n\n\n\n<p><strong>Performing<\/strong>&nbsp;<code>left<\/code>&nbsp;<strong>merge<\/strong><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">left_merging = pd.merge(dt1, dt2, how=\"left\")<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1044\" height=\"239\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-15.jpeg\" alt=\"Left merged dataset\" class=\"wp-image-1212\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-15.jpeg 1044w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-15-300x69.jpeg 300w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-15-1024x234.jpeg 1024w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-15-768x176.jpeg 768w\" sizes=\"auto, (max-width: 1044px) 100vw, 1044px\" \/><\/figure>\n\n\n\n<p>The matching values of the right dataset (<code>dt2<\/code>) were merged in the left dataset (<code>dt1<\/code>) and the values of the last four columns (<code>Project_id_final<\/code>,&nbsp;<code>Age<\/code>,&nbsp;<code>Salary<\/code>, and&nbsp;<code>No_of_awards<\/code>) were not found for&nbsp;<code>A4<\/code>, so they were filled in with&nbsp;<code>NaN<\/code>.<\/p>\n\n\n\n<p><strong>Performing<\/strong>&nbsp;<code>right<\/code>&nbsp;<strong>merge<\/strong><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">right_merging = pd.merge(dt1, dt2, how=\"right\")<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1007\" height=\"232\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-16.jpeg\" alt=\"Right merged dataset\" class=\"wp-image-1213\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-16.jpeg 1007w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-16-300x69.jpeg 300w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-16-768x177.jpeg 768w\" sizes=\"auto, (max-width: 1007px) 100vw, 1007px\" \/><\/figure>\n\n\n\n<p>The matching values of the left dataset (<code>dt1<\/code>) were merged in the right dataset (<code>dt2<\/code>) and the values of the first five columns (<code>Project_id_initial<\/code>,&nbsp;<code>Name<\/code>,&nbsp;<code>Role<\/code>,&nbsp;<code>Experience<\/code>, and&nbsp;<code>Qualification<\/code>) were not found for&nbsp;<code>A6<\/code>, so they were filled in with&nbsp;<code>NaN<\/code>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"heading-cross-merging-the-datasets\"><a href=\"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas#heading-cross-merging-the-datasets\"><\/a><strong>Cross Merging the Datasets<\/strong><\/h3>\n\n\n\n<p>The&nbsp;<code>how<\/code>&nbsp;parameter has five different types of merge, one of which is a&nbsp;<code>cross<\/code>&nbsp;merge.<\/p>\n\n\n\n<p>As previously stated, it generates the Cartesian product, with the number of rows formed equal to the product of row counts from both datasets. Take a look at the illustration below to get a better understanding.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"1000\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-17.jpeg\" alt=\"Visual representation of the cross join\" class=\"wp-image-1214\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-17.jpeg 1400w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-17-300x214.jpeg 300w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-17-1024x731.jpeg 1024w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-17-768x549.jpeg 768w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">cross_merging = pd.merge(dt1, dt2, how=\"cross\")<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1001\" height=\"518\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-18.jpeg\" alt=\"Cross merged dataset\" class=\"wp-image-1215\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-18.jpeg 1001w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-18-300x155.jpeg 300w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-18-768x397.jpeg 768w\" sizes=\"auto, (max-width: 1001px) 100vw, 1001px\" \/><\/figure>\n\n\n\n<p>Both datasets have four rows each, and each row from&nbsp;<code>dt1<\/code>&nbsp;is repeated four times (row count of&nbsp;<code>dt2<\/code>), resulting in a data set of sixteen rows.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"heading-the-on-lefton-andamp-righton-parameters\"><a href=\"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas#heading-the-on-lefton-andamp-righton-parameters\"><\/a>The&nbsp;on,&nbsp;left_on&nbsp;&amp;&nbsp;right_on&nbsp;Parameters<\/h2>\n\n\n\n<p>The&nbsp;<code>on<\/code>&nbsp;parameter accepts the name of a column or index(row) to join on. It could be a single name or a list of names.<\/p>\n\n\n\n<p>The&nbsp;<code>left_on<\/code>&nbsp;and&nbsp;<code>right_on<\/code>&nbsp;parameter takes a column or index(row) name from the left and right dataset to join on. They are used when both datasets have different column names to join on.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"heading-merging-datasets-on-the-same-column\"><a href=\"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas#heading-merging-datasets-on-the-same-column\"><\/a>Merging Datasets on the Same Column<\/h3>\n\n\n\n<p>To merge the datasets based on the same column, we can use the&nbsp;<code>on<\/code>&nbsp;parameter and pass the common column name that both datasets must have.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">merging_on_same_column = pd.merge(dt1, dt2, on='Id')<\/pre><\/div>\n\n\n\n<p>We are merging datasets&nbsp;<code>dt1<\/code>&nbsp;and&nbsp;<code>dt2<\/code>&nbsp;based on the&nbsp;<code>'Id'<\/code>&nbsp;column that they both share.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1009\" height=\"199\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-19.jpeg\" alt=\"Data merged on same column\" class=\"wp-image-1216\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-19.jpeg 1009w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-19-300x59.jpeg 300w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-19-768x151.jpeg 768w\" sizes=\"auto, (max-width: 1009px) 100vw, 1009px\" \/><\/figure>\n\n\n\n<p>The matching&nbsp;<code>Id<\/code>&nbsp;column values from both datasets were merged, and the non-matching values were removed.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"heading-merging-datasets-on-different-columns\"><a href=\"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas#heading-merging-datasets-on-different-columns\"><\/a>Merging Datasets on Different Columns<\/h3>\n\n\n\n<p>To merge different columns in the left and right datasets, use the&nbsp;<code>left_on<\/code>&nbsp;and&nbsp;<code>right_on<\/code>&nbsp;parameters.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">left_right_merging = pd.merge(dt1, dt2, left_on=\"Project_id_initial\", right_on='Project_id_final')<\/pre><\/div>\n\n\n\n<p>The joining column is the&nbsp;<code>\"Project_id_initial\"<\/code>&nbsp;column from the left dataset (<code>dt1<\/code>) and the&nbsp;<code>\"Project_id_final\"<\/code>&nbsp;column from the right dataset (<code>dt2<\/code>). The values shared by both columns will be used to merge them.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1076\" height=\"239\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-20.jpeg\" alt=\"Dataset merged on different columns\" class=\"wp-image-1217\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-20.jpeg 1076w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-20-300x67.jpeg 300w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-20-1024x227.jpeg 1024w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-20-768x171.jpeg 768w\" sizes=\"auto, (max-width: 1076px) 100vw, 1076px\" \/><\/figure>\n\n\n\n<p>As we can see, the dataset includes both columns, as well as matching rows based on the common values in both the&nbsp;<code>\"Project_id_initial\"<\/code>&nbsp;and&nbsp;<code>\"Project_id_final\"<\/code>&nbsp;columns.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"heading-changing-the-suffix-of-the-column\"><a href=\"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas#heading-changing-the-suffix-of-the-column\"><\/a>Changing the Suffix of the Column<\/h2>\n\n\n\n<p>If you notice that the merged dataset has two&nbsp;<code>Id<\/code>&nbsp;columns labeled&nbsp;<code>Id_x<\/code>&nbsp;and&nbsp;<code>Id_y<\/code>, this is due to the&nbsp;<code>suffixes<\/code>&nbsp;parameter, which has default values&nbsp;<code>_x<\/code>&nbsp;and&nbsp;<code>_y<\/code>, and when overlapping column names are found in the left and right datasets, they are suffixed with default values.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">chg_suffix = pd.merge(dt1, dt2, suffixes=[\"_1\", \"_2\"], left_on=\"Project_id_initial\", right_on='Project_id_final')<\/pre><\/div>\n\n\n\n<p>This will append the suffixes&nbsp;<code>\"_1\"<\/code>&nbsp;and&nbsp;<code>\"_2\"<\/code>&nbsp;to the overlapping columns. Because both datasets have the same column name&nbsp;<code>Id<\/code>, the&nbsp;<code>Id<\/code>&nbsp;column will appear to be&nbsp;<code>Id_1<\/code>&nbsp;in the left dataset and&nbsp;<code>Id_2<\/code>&nbsp;in the right dataset.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1121\" height=\"229\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-21.jpeg\" alt=\"Suffix changed\" class=\"wp-image-1218\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-21.jpeg 1121w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-21-300x61.jpeg 300w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-21-1024x209.jpeg 1024w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-21-768x157.jpeg 768w\" sizes=\"auto, (max-width: 1121px) 100vw, 1121px\" \/><\/figure>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"heading-joining-datasets-using-join\"><a href=\"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas#heading-joining-datasets-using-join\"><\/a>Joining Datasets Using join()<\/h1>\n\n\n\n<p>The&nbsp;<code>join()<\/code>&nbsp;method works on the&nbsp;<code>DataFrame<\/code>&nbsp;object and joins the columns based on the index values. Let&#8217;s perform a basic join operation on the dataset.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">dt1.join(dt2, lsuffix=\"_1\", rsuffix=\"_2\")<\/pre><\/div>\n\n\n\n<p>The columns of the&nbsp;<code>dt2<\/code>&nbsp;dataset will be joined with the&nbsp;<code>dt1<\/code>&nbsp;dataset based on the&nbsp;<strong>index<\/strong>&nbsp;values.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1065\" height=\"217\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-22.jpeg\" alt=\"Basic join operation output\" class=\"wp-image-1219\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-22.jpeg 1065w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-22-300x61.jpeg 300w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-22-1024x209.jpeg 1024w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-22-768x156.jpeg 768w\" sizes=\"auto, (max-width: 1065px) 100vw, 1065px\" \/><\/figure>\n\n\n\n<p>Since the&nbsp;<strong>index<\/strong>&nbsp;values of both datasets are the same which is&nbsp;<code>0<\/code>,&nbsp;<code>1<\/code>,&nbsp;<code>2<\/code>, and&nbsp;<code>3<\/code>, that&#8217;s why we got all the rows.<\/p>\n\n\n\n<p>The&nbsp;<code>join()<\/code>&nbsp;method&#8217;s parameters can be used to manipulate the dataset. The&nbsp;<code>join()<\/code>&nbsp;method, like the&nbsp;<code>merge()<\/code>&nbsp;function, includes&nbsp;<code>how<\/code>&nbsp;and&nbsp;<code>on<\/code>&nbsp;parameters.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>how<\/code>: Default value is&nbsp;<code>left<\/code>&nbsp;join. It is the same as the&nbsp;<code>how<\/code>&nbsp;parameter of the&nbsp;<code>merge()<\/code>&nbsp;function, but the difference is that it performs&nbsp;<strong>index-based<\/strong>&nbsp;joins.<\/li>\n\n\n\n<li><code>on<\/code>: A column or index name is required to join on the index in the specified dataset.<\/li>\n\n\n\n<li><code>lsuffix<\/code>&nbsp;and&nbsp;<code>rsuffix<\/code>: Used to append the suffix to the left and right datasets&#8217; overlapping columns.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"heading-examples-1\"><a href=\"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas#heading-examples-1\"><\/a>Examples<\/h2>\n\n\n\n<p><strong>Left join on an index<\/strong><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">dt1.join(dt2.set_index(\"Id\"), on=\"Id\", how=\"left\")\n#-------------------------OR----------------------#\ndt1.join(dt2.set_index(\"Id\"), on=\"Id\")<\/pre><\/div>\n\n\n\n<p>In the above code, we use&nbsp;<code>set_index('Id')<\/code>&nbsp;to set the&nbsp;<code>Id<\/code>&nbsp;column of the&nbsp;<code>dt2<\/code>&nbsp;dataset as the index and perform a left join (<code>how=\"left\"<\/code>) on the&nbsp;<code>Id<\/code>&nbsp;column (<code>on=\"Id\"<\/code>) between&nbsp;<code>dt1<\/code>&nbsp;and&nbsp;<code>dt2<\/code>.<\/p>\n\n\n\n<p>This will join matching values in the&nbsp;<code>Id<\/code>&nbsp;column of the&nbsp;<code>dt2<\/code>&nbsp;dataset with the&nbsp;<code>Id<\/code>&nbsp;column of the&nbsp;<code>dt1<\/code>&nbsp;dataset. If any values are missing, they will be filled in by&nbsp;<code>NaN<\/code>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1037\" height=\"205\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-23.jpeg\" alt=\"Left joined dataset\" class=\"wp-image-1220\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-23.jpeg 1037w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-23-300x59.jpeg 300w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-23-1024x202.jpeg 1024w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-23-768x152.jpeg 768w\" sizes=\"auto, (max-width: 1037px) 100vw, 1037px\" \/><\/figure>\n\n\n\n<p>It&#8217;s the same as when we used the&nbsp;<code>merge()<\/code>&nbsp;function, but this time we&#8217;re joining based on the&nbsp;<strong>index<\/strong>.<\/p>\n\n\n\n<p><strong>Right join on an index<\/strong><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">dt1.join(dt2.set_index(\"Id\"), on=\"Id\", how=\"right\")<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1055\" height=\"213\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-24.jpeg\" alt=\"Right joined dataset\" class=\"wp-image-1221\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-24.jpeg 1055w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-24-300x61.jpeg 300w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-24-1024x207.jpeg 1024w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-24-768x155.jpeg 768w\" sizes=\"auto, (max-width: 1055px) 100vw, 1055px\" \/><\/figure>\n\n\n\n<p>We are joining the&nbsp;<code>dt1<\/code>&nbsp;dataset with the index of the&nbsp;<code>dt2<\/code>&nbsp;dataset based on the&nbsp;<code>Id<\/code>&nbsp;column. We got&nbsp;<code>NaN<\/code>&nbsp;in the first five columns for&nbsp;<code>A6<\/code>&nbsp;because there were no values specified in the&nbsp;<code>dt1<\/code>&nbsp;dataset.<\/p>\n\n\n\n<p><strong>Inner join on an index<\/strong><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">dt1.join(dt2.set_index(\"Id\"), on=\"Id\", how=\"inner\")<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1025\" height=\"202\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-25.jpeg\" alt=\"Inner joined dataset\" class=\"wp-image-1222\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-25.jpeg 1025w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-25-300x59.jpeg 300w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-25-768x151.jpeg 768w\" sizes=\"auto, (max-width: 1025px) 100vw, 1025px\" \/><\/figure>\n\n\n\n<p>The datasets were joined based on matching index values, i.e., both datasets&nbsp;<code>dt1<\/code>&nbsp;and&nbsp;<code>dt2<\/code>&nbsp;share&nbsp;<code>A1<\/code>,&nbsp;<code>A2<\/code>, and&nbsp;<code>A3<\/code>, so the values corresponding to these indices were joined.<\/p>\n\n\n\n<p><strong>Outer join on an index<\/strong><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">dt1.join(dt2.set_index(\"Id\"), on=\"Id\", how=\"outer\")<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1066\" height=\"249\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-26.jpeg\" alt=\"Outer joined dataset\" class=\"wp-image-1223\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-26.jpeg 1066w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-26-300x70.jpeg 300w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-26-1024x239.jpeg 1024w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-26-768x179.jpeg 768w\" sizes=\"auto, (max-width: 1066px) 100vw, 1066px\" \/><\/figure>\n\n\n\n<p>We performed the outer join, which included all of the rows from both datasets based on the&nbsp;<code>Id<\/code>. The corresponding values have been filled in, and missing values have been filled in with&nbsp;<code>NaN<\/code>.<\/p>\n\n\n\n<p><strong>Cross Join<\/strong><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">dt1.join(dt2, how=\"cross\", lsuffix=\"_1\", rsuffix=\"_2\")<\/pre><\/div>\n\n\n\n<p>We didn&#8217;t pass the&nbsp;<code>on<\/code>&nbsp;parameter, instead, we defined how the data should join (<code>how=\"cross\"<\/code>). The resulting dataset will be the product of both datasets&#8217; row counts.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1019\" height=\"505\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-27.jpeg\" alt=\"Cross joined dataset\" class=\"wp-image-1224\" srcset=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-27.jpeg 1019w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-27-300x149.jpeg 300w, https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/image-27-768x381.jpeg 768w\" sizes=\"auto, (max-width: 1019px) 100vw, 1019px\" \/><\/figure>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"heading-conclusion\"><a href=\"https:\/\/geekpython.in\/multiple-datasets-integration-using-pandas#heading-conclusion\"><\/a>Conclusion<\/h1>\n\n\n\n<p>We&#8217;ve learned how to use&nbsp;<code>pandas.concat()<\/code>,&nbsp;<code>pandas.merge()<\/code>, and&nbsp;<code>pandas.DataFrame.join()<\/code>&nbsp;to combine, merge, and join DataFrames.<\/p>\n\n\n\n<p>The&nbsp;<code>concat()<\/code>&nbsp;function in&nbsp;<code>pandas<\/code>&nbsp;is a go-to option for combining the DataFrames due to its simplicity. However, if we want more control over how the data is joined and on which column in the DataFrame, the&nbsp;<code>merge()<\/code>&nbsp;function is a good choice. If we want to join data based on the index, we should use the&nbsp;<code>join()<\/code>&nbsp;method.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>\ud83c\udfc6<strong>Other articles you might be interested in if you liked this one<\/strong><\/p>\n\n\n\n<p>\u2705<a target=\"_blank\" href=\"https:\/\/geekpython.in\/python-assert\" rel=\"noreferrer noopener\">How to use assert statements for debugging in Python<\/a>?<\/p>\n\n\n\n<p>\u2705<a target=\"_blank\" href=\"https:\/\/geekpython.in\/unit-tests-in-python\" rel=\"noreferrer noopener\">How to write unit tests using the unittest module in Python<\/a>?<\/p>\n\n\n\n<p>\u2705<a target=\"_blank\" href=\"https:\/\/geekpython.in\/asterisk-in-python\" rel=\"noreferrer noopener\">What are the uses of asterisk(*) in Python<\/a>?<\/p>\n\n\n\n<p>\u2705<a target=\"_blank\" href=\"https:\/\/geekpython.in\/init-vs-new\" rel=\"noreferrer noopener\">What are the init and new methods in Python<\/a>?<\/p>\n\n\n\n<p>\u2705<a target=\"_blank\" href=\"https:\/\/geekpython.in\/using-transfer-learning-for-deep-learning-model\" rel=\"noreferrer noopener\">How to build a custom deep learning model using Python<\/a>?<\/p>\n\n\n\n<p>\u2705<a target=\"_blank\" href=\"https:\/\/geekpython.in\/tempfile-in-python\" rel=\"noreferrer noopener\">How to generate temporary files and directories using tempfile in Python<\/a>?<\/p>\n\n\n\n<p>\u2705<a target=\"_blank\" href=\"https:\/\/geekpython.in\/run-flask-app-from-the-command-line-in-windows\" rel=\"noreferrer noopener\">How to run the Flask app from the terminal<\/a>?<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><strong>That&#8217;s all for now<\/strong><\/p>\n\n\n\n<p><strong>Keep coding\u270c\u270c<\/strong><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data processing becomes critical when training a robust machine learning model. We occasionally need to restructure and add new data to the datasets to increase the efficiency of the data. We&#8217;ll look at how to combine multiple datasets and merge multiple datasets with the same and different column names in this article. We&#8217;ll use the&nbsp;pandas&nbsp;library&#8217;s [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1169,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ocean_post_layout":"","ocean_both_sidebars_style":"","ocean_both_sidebars_content_width":0,"ocean_both_sidebars_sidebars_width":0,"ocean_sidebar":"0","ocean_second_sidebar":"0","ocean_disable_margins":"enable","ocean_add_body_class":"","ocean_shortcode_before_top_bar":"","ocean_shortcode_after_top_bar":"","ocean_shortcode_before_header":"","ocean_shortcode_after_header":"","ocean_has_shortcode":"","ocean_shortcode_after_title":"","ocean_shortcode_before_footer_widgets":"","ocean_shortcode_after_footer_widgets":"","ocean_shortcode_before_footer_bottom":"","ocean_shortcode_after_footer_bottom":"","ocean_display_top_bar":"default","ocean_display_header":"default","ocean_header_style":"","ocean_center_header_left_menu":"0","ocean_custom_header_template":"0","ocean_custom_logo":0,"ocean_custom_retina_logo":0,"ocean_custom_logo_max_width":0,"ocean_custom_logo_tablet_max_width":0,"ocean_custom_logo_mobile_max_width":0,"ocean_custom_logo_max_height":0,"ocean_custom_logo_tablet_max_height":0,"ocean_custom_logo_mobile_max_height":0,"ocean_header_custom_menu":"0","ocean_menu_typo_font_family":"0","ocean_menu_typo_font_subset":"","ocean_menu_typo_font_size":0,"ocean_menu_typo_font_size_tablet":0,"ocean_menu_typo_font_size_mobile":0,"ocean_menu_typo_font_size_unit":"px","ocean_menu_typo_font_weight":"","ocean_menu_typo_font_weight_tablet":"","ocean_menu_typo_font_weight_mobile":"","ocean_menu_typo_transform":"","ocean_menu_typo_transform_tablet":"","ocean_menu_typo_transform_mobile":"","ocean_menu_typo_line_height":0,"ocean_menu_typo_line_height_tablet":0,"ocean_menu_typo_line_height_mobile":0,"ocean_menu_typo_line_height_unit":"","ocean_menu_typo_spacing":0,"ocean_menu_typo_spacing_tablet":0,"ocean_menu_typo_spacing_mobile":0,"ocean_menu_typo_spacing_unit":"","ocean_menu_link_color":"","ocean_menu_link_color_hover":"","ocean_menu_link_color_active":"","ocean_menu_link_background":"","ocean_menu_link_hover_background":"","ocean_menu_link_active_background":"","ocean_menu_social_links_bg":"","ocean_menu_social_hover_links_bg":"","ocean_menu_social_links_color":"","ocean_menu_social_hover_links_color":"","ocean_disable_title":"default","ocean_disable_heading":"default","ocean_post_title":"","ocean_post_subheading":"","ocean_post_title_style":"","ocean_post_title_background_color":"","ocean_post_title_background":0,"ocean_post_title_bg_image_position":"","ocean_post_title_bg_image_attachment":"","ocean_post_title_bg_image_repeat":"","ocean_post_title_bg_image_size":"","ocean_post_title_height":0,"ocean_post_title_bg_overlay":0.5,"ocean_post_title_bg_overlay_color":"","ocean_disable_breadcrumbs":"default","ocean_breadcrumbs_color":"","ocean_breadcrumbs_separator_color":"","ocean_breadcrumbs_links_color":"","ocean_breadcrumbs_links_hover_color":"","ocean_display_footer_widgets":"default","ocean_display_footer_bottom":"default","ocean_custom_footer_template":"0","ocean_post_oembed":"","ocean_post_self_hosted_media":"","ocean_post_video_embed":"","ocean_link_format":"","ocean_link_format_target":"self","ocean_quote_format":"","ocean_quote_format_link":"post","ocean_gallery_link_images":"off","ocean_gallery_id":[],"footnotes":""},"categories":[2,69],"tags":[70,12,31],"class_list":["post-1167","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-python","category-pandas","tag-pandas","tag-python","tag-python3","entry","has-media"],"_links":{"self":[{"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/posts\/1167","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/comments?post=1167"}],"version-history":[{"count":4,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/posts\/1167\/revisions"}],"predecessor-version":[{"id":1283,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/posts\/1167\/revisions\/1283"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/media\/1169"}],"wp:attachment":[{"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/media?parent=1167"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/categories?post=1167"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/tags?post=1167"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}