{"id":174828,"date":"2024-05-20T08:00:53","date_gmt":"2024-05-20T12:00:53","guid":{"rendered":"https:\/\/www.kdnuggets.com\/?p=174828"},"modified":"2024-05-12T11:47:59","modified_gmt":"2024-05-12T15:47:59","slug":"essential-python-libraries-for-data-manipulation","status":"publish","type":"post","link":"https:\/\/www.kdnuggets.com\/essential-python-libraries-for-data-manipulation","title":{"rendered":"Essential Python Libraries for Data Manipulation"},"content":{"rendered":"<p><center><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/wijaya-essential-python-libraries-1.png\" alt=\"Essential Python Libraries for Data Manipulation\" width=\"100%\" \/><br \/>\n<font size=\"-1\">Image generated with Midjourney<\/font><\/center><br \/>\n&nbsp;<\/p>\n<p>As a data professional, it\u2019s essential to understand how to process your data. In the modern era, it means using programming language to quickly manipulate our data set to achieve our expected results.<\/p><div class=\"kdnug-after-first-paragraph kdnug-entity-placement\" id=\"kdnug-3761250337\"><div id=\"kdnug-844928542\"><a data-no-instant=\"1\" href=\"https:\/\/sps.northwestern.edu\/information\/data-science-online-artificial-intelligence-masters.html?utm_source=kdnuggets&#038;utm_medium=banner300x250&#038;utm_campaign=kdnuggets_msds_banner300x250_l&#038;utm_term=may26&#038;utm_content=msds&#038;src=kdnuggets_msds_banner300x250_mayfy26_l\" rel=\"noopener nofollow\" class=\"a2t-link\" target=\"_blank\"><p><img decoding=\"async\" style=\"max-width: 100%; height: auto;\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/s-nwu-2605.jpg\" alt=\"NWU - Analytics, AI, and leadership skills.\" \/><br \/>\nAnalytics, AI, and leadership skills.\t\t<\/p>\n<\/a><\/div><\/div>\n<p>Python is the most popular programming language data professionals use, and many libraries are helpful for data manipulation. From a simple vector to parallelization, each use case has a library that could help.<\/p><div class=\"kdnug-in-content-1 kdnug-entity-placement\" style=\"text-align: center;padding-bottom: 180px;padding-top: 20px;\" id=\"kdnug-3241399370\"><div id=\"kdnug-2373473594\"><a data-no-instant=\"1\" href=\"https:\/\/www.pny.com\/nvidia-rtx-pro-6000-blackwell?iscommercial=true&#038;utm_source=KDNuggets+Banner+300x250&#038;utm_medium=Web+Banners&#038;utm_campaign=Blackwell+Server&#038;utm_id=RTX+PRO+6000\" rel=\"noopener nofollow\" class=\"a2t-link\" target=\"_blank\"><p>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" style=\"max-width: 100%; height: auto;\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/s-pny-2603.jpg\" alt=\"NVIDIA RTX PRO 6000 Blackwell Server Edition\" \/><br \/>\nLearn more<\/p>\n<\/a><\/div><\/div>\n<p>So, what are these Python libraries that are essential for Data Manipulation? Let\u2019s get into it.<\/p>\n<p>&nbsp;<\/p>\n<h2>1.NumPy<\/h2>\n<p>&nbsp;<\/p>\n<p>The first library we would discuss is <a href=\"https:\/\/numpy.org\/\" target=\"_blank\" rel=\"noopener\">NumPy<\/a>. NumPy is an open-source library for scientific computing activity. It was developed in 2005 and has been used in many data science cases.<\/p>\n<p>NumPy is a popular library, providing many valuable features in scientific computing activities such as array objects, vector operations, and mathematical functions. Also, many data science use cases rely on a complex table and matrices calculation, so NumPy allows users to simplify the calculation process.<\/p>\n<p>Let\u2019s try NumPy with Python. Many data science platforms, such as Anaconda, have Numpy installed by default. But you can always install them via Pip.<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>pip install numpy<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>After the installation, we would create a simple array and perform array operations.<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>import numpy as np\r\n\r\na = np.array([1, 2, 3])\r\nb = np.array([4, 5, 6])\r\nc = a + b\r\nprint(c)<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Output: <code>[5 7 9]<\/code><\/p>\n<p>We can also perform basic statistics calculations with NumPy.<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>data = np.array([1, 2, 3, 4, 5, 6, 7])\r\nmean = np.mean(data)\r\nmedian = np.median(data)\r\nstd_dev = np.std(data)\r\n\r\nprint(f\"The data mean:{mean}, median:{median} and standard deviation: {std_dev}\")<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The data mean:4.0, median:4.0, and standard deviation: 2.0<\/p>\n<p>It\u2019s also possible to perform linear algebra operations such as matrix calculation.<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>x = np.array([[1, 2], [3, 4]])\r\ny = np.array([[5, 6], [7, 8]])\r\ndot_product = np.dot(x, y)\r\n\r\nprint(dot_product)<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Output:<\/p>\n<p><code>[[19 22]<br \/>\n[43 50]]<\/code><\/p>\n<p>There are so many benefits you can do using NumPy. From handling data to complex calculations, it\u2019s no wonder many libraries have NumPy as their base.<\/p>\n<p>&nbsp;<\/p>\n<h2>2. Pandas<\/h2>\n<p>&nbsp;<\/p>\n<p><a href=\"https:\/\/pandas.pydata.org\/\" target=\"_blank\" rel=\"noopener\">Pandas<\/a> is the most popular data manipulation Python library for data professionals. I am sure that many of the data science learning classes would use Pandas as their basis for any subsequent process.<\/p>\n<p>Pandas are famous because they have intuitive APIs yet are versatile, so many data manipulation problems can easily solved using the Pandas library. Pandas allows the user to perform data operations and analyze data from various input formats such as CSV, Excel, SQL databases, or JSON.<\/p>\n<p>Pandas are built on top of NumPy, so NumPy object properties still apply to any Pandas object.<\/p>\n<p>Let\u2019s try on the library. Like NumPy, it\u2019s usually available by default if you are using a Data Science platform such as Anaconda. However, you can follow the <a href=\"https:\/\/pandas.pydata.org\/getting_started.html\" target=\"_blank\" rel=\"noopener\">Pandas Installation guide<\/a> if you are unsure.<\/p>\n<p>You can try to initiate the dataset from the NumPy object and get a DataFrame object (Table-like) that shows the top five rows of data with the following code.<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>import numpy as np\r\nimport pandas as pd\r\n\r\nnp.random.seed(0)\r\nmonths = pd.date_range(start='2023-01-01', periods=12, freq='M')\r\nsales = np.random.randint(10000, 50000, size=12)\r\ntransactions = np.random.randint(50, 200, size=12)\r\n\r\ndata = {\r\n'Month': months,\r\n'Sales': sales,\r\n'Transactions': transactions\r\n}\r\ndf = pd.DataFrame(data)\r\ndf.head()<\/code><\/pre>\n<\/div>\n<p>&nbsp;<br \/>\n<center><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Wijaya_Essential_Python_Libraries_for_Data_Manipulation_2-300x218.png\" alt=\"Essential Python Libraries for Data Manipulation\" width=\"50%\" \/><\/center><br \/>\n&nbsp;<\/p>\n<p>Then you can try several data manipulation activities, such as data selection.<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>df[df['Transactions'] &lt;100]<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>It\u2019s possible to do the Data calculation.<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>total_sales = df['Sales'].sum() \r\naverage_transactions = df['Transactions'].mean() <\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Performing data cleaning with Pandas is also easy.<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>df = df.dropna() \r\ndf = df.fillna(df.mean()) <\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>There is so much to do with Pandas for Data Manipulation. Check out <a href=\"https:\/\/www.kdnuggets.com\/7-steps-to-mastering-data-wrangling-with-pandas-and-python\" target=\"_blank\" rel=\"noopener\">Bala Priya article on using Pandas for Data Manipulation<\/a> to learn further.<\/p>\n<p>&nbsp;<\/p>\n<h2>3. Polars<\/h2>\n<p>&nbsp;<\/p>\n<p><a href=\"https:\/\/pola.rs\/\" target=\"_blank\" rel=\"noopener\">Polars<\/a> is a relatively new data manipulation Python library designed for the swift analysis of large datasets. Polars boast 30x performance gains compared to Pandas in several benchmark tests.<\/p>\n<p>Polars is built on top of the Apache Arrow, so it\u2019s efficient for memory management of the large dataset and allows for parallel processing. It also optimize their data manipulation performance using lazy execution that delays and computational until it\u2019s necessary.<\/p>\n<p>For the Polars installation, you can use the following code.<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>pip install polars <\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Like Pandas, you can initiate the Polars DataFrame with the following code.<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>import numpy as np\r\nimport polars as pl\r\n\r\nnp.random.seed(0) \r\nemployee_ids = np.arange(1, 101) \r\nages = np.random.randint(20, 60, size=100) \r\nsalaries = np.random.randint(30000, 100000, size=100) \r\n\r\ndf = pl.DataFrame({\r\n    'EmployeeID': employee_ids,\r\n    'Age': ages,\r\n    'Salary': salaries\r\n})\r\n\r\ndf.head()<\/code><\/pre>\n<\/div>\n<p>&nbsp;<br \/>\n<center><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Wijaya_Essential_Python_Libraries_for_Data_Manipulation_3-232x300.png\" alt=\"Essential Python Libraries for Data Manipulation\" width=\"50%\" \/><\/center><br \/>\n&nbsp;<\/p>\n<p>However, there are differences in how we use Polars to manipulate data. For example, here is how we select data with Polars.<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>df.filter(pl.col('Age') &gt; 40)<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The API is considerably more complex than Pandas, but it\u2019s helpful if you require fast execution for large datasets. On the other hand, you would not get the benefit if the data size is small.<\/p>\n<p>To know the details, you can refer to <a href=\"https:\/\/www.kdnuggets.com\/pandas-vs-polars-a-comparative-analysis-of-python-dataframe-libraries\" target=\"_blank\" rel=\"noopener\">Josep Ferrer's article on how different Polars is are compared to Pandas<\/a>.<\/p>\n<p>&nbsp;<\/p>\n<h2>4. Vaex<\/h2>\n<p>&nbsp;<\/p>\n<p><a href=\"https:\/\/vaex.io\/\">Vaex<\/a> is similar to Polars as the library is developed specifically for considerable dataset data manipulation. However, there are differences in the way they process the dataset. For example, Vaex utilize memory-mapping techniques, while Polars focus on a multi-threaded approach.<\/p>\n<p>Vaex is optimally suitable for datasets that are way bigger than what Polars intended to use. While Polars is also for extensive dataset manipulation processing, the library is ideally on datasets that still fit into memory size. At the same time, Vaex would be great to use on datasets that exceed the memory.<\/p>\n<p>For the Vaex installation, it\u2019s better to refer to their <a href=\"https:\/\/vaex.io\/docs\/installing.html\" target=\"_blank\" rel=\"noopener\">documentation<\/a>, as it could break your system if it\u2019s not done correctly.<\/p>\n<p>&nbsp;<\/p>\n<h2>5. CuPy<\/h2>\n<p>&nbsp;<\/p>\n<p><a href=\"https:\/\/github.com\/cupy\/cupy\" target=\"_blank\" rel=\"noopener\">CuPy<\/a> is an open-source library that enables GPU-accelerated computing in Python. It is CuPy that was designed for the NumPy and SciPy replacement if you need to run the calculation within NVIDIA CUDA or AMD ROCm platforms.<\/p>\n<p>This makes CuPy great for applications that require intense numerical computation and need to use GPU acceleration. CuPy could utilize the parallel architecture of GPU and is beneficial for large-scale computations.<\/p>\n<p>To install CuPy, refer to their GitHub repository, as many available versions might or might not suit the platforms you use. For example, below is for the CUDA platform.<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>pip install cupy-cuda11x<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The APIs are similar to NumPy, so you can use CuPy instantly if you are already familiar with NumPy. For example, the code example for CuPy calculation is below.<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>import cupy as cp\r\nx = cp.arange(10)\r\ny = cp.array([2] * 10)\r\n\r\nz = x * y\r\n\r\nprint(cp.asnumpy(z))<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>CuPy is the end of an essential Python library if you are continuously working with high-scale computational data.<\/p>\n<p>&nbsp;<\/p>\n<h2>Conclusion<\/h2>\n<p>&nbsp;<br \/>\nAll the Python libraries we have explored are essential in certain use cases. NumPy and Pandas might be the basics, but libraries like Polars, Vaex, and CuPy would be beneficial in specific environments.<\/p>\n<p>If you have any other library you deem essential, please share them in the comments!<br \/>\n&nbsp;<br \/>\n&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"The must-know Python libraries to improve your data manipulation workflow.\n","protected":false},"author":386,"featured_media":175190,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_seopress_robots_primary_cat":"none","_seopress_titles_title":"","_seopress_titles_desc":"","_seopress_robots_index":"","inline_featured_image":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"mc4wp_mailchimp_campaign":[],"footnotes":"","_links_to":"","_links_to_target":""},"categories":[5286],"tags":[203],"class_list":["post-174828","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-kdnuggets-originals","tag-python"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/posts\/174828","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/users\/386"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/comments?post=174828"}],"version-history":[{"count":11,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/posts\/174828\/revisions"}],"predecessor-version":[{"id":175236,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/posts\/174828\/revisions\/175236"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/media\/175190"}],"wp:attachment":[{"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/media?parent=174828"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/categories?post=174828"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/tags?post=174828"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}