{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "All the IPython Notebooks in **Python Files** lecture series by **[Dr. Milaan Parmar](https://www.linkedin.com/in/milaanparmar/)** are available @ **[GitHub](https://github.com/milaan9/05_Python_Files)**\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Python File I/O\n", "\n", "In this class, you'll learn about Python file operations. More specifically, opening a file, reading from it, writing into it, closing it, and various file methods that you should be aware of." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Files\n", "\n", "Files are named locations on disk to store related information. They are used to permanently store data in a non-volatile memory (e.g. hard disk).\n", "\n", "Since Random Access Memory (RAM) is volatile (which loses its data when the computer is turned off), we use files for future use of the data by permanently storing them.\n", "\n", "When we want to read from or write to a file, we need to open it first. When we are done, it needs to be closed so that the resources that are tied with the file are freed.\n", "\n", "Hence, in Python, a file operation takes place in the following order:\n", "\n", "1. Open a file\n", "2. Close the file\n", "3. Write into files (perform operation)\n", "4. Read contents of files (perform operation)\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Opening Files in Python\n", "\n", "Python has a built-in **`open()`** function to open a file. This function returns a file object, also called a handle, as it is used to read or modify the file accordingly.\n", "\n", "```python\n", ">>> f = open(\"test.txt\") # open file in current directory\n", ">>> f = open(\"C:/Python99/README.txt\") # specifying full path\n", "```\n", "\n", "We can specify the mode while opening a file. In mode, we specify whether we want to read **`r`**, write **`w`** or append **`a`** to the file. We can also specify if we want to open the file in text mode or binary mode.\n", "\n", "The default is reading in text mode. In this mode, we get strings when reading from the file.\n", "\n", "On the other hand, binary mode returns bytes and this is the mode to be used when dealing with non-text files like images or executable files.\n", "\n", "| Mode | Description |\n", "|:----:| :--- |\n", "| **`r`** | **Read** -Opens a file for reading only. The file pointer is placed at the beginning of the file. This is the default mode. | \n", "| **`t`** | **Text** - Opens in text mode. (default). | \n", "| **`b`** | **Binary** - Opens in binary mode (e.g. images). | \n", "| **`x`** | **Create** - Opens a file for exclusive creation. If the file already exists, the operation fails. | \n", "| **`rb`** | Opens a file for reading only in binary format. The file pointer is placed at the beginning of the file. This is the default mode. | \n", "| **`r+`** | Opens a file for both reading and writing. The file pointer placed at the beginning of the file. | \n", "| **`rb+`** | Opens a file for both reading and writing in binary format. The file pointer placed at the beginning of the file. | \n", "| **`w`** | **Write** - Opens a file for writing only. Overwrites the file if the file exists. If the file does not exist, creates a new file for writing. | \n", "| **`wb`** | Opens a file for writing only in binary format. Overwrites the file if the file exists. If the file does not exist, creates a new file for writing. | \n", "| **`w+`** | Opens a file for both writing and reading. Overwrites the existing file if the file exists. If the file does not exist, creates a new file for reading and writing. | \n", "| **`wb+`** | Opens a file for both writing and reading in binary format. Overwrites the existing file if the file exists. If the file does not exist, creates a new file for reading and writing. | \n", "| **`a`** | **Append** - Opens a file for appending. The file pointer is at the end of the file if the file exists. That is, the file is in the append mode. If the file does not exist, it creates a new file for writing. | \n", "| **`ab`** | Opens a file for appending in binary format. The file pointer is at the end of the file if the file exists. That is, the file is in the append mode. If the file does not exist, it creates a new file for writing. | \n", "| **`a+`** | Opens a file for both appending and reading. The file pointer is at the end of the file if the file exists. The file opens in the append mode. If the file does not exist, it creates a new file for reading and writing. |\n", "| **`ab+`** | Opens a file for both appending and reading in binary format. The file pointer is at the end of the file if the file exists. The file opens in the append mode. If the file does not exist, it creates a new file for reading and writing. | " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:08.145188Z", "start_time": "2021-10-25T14:28:08.074879Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<_io.TextIOWrapper name='test.txt' mode='r' encoding='cp1252'>\n" ] } ], "source": [ "f = open(\"test.txt\") # equivalent to 'r' or 'rt'\n", "print(f) # <_io.TextIOWrapper name='test.txt' mode='r' encoding='cp1252'>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see in the example above, I printed the opened file and it gave some information about it. Opened file has different reading methods: **`read()`**, **`readline`**, **`readlines`**. An opened file has to be closed with **`close()`** method." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:08.207692Z", "start_time": "2021-10-25T14:28:08.151052Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<_io.TextIOWrapper name='test.txt' mode='w' encoding='cp1252'>\n" ] } ], "source": [ "f = open(\"test.txt\",'w') # write in text mode\n", "print(f)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:08.317071Z", "start_time": "2021-10-25T14:28:08.217459Z" } }, "outputs": [], "source": [ "f = open(\"logo.png\",'r+b') # read and write in binary mode" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unlike other languages, the character **`a`** does not imply the number 97 until it is encoded using **`ASCII`** (or other equivalent encodings).\n", "\n", "Moreover, the default encoding is platform dependent. In windows, it is **`cp1252`** but **`utf-8`** in Linux.\n", "\n", "So, we must not also rely on the default encoding or else our code will behave differently in different platforms.\n", "\n", "Hence, when working with files in text mode, it is highly recommended to specify the encoding type." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:08.426445Z", "start_time": "2021-10-25T14:28:08.321954Z" } }, "outputs": [], "source": [ "f = open(\"test.txt\", mode='r', encoding='utf-8')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Closing Files in Python\n", "\n", "When we are done with performing operations on the file, we need to properly close the file.\n", "\n", "Closing a file will free up the resources that were tied with the file. It is done using the **`close()`** method available in Python.\n", "\n", "Python has a garbage collector to clean up unreferenced objects but we must not rely on it to close the file." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:08.582216Z", "start_time": "2021-10-25T14:28:08.430355Z" } }, "outputs": [], "source": [ "f = open(\"test.txt\", encoding = 'utf-8')\n", "# perform file operations\n", "f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This method is not entirely safe. If an exception occurs when we are performing some operation with the file, the code exits without closing the file.\n", "\n", "A safer way is to use a **[try-finally](https://github.com/milaan9/05_Python_Files/blob/main/004_Python_Exceptions_Handling.ipynb)** block." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:08.881538Z", "start_time": "2021-10-25T14:28:08.590028Z" } }, "outputs": [], "source": [ "try:\n", " f = open(\"test.txt\", encoding = 'utf-8')\n", " # perform file operations\n", "finally:\n", " f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This way, we are guaranteeing that the file is properly closed even if an exception is raised that causes program flow to stop.\n", "\n", "The best way to close a file is by using the **`with`** statement. This ensures that the file is closed when the block inside the **`with`** statement is exited.\n", "\n", "We don't need to explicitly call the **`close()`** method. It is done internally.\n", "\n", "```python\n", ">>>with open(\"test.txt\", encoding = 'utf-8') as f:\n", " # perform file operations\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The file Object Attributes\n", "\n", "* **file.closed** - Returns true if file is closed, false otherwise.\n", "* **file.mode** - Returns access mode with which file was opened.\n", "* **file.name** - Returns name of the file." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:09.020214Z", "start_time": "2021-10-25T14:28:08.891306Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name of the file: data.txt\n", "Closed or not : False\n", "Opening mode : wb\n" ] } ], "source": [ "# Open a file\n", "data = open(\"data.txt\", \"wb\")\n", "print (\"Name of the file: \", data.name)\n", "print (\"Closed or not : \", data.closed)\n", "print (\"Opening mode : \", data.mode)\n", "data.close() #closed data.txt file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Writing to Files in Python\n", "\n", "In order to write into a file in Python, we need to open it in write **`w`**, append **`a`** or exclusive creation **`x`** mode.\n", "\n", "We need to be careful with the **`w`** mode, as it will overwrite into the file if it already exists. Due to this, all the previous data are erased.\n", "\n", "Writing a string or sequence of bytes (for binary files) is done using the **`write()`** method. This method returns the number of characters written to the file." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:09.124707Z", "start_time": "2021-10-25T14:28:09.028027Z" } }, "outputs": [], "source": [ "with open(\"test_1.txt\",'w',encoding = 'utf-8') as f:\n", " f.write(\"my first file\\n\")\n", " f.write(\"This file\\n\\n\")\n", " f.write(\"contains three lines\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This program will create a new file named **`test_1.txt`** in the current directory if it does not exist. If it does exist, it is overwritten.\n", "\n", "We must include the newline characters ourselves to distinguish the different lines." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:09.269731Z", "start_time": "2021-10-25T14:28:09.134477Z" } }, "outputs": [], "source": [ "with open(\"test_2.txt\",'w',encoding = 'utf-8') as f:\n", " f.write(\"This is file\\n\")\n", " f.write(\"my\\n\")\n", " f.write(\"first file\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us append **`a`** some text to the file we have been reading:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:09.296102Z", "start_time": "2021-10-25T14:28:09.274614Z" } }, "outputs": [], "source": [ "with open(\"test_2.txt\",'a',encoding = 'utf-8') as f:\n", " f.write('This text has to be appended at the end')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:09.404501Z", "start_time": "2021-10-25T14:28:09.300006Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "done\n" ] } ], "source": [ "# open a file in current directory\n", "data = open(\"data_1.txt\", \"w\") # \"w\" write in text mode,\n", "data.write(\"Welcome to Dr. Milan Parmar's Python Tutorial\")\n", "print(\"done\")\n", "data.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Reading Files in Python\n", "\n", "To read a file in Python, we must open the file in reading **`r`** mode.\n", "\n", "There are various methods available for this purpose. We can use the **`read(size)`** method to read in the **`size`** number of data. If the **`size`** parameter is not specified, it reads and returns up to the end of the file.\n", "\n", "We can read the **`text_1.txt`** file we wrote in the above section in the following way:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:09.525600Z", "start_time": "2021-10-25T14:28:09.418663Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "f = open(\"test.txt\",'r',encoding = 'utf-8')\n", "txt = f.read() # read all the characters in the file\n", "print(type(txt))\n", "print(txt)\n", "f.close()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:09.662808Z", "start_time": "2021-10-25T14:28:09.529505Z" } }, "outputs": [ { "data": { "text/plain": [ "'my first'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f = open(\"test_1.txt\",'r',encoding = 'utf-8')\n", "f.read(8) # read the first 8 data characters" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:09.816134Z", "start_time": "2021-10-25T14:28:09.665737Z" } }, "outputs": [ { "data": { "text/plain": [ "' file'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f.read(5) # read the next 5 data characters" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:09.924044Z", "start_time": "2021-10-25T14:28:09.820039Z" } }, "outputs": [ { "data": { "text/plain": [ "'\\nThis file\\n\\ncontains three lines\\n'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f.read() # read in the rest till end of file" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:10.032936Z", "start_time": "2021-10-25T14:28:09.928929Z" } }, "outputs": [ { "data": { "text/plain": [ "''" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f.read() # further reading returns empty sting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the **`read()`** method returns a newline as **`'\\n'`**. Once the end of the file is reached, we get an empty string on further reading.\n", "\n", "We can change our current file cursor (position) using the **`seek()`** method. Similarly, the **`tell()`** method returns our current position (in number of bytes)." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:10.141825Z", "start_time": "2021-10-25T14:28:10.037819Z" } }, "outputs": [ { "data": { "text/plain": [ "50" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f.tell() # get the current file position" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:10.235578Z", "start_time": "2021-10-25T14:28:10.145731Z" } }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f.seek(0) # bring file cursor to initial position" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:10.351794Z", "start_time": "2021-10-25T14:28:10.239484Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "my first file\n", "This file\n", "\n", "contains three lines\n", "\n" ] } ], "source": [ "print(f.read()) # read the entire file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can read a file line-by-line using a **[for loop](https://github.com/milaan9/03_Python_Flow_Control/blob/main/005_Python_for_Loop.ipynb)**. This is both efficient and fast." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:10.454334Z", "start_time": "2021-10-25T14:28:10.364489Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "my first file\n", "This file\n", "\n", "contains three lines\n" ] } ], "source": [ "f = open(\"test_1.txt\",'r',encoding = 'utf-8')\n", "for line in f:\n", " print(line, end = '')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this program, the lines in the file itself include a newline character **`\\n`**. So, we use the end parameter of the **`print()`** function to avoid two newlines when printing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, we can use the **`readline()`** method to read individual lines of a file. This method reads a file till the newline, including the newline character." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:10.578847Z", "start_time": "2021-10-25T14:28:10.458240Z" } }, "outputs": [ { "data": { "text/plain": [ "'my first file\\n'" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f.seek(0) # bring file cursor to initial position\n", "f.readline()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:10.749264Z", "start_time": "2021-10-25T14:28:10.582755Z" } }, "outputs": [ { "data": { "text/plain": [ "'This file\\n'" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f.readline()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:10.874263Z", "start_time": "2021-10-25T14:28:10.760981Z" } }, "outputs": [ { "data": { "text/plain": [ "'\\n'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f.readline()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:11.014891Z", "start_time": "2021-10-25T14:28:10.880126Z" } }, "outputs": [ { "data": { "text/plain": [ "'contains three lines\\n'" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f.readline()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, the **`readlines()`** method returns a list of remaining lines of the entire file. All these reading methods return empty values when the end of file **(EOF)** is reached." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:11.139894Z", "start_time": "2021-10-25T14:28:11.018801Z" }, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "['my first file\\n', 'This file\\n', '\\n', 'contains three lines\\n']" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f.seek(0) # bring file cursor to initial position\n", "f.readlines()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way to get all the lines as a list is using **`splitlines()`**" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:11.279548Z", "start_time": "2021-10-25T14:28:11.142828Z" } }, "outputs": [ { "data": { "text/plain": [ "['my first file', 'This file', '', 'contains three lines']" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f.seek(0) # bring file cursor to initial position\n", "f.read().splitlines()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:11.388926Z", "start_time": "2021-10-25T14:28:11.283456Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Welcome to Dr. Milan Parmar\n", "'s Python Tutorial\n" ] } ], "source": [ "# Open a file\n", "data = open(\"data_1.txt\", \"r+\")\n", "file_data = data.read(27) # read 3.375 byte only\n", "full_data = data.read() # read all byte into file from last cursor\n", "print(file_data)\n", "print(full_data)\n", "data.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## File Positions\n", "\n", "The **`tell()`** method tells you the current position within the file; in other words, the next read or write will occur at that many bytes from the beginning of the file.\n", "\n", "The **`seek(offset[, from])`** method changes the current file position. The offset argument indicates the number of bytes to be moved. The from argument specifies the reference position from where the bytes are to be moved.\n", "\n", "If from is set to 0, the beginning of the file is used as the reference position. If it is set to 1, the current position is used as the reference position. If it is set to 2 then the end of the file would be taken as the reference position." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:11.545181Z", "start_time": "2021-10-25T14:28:11.392833Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "current position after reading 27 byte : 27\n", "Welcome to Dr. Milan Parmar\n", "Welcome to Dr. Milan Parmar's Python Tutorial\n", "position after reading file : 45\n" ] } ], "source": [ "# Open a file\n", "data = open(\"data_1.txt\", \"r+\")\n", "file_data = data.read(27) # read 18 byte only\n", "print(\"current position after reading 27 byte :\",data.tell())\n", "data.seek(0) #here current position set to 0 (starting of file)\n", "full_data = data.read() #read all byte\n", "print(file_data)\n", "print(full_data)\n", "print(\"position after reading file : \",data.tell())\n", "data.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python File Methods\n", "\n", "There are various methods available with the file object. Some of them have been used in the above examples.\n", "\n", "Here is the complete list of methods in text mode with a brief description:\n", "\n", "| Method | Description |\n", "|:----| :--- |\n", "| **`close()`** | Closes an opened file. It has no effect if the file is already closed. | \n", "| **`detach()`** | Separates the underlying binary buffer from the **`TextIOBase`** and returns it. | \n", "| **`fileno()`** | Returns an integer number (file descriptor) of the file. | \n", "| **`flush()`** | Flushes the write buffer of the file stream. | \n", "| **`isatty()`** | Returns **`True`** if the file stream is interactive. | \n", "| **`read(n)`** | Reads at most `n` characters from the file. Reads till end of file if it is negative or `None`. | \n", "| **`readable()`** | Returns **`True`** if the file stream can be read from. | \n", "| **`readline(n=-1)`** | Reads and returns one line from the file. Reads in at most **`n`** bytes if specified. | \n", "| **`readlines(n=-1)`** | Reads and returns a list of lines from the file. Reads in at most **`n`** bytes/characters if specified. | \n", "| **`seek(offset,from=SEEK_SET)`** | Changes the file position to **`offset`** bytes, in reference to `from` (start, current, end). | \n", "| **`seekable()`** | Returns **`True`** if the file stream supports random access. | \n", "| **`tell()`** | Returns the current file location. | \n", "| **`truncate(size=None)`** | Resizes the file stream to **`size`** bytes. If **`size`** is not specified, resizes to current location.. | \n", "| **`writable()`** | Returns **`True`** if the file stream can be written to. | \n", "| **`write(s)`** | Writes the string **`s`** to the file and returns the number of characters written.. | \n", "| **`writelines(lines)`** | Writes a list of **`lines`** to the file.. | " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deleting Files\n", "\n", "We have seen in previous section, how to make and remove a directory using **[os](https://github.com/milaan9/04_Python_Functions/blob/main/007_Python_Function_Module.ipynb)** module (04_Python_Functions ➞ 007_Python_Function_Module ➞ Python Built-In Modules). Again now, if we want to remove a file we use **`os`** module." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:12.170196Z", "start_time": "2021-10-25T14:28:11.555923Z" } }, "outputs": [ { "ename": "FileNotFoundError", "evalue": "[WinError 2] The system cannot find the file specified: 'example.txt'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mos\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mos\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mremove\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'example.txt'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mFileNotFoundError\u001b[0m: [WinError 2] The system cannot find the file specified: 'example.txt'" ] } ], "source": [ "import os\n", "os.remove('example.txt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the file does not exist, the remove method will raise an error, so it is good to use a condition like this:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:19.870587Z", "start_time": "2021-10-25T14:28:19.861800Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The file does not exist\n" ] } ], "source": [ "import os\n", "if os.path.exists('./files/example.txt'):\n", " os.remove('./files/example.txt')\n", "else:\n", " print('The file does not exist')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## File Types" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### File with txt Extension\n", "\n", "File with **txt** extension is a very common form of data and we have covered it in the previous section. Let us move to the JSON file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### File with json Extension\n", "\n", "**JSON** stands for **J**ava**S**cript **O**bject **N**otation. Actually, it is a stringified JavaScript object or Python dictionary." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:22.613818Z", "start_time": "2021-10-25T14:28:22.597221Z" } }, "outputs": [], "source": [ "# dictionary\n", "person_dct= {\n", " \"name\":\"Milaan\",\n", " \"country\":\"England\",\n", " \"city\":\"London\",\n", " \"skills\":[\"Python\", \"MATLAB\",\"R\"]\n", "}\n", "# JSON: A string form a dictionary\n", "person_json = \"{'name': 'Milaan', 'country': 'England', 'city': 'London', 'skills': ['Python', 'MATLAB','R']}\"\n", "\n", "# we use three quotes and make it multiple line to make it more readable\n", "person_json = '''{\n", " \"name\":\"Milaan\",\n", " \"country\":\"England\",\n", " \"city\":\"London\",\n", " \"skills\":[\"Python\", \"MATLAB\",\"R\"]\n", "}'''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Changing JSON to Dictionary\n", "\n", "To change a JSON to a dictionary, first we import the json module and then we use **`loads`** method." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:23.767169Z", "start_time": "2021-10-25T14:28:23.757406Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "{'name': 'Milaan', 'country': 'England', 'city': 'London', 'skills': ['Python', 'MATLAB', 'R']}\n", "Milaan\n" ] } ], "source": [ "import json\n", "# JSON\n", "person_json = '''{\n", " \"name\":\"Milaan\",\n", " \"country\":\"England\",\n", " \"city\":\"London\",\n", " \"skills\":[\"Python\", \"MATLAB\",\"R\"]\n", "}'''\n", "# let's change JSON to dictionary\n", "person_dct = json.loads(person_json)\n", "print(type(person_dct))\n", "print(person_dct)\n", "print(person_dct['name'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Changing Dictionary to JSON\n", "\n", "To change a dictionary to a JSON we use **`dumps`** method from the json module." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:24.390232Z", "start_time": "2021-10-25T14:28:24.380471Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "{\n", " \"name\": \"Milaan\",\n", " \"country\": \"England\",\n", " \"city\": \"London\",\n", " \"skills\": [\n", " \"Python\",\n", " \"MATLAB\",\n", " \"R\"\n", " ]\n", "}\n" ] } ], "source": [ "import json\n", "# python dictionary\n", "person = {\n", " \"name\":\"Milaan\",\n", " \"country\":\"England\",\n", " \"city\":\"London\",\n", " \"skills\":[\"Python\", \"MATLAB\",\"R\"]\n", "}\n", "# let's convert it to json\n", "person_json = json.dumps(person, indent=4) # indent could be 2, 4, 8. It beautifies the json\n", "print(type(person_json))\n", "print(person_json)\n", "\n", "# when you print it, it does not have the quote, but actually it is a string\n", "# JSON does not have type, it is a string type." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Saving as JSON File\n", "\n", "We can also save our data as a json file. Let us save it as a json file using the following steps. For writing a json file, we use the **`json.dump()`** method, it can take dictionary, output file, **`ensure_ascii`** and **`indent`**." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:24.872665Z", "start_time": "2021-10-25T14:28:24.856065Z" } }, "outputs": [], "source": [ "import json\n", "# python dictionary\n", "person = {\n", " \"name\":\"Milaan\",\n", " \"country\":\"England\",\n", " \"city\":\"London\",\n", " \"skills\":[\"Python\", \"MATLAB\",\"R\"]\n", "}\n", "with open('json_example.json', 'w', encoding='utf-8') as f:\n", " json.dump(person, f, ensure_ascii=False, indent=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the code above, we use encoding and indentation. Indentation makes the json file easy to read." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### File with csv Extension\n", "\n", "**CSV** stands for **C**omma **S**eparated **V**alues. CSV is a simple file format used to store tabular data, such as a spreadsheet or database. CSV is a very common data format in data science.\n", "\n", "For example, create **csv_example.csv** in your working directory with the following contents:\n", "\n", "```csv\n", "\"name\",\"country\",\"city\",\"skills\"\n", "\"Milaan\",\"England\",\"London\",\"Python\"\n", "```" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:25.606081Z", "start_time": "2021-10-25T14:28:25.589481Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Column names are :name, country, city, skills\n", "\tMilaan is a teachers. He lives in England, London.\n", "Number of lines: 2\n" ] } ], "source": [ "import csv\n", "with open('csv_example.csv') as f:\n", " csv_reader = csv.reader(f, delimiter=',') # w use, reader method to read csv\n", " line_count = 0\n", " for row in csv_reader:\n", " if line_count == 0:\n", " print(f'Column names are :{\", \".join(row)}')\n", " line_count += 1\n", " else:\n", " print(\n", " f'\\t{row[0]} is a teachers. He lives in {row[1]}, {row[2]}.')\n", " line_count += 1\n", " print(f'Number of lines: {line_count}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### File with xlsx Extension\n", "\n", "To read excel files we need to install **`xlrd`** package. We will cover this after we cover package installing using **pip**.\n", "\n", "```py\n", "import xlrd\n", "excel_book = xlrd.open_workbook('sample.xls)\n", "print(excel_book.nsheets)\n", "print(excel_book.sheet_names)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### File with xml Extension\n", "\n", "**XML** is another structured data format which looks like HTML. In XML the tags are not predefined. The first line is an XML declaration. The person tag is the root of the XML. The person has a gender attribute.\n", "\n", "```xml\n", "\n", "\n", " Asabeneh\n", " Finland\n", " Helsinki\n", " \n", " JavaScrip\n", " React\n", " Python\n", " \n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more information on how to read an XML file check the **[documentation](https://docs.python.org/2/library/xml.etree.elementtree.html)**" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "ExecuteTime": { "end_time": "2021-10-25T14:28:26.743808Z", "start_time": "2021-10-25T14:28:26.716466Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Root tag: person\n", "Attribute: {'gender': 'male'}\n", "field: name\n", "field: country\n", "field: city\n", "field: skills\n" ] } ], "source": [ "import xml.etree.ElementTree as ET\n", "tree = ET.parse('xml_example.xml')\n", "root = tree.getroot()\n", "print('Root tag:', root.tag)\n", "print('Attribute:', root.attrib)\n", "for child in root:\n", " print('field: ', child.tag)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 💻 Exercises ➞ Files\n", "\n", "### Exercises ➞ Level 1\n", "\n", "1. Write a function which count number of lines and number of words in a text.\n", " - a) Read **[speech_barack_obama.txt](https://github.com/milaan9/05_Python_Files/blob/main/speech_barack_obama.txt)** file and count number of lines and words\n", " - b) Read **[speech_michelle_obama.txt ](https://github.com/milaan9/05_Python_Files/blob/main/speech_michelle_obama.txt)** file and count number of lines and words\n", " - c) Read **[speech_donald_trump.txt](https://github.com/milaan9/05_Python_Files/blob/main/speech_donald_trump.txt)** file and count number of lines and words\n", " - d) Read **[speech_melina_trump.txt](https://github.com/milaan9/05_Python_Files/blob/main/speech_melina_trump.txt)** file and count number of lines and words\n", " \n", "2. Read the **[countries_data.json](https://github.com/milaan9/05_Python_Files/blob/main/countries_data.json)** data file in data directory, create a function that finds the ten most spoken languages\n", "\n", " - ```py\n", " # Your output should look like this:\n", " print(most_spoken_languages(filename='./countries_data.json', 10))\n", " [(91, 'English'),\n", " (45, 'French'),\n", " (25, 'Arabic'),\n", " (24, 'Spanish'),\n", " (9, 'Russian'),\n", " (9, 'Portuguese'),\n", " (8, 'Dutch'),\n", " (7, 'German'),\n", " (5, 'Chinese'),\n", " (4, 'Swahili'),\n", " (4, 'Serbian')]\n", " # Your output should look like this:\n", " print(most_spoken_languages(filename='./countries_data.json', 3))\n", " [(91, 'English'),\n", " (45, 'French'),\n", " (25, 'Arabic')]\n", " ```\n", "\n", "3. Read the **[countries_data.json](https://github.com/milaan9/05_Python_Files/blob/main/countries_data.json)** data file and create a function that creates a list of the ten most populated countries\n", "\n", " - ```py\n", " # Your output should look like this:\n", " print(most_populated_countries(filename='./countries_data.json', 10))\n", " [\n", " {'country': 'China', 'population': 1377422166},\n", " {'country': 'India', 'population': 1295210000},\n", " {'country': 'United States of America', 'population': 323947000},\n", " {'country': 'Indonesia', 'population': 258705000},\n", " {'country': 'Brazil', 'population': 206135893},\n", " {'country': 'Pakistan', 'population': 194125062},\n", " {'country': 'Nigeria', 'population': 186988000},\n", " {'country': 'Bangladesh', 'population': 161006790},\n", " {'country': 'Russian Federation', 'population': 146599183},\n", " {'country': 'Japan', 'population': 126960000}\n", " ]\n", " # Your output should look like this:\n", " print(most_populated_countries(filename='./countries_data.json', 3))\n", " [\n", " {'country': 'China', 'population': 1377422166},\n", " {'country': 'India', 'population': 1295210000},\n", " {'country': 'United States of America', 'population': 323947000}\n", " ]\n", " ```\n", "\n", "\n", "### Exercises ➞ Level 2\n", "\n", "1. Extract all incoming email addresses as a list from the **[email_exchanges_big.txt](https://github.com/milaan9/05_Python_Files/blob/main/email_exchanges_big.txt)** file.\n", "\n", "2. Find the most common words in the English language. Call the name of your function **`find_most_common_words`**, it will take two parameters - a string or a file and a positive integer, indicating the number of words. Your function will return an array of tuples in descending order. Check the output\n", "\n", " - ```py\n", " # Your output should look like this\n", " print(find_most_common_words('sample.txt', 10))\n", " [(10, 'the'),\n", " (8, 'be'),\n", " (6, 'to'),\n", " (6, 'of'),\n", " (5, 'and'),\n", " (4, 'a'),\n", " (4, 'in'),\n", " (3, 'that'),\n", " (2, 'have'),\n", " (2, 'I')]\n", "\n", " # Your output should look like this\n", " print(find_most_common_words('sample.txt', 5))\n", "\n", " [(10, 'the'),\n", " (8, 'be'),\n", " (6, 'to'),\n", " (6, 'of'),\n", " (5, 'and')]\n", " ```\n", "\n", "3. Use the function, find_most_frequent_words to find:\n", " - a) The ten most frequent words used in **[Barack Obama's Speech.txt](https://github.com/milaan9/05_Python_Files/blob/main/speech_barack_obama.txt)**\n", " - b) The ten most frequent words used in **[Michelle Obama's Speech.txt ](https://github.com/milaan9/05_Python_Files/blob/main/speech_michelle_obama.txt)**\n", " - c) The ten most frequent words used in **[Donald Trump's Speech.txt](https://github.com/milaan9/05_Python_Files/blob/main/speech_donald_trump.txt)**\n", " - d) The ten most frequent words used in **[Melina Trump's Speech.txt](https://github.com/milaan9/05_Python_Files/blob/main/speech_melina_trump.txt)**\n", " \n", "4. Write a python application that checks similarity between two texts. It takes a file or a string as a parameter and it will evaluate the similarity of the two texts. For instance check the similarity between the transcripts of **[Michelle Obama's Speech.txt ](https://github.com/milaan9/05_Python_Files/blob/main/speech_michelle_obama.txt)** and **[Melina Trump's Speech.txt](https://github.com/milaan9/05_Python_Files/blob/main/speech_melina_trump.txt)** speech. You may need a couple of functions, function to clean the text (**`clean_text`**), function to remove support words (**`remove_support_words`**) and finally to check the similarity (**`check_text_similarity`**). List of **[support_words](https://github.com/milaan9/05_Python_Files/blob/main/support_words.py)**.\n", "\n", "5. Find the 10 most repeated words in the **[romeo_and_juliet.txt](https://github.com/milaan9/05_Python_Files/blob/main/romeo_and_juliet.txt)**.\n", "\n", "6. Read the **[hacker_news.csv](https://github.com/milaan9/05_Python_Files/blob/main/hacker_news.csv)** file and find out:\n", " - a) Count the number of lines containing **`python`** or **`Python`**\n", " - b) Count the number lines containing **`JavaScript`**, **`javascript`** or **`Javascript`**\n", " - c) Count the number lines containing **`Java`** and not **`JavaScript`**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "204.797px" }, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }