Get Data!!!

The next step of the code would be to, given the file path, determine if it is a csv or a dot xlsx file. This was done using the split function in Python whereby it was split using the dot as the character that would determine where the string would be split. This works under the assumption that the file path only has one dot in it and that dot precedes the file type. When using the split function in Python essentially the output is an array with two terms 0 and 1 the first term or 0 index would be the terms before the dot and the 1 index would be the term after the dot which should be the file type. You can see this code in the line below, notice how the split function calls the dot as the separator and the use of the one within the array to indicate that we want the second term, the term after the dot.

filetype = filepath.split('.')[1]

Now that we have the file type we can determine if it is a csv or xlsx file and as such using an if condition, we can then go on to check for if the user has supplied column names or if they have supplied a sheet name. Now checking for column names or sheet names would be determined based on the file type that is supplied so if the file type does equate to a csv only then would they check for column names and even still there are two options, the person could supply column names or if they did not, we would have to check for both conditions if the column names array from the parameters is empty then we simply call the read_csv() function which is a part of the pandas library in python and we pass the file path which ultimately gives us the entire data frame.

If the column names array is not empty it would mean that Susan column name values have been passed as such, we then pass these column names into the names parameter of the read_csv() function.

    if (filetype == 'csv'):
        if (col_Names == []):
            dataFrame = pd.read_csv(filepath)

            if (test == 0):
                print('Set test to 1 to view sample datraframes, Default is the first 5 rows, set n to vary the number of rows.')
            elif (test == 1):
                print(dataFrame.head(n))
        elif (col_Names != []):
            dataFrame = pd.read_csv(filepath, names=col_Names)
            
            if (test == 0):
                print('Set test to 1 to view sample datraframes, Default is the first 5 rows, set n to vary the number of rows.')
            elif (test == 1):
                print(dataFrame.head(n))

        return dataFrame

Unlike the xlsx file the csv file does not have sheets to navigate to however further data filtering can be accomplished by specifying the column names as strings so as to build a data frame from the data in the csv file.

In the case of the xlsx file however we have the added dimension in that such files contain sheets with each sheet functioning like its own table this means that upon reading a .xlsx file it is imperative that we specify or at least default to the first sheet on the file.

Similarly, if the file type is an xlsx file and we do not pass a sheet number then it simply outputs the first sheet as the data frame. This would be seen from the use of the .sheet_names[ ] function taking 0 as the parameter, which is the default specified in the function parameters.

If a sheet number is passed, then the name of the sheet is extracted and parsed to the data frame allowing it to be returned. As it can be seen below when the sheet parameter is passed as it would only take an integer the sheet name is then parsed and a data frame that is constituted of that particular sheet is created and can then be returned by the function.

    elif (filetype == 'xlsx'):
        xlFile = pd.ExcelFile(filepath)  
        sheetName = xlFile.sheet_names[sheet]
        dataFrame = xlFile.parse(f'{sheetName}')
        if (test == 0):
            print('Set test to 1 to view sample datraframes, Default is the first 5 rows, set n to vary the numbe rof rows.')
        elif (test == 1):
            print(dataFrame.head(n))

        return dataFrame

Below is the code for the function listed in full notice that for all the parameters they are defaults as such the only compulsory parameter that is needed for the function to output a data frame is the file path parameter.

def read_from_file(filepath, test=0, n=5, col_Names = [], sheet = 0):
    filetype = filepath.split('.')[1]
    #This will read the csv and display the first 5 rows of the data.
    if (filetype == 'csv'):
        if (col_Names == []):
            dataFrame = pd.read_csv(filepath)
            
            if (test == 0):
                print('Set test to 1 to view sample datraframes, Default is the first 5 rows, set n to vary the number of rows.')
            elif (test == 1):
                print(dataFrame.head(n))
        elif (col_Names != []):
            dataFrame = pd.read_csv(filepath, names=col_Names)
            
            if (test == 0):
                print('Set test to 1 to view sample datraframes, Default is the first 5 rows, set n to vary the number of rows.')
            elif (test == 1):
                print(dataFrame.head(n))

        return dataFrame

    elif (filetype == 'xlsx'):
        xlFile = pd.ExcelFile(filepath)  
        sheetName = xlFile.sheet_names[sheet]
        dataFrame = xlFile.parse(f'{sheetName}')
        if (test == 0):
            print('Set test to 1 to view sample datraframes, Default is the first 5 rows, set n to vary the numbe rof rows.')
        elif (test == 1):
            print(dataFrame.head(n))

        return dataFrame

To print the entirety of the data frame one would simply have to enclose the function within the Python print function. Other than that, by equating the function to a particular variable in the different variable can be created and used for further data processing and visualization.

Get Data!!!

Published by Vyan Persad

Leave a comment Cancel reply

Share this:

Published by Vyan Persad

Leave a comment Cancel reply