Skip to content

Latest commit

 

History

History
122 lines (97 loc) · 4.6 KB

File metadata and controls

122 lines (97 loc) · 4.6 KB

Imports

import pandas as pd

The script starts by importing the pandas library, which is a powerful data manipulation and analysis tool in Python.

Function Definitions

1. load_and_rename_clinical_data

def load_and_rename_clinical_data(filepath='brca_metabric_clinical_data.csv'):
    clinical_data = pd.read_csv(filepath)
    clinical_data.rename(columns={
        'Overall Survival (Months)': 'OS_MONTHS',
        'Overall Survival Status': 'OS_STATUS',
        'Pam50 + Claudin-low subtype': 'CLAUDIN_SUBTYPE'
    }, inplace=True)
    return clinical_data
  • Purpose: Load clinical data from a CSV file and rename specific columns for easier reference.
  • Steps:
    1. Read CSV File: pd.read_csv(filepath) loads the CSV file into a DataFrame.
    2. Rename Columns: rename(columns={...}, inplace=True) renames specific columns:
      • 'Overall Survival (Months)' to 'OS_MONTHS'
      • 'Overall Survival Status' to 'OS_STATUS'
      • 'Pam50 + Claudin-low subtype' to 'CLAUDIN_SUBTYPE'
    3. Return DataFrame: The modified DataFrame is returned.

2. load_and_rename_nsd1_data

def load_and_rename_nsd1_data(filepath='NSD1_mRNA_expression.csv'):
    nsd1_data = pd.read_csv(filepath)
    nsd1_data.rename(columns={
        'Sample ID': 'Sample_ID',
        'NSD1: mRNA expression z-scores relative to all samples (log microarray)': 'NSD1'
    }, inplace=True)
    return nsd1_data
  • Purpose: Load NSD1 mRNA expression data from a CSV file and rename specific columns for easier reference.
  • Steps:
    1. Read CSV File: pd.read_csv(filepath) loads the CSV file into a DataFrame.
    2. Rename Columns: rename(columns={...}, inplace=True) renames specific columns:
      • 'Sample ID' to 'Sample_ID'
      • 'NSD1: mRNA expression z-scores relative to all samples (log microarray)' to 'NSD1'
    3. Return DataFrame: The modified DataFrame is returned.

3. merge_data

def merge_data(clinical_data, nsd1_data):
    merged_data = pd.merge(clinical_data, nsd1_data, left_on='Sample ID', right_on='Sample_ID')
    return merged_data
  • Purpose: Merge the clinical and NSD1 data based on the sample IDs.
  • Steps:
    1. Merge DataFrames: pd.merge(...) merges the two DataFrames on the 'Sample ID' column from the clinical data and the 'Sample_ID' column from the NSD1 data.
    2. Return Merged DataFrame: The merged DataFrame is returned.

4. save_data

def save_data(data, filepath='merged_clinical_nsd1_data.csv'):
    data.to_csv(filepath, index=False)
  • Purpose: Save the merged data to a CSV file.
  • Steps:
    1. Save to CSV: data.to_csv(filepath, index=False) saves the DataFrame to a CSV file without the index column.

5. clean_data

def clean_data(data):
    cleaned_data = data.dropna(subset=['OS_MONTHS', 'OS_STATUS', 'CLAUDIN_SUBTYPE', 'NSD1'])
    return cleaned_data
  • Purpose: Clean the data by removing rows with missing values in critical columns.
  • Steps:
    1. Drop Missing Values: data.dropna(subset=[...]) removes rows that have missing values in any of the specified columns.
    2. Return Cleaned DataFrame: The cleaned DataFrame is returned.

Main Function

def main():
    clinical_data = load_and_rename_clinical_data()
    nsd1_data = load_and_rename_nsd1_data()
    merged_data = merge_data(clinical_data, nsd1_data)
    cleaned_data = clean_data(merged_data)
    save_data(cleaned_data, 'cleaned_clinical_nsd1_data.csv')
    print("Cleaned data saved to 'cleaned_clinical_nsd1_data.csv'")
  • Purpose: Coordinate the loading, merging, cleaning, and saving of the data.
  • Steps:
    1. Load Clinical Data: load_and_rename_clinical_data() loads and renames the clinical data.
    2. Load NSD1 Data: load_and_rename_nsd1_data() loads and renames the NSD1 data.
    3. Merge Data: merge_data(clinical_data, nsd1_data) merges the two DataFrames.
    4. Clean Data: clean_data(merged_data) cleans the merged data by removing rows with missing values.
    5. Save Data: save_data(cleaned_data, 'cleaned_clinical_nsd1_data.csv') saves the cleaned data to a CSV file.
    6. Print Confirmation: print(...) prints a message confirming that the data has been saved.

Script Execution

if __name__ == "__main__":
    main()
  • Purpose: Ensure the main function is called when the script is run directly.
  • Steps:
    1. Check Execution Context: if __name__ == "__main__": checks if the script is being run directly.
    2. Call Main Function: main() is called to execute the main data processing steps.