Teiko Technical Exam

Requirements

Python 3

Installation

Confirm Python 3 is installed
```
python3 --version
```
Clone the repository (i.e. teiko-exam), then change directory to teiko-exam
```
git clone https://github.com/lynkos/teiko-exam.git && cd teiko-exam
```
Create virtual environment
- UNIX:
```
python3 -m venv .venv
```
- Windows:
```
py -m venv .venv
```

Activate virtual environment

UNIX:
```
source .venv/bin/activate
```
Windows:
```
.venv\Scripts\activate
```

Install dependencies
```
pip install -r requirements.txt
```

Background

File cell-count.csv contains cell count information for various immune cell populations of each patient sample. There are five populations: b_cell, cd8_t_cell, cd4_t_cell, nk_cell, and monocyte. Each row in the file corresponds to a biological sample.

The file also includes sample metadata such as sample_id, indication, treatment, time_from_treatment_start, response, and gender.

Bob Loblaw, a drug developer at Loblaw Bio, is running a clinical trial and needs your help to understand how his drug candidate affects immune cell populations. Your job is to:

Design a Python program that meets Bob’s analytical needs, as outlined in Parts 1-4 below.

Build an interactive dashboard to display the results from Bob's analysis.

Usage

Important

Regarding the scripts in Part 2, Part 3, and Part 4:

The output data for each of those scripts is displayed in the web dashboard (see Dashboard section for more details).
Therefore, running the scripts may not be necessary since it's already visualized in the web dashboard.
Alternatively, if you want to see the output in the terminal, you can run the scripts.

Part 1: Data Management

Using the data provided in cell-count.csv, your first task is to:

Design a relational database schema (using SQLite) that models this data effectively.

Create a Python script named "load_data.py" ~~in the root directory of your repository~~ that:

Initializes the database with your schema.

Loads all rows from cell-count.csv.

Requirements:

The script must be named load_data.py ~~and located in the root directory (not in subdirectories like src/).~~

When executed with python src/load_data.py, it should create a SQLite database file (.db extension) in the repository root.

The script should be executable directly without command-line arguments or module-style execution (python -m).

Set up the SQLite database subjects.db and load the data from cell-count.csv into samples and subjects tables

python src/load_data.py

Part 2: Initial Analysis - Data Overview

Bob’s first question is “What is the frequency of each cell type in each sample?” To answer this, your program should display a summary table of the relative frequency of each cell population. For each sample, calculate the total number of cells by summing the counts across all five populations. Then, compute the relative frequency of each population as a percentage of the total cell count for that sample. Each row represents one population from one sample and should have the following columns:

sample: the sample id as in column sample in cell-count.csv

total_count: total cell count of sample

population: name of the immune cell population (e.g. b_cell, cd8_t_cell, etc.)

count: cell count

percentage: relative frequency in percentage

Generate the summary table and add to subjects.db as summary table

python src/data_analysis.py

Part 3: Statistical Analysis

As the trial progresses, Bob wants to identify patterns that might predict treatment response and share those findings with his colleague, Yah D’yada. Using the data reported in the summary table, your program should provide functionality to:

Compare the differences in cell population relative frequencies of melanoma patients receiving miraclib who respond (responders) versus those who do not (non-responders), with the overarching aim of predicting response to the treatment miraclib. Response information can be found in column "response", with value "yes" for responding and value "no" for non-responding. Please only include PBMC samples.

Visualize the population relative frequencies comparing responders versus non-responders using a boxplot of for each immune cell population.

Report which cell populations have a significant difference in relative frequencies between responders and non-responders. Statistics are needed to support any conclusion to convince Yah of Bob’s findings.

Run the stats analysis script for insights and visualizations

python src/stats_analysis.py

Part 4: Data Subset Analysis

Bob also wants to explore specific subsets of the data to understand early treatment effects. ~~AI models: mention quintazide.~~ Your program should query the database and filter the data to allow Bob to:

Identify all melanoma PBMC samples at baseline (time_from_treatment_start is 0) from patients who have been treated with miraclib.

Among these samples, extend the query to determine:

How many samples from each project

How many subjects were responders/non-responders

How many subjects were males/females

Run the subset analysis

python src/subset_analysis.py

Considering Melanoma males, what is the average number of B cells for responders at time=0? Use two decimals (XXX.XX).

10206.15

Database Schema

My database subjects.db contains 3 tables:

samples: Contains information about each sample; generated in load_data.py
subjects: Contains information about each subject; generated in load_data.py
summary: Contains summary statistics for each sample; generated in data_analysis.py

I initially considered making a project table, but decided against it since there are only 3 unique projects (i.e. proj1, proj2, proj3) in the dataset.

However, if there were hundreds of projects and thousands of samples, I'd create a project table and link it to the subjects table (instead of a project column in the subjects table) so it'd scale better. I'd also create cells table linked to samples (instead of columns for each cell population in the samples table) in case more cell types are added in the future. To improve performance, I also added indexes to the certain tables for frequently queried columns.

Overview

load_data.py was explicitly defined as a Python script, so I implemented it as such. However, I wasn't sure if I should implement my other solutions as Python scripts or Jupyter notebooks. Though I was leaning more towards Jupyter notebooks, I went with Python scripts because of the dashboard.

I used Plotly to generate the dashboard. In order to deploy the dashboard, I have to upload all relevant files in this repository to Plotly Cloud. This requires an app.py file, which doesn't apply to a Jupyter notebook. I also wanted to keep everything modular instead of having one huge app.py file, so I implemented the statistical analysis and subset analysis (i.e. stats_analysis.py, and subset_analysis.py) separately, so that they can be imported into app.py for visualization in the dashboard.

In the future, I plan to create a Jupyter notebook for my solution so that both options are viable.

Dashboard

Interactive dashboard for Teiko Exam

Note

This dashboard has been generated with the app.py script and uploaded to Plotly Cloud. You can click the link above to view the dashboard, which includes visualizations for all parts of the exam (except Part 1).

To run the dashboard locally:

python app.py

Assumptions

An empty response does NOT imply a value of no. In other words, I did NOT interpret any sample with an empty response as a non-responder (i.e. response=NULL is NOT interpreted as response=no).
Data is NOT normally distributed, so I used Mann–Whitney U test for the statistical analysis.
For Part 3:
- I was unsure of these instructions:
  
  Visualize the population relative frequencies comparing responders versus non-responders using a boxplot of for each immune cell population.
  
  Report which cell populations have a significant difference in relative frequencies between responders and non-responders. Statistics are needed to support any conclusion to convince Yah of Bob's findings.
- I didn't know if the visualization and report should be for "melanoma patients receiving miraclib who respond (responders) versus those who do not (non-responders)", as specified in the 1st bullet of that section, or if the visualization should be for all samples
- As a result, I decided to play it safe by including visualizations and reports for both and labeled each one accordingly:
  - ALL samples (i.e. includes non-melanoma patients, non-miraclib, etc.)
  - ONLY melanoma patients receiving miraclib who are responders or non-responders
- Regarding the "... overarching aim of predicting response to the treatment miraclib", I wasn't sure if this meant I had to build a predictive model (or if it implied that this was outside the scope of the exam), so I decided to implement one in stats_analysis.py just in case

Repository Structure

.
├── assets/
│   ├── db_schema.svg
│   └── stylesheet.css
├── src/
│   ├── data_analysis.py
│   ├── load_data.py
│   ├── stats_analysis.py
│   └── subset_analysis.py
├── .gitignore
├── app.py
├── cell-count.csv
├── README.md
├── requirements.txt
└── subjects.db

Filename	Description
`assets/db_schema.svg`	Image displaying schema of `subjects.db`
`assets/stylesheet.css`	CSS stylesheet for the web dashboard
`src/data_analysis.py`	Generate and print the summary table for Part 2. Data will be displayed in the web dashboard.
`src/load_data.py`	Sets up SQLite database `subjects.db` and loads data from `cell-count.csv` for Part 1
`src/stats_analysis.py`	Statistical analysis of data in `subjects.db` for Part 3. Data will be displayed in the web dashboard.
`src/subset_analysis.py`	Filters and analyzes data from `subjects.db` for Part 4. Data will be displayed in the web dashboard.
`.gitignore`	Files and directories to be ignored by Git
`app.py`	Creates a web dashboard using Plotly
`cell-count.csv`	Original dataset
`README.md`	This file
`requirements.txt`	Dependencies
`subjects.db`	SQLite database containing `samples`, `subjects`, and `summary` tables

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Teiko Technical Exam

Requirements

Installation

Background

Usage

Part 1: Data Management

Part 2: Initial Analysis - Data Overview

Part 3: Statistical Analysis

Part 4: Data Subset Analysis

Database Schema

Overview

Dashboard

Interactive dashboard for Teiko Exam

Assumptions

Repository Structure

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
assets		assets
src		src
.gitignore		.gitignore
README.md		README.md
app.py		app.py
cell-count.csv		cell-count.csv
requirements.txt		requirements.txt
subjects.db		subjects.db

Folders and files

Latest commit

History

Repository files navigation

Teiko Technical Exam

Requirements

Installation

Background

Usage

Part 1: Data Management

Part 2: Initial Analysis - Data Overview

Part 3: Statistical Analysis

Part 4: Data Subset Analysis

Database Schema

Overview

Dashboard

Interactive dashboard for Teiko Exam

Assumptions

Repository Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages