HELLO!

My name is Dongim, and I am passionate about data analysis and model training, integrating machine learning and AI! With a strong background in natural language processing, geospatial analysis, and autonomous systems, I am highly motivated to solve real-world problems through impactful technologies.

8

Patents applied

100+

Students tutored

5

Research experience

12+

Years of Playing Guitar

3

AI-related certifications

3^rd

Degree in Taekwondo

SKILLS

PROGRAMMING: Python • C/C++ • R • MATLAB • HTML

MACHINE LEARNING: PyTorch • TensorFlow • NLTK • OpenCV • Scikit-learn

DEVOPS & TOOLS: AWS • Docker • VMware • PostgreSQL • Linux/Unix • Web Scraping • Django

PROJECTS

ROAD SAFETY INTELLIGENCE WITH AUGMENTED LLM

MIT Break Through Tech, Michelin Mobility Intelligence

Develop a natural language interface chatbot for geospatial analysis of LA crash data.

LangChain

Geospatial Analysis

Streamlit

Developing a natural language interface for geospatial analysis of LA crash data, utilizing LangChain for function calling, to handle complex queries within the geospatial dataset. Performed exploratory data analysis (EDA) and integrated Points of Interest data with crash data using Haversine formula for accurate distance calculations, to enable automated spatial data extraction and analysis using LLMs. Developing a responsive interface using Streamlit, integrating users' real-time GPS, to provide robust and interactive geospatial insights.

FINE-TUNING ASR MODELS ON STUTTERING RECORDINGS

Olin Public Interest Technology

Fine-tune ASR models to reduce bias against stuttered speech.

ASR Models

SageMaker

LibriSpeech

Leading a team to address bias in Automatic Speech Recognition (ASR) models against stuttered speech using LibriSpeech/Stutter data. Identified disparities in Word Error Rate (WER), with OpenAI’s Whisper showing a 2x increase and Facebook’s Wav2Vec a 6.4x increase in transcribing stuttered speech; Successfully reduced Whisper’s WER by 3% and Wav2Vec’s by 33% by removing repeated words. Fine-tuned Wav2Vec on AWS SageMaker, built word tokenizers for 2268 characters with Chinese stuttered speech, and uploaded the model to Hugging Face.

BENCHMARKING STUTTERING RECORDINGS AGAINST ASR MODELS

Boston University, AImpower.org

Assess bias in Mandarin speech recognition using ASR models.

ASR Models

NLTK

Pydub

Evaluated leading ASR models (Whisper, Google Speech-to-Text, Wav2Vec, Azure, WeNet) to assess bias in recognizing Mandarin stuttered speech. Segmented 50+ hours of Mandarin speech data with labeled transcriptions using Pydub, processed it on BU’s Shared Computing Cluster, addressed hallucinations, and calculated WER, CER, BLEU (NLTK), WordNet Wu-Palmer Similarity, and GloVe Cosine Similarity. Demonstrated that more stutter segments lead to higher error rates, with WeNet achieving a WER of 0.30, outperforming Wav2Vec's 0.52; analyzed model performance across stutter types, revealing a 0.2 WER difference between sound repetitions and interjections. Authored comprehensive reports explaining model performance and bias analysis, including technical visualizations; Conducted weekly meetings and presentations with the client company’s CEO.

INVERTED PENDULUM ROBOT SIMULATION TEAM

Olin Autonomous Robot Training Lab

Train an inverted pendulum robot to balance in the upright position.

Reinforcement Learning

W&B

Hyperparameter Sweeps

Developed a custom Gym environment to train an inverted pendulum robot to balance in the upright position, featuring real-time simulation visualization. Applied the Proximal Policy Optimization reinforcement learning (RL) algorithm, performing hyperparameter sweeps (e.g., learning rate, reward function, entropy coefficient) to optimize model performance; Used Weights & Biases for machine learning experiment tracking. Successfully solved the CartPole swing-up problem, achieving stable balance in simulation through iterative RL training.

AI-GENERATED IMAGE CLASSIFICATION MODEL

ML Class Project

Build a deep learning model to classify AI-generated images from MidJourney.

CNN

Image Classification

MidJourney

Developed a deep learning model using CNNs to classify AI-generated images from MidJourney against real images, with 11 classes. Applied data augmentation techniques to enhance generalization, with max-pooling layers to prevent overfitting. Achieved up to 80% accuracy on test data, demonstrating the model's capability to distinguish between real and AI-generated images.

UNDP SUDAN 2024 CONFLICT EVENTS ANALYSIS

United Nations Development Programme

Analyzed relationship between conflict events, refugees movements, and food insecurity in Sudan.

H3 Indexing

Confusion Matrix

Data Visualization

Analyzed and visualized the relationship between conflict events, refugees movements, and food insecurity in Sudan from 2019 to 2024, using H3 hexagonal indexing for geospatial data analysis, identifying conflict hotspots, trends, and humanitarian impacts across regions. Calculated confusion matrices to reveal strong correlations between the variables. (e.g., r=0.85 between conflict events and food insecurity levels by region indicates that areas with higher conflict tend to experience more severe food insecurity.)

CERTIFICATIONS

MACHINE LEARNING FOUNDATIONS

Cornell University, Jul 2024

DEEP LEARNING WITH PYTORCH : GENERATIVE ADVERSARIAL NETWORK

Coursera, Jun 2024

MACHINE LEARNING WITH PYTHON

IBM, Dec 2023