HELLO!
My name is Dongim, and I am passionate about data analysis and model training, integrating machine learning and AI! With a strong background in natural language processing, geospatial analysis, and autonomous systems, I am highly motivated to solve real-world problems through impactful technologies.
8
Patents applied
100+
Students tutored
5
Research experience
12+
Years of Playing Guitar
3
AI-related certifications
3rd
Degree in Taekwondo
SKILLS
PROGRAMMING: Python • C/C++ • R • MATLAB • HTML
MACHINE LEARNING: PyTorch • TensorFlow • NLTK • OpenCV • Scikit-learn
DEVOPS & TOOLS: AWS • Docker • VMware • PostgreSQL • Linux/Unix • Web Scraping • Django
PROJECTS
ROAD SAFETY INTELLIGENCE WITH AUGMENTED LLM
MIT Break Through Tech, Michelin Mobility Intelligence
Develop a natural language interface chatbot for geospatial analysis of LA crash data.
Developing a natural language interface for geospatial analysis of LA crash data, utilizing LangChain for function calling, to handle complex queries within the geospatial dataset. Performed exploratory data analysis (EDA) and integrated Points of Interest data with crash data using Haversine formula for accurate distance calculations, to enable automated spatial data extraction and analysis using LLMs. Developing a responsive interface using Streamlit, integrating users' real-time GPS, to provide robust and interactive geospatial insights.
FINE-TUNING ASR MODELS ON STUTTERING RECORDINGS
Olin Public Interest Technology
Fine-tune ASR models to reduce bias against stuttered speech.
Leading a team to address bias in Automatic Speech Recognition (ASR) models against stuttered speech using LibriSpeech/Stutter data. Identified disparities in Word Error Rate (WER), with OpenAI’s Whisper showing a 2x increase and Facebook’s Wav2Vec a 6.4x increase in transcribing stuttered speech; Successfully reduced Whisper’s WER by 3% and Wav2Vec’s by 33% by removing repeated words. Fine-tuned Wav2Vec on AWS SageMaker, built word tokenizers for 2268 characters with Chinese stuttered speech, and uploaded the model to Hugging Face.
BENCHMARKING STUTTERING RECORDINGS AGAINST ASR MODELS
Boston University, AImpower.org
Assess bias in Mandarin speech recognition using ASR models.
Evaluated leading ASR models (Whisper, Google Speech-to-Text, Wav2Vec, Azure, WeNet) to assess bias in recognizing Mandarin stuttered speech. Segmented 50+ hours of Mandarin speech data with labeled transcriptions using Pydub, processed it on BU’s Shared Computing Cluster, addressed hallucinations, and calculated WER, CER, BLEU (NLTK), WordNet Wu-Palmer Similarity, and GloVe Cosine Similarity. Demonstrated that more stutter segments lead to higher error rates, with WeNet achieving a WER of 0.30, outperforming Wav2Vec's 0.52; analyzed model performance across stutter types, revealing a 0.2 WER difference between sound repetitions and interjections. Authored comprehensive reports explaining model performance and bias analysis, including technical visualizations; Conducted weekly meetings and presentations with the client company’s CEO.
INVERTED PENDULUM ROBOT SIMULATION TEAM
Olin Autonomous Robot Training Lab
Train an inverted pendulum robot to balance in the upright position.
Developed a custom Gym environment to train an inverted pendulum robot to balance in the upright position, featuring real-time simulation visualization. Applied the Proximal Policy Optimization reinforcement learning (RL) algorithm, performing hyperparameter sweeps (e.g., learning rate, reward function, entropy coefficient) to optimize model performance; Used Weights & Biases for machine learning experiment tracking. Successfully solved the CartPole swing-up problem, achieving stable balance in simulation through iterative RL training.
AI-GENERATED IMAGE CLASSIFICATION MODEL
ML Class Project
Build a deep learning model to classify AI-generated images from MidJourney.
Developed a deep learning model using CNNs to classify AI-generated images from MidJourney against real images, with 11 classes. Applied data augmentation techniques to enhance generalization, with max-pooling layers to prevent overfitting. Achieved up to 80% accuracy on test data, demonstrating the model's capability to distinguish between real and AI-generated images.
UNDP SUDAN 2024 CONFLICT EVENTS ANALYSIS
United Nations Development Programme
Analyzed relationship between conflict events, refugees movements, and food insecurity in Sudan.
Analyzed and visualized the relationship between conflict events, refugees movements, and food insecurity in Sudan from 2019 to 2024, using H3 hexagonal indexing for geospatial data analysis, identifying conflict hotspots, trends, and humanitarian impacts across regions. Calculated confusion matrices to reveal strong correlations between the variables. (e.g., r=0.85 between conflict events and food insecurity levels by region indicates that areas with higher conflict tend to experience more severe food insecurity.)