Elephant Rumble Isolation — “Visual Ear”

What inspired me

Elephants communicate with powerful, low-frequency rumbles that can travel long distances — but the field recordings that capture them often include the same kinds of low-frequency energy from modern life (cars, airplanes, generators). For this hackathon challenge, I wanted to build something that’s both useful to researchers and trustworthy to judges: a tool that cleans recordings while making every step visible.

The brief (and scientific context) provided by Dr. Mickey Pardo and the organizers pushed me toward an approach that’s interpretable DSP first, rather than a black-box model. That constraint ended up being the most interesting part of the project.

What I learned

  • How STFT parameter choices matter a lot for infrasonic / very-low-frequency signals.
  • How to build a practical time–frequency masking pipeline that preserves harmonics while suppressing stationary mechanical noise.
  • How to make signal-processing work demo-friendly with a UI that shows raw vs cleaned and also what was removed (so the user can trust the result).
  • How messy real data is: mixed sample rates, inconsistent filenames, and timecode spreadsheets that require robust parsing.

How I built the project

This repo has two “faces” built around the same core isolation method:

1) Notebook exploration (Elephant_Voices.ipynb)

  • Used as the experiment lab: iterate on preprocessing, spectrogram settings, and quick feature tests (including clustering experiments) without breaking the demo tools.

2) A repeatable isolation core (rumble.py)

  • Preprocess
    • Convert to mono, resample to a target rate (8 kHz), and band-pass filter (default 5–1200 Hz).
  • High-resolution STFT for low fundamentals
    Low-frequency discrimination depends on frequency resolution: $$ \Delta f = \frac{f_s}{N_{\mathrm{FFT}}} $$ With $f_s = 8000$ and $N_{\mathrm{FFT}} = 8192$, $\Delta f \approx 0.98\,\mathrm{Hz}$ — enough to distinguish rumble fundamentals in the ~10–20 Hz range.
  • Harmonic comb emphasis (template matching)
    For each candidate fundamental $f_0 \in [8, 25]$ Hz, I build a “comb” template that boosts bins near harmonics: $$ T_{f_0}(f) = \sum_{k=1}^{K}\exp\left(-\frac{(f-kf_0)^2}{2\sigma^2}\right) $$ Then each time frame picks the best $f_0$ by scoring $T_{f_0}$ against the magnitude spectrum.
  • Noise-floor model
    Mechanical noise is often stationary (or slowly varying), so I estimate a per-frequency background using a robust statistic (median over time). The Dash UI can optionally sample a “pure noise” region to improve this estimate.
  • Soft mask (keep what looks harmonic, suppress the rest)
    The project uses a soft mask that trades off harmonic energy against the estimated background: $$ W = \left(\frac{S_{\mathrm{harm}}}{S_{\mathrm{harm}}+\alpha(B+\varepsilon)}\right)^p $$ where $S_{\mathrm{harm}}$ is the harmonic-weighted magnitude, $B$ is the noise floor, $\alpha$ controls aggressiveness, and $p$ sharpens the mask.

3) A CLI for repeatability (cli.py)

  • Turns the pipeline into a hackathon-ready tool: run on a file or folder, optionally slice by a calls CSV, save cleaned WAVs, plots, and a report.
  • Includes presets for common noise categories: airplane, car, generator.

4) A “Visual Ear” demo UI (dash_app.py)

  • Upload audio + a calls spreadsheet (CSV/TSV/XLSX).
  • Click “segment pills” to select start/end times.
  • Compare RAW vs CLEANED spectrograms (and optionally “removed audio”).
  • Provide a spectral brush for targeted attenuation (non-destructive).
  • Export a ZIP with artifacts (clean audio, spectrogram, metadata).

5) A realtime prototype (realtime_local.py, realtime_colab.py)

  • Streams audio from a mic (or Colab/Gradio) and runs the same style of harmonic detection + masking in short windows.
  • It intentionally outputs safe, descriptive events (e.g., “rumble started”, “overlap likely”) rather than claiming true semantic translation.

Challenges I faced

  • Low-frequency physics meets digital constraints: capturing 10–20 Hz structure requires long windows, but long windows reduce time resolution. Finding a workable balance was iterative.
  • Noise overlaps the signal: vehicles and generators can occupy the same bands as the rumble fundamental/harmonics. Over-aggressive removal can “erase” the elephant, while under-removal leaves the recording unusable.
  • Making it trustworthy: in scientific contexts, “it sounds cleaner” isn’t enough. The UI had to show before/after and what was removed so a user can sanity-check artifacts.
  • Hackathon practicality: reading varied spreadsheets, handling mixed sampling rates, and keeping the UI responsive mattered as much as the signal processing itself.

What’s next

  • Add objective evaluation metrics (SNR improvement, harmonic retention scores) and a small benchmark set.
  • Improve overlap handling with more explicit multi-source separation (still interpretable).
  • Explore a supervised “meaning” model only after building a labeled dataset, while keeping the current pipeline as a transparent front-end.

Built With

Share this project:

Updates