Portfolio
Signal Processing & ML

Music ML — Playlist Classifier

Multi-label song classifier predicting playlist assignments across 11 genre and mood categories. Built on Sentence-BERT embeddings and a multi-output MLP — and preceded by a full OCR-based data recovery pipeline that reconstructed an 18,000-song library corrupted during platform migration.

18K
Songs Recovered via OCR
11
Playlist Labels
01
OCR Extraction
Tesseract OCR over screenshots of the music library UI — extracts raw title/artist text per page image.
02
OCR Cleaning & MusicBrainz Validation
Regex cleaning removes timestamps and UI fragments. Adjacent short lines are paired as Title–Artist candidates and validated against MusicBrainz (rate-limited at 1 req/sec).
03
Metadata Enrichment
Each song enriched with Genius lyrics, Last.fm genre tags, NLTK VADER sentiment, and NRCLex dominant emotion. Resume-safe script skips already-processed rows.
04
Classification
Lyrics + tags + sentiment + emotion concatenated and encoded to 384-dim vectors via Sentence-BERT (all-MiniLM-L6-v2). MLPClassifier(256, 128) wrapped in MultiOutputClassifier — one binary classifier per playlist. Per-class probability thresholds (0.15–0.50) tuned to balance recall vs. precision per genre.
ComponentDetails
Embeddingall-MiniLM-L6-v2 — 384-dimensional Sentence-BERT vectors
ClassifierMLPClassifier(hidden_layer_sizes=(256, 128), max_iter=300)
WrapperMultiOutputClassifier — independent binary classifier per label
Class BalancingUpsample with replacement to max class size before training
Runtime~10 min on Colab T4 GPU for 241K songs
rapGenre
countryGenre
popGenre
edm / alt rockGenre
christianTheme
christmasSeasonal
lofiMood / Style
feelsMood
movieContext
starCurated
vocalist / instrumentalStyle