Sparsh Marwah

Data Scientist

ML Pipelines · GenAI · MLOps

I build production ML systems — from AWS-native AutoML pipelines to agentic RAG applications — that turn messy data into measurable business impact.

About Me

A data scientist turning models into measurable outcomes — from AutoML pipelines to GenAI-powered assistants.

I'm a Data Scientist at Squark AI in Boston, where I architect AWS-native ML pipelines that train 35% faster on diverse structured and unstructured datasets. My work blends classical ML with modern GenAI — engineering custom preprocessing frameworks with BERT and Word2Vec embeddings, and validating prediction granularity through clustering and rigorous A/B testing.

Before grad school, I spent two years at Tredence Analytics in Bengaluru building predictive churn models and end-to-end MLOps pipelines in Databricks for a top-tier retail client with a 10M+ user base. That experience taught me that the fastest model rarely wins — the one that's deployed, monitored, and trusted does.

I hold an M.S. in Data Analytics Engineering from Northeastern University (GPA 3.8/4.0) with coursework in Data Mining, MLOps, and Data Management. I'm passionate about agentic AI, drift monitoring, and the unglamorous engineering work that keeps models alive in production.

3+ Years Experience

10M+ User Scale Impact

35% Faster Training Pipelines

3.8 GPA at Northeastern

Work Experience

3+ years across AI startups, enterprise analytics consulting, and public-sector data — shipping ML that moves business metrics.

Data Scientist

Jun 2025 – Present

Squark AI · Remote, United States

Accelerated model training speeds by 35% across diverse structured and unstructured datasets by developing an AWS cloud-native ML pipeline using H2O AutoML.
Boosted predictive accuracy by 12% by engineering a custom preprocessing framework with NLP embeddings (BERT, Word2Vec).
Reduced data retrieval latency by 25% by integrating MinIO and S3 for artifact versioning and ensuring production reliability.
Improved AUC by 15% through clustering integration, validating redefined prediction granularity via A/B testing.

Data Science Analyst

Jun 2021 – Jul 2023

Tredence Analytics Solutions · Bengaluru, India

Increased user retention by 18% for a top-tier retail client (10M+ user base) by developing predictive churn models with Random Forest and XGBoost.
Reduced deployment time by 40% by engineering end-to-end MLOps pipelines in Databricks with automated validation checks.
Engineered a 15% lift in cross-category product performance by designing A/B and multivariate tests using t-tests.
Enhanced KPI accuracy by 20% via real-time inference pipelines with automated drift monitoring and retraining.

Data Analyst Intern

Jun 2019 – Dec 2019

SJVN Ltd. · Shimla, India

Ensured 100% data integrity during inventory migrations by developing SQL queries for data validation.
Improved forecasting accuracy by 20% to reduce inventory stockouts via scikit-learn models in Python.
Enabled 20+ data-driven decisions weekly by creating interactive Tableau dashboards.

Skills & Technologies

A full-stack data toolkit — from SQL and PySpark to BERT embeddings, MLflow, and AWS.

Programming & Visualization

Python
Tableau
Power BI
Matplotlib
Seaborn

Statistical Analysis

Linear Regression
Time-Series
A/B Testing
Multivariate Testing
SHAP
Statistical Modeling

MLOps & Model Management

MLflow
Docker
MinIO
S3
Drift Monitoring
Model Validation

Data Engineering & Cloud

SQL
PySpark
Snowflake
BigQuery
Redshift
AWS
GCP
Databricks

Machine Learning & AI

Supervised Learning
Unsupervised Learning
Feature Engineering
Applied GenAI
Clustering
AutoML

GenAI Stack

LangChain
CrewAI
RAG
BERT
Word2Vec
H2O AutoML

Featured Projects

A snapshot of recent work — agentic GenAI, predictive modeling, and recommendation engines.

AI-Powered Resume & Job Description Matching Assistant

Agentic RAG system using CrewAI and LangChain to automate context-aware resume scoring and tailored cover letter generation.

CrewAI
LangChain
OpenAI
ChromaDB
Streamlit

View on GitHub

Customer Churn Prediction

85%-accurate XGBoost model for identifying high-risk customer segments. Surfaces key retention drivers through statistical analysis to help businesses act before churn happens.

XGBoost
Statistical Analysis
Predictive Modeling
SHAP

View on GitHub

Yelp Sentiment Analyzer & Recommender

91.4%-accurate neural network for sentiment classification on unstructured Yelp reviews, paired with a hybrid recommendation engine using KNN, SVD, and sentiment embeddings.

Neural Networks
NLP
KNN
SVD
Recommenders

View on GitHub

JobRADAR — Big Tech Job Crawler

A job-notification agent that crawls Big Tech career pages every 15 minutes and pushes instant, personalized alerts. Reduces noise — finds roles that actually match.

Python
Web Scraping
Automation
Agents

View on GitHub

Liver Cirrhosis Survival Prediction

Survival classification model for liver-cirrhosis patients — predicts outcomes from clinical features using a tuned classification pipeline. Healthcare ML in practice.

scikit-learn
Healthcare
Classification
Feature Engineering

View on GitHub

MLOps — Machine Learning in Production

A working reference for productionizing ML models — covers the operational glue around training, serving, and monitoring that keeps models reliable post-deployment.

MLflow
Docker
CI/CD
Monitoring

View on GitHub

Education

Formal training at the intersection of analytics, machine learning, and engineering.

Northeastern University

Master of Science in Data Analytics Engineering

Boston, MA · Sep 2023 – May 2025

GPA: 3.8 / 4.0

Coursework: Data Management in Analytics, Data Mining in Engineering, Machine Learning Operations

SRM Institute of Science & Technology

Bachelor of Technology in Computer Science and Engineering

Chennai, Tamil Nadu, India · Jul 2017 – May 2021

Data Scientist

About Me

Work Experience

Data Scientist

Data Science Analyst

Data Analyst Intern

Skills & Technologies

Programming & Visualization

Statistical Analysis

MLOps & Model Management

Data Engineering & Cloud

Machine Learning & AI

GenAI Stack

Featured Projects

AI-Powered Resume & Job Description Matching Assistant

Customer Churn Prediction

Yelp Sentiment Analyzer & Recommender

JobRADAR — Big Tech Job Crawler

Liver Cirrhosis Survival Prediction

MLOps — Machine Learning in Production

Education

Northeastern University

SRM Institute of Science & Technology

Get In Touch