Hi, I'm

Sparsh Marwah

Data Scientist

ML Pipelines · GenAI · MLOps

I build production ML systems — from AWS-native AutoML pipelines to agentic RAG applications — that turn messy data into measurable business impact.

Sparsh Marwah
Currently at Squark AI Healthcare & retail analytics, cloud-native ML pipelines
3+ years building ML systems
10M+ user scale impact
M.S., Northeastern University
Scroll to explore

About Me

A data scientist turning models into measurable outcomes — from AutoML pipelines to GenAI-powered assistants.

I'm a Data Scientist at Squark AI in Boston, where I architect AWS-native ML pipelines that train 35% faster on diverse structured and unstructured datasets. My work blends classical ML with modern GenAI — engineering custom preprocessing frameworks with BERT and Word2Vec embeddings, and validating prediction granularity through clustering and rigorous A/B testing.

Before grad school, I spent two years at Tredence Analytics in Bengaluru building predictive churn models and end-to-end MLOps pipelines in Databricks for a top-tier retail client with a 10M+ user base. That experience taught me that the fastest model rarely wins — the one that's deployed, monitored, and trusted does.

I hold an M.S. in Data Analytics Engineering from Northeastern University (GPA 3.8/4.0) with coursework in Data Mining, MLOps, and Data Management. I'm passionate about agentic AI, drift monitoring, and the unglamorous engineering work that keeps models alive in production.

3+ Years Experience
10M+ User Scale Impact
35% Faster Training Pipelines
3.8 GPA at Northeastern

Work Experience

3+ years across AI startups, enterprise analytics consulting, and public-sector data — shipping ML that moves business metrics.

Data Scientist

Jun 2025 – Present

Squark AI · Remote, United States

  • Accelerated model training speeds by 35% across diverse structured and unstructured datasets by developing an AWS cloud-native ML pipeline using H2O AutoML.
  • Boosted predictive accuracy by 12% by engineering a custom preprocessing framework with NLP embeddings (BERT, Word2Vec).
  • Reduced data retrieval latency by 25% by integrating MinIO and S3 for artifact versioning and ensuring production reliability.
  • Improved AUC by 15% through clustering integration, validating redefined prediction granularity via A/B testing.

Data Science Analyst

Jun 2021 – Jul 2023

Tredence Analytics Solutions · Bengaluru, India

  • Increased user retention by 18% for a top-tier retail client (10M+ user base) by developing predictive churn models with Random Forest and XGBoost.
  • Reduced deployment time by 40% by engineering end-to-end MLOps pipelines in Databricks with automated validation checks.
  • Engineered a 15% lift in cross-category product performance by designing A/B and multivariate tests using t-tests.
  • Enhanced KPI accuracy by 20% via real-time inference pipelines with automated drift monitoring and retraining.

Data Analyst Intern

Jun 2019 – Dec 2019

SJVN Ltd. · Shimla, India

  • Ensured 100% data integrity during inventory migrations by developing SQL queries for data validation.
  • Improved forecasting accuracy by 20% to reduce inventory stockouts via scikit-learn models in Python.
  • Enabled 20+ data-driven decisions weekly by creating interactive Tableau dashboards.

Skills & Technologies

A full-stack data toolkit — from SQL and PySpark to BERT embeddings, MLflow, and AWS.

Programming & Visualization

  • Python
  • Tableau
  • Power BI
  • Matplotlib
  • Seaborn

Statistical Analysis

  • Linear Regression
  • Time-Series
  • A/B Testing
  • Multivariate Testing
  • SHAP
  • Statistical Modeling

MLOps & Model Management

  • MLflow
  • Docker
  • MinIO
  • S3
  • Drift Monitoring
  • Model Validation

Data Engineering & Cloud

  • SQL
  • PySpark
  • Snowflake
  • BigQuery
  • Redshift
  • AWS
  • GCP
  • Databricks

Machine Learning & AI

  • Supervised Learning
  • Unsupervised Learning
  • Feature Engineering
  • Applied GenAI
  • Clustering
  • AutoML

GenAI Stack

  • LangChain
  • CrewAI
  • RAG
  • BERT
  • Word2Vec
  • H2O AutoML

Featured Projects

A snapshot of recent work — agentic GenAI, predictive modeling, and recommendation engines.

AI-Powered Resume & Job Description Matching Assistant

Agentic RAG system using CrewAI and LangChain to automate context-aware resume scoring and tailored cover letter generation.

  • CrewAI
  • LangChain
  • OpenAI
  • ChromaDB
  • Streamlit
View on GitHub

Customer Churn Prediction

85%-accurate XGBoost model for identifying high-risk customer segments. Surfaces key retention drivers through statistical analysis to help businesses act before churn happens.

  • XGBoost
  • Statistical Analysis
  • Predictive Modeling
  • SHAP
View on GitHub

Yelp Sentiment Analyzer & Recommender

91.4%-accurate neural network for sentiment classification on unstructured Yelp reviews, paired with a hybrid recommendation engine using KNN, SVD, and sentiment embeddings.

  • Neural Networks
  • NLP
  • KNN
  • SVD
  • Recommenders
View on GitHub

JobRADAR — Big Tech Job Crawler

A job-notification agent that crawls Big Tech career pages every 15 minutes and pushes instant, personalized alerts. Reduces noise — finds roles that actually match.

  • Python
  • Web Scraping
  • Automation
  • Agents
View on GitHub

Liver Cirrhosis Survival Prediction

Survival classification model for liver-cirrhosis patients — predicts outcomes from clinical features using a tuned classification pipeline. Healthcare ML in practice.

  • scikit-learn
  • Healthcare
  • Classification
  • Feature Engineering
View on GitHub

MLOps — Machine Learning in Production

A working reference for productionizing ML models — covers the operational glue around training, serving, and monitoring that keeps models reliable post-deployment.

  • MLflow
  • Docker
  • CI/CD
  • Monitoring
View on GitHub

Education

Formal training at the intersection of analytics, machine learning, and engineering.

Northeastern University

Master of Science in Data Analytics Engineering

Boston, MA · Sep 2023 – May 2025

GPA: 3.8 / 4.0

Coursework: Data Management in Analytics, Data Mining in Engineering, Machine Learning Operations

SRM Institute of Science & Technology

Bachelor of Technology in Computer Science and Engineering

Chennai, Tamil Nadu, India · Jul 2017 – May 2021

Get In Touch

Open to data science roles, collaborations, and conversations about ML, GenAI, and MLOps. Let's build something.