Research Projects

Projects spanning retrieval-augmented generation, medical imaging, second-order optimization, NLP, and graph neural networks — all with detailed technical reports.

Systems & Data Engineering

Code Visualizer: Python Runtime Map

New

Code Visualizer is a Python-first runtime visualization app for learning how code executes. Users paste Python in the browser, run it through Pyodide in a Web Worker, and scrub a trace of lines, variables, scopes, object references, loops, functions, returns, stdout, and runtime errors.

Browser-only Python runtime Snapshot-first tracing Reference arrows + scope boxes JSON trace export

Uses sys.settrace events plus static AST analysis to normalize runtime frames, preserve Python object identity, and show aliasing and in-place mutations truthfully. The UI includes timeline playback, line-to-step jumping, speed controls, object inspection, shareable #cv= links, and trace export.

React 19 TypeScript Vite Pyodide Web Workers SVG Vitest ESLint

E-Commerce Behavior Analytics Platform

Featured

A high-performance analytics platform that turns 385M+ raw e-commerce events into sub-second business intelligence. Three-tier cloud-native stack: React 18 + Material-UI frontend on Netlify, FastAPI backend on Google Cloud Run, PostgreSQL 14 with star schema on Cloud SQL.

300–600× faster <1s queries 385M events 52GB dataset

Monthly partitioning (7 partitions) reduced scan size by 85% · 5 materialized views for pre-computed aggregations · B-tree indexing on product_id, user_session, event_type. Storage overhead: ~35% (~$3/month on Google Cloud SQL).

PostgreSQL 14 FastAPI React 18 Star Schema Material-UI Recharts Google Cloud SQL Cloud Run Docker

Generative & Vision

Medical Image Enhancement (Pix2Pix)

Implemented and extended the Pix2Pix conditional GAN for automated chest X-ray enhancement. Built a synthetic degradation pipeline (Gaussian noise σ=15, blur 3×3, JPEG quality 50) for paired training data on NIH ChestX-ray14 (4,999 frontal radiographs). Extended with Self-Attention SAGAN-style modules at the U-Net bottleneck.

PSNR 39.97 dB SSIM 0.9755 200 epochs Tesla T4 (16GB)

Key finding: Self-attention added 2.5M parameters and 50% training overhead but did not improve metrics — X-ray enhancement is a local operation well-served by U-Net skip connections.

PyTorch Pix2Pix / cGAN U-Net PatchGAN Self-Attention NIH ChestX-ray14 PSNR / SSIM

Optimization & Theory

nlTGCR: Second-Order Optimizer

Designed a scalable second-order optimization algorithm using the Fisher Information Matrix (FIM) as a symmetric positive-definite Hessian approximation. Applied Nyström approximation (rank-k subsampling) for cheap FIM inversion and Kronecker-factored preconditioning (K-FAC) for linear layers. Used JAX-style JIT compilation for C-level matrix operation speeds.

17× faster per epoch +3.2% accuracy vs Adam (MLP) 0.42s/epoch (5-layer MLP)

CIFAR-10 results: nlTGCR outperformed Adam/RMSProp on MLPs (54.52% vs 51.3%) with 17× faster epoch time. On CNNs, accuracy was comparable — convolutional structure breaks dense-Hessian assumption. Submitted to ICMLC '25.

PyTorch Fisher Information Matrix Nyström Approx. K-FAC JIT Compilation CIFAR-10

NLP & Summarization

PEGASUS Scientific Paper Summarizer

Abstractive summarization pipeline for arXiv papers using google/pegasus-pubmed. Built preprocessing pipeline (URL removal, LaTeX stripping, special-character handling) preserving domain-specific vocabulary. Trained on 1,000 papers with beam search (width 4, length penalty 0.8) on A100 (40GB) with 16-bit mixed precision.

ROUGE-1: 0.377 ROUGE-2: 0.126 ROUGE-L: 0.219 1,000 train / 100 val / 100 test
PEGASUS PyTorch Lightning Hugging Face Transformers A100 / CUDA AdamW (lr=2e-5) 16-bit Mixed Precision

Retcon: Local-First LM Adaptation Lab

New

Retcon is a reproducible research lab for domain-adaptive language model experiments. It turns local text, Markdown, JSONL, CSV, and Parquet corpora into cleaned, deduplicated, tokenized, config-hashed artifacts, then compares baseline and adapted model behavior across domain and general evaluation sets.

Eval contamination checks LoRA adapter training Partial-unfreeze comparisons Controlled forgetting reports

Implements a full experiment loop: ingest, clean, dedup, contamination analysis, tokenization, baseline evaluation, reliability calibration, training, checkpoint evaluation, forgetting detection, strategy comparison, static reports, and dashboard scaffolding. Runs preserve provenance through config hashes, SQLite metrics, environment metadata, stage hashes, cost estimates, and artifact manifests.

Python 3.11 Hugging Face PEFT / LoRA Accelerate Typer Pydantic Streamlit SQLite Continual Learning

RAG-BioQA

Retrieval-augmented generation framework for long-form biomedical question answering on the PubMedQA dataset. Dense retrieval via BioBERT embeddings + FAISS indexing. Re-ranking pipeline comparing BM25, ColBERT, and MonoT5. Generator fine-tuned with LoRA for parameter-efficient T5 adaptation.

BioBERT FAISS T5 + LoRA ColBERT MonoT5 BM25 PubMedQA

Graph ML

GNN Document Classification (CORA)

Document relationship modeling using Graph Neural Networks on the CORA dataset. Combined citation networks, co-authorship signals, and semantic similarity for graph construction. Implemented and compared GCN, GAT, and GraphSAGE architectures for document classification and clustering.

PyTorch Geometric GCN GAT GraphSAGE CORA Dataset Citation Networks

Applied AI

TasteMatch: AI Dietitian Chatbot

LLM-powered personal dietitian for users managing chronic conditions like diabetes. Analyzes user preferences, kitchen inventory, and dietary restrictions to generate personalized meal recommendations. Verifies nutritional facts against established diabetes care guidelines with glycemic index verification and portion size calculations.

Ollama FastAPI LLMs RAG Diabetes Care Conversational AI

Research Agenda

My research explores how large language models and deep learning can be applied to real-world problems in healthcare and science. I'm particularly interested in:

  • Retrieval-Augmented Generation — improving factual accuracy and domain specificity in LLMs through dense retrieval and re-ranking pipelines
  • Medical Image Analysis — using GANs and attention mechanisms to enhance diagnostic quality of medical imagery
  • Digital Health — extracting meaningful signals from passive technology usage data to monitor functional decline in aging populations
  • Graph Neural Networks — modeling complex relational data for classification and clustering tasks

Current: Digital Health Monitoring

Active

At the Emory FIT Lab, I'm extracting and analyzing Amazon Alexa voice interaction logs to identify technology engagement patterns that correlate with functional decline in older adults. Building automated data extraction pipelines with Python/Selenium and developing ML models to detect meaningful behavioral changes over time.

Python Selenium Digital Health Time Series Analysis Emory FIT Lab