← Back to projects
in-progress

Grocery Lens

PythonDjangoPostgreSQLDockerPlaywrightspaCyGrafana
Grocery Lens

Overview

Grocery Lens is a Django/PostgreSQL data product I've been building from scratch since July 2025. It combines custom dietary filters, recipe matching, basket synchronisation and NLP-assisted ingredient processing to help users navigate grocery data in ways retailer apps don't support.

The backend is the core challenge: designing a platform that handles scraping, ingestion, data quality repair, search and analytics from the same PostgreSQL schema - without losing data integrity as the product grows.

Product Features

  • Dietary filters - user-created diet templates with ingredient-level allow/restrict rules; publicly posted diets support community upvoting and downvoting
  • Recipe matching - compatible recipes surfaced per diet; background caching pre-computes matches after a diet is saved so results are instant on first load
  • Basket & favourites - add products to a grocery basket, favourite individual items, track changes over time
  • Price history - per-product price tracking across scrape runs; surfaced in product detail and the compare page
  • Compare page - side-by-side product comparison across retailers
  • D3.js dietary tree visualiser - interactive ingredient hierarchy view - see it live

What I Built

  • Production data platform: PostgreSQL schemas for semi-structured product and analytics workloads using JSONField, ArrayField, GIN indexes and integrity constraints
  • Replayable ingestion workflows: loading retailer scrapes, public datasets and CSV logs with upserts, transactional writes and structured repair tables for relinking, backfills and taxonomy improvement
  • NLP ingredient processing: canonicalisation and fuzzy matching with spaCy; fixed a scraper memory leak that reached 5.9 GB over 5,938 products
  • Concurrency-aware processing: bounded concurrency, retry-safe queues using select_for_update(skip_locked) to balance throughput and reliability
  • Background caching: post-save signals trigger async pre-computation of recipe and product compatibility so users never wait for cold results
  • Privacy-aware analytics: retention controls, aggregation layers and reporting controls built for a live service
  • Observability: Grafana/Loki log aggregation, health checks, blue/green deployments and rollback-safe traffic switching

Technical Choices

  • Django ORM + raw SQL - Django models for application logic; raw SQL for complex ingestion queries, upserts and repair workflows where the ORM adds unnecessary overhead
  • PostgreSQL - JSONField and ArrayField for semi-structured product data; GIN indexes for fast search; queue-backed processing using select_for_update(skip_locked) for background work
  • Playwright + BeautifulSoup - Playwright for JS-rendered retailer pages; BeautifulSoup for lighter scraping tasks
  • Docker + Caddy - containerised service behind Caddy, deployed blue/green with a health-check-gated swap script
  • Grafana + Loki - structured log aggregation for pipeline health and data quality trend monitoring