in-progress
Grocery Lens
PythonDjangoPostgreSQLDockerPlaywrightspaCyGrafana
Overview
Grocery Lens is a Django/PostgreSQL data product I've been building from scratch since July 2025. It combines custom dietary filters, recipe matching, basket synchronisation and NLP-assisted ingredient processing to help users navigate grocery data in ways retailer apps don't support.
The backend is the core challenge: designing a platform that handles scraping, ingestion, data quality repair, search and analytics from the same PostgreSQL schema - without losing data integrity as the product grows.
Product Features
- Dietary filters - user-created diet templates with ingredient-level allow/restrict rules; publicly posted diets support community upvoting and downvoting
- Recipe matching - compatible recipes surfaced per diet; background caching pre-computes matches after a diet is saved so results are instant on first load
- Basket & favourites - add products to a grocery basket, favourite individual items, track changes over time
- Price history - per-product price tracking across scrape runs; surfaced in product detail and the compare page
- Compare page - side-by-side product comparison across retailers
- D3.js dietary tree visualiser - interactive ingredient hierarchy view - see it live
What I Built
- Production data platform: PostgreSQL schemas for semi-structured product and analytics workloads using
JSONField,ArrayField, GIN indexes and integrity constraints - Replayable ingestion workflows: loading retailer scrapes, public datasets and CSV logs with upserts, transactional writes and structured repair tables for relinking, backfills and taxonomy improvement
- NLP ingredient processing: canonicalisation and fuzzy matching with spaCy; fixed a scraper memory leak that reached 5.9 GB over 5,938 products
- Concurrency-aware processing: bounded concurrency, retry-safe queues using
select_for_update(skip_locked)to balance throughput and reliability - Background caching: post-save signals trigger async pre-computation of recipe and product compatibility so users never wait for cold results
- Privacy-aware analytics: retention controls, aggregation layers and reporting controls built for a live service
- Observability: Grafana/Loki log aggregation, health checks, blue/green deployments and rollback-safe traffic switching
Technical Choices
- Django ORM + raw SQL - Django models for application logic; raw SQL for complex ingestion queries, upserts and repair workflows where the ORM adds unnecessary overhead
- PostgreSQL -
JSONFieldandArrayFieldfor semi-structured product data; GIN indexes for fast search; queue-backed processing usingselect_for_update(skip_locked)for background work - Playwright + BeautifulSoup - Playwright for JS-rendered retailer pages; BeautifulSoup for lighter scraping tasks
- Docker + Caddy - containerised service behind Caddy, deployed blue/green with a health-check-gated swap script
- Grafana + Loki - structured log aggregation for pipeline health and data quality trend monitoring