Eirini Mantzouni

Data Analyst & Research Engineer · PhD

Data Analyst and Research Engineer with 8+ years building full-stack data applications — from anonymisation engines and policy analytics platforms to AI-powered dashboards and R packages — for EU institutions, government agencies, and research projects. Core stack: Python, R, SQL, Power BI, cloud deployment.

PhD in Quantitative Ecology & Statistical Modelling (University of Copenhagen, Marie Curie Fellowship). External Expert for the EU Scientific, Technical and Economic Committee for Fisheries (STECF) since 2017, contributing to stock assessments, regulatory impact analyses, and fisheries data infrastructure across 10+ Expert Working Groups. Currently contributing to European Commission projects (DG EMPL) involving AI-powered analytics and digital transformation.

Projects

Scope
Stack

Statistical Disclosure Control — Streamlit App

Python AI/LLM

Production anonymization engine that protects individual-level datasets against re-identification using four methods (k-Anonymity, Local Suppression, PRAM, Noise Addition), selected automatically by a 40+ rule decision engine. Backward elimination risk analysis drives every downstream choice — from variable classification to per-QI protection parameters.

Features an adaptive retry loop with escalation and cross-method fallbacks, a composite utility score (Pearson correlation, KL divergence, optional ML validation), and optional AI-powered column classification via Cerebras Qwen 235B. Tested against 17 real-world Greek administrative datasets.

Stack: Python · Streamlit · Pandas · Plotly · scikit-learn · R/sdcMicro · Cerebras API

Live Demo GitHub Private repo — available on request
Read more — Technical Details & Architecture

Technical Highlights

LayerTechnology
FrontendStreamlit, custom CSS, session state management
VisualisationPlotly (risk histograms, variable importance bars, before/after overlays)
Risk engineCustom pipeline: per-record ReID, backward elimination, structural risk, variable importance ranking
Protection engine4 methods, 40+ selection rules, dynamic pipeline builder, multi-phase retry with escalation + fallbacks
Method selectionRules engine (RC, CAT, LDIV, DATE, QR, LOW, DP, HR rule families), suppression-gated kANON
PreprocessingType-aware routing (6 priority tiers), adaptive tier loop (light to very aggressive), risk-weighted per-QI cardinality limits
Privacy metricsk-anonymity, l-diversity (distinct + entropy), t-closeness (EMD/TVD), uniqueness rate, disclosure risk
AI integrationCerebras Qwen 235B for column classification and method recommendation (optional)
R integrationsdcMicro for optimal local suppression and correlated noise (optional, Python fallback)
Testingpytest (unit + integration), 17-dataset test suite covering real-world Greek administrative data

Architecture Decisions

  • Backward elimination as the foundation — per-variable risk contribution drives everything downstream: QI classification confidence, preprocessing aggressiveness, per-QI protection parameters, LOCSUPR importance weights, and GENERALIZE ordering
  • Suppression-gated method selection — before selecting kANON at any k value, the engine pre-estimates suppression rate from equivalence class sizes. If estimated suppression exceeds 25%, it switches to LOCSUPR or PRAM directly
  • Type-aware preprocessing before generic generalization — dates, ages, geography, and skewed numerics each get domain-specific transformations before the generic cardinality-reduction loop runs, preserving domain structure
  • Sensitive-column-scoped utility — utility is measured primarily on sensitive (analysis) columns, not QIs. The composite score reflects what downstream analysts actually care about
  • Proportional low-cardinality QI guard — columns with very few unique values relative to dataset size are demoted from QI status via a two-tier ratio test, preventing structural methods from suppressing 90%+ of their values

Policy Intelligence Platform — Django + HTMX

Django Python AI/LLM Cloud

Policy analytics platform tracking the EU Council Recommendation on Fair Transition across 27 member states. Ingests 1,000+ policy measures from MongoDB and provides 14 interactive visualisations with 11 simultaneous filter dimensions — every AI prompt receives the active filter state so responses always reflect what the user is looking at.

Includes a RAG-based semantic policy search (MongoDB vector search, top-5 retrieval with citations), a 6-section strategic intelligence framework, and AI-powered gap analysis. Deployed on Cloud Run with graceful degradation — the dashboard stays fully functional even without an API key.

Stack: Python · Django · HTMX · MongoDB · Plotly · Tailwind CSS · Cerebras API · Cloud Run

Live Demo GitHub Private repo — available on request
Read more — Technical Details & Architecture

Technical Highlights

LayerTechnology
FrontendDjango 5 + HTMX (partial page loads, no SPA), Tailwind CSS, responsive grid layout
Visualisation14+ Plotly chart types (choropleth, bar, pie, heatmap, stacked bar), responsive sizing
Data pipelineMongoDB aggregation pipelines with allowDiskUse, QueryBuilder mapping 11 filter dimensions to $match/$unwind/$regex stages
AI integration8 generation functions: chart narratives, document analysis, in-depth analysis (5 focus modes), Q&A, strategic analysis (6 sections + synthesis), batch/country summaries, gap analysis
RAG pipelineQuery embedding, MongoDB $vectorSearch, top-5 retrieval, metadata enrichment, LLM answer with source citations
Prompt engineering8 specialised system personas, filter-aware context injection, <think> tag stripping for reasoning models
CachingDjango LocMem cache with namespace keys (ai:{type}:{md5}), filter-aware invalidation, 1-hour TTL
DeploymentDocker (python:3.12-slim), gunicorn (2 workers + 4 threads), Cloud Run (512MB, auto-scale 0-2), MongoDB Atlas
Seedingseed_mongo.py generates 300 docs / 1,050 measures / 80 linkage groups across 14 EU countries

Architecture Decisions

  • Filter-scoped AI context — every AI prompt includes the active filter state as a human-readable preamble, so LLM responses reflect the user's current view
  • Deduplication at query time — linkage collection maps duplicates to canonical IDs. Matched canonicals expand to include all duplicates; when displaying, duplicates are collapsed
  • Document-grounded deep analysis — in-depth analysis uses the full parsed document text (up to 25K chars) rather than just metadata, enabling the LLM to cite specific provisions
  • Graceful AI degradation — when no API key is configured, all AI endpoints return informative placeholders, keeping the dashboard fully functional for data exploration
  • Slim Docker image — copies only the 6 modules Django actually imports, keeping the image under 300MB

PD Gait Analysis Agent — LLM-Powered Clinical Reasoning

Python AI/Gemini RAG/MongoDB

A clinician asks a question in plain English; the agent writes Python code to analyse wearable sensor gait data, executes it in a sandbox, and returns a clinically contextualised answer with full reasoning trace. Built on the PHIA pattern (Nature Communications, Jan 2026) — the first application of this approach to PD gait monitoring data.

Custom ReAct loop (no LangChain), MongoDB Atlas vector search over 17 clinical knowledge chunks, and synthetic data modelling realistic PD subtypes with medication wearing-off and freezing episodes. Entire stack runs on free-tier services at zero cost.

Stack: Python · Streamlit · Gemini 2.5 Flash · MongoDB Atlas · sentence-transformers · Pandas · NumPy

Live Demo GitHub Private repo — available on request
Read more — Technical Details & Architecture

Technical Highlights

LayerTechnology
LLMGemini 2.5 Flash (Google AI Studio, free tier)
Agent frameworkCustom ReAct loop (no LangChain dependency) — prompt parsing, tool dispatch, iteration control
Code executionSandboxed Python with pre-loaded pandas DataFrame and patient profile dict
RAGMongoDB Atlas vector search, all-MiniLM-L6-v2 embeddings (sentence-transformers, runs locally), 17 clinical chunks
System prompt~35k chars assembled from role description, clinical knowledge, data schema, patient profile, 6 few-shot ReAct trajectories, tool descriptions
DataSynthetic gait data: step_length, stride_time, cadence, stride_variability, asymmetry_index, freezing_flag, medication_state, hours_since_dose
FrontendStreamlit with patient selector, context cards, example question buttons, expandable reasoning trace

Architecture Decisions

  • PHIA-inspired ReAct pattern — agent reasons, acts (code or RAG), observes, and iterates; produces verifiable computation rather than hallucinated statistics
  • Custom agent loop over LangChain — full control over prompt assembly, tool dispatch, and iteration limits without framework overhead
  • Embedded clinical knowledge in system prompt + RAG — critical thresholds and scoring criteria are always available; RAG supplements with deeper guideline details on demand
  • Synthetic data with clinical realism — medication wearing-off patterns, progressive deterioration, freezing episodes, and PD subtype signatures allow meaningful agent evaluation without real patient data
  • Zero-cost stack — Gemini free tier, MongoDB Atlas M0, local embeddings; total running cost $0

KBforge — Knowledge Base Builder for LLM Retrieval (RAG)

Python AI/Gemini

Streamlit app that turns domain literature into production-ready knowledge bases for LLM retrieval-augmented generation (RAG). Ingest evidence from PDFs, PubMed, or structured JSON — every chunk is embedded, tagged with domain features, and stored in a ChromaDB vector index that any RAG pipeline can query with semantic search and metadata filtering. No coding required.

Evolved from a hardcoded Alzheimer’s research pipeline into a fully configurable tool where domain experts swap vocabularies, not code. Smart deduplication with cosine-similarity calibration merges near-duplicates without losing tags. Coverage gap tracking shows exactly where your KB is thin before your LLM starts hallucinating. Auto-generated extraction prompts let users paste papers into ChatGPT/Claude/NotebookLM and import the structured output directly.

Stack: Python · Streamlit · ChromaDB · sentence-transformers · Pydantic · Gemini API · PyMuPDF · PubMed E-utilities

GitHub Private repo — available on request
Read more — Technical Details & Architecture

Technical Highlights

LayerTechnology
FrontendStreamlit multipage app (Setup, Add Sources, Knowledge Base), session state management
Data modelsPydantic v2: Chunk, SourceInfo, ProjectConfig — all pipeline modules receive ProjectConfig, no hardcoded vocabularies
IngestionThree sources: JSON import with validation, PDF upload (PyMuPDF section detection + overlapping chunks), PubMed abstract search (NCBI E-utilities, free tier)
LLM extractionGemini 2.5 Flash via google-genai: dynamic tagging prompts built from ProjectConfig.features, structured JSON output mode
Prompt generationAuto-generated extraction prompts (Prompt A/B) from project config — users paste papers into external LLMs and import the JSON output
Embeddingssentence-transformers (all-MiniLM-L6-v2 default, configurable), generic embedder accepts any model
Vector storeChromaDB with persistent SQLite backend, HNSW index, local — no cloud setup required
DeduplicationCosine similarity with tag merging (not discard). Calibrator shows similarity histogram, percentile stats, threshold impact table, top-N similar pairs
CoveragePer-feature chunk counts, coverage type breakdown, gap warnings, min_chunks_per_feature threshold
NormalizationFuzzy feature-name matching against ProjectConfig vocabulary (handles typos, case variations)

Architecture Decisions

  • ProjectConfig everywhere — every module receives a Pydantic ProjectConfig object; no hardcoded feature lists, coverage types, or model names. Switching domains means changing one config, not refactoring code
  • Calibration before deduplication — different embedding models produce different similarity distributions. The calibrator shows the actual distribution so users pick a threshold grounded in their data, not a magic number
  • Tag merging over discard — when chunks are near-duplicates, tags from both are merged into the survivor. Information is preserved even when text is deduplicated
  • Three-path ingestion for different workflows — JSON paste for power users, auto-generated prompts for LLM-assisted extraction (no API key needed), PubMed + PDF for automated bulk ingestion

Spark Athens — Social Nightlife Platform

Supabase React

Multi-sided platform for Athens nightlife that connects three audiences through role-based interfaces. Users discover events happening tonight, see who’s going before they commit, and match with people at the same venue. Venues get a management dashboard for promotion, attendee tracking, and talent booking. Artists find gigs and build their audience through event integration.

Squad matching lets friend groups find events together and discover other groups going to the same place. Designed around cross-role network effects — each new user, venue, or artist makes the platform stronger for everyone else.

Stack: Supabase · React

GitHub Private repo — available on request

Athens Events Hub — Professional Event Networking for SMEs

Supabase React

B2B event networking platform designed for Greek SMEs attending professional conferences, trade shows, and workshops. Solves the biggest pain point of business events: walking in blind. Attendees publish structured profiles with explicit intent (looking for / offering), browse other attendees before the event, and schedule qualified meetings in advance.

Organisers get attendee analytics, demographic breakdowns, and engagement tracking. Built around measurable ROI — every interaction is trackable, so SMEs know whether an event was worth attending. Aligned with EDIH digital transformation priorities for the Attica region.

Stack: Supabase · React

GitHub Private repo — available on request

Skills & Tools

Languages

Python · R · SQL · DAX · JavaScript

Data & BI

Pandas · Power BI · Plotly · ggplot2 · Streamlit · Shiny

Web & Backend

Django · HTMX · Tailwind CSS · MongoDB · SQLite · PostgreSQL

AI / ML

LLM integration · RAG pipelines · scikit-learn · Prompt engineering

DevOps & Cloud

Docker · Google Cloud Run · Azure · GitHub Actions · WhiteNoise

Domain Expertise

Statistical disclosure · Fisheries science · EU policy analytics · Digital transformation

Certifications

Google

Advanced Data Analytics Specialization (2025)

Business Intelligence Specialization (2025)

Microsoft

Azure Machine Learning for Data Scientists (2025)

Power BI & Power Virtual Agents (2025)

Data Visualization & Reporting with Generative AI (2025)

UC Davis

Geospatial Analysis with ArcGIS (2025)

Education

PhD

Quantitative Ecology & Statistical Modeling — University of Copenhagen / DTU-Aqua, Denmark (2006–2010). Marie Curie Fellowship. Meta-analysis and hierarchical modeling of population dynamics.

MSc

Ecology & Environmental Management — University of Patras, Greece (2003–2006)

BSc

Biology — University of Patras, Greece (1998–2003)