May 2026
We present TraceUI, an open-source framework for AI agents and programmatic clients: from a single public website URL it recovers deep design context (colors, typography, logo, UI patterns) and uses that grounding to generate brand-native visual assets (ads, mockups, and related creatives). That workflow matches how we describe the open-source framework publicly: paste a URL, read the live site’s design language, generate outputs native to the brand. Unlike prior work that relies on manually supplied brand guidelines or freeform text prompts alone, TraceUI employs a visual-first brand grounding approach: a headless browser crawls the target site, a domain-specific JavaScript scoring heuristic extracts authoritative design signals, and a layered multimodal prompt architecture steers large vision-language models (Gemini 2.5 Flash / Gemini 3 Pro) so outputs stay faithful to observed identity. We introduce the concept of
Visual-RAG: retrieval-augmented generation where retrieved artifacts are rendered visual representations of the target domain rather than text passages.
The same pipeline is shipped as open source so agent runtimes and integrations can embed URL→context→asset generation without maintaining parallel brand kits, grounding autonomous workflows in live-site evidence rather than hand-written briefs.
TraceUI is positioned on the public-facing site as an open-source framework that gives AI agents the full design context of any website. This paper makes that claim precise: agents and automated pipelines need not depend on static PDF guidelines or long prose briefs when a URL already encodes the brand’s observable visual language. The product story is URL-first: paste a domain, read colors, typography, logo, and layout from the rendered page, then generate marketing assets that read as native to that brand.
Creating such creatives by hand remains expensive. Designers must study brand guidelines, match palettes and type, and keep dozens of format variants consistent as the number of brands scales.
Recent advances in text-to-image generation [1, 2, 3] and multimodal language models [4, 5] have opened the door to automated asset generation. However, existing pipelines face a fundamental grounding problem: the model has no authoritative access to the brand's actual visual language. Practitioners compensate by writing detailed text prompts, uploading reference images manually, or fine-tuning models per brand [6]. Each approach is labor-intensive, brittle, or operationally expensive.
TraceUI addresses grounding at its root. Rather than asking a human to describe a brand, the open-source framework observes it by crawling the live website, capturing full-page screenshots at high fidelity, and extracting structured design signals directly from the rendered DOM. Those signals become the normative context for every subsequent generation request. This visual-first grounding is the central innovation we expose both in production and in open source.
Contributions:
Diffusion-based models [1, 2] and autoregressive vision-language models [3, 7] have dramatically raised the quality ceiling for AI-generated images. Rombach et al. [1] introduced Latent Diffusion Models (LDMs) that operate in a compressed latent space, enabling high-resolution synthesis at practical compute costs. Google's Imagen [2] and subsequent Gemini image generation work [4] demonstrated that scaling alongside improved text encoders produces further gains in photorealism and prompt adherence. TraceUI uses Gemini 3 Pro as its image generation backbone, benefiting from its native multimodal input: screenshots and text prompts are submitted together.
The emergence of large vision-language models (VLMs) such as GPT-4V [8], LLaVA [9], and Gemini [4, 5] has made it practical to condition generation on visual inputs. Gemini 2.5 Flash serves TraceUI's analysis stage: given three to five website screenshots, the model produces a structured design-language brief covering color palette, typography, UI patterns, and brand tone. The use of a fast model (Flash) for analysis and a high-capacity model (Pro) for generation follows the cascaded inference pattern described in [11, 12].
Structured extraction of web content using headless browsers has been studied in the context of web scraping [13], accessibility auditing [14], and visual regression testing [15]. Playwright [16] provides a cross-browser automation framework that exposes both the rendered DOM and composited pixel output. TraceUI builds on Playwright's Chromium backend to capture screenshots at 1.5× device-pixel ratio, enabling extraction of sub-pixel typographic details. Prior work on web-to-design-token extraction [17] focuses on CSS variable extraction; TraceUI goes further by using JavaScript DOM scoring to identify logo elements and parsing Google Fonts API URLs to enumerate typeface choices.
DreamBooth [6] and Textual Inversion [18] fine-tune diffusion models on small per-brand image sets to capture brand identity in model weights. These offer high fidelity but require per-brand training runs, impractical for multi-tenant offerings such as TraceUI, an open-source framework where users self-serve new brands on demand. ControlNet [19] and IP-Adapter [20] allow reference-image conditioning at inference time without fine-tuning, but both require integration into the diffusion sampling loop. TraceUI takes a model-agnostic approach: brand grounding is enforced entirely through prompt engineering and in-context visual examples, making the technique applicable to any future VLM.
Retrieval-Augmented Generation [21] conditions language model outputs on retrieved documents rather than relying purely on parametric knowledge. TraceUI applies an analogous principle to the visual domain: instead of retrieving text passages, it retrieves live screenshots of the brand's website and injects them as in-context visual references. We term this Visual-RAG (V-RAG), a retrieval-augmented generation approach where the retrieved artifacts are rendered visual representations of the target domain, not text.
TraceUI is structured as a request-driven pipeline with five stages:
[CRAWL] → [ANALYSIS] → [CACHE] → [GENERATE] → [PERSIST]
Safety validation. Before initiating any network request, the URL hostname is resolved via DNS and the resulting IP address is checked against a blocklist of CIDR ranges covering RFC 1918 private networks, loopback, link-local, and cloud metadata endpoints. This prevents Server-Side Request Forgery (SSRF) attacks [22].
Headless navigation. A Playwright Chromium instance is launched with a 1920×1080 viewport at 1.5× device-pixel ratio. Navigation uses a 35-second timeout with network-idle detection. The page is programmatically scrolled in 850-pixel increments to trigger lazy-loaded assets, ensuring images and fonts below the fold are captured.
Design signal extraction. Three routines run as injected JavaScript: (1) Logo scoring: elements are scored by header/nav containment, home-link relationship, class/alt-text semantics, and aspect ratio; (2) Typography extraction: getComputedStyle reads font-family values from body, h1, and button; Google Fonts API URLs are parsed to map family names; (3) Favicon fallback: link hints are tried in order, with /favicon.ico as last resort.
Screenshot processing. Full-page PNGs are processed by Sharp: WebP at 2400px for client display, JPEG at 1280px for Gemini input.
Screenshots are submitted to Gemini 2.5 Flash with a structured prompt requesting a design-language brief: primary/secondary/accent color values, heading and body typeface names, UI pattern vocabulary, and overall brand tone. Gemini 2.5 Flash was chosen for latency and cost; it produces acceptable design briefs in under four seconds.
The design-language brief and base64-encoded screenshots are stored in a server-side in-memory map keyed by a UUID session token. Sessions expire after 45 minutes and are pruned every 5 minutes. This avoids re-running the full crawl (8–15 s) on every generation request during a working session.
Image generation composes the six-layer prompt for each request and calls Gemini 3 Pro via Vertex AI. Requests use exponential backoff (base 1.5 s, jitter ±20%, max 6 retries, 20 s cap per attempt). RESOURCE_EXHAUSTED, 502/503/504, and network timeout errors are retried; semantic errors are surfaced immediately.
Generated images are base64-decoded and uploaded to Google Cloud Storage under a per-user path. A 10-year signed URL is generated server-side. Direct client-to-storage access is disabled by storage rules. Metadata is written to Firestore under users/{uid}/generatedAssets/{id}.
The generation prompt is decomposed into six ordered layers:
Layers 1–3 are invariant; Layers 4–6 are request-specific. This separation of concerns allows independent iteration on framework-wide quality defaults, brand enforcement, and format support without cross-contamination.
Beyond output categories, 30+ style presets augment Layer 6 with highly specific aesthetic direction, including Swiss Industrial Brutalist (strict grid, registration marks, bold sans-serifs), Japanese Neo-Traditional Psychedelic (Hokusai-inspired linework, Rinpa cloud patterning, cobalt/crimson/amber palette), and 28 others. Each preset provides enough visual specificity for the model to converge on a recognizable style without overriding brand-identity constraints in Layers 1–4.
SSRF Prevention. DNS-resolution-based IP blocklist prevents the crawl engine from probing internal services.
Authentication. Every API request is authenticated via a Firebase ID token verified server-side. No server-side sessions are maintained.
Rate Limiting. Per-IP and per-user limits at five granularities: general API (400/15 min), crawl (20/15 min/user), brief expansion (40/15 min/user), image generation (12/15 min/user), feedback (5/hour/user).
Storage Isolation. Cloud Storage rules deny all direct read/write access. All image retrieval goes through server-generated signed URLs.
We evaluated brand fidelity qualitatively across 50 publicly known brand websites (tech, DTC, hospitality, SaaS), rating generated ad creatives on a 1–5 scale across color accuracy, typography match, tone match, and overall recognizability. The "pixels-win" grounding (Layer 2) was the single largest contributor to brand accuracy. In ablation runs where Layer 2 was removed, the model reverted to generic "tech startup" visual language in approximately 40% of cases when the design brief text was ambiguous or terse.
| Operation | Median | p95 |
|---|---|---|
| Full crawl (1 page) | 8.2 s | 14.1 s |
| Design brief analysis | 3.6 s | 6.8 s |
| Single image generation | 11.4 s | 22.7 s |
These figures reflect typical production traffic for the crawl-to-single-image path described above.
TraceUI instantiates a broader principle we call Visual-RAG: retrieval-augmented generation where the retrieved context is rendered visual content rather than text. For domains where the authoritative source of truth is visual (brand identity, interior design, fashion), grounding model outputs in retrieved visual artifacts produces more faithful results than grounding in text descriptions of those artifacts. This is analogous to the insight in text RAG [21] that retrieved passages produce better answers than the model's parametric memory, but applied to the visual modality.
TraceUI can generate ad creatives for any public website, raising potential misuse vectors around brand impersonation. Mitigations include requiring authenticated Google accounts, rate-limiting generation to practical creative workflows, and terms of service prohibiting misuse. Future work should investigate automated detection of brand impersonation in generated outputs.
TraceUI demonstrates that live website crawls combined with structured visual-signal extraction and a layered multimodal prompt architecture can produce brand-grounded ad creatives at quality levels suitable for production use, without per-brand model fine-tuning. The "pixels-win" brand fidelity principle (treating observed visual evidence as normative over text descriptions) is the key design decision that differentiates TraceUI from prior text-prompt-based creative-generation frameworks. We expect Visual-RAG to find application in other design domains: architecture moodboards, product styling, fashion lookbooks, and game asset generation where art direction is defined visually.