May 2026

TraceUI An open-source framework that gives AI agents the full design context of any website.

Abstract

We present TraceUI, an open-source framework for AI agents and programmatic clients: from a single public website URL it recovers deep design context (colors, typography, logo, UI patterns) and uses that grounding to generate brand-native visual assets (ads, mockups, and related creatives). That workflow matches how we describe the open-source framework publicly: paste a URL, read the live site’s design language, generate outputs native to the brand. Unlike prior work that relies on manually supplied brand guidelines or freeform text prompts alone, TraceUI employs a visual-first brand grounding approach: a headless browser crawls the target site, a domain-specific JavaScript scoring heuristic extracts authoritative design signals, and a layered multimodal prompt architecture steers large vision-language models (Gemini 2.5 Flash / Gemini 3 Pro) so outputs stay faithful to observed identity. We introduce the concept of

Visual-RAG: retrieval-augmented generation where retrieved artifacts are rendered visual representations of the target domain rather than text passages.

The same pipeline is shipped as open source so agent runtimes and integrations can embed URL→context→asset generation without maintaining parallel brand kits, grounding autonomous workflows in live-site evidence rather than hand-written briefs.

Contents

Introduction
Background and Related Work
Framework Architecture
Layered Prompt Architecture
Security Model
Evaluation
Discussion
Conclusion
References

1. Introduction

TraceUI is positioned on the public-facing site as an open-source framework that gives AI agents the full design context of any website. This paper makes that claim precise: agents and automated pipelines need not depend on static PDF guidelines or long prose briefs when a URL already encodes the brand’s observable visual language. The product story is URL-first: paste a domain, read colors, typography, logo, and layout from the rendered page, then generate marketing assets that read as native to that brand.

Creating such creatives by hand remains expensive. Designers must study brand guidelines, match palettes and type, and keep dozens of format variants consistent as the number of brands scales.

Recent advances in text-to-image generation [1, 2, 3] and multimodal language models [4, 5] have opened the door to automated asset generation. However, existing pipelines face a fundamental grounding problem: the model has no authoritative access to the brand's actual visual language. Practitioners compensate by writing detailed text prompts, uploading reference images manually, or fine-tuning models per brand [6]. Each approach is labor-intensive, brittle, or operationally expensive.

TraceUI addresses grounding at its root. Rather than asking a human to describe a brand, the open-source framework observes it by crawling the live website, capturing full-page screenshots at high fidelity, and extracting structured design signals directly from the rendered DOM. Those signals become the normative context for every subsequent generation request. This visual-first grounding is the central innovation we expose both in production and in open source.

Contributions:

An open-source, URL-first framework for AI agents and integrations: recover deep design context from a website and generate brand-grounded assets without parallel manual brand kits.
A Visual-First Brand Grounding methodology that treats live website crawls as the authoritative brand source.
A Layered Prompt Architecture that enforces brand fidelity, format constraints, and creative direction hierarchically.
A full description of this open-source framework as deployed in production on top of this architecture.

2.1 Text-to-Image Generation

Diffusion-based models [1, 2] and autoregressive vision-language models [3, 7] have dramatically raised the quality ceiling for AI-generated images. Rombach et al. [1] introduced Latent Diffusion Models (LDMs) that operate in a compressed latent space, enabling high-resolution synthesis at practical compute costs. Google's Imagen [2] and subsequent Gemini image generation work [4] demonstrated that scaling alongside improved text encoders produces further gains in photorealism and prompt adherence. TraceUI uses Gemini 3 Pro as its image generation backbone, benefiting from its native multimodal input: screenshots and text prompts are submitted together.

2.2 Multimodal Large Language Models

The emergence of large vision-language models (VLMs) such as GPT-4V [8], LLaVA [9], and Gemini [4, 5] has made it practical to condition generation on visual inputs. Gemini 2.5 Flash serves TraceUI's analysis stage: given three to five website screenshots, the model produces a structured design-language brief covering color palette, typography, UI patterns, and brand tone. The use of a fast model (Flash) for analysis and a high-capacity model (Pro) for generation follows the cascaded inference pattern described in [11, 12].

2.3 Browser Automation for Web Understanding

Structured extraction of web content using headless browsers has been studied in the context of web scraping [13], accessibility auditing [14], and visual regression testing [15]. Playwright [16] provides a cross-browser automation framework that exposes both the rendered DOM and composited pixel output. TraceUI builds on Playwright's Chromium backend to capture screenshots at 1.5× device-pixel ratio, enabling extraction of sub-pixel typographic details. Prior work on web-to-design-token extraction [17] focuses on CSS variable extraction; TraceUI goes further by using JavaScript DOM scoring to identify logo elements and parsing Google Fonts API URLs to enumerate typeface choices.

2.4 Brand Consistency in AI-Assisted Design

DreamBooth [6] and Textual Inversion [18] fine-tune diffusion models on small per-brand image sets to capture brand identity in model weights. These offer high fidelity but require per-brand training runs, impractical for multi-tenant offerings such as TraceUI, an open-source framework where users self-serve new brands on demand. ControlNet [19] and IP-Adapter [20] allow reference-image conditioning at inference time without fine-tuning, but both require integration into the diffusion sampling loop. TraceUI takes a model-agnostic approach: brand grounding is enforced entirely through prompt engineering and in-context visual examples, making the technique applicable to any future VLM.

2.5 Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation [21] conditions language model outputs on retrieved documents rather than relying purely on parametric knowledge. TraceUI applies an analogous principle to the visual domain: instead of retrieving text passages, it retrieves live screenshots of the brand's website and injects them as in-context visual references. We term this Visual-RAG (V-RAG), a retrieval-augmented generation approach where the retrieved artifacts are rendered visual representations of the target domain, not text.

3. Framework Architecture

TraceUI is structured as a request-driven pipeline with five stages:

[CRAWL] → [ANALYSIS] → [CACHE] → [GENERATE] → [PERSIST]

3.1 Crawl Stage

Safety validation. Before initiating any network request, the URL hostname is resolved via DNS and the resulting IP address is checked against a blocklist of CIDR ranges covering RFC 1918 private networks, loopback, link-local, and cloud metadata endpoints. This prevents Server-Side Request Forgery (SSRF) attacks [22].

Headless navigation. A Playwright Chromium instance is launched with a 1920×1080 viewport at 1.5× device-pixel ratio. Navigation uses a 35-second timeout with network-idle detection. The page is programmatically scrolled in 850-pixel increments to trigger lazy-loaded assets, ensuring images and fonts below the fold are captured.

Design signal extraction. Three routines run as injected JavaScript: (1) Logo scoring: elements are scored by header/nav containment, home-link relationship, class/alt-text semantics, and aspect ratio; (2) Typography extraction: getComputedStyle reads font-family values from body, h1, and button; Google Fonts API URLs are parsed to map family names; (3) Favicon fallback: link hints are tried in order, with /favicon.ico as last resort.

Screenshot processing. Full-page PNGs are processed by Sharp: WebP at 2400px for client display, JPEG at 1280px for Gemini input.

3.2 Analysis Stage

Screenshots are submitted to Gemini 2.5 Flash with a structured prompt requesting a design-language brief: primary/secondary/accent color values, heading and body typeface names, UI pattern vocabulary, and overall brand tone. Gemini 2.5 Flash was chosen for latency and cost; it produces acceptable design briefs in under four seconds.

3.3 Session Cache

The design-language brief and base64-encoded screenshots are stored in a server-side in-memory map keyed by a UUID session token. Sessions expire after 45 minutes and are pruned every 5 minutes. This avoids re-running the full crawl (8–15 s) on every generation request during a working session.

3.4 Generation Stage

Image generation composes the six-layer prompt for each request and calls Gemini 3 Pro via Vertex AI. Requests use exponential backoff (base 1.5 s, jitter ±20%, max 6 retries, 20 s cap per attempt). RESOURCE_EXHAUSTED, 502/503/504, and network timeout errors are retried; semantic errors are surfaced immediately.

3.5 Persist Stage

Generated images are base64-decoded and uploaded to Google Cloud Storage under a per-user path. A 10-year signed URL is generated server-side. Direct client-to-storage access is disabled by storage rules. Metadata is written to Firestore under users/{uid}/generatedAssets/{id}.

4. Layered Prompt Architecture

The generation prompt is decomposed into six ordered layers:

Universal Quality Anchor. Baseline quality expectations: output must look native to the brand's website, maintain legible visual hierarchy, respect brand truth. Constant across all requests.
Brand Fidelity Block. The key innovation layer: "The website screenshots provided are the authoritative source for the brand's visual identity. Pixels win over written description." This explicit normative ordering prevents the model from hallucinating brand attributes that conflict with observed reality.
Format Exclusion Block. Negative-space instruction: no navigation bars, footers, cookie banners, or browser chrome in generated output. Without this, models with strong website-rendering priors reproduce site UI in ad outputs.
Category Execution Anchor. Format-specific instruction selected from 30+ output categories; each encodes layout conventions, aspect-ratio constraints, and visual hierarchy rules for its format type.
Dynamic Brand Context. The design-language brief from the Analysis stage, optionally augmented with a user-selected inspiration image.
User Direction. The user's own creative brief (up to 4,000 characters), optionally expanded by Gemini 2.5 Flash from short notes. The only layer the user directly controls.

Layers 1–3 are invariant; Layers 4–6 are request-specific. This separation of concerns allows independent iteration on framework-wide quality defaults, brand enforcement, and format support without cross-contamination.

4.1 Style Presets

Beyond output categories, 30+ style presets augment Layer 6 with highly specific aesthetic direction, including Swiss Industrial Brutalist (strict grid, registration marks, bold sans-serifs), Japanese Neo-Traditional Psychedelic (Hokusai-inspired linework, Rinpa cloud patterning, cobalt/crimson/amber palette), and 28 others. Each preset provides enough visual specificity for the model to converge on a recognizable style without overriding brand-identity constraints in Layers 1–4.

5. Security Model

SSRF Prevention. DNS-resolution-based IP blocklist prevents the crawl engine from probing internal services.

Authentication. Every API request is authenticated via a Firebase ID token verified server-side. No server-side sessions are maintained.

Rate Limiting. Per-IP and per-user limits at five granularities: general API (400/15 min), crawl (20/15 min/user), brief expansion (40/15 min/user), image generation (12/15 min/user), feedback (5/hour/user).

Storage Isolation. Cloud Storage rules deny all direct read/write access. All image retrieval goes through server-generated signed URLs.

6. Evaluation

6.1 Brand Fidelity

We evaluated brand fidelity qualitatively across 50 publicly known brand websites (tech, DTC, hospitality, SaaS), rating generated ad creatives on a 1–5 scale across color accuracy, typography match, tone match, and overall recognizability. The "pixels-win" grounding (Layer 2) was the single largest contributor to brand accuracy. In ablation runs where Layer 2 was removed, the model reverted to generic "tech startup" visual language in approximately 40% of cases when the design brief text was ambiguous or terse.

6.2 Latency

Operation	Median	p95
Full crawl (1 page)	8.2 s	14.1 s
Design brief analysis	3.6 s	6.8 s
Single image generation	11.4 s	22.7 s

These figures reflect typical production traffic for the crawl-to-single-image path described above.

6.3 Observed Failure Modes

Site chrome leakage. Layout-heavy sites can cause the model to reproduce navigation structure in outputs. Mitigation: Format Exclusion Block expanded iteratively.
Typography hallucination. Obscure or system fonts not well-represented in model training data cause substitution. Mitigation: font name is always included literally in the design brief.
Palette drift across variants. When issuing multiple generations from the same session, one variant occasionally drifted to a different palette. Mitigation: Universal Quality Anchor explicitly requires color consistency across related outputs.

7. Discussion

7.1 The Visual-RAG Paradigm

TraceUI instantiates a broader principle we call Visual-RAG: retrieval-augmented generation where the retrieved context is rendered visual content rather than text. For domains where the authoritative source of truth is visual (brand identity, interior design, fashion), grounding model outputs in retrieved visual artifacts produces more faithful results than grounding in text descriptions of those artifacts. This is analogous to the insight in text RAG [21] that retrieved passages produce better answers than the model's parametric memory, but applied to the visual modality.

7.2 Limitations

Single-page visual coverage. Brands with highly varied sub-site designs may see inconsistent results depending on which screenshots are selected.
No fine-tuning. Visual-RAG via in-context images is less powerful than model fine-tuning for capturing subtle brand attributes (e.g., proprietary illustration style). DreamBooth-style fine-tuning could be integrated as an optional enhancement.
Ephemeral session cache. The 45-minute in-memory cache is lost on server restart. Horizontal scaling would require a shared cache (Redis, Firestore).
Model dependency. Quality is bounded by Gemini 3 Pro's image generation capabilities and exposed to any model regressions.

7.3 Ethical Considerations

TraceUI can generate ad creatives for any public website, raising potential misuse vectors around brand impersonation. Mitigations include requiring authenticated Google accounts, rate-limiting generation to practical creative workflows, and terms of service prohibiting misuse. Future work should investigate automated detection of brand impersonation in generated outputs.

8. Conclusion

TraceUI demonstrates that live website crawls combined with structured visual-signal extraction and a layered multimodal prompt architecture can produce brand-grounded ad creatives at quality levels suitable for production use, without per-brand model fine-tuning. The "pixels-win" brand fidelity principle (treating observed visual evidence as normative over text descriptions) is the key design decision that differentiates TraceUI from prior text-prompt-based creative-generation frameworks. We expect Visual-RAG to find application in other design domains: architecture moodboards, product styling, fashion lookbooks, and game asset generation where art direction is defined visually.

References

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. CVPR 2022. arXiv:2112.10752.
Saharia, C., Chan, W., Saxena, S., Li, L., et al. (2022). Photorealistic text-to-image diffusion models with deep language understanding (Imagen). NeurIPS 2022. arXiv:2205.11487.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with CLIP latents (DALL·E 2). arXiv:2204.06125.
Google DeepMind. (2024). Gemini: A family of highly capable multimodal models. arXiv:2312.11805.
Google DeepMind. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530.
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. (2023). DreamBooth: Fine-tuning text-to-image diffusion models for subject-driven generation. CVPR 2023. arXiv:2208.12242.
Yu, J., Xu, Y., Koh, J. Y., et al. (2022). Scaling autoregressive models for content-rich text-to-image generation (Parti). arXiv:2206.10789.
OpenAI. (2023). GPT-4 technical report. arXiv:2303.08774.
Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning (LLaVA). NeurIPS 2023. arXiv:2304.08485.
Antol, S., Agrawal, A., Lu, J., et al. (2015). VQA: Visual question answering. ICCV 2015. arXiv:1505.00468.
Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast inference from transformers via speculative decoding. ICML 2023. arXiv:2211.17192.
Chen, C., Borgeaud, S., Irving, G., et al. (2023). Accelerating large language model decoding with speculative sampling. arXiv:2302.01318.
Vargiu, E., & Urru, M. (2013). Exploiting web scraping in a collaborative filtering-based approach to web advertising. Artificial Intelligence Research, 2(1).
Fuertes, J. L., González, Á. L., Martínez, L., & Mohamad, Y. (2011). Towards a more accessible web. Procedia Computer Science, 3, 687–693.
Choudhary, S. R., Versee, H., & Orso, A. (2010). WEBDIFF: Automated identification of cross-browser issues in web applications. ICSM 2010.
Microsoft. (2020). Playwright: Fast and reliable end-to-end testing for modern web apps. GitHub: microsoft/playwright.
Gu, Y., Li, K., & Zhao, X. (2023). Automated design token extraction from web interfaces using DOM analysis. IEEE Access, 11, 29341–29356.
Gal, R., Alaluf, Y., Atzmon, Y., et al. (2023). An image is worth one word: Personalizing text-to-image generation using textual inversion. ICLR 2023. arXiv:2208.01618.
Zhang, L., Rao, A., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models (ControlNet). ICCV 2023. arXiv:2302.05543.
Ye, H., Zhang, J., Liu, S., Han, X., & Yang, W. (2023). IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv:2308.06721.
Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS 2020. arXiv:2005.11401.
Saini, R., & Kenkre, S. (2017). Server-side request forgery: Attacks and defenses. ICACCI 2017.