Retrieval PoC - Document Generation | Project Northern Lights

Overview

The Retrieval PoC enables analysts to generate complete sustainability report sections through a conversational interface. Using a declarative template language, the system dynamically populates content by combining LLM-generated narratives with live data from Workiva factbooks.

Template-Driven

LaTeX templates with embedded tags for dynamic content injection

Workiva-Connected

Direct cell references hydrate templates with factbook data

Copilot Interface

Natural language commands via Microsoft 365 Copilot chat

Template Language Specification

The template language uses a tag-based syntax embedded in LaTeX documents. Tags are resolved at generation time, with content either retrieved from Workiva or generated by the LLM based on specification files.

Tag Syntax Reference

Tag Pattern	Type	Description
`{@content:section-id}`	LLM Generated	Generates content using the spec file at `/specs/section-id.yaml`
`{@content:section-id.subsection}`	LLM Generated	Nested content generation with hierarchical spec lookup
`{$workiva:sheet_name:cell_ref}`	Data Reference	Retrieves live value from Workiva spreadsheet cell
`{$workiva:sheet_name:range}`	Data Reference	Retrieves a range of cells (e.g., `A1:D10`)
`{#if condition}...{/if}`	Conditional	Conditionally include content based on Workiva values
`{#each range as item}...{/each}`	Iterator	Iterate over Workiva rows to generate repeated sections

Example: Environmental Section Template

templates/environmental-section.tex

% VSME Environmental Section - Generated Template
\section{Environmental Disclosures}

% Pull company name from Workiva factbook
\subsection{About {$workiva:Factbook:B2}}

% LLM-generated introduction based on spec file
{@content:env-intro}

% Climate transition plan status from factbook
\subsection{Climate Transition Status}
{#if {$workiva:C3:G4} == "TRUE"}
    The undertaking operates in high-impact sectors and has adopted
    the following transition measures:

    {@content:transition-plan.adopted}
{/if}

{#if {$workiva:C3:G4} == "FALSE"}
    {@content:transition-plan.not-applicable}
{/if}

% KPI display with live data
\begin{kpicard}
    \kpivalue{{$workiva:Metrics:C12}}
    \kpilabel{GHG Emissions Reduction}
    \kpichange{{$workiva:Metrics:D12} YoY}
\end{kpicard}

% Pollutant emissions table - iterate over data rows
\subsection{Pollutant Emissions}
{#each {$workiva:B4:B5:B23} as row}
    \emissionrow{{row.pollutant}}{{row.air}}{{row.water}}{{row.soil}}
{/each}

Content Specification System

Each {@content:id} tag is backed by a YAML specification file that defines the generation parameters. Specs include style guidelines, VSME framework references, tone, and context.

specs/env-intro.yaml

id: env-intro
type: narrative
vsme_reference: B1-B7
paragraph_guidance: "29-56"

style:
  tone: professional, formal
  voice: third-person
  length: 150-250 words
  brand_guidelines: ./brand/kpmg-style.yaml

context:
  company_name: {$workiva:Factbook:B2}
  reporting_year: {$workiva:Factbook:B5}
  sector: {$workiva:Factbook:B8}

prompt: |
  Write an introductory paragraph for the Environmental section of a
  VSME sustainability report. The paragraph should:
  - Introduce the company's environmental commitment
  - Reference the reporting period and framework (EU VSME)
  - Set the context for detailed disclosures that follow
  - Maintain consistency with prior year reports if available

cache:
  enabled: true
  invalidate_on:
    - workiva:Factbook:B2  # Company name change
    - workiva:Factbook:B5  # Reporting year change

Specification Directory Structure

Project Structure

specs/
├── env-intro.yaml                 # Environmental introduction
├── transition-plan/
│   ├── adopted.yaml              # For companies with plans
│   └── not-applicable.yaml       # For non-applicable cases
├── emissions/
│   ├── scope1.yaml
│   ├── scope2.yaml
│   └── scope3.yaml
├── pollutants/
│   └── summary.yaml
└── brand/
    └── kpmg-style.yaml           # Shared style guidelines

templates/
├── environmental-section.tex
├── social-section.tex
└── governance-section.tex

cache/
└── content_cache.json           # Generated content cache

Workiva Data Integration

The system reads directly from Workiva factbooks using the existing API integration. Cell references follow a simple addressing scheme that maps to the VSME framework structure.

Sample Workiva Factbook Structure (C3 - Climate)

C3-1-4

C3

High impact sector?

TRUE

C3-1-5

C3

Transition plan status

Not adopted...

C3-1-6

C3

Plan description

Freight transport...

C3-1-7

C3

Adoption date (Year)

2025

Cell Reference Syntax

Reference	Returns	Example
`{$workiva:C3:G4}`	Single cell value	`"TRUE"`
`{$workiva:C3:G4:G10}`	Range (array)	`["TRUE", "Not adopted...", ...]`
`{$workiva:B4:B5:B23}`	Row iterator	Pollutant data rows for `{#each}`
`{$workiva:Factbook:company_name}`	Named field	Uses mapper.yaml field definitions

Document Generation Flow

1

User Request

Analyst requests section via Copilot

2

Template Load

System loads LaTeX template

3

Tag Resolution

Parse and identify all tags

4

Data Fetch

Retrieve Workiva cell values

5

Content Gen

LLM generates narrative

6

Compile

Build PDF output

Caching Strategy

Content Cache Generated text is cached by content ID. First generation is stored, subsequent requests return cached version.
Invalidation Rules Cache entries are invalidated when dependent Workiva cells change (defined in spec files).
Force Regeneration Users can request fresh generation via Copilot: "Regenerate the environmental intro"
Version History All generated content versions are preserved for audit trail and comparison.

Microsoft 365 Copilot Interface

The primary interface for document generation is the Microsoft 365 Copilot chat. Analysts use natural language commands to generate, edit, and refine report sections.

Example Copilot Interactions

Copilot Chat Session

Analyst: Generate the environmental section for the 2024 report

Copilot: I'll generate the Environmental section using the current Workiva
         factbook data. Let me fetch the latest values...

         Reading from Workiva:
         - Company: Acme Industries Ltd
         - Reporting Period: Jan 1 - Dec 31, 2024
         - High Impact Sector: Yes
         - Transition Plan Status: Adopted with targets

         Generating content...

         Here's your draft Environmental section. Key highlights:
         - Introduction paragraph (150 words)
         - Climate transition status with targets
         - 3 KPI cards with live metrics
         - Pollutant emissions table (12 rows)

         [Preview PDF] [Edit in Word] [Approve & Save]

---

Analyst: The tone in the introduction is too informal. Make it more formal.

Copilot: I'll regenerate the introduction with a more formal tone while
         keeping the same factual content.

         Updated introduction:
         "Acme Industries Ltd hereby presents its environmental disclosures
         in accordance with the EU Voluntary SME Sustainability Reporting
         Standard (VSME) for the fiscal year ended December 31, 2024..."

         [Accept Changes] [Try Another Version]

---

Analyst: Add a section about our water usage from the B4 data

Copilot: I'll add a water usage subsection. Let me check what data is
         available in the B4 (Pollutants) section of your factbook...

         Found water-related data:
         - B4-6 through B4-12: Water pollutant emissions
         - Metrics available for Nitrogen, Phosphorus, BOD

         [Generate Water Section] [Show Available Data]

Supported Commands

Command Pattern	Action
`"Generate [section] for [year]"`	Creates full section from template with current Workiva data
`"Show me the data for [framework code]"`	Displays relevant Workiva factbook values
`"Regenerate [content block]"`	Forces fresh LLM generation, ignoring cache
`"Make it more [tone]"`	Adjusts style/tone of most recent generated content
`"Add [topic] from [data source]"`	Incorporates additional Workiva data into section
`"Preview as PDF"`	Compiles current draft to PDF for review

System Architecture

User Layer

M365 Copilot

Word/PDF Preview

Processing

Template Engine

Tag Resolver

Content Cache

LaTeX Compiler

Generation

Azure OpenAI

Spec Loader

Style Enforcer

Data Layer

Workiva API

Factbook Reader

Cell Mapper

Technical Reference

Implementation Base

The PoC builds on the existing document generation prototype at experiments/test_doc_gen.py. Key components to extend:

ContentManager Extend cache to support invalidation rules and version history
Generator Replace mock LLM with Azure OpenAI integration, add spec file loading
LatexBuilder Add new tag syntax parsing ({@content:}, {$workiva:}, conditionals, iterators)
WorkivaReader New component: integrate with existing WorkivaAgent for cell reads (see app/api/routes/demo.py)

API Endpoints Required

New API Routes

POST /api/generate/section
  Body: { template: "environmental", year: 2024 }
  Returns: { pdf_url, word_url, content_hash }

POST /api/generate/regenerate
  Body: { content_id: "env-intro", force: true }
  Returns: { new_content, previous_version }

GET  /api/workiva/cells/{sheet}/{range}
  Returns: { cells: [...], last_updated }

POST /api/preview/pdf
  Body: { template, data_overrides }
  Returns: { pdf_url, warnings }

Retrieval PoC: Document Generation