idr Z. Ruan, Ed. Internet-Draft M. Han Intended status: Standards Track China Unicom Expires: 1 January 2027 30 June 2026 BGP Extension for AI Compute Service Metadata draft-ruan-idr-ai-compute-service-metadata-00 Abstract This document defines a new optional transitive BGP Path Attribute named AI Compute Service Metadata, which carries three categories of dedicated Sub-TLVs for generative AI inference over inter-domain BGP routes. Existing CATS framework and generic BGP compute metric drafts lack inference-specific metadata covering inference SLA, token billing cost, and KV cache prefix index. Re-computing context without matched cache prefixes raises GPU overhead and user billing expense. This specification defines a new top-level optional transitive BGP Path Attribute named AI Compute Service Metadata, which encapsulates three independent Sub-TLVs to advertise inference SLA metrics, billing parameters, and cache prefix indexes. All metadata carried within this Path Attribute is flooded across multi-domain routing fabrics, supporting two core use cases: matching compute instances based on latency and cost requirements, and dispatching tasks to nodes with matched reusable KV cache to reduce resource consumption. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 1 January 2027. Ruan & Han Expires 1 January 2027 [Page 1] Internet-Draft BGP Extension for AI Compute Service Met June 2026 Copyright Notice Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Existing CATS and BGP Metadata Work . . . . . . . . . . . 3 1.2. AI Inference-Specific Metadata Requirements . . . . . . . 3 1.3. Scope of This Document . . . . . . . . . . . . . . . . . 3 2. Terminology and Definitions . . . . . . . . . . . . . . . . . 3 3. Overview of Design . . . . . . . . . . . . . . . . . . . . . 4 4. AI Compute Service Metadata Encoding . . . . . . . . . . . . 4 4.1. AI Compute Metadata Path Attribute . . . . . . . . . . . 4 4.1.1. AI Metadata Path Attribute Characteristics . . . . . 4 4.1.2. Propagation and Attribute Level Processing . . . . . 5 4.2. The Inference-SLA Sub-TLV (Dynamic Metadata) . . . . . . 5 4.3. The Inference-Billing Sub-TLV (Static Metadata) . . . . . 5 4.4. The KV-Cache-Prefix Sub-TLV (Dynamic Metadata) . . . . . 6 5. Core Use Cases . . . . . . . . . . . . . . . . . . . . . . . 6 5.1. User Demand Oriented Instance Matching . . . . . . . . . 6 5.2. Cache-Aware Model-Side Task Dispatch . . . . . . . . . . 7 6. Security Considerations . . . . . . . . . . . . . . . . . . . 7 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 7.1. AI Compute Service Metadata Path Attribute . . . . . . . 8 7.2. AI Compute Service Metadata Path Attribute Sub-Types . . 8 7.3. Capability Code Allocation . . . . . . . . . . . . . . . 9 8. Normative References . . . . . . . . . . . . . . . . . . . . 9 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 9 1. Introduction Ruan & Han Expires 1 January 2027 [Page 2] Internet-Draft BGP Extension for AI Compute Service Met June 2026 1.1. Existing CATS and BGP Metadata Work The IETF CATS Working Group defines a general compute-aware steering architecture in [draft-ietf-cats-framework] , while related IDR drafts define BGP extensions for distributing generic compute metrics. These works provide a basis for distributing compute-node state through BGP, but they do not define metadata specific to autoregressive Transformer inference workloads. 1.2. AI Inference-Specific Metadata Requirements Autoregressive Transformer inference tasks, including LLM, multimodal vision-language, speech synthesis, and machine translation, commonly use KV caches to avoid redundant context computation. A Prefix Index identifies an immutable shared context segment. A cache miss triggers full prompt recomputation, which increases GPU load, TTFT latency, and token billing cost. Existing generic BGP metrics do not provide the inference SLA indicators, differentiated billing tiers, or cache prefix identifiers needed for fine-grained task distribution. 1.3. Scope of This Document This document extends the CATS/IDR BGP metric system with a standalone optional transitive BGP Path Attribute and three Sub-TLVs: inference SLA attributes, inference billing attributes, and KV cache prefix attributes. The inference metadata is propagated through standard IBGP and EBGP flooding. The metadata supports instance selection based on user latency and cost requirements, as well as cache-aware task dispatch based on advertised KV cache prefix indexes. This specification follows the IDR design of isolating service metadata in a dedicated top-level Path Attribute, thereby separating AI inference metadata from generic edge-compute metric namespaces. 2. Terminology and Definitions * Prefix Caching: Universal optimization adopted by autoregressive Transformer inference tasks to reuse precomputed KV tensors for identical fixed context segments, reducing redundant computation, hardware overhead and inference latency. * Prefix Index: Opaque lookup identifier for immutable shared token context segments used to match reusable KV cache entries across distributed inference instances. A Prefix Index may be implemented via cryptographic hash, vendor-assigned numeric ID, or composite version encoding. Ruan & Han Expires 1 January 2027 [Page 3] Internet-Draft BGP Extension for AI Compute Service Met June 2026 * Fixed-Prefix-Key: A subtype of Prefix Index, uniquely identifying the immutable system prompt token sequence bound to a specific model and service function. * TTFT (Time To First Token): Average wall-clock latency from request arrival to delivery of the first generated output token, measured in milliseconds. * TPOT (Time Per Output Token): Average incremental latency between consecutive generated output tokens after the first token. 3. Overview of Design This document defines a single new optional transitive top-level BGP Path Attribute, named AI Compute Service Metadata. The attribute contains the following three Sub-TLVs: * Inference-SLA Sub-TLV: Carries real-time latency, throughput and queue metrics of inference instances. * Inference-Billing Sub-TLV: Carries token pricing parameters for cache-hit and cache-miss requests. * KV-Cache-Prefix Sub-TLV: Carries Prefix Index identifiers for shared fixed context segments to support cache-aware steering. BGP speakers propagate the complete attribute across IBGP and EBGP sessions to distribute inference-resource state throughout the inter- domain fabric. Receiving BGP speakers parse the encapsulated Sub-TLV entries and use them for the two steering use cases described in this document. Unknown Sub-TLV types are skipped without discarding the parent attribute, in accordance with the Sub-TLV handling rules defined in [I-D.ietf-idr-5g-edge-service-metadata] . 4. AI Compute Service Metadata Encoding 4.1. AI Compute Metadata Path Attribute 4.1.1. AI Metadata Path Attribute Characteristics The AI Compute Service Metadata attribute is defined as an optional transitive BGP Path Attribute in accordance with [RFC4271]. BGP speakers that do not recognize this attribute code MUST retain the entire attribute unmodified and propagate it to all peers. Ruan & Han Expires 1 January 2027 [Page 4] Internet-Draft BGP Extension for AI Compute Service Met June 2026 The attribute payload consists of a concatenated sequence of Sub-TLV structures. Multiple instances of the same Sub-TLV type may appear in one attribute payload to advertise multiple model-function groups deployed on a single inference instance. 4.1.2. Propagation and Attribute Level Processing When advertising routes through IBGP or EBGP sessions, BGP speakers SHALL propagate the complete AI Compute Metadata Path Attribute without modification, unless a local administrative policy explicitly filters the attribute. Transit PE and BR routers forward BGP updates without parsing or modifying the encapsulated Sub-TLV content. 4.2. The Inference-SLA Sub-TLV (Dynamic Metadata) This Sub-TLV carries real-time performance and queue metrics for model-function pairs to support latency-sensitive instance selection. Within each repeated tuple group, the fields appear in the following transmission order: 1. Model-ID (uint16): Unique encoded identifier of the deployed inference model. 2. Function-ID (uint16): Enumerated identifier of the specific service function bound to the model instance. 3. Avg-TTFT (uint16): Rolling average Time-To-First-Token latency measured in milliseconds. 4. Avg-TPOT (uint16): Rolling average Time-Per-Output-Token incremental latency measured in milliseconds. 5. Current-TPS (uint32): Real-time token generation throughput expressed as tokens per second. 6. Pending-Queue-Num (uint32): Count of queued unprocessed inference tasks waiting for GPU accelerator resources. 4.3. The Inference-Billing Sub-TLV (Static Metadata) This Sub-TLV defines token-consumption pricing tiers for each model- function pair. It is updated only when commercial billing policies change. The fields appear in the following order: 1. Model-ID (uint16): Unique encoded identifier of the deployed inference model. Ruan & Han Expires 1 January 2027 [Page 5] Internet-Draft BGP Extension for AI Compute Service Met June 2026 2. Function-ID (uint16): Enumerated identifier of the specific service function bound to the model instance. 3. Hit-Price (uint32): Fixed-point unit cost per million output tokens for cache-hit inference requests with matched Prefix Index and available cache blocks. 4. Miss-Price (uint32): Fixed-point unit cost per million output tokens for full Prefill cache-miss inference requests without reusable cache. 5. Price-Unit (uint8): Enumerated settlement unit type for token billing, with value space managed by IANA. 4.4. The KV-Cache-Prefix Sub-TLV (Dynamic Metadata) This Sub-TLV carries Prefix Index identifiers for immutable shared context segments bound to a model-function pair, enabling cache-aware task dispatch. Within each repeated tuple group, the fields appear in the following order: 1. Model-ID (uint16): Unique encoded identifier of the deployed inference model. 2. Function-ID (uint16): Enumerated identifier of the specific service function bound to the model instance. 3. Fixed-Prefix-Key (octet string): Prefix Index for the service’s fixed system prompt token sequence, used to match reusable KV cache entries across distributed compute nodes. 5. Core Use Cases The three advertised Sub-TLV types support two steering use cases. In both cases, the matching logic correlates Model-ID and Function-ID to associate SLA, billing, and cache-prefix metadata with the same model-function service. 5.1. User Demand Oriented Instance Matching When an end user submits an inference request with explicit latency and cost requirements, the BGP speaker filters advertised routes using the request's Model-ID and Function-ID. It then uses metadata from the Inference-SLA and Inference-Billing Sub-TLVs to select suitable compute instances: 1. Filter routes matching the target Model-ID and Function-ID; Ruan & Han Expires 1 January 2027 [Page 6] Internet-Draft BGP Extension for AI Compute Service Met June 2026 2. Sort candidate nodes according to user priority: * Cost-prioritized users: Sort eligible instances by ascending Hit-Price, falling back to Miss-Price if no cache-eligible nodes exist; * Latency-prioritized users: Sort eligible instances by ascending Avg-TTFT, filtered to exclude nodes with excessive Pending-Queue-Num; 3. Dispatch the user request to the highest-priority available compute instance. 5.2. Cache-Aware Model-Side Task Dispatch After an inference request arrives at a local model service, the model orchestrator queries the BGP speaker's local metadata table to locate compute instances with matching KV cache resources and reduce redundant prefill computation: 1. Extract Model-ID, Function-ID and request-derived Fixed-Prefix- Key Prefix Index from the incoming task; 2. Filter advertised routes matching the Model-ID and Function-ID, then match against Fixed-Prefix-Key entries inside the KV-Cache- Prefix Sub-TLV; 3. Prioritize compute instances carrying identical matched Prefix Index to reuse precomputed KV cache segments, avoiding full prompt re-computation and reducing overall GPU resource consumption; 4. Forward the inference task to the highest-priority cache-matched remote compute node. 6. Security Considerations 1. Confidential Data Isolation: BGP updates do not carry plaintext proprietary data, including raw system prompts, training documents, tokenizer vocabularies, or model weights. Only anonymized Prefix Index identifiers and aggregated numeric telemetry are propagated across inter-domain routing planes. The Fixed-Prefix-Key is not intended to reveal the original prompt content without private vendor key material that is not exposed to the routing domain. Ruan & Han Expires 1 January 2027 [Page 7] Internet-Draft BGP Extension for AI Compute Service Met June 2026 2. Backward Compatibility Safety: Legacy BGP speakers silently ignore an unknown AI Compute Service Metadata attribute without affecting base IP forwarding. Because unsupported devices do not process the attribute, the attribute is not intended to introduce forwarding loops, blackholes, or route flapping on those devices. 3. Signaling Overload Mitigation: Threshold-based suppression of dynamic metadata updates reduces the likelihood that minor metric fluctuations will generate large numbers of BGP updates and destabilize inter-domain routing fabrics. 4. Attribute Integrity: Standard BGP session authentication mechanisms, including TCP-AO and IPsec, can be applied to the new Path Attribute payload to provide integrity protection against tampering or spoofing between BGP peers. 5. Metadata Scope Restriction: AI compute metadata is attached only to dedicated AI service NLRI prefixes and is not associated with general Internet IPv4 or IPv6 NLRI, thereby limiting state-table expansion on transit routers. 7. IANA Considerations 7.1. AI Compute Service Metadata Path Attribute This document requests the following three IANA registry allocations: +=======+=================================================+=================+ | Value | Description | Reference | +=======+=================================================+=================+ | TBD | AI Compute Service Metadata Path Attribute | [this document] | +-------+-------------------------------------------------+-----------------+ 7.2. AI Compute Service Metadata Path Attribute Sub-Types +========+=============================+===================+ |Sub-Type| Description | Reference | +========+=============================+===================+ | 0 |Inference-SLA |[this document ] | +--------+-----------------------------+-------------------+ | 1 |Inference-Billing |[this document] | +--------+-----------------------------+-------------------+ | 2 |KV-Cache-Prefix |[this document] | +--------+-----------------------------+-------------------+ Ruan & Han Expires 1 January 2027 [Page 8] Internet-Draft BGP Extension for AI Compute Service Met June 2026 7.3. Capability Code Allocation Request IANA allocate a new BGP Capability Code as follows: | Code | Description | +========+=============================+ | TBD |AI Compute Service Metadata | Capability 8. Normative References [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A Border Gateway Protocol 4 (BGP-4)", RFC 4271, DOI 10.17487/RFC4271, January 2006, . [I-D.ietf-cats-framework] Li, C., Du, Z., Boucadair, M., Contreras, L. M., and J. Drake, "A Framework for Computing-Aware Traffic Steering (CATS)", Work in Progress, Internet-Draft, draft-ietf- cats-framework-24, 2 April 2026, . [I-D.ietf-idr-5g-edge-service-metadata] Dunbar, L., Majumdar, K., Li, C., Mishra, G. S., and Z. Du, "BGP Extension for 5G Edge Service Metadata", Work in Progress, Internet-Draft, draft-ietf-idr-5g-edge-service- metadata-33, 29 May 2026, . Authors' Addresses Zheng Ruan (editor) China Unicom Beijing China Email: ruanz6@chinaunicom.cn MengYao Han China Unicom Beijing China Email: hanmy12@chinaunicom.cn Ruan & Han Expires 1 January 2027 [Page 9]