teas Q. Xiong Internet-Draft ZTE Corporation Intended status: Standards Track K. Kompella Expires: 31 December 2026 HPE D. King Lancaster University 29 June 2026 HPC/AI Scheduler Job Metadata Model draft-xkk-teas-hpc-scheduler-job-metadata-00 Abstract This document defines a scheduler-facing metadata model for High Performance Computing (HPC) and AI workloads. The model captures common job, workload, scheduler, tenant, timing, and task metadata that can be mapped from heterogeneous workload managers and orchestration platforms and used as context for network service intent. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 31 December 2026. Copyright Notice Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components Xiong, et al. Expires 31 December 2026 [Page 1] Internet-Draft HPC/AI scheduler job metadata June 2026 extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Conventions Used in This Document . . . . . . . . . . . . . . 3 2.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 4. Model Scope . . . . . . . . . . . . . . . . . . . . . . . . . 4 5. Model Structure . . . . . . . . . . . . . . . . . . . . . . . 5 6. Relationship to Other Models . . . . . . . . . . . . . . . . 6 7. YANG Data Model . . . . . . . . . . . . . . . . . . . . . . . 7 8. Security Considerations . . . . . . . . . . . . . . . . . . . 16 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 17 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 17 11.1. Normative References . . . . . . . . . . . . . . . . . . 17 11.2. Informative References . . . . . . . . . . . . . . . . . 17 Appendix A. Example . . . . . . . . . . . . . . . . . . . . . . 18 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 19 1. Introduction HPC and AI workflows are commonly managed by workload managers and orchestration systems such as batch schedulers, Kubernetes-based training systems, workflow engines, and higher-level AI platforms. These systems maintain metadata about jobs, tasks, users, tenants, timing, resource requests, and workload structure. Examples of such systems include HPC workload managers such as Slurm, PBS Pro/OpenPBS, IBM Spectrum LSF, and Grid Engine-style schedulers, as well as AI and machine learning orchestration platforms based on Kubernetes, Kubeflow, Ray, Volcano, Kueue, Red Hat OpenShift AI, NVIDIA Base Command Manager, and NVIDIA Run:ai. These examples are illustrative; the model is intended to be independent of any specific scheduler or orchestration platform. The requirements reflected in this model are derived from the types of information commonly exposed by such workload schedulers and AI orchestration platforms, including workload identity, job structure, task or role information, timing, placement context, tenant or project context, and correlation identifiers. The intent is to carry the network-relevant subset of this information without requiring the network domain to adopt the native data model of any one scheduler. Xiong, et al. Expires 31 December 2026 [Page 2] Internet-Draft HPC/AI scheduler job metadata June 2026 The representation of this metadata is platform-specific. For example, an HPC scheduler may identify jobs using scheduler-local job identifiers and queues, while a Kubernetes-based AI platform may use namespaces, custom resources, pod sets, and workload admission objects. A common metadata model allows the network-relevant portions of these platform-specific job descriptions to be represented in a consistent form. The broader HP-WAN context and current deployment considerations are described in [I-D.kcrh-hpwan-state-of-art] and [I-D.xhy-hpwan-framework]. This document focuses on the scheduler and job metadata needed to relate workload context to that network environment. Related work on machine learning cluster scheduling, including [I-D.kompella-rtgwg-mlnwsched], illustrates that job timing, placement, and resource context can be relevant beyond the compute scheduler itself. This document provides a platform-neutral way to carry scheduler and job metadata that can be used for correlation with network service intent. This document defines a YANG model for scheduler and job metadata. It does not define the requested network service itself and does not define how that service is realized in the network. The metadata defined here is intended to be used by a service intent model that expresses the desired connectivity outcome for the workload. 2. Conventions Used in This Document 2.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. 3. Terminology This document defines common terminology used by the HPC/AI scheduler job metadata model, the HPC/AI service intent model, and the HPC/AI tunnel realization model. Workload: A unit of work submitted to, or managed by, a workload manager or orchestration platform. A workload can be an HPC batch workload, an AI training workload, an inference workload, a data movement workflow, or another scheduled application-level activity. Xiong, et al. Expires 31 December 2026 [Page 3] Internet-Draft HPC/AI scheduler job metadata June 2026 Job: A scheduler-visible execution object associated with a workload. A job is identified by the originating scheduler or orchestration platform and can contain one or more tasks, roles, replicas, or execution units. Task: A component of a job that represents a schedulable or executable part of the workload. Examples include an HPC task, an MPI rank group, a training worker, a parameter-server role, or a workflow stage. Scheduler: A workload manager or orchestration system that creates, admits, places, or manages workloads and jobs. Examples include HPC batch schedulers and Kubernetes-based AI orchestration systems. Scheduler Job Metadata: Platform-neutral context describing the originating scheduler, submitter, workload, job, task structure, and timing information. Scheduler job metadata identifies and describes the workload but does not request network connectivity. Service Intent: A request for a network service associated with a workload or job. Service intent describes the desired connectivity outcome, including endpoints, communication pattern, timing, data movement, performance objectives, policy preferences, and admission state. It does not prescribe the network mechanism used to realize the service. Tunnel Realization: The network-side realization of an admitted service intent. A tunnel realization can reference tunnels, paths, policy, protection, resource allocation, lifecycle state, and performance monitoring associated with the service intent. Correlation Identifier: An identifier used to associate scheduler job metadata, service intent, and tunnel realization state across systems that may use different native identifiers. 4. Model Scope The scheduler job metadata model provides workload context that can be consumed by a network service intent system. It includes identifiers and descriptive attributes that allow a network controller, orchestrator, or broker to correlate a network service request with the originating workload manager and job. The model is intended to be independent of a specific workload manager. Platform-specific identifiers are carried as metadata and do not imply that the network controller understands the internal scheduling behavior of the originating platform. Xiong, et al. Expires 31 December 2026 [Page 4] Internet-Draft HPC/AI scheduler job metadata June 2026 This model is intended to provide a stable boundary between workload scheduling systems and IETF-defined interfaces used by data center and inter-data-center network orchestration systems. 5. Model Structure module: ietf-hpc-scheduler-job-metadata +--rw hpc-scheduler-job-metadata +--rw scheduler | +--rw scheduler-id? string | +--rw scheduler-name? string | +--rw scheduler-type? identityref | +--rw platform-instance? string +--rw submitter | +--rw tenant-id? string | +--rw project-id? string | +--rw namespace? string | +--rw user-id? string | +--rw account-id? string +--rw workload | +--rw workload-id? string | +--rw workload-name? string | +--rw workload-type? identityref | +--rw framework? identityref | +--rw priority? uint32 | +--rw queue? string | +--rw correlation-id? string +--rw job | +--rw job-id? string | +--rw job-name? string | +--rw job-array-id? string | +--rw job-size? uint32 | +--rw task* [task-id] | +--rw task-id string | +--rw task-name? string | +--rw task-role? identityref | +--rw task-index? uint32 +--rw timing +--rw submit-time? yang:date-and-time +--rw earliest-start-time? yang:date-and-time +--rw requested-start-time? yang:date-and-time +--rw deadline? yang:date-and-time +--rw requested-duration? uint32 +--rw duration-unit? identityref Figure 2: Scheduler job metadata model structure Xiong, et al. Expires 31 December 2026 [Page 5] Internet-Draft HPC/AI scheduler job metadata June 2026 6. Relationship to Other Models The naming relationship between these concepts is hierarchical. * Scheduler job metadata in this document identifies and describes the workload. * A service intent as per draft-xkk-teas-hpc-service-intent identifies the network service requested for that workload. * A tunnel realization as per draft-xkk-teas-hpc-tunnel-realization identifies the network resources used to realize an admitted service intent. .----------------------------. | Scheduler/Job Metadata | | workload-id, job-id, | | task-id, correlation-id | '-------------+--------------' | | referenced by v .-------------+--------------. | Service Intent | | intent-id, workload-ref, | | endpoints, objectives | '-------------+--------------' | | admitted and realized by v .-------------+--------------. | Tunnel Realization | | realization-id, intent-ref,| | tunnel/path references | '----------------------------' Figure 1: Relationship A workload or job can have zero or more service intent instances. A service intent instance can have zero or more tunnel realization instances. A tunnel realization instance is associated with one service intent instance, although the underlying network service may use one or more tunnels, paths, or technology-specific constructs. The scheduler job metadata model provides context for a separate service intent request. A service intent instance can refer to the metadata instance using a workload identifier, job identifier, or correlation identifier. This separation allows multiple service Xiong, et al. Expires 31 December 2026 [Page 6] Internet-Draft HPC/AI scheduler job metadata June 2026 intent requests to be associated with a single workload, and allows one service intent request to be updated or replaced without changing the scheduler metadata. 7. YANG Data Model The YANG data model is as follows: module ietf-hpc-scheduler-job-metadata { yang-version 1.1; namespace "urn:ietf:params:xml:ns:yang:ietf-hpc-scheduler-job-metadata"; prefix hpc-sched; import ietf-yang-types { prefix yang; reference "RFC 6991: Common YANG Data Types"; } organization "IETF Traffic Engineering Architecture and Signaling (TEAS) Working Group"; contact "WG Web: WG List: Editor: Quan Xiong Editor: Kireeti Kompella Editor: Daniel King "; description "This module defines a scheduler-facing metadata model for High Performance Computing (HPC) and AI workloads. The model captures common job, workload, scheduler, tenant, timing, and task metadata that can be mapped from heterogeneous workload managers and orchestration platforms. Copyright (c) 2026 IETF Trust and the persons identified as authors of the code. All rights reserved. Redistribution and use in source and binary forms, with or without modification, is permitted pursuant to, and subject Xiong, et al. Expires 31 December 2026 [Page 7] Internet-Draft HPC/AI scheduler job metadata June 2026 to the license terms contained in, the Revised BSD License set forth in Section 4.c of the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info). This version of this YANG module is part of RFC XXXX; see the RFC itself for full legal notices."; revision 2026-04-23 { description "Initial version of the HPC/AI scheduler job metadata model."; reference "RFC XXXX: HPC/AI Scheduler Job Metadata Model"; } /* * Identity definitions */ identity scheduler-type { description "Base identity for scheduler types."; } identity slurm { base scheduler-type; description "Slurm workload manager."; } identity pbs { base scheduler-type; description "PBS Pro/OpenPBS workload manager."; } identity lsf { base scheduler-type; description "IBM Spectrum LSF workload manager."; } identity kubernetes { base scheduler-type; description "Kubernetes-based orchestration platform."; } identity kubeflow { Xiong, et al. Expires 31 December 2026 [Page 8] Internet-Draft HPC/AI scheduler job metadata June 2026 base scheduler-type; description "Kubeflow AI orchestration platform."; } identity workload-type { description "Base identity for workload types."; } identity hpc-batch { base workload-type; description "HPC batch workload."; } identity ai-training { base workload-type; description "AI training workload."; } identity ai-inference { base workload-type; description "AI inference workload."; } identity data-movement { base workload-type; description "Data movement workload."; } identity framework { description "Base identity for workload frameworks."; } identity mpi { base framework; description "Message Passing Interface (MPI) framework."; } identity tensorflow { base framework; description Xiong, et al. Expires 31 December 2026 [Page 9] Internet-Draft HPC/AI scheduler job metadata June 2026 "TensorFlow machine learning framework."; } identity pytorch { base framework; description "PyTorch machine learning framework."; } identity task-role { description "Base identity for task roles."; } identity worker { base task-role; description "Worker role in distributed computation."; } identity parameter-server { base task-role; description "Parameter server role in distributed training."; } identity master { base task-role; description "Master/coordinator role."; } identity duration-unit { description "Base identity for duration units."; } identity seconds { base duration-unit; description "Duration in seconds."; } identity minutes { base duration-unit; description "Duration in minutes."; } Xiong, et al. Expires 31 December 2026 [Page 10] Internet-Draft HPC/AI scheduler job metadata June 2026 identity hours { base duration-unit; description "Duration in hours."; } /* * Typedefs */ typedef priority-type { type uint32 { range "0..1000"; } description "Priority value type, with higher values indicating higher priority."; } /* * Groupings */ grouping scheduler-grouping { description "Scheduler identification and metadata."; leaf scheduler-id { type string; description "Unique identifier for the scheduler instance."; } leaf scheduler-name { type string; description "Human-readable name of the scheduler."; } leaf scheduler-type { type identityref { base scheduler-type; } description "Type of scheduler or orchestration platform."; } leaf platform-instance { type string; description "Platform-specific instance identifier or version."; } } grouping submitter-grouping { Xiong, et al. Expires 31 December 2026 [Page 11] Internet-Draft HPC/AI scheduler job metadata June 2026 description "Submitter and tenant context."; leaf tenant-id { type string; description "Tenant identifier for multi-tenant environments."; } leaf project-id { type string; description "Project identifier within the tenant."; } leaf namespace { type string; description "Namespace identifier (e.g., Kubernetes namespace)."; } leaf user-id { type string; description "User identifier who submitted the workload."; } leaf account-id { type string; description "Accounting or billing account identifier."; } } grouping workload-grouping { description "Workload identification and metadata."; leaf workload-id { type string; description "Unique identifier for the workload."; } leaf workload-name { type string; description "Human-readable name of the workload."; } leaf workload-type { type identityref { base workload-type; } description "Type of workload."; Xiong, et al. Expires 31 December 2026 [Page 12] Internet-Draft HPC/AI scheduler job metadata June 2026 } leaf framework { type identityref { base framework; } description "Computational framework used by the workload."; } leaf priority { type priority-type; description "Priority of the workload."; } leaf queue { type string; description "Queue or partition where the workload is submitted."; } leaf correlation-id { type string; description "Correlation identifier for cross-system tracing."; } } grouping task-grouping { description "Task-level metadata."; leaf task-id { type string; mandatory true; description "Unique identifier for the task within the job."; } leaf task-name { type string; description "Human-readable name of the task."; } leaf task-role { type identityref { base task-role; } description "Functional role of the task in the workload."; } leaf task-index { type uint32; Xiong, et al. Expires 31 December 2026 [Page 13] Internet-Draft HPC/AI scheduler job metadata June 2026 description "Index or sequence number of the task."; } } grouping job-grouping { description "Job structure and task information."; leaf job-id { type string; description "Scheduler-specific job identifier."; } leaf job-name { type string; description "Human-readable job name."; } leaf job-array-id { type string; description "Job array identifier for array jobs."; } leaf job-size { type uint32; description "Total number of tasks or execution units in the job."; } list task { key "task-id"; description "List of tasks comprising the job."; uses task-grouping; } } grouping timing-grouping { description "Timing and scheduling information."; leaf submit-time { type yang:date-and-time; description "Time when the workload was submitted to the scheduler."; } leaf earliest-start-time { type yang:date-and-time; description "Earliest time when the workload can start."; Xiong, et al. Expires 31 December 2026 [Page 14] Internet-Draft HPC/AI scheduler job metadata June 2026 } leaf requested-start-time { type yang:date-and-time; description "Requested start time for the workload."; } leaf deadline { type yang:date-and-time; description "Deadline by which the workload should complete."; } leaf requested-duration { type uint32; description "Requested duration for the workload execution."; } leaf duration-unit { type identityref { base duration-unit; } description "Unit for the requested duration."; } } /* * Top-level container */ container hpc-scheduler-job-metadata { description "Top-level container for HPC/AI scheduler job metadata."; container scheduler { description "Scheduler identification and metadata."; uses scheduler-grouping; } container submitter { description "Submitter and tenant context."; uses submitter-grouping; } container workload { description "Workload identification and metadata."; uses workload-grouping; Xiong, et al. Expires 31 December 2026 [Page 15] Internet-Draft HPC/AI scheduler job metadata June 2026 } container job { description "Job structure and task information."; uses job-grouping; } container timing { description "Timing and scheduling information."; uses timing-grouping; } } } 8. Security Considerations Scheduler and job metadata can reveal user, tenant, project, workload, timing, and operational information. Implementations need to protect the confidentiality and integrity of this information and restrict access to authorized workload managers, controllers, orchestrators, and network management systems. 9. IANA Considerations IANA is requested to register one URI in the "IETF XML Registry" [RFC3688]. Following the format in [RFC3688], the following registration is requested: URI: urn:ietf:params:xml:ns:yang:ietf-hpc-scheduler-job-metadata Registrant Contact: The IESG. XML: N/A; the requested URI is an XML namespace. IANA is requested to register the following YANG module in the "YANG Module Names" registry [RFC6020]. name: ietf-hpc-scheduler-job-metadata namespace: urn:ietf:params:xml:ns:yang:ietf-hpc-scheduler-job-metadata prefix: hpc-sched reference: RFC XXXX Xiong, et al. Expires 31 December 2026 [Page 16] Internet-Draft HPC/AI scheduler job metadata June 2026 10. Acknowledgements The authors acknowledge the related HP-WAN framework and problem statement work that provides the broader context for this scheduler job metadata model. 11. References 11.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, . 11.2. Informative References [I-D.kcrh-hpwan-state-of-art] King, D., Chown, T., Rapier, C., Huang, D., and K. Yao, "Current State of the Art for High Performance Wide Area Networks", Work in Progress, Internet-Draft, draft-kcrh- hpwan-state-of-art-03, 20 October 2025, . [I-D.kompella-rtgwg-mlnwsched] Kompella, K., Beeram, V. P., Mahale, A., Bhargava, R., and N. Geyer, "Scheduling Network Resources for Machine Learning Clusters", Work in Progress, Internet-Draft, draft-kompella-rtgwg-mlnwsched-02, 1 March 2026, . [I-D.xhy-hpwan-framework] Xiong, Q., Huang, G., Yao, K., and C. Lin, "Framework for High Performance Wide Area Network (HP-WAN)", Work in Progress, Internet-Draft, draft-xhy-hpwan-framework-03, 20 October 2025, . Xiong, et al. Expires 31 December 2026 [Page 17] Internet-Draft HPC/AI scheduler job metadata June 2026 Appendix A. Example This section provides an example of scheduler job metadata for a distributed AI training workload. The example demonstrates how platform-specific job information from a Kubernetes-based AI orchestration system is mapped to the common metadata model. Consider a scenario where a user submits a distributed training job using Kubeflow on a Kubernetes cluster. The job involves multiple worker nodes and parameter servers. { "ietf-hpc-scheduler-job-metadata:hpc-scheduler-job-metadata": { "scheduler": { "scheduler-id": "ai-orchestrator-1", "scheduler-name": "AI-Training-Orchestrator", "scheduler-type": "kubernetes", "platform-instance": "nvidia-base-command-2.0" }, "submitter": { "tenant-id": "ai-research-lab", "project-id": "distributed-ml-project", "namespace": "ml-training", "user-id": "researcher-bob", "account-id": "project-alpha" }, "workload": { "workload-id": "distributed-training-001", "workload-name": "large-scale-llm-training", "workload-type": "ai-training", "framework": "pytorch", "priority": 100, "queue": "gpu-high-priority", "correlation-id": "corr-ai-training-001" }, "job": { "job-id": "job-2026-04-23-001", "job-name": "llm-13b-distributed", "job-size": 3, "task": [ { "task-id": "worker-1", "task-name": "gpu-worker-west-1", "task-role": "worker", "task-index": 0 }, Xiong, et al. Expires 31 December 2026 [Page 18] Internet-Draft HPC/AI scheduler job metadata June 2026 { "task-id": "worker-2", "task-name": "gpu-worker-west-2", "task-role": "worker", "task-index": 1 }, { "task-id": "worker-3", "task-name": "gpu-worker-east-1", "task-role": "worker", "task-index": 2 } ] }, "timing": { "submit-time": "2026-04-23T09:00:00Z", "earliest-start-time": "2026-04-23T09:45:00Z", "requested-start-time": "2026-04-23T10:00:00Z", "deadline": "2026-04-23T12:00:00Z", "requested-duration": 120, "duration-unit": "minutes" } } } Authors' Addresses Quan Xiong ZTE Corporation Email: xiong.quan@zte.com.cn Kireeti Kompella HPE Email: kireeti.ietf@gmail.com Daniel King Lancaster University Email: d.king@lancaster.ac.uk Xiong, et al. Expires 31 December 2026 [Page 19]