teas Q. Xiong
Internet-Draft ZTE Corporation
Intended status: Standards Track K. Kompella
Expires: 31 December 2026 HPE
D. King
Lancaster University
29 June 2026
HPC/AI Scheduler Job Metadata Model
draft-xkk-teas-hpc-scheduler-job-metadata-00
Abstract
This document defines a scheduler-facing metadata model for High
Performance Computing (HPC) and AI workloads. The model captures
common job, workload, scheduler, tenant, timing, and task metadata
that can be mapped from heterogeneous workload managers and
orchestration platforms and used as context for network service
intent.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 31 December 2026.
Copyright Notice
Copyright (c) 2026 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
Xiong, et al. Expires 31 December 2026 [Page 1]
Internet-Draft HPC/AI scheduler job metadata June 2026
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Conventions Used in This Document . . . . . . . . . . . . . . 3
2.1. Requirements Language . . . . . . . . . . . . . . . . . . 3
3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3
4. Model Scope . . . . . . . . . . . . . . . . . . . . . . . . . 4
5. Model Structure . . . . . . . . . . . . . . . . . . . . . . . 5
6. Relationship to Other Models . . . . . . . . . . . . . . . . 6
7. YANG Data Model . . . . . . . . . . . . . . . . . . . . . . . 7
8. Security Considerations . . . . . . . . . . . . . . . . . . . 16
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16
10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 17
11. References . . . . . . . . . . . . . . . . . . . . . . . . . 17
11.1. Normative References . . . . . . . . . . . . . . . . . . 17
11.2. Informative References . . . . . . . . . . . . . . . . . 17
Appendix A. Example . . . . . . . . . . . . . . . . . . . . . . 18
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 19
1. Introduction
HPC and AI workflows are commonly managed by workload managers and
orchestration systems such as batch schedulers, Kubernetes-based
training systems, workflow engines, and higher-level AI platforms.
These systems maintain metadata about jobs, tasks, users, tenants,
timing, resource requests, and workload structure.
Examples of such systems include HPC workload managers such as Slurm,
PBS Pro/OpenPBS, IBM Spectrum LSF, and Grid Engine-style schedulers,
as well as AI and machine learning orchestration platforms based on
Kubernetes, Kubeflow, Ray, Volcano, Kueue, Red Hat OpenShift AI,
NVIDIA Base Command Manager, and NVIDIA Run:ai. These examples are
illustrative; the model is intended to be independent of any specific
scheduler or orchestration platform.
The requirements reflected in this model are derived from the types
of information commonly exposed by such workload schedulers and AI
orchestration platforms, including workload identity, job structure,
task or role information, timing, placement context, tenant or
project context, and correlation identifiers. The intent is to carry
the network-relevant subset of this information without requiring the
network domain to adopt the native data model of any one scheduler.
Xiong, et al. Expires 31 December 2026 [Page 2]
Internet-Draft HPC/AI scheduler job metadata June 2026
The representation of this metadata is platform-specific. For
example, an HPC scheduler may identify jobs using scheduler-local job
identifiers and queues, while a Kubernetes-based AI platform may use
namespaces, custom resources, pod sets, and workload admission
objects. A common metadata model allows the network-relevant
portions of these platform-specific job descriptions to be
represented in a consistent form.
The broader HP-WAN context and current deployment considerations are
described in [I-D.kcrh-hpwan-state-of-art] and
[I-D.xhy-hpwan-framework]. This document focuses on the scheduler
and job metadata needed to relate workload context to that network
environment.
Related work on machine learning cluster scheduling, including
[I-D.kompella-rtgwg-mlnwsched], illustrates that job timing,
placement, and resource context can be relevant beyond the compute
scheduler itself. This document provides a platform-neutral way to
carry scheduler and job metadata that can be used for correlation
with network service intent.
This document defines a YANG model for scheduler and job metadata.
It does not define the requested network service itself and does not
define how that service is realized in the network. The metadata
defined here is intended to be used by a service intent model that
expresses the desired connectivity outcome for the workload.
2. Conventions Used in This Document
2.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in BCP
14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
3. Terminology
This document defines common terminology used by the HPC/AI scheduler
job metadata model, the HPC/AI service intent model, and the HPC/AI
tunnel realization model.
Workload: A unit of work submitted to, or managed by, a workload
manager or orchestration platform. A workload can be an HPC batch
workload, an AI training workload, an inference workload, a data
movement workflow, or another scheduled application-level
activity.
Xiong, et al. Expires 31 December 2026 [Page 3]
Internet-Draft HPC/AI scheduler job metadata June 2026
Job: A scheduler-visible execution object associated with a
workload. A job is identified by the originating scheduler or
orchestration platform and can contain one or more tasks, roles,
replicas, or execution units.
Task: A component of a job that represents a schedulable or
executable part of the workload. Examples include an HPC task, an
MPI rank group, a training worker, a parameter-server role, or a
workflow stage.
Scheduler: A workload manager or orchestration system that creates,
admits, places, or manages workloads and jobs. Examples include
HPC batch schedulers and Kubernetes-based AI orchestration
systems.
Scheduler Job Metadata: Platform-neutral context describing the
originating scheduler, submitter, workload, job, task structure,
and timing information. Scheduler job metadata identifies and
describes the workload but does not request network connectivity.
Service Intent: A request for a network service associated with a
workload or job. Service intent describes the desired
connectivity outcome, including endpoints, communication pattern,
timing, data movement, performance objectives, policy preferences,
and admission state. It does not prescribe the network mechanism
used to realize the service.
Tunnel Realization: The network-side realization of an admitted
service intent. A tunnel realization can reference tunnels,
paths, policy, protection, resource allocation, lifecycle state,
and performance monitoring associated with the service intent.
Correlation Identifier: An identifier used to associate scheduler
job metadata, service intent, and tunnel realization state across
systems that may use different native identifiers.
4. Model Scope
The scheduler job metadata model provides workload context that can
be consumed by a network service intent system. It includes
identifiers and descriptive attributes that allow a network
controller, orchestrator, or broker to correlate a network service
request with the originating workload manager and job.
The model is intended to be independent of a specific workload
manager. Platform-specific identifiers are carried as metadata and
do not imply that the network controller understands the internal
scheduling behavior of the originating platform.
Xiong, et al. Expires 31 December 2026 [Page 4]
Internet-Draft HPC/AI scheduler job metadata June 2026
This model is intended to provide a stable boundary between workload
scheduling systems and IETF-defined interfaces used by data center
and inter-data-center network orchestration systems.
5. Model Structure
module: ietf-hpc-scheduler-job-metadata
+--rw hpc-scheduler-job-metadata
+--rw scheduler
| +--rw scheduler-id? string
| +--rw scheduler-name? string
| +--rw scheduler-type? identityref
| +--rw platform-instance? string
+--rw submitter
| +--rw tenant-id? string
| +--rw project-id? string
| +--rw namespace? string
| +--rw user-id? string
| +--rw account-id? string
+--rw workload
| +--rw workload-id? string
| +--rw workload-name? string
| +--rw workload-type? identityref
| +--rw framework? identityref
| +--rw priority? uint32
| +--rw queue? string
| +--rw correlation-id? string
+--rw job
| +--rw job-id? string
| +--rw job-name? string
| +--rw job-array-id? string
| +--rw job-size? uint32
| +--rw task* [task-id]
| +--rw task-id string
| +--rw task-name? string
| +--rw task-role? identityref
| +--rw task-index? uint32
+--rw timing
+--rw submit-time? yang:date-and-time
+--rw earliest-start-time? yang:date-and-time
+--rw requested-start-time? yang:date-and-time
+--rw deadline? yang:date-and-time
+--rw requested-duration? uint32
+--rw duration-unit? identityref
Figure 2: Scheduler job metadata model structure
Xiong, et al. Expires 31 December 2026 [Page 5]
Internet-Draft HPC/AI scheduler job metadata June 2026
6. Relationship to Other Models
The naming relationship between these concepts is hierarchical.
* Scheduler job metadata in this document identifies and describes
the workload.
* A service intent as per draft-xkk-teas-hpc-service-intent
identifies the network service requested for that workload.
* A tunnel realization as per draft-xkk-teas-hpc-tunnel-realization
identifies the network resources used to realize an admitted service
intent.
.----------------------------.
| Scheduler/Job Metadata |
| workload-id, job-id, |
| task-id, correlation-id |
'-------------+--------------'
|
| referenced by
v
.-------------+--------------.
| Service Intent |
| intent-id, workload-ref, |
| endpoints, objectives |
'-------------+--------------'
|
| admitted and realized by
v
.-------------+--------------.
| Tunnel Realization |
| realization-id, intent-ref,|
| tunnel/path references |
'----------------------------'
Figure 1: Relationship
A workload or job can have zero or more service intent instances. A
service intent instance can have zero or more tunnel realization
instances. A tunnel realization instance is associated with one
service intent instance, although the underlying network service may
use one or more tunnels, paths, or technology-specific constructs.
The scheduler job metadata model provides context for a separate
service intent request. A service intent instance can refer to the
metadata instance using a workload identifier, job identifier, or
correlation identifier. This separation allows multiple service
Xiong, et al. Expires 31 December 2026 [Page 6]
Internet-Draft HPC/AI scheduler job metadata June 2026
intent requests to be associated with a single workload, and allows
one service intent request to be updated or replaced without changing
the scheduler metadata.
7. YANG Data Model
The YANG data model is as follows:
module ietf-hpc-scheduler-job-metadata {
yang-version 1.1;
namespace "urn:ietf:params:xml:ns:yang:ietf-hpc-scheduler-job-metadata";
prefix hpc-sched;
import ietf-yang-types {
prefix yang;
reference
"RFC 6991: Common YANG Data Types";
}
organization
"IETF Traffic Engineering Architecture and Signaling (TEAS)
Working Group";
contact
"WG Web:
WG List:
Editor: Quan Xiong
Editor: Kireeti Kompella
Editor: Daniel King
";
description
"This module defines a scheduler-facing metadata model for
High Performance Computing (HPC) and AI workloads. The model
captures common job, workload, scheduler, tenant, timing, and
task metadata that can be mapped from heterogeneous workload
managers and orchestration platforms.
Copyright (c) 2026 IETF Trust and the persons identified as
authors of the code. All rights reserved.
Redistribution and use in source and binary forms, with or
without modification, is permitted pursuant to, and subject
Xiong, et al. Expires 31 December 2026 [Page 7]
Internet-Draft HPC/AI scheduler job metadata June 2026
to the license terms contained in, the Revised BSD License
set forth in Section 4.c of the IETF Trust's Legal Provisions
Relating to IETF Documents
(https://trustee.ietf.org/license-info).
This version of this YANG module is part of RFC XXXX; see
the RFC itself for full legal notices.";
revision 2026-04-23 {
description
"Initial version of the HPC/AI scheduler job metadata model.";
reference
"RFC XXXX: HPC/AI Scheduler Job Metadata Model";
}
/*
* Identity definitions
*/
identity scheduler-type {
description
"Base identity for scheduler types.";
}
identity slurm {
base scheduler-type;
description
"Slurm workload manager.";
}
identity pbs {
base scheduler-type;
description
"PBS Pro/OpenPBS workload manager.";
}
identity lsf {
base scheduler-type;
description
"IBM Spectrum LSF workload manager.";
}
identity kubernetes {
base scheduler-type;
description
"Kubernetes-based orchestration platform.";
}
identity kubeflow {
Xiong, et al. Expires 31 December 2026 [Page 8]
Internet-Draft HPC/AI scheduler job metadata June 2026
base scheduler-type;
description
"Kubeflow AI orchestration platform.";
}
identity workload-type {
description
"Base identity for workload types.";
}
identity hpc-batch {
base workload-type;
description
"HPC batch workload.";
}
identity ai-training {
base workload-type;
description
"AI training workload.";
}
identity ai-inference {
base workload-type;
description
"AI inference workload.";
}
identity data-movement {
base workload-type;
description
"Data movement workload.";
}
identity framework {
description
"Base identity for workload frameworks.";
}
identity mpi {
base framework;
description
"Message Passing Interface (MPI) framework.";
}
identity tensorflow {
base framework;
description
Xiong, et al. Expires 31 December 2026 [Page 9]
Internet-Draft HPC/AI scheduler job metadata June 2026
"TensorFlow machine learning framework.";
}
identity pytorch {
base framework;
description
"PyTorch machine learning framework.";
}
identity task-role {
description
"Base identity for task roles.";
}
identity worker {
base task-role;
description
"Worker role in distributed computation.";
}
identity parameter-server {
base task-role;
description
"Parameter server role in distributed training.";
}
identity master {
base task-role;
description
"Master/coordinator role.";
}
identity duration-unit {
description
"Base identity for duration units.";
}
identity seconds {
base duration-unit;
description
"Duration in seconds.";
}
identity minutes {
base duration-unit;
description
"Duration in minutes.";
}
Xiong, et al. Expires 31 December 2026 [Page 10]
Internet-Draft HPC/AI scheduler job metadata June 2026
identity hours {
base duration-unit;
description
"Duration in hours.";
}
/*
* Typedefs
*/
typedef priority-type {
type uint32 {
range "0..1000";
}
description
"Priority value type, with higher values indicating higher priority.";
}
/*
* Groupings
*/
grouping scheduler-grouping {
description
"Scheduler identification and metadata.";
leaf scheduler-id {
type string;
description
"Unique identifier for the scheduler instance.";
}
leaf scheduler-name {
type string;
description
"Human-readable name of the scheduler.";
}
leaf scheduler-type {
type identityref {
base scheduler-type;
}
description
"Type of scheduler or orchestration platform.";
}
leaf platform-instance {
type string;
description
"Platform-specific instance identifier or version.";
}
}
grouping submitter-grouping {
Xiong, et al. Expires 31 December 2026 [Page 11]
Internet-Draft HPC/AI scheduler job metadata June 2026
description
"Submitter and tenant context.";
leaf tenant-id {
type string;
description
"Tenant identifier for multi-tenant environments.";
}
leaf project-id {
type string;
description
"Project identifier within the tenant.";
}
leaf namespace {
type string;
description
"Namespace identifier (e.g., Kubernetes namespace).";
}
leaf user-id {
type string;
description
"User identifier who submitted the workload.";
}
leaf account-id {
type string;
description
"Accounting or billing account identifier.";
}
}
grouping workload-grouping {
description
"Workload identification and metadata.";
leaf workload-id {
type string;
description
"Unique identifier for the workload.";
}
leaf workload-name {
type string;
description
"Human-readable name of the workload.";
}
leaf workload-type {
type identityref {
base workload-type;
}
description
"Type of workload.";
Xiong, et al. Expires 31 December 2026 [Page 12]
Internet-Draft HPC/AI scheduler job metadata June 2026
}
leaf framework {
type identityref {
base framework;
}
description
"Computational framework used by the workload.";
}
leaf priority {
type priority-type;
description
"Priority of the workload.";
}
leaf queue {
type string;
description
"Queue or partition where the workload is submitted.";
}
leaf correlation-id {
type string;
description
"Correlation identifier for cross-system tracing.";
}
}
grouping task-grouping {
description
"Task-level metadata.";
leaf task-id {
type string;
mandatory true;
description
"Unique identifier for the task within the job.";
}
leaf task-name {
type string;
description
"Human-readable name of the task.";
}
leaf task-role {
type identityref {
base task-role;
}
description
"Functional role of the task in the workload.";
}
leaf task-index {
type uint32;
Xiong, et al. Expires 31 December 2026 [Page 13]
Internet-Draft HPC/AI scheduler job metadata June 2026
description
"Index or sequence number of the task.";
}
}
grouping job-grouping {
description
"Job structure and task information.";
leaf job-id {
type string;
description
"Scheduler-specific job identifier.";
}
leaf job-name {
type string;
description
"Human-readable job name.";
}
leaf job-array-id {
type string;
description
"Job array identifier for array jobs.";
}
leaf job-size {
type uint32;
description
"Total number of tasks or execution units in the job.";
}
list task {
key "task-id";
description
"List of tasks comprising the job.";
uses task-grouping;
}
}
grouping timing-grouping {
description
"Timing and scheduling information.";
leaf submit-time {
type yang:date-and-time;
description
"Time when the workload was submitted to the scheduler.";
}
leaf earliest-start-time {
type yang:date-and-time;
description
"Earliest time when the workload can start.";
Xiong, et al. Expires 31 December 2026 [Page 14]
Internet-Draft HPC/AI scheduler job metadata June 2026
}
leaf requested-start-time {
type yang:date-and-time;
description
"Requested start time for the workload.";
}
leaf deadline {
type yang:date-and-time;
description
"Deadline by which the workload should complete.";
}
leaf requested-duration {
type uint32;
description
"Requested duration for the workload execution.";
}
leaf duration-unit {
type identityref {
base duration-unit;
}
description
"Unit for the requested duration.";
}
}
/*
* Top-level container
*/
container hpc-scheduler-job-metadata {
description
"Top-level container for HPC/AI scheduler job metadata.";
container scheduler {
description
"Scheduler identification and metadata.";
uses scheduler-grouping;
}
container submitter {
description
"Submitter and tenant context.";
uses submitter-grouping;
}
container workload {
description
"Workload identification and metadata.";
uses workload-grouping;
Xiong, et al. Expires 31 December 2026 [Page 15]
Internet-Draft HPC/AI scheduler job metadata June 2026
}
container job {
description
"Job structure and task information.";
uses job-grouping;
}
container timing {
description
"Timing and scheduling information.";
uses timing-grouping;
}
}
}
8. Security Considerations
Scheduler and job metadata can reveal user, tenant, project,
workload, timing, and operational information. Implementations need
to protect the confidentiality and integrity of this information and
restrict access to authorized workload managers, controllers,
orchestrators, and network management systems.
9. IANA Considerations
IANA is requested to register one URI in the "IETF XML Registry"
[RFC3688]. Following the format in [RFC3688], the following
registration is requested:
URI: urn:ietf:params:xml:ns:yang:ietf-hpc-scheduler-job-metadata
Registrant Contact: The IESG.
XML: N/A; the requested URI is an XML namespace.
IANA is requested to register the following YANG module in the "YANG
Module Names" registry [RFC6020].
name: ietf-hpc-scheduler-job-metadata
namespace: urn:ietf:params:xml:ns:yang:ietf-hpc-scheduler-job-metadata
prefix: hpc-sched
reference: RFC XXXX
Xiong, et al. Expires 31 December 2026 [Page 16]
Internet-Draft HPC/AI scheduler job metadata June 2026
10. Acknowledgements
The authors acknowledge the related HP-WAN framework and problem
statement work that provides the broader context for this scheduler
job metadata model.
11. References
11.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, .
11.2. Informative References
[I-D.kcrh-hpwan-state-of-art]
King, D., Chown, T., Rapier, C., Huang, D., and K. Yao,
"Current State of the Art for High Performance Wide Area
Networks", Work in Progress, Internet-Draft, draft-kcrh-
hpwan-state-of-art-03, 20 October 2025,
.
[I-D.kompella-rtgwg-mlnwsched]
Kompella, K., Beeram, V. P., Mahale, A., Bhargava, R., and
N. Geyer, "Scheduling Network Resources for Machine
Learning Clusters", Work in Progress, Internet-Draft,
draft-kompella-rtgwg-mlnwsched-02, 1 March 2026,
.
[I-D.xhy-hpwan-framework]
Xiong, Q., Huang, G., Yao, K., and C. Lin, "Framework for
High Performance Wide Area Network (HP-WAN)", Work in
Progress, Internet-Draft, draft-xhy-hpwan-framework-03, 20
October 2025, .
Xiong, et al. Expires 31 December 2026 [Page 17]
Internet-Draft HPC/AI scheduler job metadata June 2026
Appendix A. Example
This section provides an example of scheduler job metadata for a
distributed AI training workload. The example demonstrates how
platform-specific job information from a Kubernetes-based AI
orchestration system is mapped to the common metadata model.
Consider a scenario where a user submits a distributed training job
using Kubeflow on a Kubernetes cluster. The job involves multiple
worker nodes and parameter servers.
{
"ietf-hpc-scheduler-job-metadata:hpc-scheduler-job-metadata": {
"scheduler": {
"scheduler-id": "ai-orchestrator-1",
"scheduler-name": "AI-Training-Orchestrator",
"scheduler-type": "kubernetes",
"platform-instance": "nvidia-base-command-2.0"
},
"submitter": {
"tenant-id": "ai-research-lab",
"project-id": "distributed-ml-project",
"namespace": "ml-training",
"user-id": "researcher-bob",
"account-id": "project-alpha"
},
"workload": {
"workload-id": "distributed-training-001",
"workload-name": "large-scale-llm-training",
"workload-type": "ai-training",
"framework": "pytorch",
"priority": 100,
"queue": "gpu-high-priority",
"correlation-id": "corr-ai-training-001"
},
"job": {
"job-id": "job-2026-04-23-001",
"job-name": "llm-13b-distributed",
"job-size": 3,
"task": [
{
"task-id": "worker-1",
"task-name": "gpu-worker-west-1",
"task-role": "worker",
"task-index": 0
},
Xiong, et al. Expires 31 December 2026 [Page 18]
Internet-Draft HPC/AI scheduler job metadata June 2026
{
"task-id": "worker-2",
"task-name": "gpu-worker-west-2",
"task-role": "worker",
"task-index": 1
},
{
"task-id": "worker-3",
"task-name": "gpu-worker-east-1",
"task-role": "worker",
"task-index": 2
}
]
},
"timing": {
"submit-time": "2026-04-23T09:00:00Z",
"earliest-start-time": "2026-04-23T09:45:00Z",
"requested-start-time": "2026-04-23T10:00:00Z",
"deadline": "2026-04-23T12:00:00Z",
"requested-duration": 120,
"duration-unit": "minutes"
}
}
}
Authors' Addresses
Quan Xiong
ZTE Corporation
Email: xiong.quan@zte.com.cn
Kireeti Kompella
HPE
Email: kireeti.ietf@gmail.com
Daniel King
Lancaster University
Email: d.king@lancaster.ac.uk
Xiong, et al. Expires 31 December 2026 [Page 19]