<?xml version="1.0" encoding="US-ASCII"?>
<!-- edited with XMLSPY v5 rel. 3 U (http://www.xmlspy.com)
     by Daniel M Kohn (private) -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY rfc2119 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
]>
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="std" docName="draft-yang-dmsc-distributed-model-01"
     ipr="trust200902">
  <front>
    <title abbrev="DSMC Architecture">Distributed AI model architecture for
    microservices communication and computing power scheduling</title>

    <author fullname="Hui Yang" initials="H" surname="Yang">
      <organization>Beijing University of Posts and
      Telecommunications</organization>

      <address>
        <postal>
          <street>10 Xitucheng Road, Haidian District</street>

          <city>Beijing</city>

          <code>100876</code>

          <region>Beijing</region>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>yanghui@bupt.edu.cn</email>
      </address>
    </author>

    <author fullname="Tiankuo Yu" initials="T" surname="Yu">
      <organization>Beijing University of Posts and
      Telecommunications</organization>

      <address>
        <postal>
          <street>10 Xitucheng Road, Haidian District</street>

          <city>Beijing</city>

          <code>100876</code>

          <region>Beijing</region>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>yutiankuo@bupt.edu.cn</email>
      </address>
    </author>

    <author fullname="Qiuyan Yao" initials="Q" surname="Yao">
      <organization>Beijing University of Posts and
      Telecommunications</organization>

      <address>
        <postal>
          <street>10 Xitucheng Road, Haidian District</street>

          <city>Beijing</city>

          <code>100876</code>

          <region>Beijing</region>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>yqy89716@bupt.edu.cn</email>
      </address>
    </author>

    <author fullname="Zepeng Zhang" initials="Z" surname="Zhang">
      <organization>Beijing University of Posts and
      Telecommunications</organization>

      <address>
        <postal>
          <street>10 Xitucheng Road, Haidian District</street>

          <city>Beijing</city>

          <code>100876</code>

          <region>Beijing</region>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>2024140574@bupt.cn</email>
      </address>
    </author>

    <date day="1" month="March" year="2025"/>

    <area>IETF Area</area>

    <workgroup>DSMC Working Group</workgroup>

    <keyword>distributed AI, service architecture</keyword>

    <abstract>
      <t>This document describes the distributed AI micromodel computing power
      scheduling service architecture.</t>
    </abstract>
  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <t>The Distributed AI Micromodel Computing Power Scheduling Service
      Architecture is a structured framework designed to address the
      challenges of scalability, flexibility, and efficiency in modern AI
      systems. By integrating model segmentation, micro-model deployment, and
      microservice orchestration, this architecture enables the effective
      allocation and management of computing resources across distributed
      environments. The primary focus lies in leveraging model segmentation to
      decompose large AI models into smaller, modular micro-models, which are
      executed collaboratively across distributed nodes.</t>

      <t>The architecture is organized into four tightly integrated layers,
      each with distinct roles and responsibilities that together ensure
      seamless functionality:</t>

      <t>Business Layer: This layer acts as the interface between the
      user-facing applications and the underlying system. It encapsulates AI
      capabilities as microservices, enabling modular deployment, elastic
      scaling, and independent version control. By routing user requests
      through service gateways, it ensures efficient interaction with back-end
      micro-models while balancing workloads. The business layer also
      facilitates collaboration between multiple micro-models, allowing them
      to function as part of a cohesive distributed system.</t>

      <t>Control Layer: The control layer is the central coordination hub,
      responsible for task scheduling, resource allocation, and the
      implementation of model segmentation strategies. It decomposes large AI
      models into smaller, manageable components, assigns tasks to specific
      nodes, and ensures synchronized execution across distributed
      environments. This layer dynamically balances compute and network
      resources while adapting to system demands, ensuring high efficiency for
      training and inference workflows.</t>

      <t>Computing Power Layer: As the execution core, this layer translates
      the decisions made by the control layer into distributed computation. It
      executes segmented micro-models on diverse hardware resources such as
      GPUs, CPUs, and accelerators, optimizing parallelism and fault
      tolerance. By coordinating with the control layer, it ensures that tasks
      are executed efficiently while leveraging distributed orchestration
      frameworks to handle diverse workloads.</t>

      <t>Data Layer: The data layer underpins the entire system by managing
      secure storage, access, and transmission of data. It provides the
      necessary datasets, intermediate results, and metadata required for
      executing segmented micro-models. Privacy protection mechanisms, such as
      federated learning and differential privacy, ensure data security and
      compliance, while distributed database operations guarantee consistent
      access and high availability across nodes.</t>

      <t>At the heart of this architecture is model segmentation, which serves
      as the foundation for effectively distributing computation and
      optimizing resource utilization. The control layer breaks down models
      into smaller micro-models using strategies such as layer-based,
      business-specific, or block-based segmentation. These micro-models are
      then deployed as independent services in the business layer, where they
      are dynamically scaled and orchestrated to meet real-time demands. The
      computing power layer executes these tasks using parallel processing
      techniques and advanced scheduling algorithms, while the data layer
      ensures secure and efficient data flow to support both training and
      inference tasks.</t>

      <t>By tightly integrating these layers, the architecture addresses
      critical challenges such as balancing compute and network resources,
      synchronizing distributed micro-models, and minimizing communication
      overhead. This cohesive design enables AI systems to achieve high
      performance, scalability, and flexibility across dynamic and
      resource-intensive workloads.</t>

      <t>This document outlines the design principles, key components, and
      operational advantages of the Distributed AI Micromodel Computing Power
      Scheduling Service Architecture, emphasizing how model segmentation,
      micro-models, and microservices form the foundation for scalable and
      efficient distributed AI systems.</t>
    </section>

    <section title="Conventions used in this document">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </section>

    <section title="Terminology">
      <t>TBD</t>
    </section>

    <section title="Scenarios and requirements">
      <section title="AI Microservice model scenario requirements">
        <t>In contemporary times, as artificial intelligence technology
        evolves at an accelerated pace, the scale and intricacy of AI models
        are continuously expanding. The traditional monolithic application or
        centralized reasoning and training model is progressively becoming
        inadequate to meet the swiftly changing business demands.
        Encapsulating AI capabilities within a microservices architecture can
        confer substantial advantages in terms of system flexibility,
        scalability, and service governance. By decoupling models through
        microservices, an independent AI model service can circumvent
        potential bottlenecks that arise from deep coupling with other
        business logic components, and it can also achieve elastic scaling
        during surges in requests or training loads. Given the rapid iteration
        and upgrade cycles of AI models, a microservice architecture
        facilitates the coexistence of multiple model versions, enables
        gray-scale releases, and supports rapid rollbacks, thereby minimizing
        the impact on the overall system.</t>

        <t>The requirements for computing power in AI microservice models are
        often extremely demanding. On the one hand, the training or inference
        process usually involves massive data processing and high-density
        parallel computing, requiring the collaborative work of various
        hardware resources such as GPU, CPU, FPGA, NPU, etc; On the other
        hand, if the model scale is large or the request volume is high, the
        computing power of a single machine is often insufficient to meet
        business needs. It is necessary to perform parallel computing on
        multiple nodes through a distributed mode and release resources
        reasonably during idle time to improve utilization. This type of
        distributed training or inference typically relies on efficient
        communication strategies to synchronize model parameters or gradients,
        and methods such as AllReduce or All to All are often used to reduce
        communication overhead and ensure model consistency.</t>

        <t>In the distributed system, the network plays a crucial role. A
        large number of model parameters and gradients need to be exchanged
        frequently during the calculation process, which puts forward high
        requirements for network bandwidth and delay. In the large-scale
        cluster scenario, the reasonable design of the network topology and
        the choice of the communication framework can not be ignored. Only in
        the high-bandwidth, low-latency network environment, combined with the
        appropriate communication library (such as NCCL, MPI, etc.), can the
        cluster fully exploit the potential of computing power and avoid
        communication becoming the bottleneck of global performance.</t>
      </section>

      <section title="Distributed Micro model Service Flow">
        <t>In the distributed AI micro-model computing power scheduling
        service architecture, the core of the business process is how to
        realize the multi-node layout and collaborative work of the model to
        ensure efficient parameter synchronization and communication.
        Typically, a model is trained and evaluated using a deep learning
        framework during development, and then container-ized or mirrored to
        package the model and its dependencies into a service that can be
        deployed independently. Then, these encapsulated model services are
        registered to the system's microservice management platform for
        subsequent unified scheduling and access.</t>

        <t>The micromodel is deployed to a distributed cluster, computing
        power orchestration and resource scheduling allocates computing
        resources such as GPU or CPU according to real-time load, business
        priority and hardware topology, and uses container orchestration tools
        (such as Kubernetes) to start corresponding service instances on each
        node. When distributed cooperation is needed, NCCL, Horovod and other
        frameworks are used to complete inter-process communication. Requests
        from upper business systems or users usually arrive at API Gateway or
        service gateway first, and then are distributed to the target service
        instance according to load balancing or other routing policies. If
        distributed reasoning is needed, multiple nodes cooperate to perform
        model segmentation reasoning and summarize the results, and finally
        return the reasoning results to the requester. In this process,
        real-time monitoring and elastic scaling mechanism play an important
        role in ensuring system stability and optimizing resource utilization.
        On the monitoring level, through a unified data acquisition and
        analysis platform, the system can track core indicators such as GPU
        utilization, network traffic, and request latency of each service
        node, so as to provide timely alarms in case of failures, performance
        bottlenecks, or insufficient resources, and perform automatic failover
        or node offline processing.</t>

        <t>In addition, the distributed micromodel business flow needs to be
        combined with the data backflow mechanism. A large number of logs,
        user feedbacks and interactive information generated in the inference
        process can be further used for the training of new models or the
        performance optimization of existing models if they can be returned to
        the data platform under the premise of meeting privacy and compliance
        requirements.</t>
      </section>
    </section>

    <section title="Key issues and challenges">
      <section title=" Balancing Compute and Network Resources under Constraints">
        <t>With the continuous growth of AI model size and business demand,
        the computing power resources of a single node or single cluster are
        often difficult to support high-intensity training and inference
        tasks, and it is prone to the problem of insufficient computing power
        or sharp rise in cost. Through the distributed architecture to
        coordinate computing resources between multiple nodes and multiple
        regions, it can improve the overall efficiency and fault tolerance to
        a certain extent. However, distributed deployment also brings higher
        complexity, which not only considers the differences of heterogeneous
        hardware (such as GPU, CPU, FPGA, etc.), but also needs to balance the
        allocation of computing power under different network topology and
        bandwidth conditions.</t>

        <t>When computing network resources are scarce, it is necessary to
        dynamically schedule and allocate computing power according to
        business priority, model scale and real-time load conditions, and
        combine strategic queuing, elastic scaling and scaling, and
        cross-cluster resource collaboration to improve the overall service
        efficiency. In this process, Model Partitioning/Parallelism scheme
        plays a key role. On the one hand, the model can be decomposed among
        multiple nodes by means of "tensor segmentation" or "computing power
        pipelining", and each node is only responsible for a specific
        submodule or specific slice. On the other hand, for reasoning
        scenarios, the input data can also flow through a series of model
        microservice nodes to form a pipelined processing mode, so as to make
        full use of scattered computing resources. Through this strategy of
        splitting the model into parallel execution, it can not only avoid too
        much computing pressure on a single server, but also maximize the use
        of GPU/CPU computing power of idle nodes when network resources
        permit, so as to achieve balance and optimization between computing
        network resources.</t>
      </section>

      <section title=" Data Collaboration Challenges under Block Isolation">
        <t>In many distributed systems, large-scale data is usually split into
        multiple data blocks, which are stored and processed separately.
        Although this improves data security and processing efficiency, it
        also brings challenges to data coordination. When multiple nodes or
        microservice modules need to share or exchange data, the interface and
        call sequence must be defined in advance, and the consistency and
        concurrency control level must be managed. Especially when different
        data blocks have cross-node dependencies, how to effectively schedule,
        load and distribute data has become one of the key bottlenecks of
        system scalability and computational efficiency.</t>

        <t>A key difficulty lies in synchronizing data across distributed
        nodes while minimizing latency and avoiding bottlenecks. Cross-node
        dependencies require precise scheduling to ensure data arrives at the
        correct location and time without conflicts. As the scale of data and
        the number of nodes grow, the management overhead for maintaining
        these dependencies can increase exponentially, particularly when
        network bandwidth or latency constraints exacerbate delays.
        Additionally, ensuring data consistency across multiple data blocks
        during concurrent access or updates adds another layer of complexity.
        High levels of concurrency can increase the risk of inconsistencies,
        data races, and synchronization issues, demanding advanced mechanisms
        to enforce data integrity.</t>

        <t>Traditional distributed communication strategies, such as AllReduce
        and All-to-All, are widely used and remain effective in addressing
        certain data collaboration needs in training and inference tasks. For
        example, AllReduce is well-suited for data parallel scenarios, where
        all nodes compute on the same model with different data splits, and
        gradients or weights are synchronized via aggregation and broadcast.
        Similarly, All-to-All is valuable in more complex distributed tasks
        that require frequent intermediate data exchanges across nodes.
        However, these methods are not without limitations. As data and system
        complexity grow, they can lead to increased communication overhead,
        especially in scenarios where synchronization is uneven or poorly
        timed.</t>

        <t>The effectiveness of traditional methods relies on fine tuning and
        precise execution. Improper timing of data exchange can lead to long
        waiting times, underutilization of resources, and even data mismatch.
        Although approaches such as AllReduce and All-to-All provide reliable
        communication frameworks, their scalability and efficiency are often
        limited by challenges such as synchronization across nodes, network
        variations, and system heterogeneity. Therefore, there is a need for
        continuous improvement and innovation in distributed communication and
        data collaboration strategies to overcome the challenges posed by
        block isolation.</t>
      </section>
    </section>

    <section title="Distributed solution based on model segmentation ">
      <t>Based on the key problems and challenges, a distributed AI
      micro-model computing power scheduling service architecture is proposed,
      which can be divided into four layers: business layer, control layer,
      computing power layer, and data layer. The hierarchical relationship is
      shown in Figure 1. The specific architecture diagram is shown in Figure.
      2. The function module can realize the soft cooperation of the control
      layer and the hard isolation of the data layer, and the specific
      structure is shown in Figure 3.</t>

      <figure>
        <artwork name="Fig. 1 Hierarchical relationships"><![CDATA[ ---------------------------------
|          Business layer         |
|                 |               |
|           Control layer         |
|                 |               |
|      Computing power layer      |
|                 |               |
|             Data layer          |
 ---------------------------------]]></artwork>
      </figure>

      <figure>
        <artwork name="Fig. 2  Architecture of computing power scheduling service for distributed AI micromodel"><![CDATA[ -----------------------------------------------------------------------------------------------------------------------------------------------------------
|                       -----------      -----------                                      -----------      -----------                                      |
|                      |Service A/1|    |Service B/1|                                    |Service A/2|    |Service B/2|                                     |
|                       -----|-----      -----|-----                                      -----|-----      -----|-----                                      |
|                            |                |                                                |                |                                           |
|                            |                |                                                |                |                                           |
|                       -----------------------------                                    -----------------------------                                      |
|                      |  Microservices Gateway -1   |                                  |  Microservices Gateway -2   |                                     |
|                       ------------|----------------                                    -----------|-----------------                                      |
|                                   |                                                               |                                                       |
|                              -----|-----                                                     -----|-----                                                  |
|                             | Interface |                                                   | Interface |                                                 |
|                             | address 1 |- - - - - - - - - - - - - - - - - - - - - - - - - -| address 2 |----------------------------------               |
|                              -----\-----                                                     -----/-----            Address caching        |              |
|                                     \                                                            /                                         |              |
|                                       \                                                        /                                           |              |
|           --------------------        --\-------------                          -------------/--       --------------------                |              |
|          | Functional modules |------| Service Router |------------------------| Service Router |-----| Functional modules |               |              |
|           --------------------        -------\--------                          --------/-------       --------------------                |              |
|                                                \                                      /                                                    |              |
|                                                  \                                  /                                            ----------------------   |
|                                                    \                              /                                --------     | Service Registration |  |
                                                       \                          /                                 |  Feign | ---| and Discovery Centre |  |
|                                                        \                      /                                    --------      ----------------------   |
|                                                          \                  /                                                              |              |
|                                                            \              /                                                                |              |
|                                --------------------        --\----------/--                                                                |              |
|                               | Functional modules |------| Service Router |                                                               |              |
|                                --------------------        --------|-------                                                                |              |
|                                                                    |                                                                       |              |
|                                                                    |                                                                       |              |
|                                                               -----|-----                                                                  |              |
|                                                              | Interface |                                           Address caching       |              |
|                                                              | address 3 |-----------------------------------------------------------------               |
|                                                               -----|-----                                                                                 |
|                                                                    |                                                                                      |
|                                                        ------------|----------------                                                                      |
|                                                       |  Microservices Gateway -3   |                                                                     |
|                                                        -----------------------------                                                                      |
|                                                             |                |                                                                            |
|                                                             |                |                                                                            |
|                                                        -----|-----      -----|-----                                                                       |
|                                                       |Service A/3|    |Service B/3|                                                                      |
|                                                        -----------      -----------                                                                       |
|                                                                                                                                                           |
|                                                                                                                                                           |
 -----------------------------------------------------------------------------------------------------------------------------------------------------------]]></artwork>
      </figure>

      <figure>
        <artwork name="Fig. 3 Functional modules"><![CDATA[
                                           RPC  | REST API
                                                |
 -----------------------------------------------|---------------------------------------
|                      -|-* * *-----------------|---------------------|-                |
|                     |        Task         management      module      |               |
|                      -|---|-------------------|-----------------------                |
|                       |   |                   |                                       |
|                 ------    |                   |                                       |
|                |          |                   |                                       |
|     ---|-* * *-|-|--      |         -|-* * *--|---|-----|---                          |
|    |  Asynchronous  |     |        |  AI Model Segmentation |                         |
|    |   task queue   |     |        |     and aggregation    |                         |
|    |     module     |     |        |          module        |                         |
|     ---|-* * *-|-|--      |         -|-* * *--|---|-----|---                          |
|                |          |          |        |   |     |                             |
|                |          |  --------         |   |     |                             |
|                |          | |                 |   |      -------------                |
|               -|-* * *----|-|-                |   |                   |               |
|              | Log management |               |  -|-* * *--|-|---    -|-* * *---|-|-  |
|              |    system      |               | | Fault-tolerant |  | Model storage | |
|               ----------------                | |    mechanism   |  |     module    | |
|                                               |  ----------------    -|-* * *---|-|-  |
|                                               |                                       |
|    Control layer                              |           (Soft collaboration)        |
------------------------------------------------|--------------------------------------- 
                                                |
 -----------------------------------------------|---------------------------------------
|                                               |                                       |
|                                               |                                       |
|                                     -|-* * *--|-----|-|--                             |
|                                    |     Distributed     |                            |
|                                    | unified cooperation |                            |
|                                    |       module        |                            |
|                                     -|-* * *---|----|-|--                             |
|                                                |                                      |
|                                     -|-* * *---|----|-|--                             |
|                                    |   Load balancing    |                            |
|                                    |    and resource     |                            |
|                                    |allocation mechanism |                            |
|                                     -|-* * *---|----|-|--                             |
|                                                |                                      |
|                                     -|-* * *---|----|-|--                             |
|                                    |   Computing power   |                            |
|                                    |  execution  module  |                            |
|                                     -|-* * *---|----|-|--                             |
|                                                |    |                                 |
|                                                |     -------------                    |
|                                                |                  |                   |
|                                     -|-* * *---|----|-|--        -|-* * *----|-|-     |
|                                     |         Data       |      | Fault tolerance|    |
|                                     |  management module |      |  and recovery  |    |
|                                     -|-* * *---|----|-|--       |     module     |    |
|                                                |                 ----------------     |
|                                     -|-* * *---|----|-|--                             |
|                                    |   Computing power   |                            |
|                                    |    resource pool    |                            |
|                                     -|-* * *---|----|-|--                             |
|                                                |                                      |
|  Computing power layer                         |                                      |
|                                                |                                      |
 ------------------------------------------------|--------------------------------------
 ------------------------------------------------|--------------------------------------
|                                                | Packing data                         |
|                                                |                                      |
|                                       -|-* * *-|-|-                                   |
|       Data layer                     |   Database  |                                  |
|                                       -------------               (Hard isolation)    |
 ---------------------------------------------------------------------------------------]]></artwork>
      </figure>

      <section title="Business layer">
        <t>The business layer is the core of the whole system and hosts the
        main business logic and microservice components. It interacts with the
        user-side front-end presentation layer, receives requests from various
        channels, processes them according to models or business rules, and
        returns the results to the upper layer or synchronizes them to other
        microservices. Typically, the business layer is deployed on a
        microservice container platform (such as Kubernetes), managed by a
        service gateway or API gateway, and a service registry and discovery
        center to maintain communication and load balancing between
        microservices. Internal communication can include RPC, REST API, or
        Feign based remote calls.</t>

        <section title="Microservices and Micromodels">
          <t>Microservices and micromodels exist between multiple services
          (e.g. "Service 1", "Service 2", "Service 3" and even "Service n")
          that invoke each other at the business layer and logical layer. Each
          service encapsulates a separate model or a functional slice of a
          model, and when these services communicate with each other in the
          real world via RPC, REST API, or internal event bus, the overall
          effect of distributed micro-model coordination is formed. Through
          service registration and discovery center, these micro-models can
          automatically discover each other's available instances when needed,
          so as to flexibly scale and balance computing power and network
          resources in large-scale concurrent scenarios.</t>
        </section>

        <section title="Microservice Gateway and API Gateway">
          <t>The microservice gateway and API gateway assume the function of
          traffic scheduling and unified entry in the business layer. The
          microservice gateway is mainly for internal service calls. Through
          load balancing, routing rules and security policy configuration, the
          communication between various business modules is more efficient and
          stable.</t>

          <t>API gateways face external clients or front-end layers, providing
          users with a consistent HTTP or gRPC interface. At the same time, it
          is responsible for authentication, flow limiting, fusing and
          monitoring functions to ensure that when external requests surge,
          the impact on internal services is controllable.</t>
        </section>

        <section title="Service Registration and Discovery Center">
          <t>The service registration and discovery center records the network
          address, version information and health status of all available
          microservices in the system, so that other modules or gateways in
          the business layer can find the correct target instance in time when
          they need to call the relevant microservices. For example, in the
          "real-time recommendation and user behavior analysis" business, when
          the "user portrait generation" microservice needs to be called, the
          system will first inquire the load status and the available instance
          list of the service from the registration and discovery center, and
          then select the appropriate node to call according to the load
          balancing strategy. This not only prevents a single point of
          failure, but also automatically updates the routing information as
          the microservice adds or loses instances.</t>

          <t>Service registration and discovery center allows business
          function modules to avoid manually maintaining complex service
          addresses or dependencies. Each microservice only needs to actively
          register its own information after startup, and when the execution
          goes offline or crashes, the registry will update the state
          accordingly. Common implementations include Eureka, Consul,
          Zookeeper, etc. These registration and discovery centers can be
          deeply integrated with microservice gateways or load balancing
          layers to achieve high availability governance in distributed
          environments.</t>

          <t>Each service registers the interface address to the registry, and
          the service finds the interface address of the calling service
          through the registry to initiate the use of the calling interface.
          The interfaces are called peer-to-peer, and although there is a
          registry, it only plays the role of controlling the flow.</t>
        </section>
      </section>

      <section title="Control layer ">
        <t>The control layer is mainly responsible for scheduling and managing
        various tasks and resources in the distributed AI system, including
        task creation, allocation, exception handling, and key processes such
        as model segmentation, training, and aggregation. Through the fine
        design of the control layer, it can realize the parallel operation of
        multiple models with high efficiency and high availability, and make
        timely scheduling and fault tolerance when the computing power is
        insufficient.</t>

        <section title="Task management module">
          <t>The task management module is the "hub" of the control layer,
          which is responsible for receiving different types of task requests
          from the business layer or data layer, such as model training, model
          reasoning, or batch data processing, etc., and allocating tasks to
          nodes for execution according to real-time load conditions and
          computing power resource information. The task management module
          usually maintains a task queue or task priority queue to sort tasks
          by FCFS (First come first served), FIFO (First in First out) or
          weight-based scheduling policy. At the same time, the module
          internally interfaces with a service registry and discovery center
          or resource orchestration system (e.g. Kubernetes) to dynamically
          obtain key metrics such as health, bandwidth, and memory usage of
          available nodes (GPU/CPU). Some advanced implementations also use
          load balancing strategies or node affinity algorithms to choose the
          best location for tasks and trigger auto-scaling or resource
          recycling when the overall cluster load reaches a threshold.</t>
        </section>

        <section title="Exception task queue module ">
          <t>The exception task queue module plays the role of "fault buffer",
          which is used to capture and store exceptions that occur during the
          execution of tasks. In distributed AI systems, network jitter, node
          failure or data exception often cause some tasks to fail or hang for
          a long time. The exception task queue module is designed to collect
          and isolate these abnormal tasks, so that they do not block the main
          task queue and affect the overall performance. This module
          continuously monitors the error logs and timeouts during the
          training or inference process.When an exception is found, the
          detailed information of the corresponding task (e.g., task ID,
          exception type, execution log, etc.) is transferred to a separate
          exception queue and recorded in the fault tracking system.</t>
        </section>

        <section title="Log management system ">
          <t>The log management module is responsible for tracking all
          critical operations and events during the distributed training,
          inference and scheduling process. This module usually uses a
          centralized log storage and analysis framework to efficiently
          retrieve and aggregate log data even when the system is large. This
          module not only records the timestamps and execution results of
          events such as model segmentation, computing power allocation, and
          communication synchronization, but also collects hardware metrics
          (such as GPU utilization, memory usage, and I/O throughput) of each
          node during execution. When failure symptoms or performance
          bottlenecks are detected in the logs, such as slow training or
          frequent node timeouts, the log management module pushes the
          information to the abnormal task queue module or alert system, which
          assists the operations and Development teams to make timely
          diagnosis and troubleshooting. Through the centralized management
          and visual analysis of log data, it can also provide reliable data
          basis for subsequent model optimization, resource budgeting and
          business decision-making.</t>
        </section>

        <section title="Model segmentation interface">
          <t>This interface is mainly used to receive configuration
          information related to segmentation strategy or algorithm. Through
          this interface, the caller (e.g., a task management module, a
          business layer, or a scheduling system) can specify the splitting
          mode (per layer, per service, per block, etc.) and the corresponding
          parameter restrictions for each policy, such as the range of the
          number of layers to be split, the heuristic rules of the tabu search
          algorithm, the number of shared layers for multiple tasks, and the
          privacy protection requirements. The interface is typically provided
          in the form of a REST API, gRPC, RPC, or messaging middleware,
          giving the upstream system the flexibility to send or update
          policies.</t>
        </section>

        <section title="Model segmentation module">
          <t>Model segmentation is a key innovation in distributed AI
          architectures, offering a more efficient and flexible way to
          allocate computational resources and manage workloads. Within the
          control layer, segmentation strategies are carefully selected based
          on specific objectives, such as improving parallelism, optimizing
          resource utilization, or meeting privacy requirements. These
          strategies are tightly integrated into the system, with each
          segmented component packaged as a modular microservice to ensure
          seamless deployment and operation in distributed environments.FIG. 4
          shows the framework diagram of the model segmentation and
          aggregation module.</t>

          <t>Layer-based segmentation divides a model according to its
          structural hierarchy, segmenting the network layer by layer. Each
          resulting sub-model, typically consisting of one or more layers, is
          assigned to different nodes for parallel execution. This method is
          particularly effective for deep neural networks with significant
          depth and computational complexity. For example, in a deep
          convolutional neural network (CNN) for image classification, the
          initial convolutional layers responsible for extracting features
          might be executed on Node A, the intermediate fully connected layers
          on Node B, and the output classification layer on Node C. To enhance
          efficiency, heuristic or tabu search algorithms can determine
          optimal segmentation points by considering factors like
          computational load, inter-node communication overhead, and overall
          network latency. This strategy is especially valuable in real-time
          inference scenarios, such as autonomous driving, where computational
          throughput and low latency are critical for decision-making.</t>

          <t>Business segmentation is usually applied to multi-task learning
          scenarios, where the same "backbone" model is derived into several
          sub-models (or sub-tasks) according to business requirements, and
          the co-training or inference of multiple tasks is realized by
          sharing part of the network structure or parameters. For example, an
          e-commerce platform may care about recommendations, AD click
          prediction, and user personas at the same time, and these
          requirements can be split into different "branches" on the "common
          part" of the same model, which share feature extraction layers, and
          each have task-specific output or fine-tuning layers.</t>

          <t>Block-based segmentation provides maximum flexibility by dividing
          the model into smaller, independent chunks of computation that can
          be executed on separate nodes. Unlike layer-based or business-based
          segmentation, this approach does not adhere to the structural
          hierarchy or task boundaries of the model. Instead, it focuses on
          resource adaptability and efficient computation in heterogeneous
          environments. For example, in a federated learning system for
          healthcare, hospitals can train local model blocks on sensitive
          patient data. These blocks securely perform computations in the
          field and only encrypted intermediate results are globally
          aggregated. Similarly, in high-density cloud environments,
          block-based segmentation can dynamically allocate computational
          tasks to available hardware.</t>

          <t>In addition to the above common segmentation methods, for
          scenarios that need to take into account data privacy or compliance
          requirements, privacy protection logic can also be built into the
          segmentation strategy, such as putting sensitive data related
          calculations into a separate secure node, or performing differential
          privacy processing on gradient information and then aggregating.
          Through the multi-level and multi-angle model segmentation scheme,
          the control layer can maximize the use of distributed computing
          power, and flexibly schedule AI tasks in a multi-business and
          multi-data source environment.</t>

          <figure>
            <artwork name="Fig. 4 AI Model Segmentation and aggregation module"><![CDATA[ ---------------------------------------------------------------------------
|                                -----------------------------------------  |
|    --|-* * *---------|-|--    | Task requests are collected and stored  | |
|   | AI Model Segmentation |   |                       |                 | |
|   |    and aggregation    | --|      The feature algorithm extracts     | |
|   |         module        |   |           the generated features        | |
|    --|-* * *---------|-|--    |                       |                 | |
|                               |          The data matching algorithm    | |
|                               |           performs the task grouping    | |
|    -----------------------    |                       |                 | |
|   | Layer segmentation    |   |                 Model training          | |
|   | Business segmentation |---|                       |                 | |
|   | Block segmentation    |   |         Model parameter aggregation     | |
|    -----------------------     -----------------------------------------  |
 ---------------------------------------------------------------------------]]></artwork>
          </figure>
        </section>

        <section title="Model segmentation scheduling ">
          <t>After model segmentation, the control layer undertakes the key
          task of scheduling the execution of the segmented sub-models.
          Scheduling is more than just assigning tasks to nodes; It must
          optimize collaboration efficiency, minimize resource idleness, and
          reduce data bias across distributed systems. The scheduling process
          requires careful consideration of factors such as task timing,
          resource availability, data dependency, and system load to determine
          the optimal execution order and synchronization strategy for each
          submodel.</t>

          <t>To manage incoming requests effectively, the scheduling algorithm
          must decide how tasks are prioritized and allocated. For instance,
          using a First Come, First Serve (FCFS) strategy ensures that tasks
          are executed in the order they arrive. However, this approach may
          leave some nodes underutilized if tasks vary significantly in
          complexity or resource requirements. To address such inefficiencies,
          advanced scheduling methods like priority queues or dynamic
          insertion algorithms can be employed. These methods prioritize tasks
          based on urgency, computational cost, or value to the system,
          ensuring that high-priority or time-sensitive tasks are assigned
          computational resources more quickly. For example, in a real-time
          fraud detection system, high-risk transactions can be processed
          immediately by prioritizing their execution, while lower-risk
          transactions are queued for later.</t>

          <t>At the same time, in order to ensure the correctness and
          consistency in the distributed environment, it is necessary to
          arrange the appropriate communication time after each step of
          fragment calculation to avoid data disorder or excessive delay. For
          those scenarios where the training or inference process is very
          time-sensitive, we can also reserve exclusive GPU/CPU nodes for
          critical tasks at the scheduling level, or enable timing
          synchronization mechanisms to ensure that all sub-models complete
          updates and feedback in the same iteration cycle.</t>
        </section>

        <section title="Model segmentation aggregation">
          <t>Once all calculations distributed across different nodes or
          sub-models are completed, the intermediate results or parameters
          must be aggregated to produce the final output, whether it is a
          model prediction result or updated model parameters. The aggregation
          module plays a pivotal role in consolidating these outputs into a
          unified result, ensuring consistency and accuracy in distributed AI
          workflows.</t>

          <t>The aggregation process typically employs strategies such as
          voting, weighted averaging, or attention mechanisms to combine the
          outputs of sub-models. For instance, in an ensemble-based
          recommendation system, each sub-model might provide a recommendation
          score, and the aggregation module could compute a weighted average
          based on the performance or confidence of each sub-model. Similarly,
          in distributed neural networks, attention mechanisms can be used to
          assign different importance to outputs from various nodes, enabling
          more precise aggregation based on task-specific contexts. These
          strategies ensure that the aggregated result reflects the strengths
          and contributions of individual sub-models while maintaining overall
          coherence.</t>

          <t>However, aggregation in distributed systems is inherently
          challenging due to the possibility of node failures or delays.
          Network jitter, node outages, or computation delays can prevent
          certain nodes from returning their results in time, potentially
          disrupting the aggregation process. To address this, the control
          layer incorporates fault-tolerant mechanisms such as timeout
          retries, data playback, or redundant computation strategies. For
          example, if a node fails to provide its result within a specified
          time frame, the system might either retry the computation on the
          same node or reassign the task to a different node. In scenarios
          where redundancy is feasible, multiple nodes can perform the same
          computation, ensuring that at least one result is available for
          aggregation.</t>

          <t>The aggregation module also monitors system-wide performance to
          evaluate the trade-off between computational benefits and
          coordination overhead. By refining fault-tolerant logic and
          aggregation strategies, the control layer ensures that the
          advantages of distributed computation&mdash;such as scalability and
          parallelism&mdash;are not offset by excessive synchronization or
          error-handling delays. For example, in large-scale model training,
          the aggregation process might include gradient averaging or
          parameter summation across nodes, with mechanisms to handle delayed
          or missing gradients, ensuring that the global model converges
          effectively despite intermittent node failures.</t>
        </section>
      </section>

      <section title="Computing power layer">
        <t>The computing power layer is the execution core of the distributed
        artificial intelligence system, which converts the strategies and
        decisions of the control layer into actual calculations. This layer
        processes tasks, manages resources, and executes distributed models
        across nodes, ensuring that the computational benefits of model
        segmentation are fully realized. By integrating advanced scheduling,
        resource allocation and fault tolerance mechanisms, the computing
        capacity layer ensures the efficient execution of tasks while
        maintaining the stability of the system under dynamic loads.</t>

        <t>The model segmentation strategy at the control layer determines how
        the sub-models or operators are distributed over the nodes. The
        computing power layer, in turn, optimizes resource allocation and
        execution to align with the segmentation design, ensuring that data
        dependencies and computational workflows are effectively managed.
        Through dynamic orchestration, parallel processing, and feedback
        mechanisms, this layer provides high performance and scalability for
        large-scale distributed AI systems.</t>

        <section title="Calculation of micro-model parameters">
          <t>In the phase of micro-model parameter calculation, the computing
          power layer receives the scheduling instructions from the control
          layer and obtains the aggregated model information provided by the
          distributed unified cooperation module. The input usually includes
          the structural description of the micromodel (e.g., different
          network topologies such as convolutional networks, DNNS,
          Transformers, etc.), and the corresponding data fragments or data
          blocks. In addition, the compute layer takes into account the
          requirements of the business layer, such as inference latency,
          training accuracy, and throughput, to pre-allocate and schedule
          resources before execution.</t>

          <t>When the micromodel and data are ready, the computing power
          execution module will load the corresponding operators into GPU, CPU
          or other hardware acceleration units according to the pre-selected
          computing framework (such as TensorFlow, PyTorch or self-developed
          lightweight AI inference engine), and perform parallel computing
          according to the parallelism configuration provided by the
          distributed unified collaboration module. For larger convolutional
          layers or attention mechanisms, the system may adopt AllReduce,
          All-to-All and other modes to distribute computing tasks, and
          perform synchronization or gradient updates after each iteration is
          completed. For the lightweight AI model, the computing power layer
          will give priority to the nodes with fast response to meet the low
          latency application scenarios. In the whole process, the load
          balancing and resource allocation mechanism will monitor the load of
          each resource pool (such as "computing power resource pool 1",
          "computing power resource pool 2", etc.) in real time, and make
          dynamic adjustments when the node has performance bottlenecks or
          idle resources, so as to reduce the calculation waiting time and
          improve the overall throughput.</t>

          <t>When the calculation is finished, the computing power layer will
          summarize the execution of each micro-model, generate records
          including calculation delay, model metrics (such as Loss or
          Accuracy) and hardware utilization, and archive these records
          through the data management module to prepare for the next
          distributed computing power parameter update.</t>
        </section>

        <section title="Distributed computing power parameter update ">
          <t>In the stage of distributed computing power parameter update, the
          computing power layer needs to globally merge and synchronize the
          intermediate results or model gradients calculated in the previous
          step, and then feed back the updated model parameters to the control
          layer or data layer. The input usually includes information such as
          training gradients uploaded by each node, model weight chunks, and
          node health status. The distributed unified coordination module
          combines fault tolerance and recovery mechanisms to ensure that
          parameters can be smoothly aggregated in the case of delay or
          failure of some nodes.</t>

          <t>According to the business requirements and model scale, the
          computing power layer will choose the optimal parallel communication
          strategy, such as Ring AllReduce, Tree AllReduce or gradient
          compression followed by aggregation, to reduce network bandwidth
          consumption and accelerate the synchronization of model parameters.
          In the scenario of large model using Transformer or Attention
          structure, the computing power layer can allocate model parameters
          to different resource pools to be updated in parallel with the help
          of block or pipeline parallel technology, and then centralized and
          summarized to the master node or master process.</t>

          <t>After the distributed parameter update is completed, the
          computing power layer will send the final model weights or inference
          engine image back to the control layer to be registered in the model
          warehouse as the "latest version of the model", and may also
          synchronize some intermediate features or labels to the data layer
          for subsequent analysis. At the same time, the fault tolerance and
          recovery module evaluates the stability and performance of the node
          according to the monitoring data collected during the training and
          update process, and provides a decision basis for the next iteration
          cycle or new task scheduling.</t>
        </section>

        <section title="Distributed unified Collaboration module">
          <t>The distributed unified collaboration module is located at the
          core of the entire computing power layer, which is responsible for
          receiving and integrating task instructions (such as model
          segmentation strategy, training or inference goals, etc.) from the
          control layer, and effectively docking with the underlying computing
          power resource pool. Its inputs include information about the
          architecture of the individual micromodels or aggregated models, the
          type of computation to be performed (training or inference), and an
          overview of the hardware available in the current cluster. The
          output is a global choreography instruction for computing resources
          and computing processes, which is used to guide the computing power
          execution module and other functional modules to work together. A
          distributed unified collaboration module will typically work with a
          service registry or cluster orchestration system (e.g., Kubernetes,
          Yarn), or may have a built-in distributed communication framework
          (e.g., NCCL, Horovod) to manage and synchronize multiple Gpus or
          multiple nodes. Its most prominent feature is that it can
          dynamically map different sub-models or operators to the most
          appropriate node according to the model block information and
          computing requirements, so that the distributed computing can
          maintain higher throughput and scalability in the multi-task and
          multi-model environment.</t>
        </section>

        <section title="Load balancing and resource allocation mechanism ">
          <t>The load balancing and resource allocation mechanism monitors the
          load of each computing resource pool (such as GPU cluster, CPU
          cluster, heterogeneous accelerator, etc.) in real time, and combines
          the task scheduling strategy given by the distributed unified
          collaboration module to decide how to distribute the computing load
          between nodes. The input mainly includes the real-time status
          information of each node (free degree, free memory, computing power
          utilization) and the description of the hardware requirements of the
          task to be assigned (e.g., how many Gpus are needed, whether
          mixed-precision training is supported). The output is the specific
          node allocation scheme and task routing instructions, which guide
          the computing power execution module to deliver computing tasks to
          the optimal location.</t>
        </section>

        <section title="Computing power execution module ">
          <t>According to the instructions from the distributed unified
          cooperation module and the load balancing module, the computing
          power execution module loads the specific micro model or operator to
          the corresponding node to run. The inputs include model parameters,
          network topology, and data blocks, and the outputs are computed
          inference results or intermediate training gradients. The module can
          run on multiple servers through containerization (e.g., Docker,
          Kubernetes Pods), and combine with AI frameworks (TensorFlow,
          PyTorch, etc.) or self-developed inference engines to flexibly
          switch execution environment and underlying computing power.</t>
        </section>

        <section title="Data management module ">
          <t>The necessary characteristics, tags and metadata information are
          transferred between the data management module and the control layer
          or the business layer. The input sources usually include already
          chunked or segmented data sets, as well as intermediate results
          generated during model execution (e.g., local gradients, temporary
          features, etc.). Output updated snapshots of model parameters, or
          preprocessed feature data, for later use. The data management module
          can support high concurrent reads and writes with the help of
          distributed file system (HDFS), object storage (S3, etc.) or message
          queue (Kafka, RabbitMQ). It also performs small-scale and
          high-frequency data queries with database or cache systems.</t>
        </section>

        <section title="Fault tolerance and recovery module">
          <t>The fault tolerance and recovery module continuously monitors the
          heartbeat, load and network status of each node while the system is
          running. Once an anomaly is detected, the fault information will be
          reported to the distributed unified cooperation module, and the
          automatic fault tolerance logic will be triggered. The inputs are
          real-time cluster health data, task execution logs, and node failure
          reports. The output is a series of decision instructions including
          restarting tasks, reallocating resources, or rolling back to the
          last stable snapshot. This often includes self-healing from
          automation scripts (Ansible, salt, etc.) or cluster orchestration
          (Kubernetes), or it may include a stop-start training process that
          records the current iteration number and intermediate parameters
          when a crash occurs and waits for the node to recover before
          continuing execution.</t>
        </section>

        <section title="Computing resource pool ">
          <t>A pool of computing resources represents a collection of
          underlying hardware that actually provides computing power. Each
          pool may correspond to different types or specifications of
          hardware, such as GPU server farms, CPU clusters, FPGA/ASIC
          accelerator cards, or even hybrid computing power across cloud or
          local data centers. Their inputs are usually task assignments and
          model execution requirements from load balancing and resource
          allocation mechanisms, and their outputs are inference results or
          training data after calculation, and relevant performance indicators
          (such as temperature, power consumption, throughput, etc.) are fed
          back to upper modules for analysis.</t>
        </section>
      </section>

      <section title="Data layer">
        <t>The data layer is the backbone of distributed AI systems, enabling
        efficient data management while ensuring privacy protection,
        scalability, and seamless integration with other layers, including
        control, computing, and business layers. It plays a pivotal role in
        storing, transmitting, and processing diverse datasets, supporting
        distributed training, inference, and model segmentation workflows.
        Through its robust design, the data layer balances security and
        performance while maintaining the flexibility required by dynamic,
        large-scale AI systems.</t>

        <section title="Privacy protection">
          <t>Privacy protection is at the core of the data layer, ensuring
          secure data handling across the entire AI workflow. Multiple
          databases (e.g., DB1, DB2, ..., DBn) store datasets from various
          business domains or sensitivity levels, enabling the system to
          manage and segregate data efficiently. For high-sensitivity
          scenarios, such as healthcare or financial applications, only
          encrypted or desensitized data fields are stored and transmitted.
          For instance, patient medical records might be encrypted locally,
          and only aggregated gradients or anonymized insights are shared
          during federated learning tasks.</t>

          <t>When the system executes model training or inference, the control
          layer determines the appropriate data transmission strategy based on
          predefined privacy policies. Federated learning ensures that raw
          data remains localized, sharing only intermediate model gradients or
          parameters, while differential privacy adds noise to data or
          computations to prevent individual information leakage.</t>

          <t>To further strengthen security, the data layer integrates
          advanced privacy-preserving technologies, such as homomorphic
          encryption, multi-party secure computation, and differential privacy
          injection. These techniques enable micro-models and segmented
          workflows to process data securely while complying with privacy
          regulations. For instance, in a cross-database integration scenario,
          the data layer ensures that access control policies and metadata
          updates prevent unauthorized sharing of sensitive data, maintaining
          compliance without hindering system performance.</t>
        </section>

        <section title="Database maintenance and update">
          <t>The data layer's database infrastructure ensures reliable
          storage, high availability, and scalability, supporting the
          execution of micro-models and model segmentation workflows.
          Distributed databases are deployed to manage datasets associated
          with various system segments, enabling parallel operations and
          efficient data provisioning for training and inference tasks.</t>

          <t>To handle high-concurrency environments, the data layer leverages
          distributed database architectures such as NoSQL, NewSQL, and
          relational databases, each selected based on the nature of the
          workload:</t>

          <t>NoSQL databases (e.g., HBase, Cassandra) are ideal for handling
          unstructured or semi-structured data, such as logs and user behavior
          data, offering high write throughput and horizontal scalability.</t>

          <t>NewSQL systems (e.g., TiDB) provide a hybrid solution, balancing
          transactional consistency with scalability, making them suitable for
          workloads requiring real-time updates, such as model parameter
          synchronization.</t>

          <t>Relational databases (e.g., MySQL, PostgreSQL) handle structured
          datasets, such as model version histories or feature engineering
          outputs, ensuring strong consistency and query efficiency.</t>

          <t>The data layer ensures data consistency and fault tolerance
          through mechanisms such as master-slave replication, shard-based
          architectures, and automated failover. For example, if a database
          shard responsible for storing training gradients becomes
          unavailable, the system redirects queries to backup replicas or
          initiates a failover process to restore service. Regular incremental
          backups and disaster recovery protocols safeguard critical data
          against long-term loss due to network or hardware failures.</t>

          <t>Real-time monitoring tools, such as Prometheus and ELK Stack,
          track database performance metrics, including query latency,
          synchronization delays, and disk usage. If anomalies are detected,
          automated alerts trigger recovery actions such as reallocating
          workloads, rerouting queries, or scaling database resources to
          prevent bottlenecks. For instance, during a high-demand scenario
          like a shopping festival, the data layer may dynamically scale up
          storage resources to accommodate surging user activity logs,
          ensuring uninterrupted data availability for recommendation
          models.</t>
        </section>
      </section>
    </section>

    <section anchor="iana" title="IANA Considerations">
      <t>TBD</t>
    </section>

    <section title="Acknowledgement">
      <t>TBD</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>
    </references>

    <references title="Informative References">
      <reference anchor="InfRef">
        <front>
          <title/>

          <author>
            <organization/>
          </author>

          <date year="2004"/>
        </front>
      </reference>
    </references>

    <section title="An Appendix">
      <t/>
    </section>
  </back>
</rfc>
