Distributed AI model architecture for microservices communication and computing power scheduling

Based on the key problems and challenges, a distributed AI micro-model computing power scheduling service architecture is proposed, which can be divided into four layers: business layer, control layer, computing power layer, and data layer. The hierarchical relationship is shown in Figure 1. The specific architecture diagram is shown in Figure. 2. The function module can realize the soft cooperation of the control layer and the hard isolation of the data layer, and the specific structure is shown in Figure 3.

The business layer is the core of the whole system and hosts the main business logic and microservice components. It interacts with the user-side front-end presentation layer, receives requests from various channels, processes them according to models or business rules, and returns the results to the upper layer or synchronizes them to other microservices. Typically, the business layer is deployed on a microservice container platform (such as Kubernetes), managed by a service gateway or API gateway, and a service registry and discovery center to maintain communication and load balancing between microservices. Internal communication can include RPC, REST API, or Feign based remote calls.

Microservices and micromodels exist between multiple services (e.g. "Service 1", "Service 2", "Service 3" and even "Service n") that invoke each other at the business layer and logical layer. Each service encapsulates a separate model or a functional slice of a model, and when these services communicate with each other in the real world via RPC, REST API, or internal event bus, the overall effect of distributed micro-model coordination is formed. Through service registration and discovery center, these micro-models can automatically discover each other's available instances when needed, so as to flexibly scale and balance computing power and network resources in large-scale concurrent scenarios.

The microservice gateway and API gateway assume the function of traffic scheduling and unified entry in the business layer. The microservice gateway is mainly for internal service calls. Through load balancing, routing rules and security policy configuration, the communication between various business modules is more efficient and stable. API gateways face external clients or front-end layers, providing users with a consistent HTTP or gRPC interface. At the same time, it is responsible for authentication, flow limiting, fusing and monitoring functions to ensure that when external requests surge, the impact on internal services is controllable.

The service registration and discovery center records the network address, version information and health status of all available microservices in the system, so that other modules or gateways in the business layer can find the correct target instance in time when they need to call the relevant microservices. For example, in the "real-time recommendation and user behavior analysis" business, when the "user portrait generation" microservice needs to be called, the system will first inquire the load status and the available instance list of the service from the registration and discovery center, and then select the appropriate node to call according to the load balancing strategy. This not only prevents a single point of failure, but also automatically updates the routing information as the microservice adds or loses instances. Service registration and discovery center allows business function modules to avoid manually maintaining complex service addresses or dependencies. Each microservice only needs to actively register its own information after startup, and when the execution goes offline or crashes, the registry will update the state accordingly. Common implementations include Eureka, Consul, Zookeeper, etc. These registration and discovery centers can be deeply integrated with microservice gateways or load balancing layers to achieve high availability governance in distributed environments. Each service registers the interface address to the registry, and the service finds the interface address of the calling service through the registry to initiate the use of the calling interface. The interfaces are called peer-to-peer, and although there is a registry, it only plays the role of controlling the flow.

The control layer is mainly responsible for scheduling and managing various tasks and resources in the distributed AI system, including task creation, allocation, exception handling, and key processes such as model segmentation, training, and aggregation. Through the fine design of the control layer, it can realize the parallel operation of multiple models with high efficiency and high availability, and make timely scheduling and fault tolerance when the computing power is insufficient.

The task management module is the "hub" of the control layer, which is responsible for receiving different types of task requests from the business layer or data layer, such as model training, model reasoning, or batch data processing, etc., and allocating tasks to nodes for execution according to real-time load conditions and computing power resource information. The task management module usually maintains a task queue or task priority queue to sort tasks by FCFS (First come first served), FIFO (First in First out) or weight-based scheduling policy. At the same time, the module internally interfaces with a service registry and discovery center or resource orchestration system (e.g. Kubernetes) to dynamically obtain key metrics such as health, bandwidth, and memory usage of available nodes (GPU/CPU). Some advanced implementations also use load balancing strategies or node affinity algorithms to choose the best location for tasks and trigger auto-scaling or resource recycling when the overall cluster load reaches a threshold.

The exception task queue module plays the role of "fault buffer", which is used to capture and store exceptions that occur during the execution of tasks. In distributed AI systems, network jitter, node failure or data exception often cause some tasks to fail or hang for a long time. The exception task queue module is designed to collect and isolate these abnormal tasks, so that they do not block the main task queue and affect the overall performance. This module continuously monitors the error logs and timeouts during the training or inference process.When an exception is found, the detailed information of the corresponding task (e.g., task ID, exception type, execution log, etc.) is transferred to a separate exception queue and recorded in the fault tracking system.

The log management module is responsible for tracking all critical operations and events during the distributed training, inference and scheduling process. This module usually uses a centralized log storage and analysis framework to efficiently retrieve and aggregate log data even when the system is large. This module not only records the timestamps and execution results of events such as model segmentation, computing power allocation, and communication synchronization, but also collects hardware metrics (such as GPU utilization, memory usage, and I/O throughput) of each node during execution. When failure symptoms or performance bottlenecks are detected in the logs, such as slow training or frequent node timeouts, the log management module pushes the information to the abnormal task queue module or alert system, which assists the operations and Development teams to make timely diagnosis and troubleshooting. Through the centralized management and visual analysis of log data, it can also provide reliable data basis for subsequent model optimization, resource budgeting and business decision-making.

This interface is mainly used to receive configuration information related to segmentation strategy or algorithm. Through this interface, the caller (e.g., a task management module, a business layer, or a scheduling system) can specify the splitting mode (per layer, per service, per block, etc.) and the corresponding parameter restrictions for each policy, such as the range of the number of layers to be split, the heuristic rules of the tabu search algorithm, the number of shared layers for multiple tasks, and the privacy protection requirements. The interface is typically provided in the form of a REST API, gRPC, RPC, or messaging middleware, giving the upstream system the flexibility to send or update policies.

Model segmentation is a key innovation in distributed AI architectures, offering a more efficient and flexible way to allocate computational resources and manage workloads. Within the control layer, segmentation strategies are carefully selected based on specific objectives, such as improving parallelism, optimizing resource utilization, or meeting privacy requirements. These strategies are tightly integrated into the system, with each segmented component packaged as a modular microservice to ensure seamless deployment and operation in distributed environments.FIG. 4 shows the framework diagram of the model segmentation and aggregation module. Layer-based segmentation divides a model according to its structural hierarchy, segmenting the network layer by layer. Each resulting sub-model, typically consisting of one or more layers, is assigned to different nodes for parallel execution. This method is particularly effective for deep neural networks with significant depth and computational complexity. For example, in a deep convolutional neural network (CNN) for image classification, the initial convolutional layers responsible for extracting features might be executed on Node A, the intermediate fully connected layers on Node B, and the output classification layer on Node C. To enhance efficiency, heuristic or tabu search algorithms can determine optimal segmentation points by considering factors like computational load, inter-node communication overhead, and overall network latency. This strategy is especially valuable in real-time inference scenarios, such as autonomous driving, where computational throughput and low latency are critical for decision-making. Business segmentation is usually applied to multi-task learning scenarios, where the same "backbone" model is derived into several sub-models (or sub-tasks) according to business requirements, and the co-training or inference of multiple tasks is realized by sharing part of the network structure or parameters. For example, an e-commerce platform may care about recommendations, AD click prediction, and user personas at the same time, and these requirements can be split into different "branches" on the "common part" of the same model, which share feature extraction layers, and each have task-specific output or fine-tuning layers. Block-based segmentation provides maximum flexibility by dividing the model into smaller, independent chunks of computation that can be executed on separate nodes. Unlike layer-based or business-based segmentation, this approach does not adhere to the structural hierarchy or task boundaries of the model. Instead, it focuses on resource adaptability and efficient computation in heterogeneous environments. For example, in a federated learning system for healthcare, hospitals can train local model blocks on sensitive patient data. These blocks securely perform computations in the field and only encrypted intermediate results are globally aggregated. Similarly, in high-density cloud environments, block-based segmentation can dynamically allocate computational tasks to available hardware. In addition to the above common segmentation methods, for scenarios that need to take into account data privacy or compliance requirements, privacy protection logic can also be built into the segmentation strategy, such as putting sensitive data related calculations into a separate secure node, or performing differential privacy processing on gradient information and then aggregating. Through the multi-level and multi-angle model segmentation scheme, the control layer can maximize the use of distributed computing power, and flexibly schedule AI tasks in a multi-business and multi-data source environment.

After model segmentation, the control layer undertakes the key task of scheduling the execution of the segmented sub-models. Scheduling is more than just assigning tasks to nodes; It must optimize collaboration efficiency, minimize resource idleness, and reduce data bias across distributed systems. The scheduling process requires careful consideration of factors such as task timing, resource availability, data dependency, and system load to determine the optimal execution order and synchronization strategy for each submodel. To manage incoming requests effectively, the scheduling algorithm must decide how tasks are prioritized and allocated. For instance, using a First Come, First Serve (FCFS) strategy ensures that tasks are executed in the order they arrive. However, this approach may leave some nodes underutilized if tasks vary significantly in complexity or resource requirements. To address such inefficiencies, advanced scheduling methods like priority queues or dynamic insertion algorithms can be employed. These methods prioritize tasks based on urgency, computational cost, or value to the system, ensuring that high-priority or time-sensitive tasks are assigned computational resources more quickly. For example, in a real-time fraud detection system, high-risk transactions can be processed immediately by prioritizing their execution, while lower-risk transactions are queued for later. At the same time, in order to ensure the correctness and consistency in the distributed environment, it is necessary to arrange the appropriate communication time after each step of fragment calculation to avoid data disorder or excessive delay. For those scenarios where the training or inference process is very time-sensitive, we can also reserve exclusive GPU/CPU nodes for critical tasks at the scheduling level, or enable timing synchronization mechanisms to ensure that all sub-models complete updates and feedback in the same iteration cycle.

Once all calculations distributed across different nodes or sub-models are completed, the intermediate results or parameters must be aggregated to produce the final output, whether it is a model prediction result or updated model parameters. The aggregation module plays a pivotal role in consolidating these outputs into a unified result, ensuring consistency and accuracy in distributed AI workflows. The aggregation process typically employs strategies such as voting, weighted averaging, or attention mechanisms to combine the outputs of sub-models. For instance, in an ensemble-based recommendation system, each sub-model might provide a recommendation score, and the aggregation module could compute a weighted average based on the performance or confidence of each sub-model. Similarly, in distributed neural networks, attention mechanisms can be used to assign different importance to outputs from various nodes, enabling more precise aggregation based on task-specific contexts. These strategies ensure that the aggregated result reflects the strengths and contributions of individual sub-models while maintaining overall coherence. However, aggregation in distributed systems is inherently challenging due to the possibility of node failures or delays. Network jitter, node outages, or computation delays can prevent certain nodes from returning their results in time, potentially disrupting the aggregation process. To address this, the control layer incorporates fault-tolerant mechanisms such as timeout retries, data playback, or redundant computation strategies. For example, if a node fails to provide its result within a specified time frame, the system might either retry the computation on the same node or reassign the task to a different node. In scenarios where redundancy is feasible, multiple nodes can perform the same computation, ensuring that at least one result is available for aggregation. The aggregation module also monitors system-wide performance to evaluate the trade-off between computational benefits and coordination overhead. By refining fault-tolerant logic and aggregation strategies, the control layer ensures that the advantages of distributed computation—such as scalability and parallelism—are not offset by excessive synchronization or error-handling delays. For example, in large-scale model training, the aggregation process might include gradient averaging or parameter summation across nodes, with mechanisms to handle delayed or missing gradients, ensuring that the global model converges effectively despite intermittent node failures.

The computing power layer is the execution core of the distributed artificial intelligence system, which converts the strategies and decisions of the control layer into actual calculations. This layer processes tasks, manages resources, and executes distributed models across nodes, ensuring that the computational benefits of model segmentation are fully realized. By integrating advanced scheduling, resource allocation and fault tolerance mechanisms, the computing capacity layer ensures the efficient execution of tasks while maintaining the stability of the system under dynamic loads. The model segmentation strategy at the control layer determines how the sub-models or operators are distributed over the nodes. The computing power layer, in turn, optimizes resource allocation and execution to align with the segmentation design, ensuring that data dependencies and computational workflows are effectively managed. Through dynamic orchestration, parallel processing, and feedback mechanisms, this layer provides high performance and scalability for large-scale distributed AI systems.

In the phase of micro-model parameter calculation, the computing power layer receives the scheduling instructions from the control layer and obtains the aggregated model information provided by the distributed unified cooperation module. The input usually includes the structural description of the micromodel (e.g., different network topologies such as convolutional networks, DNNS, Transformers, etc.), and the corresponding data fragments or data blocks. In addition, the compute layer takes into account the requirements of the business layer, such as inference latency, training accuracy, and throughput, to pre-allocate and schedule resources before execution. When the micromodel and data are ready, the computing power execution module will load the corresponding operators into GPU, CPU or other hardware acceleration units according to the pre-selected computing framework (such as TensorFlow, PyTorch or self-developed lightweight AI inference engine), and perform parallel computing according to the parallelism configuration provided by the distributed unified collaboration module. For larger convolutional layers or attention mechanisms, the system may adopt AllReduce, All-to-All and other modes to distribute computing tasks, and perform synchronization or gradient updates after each iteration is completed. For the lightweight AI model, the computing power layer will give priority to the nodes with fast response to meet the low latency application scenarios. In the whole process, the load balancing and resource allocation mechanism will monitor the load of each resource pool (such as "computing power resource pool 1", "computing power resource pool 2", etc.) in real time, and make dynamic adjustments when the node has performance bottlenecks or idle resources, so as to reduce the calculation waiting time and improve the overall throughput. When the calculation is finished, the computing power layer will summarize the execution of each micro-model, generate records including calculation delay, model metrics (such as Loss or Accuracy) and hardware utilization, and archive these records through the data management module to prepare for the next distributed computing power parameter update.

In the stage of distributed computing power parameter update, the computing power layer needs to globally merge and synchronize the intermediate results or model gradients calculated in the previous step, and then feed back the updated model parameters to the control layer or data layer. The input usually includes information such as training gradients uploaded by each node, model weight chunks, and node health status. The distributed unified coordination module combines fault tolerance and recovery mechanisms to ensure that parameters can be smoothly aggregated in the case of delay or failure of some nodes. According to the business requirements and model scale, the computing power layer will choose the optimal parallel communication strategy, such as Ring AllReduce, Tree AllReduce or gradient compression followed by aggregation, to reduce network bandwidth consumption and accelerate the synchronization of model parameters. In the scenario of large model using Transformer or Attention structure, the computing power layer can allocate model parameters to different resource pools to be updated in parallel with the help of block or pipeline parallel technology, and then centralized and summarized to the master node or master process. After the distributed parameter update is completed, the computing power layer will send the final model weights or inference engine image back to the control layer to be registered in the model warehouse as the "latest version of the model", and may also synchronize some intermediate features or labels to the data layer for subsequent analysis. At the same time, the fault tolerance and recovery module evaluates the stability and performance of the node according to the monitoring data collected during the training and update process, and provides a decision basis for the next iteration cycle or new task scheduling.

The distributed unified collaboration module is located at the core of the entire computing power layer, which is responsible for receiving and integrating task instructions (such as model segmentation strategy, training or inference goals, etc.) from the control layer, and effectively docking with the underlying computing power resource pool. Its inputs include information about the architecture of the individual micromodels or aggregated models, the type of computation to be performed (training or inference), and an overview of the hardware available in the current cluster. The output is a global choreography instruction for computing resources and computing processes, which is used to guide the computing power execution module and other functional modules to work together. A distributed unified collaboration module will typically work with a service registry or cluster orchestration system (e.g., Kubernetes, Yarn), or may have a built-in distributed communication framework (e.g., NCCL, Horovod) to manage and synchronize multiple Gpus or multiple nodes. Its most prominent feature is that it can dynamically map different sub-models or operators to the most appropriate node according to the model block information and computing requirements, so that the distributed computing can maintain higher throughput and scalability in the multi-task and multi-model environment.

The load balancing and resource allocation mechanism monitors the load of each computing resource pool (such as GPU cluster, CPU cluster, heterogeneous accelerator, etc.) in real time, and combines the task scheduling strategy given by the distributed unified collaboration module to decide how to distribute the computing load between nodes. The input mainly includes the real-time status information of each node (free degree, free memory, computing power utilization) and the description of the hardware requirements of the task to be assigned (e.g., how many Gpus are needed, whether mixed-precision training is supported). The output is the specific node allocation scheme and task routing instructions, which guide the computing power execution module to deliver computing tasks to the optimal location.

According to the instructions from the distributed unified cooperation module and the load balancing module, the computing power execution module loads the specific micro model or operator to the corresponding node to run. The inputs include model parameters, network topology, and data blocks, and the outputs are computed inference results or intermediate training gradients. The module can run on multiple servers through containerization (e.g., Docker, Kubernetes Pods), and combine with AI frameworks (TensorFlow, PyTorch, etc.) or self-developed inference engines to flexibly switch execution environment and underlying computing power.

The necessary characteristics, tags and metadata information are transferred between the data management module and the control layer or the business layer. The input sources usually include already chunked or segmented data sets, as well as intermediate results generated during model execution (e.g., local gradients, temporary features, etc.). Output updated snapshots of model parameters, or preprocessed feature data, for later use. The data management module can support high concurrent reads and writes with the help of distributed file system (HDFS), object storage (S3, etc.) or message queue (Kafka, RabbitMQ). It also performs small-scale and high-frequency data queries with database or cache systems.

The fault tolerance and recovery module continuously monitors the heartbeat, load and network status of each node while the system is running. Once an anomaly is detected, the fault information will be reported to the distributed unified cooperation module, and the automatic fault tolerance logic will be triggered. The inputs are real-time cluster health data, task execution logs, and node failure reports. The output is a series of decision instructions including restarting tasks, reallocating resources, or rolling back to the last stable snapshot. This often includes self-healing from automation scripts (Ansible, salt, etc.) or cluster orchestration (Kubernetes), or it may include a stop-start training process that records the current iteration number and intermediate parameters when a crash occurs and waits for the node to recover before continuing execution.

A pool of computing resources represents a collection of underlying hardware that actually provides computing power. Each pool may correspond to different types or specifications of hardware, such as GPU server farms, CPU clusters, FPGA/ASIC accelerator cards, or even hybrid computing power across cloud or local data centers. Their inputs are usually task assignments and model execution requirements from load balancing and resource allocation mechanisms, and their outputs are inference results or training data after calculation, and relevant performance indicators (such as temperature, power consumption, throughput, etc.) are fed back to upper modules for analysis.

The data layer is the backbone of distributed AI systems, enabling efficient data management while ensuring privacy protection, scalability, and seamless integration with other layers, including control, computing, and business layers. It plays a pivotal role in storing, transmitting, and processing diverse datasets, supporting distributed training, inference, and model segmentation workflows. Through its robust design, the data layer balances security and performance while maintaining the flexibility required by dynamic, large-scale AI systems.

Privacy protection is at the core of the data layer, ensuring secure data handling across the entire AI workflow. Multiple databases (e.g., DB1, DB2, ..., DBn) store datasets from various business domains or sensitivity levels, enabling the system to manage and segregate data efficiently. For high-sensitivity scenarios, such as healthcare or financial applications, only encrypted or desensitized data fields are stored and transmitted. For instance, patient medical records might be encrypted locally, and only aggregated gradients or anonymized insights are shared during federated learning tasks. When the system executes model training or inference, the control layer determines the appropriate data transmission strategy based on predefined privacy policies. Federated learning ensures that raw data remains localized, sharing only intermediate model gradients or parameters, while differential privacy adds noise to data or computations to prevent individual information leakage. To further strengthen security, the data layer integrates advanced privacy-preserving technologies, such as homomorphic encryption, multi-party secure computation, and differential privacy injection. These techniques enable micro-models and segmented workflows to process data securely while complying with privacy regulations. For instance, in a cross-database integration scenario, the data layer ensures that access control policies and metadata updates prevent unauthorized sharing of sensitive data, maintaining compliance without hindering system performance.

The data layer's database infrastructure ensures reliable storage, high availability, and scalability, supporting the execution of micro-models and model segmentation workflows. Distributed databases are deployed to manage datasets associated with various system segments, enabling parallel operations and efficient data provisioning for training and inference tasks. To handle high-concurrency environments, the data layer leverages distributed database architectures such as NoSQL, NewSQL, and relational databases, each selected based on the nature of the workload: NoSQL databases (e.g., HBase, Cassandra) are ideal for handling unstructured or semi-structured data, such as logs and user behavior data, offering high write throughput and horizontal scalability. NewSQL systems (e.g., TiDB) provide a hybrid solution, balancing transactional consistency with scalability, making them suitable for workloads requiring real-time updates, such as model parameter synchronization. Relational databases (e.g., MySQL, PostgreSQL) handle structured datasets, such as model version histories or feature engineering outputs, ensuring strong consistency and query efficiency. The data layer ensures data consistency and fault tolerance through mechanisms such as master-slave replication, shard-based architectures, and automated failover. For example, if a database shard responsible for storing training gradients becomes unavailable, the system redirects queries to backup replicas or initiates a failover process to restore service. Regular incremental backups and disaster recovery protocols safeguard critical data against long-term loss due to network or hardware failures. Real-time monitoring tools, such as Prometheus and ELK Stack, track database performance metrics, including query latency, synchronization delays, and disk usage. If anomalies are detected, automated alerts trigger recovery actions such as reallocating workloads, rerouting queries, or scaling database resources to prevent bottlenecks. For instance, during a high-demand scenario like a shopping festival, the data layer may dynamically scale up storage resources to accommodate surging user activity logs, ensuring uninterrupted data availability for recommendation models.