Publications
* denotes equal contribution.
2025
- SIGCOMMCEIO: A Cache-Efficient Network I/O Architecture for NIC-CPU Data PathsBowen Liu* , Xinyang Huang* , Qijing Li, Zhuobin Huang, Yijun Sun, Wenxue Li, Junxue Zhang, Ping Yin, and Kai Chen.In 39th ACM Special Interest Group on Data Communication (SIGCOMM 2025) , 2025
Efficient Input/Output (I/O) data path between NICs and CPUs/DRAMs is critical for supporting datacenter applications with high-performance network transmission, especially as link speed scales to 100Gbps and beyond. Traditional I/O acceleration strategies, such as Data Direct I/O (DDIO) and Remote Direct Memory Access (RDMA), perform suboptimally due to the inefficient utilization of the Last-Level Cache (LLC). This paper presents CEIO, a novel cache-efficient network I/O architecture that employs proactive rate control and elastic buffering to achieve zero LLC misses in the I/O data path while ensuring the effectiveness of DDIO and RDMA under various network conditions. We have implemented CEIO on commodity SmartNICs and incorporated it into widely-used DPDK and RDMA libraries. Experiments with well-optimized RPC framework and distributed file system under realistic workloads demonstrate that CEIO achieves up to 2.9x higher throughput and 1.9x lower P99.9 latency over prior work.
- SIGCOMMRevisiting RDMA Reliability for Lossy FabricsWenxue Li , Xiangzhou Liu , Yunxuan Zhang , Zihao Wang, Wei Gu , Gaoxiong Zeng , Shoushou Ren , Xinyang Huang, Zhenghang Ren, Bowen Liu, Junxue Zhang, and Kai Chen.In 39th ACM Special Interest Group on Data Communication (SIGCOMM 2025) , 2025
- ATCFLB: Fine-grained Load Balancing for Lossless Datacenter NetworksJinbin Hu, Wenxue Li , Xiangzhou Liu , Junfeng Wang, Bowen Liu, Ping Yin , Jianxin Wang , Jiawei Huang, and Kai Chen.In 2025 USENIX Annual Technical Conference (ATC 2025) , 2025
Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) cooperating with Priority Flow Control (PFC) has been widely deployed in production datacenters to enable low latency, lossless transmission. At the same time, modern datacenters typically offer parallel transmission paths between any pair of end-hosts, underscoring the importance of load balancing. However, the well-studied load balancing mechanisms designed for lossy datacenter networks (DCNs) are ill-suited for such lossless environments. Through extensive experiments, we are among the first to comprehensively inspect the interactions between PFC and load balancing, and uncover that existing fine-grained rerouting schemes can be counterproductive to spread the congested flows among more paths, further aggravating PFC’s head-of-line (HoL) blocking. Motivated by this, we present FLB, a Fine-grained Load Balancing scheme for lossless DCNs. At its core, FLB employs threshold-free rerouting to effectively balance traffic load and improve link utilization during normal conditions and leverages timely congested flow isolation to eliminate HoL blocking on non-congested flows when congestion occurs. We have fully implemented a FLB prototype, and our evaluation results show that FLB reduces PFC PAUSE rate by up to 96% and avoids HoL blocking, translating to up to 45% improvement in goodput over CONGA+DCQCN and 40%, 36%, 29% and 18% reduction in average flow completion time (FCT) over LetFlow+Swift, MP-RDMA, Proteus+DCQCN and LetFlow+PCN, respectively.
- APNetCache-Aware I/O Rate Control for RDMAQijing Li , Xinyang Huang, Bowen Liu , Pengbo Li, Junxue Zhang, and Kai Chen.In 9th Asia-Pacific Workshop on Networking (APNet ’25) (APNet 2025) , 2025
Remote Direct Memory Access (RDMA) has become a cornerstone technology in modern datacenter networks due to its high throughput and extremely low latency. However, recent works have revealed that congestion arises in the "last mile" of the RDMA I/O path—–between DRAM and CPU registers–—due to inefficiencies in the memory hierarchy, where severe cache misses and memory bandwidth contention degrade performance. We identify the root cause of this I/O congestion as the speed mismatch between network ingress and CPU processing, which leads to data accumulation and, eventually, Last-Level Cache (LLC) overflow. To address this issue, we propose RhyR, a credit-based rate control mechanism that dynamically aligns network ingress speed with CPU processing speed. Our preliminary evaluation on eRPC over RDMA, a widely used RPC framework, demonstrates that RhyR effectively mitigates I/O congestion, reducing tail latency by up to 1.40x and improving throughput by up to 1.35x compared to prior work.
- OSDIEnabling Efficient GPU Communication over Multiple NICs with FuseLinkZhenghang Ren , Yuxuan Li , Zilong Wang , Xinyang Huang, Wenxue Li, Kaiqiang Xu, Xudong Liao, Yijun Sun, Bowen Liu, Han Tian, Junxue Zhang , Mingfei Wang, Zhizhen Zhong , Guyue Liu , Ying Zhang, and Kai Chen.In Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2025) , 2025
Machine learning (ML) clusters stack multiple network interface cards (NICs) within each server to improve GPU communication bandwidth. However, existing systems fall short in fully utilizing NICs because of statically binding GPU traffic to NICs and PCIe bottleneck. This leads to suboptimal performance under imbalanced traffic, such as when GPUs process different LLM serving requests and training models with varying communication pattern. We propose FuseLink to enable efficient GPU communication over multiple NICs. FuseLink extends inter-server network by integrating high-speed intra-server connections, and recognizes GPUs to efficiently relay traffic to idle NICs. We implement FuseLink and integrate it into NCCL, so that ML applications can use FuseLink seamlessly without code modifications. Compared to NCCL with PXN, we verify that FuseLink can achieve 212GBps bandwidth between two inter-server GPUs and bring speedup on producing the first token in LLM model serving by 1.06-2.89, mixture-of-expert (MoE) training by up to 1.3x, and recommendation model training by up to 1.2x.
- S&PEdge Unlearning is Not" on Edge"! An Adaptive Exact Unlearning System on Resource-Constrained DevicesXiaoyu Xia , Ziqi Wang, Ruoxi Sun, Bowen Liu, Ibrahim Khalil, and Minhui Xue.In 46th IEEE Symposium on Security and Privacy (S&P 2025) , 2025
The right to be forgotten mandates that machine learning models enable the erasure of a data owner’s data and information from a trained model. Removing data from the dataset alone is inadequate, as machine learning models can memorize information from the training data, increasing the potential privacy risk to users. To address this, multiple machine unlearning techniques have been developed and deployed. Among them, approximate unlearning is a popular solution, but recent studies report that its unlearning effectiveness is not fully guaranteed. Another approach, exact unlearning, tackles this issue by discarding the data and retraining the model from scratch, but at the cost of considerable computational and memory resources. However, not all devices have the capability to perform such retraining. In numerous machine learning applications, such as edge devices, Internet-of-Things (IoT), mobile devices, and satellites, resources are constrained, posing challenges for deploying existing exact unlearning methods. In this study, we propose a Constraint-aware Adaptive Exact Unlearning System at the network Edge (CAUSE), an approach to enabling exact unlearning on resource-constrained devices. Aiming to minimize the retrain overhead by storing sub-models on the resource-constrained device, CAUSE innovatively applies a Fibonacci-based replacement strategy and updates the number of shards adaptively in the user-based data partition process. To further improve the effectiveness of memory usage, CAUSE leverages the advantage of model pruning to save memory via compression with minimal accuracy sacrifice. The experimental results demonstrate that CAUSE significantly outperforms other representative systems in realizing exact unlearning on the resource-constrained device by 9.23%-80.86%, 66.21%-83.46%, and 5.26%-194.13% in terms of unlearning speed, energy consumption, and accuracy.
- AAAIPFedCS: A Personalized Federated Learning Method for Enhancing Collaboration among Similar ClassifiersSiyuan Wu, Yongzhe Jia, Bowen Liu, Haolong Xiang, Xiaolong Xu, and Wanchun DouIn Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2025) , 2025
Personalized federated learning (PFL) has recently gained significant attention for its capability to address the poor convergence performance on highly heterogeneous data and the lack of personalized solutions of traditional federated learning (FL). Existing mainstream approaches either perform personalized aggregation based on a specific model architecture to leverage global knowledge or achieve personalization by exploiting client similarities. However, the former overlooks the discrepancies in client data distributions by indiscriminately aggregating all clients, while the latter lacks fine-grained collaboration of classifiers relevant to local tasks. In view of this challenge, we propose a Personalized Federated learning method for Enhancing Collaboration among Similar Classifiers (PFedCS), which aims at improving the client’s accuracy on local tasks. Concretely, it is achieved by leveraging awareness of the client classifier similarities to address the above problems. By iteratively measuring the distance of the classifier parameters between clients and clustering with each client as a cluster center, the central server adaptively identifies the collaborating clients with similar data distributions. In addition, a distance-constrained aggregation method is designed to generate customized collaborative classifiers to guide local training. As a result, extensive experimental evaluations conducted on three datasets demonstrate that our method achieves state-of-the-art performance.
2024
- TMCEdgeShield: Enabling collaborative DDoS mitigation at the edgeXiaoyu Xia , Feifei Chen, Qiang He, Ruikun Luo, Bowen Liu, Caslon Chua, Rajkumar Buyya, and Yun Yang.In IEEE Transactions on Mobile Computing (TMC 2024) , 2024
Edge computing (EC) enables low-latency services by pushing computing resources to the network edge. Due to the geographic distribution and limited capacities of edge servers, EC systems face the challenge of edge distributed denial-of-service (DDoS) attacks. Existing systems designed to fight cloud DDoS attacks cannot mitigate edge DDoS attacks effectively due to new attack characteristics. In addition, those systems are typically activated upon detected attacks, which is not always realistic in EC systems. DDoS mitigation needs to be cohesively integrated with workload migration at the edge to ensure timely responses to edge DDoS attacks. In this paper, we present EdgeShield, a novel DDoS mitigation system that leverages edge servers’ computing resources collectively to defend against edge DDoS attacks without the need for attack detection. Aiming to maximize system throughput over time without causing significant service delays, EdgeShield monitors service delays and migrates workloads across an EC system with adaptive mitigation strategies. The experimental results show that EdgeShield significantly outperforms state-of-the-art solutions in both system throughput and service delays.
2022
- UICAn Intelligent Resource Scheduling Method With Edge Channel Deployment for BPM.Bowen Liu, Wanchun Dou, Xiaokang Zhou , Xuyun Zhang, Lianyong Qi, Fei Dai, and Chaochao Chen.In 19th IEEE International Conference on Ubiquitous Intelligence and Computing (UIC 2022, Awards Outstanding Paper) , 2022
Edge computing is a novel computing paradigm that offers kinds of resources at the network edge. In edge computing, terminal users are connected to edge servers via the wireless network and there are various channels in each wireless link. These wireless channels are limited resource while different channel has different cost and service ability. The dynamic changes of users’ status make it harder to find an appropriate method to satisfy the BPM requirements of channel deployment. With this observation, it is a tricky challenge to make a trade-off between the system cost(rental price) and the service ability(number of users). In view of this challenge, an intelligent resource scheduling method, named EdgeIRS, is proposed in this paper. In the technical sense, the EdgeIRS method can accommodate most users at the edge with a minimum cost of deploying channel resources in an online way. Its performance is analyzed theoretically and the experiments verify the superiority of the method.