Ennan Zhai (翟恩南)

Senior Staff Engineer / Research Scientist
Network Research at Alibaba Cloud

Email: ennan.zhai (at) alibaba-inc.com

I am a Director of Network Research at Alibaba Cloud now. My research focuses on building high-performance and reliable networking systems, by utilizing techniques in areas including programming languages, verification, and programmable hardware.

Prior to joining Alibaba, I was a research scientist and lecturer in the Computer Science Department at Yale University until Jun 2018. During that time, I worked with Ruzica Piskac, Mahesh Balakrishnan, and Avi Silberschatz, on building cloud failure auditing systems; and, also worked with Joan Feigenbaum on tracking-resistant anonymous systems. I was also an instructor for Building Distributed Systems course.

I received my Ph.D. degree in 2015 from Yale University, under the guidance of Bryan Ford. My dissertation work focused on building the first cloud-reliability auditing system (named Independence-as-a-Service or INDaaS) that proactively detects deep, unexpected dependencies potentially causing cloud-scale correlated failures, which was published in OSDI'14.



Selected Publications  (My full publication list)

NSDI'25 SimAI: Unifying Architecture Design and Performance Tunning for Large-Scale Large Language Model Training with Scalability and Precision.
Xizheng Wang, Qingxu Li, Yichi Xu, Gang Lu, Dan Li, Li Chen, Heyang Zhou, Linkang Zheng, Sen Zhang, Yikai Zhu, Yang Liu, Pengcheng Zhang, Kun Qian, Kunling He, Jiaqi Gao, Ennan Zhai, Dennis Cai, and Binzhang Fu.
22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI'25), Apr 2025.
[PDF]. [Source Code]
SIGCOMM'24 Alibaba HPN: A Data Center Network for Large Language Model Training.
Kun Qian, Yongqing Xi, Jiamin Cao, Jiaqi Gao, Yichi Xu, Yu Guan, Binzhang Fu, Xuemei Shi, Fangbo Zhu, Rui Miao, Chao Wang, Peng Wang, Pengcheng Zhang, Xianlong Zeng, Eddie Ruan, Zhiping Yao, Ennan Zhai, and Dennis Cai.
ACM SIGCOMM (SIGCOMM'24), Aug 2024.
[PDF]
SIGCOMM'24 Crux: GPU-Efficient Communication Scheduling for Deep Learning Training.
Jiamin Cao, Yu Guan, Kun Qian, Jiaqi Gao, Wencong Xiao, Jianbo Dong, Binzhang Fu, Dennis Cai, and Ennan Zhai.
ACM SIGCOMM (SIGCOMM'24), Aug 2024.
Best Paper Honorable Mention.
[PDF]. [LLM Job Dataset].
SIGCOMM'24 A General and Efficient Approach to Verifying Traffic Load Properties under Arbitrary k Failures.
Ruihan Li, Yifei Yuan, Fangdan Ye, Mengqi Liu, Ruizhen Yang, Yang Yu, Tianchen Guo, Qing Ma, Xianlong Zeng, Chenren Xu, Dennis Cai, and Ennan Zhai.
ACM SIGCOMM (SIGCOMM'24), Aug 2024.
[PDF]
SIGCOMM'24 Relational Network Verification.
Xieyang Xu, Yifei Yuan, Zachary Kincaid, Arvind Krishnamurthy, Ratul Mahajan, David Walker, and Ennan Zhai.
ACM SIGCOMM (SIGCOMM'24), Aug 2024.
[PDF]. [Code and Dataset].
OSDI'24 Burstable Cloud Block Storage with Data Processing Units.
Junyi Shu, Kun Qian, Ennan Zhai, Xuanzhe Liu, and Xin Jin.
18th USENIX Symposium on Operating Systems Design and Implementation (OSDI'24), Jul, 2024.
[PDF]
NSDI'24 Sirius: Composing Network Function Chains into P4-Capable Edge Gateways.
Jiaqi Gao, Jiamin Cao, Yifan Li, Mengqi Liu, Ming Tang, Dennis Cai, and Ennan Zhai.
21st USENIX Symposium on Networked Systems Design and Implementation (NSDI'24), Apr, 2024.
[PDF]
NSDI'24 Reasoning about Network Traffic Load Property at Production Scale.
Ruihan Li, Fangdan Ye, Yifei Yuan, Ruizhen Yang, Bingchuan Tian, Tianchen Guo, Hao Wu, Xiaobo Zhu, Zhongyu Guan, Qing Ma, Xianlong Zeng, Chenren Xu, Dennis Cai, and Ennan Zhai.
21st USENIX Symposium on Networked Systems Design and Implementation (NSDI'24), Apr, 2024.
[PDF]
NSDI'24 LuoShen: A Hyper-Converged Programmable Gateway for Multi-Tenant Multi-Service Edge Clouds.
Tian Pan, Kun Liu, Xionglie Wei, Yisong Qiao, Jun Hu, Zhiguo Li, Jun Liang, Tiesheng Cheng, Wenqiang Su, Jie Lu, Yuke Hong, Zhengzhong Wang, Zhi Xu, Chongjing Dai, Peiqiao Wang, Xuetao Jia, Jianyuan Lu, Enge Song, Jun Zeng, Biao Lyu, Ennan Zhai, Jiao Zhang, Tao Huang, Dennis Cai, and Shunmin Zhu.
21st USENIX Symposium on Networked Systems Design and Implementation (NSDI'24), Apr, 2024.
[PDF]
SOSP'23 Automated Verification of an In-Production DNS Authoritative Engine.
Naiqian Zheng, Mengqi Liu, Yuxing Xiang, Linjian Song, Dong Li, Feng Han, Nan Wang, Yong Ma, Zhuo Liang, Dennis Cai, Ennan Zhai, Xuanzhe Liu, and Xin Jin.
29th ACM Symposium on Operating Systems Principles (SOSP'23), Oct 2023.
[PDF]
SIGCOMM'23 XRON: A Hybrid Elastic Cloud Overlay Network for Video Conferencing at Planetary Scale.
Bingyang Wu, Kun Qian, Bo Li, Yunfei Ma, Qi Zhang, Zhigang Jiang, Jiayu Zhao, Dennis Cai, Ennan Zhai, Xuanzhe Liu, and Xin Jin.
ACM SIGCOMM (SIGCOMM'23), Sep 2023.
[PDF]
SIGCOMM'23 CellFusion: Multipath Vehicle-to-Cloud Video Streaming with Network Coding in the Wild.
Yunzhe Ni, Zhilong Zheng, Xianshang Lin, Fengyu Gao, Xuan Zeng, Yirui Liu, Guang Yang, Yuanchao Su, Dennis Cai, Hongqiang Harry Liu, Chenren Xu, Ennan Zhai, and Yunfei Ma.
ACM SIGCOMM (SIGCOMM'23), Sep 2023.
[PDF]
NSDI'23 Norma: Towards Practical Network Load Testing.
Yanqing Chen, Bingchuan Tian, Chen Tian, Li Dai, Yu Zhou, Mengjing Ma, Ming Tang, Hao Zheng, Zhewen Yang, Guihai Chen, Dennis Cai, and Ennan Zhai.
20th USENIX Symposium on Networked Systems Design and Implementation (NSDI'23), Apr, 2023.
[PDF]
SIGCOMM'22 Meissa: Scalable Network Testing for Programmable Data Planes.
Naiqian Zheng, Mengqi Liu, Ennan Zhai, Hongqiang Harry Liu, Yifan Li, Kaicheng Yang, Xuanzhe Liu, and Xin Jin.
ACM SIGCOMM (SIGCOMM'22), Aug 2022.
[PDF]
SIGCOMM'22 Predictable vFabric on Informative Data Plane.
Shuai Wang, Kaihui Gao, Kun Qian, Dan Li, Rui Miao, Bo Li, Yu Zhou, Ennan Zhai, Chen Sun, Jiaqi Gao, Dai Zhang, Binzhang Fu, Frank Kelly, Dennis Cai, Hongqiang Harry Liu, and Ming Zhang.
ACM SIGCOMM (SIGCOMM'22), Aug 2022.
[PDF]
NSDI'22 Cetus: Releasing P4 Programmers from the Chore of Trial and Error Compiling.
Yifan Li, Jiaqi Gao, Ennan Zhai, Mengqi Liu, Kun Liu, and Hongqiang Harry Liu.
19th USENIX Symposium on Networked Systems Design and Implementation (NSDI'22), Apr, 2022.
[PDF]
OOPSLA'21 Static Detection of Silent Misconfigurations with Deep Interaction Analysis.
Jialu Zhang, Ruzica Piskac, Ennan Zhai, and Tianyin Xu.
36th ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'21), Oct, 2021.
[PDF]
SIGCOMM'21 Aquila: A Practically Usable Verification System for Production-Scale Programmable Data Planes.
Bingchuan Tian, Jiaqi Gao, Mengqi Liu, Ennan Zhai, Yanqing Chen, Yu Zhou, Li Dai, Feng Yan, Mengjing Ma, Ming Tang, Jie Lu, Xionglie Wei, Hongqiang Harry Liu, Ming Zhang, Chen Tian, and Minlan Yu.
ACM SIGCOMM (SIGCOMM'21), Aug 2021.
[PDF]
SIGCOMM'21 Campion: Debugging Router Configuration Differences.
Alan Tang, Siva Kesava Reddy Kakarla, Ryan Beckett, Ennan Zhai, Matt Brown, Todd Millstein, Yuval Tamir, and George Varghese.
ACM SIGCOMM (SIGCOMM'21), Aug 2021.
[PDF]
SIGCOMM'20 Lyra: A Cross-Platform Language and Compiler for Data Plane Programming on Heterogeneous ASICs.
Jiaqi Gao, Ennan Zhai, Hongqiang Harry Liu, Rui Miao, Yu Zhou, Bingchuan Tian, Chen Sun, Dennis Cai, Ming Zhang, and Minlan Yu.
ACM SIGCOMM (SIGCOMM'20), Aug 2020.
[PDF].
SIGCOMM'20 Accuracy, Scalability, Coverage - A Practical Configuration Verifier on a Global WAN.
Fangdan Ye, Da Yu, Ennan Zhai, Hongqiang Harry Liu, Bingchuan Tian, Qiaobo Ye, Chunsheng Wang, Xin Wu, Tianchen Guo, Cheng Jin, Duncheng She, Qing Ma, Biao Cheng, Hui Xu, Ming Zhang, Zhiliang Wang, and Rodrigo Fonseca.
ACM SIGCOMM (SIGCOMM'20), Aug 2020.
[PDF].
PETS'20 PriFi: Low-Latency Anonymity for Organizational Networks.
Ludovic Barman, Italo Dacosta, Mahdi Zamani, Ennan Zhai, Apostolos Pyrgelis, Bryan Ford, Joan Feigenbaum, and Jean-Pierre Hubaux.
20th Privacy Enhancing Technologies Symposium (PETS'20), Jul, 2020.
[PDF]. [Talk Video].
NSDI'20 Check before You Change: Preventing Correlated Failures in Service Updates.
Ennan Zhai, Ang Chen, Ruzica Piskac, Mahesh Balakrishnan, Bingchuan Tian, Bo Song, and Haoliang Zhang.
17th USENIX Symposium on Networked Systems Design and Implementation (NSDI'20), Feb, 2020.
[PDF]. [Talk Slides].
FAST'20 Lock-Free Collaboration Support for Cloud Storage Services with Operation Inference and Transformation.
Jian Chen, Minghao Zhao, Zhenhua Li, Ennan Zhai, Tianyin Xu, Feng Qian, Hongyi Chen, and Yunhao Liu.
18th USENIX Conference on File and Storage Technologies (FAST'20), Feb, 2020.
[PDF].
SIGCOMM'19 Safely and Automatically Updating In-Network ACL Configurations with Intent Language.
Bingchuan Tian, Xinyi Zhang, Ennan Zhai, Hongqiang Harry Liu, Qiaobo Ye, Chunsheng Wang, Xin Wu, Zhiming Ji, Yihong Sang, Ming Zhang, Da Yu, Chen Tian, Haitao Zheng, and Ben Y. Zhao.
ACM SIGCOMM (SIGCOMM'19), Aug 2019.
[PDF].
MobiCom'19 Mobile Gaming on Personal Computers with Direct Android Emulation.
Qifan Yang, Zhenhua Li, Yunhao Liu, Hai Long, Yuanchao Huang, Jiaming He, Tianyin Xu, and Ennan Zhai.
25th International Conference on Mobile Computing and Networking (MobiCom'19), Oct, 2019.
[PDF]. [System with 50M users].
FAST'18 Towards Web-based Delta Synchronization for Cloud Storage Services.
He Xiao, Zhenhua Li, Ennan Zhai, Tianyin Xu, Yang Li, Yunhao Liu, Quanlu Zhang, and Yao Liu.
16th USENIX Conference on File and Storage Technologies (FAST'18), Feb, 2018.
[PDF]. [Source Code]. [The morning paper].
OOPSLA'17 An Auditing Language for Preventing Correlated Failures in the Cloud.
Ennan Zhai, Ruzica Piskac, Ronghui Gu, Xun Lao, and Xi Wang.
32th ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'17), Oct, 2017.
[PDF].
OOPSLA'17 Synthesizing Configuration File Specifications with Association Rule Learning.
Mark Santolucito, Ennan Zhai, Rahul Dhodapkar, Aaron Shim, and Ruzica Piskac.
32th ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'17), Oct, 2017.
[PDF].
VLDB'17 Resisting Tag Spam by Leveraging Implicit User Behaviors.
Ennan Zhai, Zhenhua Li, Zhenyu Li, Fan Wu and Guihai Chen.
Proceedings of the VLDB, Vol. 10, No. 3.
43rd International Conference on Very Large Data Bases (VLDB'17), Aug, 2017
[PDF].
NSDI'16 AnonRep: Towards Tracking-Resistant Anonymous Reputation.
Ennan Zhai, David Isaac Wolinsky, Ruichuan Chen, Ewa Syta, Chao Teng, and Bryan Ford.
13th USENIX Symposium on Networked Systems Design and Implementation (NSDI'16), Mar, 2016.
[PDF]. [Talk Slides]. [Source Code].
CAV'16 Probabilistic Automated Language Learning for Configuration Files.
Mark Santolucito, Ennan Zhai, and Ruzica Piskac.
28th International Conference on Computer Aided Verification (CAV'16), Jul, 2016
[PDF]
OSDI'14 Heading Off Correlated Failures through Independence-as-a-Service.
Ennan Zhai, Ruichuan Chen, David Isaac Wolinsky, and Bryan Ford.
11th USENIX Symposium on Operating Systems Design and Implementation (OSDI'14), Oct, 2014.
[PDF]. [Technical Report]. [Talk Slides]. [Talk Video]. [The Register News].


Professional Service


Last Update: Aug 2024