Yu (Emma) Wang

I am a Staff Software Engineer at Google DeepMind. I specialized in serving system design and optimization for large models. I lead the Multimodal Serving team that optimized Veo serving performance by 5x. I led my previous team that optimized Gemini serving performance by up to 2.5x. I obtained my Ph.D from Harvard University with the Harvard Architecture, Circuits, and Compiler Group in 2019. My advisors are Prof. David Brooks and Prof. Gu-Yeon Wei. I received my Bachelor of Science degree in Computer Science from Shanghai Jiao Tong University in 2013.

Email: (@ yuemmawang (. google com))

Google Scholar Profile

(Emma reserves the copyright of all the photos on this website.)

Professional Experience

Staff Software Engineer, Google DeepMind

April 2023 -- Present

Leading the Multimodal Serving team in Google DeepMind.

Designed and implementing a serving system to cater different multimodal model serving needs.
Optimized three generations of Veo for serving, achieved up to 30% speedup for Veo 1, over 5x for Veo 2 and Veo 3.
Designed and optimized the serving system for Genie 3 and achieved 10s of times of latency reduction and throughput improvement, which results in 24fps and sub-second response latency.

Optimized serving performance of Gemini-based products.

3.4x speedup for Project Astra, which was launched in Google I/O 2024.
Up to 2.6x speedup for Search, and 1.6x for a long context model for a Cloud product.
Google Tech Impact Award 2024
Google Editor's Choice Perfy Award 2024
Google Gold Perfy Award 2024 x 2
Google Silver Perfy Award 2024

Senior Software Engineer, Google

April 2021 -- April 2023

Optimized serving performance of Google's large language model products, led a virtual team of over 10 engineers.
- Achieved up to 2.5x speedup for Bard (later renamed as Gemini), 1.7x for Search, 1.3x for Cloud.
- Applied optimizations across the stacks including compilers (e.g., fusion, lowering, scheduling), models (such as sharding optimization, rewriting computations to be more TPU-friendly), and compiler flag autotuning.
- Proposed a systematic and general performance analysis methodology, widely used for onboarding new performance engineers.
- Google Cloud Tech Impact Award 2023
- Google Editor's Choice Perfy Award 2023
- Three of Google Silver Perfy Award 2023
Productionized the automatic model partitioning algorithm in Alpa (co-authored blog post), implemented optimizations to scale Google wide. Led the team to open source it in PyTorch XLA.
Conducted model partitioning autotuning, speeded up serving of an earlier version of Bard by 10%.
Designed the 2nd generation of the automatic system to deploy multi-pass Machine Learning compiler autotuning for fleetwide Google ML workloads, saved significant fleetwide TPU resources.

After founding the project and the team, I handed off to a new TLM and moved on to other projects. This new system recently finished landing most optimizations to Google fleet and awarded Google Gold Perfy Award 2025.

Software Engineer, Google

Nov 2019 -- April 2021

Designed and implemented an automatic system (US Patent) to deploy multi-pass Machine Learning compiler autotuning that optimizes fleetwide Google ML workloads.

It saves significant fleetwide TPU resources.
Google Editor's Choice Perfy Award 2021.
Google ML Product Excellence Award 2021.
Special Efficiency Award 2021.

Optimized MLPerf BERT training on 4k TPU v3 pod, set the world record of 23 seconds for BERT training (blog post). Authored the MLSys'21 paper to summarize the optimizations and lessons.
- Google Silver Perfy Award 2020 and 2021.
Designed and implemented a workload scheduling queue in TPU runtime (US Patent).

Research Assistant, Harvard University

Sep 2013 -- Sep 2019

Conducted deep and systematic performance analysis for machine learning workloads and extracted architectural insights to optimize those workloads.
See my dissertation.

Research Intern, Facebook

Mentors: Xiaodong Wang and Carole-Jean Wu

Nov 2018 -- Feb 2019

Performed performance comparison across different deep learning frameworks and identified the source of performance difference in depth.
Extracted insights to optimize Caffe2 from the analysis results.

Software Engineering Intern, Google Platforms

Mentor: Hui Huang

May -- Aug 2018

Benchmarked 3rd generation of Tensor Processing Units (TPU v3) with state-of-the-art deep learning workloads.
Predicted potential bottlenecks of Cloud TPU v3.
Quantified the impact of NUMA-aware allocation for Cloud TPU v3.
Silver Perfy Award in 2019 Q1 at Google.

Software Engineering Intern, Google Brain

Mentor: Cliff Young

Sep -- Dec 2017

Benchmarked 2nd generation of Tensor Processing Units (TPU v2) with state-of-the-art deep learning workloads and analyzed their bottlenecks.
Quantified performance scalability and speedup of Cloud TPU v2.

Parallel Computing Intern, Intel Labs

Mentor: Victor Lee

July 2015 -- Jan 2016

Developed a set of tools to characterize CPU workloads, extracted platform independent features including memory locality, memory footprint, and branch entropy.

Research Assistant, Shanghai Jiao Tong University

Mentor: Prof. Bo Yuan

Sep 2011 -- July 2013

Optimized a Bayesian network learning algorithm and implemented on GPU.
Achieved a 143× speedup on GPU over CPU.
Applied this method to networks of up to 125 nodes.

Research Assistant, Shanghai Jiao Tong University

Mentor: Prof. Xiaoyao Liang

June 2012 -- July 2013

Designed a hardware profile-guided scheduler for green datacenters.
Reduce the datacenter energy cost up to 54% while maintaining fairly balanced processor utilization.

Publications

Genie Team. Genie 3: A new frontier for world models.

Veo Team. Veo: Our state-of-the-art video generation model.

Gemini Team. "Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities." arXiv preprint arXiv:2507.06261 (2025).

Gemma Team. "Gemma 2: Improving open language models at a practical size." arXiv preprint arXiv:2408.00118 (2024).

Gemini Team. "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context." arXiv preprint arXiv:2403.05530 (2024).

Gemma Team. "Gemma 2: Improving open language models at a practical size." arXiv preprint arXiv:2408.00118 (2024).

Gemini Team. "Gemini: A family of highly capable multimodal models, 2024." arXiv preprint arXiv:2312.11805 10 (2024).

Schiemer, Martin, Clemens JS Schaefer, Jayden Parker Vap, Mark James Horeni, Yu Emma Wang, Juan Ye, and Siddharth Joshi. "Hadamard domain training with integers for class incremental quantized learning." arXiv preprint arXiv:2310.03675 (2023).

Clemens JS Schaefer, Navid Lambert-Shirzad, Xiaofan Zhang, Chiachen Chou, Tom Jablin, Jian Li, Elfie Guo, Caitlin Stanton, Siddharth Joshi, and Yu Emma Wang. "Augmenting hessians with inter-layer dependencies for mixed-precision post-training quantization." arXiv preprint arXiv:2306.04879 (2023).

Clemens JS Schaefer, Elfie Guo, Caitlin Stanton, Xiaofan Zhang, Tom Jablin, Navid Lambert-Shirzad, Jian Li, Chiachen Chou, Siddharth Joshi, and Yu Emma Wang. "Mixed precision post training quantization of neural networks with sensitivity guided search." arXiv preprint arXiv:2302.01382 (2023).

Zhang, Xiaofan, Zongwei Zhou, Deming Chen, and Yu Emma Wang. "AutoDistill: An end-to-end framework to explore and distill hardware-efficient language models." arXiv preprint arXiv:2201.08539 (2022).

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang et al. "Glam: Efficient scaling of language models with mixture-of-experts." In International conference on machine learning, pp. 5547-5569. PMLR, 2022.

Phitchaya Mangpo Phothilimthana, Amit Sabne, Nikhil Sarda, Karthik Srinivasa Murthy, Yanqi Zhou, Christof Angermueller, Mike Burrows, Sudip Roy, Ketan Mandke, Rezsa Farahani, Yu Emma Wang et al. "A flexible approach to autotuning multi-pass machine learning compilers." In 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 1-16. IEEE, 2021.

Sameer Kumar, Yu Emma Wang, Cliff Young, James Bradbury, Anselm Levskaya, Blake Hechtman, Dehao Chen, HyoukJoong Lee, Mehmet Deveci, Naveen Kumar, Pankaj Kanwar, Shibo Wang, Skye Wanderman-Milne, Steve Lacy, Tao Wang, Tayo Oguntebi, Yazhou Zu, Yuanzhong Xu, Andy Swing, "Exploring the limits of Concurrency in ML Training on Google TPUs." MLSys (2021).

Yu Emma Wang, Carole-Jean Wu, Xiaodong Wang, Kim Hazelwood, David Brooks, "Exploiting Parallelism Opportunities with Deep Learning Frameworks." arXiv preprint arXiv:1908.04705 (2019).

Yu Emma Wang, Gu-Yeon Wei, David Brooks, "A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms." MLSys (2020).

(The arXiv version of the above paper) Yu Emma Wang, Gu-Yeon Wei, David Brooks, "Benchmarking TPU, GPU and CPU for Deep Learning." arXiv preprint arXiv:1907.10701 (2019).

Yu Emma Wang, Yuhao Zhu, Glenn G. Ko, Brandon Reagen, Gu-Yeon Wei, and David Brooks. "Demystifying Bayesian Inference Workloads." IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 177-189. IEEE, 2019.

Yu Emma Wang, Victor Lee, Gu-Yeon Wei, and David Brooks. "Predicting New Workload or CPU Performance by Analyzing Public Datasets." ACM Transactions on Architecture and Code Optimization (TACO). vol. 15, no. 4 (2019): 53:1–53:21.

Yu Emma Wang, Weikang Qian, Shuchang Zhang, Xiaoyao Liang, and Bo Yuan. "A Learning Algorithm for Bayesian Networks and Its Efficient Implementation on GPU," IEEE Transactions on Parallel and Distributed Systems. vol. 27, no. 1 (2016): 17–30.

Weichao Tang, Yu Emma Wang, Haopeng Liu, Tao Zhang, Chao Li, and Xiaoyao Liang. "Exploring Hardware Profile-Guided Green Datacenter Scheduling." International Conference on Parallel Processing (ICPP), pp. 11-20. 2015.

Dissertation

Yu Emma Wang. "Performance Analysis for Machine Learning Applications." PhD Dissertation, Harvard University, Nov 2019.

Patents

Yu Emma Wang, Dehao Chen, Phitchaya Mangpo Phothilimthana, "Deploying optimization profiles for compiling computer programs in data centers." 2023.

Hyojun Kim, Xiao Yu, Yu Emma Wang, Phitchaya Mangpo Phothilimthana, "Caching compilation outputs using optimization profiles." 2022.

Yu Emma Wang, Thomas Benjamin Jablin, Caitlin King Stanton, "Workload scheduling using queues with different priorities." 2022.

Open-Source Software

Feel free to download our software and use in your project. If you do, please cite our corresponding papers.

Professional Service

Technical Program Committee

Machine Learning and Systems Rising Stars 2024
Conference on Machine Learning and Systems (MLSys) 2024
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2024
ACM/IEEE Supercomputing Conference (SC) 2023
MLBench workshop in MLSys'23
Conference on Machine Learning and Systems (MLSys) 2023
ACM/IEEE Supercomputing Conference (SC) 2022
Conference on Machine Learning and Systems (MLSys) 2022
MLBench workshop in MLSys'21

Journal reviews

IEEE Computer Architecture Letters (CAL)
ACM Transactions on Architecture and Code Optimization (TACO)

Talks

Demystify Bayesian Inference Workloads

ISPASS, Madison, WI, March 2019.
ADA Symposium, Ann Arbor, MI, April 2019.

A Systematic Methodology for Analysis of Deep Learning Platforms

Google, Aug 2018.
Google, Aug 2018.
Google, Sep 2018. (No, the three lines are not typos.)
Facebook, Sep 2018.
ADA Center, Dec 2018.
IBM, March 2019.
Micron, May 2019.
MLSys, March 2020.

Photography

I enjoy food, traveling, photographing and interacting with people. These are samples of the outcome.

For more photos please refer to my 500PX page. Copyright reserved :)

Yu (Emma) Wang

Professional Experience

Publications

Dissertation

Patents

Open-Source Software

ParaDnn

Mille Crepe Bench

BayesSuite

BN-GPU

Professional Service

Talks

Photography

Guggenheim Museum

Monterey Aquarium

Izakaya

Dessert in Lauduree

A Sushi Restaurant

An Italian Restaurant

Duck Confit