My Research

My PhD research focuses on performance analysis for machine learning applications, including deep learning and Bayesian inference . Proper performance analysis can reveal system and architectural bottlenecks, provide essential information for choosing frameworks and platforms, and lead to performance optimizations.

For deep learning , I extracted the performance implications of key design features from deep learning frameworks (such as TensorFlow and Caffe2). For more details please refer to this paper. I also proposed a systematic analysis methodology that can reveal deeper insights that are difficult to discover for traditional approaches. For more details please refer to this paper. ParaDnn, a tool for this methodology, is available to use for other deep learning platforms.

Bayesian inference is an important branch of machine learning. However its computational characteristics are less studied in the community. I proposed BayesSuite to facilitate the research of such applications. Please refer to this paper for more details.


Sameer Kumar, Yu Emma Wang, Cliff Young, James Bradbury, Anselm Levskaya, Blake Hechtman, Dehao Chen, HyoukJoong Lee, Mehmet Deveci, Naveen Kumar, Pankaj Kanwar, Shibo Wang, Skye Wanderman-Milne, Steve Lacy, Tao Wang, Tayo Oguntebi, Yazhou Zu, Yuanzhong Xu, Andy Swing, "Exploring the limits of Concurrency in ML Training on Google TPUs." MLSys (2021).

Yu Emma Wang, Carole-Jean Wu, Xiaodong Wang, Kim Hazelwood, David Brooks, "Exploiting Parallelism Opportunities with Deep Learning Frameworks." arXiv preprint arXiv:1908.04705 (2019).

Yu Emma Wang, Gu-Yeon Wei, David Brooks, "A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms." MLSys (2020).

(The arXiv version of the above paper) Yu Emma Wang, Gu-Yeon Wei, David Brooks, "Benchmarking TPU, GPU and CPU for Deep Learning." arXiv preprint arXiv:1907.10701 (2019).

Yu Emma Wang, Yuhao Zhu, Glenn G. Ko, Brandon Reagen, Gu-Yeon Wei, and David Brooks. "Demystifying Bayesian Inference Workloads." IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 177-189. IEEE, 2019.

Yu Emma Wang, Victor Lee, Gu-Yeon Wei, and David Brooks. "Predicting New Workload or CPU Performance by Analyzing Public Datasets." ACM Transactions on Architecture and Code Optimization (TACO). vol. 15, no. 4 (2019): 53:1–53:21.

Yu Emma Wang, Weikang Qian, Shuchang Zhang, Xiaoyao Liang, and Bo Yuan. "A Learning Algorithm for Bayesian Networks and Its Efficient Implementation on GPU," IEEE Transactions on Parallel and Distributed Systems. vol. 27, no. 1 (2016): 17–30.

Weichao Tang, Yu Emma Wang, Haopeng Liu, Tao Zhang, Chao Li, and Xiaoyao Liang. "Exploring Hardware Profile-Guided Green Datacenter Scheduling." International Conference on Parallel Processing (ICPP), pp. 11-20. 2015.


Yu Emma Wang. "Performance Analysis for Machine Learning Applications." PhD Dissertation, Harvard University, Nov 2019.

Open-Source Software

Feel free to download our software and use in your project. If you do, please cite our corresponding papers.


ParaDnn is a tool that enables systematic performance analysis for deep learning platforms.

Mille Crepe Bench

Mille Crepe Bench is a multi-layer performance analysis tool for deep learning frameworks.


BayesSuite is a Bayesian inference benchmark suite based on Stan.


BN-GPU is a GPU implementation of a Bayesian network learning algorithm.


Demystify Bayesian Inference Workloads

  • ISPASS, Madison, WI, March 2019.
  • ADA Symposium, Ann Arbor, MI, April 2019.

A Systematic Methodology for Analysis of Deep Learning Platforms

  • Google, Aug 2018.
  • Google, Aug 2018.
  • Google, Sep 2018. (No, the three lines are not typos.)
  • Facebook, Sep 2018.
  • ADA Center, Dec 2018.
  • IBM, March 2019.
  • Micron, May 2019.
  • MLSys, March 2020.

Research Experience

Research Assistant, Harvard University

Sep 2013 -- Present

  • Conducted deep and systematic performance analysis for machine learning workloads and extracted architectural insights to optimize those workloads.
  • (See my dissertation.)

Research Intern, Facebook

Mentors: Xiaodong Wang and Carole-Jean Wu

Nov 2018 -- Feb 2019

  • Performed performance comparison across different deep learning frameworks and identified the source of performance difference in depth.
  • Extracted insights to optimize Caffe2 from the analysis results.

Software Engineering Intern, Google Platforms

Mentor: Hui Huang

May -- Aug 2018

  • Benchmarked 3rd generation of Tensor Processing Units (TPU v3) with state-of-the-art deep learning workloads.
  • Predicted potential bottlenecks of Cloud TPU v3.
  • Quantified the impact of NUMA-aware allocation for Cloud TPU v3.
  • Shared Silver Perfy Award in 2019 Q1 at Google.

Software Engineering Intern, Google Brain

Mentor: Cliff Young

Sep -- Dec 2017

  • Benchmarked 2nd generation of Tensor Processing Units (TPU v2) with state-of-the-art deep learning workloads and analyzed their bottlenecks.
  • Quantified performance scalability and speedup of Cloud TPU v2.

Parallel Computing Intern, Intel Labs

Mentor: Victor Lee

July 2015 -- Jan 2016

  • Developed a set of tools to characterize CPU workloads, extracted platform independent features including memory locality, memory footprint, and branch entropy.

Research Assistant, Shanghai Jiao Tong University

Mentor: Prof. Bo Yuan

Sep 2011 -- July 2013

  • Optimized a Bayesian network learning algorithm and implemented on GPU.
  • Achieved a 143× speedup on GPU over CPU.
  • Applied this method to networks of up to 125 nodes.

Research Assistant, Shanghai Jiao Tong University

Mentor: Prof. Xiaoyao Liang

June 2012 -- July 2013

  • Designed a hardware profile-guided scheduler for green datacenters.
  • Reduce the datacenter energy cost up to 54% while maintaining fairly balanced processor utilization.


I enjoy food, traveling, photographing and interacting with people. These are samples of the outcome.

For more photos please refer to my 500PX page. Copyright reserved :)