Publications – Blueprint Research Lab

Keywords (tags) and Publication List

2021

Hua, Fei ; Chen, Yanhao ; Jin, Yuwei ; Zhang, Chi ; Hayes, Ari ; Zhang, Youtao ; Zhang, Eddy Z

AutoBraid: A Framework for Enabling Efficient Surface Code Communication in Quantum Computing Conference

54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2021), Association for Computing Machinery, 2021, ISBN: 9781450385572.

Abstract | Links | BibTeX | Tags: Compiler optimization, Quantum Computing, Quantum Error Correction

@conference{hua+:micro21,
title = {AutoBraid: A Framework for Enabling Efficient Surface Code Communication in Quantum Computing},
author = {Hua, Fei and Chen, Yanhao and Jin, Yuwei and Zhang, Chi and Hayes, Ari and Zhang, Youtao and Zhang, Eddy Z.
},
url = {https://doi.org/10.1145/3466752.3480072},
doi = {10.1145/3466752.3480072},
isbn = {9781450385572},
year = {2021},
date = {2021-10-21},
booktitle = {54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2021)},
pages = {925–936},
publisher = {Association for Computing Machinery},
abstract = {Quantum computers can solve problems that are intractable using the most powerful classical computer. However, qubits are fickle and error prone. It is necessary to actively correct errors in the execution of a quantum circuit. Quantum error correction (QEC) codes are developed to enable fault-tolerant quantum computing. With QEC, one logical circuit is converted into an encoded circuit. Most studies on quantum circuit compilation focus on NISQ devices which have 10-100 qubits and are not fault-tolerant. In this paper, we focus on the compilation for fault-tolerant quantum hardware. In particular, we focus on optimizing communication parallelism for the surface code based QEC. The execution of surface code circuits involves non-trivial geometric manipulation of a large lattice of entangled physical qubits. A two-qubit gate in surface code is implemented as a virtual “pipe” in space-time called a braiding path. The braiding paths should be carefully routed to avoid congestion. Communication between qubits is considered the major bottleneck as it involves scheduling and searching for simultaneous paths between qubits. We provide a framework for efficiently scheduling braiding paths. We discover that for quantum programs with a local parallelism pattern, our framework guarantees an optimal solution, while the previous greedy-heuristic-based solution cannot. Moreover, we propose an extension to the local parallelism analysis framework to address the communication bottleneck. Our framework achieves orders of magnitude improvement after addressing the communication bottleneck.},
keywords = {Compiler optimization, Quantum Computing, Quantum Error Correction},
pubstate = {published},
tppubtype = {conference}
}

Chen, Yanhao ; Hua, Fei ; Jin, Yuwei ; Zhang, Eddy Z

BGPQ: A Heap-Based Priority Queue Design for GPUs Conference

50th International Conference on Parallel Processing (ICPP 2021), Association for Computing Machinery, New York, NY, USA, 2021, ISBN: 9781450390682.

Abstract | Links | BibTeX | Tags: Batched Heap, GPUs, Priority Queue

Zhang, Chi ; Hayes, Ari B; Qiu, Longfei ; Jin, Yuwei ; Chen, Yanhao ; Zhang, Eddy Z

Time-Optimal Qubit Mapping Conference

Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021), Association for Computing Machinery, New York, NY, USA, 2021.

Links | BibTeX | Tags: A* search, Compilation technique, QFT, Quantum, Qubit mapping

2019

Hua, Fei; Zhang, Eddy Z

Optimizing Surface Code Braiding Workshop

The 3rd International Workshop on Quantum Compilation (IWQC 2019), 2019.

BibTeX | Tags: Compilation technique, Quantum, Surface code

Fu, Hao; Zhu, Mingzheng; Wu, Wenli; Zhang, Eddy Z; Tan, Haisheng

Towards Optimal Qubit Mapping in NISQ Era Workshop

The 3rd International Workshop on Quantum Compilation (IWQC 2019), 2019.

BibTeX | Tags: Compilation technique, Quantum, Qubit mapping

Hayes, Ari B; Hua, Fei; Huang, Jin; Chen, Yanhao; Zhang, Eddy Z

Decoding CUDA Binary Conference

Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2019), IEEE Press, Washington, DC, USA, 2019, ISBN: 9781728114361.

Abstract | BibTeX | Tags: Code generation, Code translation and transformation, CUDA, GPU, Instruction set architecture (ISA)

2018

Chen, Yanhao; Hayes, Ari B; Zhang, Chi; Salmon, Timothy; Zhang, Eddy Z

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Conference

2018 USENIX Annual Technical Conference (USENIX ATC 18), USENIX Association, Boston, MA, 2018, ISBN: 978-1-939133-01-4.

Links | BibTeX | Tags: GPU, Program locality, Sparse matrix, Spmv

2017

Hayes, Ari B; Li, Lingda; Hedayati, Mohammad; He, Jiahuan; Zhang, Eddy Z; Shen, Kai

GPU Taint Tracking Conference

2017 USENIX Annual Technical Conference (USENIX ATC 17), USENIX Association, Santa Clara, CA, 2017, ISBN: 978-1-931971-38-6.

Links | BibTeX | Tags:

Li, Lingda; Geda, Robel; Hayes, Ari B; Chen, Yanhao; Chaudhari, Pranav; Zhang, Eddy Z; Szegedy, Mario

A Simple Yet Effective Balanced Edge Partition Model for Parallel Computing Conference

Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems (Sigmetrics 2017), SIGMETRICS ’17 Abstracts Association for Computing Machinery, Urbana-Champaign, Illinois, USA, 2017, ISBN: 9781450350327.

Abstract | Links | BibTeX | Tags: Data sharing, Edge-partition, GPU, Graph model, Program locality

@conference{10.1145/3078505.3078520,
title = {A Simple Yet Effective Balanced Edge Partition Model for Parallel Computing},
author = {Lingda Li and Robel Geda and Ari B Hayes and Yanhao Chen and Pranav Chaudhari and Eddy Z Zhang and Mario Szegedy},
url = {https://doi.org/10.1145/3078505.3078520},
doi = {10.1145/3078505.3078520},
isbn = {9781450350327},
year = {2017},
date = {2017-01-01},
booktitle = {Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems (Sigmetrics 2017)},
pages = {6},
publisher = {Association for Computing Machinery},
address = {Urbana-Champaign, Illinois, USA},
series = {SIGMETRICS ’17 Abstracts},
abstract = {Graph edge partition models have recently become an appealing alternative to graph vertex partition models for distributed computing due to both their flexibility in balancing loads and their performance in reducing communication cost.In this paper, we propose a simple yet effective graph edge partitioning algorithm. In practice, our algorithm provides good partition quality while maintaining low partition overhead. It also outperforms similar state-of-the-art edge partition approaches, especially for power-law graphs. In theory, previous work showed that an approximation guarantee of O(dmax√(log n log k)) apply to the graphs with m=Ω(k2) edges (n is the number of vertices, and k is the number of partitions). We further rigorously proved that this approximation guarantee hold for all graphs.We also demonstrate the applicability of the proposed edge partition algorithm in real parallel computing systems. We draw our example from GPU program locality enhancement and demonstrate that the graph edge partition model does not only apply to distributed computing with many computer nodes, but also to parallel computing in a single computer node with a many-core processor.},
keywords = {Data sharing, Edge-partition, GPU, Graph model, Program locality},
pubstate = {published},
tppubtype = {conference}
}

Li, Lingda; Geda, Robel; Hayes, Ari B; Chen, Yanhao; Chaudhari, Pranav; Zhang, Eddy Z; Szegedy, Mario

A Simple Yet Effective Balanced Edge Partition Model for Parallel Computing Journal Article

SIGMETRICS Perform. Eval. Rev., 45 (1), pp. 6, 2017, ISSN: 0163-5999.

Abstract | Links | BibTeX | Tags: Data sharing, Edge-partition, GPU, Graph model, Program locality

Li, Pengcheng; Hu, Xiaoyu; Chen, Dong; Brock, Jacob; Luo, Hao; Zhang, Eddy Z; Ding, Chen

LD: Low-Overhead GPU Race Detection Without Access Monitoring Journal Article

ACM Transaction on Architecture and Code Optimization (TACO 2017), 14 (1), 2017, ISSN: 1544-3566.

Abstract | Links | BibTeX | Tags: GPU race detection, Instrumentation-free, Value-based checking

Catarata, Jan; Corbett, Scott; Stern, Harry; Szegedy, Mario; Vyskocil, Tomas; Zhang, Zheng

The Moser-Tardos Resample algorithm: Where is the limit? (an experimental inquiry) Workshop

2017.

Links | BibTeX | Tags:

2016

Hayes, Ari B; Li, Lingda; Chavarría-Miranda, Daniel; Song, Shuaiwen Leon; Zhang, Eddy Z

Orion: A Framework for GPU Occupancy Tuning Conference

Proceedings of the 17th International Middleware Conference (Middleware 2017), Association for Computing Machinery, Trento, Italy, 2016, ISBN: 9781450343008.

Abstract | Links | BibTeX | Tags: Concurrent-program compilation, GPU occupancy tuning, Register allocation, Shared memory allocation

Li, Lingda; Hayes, Ari B; Song, Shuaiwen Leon; Zhang, Eddy Z

Tag-Split Cache for Efficient GPGPU Cache Utilization Conference

Proceedings of the 2016 International Conference on Supercomputing (ICS 2016), Association for Computing Machinery, Istanbul, Turkey, 2016, ISBN: 9781450343619.

Abstract | Links | BibTeX | Tags: Cache organization, GPGPU, Spatial Locality

Tao, Dingwen; Song, Shuaiwen Leon; Krishnamoorthy, Sriram; Wu, Panruo; Liang, Xin; Zhang, Eddy Z; Kerbyson, Darren; Chen, Zizhong

New-Sum: A Novel Online ABFT Scheme For General Iterative Methods Conference

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2016), Association for Computing Machinery, Kyoto, Japan, 2016, ISBN: 9781450343145.

Abstract | Links | BibTeX | Tags: Algorithm-based fault tolerance (abft), Checkpoint, Checksum, Iterative methods, Online error detection, Resilience, Rollback recovery, Silent data corruption (sdc)

@conference{10.1145/2907294.2907306,
title = {New-Sum: A Novel Online ABFT Scheme For General Iterative Methods},
author = {Dingwen Tao and Shuaiwen Leon Song and Sriram Krishnamoorthy and Panruo Wu and Xin Liang and Eddy Z Zhang and Darren Kerbyson and Zizhong Chen},
url = {https://doi.org/10.1145/2907294.2907306},
doi = {10.1145/2907294.2907306},
isbn = {9781450343145},
year = {2016},
date = {2016-01-01},
booktitle = {Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2016)},
pages = {43–55},
publisher = {Association for Computing Machinery},
address = {Kyoto, Japan},
abstract = {Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme. These designs are capable of addressing scenarios under different error rates. Our ABFT approaches apply to a wide range of iterative solvers that primarily rely on matrix-vector multiplication and vector linear operations. We evaluate our designs through comprehensive analytical and empirical analysis. Experimental evaluation on the Stampede supercomputer demonstrates the low performance overheads incurred by our two ABFT schemes for preconditioned CG (0.4% and 2.2%) and preconditioned BiCGSTAB (1.0% and 4.0%) for the largest SPD matrix from UFL Sparse Matrix Collection. The evaluation also demonstrates the flexibility and effectiveness of our proposed designs for detecting and recovering various types of soft errors in general iterative methods.},
keywords = {Algorithm-based fault tolerance (abft), Checkpoint, Checksum, Iterative methods, Online error detection, Resilience, Rollback recovery, Silent data corruption (sdc)},
pubstate = {published},
tppubtype = {conference}
}

Li, Ang; Song, Shuaiwen Leon; Kumar, Akash; Zhang, Eddy Z; -, Daniel Chavarría G; Corporaal, Henk

Critical points based register-concurrency autotuning for GPUs Conference

2016 Design, Automation & Test in Europe Conference & Exhibition (DATE 2016), Dresden, Germany, March 14-18, 2016, IEEE, 2016.

Links | BibTeX | Tags:

2015

Renart, E G; Zhang, E Z; Nath, B

Towards a GPU SDN controller Workshop

2015 International Conference and Workshops on Networked Systems (NetSys), 2015.

BibTeX | Tags:

2014

Egielski, Ian J; Huang, Jesse; Zhang, Eddy Z

Massive Atomics for Massive Parallelism on GPUs Conference

Proceedings of the 2014 International Symposium on Memory Management (ISMM 2014), Association for Computing Machinery, Edinburgh, United Kingdom, 2014, ISBN: 9781450329217.

Abstract | Links | BibTeX | Tags: Atomics, Concurrency, GPU, Parallelism

@conference{10.1145/2602988.2602993,
title = {Massive Atomics for Massive Parallelism on GPUs},
author = {Ian J Egielski and Jesse Huang and Eddy Z Zhang},
url = {https://doi.org/10.1145/2602988.2602993},
doi = {10.1145/2602988.2602993},
isbn = {9781450329217},
year = {2014},
date = {2014-01-01},
booktitle = {Proceedings of the 2014 International Symposium on Memory Management (ISMM 2014)},
pages = {93–103},
publisher = {Association for Computing Machinery},
address = {Edinburgh, United Kingdom},
abstract = {One important type of parallelism exploited in many applications is reduction type parallelism. In these applications, the order of the read-modify-write updates to one shared data object can be arbitrary as long as there is an imposed order for the read-modify-write updates. The typical way to parallelize these types of applications is to first let every individual thread perform local computation and save the results in thread-private data objects, and then merge the results from all worker threads in the reduction stage. All applications that fit into the map reduce framework belong to this category. Additionally, the machine learning, data mining, numerical analysis and scientific simulation applications may also benefit from reduction type parallelism. However, the parallelization scheme via the usage of thread-private data objects may not be vi- able in massively parallel GPU applications. Because the number of concurrent threads is extremely large (at least tens of thousands of), thread-private data object creation may lead to memory space explosion problems.In this paper, we propose a novel approach to deal with shared data object management for reduction type parallelism on GPUs. Our approach exploits fine-grained parallelism while at the same time maintaining good programmability. It is based on the usage of intrinsic hardware atomic instructions. Atomic operation may appear to be expensive since it causes thread serialization when multiple threads atomically update the same memory object at the same time. However, we discovered that, with appropriate atomic collision reduction techniques, the atomic implementation can out- perform the non-atomics implementation, even for benchmarks known to have high performance non-atomics GPU implementations. In the meantime, the usage of atomics can greatly reduce coding complexity as neither thread-private object management or explicit thread-communication (for the shared data objects protected by atomic operations) is necessary.},
keywords = {Atomics, Concurrency, GPU, Parallelism},
pubstate = {published},
tppubtype = {conference}
}

One important type of parallelism exploited in many applications is reduction type parallelism. In these applications, the order of the read-modify-write updates to one shared data object can be arbitrary as long as there is an imposed order for the read-modify-write updates. The typical way to parallelize these types of applications is to first let every individual thread perform local computation and save the results in thread-private data objects, and then merge the results from all worker threads in the reduction stage. All applications that fit into the map reduce framework belong to this category. Additionally, the machine learning, data mining, numerical analysis and scientific simulation applications may also benefit from reduction type parallelism. However, the parallelization scheme via the usage of thread-private data objects may not be vi- able in massively parallel GPU applications. Because the number of concurrent threads is extremely large (at least tens of thousands of), thread-private data object creation may lead to memory space explosion problems.In this paper, we propose a novel approach to deal with shared data object management for reduction type parallelism on GPUs. Our approach exploits fine-grained parallelism while at the same time maintaining good programmability. It is based on the usage of intrinsic hardware atomic instructions. Atomic operation may appear to be expensive since it causes thread serialization when multiple threads atomically update the same memory object at the same time. However, we discovered that, with appropriate atomic collision reduction techniques, the atomic implementation can out- perform the non-atomics implementation, even for benchmarks known to have high performance non-atomics GPU implementations. In the meantime, the usage of atomics can greatly reduce coding complexity as neither thread-private object management or explicit thread-communication (for the shared data objects protected by atomic operations) is necessary.

Hayes, Ari B; Zhang, Eddy Z

Unified On-Chip Memory Allocation for SIMT Architecture Conference

Proceedings of the 28th ACM International Conference on Supercomputing (ICS 2014), Association for Computing Machinery, Munich, Germany, 2014, ISBN: 9781450326421.

Abstract | Links | BibTeX | Tags: Compiler optimization, Concurrency, GPU, Register allocation, Shared memory allocation

@conference{10.1145/2597652.2597685,
title = {Unified On-Chip Memory Allocation for SIMT Architecture},
author = {Ari B Hayes and Eddy Z Zhang},
url = {https://doi.org/10.1145/2597652.2597685},
doi = {10.1145/2597652.2597685},
isbn = {9781450326421},
year = {2014},
date = {2014-01-01},
booktitle = {Proceedings of the 28th ACM International Conference on Supercomputing (ICS 2014)},
pages = {293–302},
publisher = {Association for Computing Machinery},
address = {Munich, Germany},
abstract = {The popularity of general purpose Graphic Processing Unit (GPU) is largely attributed to the tremendous concurrency enabled by its underlying architecture — single instruction multiple thread (SIMT) architecture. It keeps the context of a significant number of threads in registers to enable fast “context switches” when the processor is stalled due to execution dependence, memory requests and etc. The SIMT architecture has a large register file evenly partitioned among all concurrent threads. Per-thread register usage determines the number of concurrent threads, which strongly affects the whole program performance. Existing register allocation techniques, extensively studied in the past several decades, are oblivious to the register contention due to the concurrent execution of many threads. They are prone to making optimization decisions that benefit single thread but degrade the whole application performance.Is it possible for compilers to make register allocation decisions that can maximize the whole GPU application performance? We tackle this important question from two different aspects in this paper. We first propose an unified on-chip memory allocation framework that uses scratch-pad memory to help: (1) alleviate single-thread register pressure; (2) increase whole application throughput. Secondly, we propose a characterization model for the SIMT execution model in order to achieve a desired on-chip memory partition given the register pressure of a program. Overall, we discovered that it is possible to automatically determine an on-chip memory resource allocation that maximizes concurrency while ensuring good single-thread performance at compile-time. We evaluated our techniques on a representative set of GPU benchmarks with non-trivial register pressure. We are able to achieve up to 1.70 times speedup over the baseline of the traditional register allocation scheme that maximizes single thread performance.},
keywords = {Compiler optimization, Concurrency, GPU, Register allocation, Shared memory allocation},
pubstate = {published},
tppubtype = {conference}
}

2013

Wu, Bo; Zhao, Zhijia; Zhang, Eddy Zheng; Jiang, Yunlian; Shen, Xipeng

Complexity Analysis and Algorithm Design for Reorganizing Data to Minimize Non-Coalesced Memory Accesses on GPU Conference

Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2013), Association for Computing Machinery, Shenzhen, China, 2013, ISBN: 9781450319225.

Abstract | Links | BibTeX | Tags: Computational complexity, Data transformation, GPGPU, Memory coalescing, Runtime optimizations, Thread-data remapping

@conference{10.1145/2442516.2442523,
title = {Complexity Analysis and Algorithm Design for Reorganizing Data to Minimize Non-Coalesced Memory Accesses on GPU},
author = {Bo Wu and Zhijia Zhao and Eddy Zheng Zhang and Yunlian Jiang and Xipeng Shen},
url = {https://doi.org/10.1145/2442516.2442523},
doi = {10.1145/2442516.2442523},
isbn = {9781450319225},
year = {2013},
date = {2013-01-01},
booktitle = {Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2013)},
pages = {57–68},
publisher = {Association for Computing Machinery},
address = {Shenzhen, China},
abstract = {The performance of Graphic Processing Units (GPU) is sensitive to irregular memory references. Some recent work shows the promise of data reorganization for eliminating non-coalesced memory accesses that are caused by irregular references. However, all previous studies have employed simple, heuristic methods to determine the new data layouts to create. As a result, they either do not provide any performance guarantee or are effective to only some limited scenarios. This paper contributes a fundamental study to the problem. It systematically analyzes the inherent complexity of the problem in various settings, and for the first time, proves that the problem is NP-complete. It then points out the limitations of existing techniques and reveals that in practice, the essence for designing an appropriate data reorganization algorithm can be reduced to a tradeoff among space, time, and complexity. Based on that insight, it develops two new data reorganization algorithms to overcome the limitations of previous methods. Experiments show that an assembly composed of the new algorithms and a previous algorithm can circumvent the inherent complexity in finding optimal data layouts, making it feasible to minimize non-coalesced memory accesses for a variety of irregular applications and settings that are beyond the reach of existing techniques.},
keywords = {Computational complexity, Data transformation, GPGPU, Memory coalescing, Runtime optimizations, Thread-data remapping},
pubstate = {published},
tppubtype = {conference}
}

Shen, Xipeng; Liu, Yixun; Zhang, Eddy Z; Bhamidipati, Poornima

An Infrastructure for Tackling Input-Sensitivity of GPU Program Optimizations Journal Article

Int. J. Parallel Program., 41 (6), pp. 855–869, 2013, ISSN: 0885-7458.

Abstract | Links | BibTeX | Tags: Cross-input adaptation, CUDA, Empirical search, G-ADAPT, GPU, Program optimizations

2012

Zhang, E Z; Jiang, Y; Shen, X

The Significance of CMP Cache Sharing on Contemporary Multithreaded Applications Journal Article

IEEE Transactions on Parallel and Distributed Systems, 23 (2), pp. 367-374, 2012.

BibTeX | Tags:

2011

Wu, B; Zhang, E Z; Shen, X

Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control Conference

2011 International Conference on Parallel Architectures and Compilation Techniques, 2011.

BibTeX | Tags:

Guo, Z; Zhang, E Z; Shen, X

Correctly Treating Synchronizations in Compiling Fine-Grained SPMD-Threaded Programs for CPU Conference

2011 International Conference on Parallel Architectures and Compilation Techniques, 2011.

BibTeX | Tags:

Tian, Kai; Zhang, Eddy; Shen, Xipeng

A Step towards Transparent Integration of Input-Consciousness into Dynamic Program Optimizations Conference

Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA ’11 Association for Computing Machinery, Portland, Oregon, USA, 2011, ISBN: 9781450309400.

Abstract | Links | BibTeX | Tags: Dynamic optimizations, Dynamic version selection, Java virtual machine, Just-in-time compilation, Proactivity, Seminal behaviors

@conference{10.1145/2048066.2048103,
title = {A Step towards Transparent Integration of Input-Consciousness into Dynamic Program Optimizations},
author = {Kai Tian and Eddy Zhang and Xipeng Shen},
url = {https://doi.org/10.1145/2048066.2048103},
doi = {10.1145/2048066.2048103},
isbn = {9781450309400},
year = {2011},
date = {2011-01-01},
booktitle = {Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications},
pages = {445–462},
publisher = {Association for Computing Machinery},
address = {Portland, Oregon, USA},
series = {OOPSLA ’11},
abstract = {Dynamic program optimizations are critical for the efficiency of applications in managed programming languages and scripting languages. Recent studies have shown that exploitation of program inputs may enhance the effectiveness of dynamic optimizations significantly. However, current solutions for enabling the exploitation require either programmers’ annotations or intensive offline profiling, impairing the practical adoption of the techniques.This current work examines the basic feasibility of transparent integration of input-consciousness into dynamic program optimizations, particularly in managed execution environments. It uses transparent learning across production runs as the basic vehicle, and investigates the implications of cross-run learning on each main component of input-conscious dynamic optimizations. It proposes several techniques to address some key challenges for the transparent integration, including randomized inspection-instrumentation for cross-user data collection, a sparsity-tolerant algorithm for input characterization, and selective prediction for efficiency protection. These techniques make it possible to automatically recognize the relations between the inputs to a program and the appropriate ways to optimize it. The whole process happens transparently across production runs; no need for offline profiling or programmer intervention. Experiments on a number of Java programs demonstrate the effectiveness of the techniques in enabling input-consciousness for dynamic optimizations, revealing the feasibility and potential benefits of the new optimization paradigm in some basic settings.},
keywords = {Dynamic optimizations, Dynamic version selection, Java virtual machine, Just-in-time compilation, Proactivity, Seminal behaviors},
pubstate = {published},
tppubtype = {conference}
}

Zhang, Eddy Z; Jiang, Yunlian; Guo, Ziyu; Tian, Kai; Shen, Xipeng

On-the-Fly Elimination of Dynamic Irregularities for GPU Computing Conference

Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI Association for Computing Machinery, Newport Beach, California, USA, 2011, ISBN: 9781450302661.

Abstract | Links | BibTeX | Tags: Cpu-gpu pipelining, Data transformation, GPGPU, Memory coalescing, Thread data remapping, Thread divergence

@conference{10.1145/1950365.1950408,
title = {On-the-Fly Elimination of Dynamic Irregularities for GPU Computing},
author = {Eddy Z Zhang and Yunlian Jiang and Ziyu Guo and Kai Tian and Xipeng Shen},
url = {https://doi.org/10.1145/1950365.1950408},
doi = {10.1145/1950365.1950408},
isbn = {9781450302661},
year = {2011},
date = {2011-01-01},
booktitle = {Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems},
pages = {369–380},
publisher = {Association for Computing Machinery},
address = {Newport Beach, California, USA},
series = {ASPLOS XVI},
abstract = {The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control flows in an application. Experiments have shown great performance gains when these irregularities are removed. But it remains an open question how to achieve those gains through software approaches on modern GPUs.This paper presents a systematic exploration to tackle dynamic irregularities in both control flows and memory references. It reveals some properties of dynamic irregularities in both control flows and memory references, their interactions, and their relations with program data and threads. It describes several heuristics-based algorithms and runtime adaptation techniques for effectively removing dynamic irregularities through data reordering and job swapping. It presents a framework, G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. G-Streamline has several distinctive properties. It is a pure software solution and works on the fly, requiring no hardware extensions or offline profiling. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations. Its optimization overhead is largely transparent to GPU kernel executions, jeopardizing no basic efficiency of the GPU application. Finally, it is robust to the presence of various complexities in GPU applications. Experiments show that G-Streamline is effective in reducing dynamic irregularities in GPU computing, producing speedups between 1.07 and 2.5 for a variety of applications.},
keywords = {Cpu-gpu pipelining, Data transformation, GPGPU, Memory coalescing, Thread data remapping, Thread divergence},
pubstate = {published},
tppubtype = {conference}
}

2010

Tian, Kai; Jiang, Yunlian; Zhang, Eddy Z; Shen, Xipeng

An Input-Centric Paradigm for Program Dynamic Optimizations Conference

Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA ’10 Association for Computing Machinery, Reno/Tahoe, Nevada, USA, 2010, ISBN: 9781450302036.

Abstract | Links | BibTeX | Tags: Dynamic optimizations, Dynamic version selection, Java virtual machine, Just-in-time compilation, Proactivity, Seminal behaviors

@conference{10.1145/1869459.1869471,
title = {An Input-Centric Paradigm for Program Dynamic Optimizations},
author = {Kai Tian and Yunlian Jiang and Eddy Z Zhang and Xipeng Shen},
url = {https://doi.org/10.1145/1869459.1869471},
doi = {10.1145/1869459.1869471},
isbn = {9781450302036},
year = {2010},
date = {2010-01-01},
booktitle = {Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications},
pages = {125–139},
publisher = {Association for Computing Machinery},
address = {Reno/Tahoe, Nevada, USA},
series = {OOPSLA ’10},
abstract = {Accurately predicting program behaviors (e.g., locality, dependency, method calling frequency) is fundamental for program optimizations and runtime adaptations. Despite decades of remarkable progress, prior studies have not systematically exploited program inputs, a deciding factor for program behaviors.Triggered by the strong and predictive correlations between program inputs and behaviors that recent studies have uncovered, this work proposes to include program inputs into the focus of program behavior analysis, cultivating a new paradigm named input-centric program behavior analysis. This new approach consists of three components, forming a three-layer pyramid. At the base is program input characterization, a component for resolving the complexity in program raw inputs and the extraction of important features. In the middle is input-behavior modeling, a component for recognizing and modeling the correlations between characterized input features and program behaviors. These two components constitute input-centric program behavior analysis, which (ideally) is able to predict the large-scope behaviors of a program’s execution as soon as the execution starts. The top layer of the pyramid is input-centric adaptation, which capitalizes on the novel opportunities that the first two components create to facilitate proactive adaptation for program optimizations.By centering on program inputs, the new approach resolves a proactivity-adaptivity dilemma inherent in previous techniques. Its benefits are demonstrated through proactive dynamic optimizations and version selection, yielding significant performance improvement on a set of Java and C programs.},
keywords = {Dynamic optimizations, Dynamic version selection, Java virtual machine, Just-in-time compilation, Proactivity, Seminal behaviors},
pubstate = {published},
tppubtype = {conference}
}

Zhang, Eddy Z; Jiang, Yunlian; Guo, Ziyu; Shen, Xipeng

Streamlining GPU Applications on the Fly: Thread Divergence Elimination through Runtime Thread-Data Remapping Conference

Proceedings of the 24th ACM International Conference on Supercomputing, ICS ’10 Association for Computing Machinery, Tsukuba, Ibaraki, Japan, 2010, ISBN: 9781450300186.

Abstract | Links | BibTeX | Tags: Cpu-gpu pipelining, Data transformation, GPGPU, Thread divergence, Thread-data remapping

@conference{10.1145/1810085.1810104,
title = {Streamlining GPU Applications on the Fly: Thread Divergence Elimination through Runtime Thread-Data Remapping},
author = {Eddy Z Zhang and Yunlian Jiang and Ziyu Guo and Xipeng Shen},
url = {https://doi.org/10.1145/1810085.1810104},
doi = {10.1145/1810085.1810104},
isbn = {9781450300186},
year = {2010},
date = {2010-01-01},
booktitle = {Proceedings of the 24th ACM International Conference on Supercomputing},
pages = {115–126},
publisher = {Association for Computing Machinery},
address = {Tsukuba, Ibaraki, Japan},
series = {ICS ’10},
abstract = {Because of their tremendous computing power and remarkable cost efficiency, GPUs (graphic processing unit) have quickly emerged as a kind of influential platform for high performance computing. However, as GPUs are designed for massive data-parallel computing, their performance is subject to the presence of condition statements in a GPU application. On a conditional branch where threads diverge in which path to take, the threads taking different paths have to run serially. Such divergences often cause serious performance degradations, impairing the adoption of GPU for many applications that contain non-trivial branches or certain types of loops.This paper presents a systematic investigation in the employment of runtime thread-data remapping for solving that problem. It introduces an abstract form of GPU applications, based on which, it describes the use of reference redirection and data layout transformation for remapping data and threads to minimize thread divergences. It discusses the major challenges for practical deployment of the remapping techniques, most notably, the conflict between the large remapping overhead and the need for the remapping to happen on the fly because of the dependence of thread divergences on runtime values. It offers a solution to the challenge by proposing a CPU-GPU pipelining scheme and a label-assign-move (LAM) algorithm to virtually hide all the remapping overhead. At the end, it reports significant performance improvement produced by the remapping for a set of GPU applications, demonstrating the potential of the techniques for streamlining GPU applications on the fly.},
keywords = {Cpu-gpu pipelining, Data transformation, GPGPU, Thread divergence, Thread-data remapping},
pubstate = {published},
tppubtype = {conference}
}

Jiang, Yunlian; Zhang, Eddy Z; Tian, Kai; Mao, Feng; Gethers, Malcom; Shen, Xipeng; Gao, Yaoqing

Exploiting Statistical Correlations for Proactive Prediction of Program Behaviors Conference

Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’10 Association for Computing Machinery, Toronto, Ontario, Canada, 2010, ISBN: 9781605586359.

Abstract | Links | BibTeX | Tags: Correlation, Program behavior analysis

@conference{10.1145/1772954.1772989,
title = {Exploiting Statistical Correlations for Proactive Prediction of Program Behaviors},
author = {Yunlian Jiang and Eddy Z Zhang and Kai Tian and Feng Mao and Malcom Gethers and Xipeng Shen and Yaoqing Gao},
url = {https://doi.org/10.1145/1772954.1772989},
doi = {10.1145/1772954.1772989},
isbn = {9781605586359},
year = {2010},
date = {2010-01-01},
booktitle = {Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization},
pages = {248–256},
publisher = {Association for Computing Machinery},
address = {Toronto, Ontario, Canada},
series = {CGO ’10},
abstract = {This paper presents a finding and a technique on program behavior prediction. The finding is that surprisingly strong statistical correlations exist among the behaviors of different program components (e.g., loops) and among different types of program level behaviors (e.g., loop trip-counts versus data values). Furthermore, the correlations can be beneficially exploited: They help resolve the proactivity-adaptivity dilemma faced by existing program behavior predictions, making it possible to gain the strengths of both approaches–the large scope and earliness of offline-profiling–based predictions, and the cross-input adaptivity of runtime sampling-based predictions.The main technique contributed by this paper centers on a new concept, seminal behaviors. Enlightened by the existence of strong correlations among program behaviors, we propose a regression based framework to automatically identify a small set of behaviors that can lead to accurate prediction of other behaviors in a program. We call these seminal behaviors. By applying statistical learning techniques, the framework constructs predictive models that map from seminal behaviors to other behaviors, enabling proactive and cross-input adaptive prediction of program behaviors. The prediction helps a commercial compiler, the IBM XL C compiler, generate code that runs up to 45% faster (5%-13% on average), demonstrating the large potential of correlation-based techniques for program optimizations.},
keywords = {Correlation, Program behavior analysis},
pubstate = {published},
tppubtype = {conference}
}

Zhang, Eddy Z; Jiang, Yunlian; Shen, Xipeng

Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs? Conference

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’10 Association for Computing Machinery, Bangalore, India, 2010, ISBN: 9781605588773.

Abstract | Links | BibTeX | Tags: Chip multiprocessors, Parallel program optimizations, Shared cache, Thread scheduling

@conference{10.1145/1693453.1693482,
title = {Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs?},
author = {Eddy Z Zhang and Yunlian Jiang and Xipeng Shen},
url = {https://doi.org/10.1145/1693453.1693482},
doi = {10.1145/1693453.1693482},
isbn = {9781605588773},
year = {2010},
date = {2010-01-01},
booktitle = {Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
pages = {203–212},
publisher = {Association for Computing Machinery},
address = {Bangalore, India},
series = {PPoPP ’10},
abstract = {Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention.A number of studies have examined the influence of cache sharing on multithreaded applications, but most of them have concentrated on the design or management of shared cache, rather than a systematic measurement of the influence. Consequently, prior measurements have been constrained by the reliance on simulators, the use of out-of-date benchmarks, and the limited coverage of deciding factors. The influence of CMP cache sharing on contemporary multithreaded applications remains preliminarily understood.In this work, we conduct a systematic measurement of the influence on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions, regardless of the types of parallelism, input datasets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch of current development and compilation of multithreaded applications and CMP architectures. By transforming the programs in a cache-sharing-aware manner, we observe up to 36% performance increase when the threads are placed on cores appropriately.},
keywords = {Chip multiprocessors, Parallel program optimizations, Shared cache, Thread scheduling},
pubstate = {published},
tppubtype = {conference}
}

Casale, Giuliano; Zhang, Eddy Z; Smirni, Evgenia

KPC-Toolbox: Best Recipes for Automatic Trace Fitting Using Markovian Arrival Processes Journal Article

Perform. Eval., 67 (9), pp. 873–896, 2010, ISSN: 0166-5316.

Abstract | Links | BibTeX | Tags: Automatic fitting, MAP characterization, Markovian arrival process (MAP), Phase-type distribution, Temporal dependence, Time series modeling

@article{10.1016/j.peva.2009.12.003,
title = {KPC-Toolbox: Best Recipes for Automatic Trace Fitting Using Markovian Arrival Processes},
author = {Giuliano Casale and Eddy Z Zhang and Evgenia Smirni},
url = {https://doi.org/10.1016/j.peva.2009.12.003},
doi = {10.1016/j.peva.2009.12.003},
issn = {0166-5316},
year = {2010},
date = {2010-01-01},
journal = {Perform. Eval.},
volume = {67},
number = {9},
pages = {873–896},
publisher = {Elsevier Science Publishers B. V.},
address = {NLD},
abstract = {We present the KPC-Toolbox, a library of MATLAB scripts for fitting workload traces into Markovian Arrival Processes (MAPs) in an automatic way based on the recently proposed Kronecker Product Composition (KPC) method. We first present detailed sensitivity analysis that builds intuition on which trace descriptors are the most important for queueing performance, stressing the advantages of matching higher-order correlations of the process rather than higher-order moments of the distribution. Given that the MAP parameterization space can be very large, we focus on first determining the order of the smallest MAP that can fit the trace well using the Bayesian Information Criterion (BIC). The KPC-Toolbox then automatically derives a MAP that captures accurately the most essential features of the trace. Extensive experimentation illustrates the effectiveness of the KPC-Toolbox in fitting traces that are well documented in the literature as very challenging to fit, showing that the KPC-Toolbox offers a simple and powerful solution to fitting accurately trace data into MAPs. We provide a characterization of moments and correlations that can be fitted exactly by KPC, thus showing the wider applicability of the method compared to small order MAPs. We also consider the fitting of phase-type (PH-type) distributions, which are an important specialization of MAPs that are useful for describing traces without correlations in their time series. We illustrate that the KPC methodology can be easily adapted to PH-type fitting and present experimental results on networking and disk drive traces showing that the KPC-Toolbox can also match accurately higher-order moments of the inter-arrival times in place of correlations.},
keywords = {Automatic fitting, MAP characterization, Markovian arrival process (MAP), Phase-type distribution, Temporal dependence, Time series modeling},
pubstate = {published},
tppubtype = {article}
}

2009

Liu, Yixun; Zhang, E Z; Shen, X

A cross-input adaptive framework for GPU program optimizations Conference

2009 IEEE International Symposium on Parallel Distributed Processing, 2009.

BibTeX | Tags:

Mao, Feng; Zhang, Eddy Z; Shen, Xipeng

Influence of Program Inputs on the Selection of Garbage Collectors Conference

Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE ’09 Association for Computing Machinery, Washington, DC, USA, 2009, ISBN: 9781605583754.

Abstract | Links | BibTeX | Tags: Cross-input program analysis, Input-specific selection, Minimum possible heap size, Profiling, Selection of garbage collectors

@conference{10.1145/1508293.1508307,
title = {Influence of Program Inputs on the Selection of Garbage Collectors},
author = {Feng Mao and Eddy Z Zhang and Xipeng Shen},
url = {https://doi.org/10.1145/1508293.1508307},
doi = {10.1145/1508293.1508307},
isbn = {9781605583754},
year = {2009},
date = {2009-01-01},
booktitle = {Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments},
pages = {91–100},
publisher = {Association for Computing Machinery},
address = {Washington, DC, USA},
series = {VEE ’09},
abstract = {Many studies have shown that the best performer among a set of garbage collectors tends to be different for different applications. Researchers have proposed application-specific selection of garbage collectors. In this work, we concentrate on a second dimension of the problem: the influence of program inputs on the selection of garbage collectors.We collect tens to hundreds of inputs for a set of Java benchmarks, and measure their performance on Jikes RVM with different heap sizes and garbage collectors. A rigorous statistical analysis produces four-fold insights. First, inputs influence the relative performance of garbage collectors significantly, causing large variations to the top set of garbage collectors across inputs. Profiling one or few runs is thus inadequate for selecting the garbage collector that works well for most inputs. Second, when the heap size ratio is fixed, one or two types of garbage collectors are enough to stimulate the top performance of the program on all inputs. Third, for some programs, the heap size ratio significantly affects the relative performance of different types of garbage collectors. For the selection of garbage collectors on those programs, it is necessary to have a cross-input predictive model that predicts the minimum possible heap size of the execution on an arbitrary input. Finally, based on regression techniques, we demonstrate the predictability of the minimum possible heap size, indicating the potential feasibility of the input-specific selection of garbage collectors.},
keywords = {Cross-input program analysis, Input-specific selection, Minimum possible heap size, Profiling, Selection of garbage collectors},
pubstate = {published},
tppubtype = {conference}
}

2008

Casale, G; Zhang, E Z; Smirni, E

KPC-Toolbox: Simple Yet Effective Trace Fitting Using Markovian Arrival Processes Conference

2008 Fifth International Conference on Quantitative Evaluation of Systems, 2008.

BibTeX | Tags: