Keywords (tags) and Publication List
Hua, Fei ; Chen, Yanhao ; Jin, Yuwei ; Zhang, Chi ; Hayes, Ari ; Zhang, Youtao ; Zhang, Eddy Z AutoBraid: A Framework for Enabling Efficient Surface Code Communication in Quantum Computing Conference 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2021), Association for Computing Machinery, 2021, ISBN: 9781450385572. Abstract | Links | BibTeX | Tags: Compiler optimization, Quantum Computing, Quantum Error Correction Chen, Yanhao ; Hua, Fei ; Jin, Yuwei ; Zhang, Eddy Z BGPQ: A Heap-Based Priority Queue Design for GPUs Conference 50th International Conference on Parallel Processing (ICPP 2021), Association for Computing Machinery, New York, NY, USA, 2021, ISBN: 9781450390682. Abstract | Links | BibTeX | Tags: Batched Heap, GPUs, Priority Queue Zhang, Chi ; Hayes, Ari B; Qiu, Longfei ; Jin, Yuwei ; Chen, Yanhao ; Zhang, Eddy Z Time-Optimal Qubit Mapping Conference Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021), Association for Computing Machinery, New York, NY, USA, 2021. Links | BibTeX | Tags: A* search, Compilation technique, QFT, Quantum, Qubit mapping Hua, Fei; Zhang, Eddy Z Optimizing Surface Code Braiding Workshop The 3rd International Workshop on Quantum Compilation (IWQC 2019), 2019. BibTeX | Tags: Compilation technique, Quantum, Surface code Fu, Hao; Zhu, Mingzheng; Wu, Wenli; Zhang, Eddy Z; Tan, Haisheng Towards Optimal Qubit Mapping in NISQ Era Workshop The 3rd International Workshop on Quantum Compilation (IWQC 2019), 2019. BibTeX | Tags: Compilation technique, Quantum, Qubit mapping Hayes, Ari B; Hua, Fei; Huang, Jin; Chen, Yanhao; Zhang, Eddy Z Decoding CUDA Binary Conference Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2019), IEEE Press, Washington, DC, USA, 2019, ISBN: 9781728114361. Abstract | BibTeX | Tags: Code generation, Code translation and transformation, CUDA, GPU, Instruction set architecture (ISA) Chen, Yanhao; Hayes, Ari B; Zhang, Chi; Salmon, Timothy; Zhang, Eddy Z Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Conference 2018 USENIX Annual Technical Conference (USENIX ATC 18), USENIX Association, Boston, MA, 2018, ISBN: 978-1-939133-01-4. Links | BibTeX | Tags: GPU, Program locality, Sparse matrix, Spmv Hayes, Ari B; Li, Lingda; Hedayati, Mohammad; He, Jiahuan; Zhang, Eddy Z; Shen, Kai GPU Taint Tracking Conference 2017 USENIX Annual Technical Conference (USENIX ATC 17), USENIX Association, Santa Clara, CA, 2017, ISBN: 978-1-931971-38-6. Li, Lingda; Geda, Robel; Hayes, Ari B; Chen, Yanhao; Chaudhari, Pranav; Zhang, Eddy Z; Szegedy, Mario A Simple Yet Effective Balanced Edge Partition Model for Parallel Computing Conference Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems (Sigmetrics 2017), SIGMETRICS ’17 Abstracts Association for Computing Machinery, Urbana-Champaign, Illinois, USA, 2017, ISBN: 9781450350327. Abstract | Links | BibTeX | Tags: Data sharing, Edge-partition, GPU, Graph model, Program locality Li, Lingda; Geda, Robel; Hayes, Ari B; Chen, Yanhao; Chaudhari, Pranav; Zhang, Eddy Z; Szegedy, Mario A Simple Yet Effective Balanced Edge Partition Model for Parallel Computing Journal Article SIGMETRICS Perform. Eval. Rev., 45 (1), pp. 6, 2017, ISSN: 0163-5999. Abstract | Links | BibTeX | Tags: Data sharing, Edge-partition, GPU, Graph model, Program locality Li, Pengcheng; Hu, Xiaoyu; Chen, Dong; Brock, Jacob; Luo, Hao; Zhang, Eddy Z; Ding, Chen LD: Low-Overhead GPU Race Detection Without Access Monitoring Journal Article ACM Transaction on Architecture and Code Optimization (TACO 2017), 14 (1), 2017, ISSN: 1544-3566. Abstract | Links | BibTeX | Tags: GPU race detection, Instrumentation-free, Value-based checking Catarata, Jan; Corbett, Scott; Stern, Harry; Szegedy, Mario; Vyskocil, Tomas; Zhang, Zheng The Moser-Tardos Resample algorithm: Where is the limit? (an experimental inquiry) Workshop 2017. Hayes, Ari B; Li, Lingda; Chavarría-Miranda, Daniel; Song, Shuaiwen Leon; Zhang, Eddy Z Orion: A Framework for GPU Occupancy Tuning Conference Proceedings of the 17th International Middleware Conference (Middleware 2017), Association for Computing Machinery, Trento, Italy, 2016, ISBN: 9781450343008. Abstract | Links | BibTeX | Tags: Concurrent-program compilation, GPU occupancy tuning, Register allocation, Shared memory allocation Li, Lingda; Hayes, Ari B; Song, Shuaiwen Leon; Zhang, Eddy Z Tag-Split Cache for Efficient GPGPU Cache Utilization Conference Proceedings of the 2016 International Conference on Supercomputing (ICS 2016), Association for Computing Machinery, Istanbul, Turkey, 2016, ISBN: 9781450343619. Abstract | Links | BibTeX | Tags: Cache organization, GPGPU, Spatial Locality Tao, Dingwen; Song, Shuaiwen Leon; Krishnamoorthy, Sriram; Wu, Panruo; Liang, Xin; Zhang, Eddy Z; Kerbyson, Darren; Chen, Zizhong New-Sum: A Novel Online ABFT Scheme For General Iterative Methods Conference Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2016), Association for Computing Machinery, Kyoto, Japan, 2016, ISBN: 9781450343145. Abstract | Links | BibTeX | Tags: Algorithm-based fault tolerance (abft), Checkpoint, Checksum, Iterative methods, Online error detection, Resilience, Rollback recovery, Silent data corruption (sdc) Li, Ang; Song, Shuaiwen Leon; Kumar, Akash; Zhang, Eddy Z; -, Daniel Chavarría G; Corporaal, Henk Critical points based register-concurrency autotuning for GPUs Conference 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE 2016), Dresden, Germany, March 14-18, 2016, IEEE, 2016. Renart, E G; Zhang, E Z; Nath, B Towards a GPU SDN controller Workshop 2015 International Conference and Workshops on Networked Systems (NetSys), 2015. BibTeX | Tags: Egielski, Ian J; Huang, Jesse; Zhang, Eddy Z Massive Atomics for Massive Parallelism on GPUs Conference Proceedings of the 2014 International Symposium on Memory Management (ISMM 2014), Association for Computing Machinery, Edinburgh, United Kingdom, 2014, ISBN: 9781450329217. Abstract | Links | BibTeX | Tags: Atomics, Concurrency, GPU, Parallelism Hayes, Ari B; Zhang, Eddy Z Unified On-Chip Memory Allocation for SIMT Architecture Conference Proceedings of the 28th ACM International Conference on Supercomputing (ICS 2014), Association for Computing Machinery, Munich, Germany, 2014, ISBN: 9781450326421. Abstract | Links | BibTeX | Tags: Compiler optimization, Concurrency, GPU, Register allocation, Shared memory allocation Wu, Bo; Zhao, Zhijia; Zhang, Eddy Zheng; Jiang, Yunlian; Shen, Xipeng Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2013), Association for Computing Machinery, Shenzhen, China, 2013, ISBN: 9781450319225. Abstract | Links | BibTeX | Tags: Computational complexity, Data transformation, GPGPU, Memory coalescing, Runtime optimizations, Thread-data remapping Shen, Xipeng; Liu, Yixun; Zhang, Eddy Z; Bhamidipati, Poornima An Infrastructure for Tackling Input-Sensitivity of GPU Program Optimizations Journal Article Int. J. Parallel Program., 41 (6), pp. 855–869, 2013, ISSN: 0885-7458. Abstract | Links | BibTeX | Tags: Cross-input adaptation, CUDA, Empirical search, G-ADAPT, GPU, Program optimizations Zhang, E Z; Jiang, Y; Shen, X The Significance of CMP Cache Sharing on Contemporary Multithreaded Applications Journal Article IEEE Transactions on Parallel and Distributed Systems, 23 (2), pp. 367-374, 2012. BibTeX | Tags: Wu, B; Zhang, E Z; Shen, X Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control Conference 2011 International Conference on Parallel Architectures and Compilation Techniques, 2011. BibTeX | Tags: Guo, Z; Zhang, E Z; Shen, X Correctly Treating Synchronizations in Compiling Fine-Grained SPMD-Threaded Programs for CPU Conference 2011 International Conference on Parallel Architectures and Compilation Techniques, 2011. BibTeX | Tags: Tian, Kai; Zhang, Eddy; Shen, Xipeng A Step towards Transparent Integration of Input-Consciousness into Dynamic Program Optimizations Conference Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA ’11 Association for Computing Machinery, Portland, Oregon, USA, 2011, ISBN: 9781450309400. Abstract | Links | BibTeX | Tags: Dynamic optimizations, Dynamic version selection, Java virtual machine, Just-in-time compilation, Proactivity, Seminal behaviors Zhang, Eddy Z; Jiang, Yunlian; Guo, Ziyu; Tian, Kai; Shen, Xipeng On-the-Fly Elimination of Dynamic Irregularities for GPU Computing Conference Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI Association for Computing Machinery, Newport Beach, California, USA, 2011, ISBN: 9781450302661. Abstract | Links | BibTeX | Tags: Cpu-gpu pipelining, Data transformation, GPGPU, Memory coalescing, Thread data remapping, Thread divergence Tian, Kai; Jiang, Yunlian; Zhang, Eddy Z; Shen, Xipeng An Input-Centric Paradigm for Program Dynamic Optimizations Conference Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA ’10 Association for Computing Machinery, Reno/Tahoe, Nevada, USA, 2010, ISBN: 9781450302036. Abstract | Links | BibTeX | Tags: Dynamic optimizations, Dynamic version selection, Java virtual machine, Just-in-time compilation, Proactivity, Seminal behaviors Zhang, Eddy Z; Jiang, Yunlian; Guo, Ziyu; Shen, Xipeng Proceedings of the 24th ACM International Conference on Supercomputing, ICS ’10 Association for Computing Machinery, Tsukuba, Ibaraki, Japan, 2010, ISBN: 9781450300186. Abstract | Links | BibTeX | Tags: Cpu-gpu pipelining, Data transformation, GPGPU, Thread divergence, Thread-data remapping Jiang, Yunlian; Zhang, Eddy Z; Tian, Kai; Mao, Feng; Gethers, Malcom; Shen, Xipeng; Gao, Yaoqing Exploiting Statistical Correlations for Proactive Prediction of Program Behaviors Conference Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’10 Association for Computing Machinery, Toronto, Ontario, Canada, 2010, ISBN: 9781605586359. Abstract | Links | BibTeX | Tags: Correlation, Program behavior analysis Zhang, Eddy Z; Jiang, Yunlian; Shen, Xipeng Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs? Conference Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’10 Association for Computing Machinery, Bangalore, India, 2010, ISBN: 9781605588773. Abstract | Links | BibTeX | Tags: Chip multiprocessors, Parallel program optimizations, Shared cache, Thread scheduling Casale, Giuliano; Zhang, Eddy Z; Smirni, Evgenia KPC-Toolbox: Best Recipes for Automatic Trace Fitting Using Markovian Arrival Processes Journal Article Perform. Eval., 67 (9), pp. 873–896, 2010, ISSN: 0166-5316. Abstract | Links | BibTeX | Tags: Automatic fitting, MAP characterization, Markovian arrival process (MAP), Phase-type distribution, Temporal dependence, Time series modeling Liu, Yixun; Zhang, E Z; Shen, X A cross-input adaptive framework for GPU program optimizations Conference 2009 IEEE International Symposium on Parallel Distributed Processing, 2009. BibTeX | Tags: Mao, Feng; Zhang, Eddy Z; Shen, Xipeng Influence of Program Inputs on the Selection of Garbage Collectors Conference Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE ’09 Association for Computing Machinery, Washington, DC, USA, 2009, ISBN: 9781605583754. Abstract | Links | BibTeX | Tags: Cross-input program analysis, Input-specific selection, Minimum possible heap size, Profiling, Selection of garbage collectors Casale, G; Zhang, E Z; Smirni, E KPC-Toolbox: Simple Yet Effective Trace Fitting Using Markovian Arrival Processes Conference 2008 Fifth International Conference on Quantitative Evaluation of Systems, 2008. BibTeX | Tags:
2021
title = {AutoBraid: A Framework for Enabling Efficient Surface Code Communication in Quantum Computing},
author = {Hua, Fei and Chen, Yanhao and Jin, Yuwei and Zhang, Chi and Hayes, Ari and Zhang, Youtao and Zhang, Eddy Z.
},
url = {https://doi.org/10.1145/3466752.3480072},
doi = {10.1145/3466752.3480072},
isbn = {9781450385572},
year = {2021},
date = {2021-10-21},
booktitle = {54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2021)},
pages = {925–936},
publisher = {Association for Computing Machinery},
abstract = {Quantum computers can solve problems that are intractable using the most powerful classical computer. However, qubits are fickle and error prone. It is necessary to actively correct errors in the execution of a quantum circuit. Quantum error correction (QEC) codes are developed to enable fault-tolerant quantum computing. With QEC, one logical circuit is converted into an encoded circuit. Most studies on quantum circuit compilation focus on NISQ devices which have 10-100 qubits and are not fault-tolerant. In this paper, we focus on the compilation for fault-tolerant quantum hardware. In particular, we focus on optimizing communication parallelism for the surface code based QEC. The execution of surface code circuits involves non-trivial geometric manipulation of a large lattice of entangled physical qubits. A two-qubit gate in surface code is implemented as a virtual “pipe” in space-time called a braiding path. The braiding paths should be carefully routed to avoid congestion. Communication between qubits is considered the major bottleneck as it involves scheduling and searching for simultaneous paths between qubits. We provide a framework for efficiently scheduling braiding paths. We discover that for quantum programs with a local parallelism pattern, our framework guarantees an optimal solution, while the previous greedy-heuristic-based solution cannot. Moreover, we propose an extension to the local parallelism analysis framework to address the communication bottleneck. Our framework achieves orders of magnitude improvement after addressing the communication bottleneck.},
keywords = {Compiler optimization, Quantum Computing, Quantum Error Correction},
pubstate = {published},
tppubtype = {conference}
}
title = {BGPQ: A Heap-Based Priority Queue Design for GPUs},
author = {Chen, Yanhao and Hua, Fei and Jin, Yuwei and Zhang, Eddy Z.},
url = {https://doi.org/10.1145/3472456.3472463},
doi = {10.1145/3472456.3472463},
isbn = {9781450390682},
year = {2021},
date = {2021-08-09},
booktitle = {50th International Conference on Parallel Processing (ICPP 2021)},
pages = {10},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
abstract = {Programming today’s many-core processor is challenging. Due to the enormous amount of parallelism, synchronization is expensive. We need efficient data structures for providing automatic and scalable synchronization methods. In this paper, we focus on the priority queue data structure. We develop a heap-based priority queue implementation called BGPQ. BGPQ uses batched key nodes as the internal data representation, exploits both task parallelism and data parallelism, and is linearizable. We show that BGPQ achieves up to 88X speedup compared with four state-of-the-art CPU parallel priority queue implementations and up to 11.2X speedup over an existing GPU implementation. We also apply BGPQ to search problems, including 0-1 Knapsack and A* search. We achieve 45X-100X and 12X-46X speedup respectively over best known concurrent CPU priority queues.},
keywords = {Batched Heap, GPUs, Priority Queue},
pubstate = {published},
tppubtype = {conference}
}
title = {Time-Optimal Qubit Mapping},
author = {Zhang, Chi and Hayes, Ari B. and Qiu, Longfei and Jin, Yuwei and Chen, Yanhao and Zhang, Eddy Z.},
url = {https://doi.org/10.1145/3445814.3446706},
doi = {10.1145/3445814.3446706},
year = {2021},
date = {2021-04-19},
booktitle = {Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021)},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
keywords = {A* search, Compilation technique, QFT, Quantum, Qubit mapping},
pubstate = {published},
tppubtype = {conference}
}
2019
title = {Optimizing Surface Code Braiding},
author = {Fei Hua and Eddy Z. Zhang},
year = {2019},
date = {2019-11-07},
booktitle = {The 3rd International Workshop on Quantum Compilation (IWQC 2019)},
keywords = {Compilation technique, Quantum, Surface code},
pubstate = {published},
tppubtype = {workshop}
}
title = {Towards Optimal Qubit Mapping in NISQ Era},
author = {Hao Fu and Mingzheng Zhu and Wenli Wu and Eddy Z. Zhang and Haisheng Tan},
year = {2019},
date = {2019-11-07},
booktitle = {The 3rd International Workshop on Quantum Compilation (IWQC 2019)},
keywords = {Compilation technique, Quantum, Qubit mapping},
pubstate = {published},
tppubtype = {workshop}
}
title = {Decoding CUDA Binary},
author = {Ari B Hayes and Fei Hua and Jin Huang and Yanhao Chen and Eddy Z Zhang},
isbn = {9781728114361},
year = {2019},
date = {2019-01-01},
booktitle = {Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2019)},
pages = {229–241},
publisher = {IEEE Press},
address = {Washington, DC, USA},
abstract = {NVIDIA’s software does not offer translation of assembly code to binary for their GPUs, since the specifications are closed-source. This work fills that gap. We develop a systematic method of decoding the Instruction Set Architectures (ISAs) of NVIDIA’s GPUs, and generating assemblers for different generations of GPUs. Our framework enables cross-architecture binary analysis and transformation. Making the ISA accessible in this manner opens up a world of opportunities for developers and researchers, enabling numerous optimizations and explorations that are unachievable at the source-code level. Our infrastructure has already benefited and been adopted in important applications including performance tuning, binary instrumentation, resource allocation, and memory protection.},
keywords = {Code generation, Code translation and transformation, CUDA, GPU, Instruction set architecture (ISA)},
pubstate = {published},
tppubtype = {conference}
}
2018
title = {Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs},
author = {Yanhao Chen and Ari B Hayes and Chi Zhang and Timothy Salmon and Eddy Z Zhang},
url = {https://www.usenix.org/conference/atc18/presentation/chen-yanhao},
isbn = {978-1-939133-01-4},
year = {2018},
date = {2018-01-01},
booktitle = {2018 USENIX Annual Technical Conference (USENIX ATC 18)},
pages = {413–426},
publisher = {USENIX Association},
address = {Boston, MA},
keywords = {GPU, Program locality, Sparse matrix, Spmv},
pubstate = {published},
tppubtype = {conference}
}
2017
title = {GPU Taint Tracking},
author = {Ari B Hayes and Lingda Li and Mohammad Hedayati and Jiahuan He and Eddy Z Zhang and Kai Shen},
url = {https://www.usenix.org/conference/atc17/technical-sessions/presentation/hayes},
isbn = {978-1-931971-38-6},
year = {2017},
date = {2017-01-01},
booktitle = {2017 USENIX Annual Technical Conference (USENIX ATC 17)},
pages = {209–220},
publisher = {USENIX Association},
address = {Santa Clara, CA},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
title = {A Simple Yet Effective Balanced Edge Partition Model for Parallel Computing},
author = {Lingda Li and Robel Geda and Ari B Hayes and Yanhao Chen and Pranav Chaudhari and Eddy Z Zhang and Mario Szegedy},
url = {https://doi.org/10.1145/3078505.3078520},
doi = {10.1145/3078505.3078520},
isbn = {9781450350327},
year = {2017},
date = {2017-01-01},
booktitle = {Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems (Sigmetrics 2017)},
pages = {6},
publisher = {Association for Computing Machinery},
address = {Urbana-Champaign, Illinois, USA},
series = {SIGMETRICS ’17 Abstracts},
abstract = {Graph edge partition models have recently become an appealing alternative to graph vertex partition models for distributed computing due to both their flexibility in balancing loads and their performance in reducing communication cost.In this paper, we propose a simple yet effective graph edge partitioning algorithm. In practice, our algorithm provides good partition quality while maintaining low partition overhead. It also outperforms similar state-of-the-art edge partition approaches, especially for power-law graphs. In theory, previous work showed that an approximation guarantee of O(dmax√(log n log k)) apply to the graphs with m=Ω(k2) edges (n is the number of vertices, and k is the number of partitions). We further rigorously proved that this approximation guarantee hold for all graphs.We also demonstrate the applicability of the proposed edge partition algorithm in real parallel computing systems. We draw our example from GPU program locality enhancement and demonstrate that the graph edge partition model does not only apply to distributed computing with many computer nodes, but also to parallel computing in a single computer node with a many-core processor.},
keywords = {Data sharing, Edge-partition, GPU, Graph model, Program locality},
pubstate = {published},
tppubtype = {conference}
}
title = {A Simple Yet Effective Balanced Edge Partition Model for Parallel Computing},
author = {Lingda Li and Robel Geda and Ari B Hayes and Yanhao Chen and Pranav Chaudhari and Eddy Z Zhang and Mario Szegedy},
url = {https://doi.org/10.1145/3143314.3078520},
doi = {10.1145/3143314.3078520},
issn = {0163-5999},
year = {2017},
date = {2017-01-01},
journal = {SIGMETRICS Perform. Eval. Rev.},
volume = {45},
number = {1},
pages = {6},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
abstract = {Graph edge partition models have recently become an appealing alternative to graph vertex partition models for distributed computing due to both their flexibility in balancing loads and their performance in reducing communication cost.In this paper, we propose a simple yet effective graph edge partitioning algorithm. In practice, our algorithm provides good partition quality while maintaining low partition overhead. It also outperforms similar state-of-the-art edge partition approaches, especially for power-law graphs. In theory, previous work showed that an approximation guarantee of O(dmax√(log n log k)) apply to the graphs with m=Ω(k2) edges (n is the number of vertices, and k is the number of partitions). We further rigorously proved that this approximation guarantee hold for all graphs.We also demonstrate the applicability of the proposed edge partition algorithm in real parallel computing systems. We draw our example from GPU program locality enhancement and demonstrate that the graph edge partition model does not only apply to distributed computing with many computer nodes, but also to parallel computing in a single computer node with a many-core processor.},
keywords = {Data sharing, Edge-partition, GPU, Graph model, Program locality},
pubstate = {published},
tppubtype = {article}
}
title = {LD: Low-Overhead GPU Race Detection Without Access Monitoring},
author = {Pengcheng Li and Xiaoyu Hu and Dong Chen and Jacob Brock and Hao Luo and Eddy Z Zhang and Chen Ding},
url = {https://doi.org/10.1145/3046678},
doi = {10.1145/3046678},
issn = {1544-3566},
year = {2017},
date = {2017-01-01},
journal = {ACM Transaction on Architecture and Code Optimization (TACO 2017)},
volume = {14},
number = {1},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
abstract = {Data race detection has become an important problem in GPU programming. Previous designs of CPU race-checking tools are mainly task parallel and incur high overhead on GPUs due to access instrumentation, especially when monitoring many thousands of threads routinely used by GPU programs.This article presents a novel data-parallel solution designed and optimized for the GPU architecture. It includes compiler support and a set of runtime techniques. It uses value-based checking, which detects the races reported in previous work, finds new races, and supports race-free deterministic GPU execution. More important, race checking is massively data parallel and does not introduce divergent branching or atomic synchronization. Its slowdown is less than 5 texttimes for over half of the tests and 10 texttimes on average, which is orders of magnitude more efficient than the cuda-memcheck tool by Nvidia and the methods that use fine-grained access instrumentation.},
keywords = {GPU race detection, Instrumentation-free, Value-based checking},
pubstate = {published},
tppubtype = {article}
}
title = {The Moser-Tardos Resample algorithm: Where is the limit? (an experimental inquiry)},
author = {Jan Catarata and Scott Corbett and Harry Stern and Mario Szegedy and Tomas Vyskocil and Zheng Zhang},
doi = {10.1137/1.9781611974768.13},
year = {2017},
date = {2017-01-01},
pages = {159-171},
keywords = {},
pubstate = {published},
tppubtype = {workshop}
}
2016
title = {Orion: A Framework for GPU Occupancy Tuning},
author = {Ari B Hayes and Lingda Li and Daniel Chavarría-Miranda and Shuaiwen Leon Song and Eddy Z Zhang},
url = {https://doi.org/10.1145/2988336.2988355},
doi = {10.1145/2988336.2988355},
isbn = {9781450343008},
year = {2016},
date = {2016-01-01},
booktitle = {Proceedings of the 17th International Middleware Conference (Middleware 2017)},
publisher = {Association for Computing Machinery},
address = {Trento, Italy},
abstract = {An important feature of modern GPU architectures is variable occupancy. Occupancy measures the ratio between the actual number of threads actively running on a GPU and the maximum number of threads that can be scheduled on a GPU. High-occupancy execution enables a large number of threads to run simultaneously and to hide memory latency, but may increase resource contention. Low-occupancy execution leads to less resource contention, but is less capable of hiding memory latency. Occupancy tuning is an important and challenging problem. A program running at two different occupancy levels can have three to four times difference in performance.We introduce Orion, the first GPU program occupancy tuning framework. The Orion framework automatically generates and chooses occupancy-adaptive code for any given GPU program. It is capable of finding the (near-)optimal occupancy level by combining static and dynamic tuning techniques. We demonstrate the efficiency of Orion with twelve representative benchmarks from the Rodinia benchmark suite and CUDA SDK evaluated on two different GPU architectures, obtaining up to 1.61 times speedup, 62.5% memory resource saving, and 6.7% energy saving compared to the baseline of optimized code compiled by nvcc.},
keywords = {Concurrent-program compilation, GPU occupancy tuning, Register allocation, Shared memory allocation},
pubstate = {published},
tppubtype = {conference}
}
title = {Tag-Split Cache for Efficient GPGPU Cache Utilization},
author = {Lingda Li and Ari B Hayes and Shuaiwen Leon Song and Eddy Z Zhang},
url = {https://doi.org/10.1145/2925426.2926253},
doi = {10.1145/2925426.2926253},
isbn = {9781450343619},
year = {2016},
date = {2016-01-01},
booktitle = {Proceedings of the 2016 International Conference on Supercomputing (ICS 2016)},
publisher = {Association for Computing Machinery},
address = {Istanbul, Turkey},
abstract = {Modern GPUs employ cache to improve memory system efficiency. However, large amount of cache space is underutilized due to irregular memory accesses and poor spatial locality which exhibited commonly in GPU applications. Our experiments show that using smaller cache lines could improve cache space utilization, but it also frequently suffers from significant performance loss by introducing large amount of extra cache requests. In this work, we propose a novel cache design named tag-split cache (TSC) that enables fine-grained cache storage to address the problem of cache space underutilization while keeping memory request number unchanged. TSC divides tag into two parts to reduce storage overhead, and it supports multiple cache line replacement in one cycle. TSC can also automatically adjust cache storage granularity to avoid performance loss for applications with good spatial locality. Our evaluation shows that TSC improves the baseline cache performance by 17.2% on average across a wide range of applications. It also out-performs other previous techniques significantly.},
keywords = {Cache organization, GPGPU, Spatial Locality},
pubstate = {published},
tppubtype = {conference}
}
title = {New-Sum: A Novel Online ABFT Scheme For General Iterative Methods},
author = {Dingwen Tao and Shuaiwen Leon Song and Sriram Krishnamoorthy and Panruo Wu and Xin Liang and Eddy Z Zhang and Darren Kerbyson and Zizhong Chen},
url = {https://doi.org/10.1145/2907294.2907306},
doi = {10.1145/2907294.2907306},
isbn = {9781450343145},
year = {2016},
date = {2016-01-01},
booktitle = {Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2016)},
pages = {43–55},
publisher = {Association for Computing Machinery},
address = {Kyoto, Japan},
abstract = {Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme. These designs are capable of addressing scenarios under different error rates. Our ABFT approaches apply to a wide range of iterative solvers that primarily rely on matrix-vector multiplication and vector linear operations. We evaluate our designs through comprehensive analytical and empirical analysis. Experimental evaluation on the Stampede supercomputer demonstrates the low performance overheads incurred by our two ABFT schemes for preconditioned CG (0.4% and 2.2%) and preconditioned BiCGSTAB (1.0% and 4.0%) for the largest SPD matrix from UFL Sparse Matrix Collection. The evaluation also demonstrates the flexibility and effectiveness of our proposed designs for detecting and recovering various types of soft errors in general iterative methods.},
keywords = {Algorithm-based fault tolerance (abft), Checkpoint, Checksum, Iterative methods, Online error detection, Resilience, Rollback recovery, Silent data corruption (sdc)},
pubstate = {published},
tppubtype = {conference}
}
title = {Critical points based register-concurrency autotuning for GPUs},
author = {Ang Li and Shuaiwen Leon Song and Akash Kumar and Eddy Z Zhang and Daniel Chavarría G – and Henk Corporaal},
editor = {Luca Fanucci and J ü},
url = {http://ieeexplore.ieee.org/document/7459506/},
year = {2016},
date = {2016-01-01},
booktitle = {2016 Design, Automation & Test in Europe Conference & Exhibition (DATE 2016), Dresden, Germany, March 14-18, 2016},
pages = {1273–1278},
publisher = {IEEE},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
2015
title = {Towards a GPU SDN controller},
author = {E G Renart and E Z Zhang and B Nath},
year = {2015},
date = {2015-01-01},
booktitle = {2015 International Conference and Workshops on Networked Systems (NetSys)},
pages = {1-5},
keywords = {},
pubstate = {published},
tppubtype = {workshop}
}
2014
title = {Massive Atomics for Massive Parallelism on GPUs},
author = {Ian J Egielski and Jesse Huang and Eddy Z Zhang},
url = {https://doi.org/10.1145/2602988.2602993},
doi = {10.1145/2602988.2602993},
isbn = {9781450329217},
year = {2014},
date = {2014-01-01},
booktitle = {Proceedings of the 2014 International Symposium on Memory Management (ISMM 2014)},
pages = {93–103},
publisher = {Association for Computing Machinery},
address = {Edinburgh, United Kingdom},
abstract = {One important type of parallelism exploited in many applications is reduction type parallelism. In these applications, the order of the read-modify-write updates to one shared data object can be arbitrary as long as there is an imposed order for the read-modify-write updates. The typical way to parallelize these types of applications is to first let every individual thread perform local computation and save the results in thread-private data objects, and then merge the results from all worker threads in the reduction stage. All applications that fit into the map reduce framework belong to this category. Additionally, the machine learning, data mining, numerical analysis and scientific simulation applications may also benefit from reduction type parallelism. However, the parallelization scheme via the usage of thread-private data objects may not be vi- able in massively parallel GPU applications. Because the number of concurrent threads is extremely large (at least tens of thousands of), thread-private data object creation may lead to memory space explosion problems.In this paper, we propose a novel approach to deal with shared data object management for reduction type parallelism on GPUs. Our approach exploits fine-grained parallelism while at the same time maintaining good programmability. It is based on the usage of intrinsic hardware atomic instructions. Atomic operation may appear to be expensive since it causes thread serialization when multiple threads atomically update the same memory object at the same time. However, we discovered that, with appropriate atomic collision reduction techniques, the atomic implementation can out- perform the non-atomics implementation, even for benchmarks known to have high performance non-atomics GPU implementations. In the meantime, the usage of atomics can greatly reduce coding complexity as neither thread-private object management or explicit thread-communication (for the shared data objects protected by atomic operations) is necessary.},
keywords = {Atomics, Concurrency, GPU, Parallelism},
pubstate = {published},
tppubtype = {conference}
}
title = {Unified On-Chip Memory Allocation for SIMT Architecture},
author = {Ari B Hayes and Eddy Z Zhang},
url = {https://doi.org/10.1145/2597652.2597685},
doi = {10.1145/2597652.2597685},
isbn = {9781450326421},
year = {2014},
date = {2014-01-01},
booktitle = {Proceedings of the 28th ACM International Conference on Supercomputing (ICS 2014)},
pages = {293–302},
publisher = {Association for Computing Machinery},
address = {Munich, Germany},
abstract = {The popularity of general purpose Graphic Processing Unit (GPU) is largely attributed to the tremendous concurrency enabled by its underlying architecture — single instruction multiple thread (SIMT) architecture. It keeps the context of a significant number of threads in registers to enable fast “context switches” when the processor is stalled due to execution dependence, memory requests and etc. The SIMT architecture has a large register file evenly partitioned among all concurrent threads. Per-thread register usage determines the number of concurrent threads, which strongly affects the whole program performance. Existing register allocation techniques, extensively studied in the past several decades, are oblivious to the register contention due to the concurrent execution of many threads. They are prone to making optimization decisions that benefit single thread but degrade the whole application performance.Is it possible for compilers to make register allocation decisions that can maximize the whole GPU application performance? We tackle this important question from two different aspects in this paper. We first propose an unified on-chip memory allocation framework that uses scratch-pad memory to help: (1) alleviate single-thread register pressure; (2) increase whole application throughput. Secondly, we propose a characterization model for the SIMT execution model in order to achieve a desired on-chip memory partition given the register pressure of a program. Overall, we discovered that it is possible to automatically determine an on-chip memory resource allocation that maximizes concurrency while ensuring good single-thread performance at compile-time. We evaluated our techniques on a representative set of GPU benchmarks with non-trivial register pressure. We are able to achieve up to 1.70 times speedup over the baseline of the traditional register allocation scheme that maximizes single thread performance.},
keywords = {Compiler optimization, Concurrency, GPU, Register allocation, Shared memory allocation},
pubstate = {published},
tppubtype = {conference}
}
2013
title = {Complexity Analysis and Algorithm Design for Reorganizing Data to Minimize Non-Coalesced Memory Accesses on GPU},
author = {Bo Wu and Zhijia Zhao and Eddy Zheng Zhang and Yunlian Jiang and Xipeng Shen},
url = {https://doi.org/10.1145/2442516.2442523},
doi = {10.1145/2442516.2442523},
isbn = {9781450319225},
year = {2013},
date = {2013-01-01},
booktitle = {Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2013)},
pages = {57–68},
publisher = {Association for Computing Machinery},
address = {Shenzhen, China},
abstract = {The performance of Graphic Processing Units (GPU) is sensitive to irregular memory references. Some recent work shows the promise of data reorganization for eliminating non-coalesced memory accesses that are caused by irregular references. However, all previous studies have employed simple, heuristic methods to determine the new data layouts to create. As a result, they either do not provide any performance guarantee or are effective to only some limited scenarios. This paper contributes a fundamental study to the problem. It systematically analyzes the inherent complexity of the problem in various settings, and for the first time, proves that the problem is NP-complete. It then points out the limitations of existing techniques and reveals that in practice, the essence for designing an appropriate data reorganization algorithm can be reduced to a tradeoff among space, time, and complexity. Based on that insight, it develops two new data reorganization algorithms to overcome the limitations of previous methods. Experiments show that an assembly composed of the new algorithms and a previous algorithm can circumvent the inherent complexity in finding optimal data layouts, making it feasible to minimize non-coalesced memory accesses for a variety of irregular applications and settings that are beyond the reach of existing techniques.},
keywords = {Computational complexity, Data transformation, GPGPU, Memory coalescing, Runtime optimizations, Thread-data remapping},
pubstate = {published},
tppubtype = {conference}
}
title = {An Infrastructure for Tackling Input-Sensitivity of GPU Program Optimizations},
author = {Xipeng Shen and Yixun Liu and Eddy Z Zhang and Poornima Bhamidipati},
url = {https://doi.org/10.1007/s10766-012-0236-3},
doi = {10.1007/s10766-012-0236-3},
issn = {0885-7458},
year = {2013},
date = {2013-01-01},
journal = {Int. J. Parallel Program.},
volume = {41},
number = {6},
pages = {855–869},
publisher = {Kluwer Academic Publishers},
address = {USA},
abstract = {Graphic processing units (GPU) have become increasingly adopted for the enhancement of computing throughput. However, the development of a high-quality GPU application is challenging, due to the large optimization space and complex unpredictable effects of optimizations on GPU program performance. Many recent efforts have been employing empirical search-based auto-tuners to tackle the problem, but few of them have concentrated on the influence of program inputs on the optimizations. In this paper, based on a set of CUDA and OpenCL kernels, we report some evidences on the importance for auto-tuners to adapt to program input changes, and present a framework, G-ADAPT+, to address the influence by constructing cross-input predictive models for automatically predicting the (near-)optimal configurations for an arbitrary input to a GPU program. G-ADAPT+ is based on source-to-source compilers, specifically, Cetus and ROSE. It supports the optimizations of both CUDA and OpenCL programs.},
keywords = {Cross-input adaptation, CUDA, Empirical search, G-ADAPT, GPU, Program optimizations},
pubstate = {published},
tppubtype = {article}
}
2012
title = {The Significance of CMP Cache Sharing on Contemporary Multithreaded Applications},
author = {E Z Zhang and Y Jiang and X Shen},
year = {2012},
date = {2012-01-01},
journal = {IEEE Transactions on Parallel and Distributed Systems},
volume = {23},
number = {2},
pages = {367-374},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
2011
title = {Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control},
author = {B Wu and E Z Zhang and X Shen},
year = {2011},
date = {2011-01-01},
booktitle = {2011 International Conference on Parallel Architectures and Compilation Techniques},
pages = {243-252},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
title = {Correctly Treating Synchronizations in Compiling Fine-Grained SPMD-Threaded Programs for CPU},
author = {Z Guo and E Z Zhang and X Shen},
year = {2011},
date = {2011-01-01},
booktitle = {2011 International Conference on Parallel Architectures and Compilation Techniques},
pages = {310-319},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
title = {A Step towards Transparent Integration of Input-Consciousness into Dynamic Program Optimizations},
author = {Kai Tian and Eddy Zhang and Xipeng Shen},
url = {https://doi.org/10.1145/2048066.2048103},
doi = {10.1145/2048066.2048103},
isbn = {9781450309400},
year = {2011},
date = {2011-01-01},
booktitle = {Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications},
pages = {445–462},
publisher = {Association for Computing Machinery},
address = {Portland, Oregon, USA},
series = {OOPSLA ’11},
abstract = {Dynamic program optimizations are critical for the efficiency of applications in managed programming languages and scripting languages. Recent studies have shown that exploitation of program inputs may enhance the effectiveness of dynamic optimizations significantly. However, current solutions for enabling the exploitation require either programmers’ annotations or intensive offline profiling, impairing the practical adoption of the techniques.This current work examines the basic feasibility of transparent integration of input-consciousness into dynamic program optimizations, particularly in managed execution environments. It uses transparent learning across production runs as the basic vehicle, and investigates the implications of cross-run learning on each main component of input-conscious dynamic optimizations. It proposes several techniques to address some key challenges for the transparent integration, including randomized inspection-instrumentation for cross-user data collection, a sparsity-tolerant algorithm for input characterization, and selective prediction for efficiency protection. These techniques make it possible to automatically recognize the relations between the inputs to a program and the appropriate ways to optimize it. The whole process happens transparently across production runs; no need for offline profiling or programmer intervention. Experiments on a number of Java programs demonstrate the effectiveness of the techniques in enabling input-consciousness for dynamic optimizations, revealing the feasibility and potential benefits of the new optimization paradigm in some basic settings.},
keywords = {Dynamic optimizations, Dynamic version selection, Java virtual machine, Just-in-time compilation, Proactivity, Seminal behaviors},
pubstate = {published},
tppubtype = {conference}
}
title = {On-the-Fly Elimination of Dynamic Irregularities for GPU Computing},
author = {Eddy Z Zhang and Yunlian Jiang and Ziyu Guo and Kai Tian and Xipeng Shen},
url = {https://doi.org/10.1145/1950365.1950408},
doi = {10.1145/1950365.1950408},
isbn = {9781450302661},
year = {2011},
date = {2011-01-01},
booktitle = {Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems},
pages = {369–380},
publisher = {Association for Computing Machinery},
address = {Newport Beach, California, USA},
series = {ASPLOS XVI},
abstract = {The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control flows in an application. Experiments have shown great performance gains when these irregularities are removed. But it remains an open question how to achieve those gains through software approaches on modern GPUs.This paper presents a systematic exploration to tackle dynamic irregularities in both control flows and memory references. It reveals some properties of dynamic irregularities in both control flows and memory references, their interactions, and their relations with program data and threads. It describes several heuristics-based algorithms and runtime adaptation techniques for effectively removing dynamic irregularities through data reordering and job swapping. It presents a framework, G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. G-Streamline has several distinctive properties. It is a pure software solution and works on the fly, requiring no hardware extensions or offline profiling. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations. Its optimization overhead is largely transparent to GPU kernel executions, jeopardizing no basic efficiency of the GPU application. Finally, it is robust to the presence of various complexities in GPU applications. Experiments show that G-Streamline is effective in reducing dynamic irregularities in GPU computing, producing speedups between 1.07 and 2.5 for a variety of applications.},
keywords = {Cpu-gpu pipelining, Data transformation, GPGPU, Memory coalescing, Thread data remapping, Thread divergence},
pubstate = {published},
tppubtype = {conference}
}
2010
title = {An Input-Centric Paradigm for Program Dynamic Optimizations},
author = {Kai Tian and Yunlian Jiang and Eddy Z Zhang and Xipeng Shen},
url = {https://doi.org/10.1145/1869459.1869471},
doi = {10.1145/1869459.1869471},
isbn = {9781450302036},
year = {2010},
date = {2010-01-01},
booktitle = {Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications},
pages = {125–139},
publisher = {Association for Computing Machinery},
address = {Reno/Tahoe, Nevada, USA},
series = {OOPSLA ’10},
abstract = {Accurately predicting program behaviors (e.g., locality, dependency, method calling frequency) is fundamental for program optimizations and runtime adaptations. Despite decades of remarkable progress, prior studies have not systematically exploited program inputs, a deciding factor for program behaviors.Triggered by the strong and predictive correlations between program inputs and behaviors that recent studies have uncovered, this work proposes to include program inputs into the focus of program behavior analysis, cultivating a new paradigm named input-centric program behavior analysis. This new approach consists of three components, forming a three-layer pyramid. At the base is program input characterization, a component for resolving the complexity in program raw inputs and the extraction of important features. In the middle is input-behavior modeling, a component for recognizing and modeling the correlations between characterized input features and program behaviors. These two components constitute input-centric program behavior analysis, which (ideally) is able to predict the large-scope behaviors of a program’s execution as soon as the execution starts. The top layer of the pyramid is input-centric adaptation, which capitalizes on the novel opportunities that the first two components create to facilitate proactive adaptation for program optimizations.By centering on program inputs, the new approach resolves a proactivity-adaptivity dilemma inherent in previous techniques. Its benefits are demonstrated through proactive dynamic optimizations and version selection, yielding significant performance improvement on a set of Java and C programs.},
keywords = {Dynamic optimizations, Dynamic version selection, Java virtual machine, Just-in-time compilation, Proactivity, Seminal behaviors},
pubstate = {published},
tppubtype = {conference}
}
title = {Streamlining GPU Applications on the Fly: Thread Divergence Elimination through Runtime Thread-Data Remapping},
author = {Eddy Z Zhang and Yunlian Jiang and Ziyu Guo and Xipeng Shen},
url = {https://doi.org/10.1145/1810085.1810104},
doi = {10.1145/1810085.1810104},
isbn = {9781450300186},
year = {2010},
date = {2010-01-01},
booktitle = {Proceedings of the 24th ACM International Conference on Supercomputing},
pages = {115–126},
publisher = {Association for Computing Machinery},
address = {Tsukuba, Ibaraki, Japan},
series = {ICS ’10},
abstract = {Because of their tremendous computing power and remarkable cost efficiency, GPUs (graphic processing unit) have quickly emerged as a kind of influential platform for high performance computing. However, as GPUs are designed for massive data-parallel computing, their performance is subject to the presence of condition statements in a GPU application. On a conditional branch where threads diverge in which path to take, the threads taking different paths have to run serially. Such divergences often cause serious performance degradations, impairing the adoption of GPU for many applications that contain non-trivial branches or certain types of loops.This paper presents a systematic investigation in the employment of runtime thread-data remapping for solving that problem. It introduces an abstract form of GPU applications, based on which, it describes the use of reference redirection and data layout transformation for remapping data and threads to minimize thread divergences. It discusses the major challenges for practical deployment of the remapping techniques, most notably, the conflict between the large remapping overhead and the need for the remapping to happen on the fly because of the dependence of thread divergences on runtime values. It offers a solution to the challenge by proposing a CPU-GPU pipelining scheme and a label-assign-move (LAM) algorithm to virtually hide all the remapping overhead. At the end, it reports significant performance improvement produced by the remapping for a set of GPU applications, demonstrating the potential of the techniques for streamlining GPU applications on the fly.},
keywords = {Cpu-gpu pipelining, Data transformation, GPGPU, Thread divergence, Thread-data remapping},
pubstate = {published},
tppubtype = {conference}
}
title = {Exploiting Statistical Correlations for Proactive Prediction of Program Behaviors},
author = {Yunlian Jiang and Eddy Z Zhang and Kai Tian and Feng Mao and Malcom Gethers and Xipeng Shen and Yaoqing Gao},
url = {https://doi.org/10.1145/1772954.1772989},
doi = {10.1145/1772954.1772989},
isbn = {9781605586359},
year = {2010},
date = {2010-01-01},
booktitle = {Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization},
pages = {248–256},
publisher = {Association for Computing Machinery},
address = {Toronto, Ontario, Canada},
series = {CGO ’10},
abstract = {This paper presents a finding and a technique on program behavior prediction. The finding is that surprisingly strong statistical correlations exist among the behaviors of different program components (e.g., loops) and among different types of program level behaviors (e.g., loop trip-counts versus data values). Furthermore, the correlations can be beneficially exploited: They help resolve the proactivity-adaptivity dilemma faced by existing program behavior predictions, making it possible to gain the strengths of both approaches–the large scope and earliness of offline-profiling–based predictions, and the cross-input adaptivity of runtime sampling-based predictions.The main technique contributed by this paper centers on a new concept, seminal behaviors. Enlightened by the existence of strong correlations among program behaviors, we propose a regression based framework to automatically identify a small set of behaviors that can lead to accurate prediction of other behaviors in a program. We call these seminal behaviors. By applying statistical learning techniques, the framework constructs predictive models that map from seminal behaviors to other behaviors, enabling proactive and cross-input adaptive prediction of program behaviors. The prediction helps a commercial compiler, the IBM XL C compiler, generate code that runs up to 45% faster (5%-13% on average), demonstrating the large potential of correlation-based techniques for program optimizations.},
keywords = {Correlation, Program behavior analysis},
pubstate = {published},
tppubtype = {conference}
}
title = {Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs?},
author = {Eddy Z Zhang and Yunlian Jiang and Xipeng Shen},
url = {https://doi.org/10.1145/1693453.1693482},
doi = {10.1145/1693453.1693482},
isbn = {9781605588773},
year = {2010},
date = {2010-01-01},
booktitle = {Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
pages = {203–212},
publisher = {Association for Computing Machinery},
address = {Bangalore, India},
series = {PPoPP ’10},
abstract = {Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention.A number of studies have examined the influence of cache sharing on multithreaded applications, but most of them have concentrated on the design or management of shared cache, rather than a systematic measurement of the influence. Consequently, prior measurements have been constrained by the reliance on simulators, the use of out-of-date benchmarks, and the limited coverage of deciding factors. The influence of CMP cache sharing on contemporary multithreaded applications remains preliminarily understood.In this work, we conduct a systematic measurement of the influence on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions, regardless of the types of parallelism, input datasets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch of current development and compilation of multithreaded applications and CMP architectures. By transforming the programs in a cache-sharing-aware manner, we observe up to 36% performance increase when the threads are placed on cores appropriately.},
keywords = {Chip multiprocessors, Parallel program optimizations, Shared cache, Thread scheduling},
pubstate = {published},
tppubtype = {conference}
}
title = {KPC-Toolbox: Best Recipes for Automatic Trace Fitting Using Markovian Arrival Processes},
author = {Giuliano Casale and Eddy Z Zhang and Evgenia Smirni},
url = {https://doi.org/10.1016/j.peva.2009.12.003},
doi = {10.1016/j.peva.2009.12.003},
issn = {0166-5316},
year = {2010},
date = {2010-01-01},
journal = {Perform. Eval.},
volume = {67},
number = {9},
pages = {873–896},
publisher = {Elsevier Science Publishers B. V.},
address = {NLD},
abstract = {We present the KPC-Toolbox, a library of MATLAB scripts for fitting workload traces into Markovian Arrival Processes (MAPs) in an automatic way based on the recently proposed Kronecker Product Composition (KPC) method. We first present detailed sensitivity analysis that builds intuition on which trace descriptors are the most important for queueing performance, stressing the advantages of matching higher-order correlations of the process rather than higher-order moments of the distribution. Given that the MAP parameterization space can be very large, we focus on first determining the order of the smallest MAP that can fit the trace well using the Bayesian Information Criterion (BIC). The KPC-Toolbox then automatically derives a MAP that captures accurately the most essential features of the trace. Extensive experimentation illustrates the effectiveness of the KPC-Toolbox in fitting traces that are well documented in the literature as very challenging to fit, showing that the KPC-Toolbox offers a simple and powerful solution to fitting accurately trace data into MAPs. We provide a characterization of moments and correlations that can be fitted exactly by KPC, thus showing the wider applicability of the method compared to small order MAPs. We also consider the fitting of phase-type (PH-type) distributions, which are an important specialization of MAPs that are useful for describing traces without correlations in their time series. We illustrate that the KPC methodology can be easily adapted to PH-type fitting and present experimental results on networking and disk drive traces showing that the KPC-Toolbox can also match accurately higher-order moments of the inter-arrival times in place of correlations.},
keywords = {Automatic fitting, MAP characterization, Markovian arrival process (MAP), Phase-type distribution, Temporal dependence, Time series modeling},
pubstate = {published},
tppubtype = {article}
}
2009
title = {A cross-input adaptive framework for GPU program optimizations},
author = {Yixun Liu and E Z Zhang and X Shen},
year = {2009},
date = {2009-01-01},
booktitle = {2009 IEEE International Symposium on Parallel Distributed Processing},
pages = {1-10},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
title = {Influence of Program Inputs on the Selection of Garbage Collectors},
author = {Feng Mao and Eddy Z Zhang and Xipeng Shen},
url = {https://doi.org/10.1145/1508293.1508307},
doi = {10.1145/1508293.1508307},
isbn = {9781605583754},
year = {2009},
date = {2009-01-01},
booktitle = {Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments},
pages = {91–100},
publisher = {Association for Computing Machinery},
address = {Washington, DC, USA},
series = {VEE ’09},
abstract = {Many studies have shown that the best performer among a set of garbage collectors tends to be different for different applications. Researchers have proposed application-specific selection of garbage collectors. In this work, we concentrate on a second dimension of the problem: the influence of program inputs on the selection of garbage collectors.We collect tens to hundreds of inputs for a set of Java benchmarks, and measure their performance on Jikes RVM with different heap sizes and garbage collectors. A rigorous statistical analysis produces four-fold insights. First, inputs influence the relative performance of garbage collectors significantly, causing large variations to the top set of garbage collectors across inputs. Profiling one or few runs is thus inadequate for selecting the garbage collector that works well for most inputs. Second, when the heap size ratio is fixed, one or two types of garbage collectors are enough to stimulate the top performance of the program on all inputs. Third, for some programs, the heap size ratio significantly affects the relative performance of different types of garbage collectors. For the selection of garbage collectors on those programs, it is necessary to have a cross-input predictive model that predicts the minimum possible heap size of the execution on an arbitrary input. Finally, based on regression techniques, we demonstrate the predictability of the minimum possible heap size, indicating the potential feasibility of the input-specific selection of garbage collectors.},
keywords = {Cross-input program analysis, Input-specific selection, Minimum possible heap size, Profiling, Selection of garbage collectors},
pubstate = {published},
tppubtype = {conference}
}
2008
title = {KPC-Toolbox: Simple Yet Effective Trace Fitting Using Markovian Arrival Processes},
author = {G Casale and E Z Zhang and E Smirni},
year = {2008},
date = {2008-01-01},
booktitle = {2008 Fifth International Conference on Quantitative Evaluation of Systems},
pages = {83-92},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}