Author: "Jouppi, Norman P." - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Jouppi, Norman P."' showing total 321 results

Start Over Author "Jouppi, Norman P."

321 results on '"Jouppi, Norman P."'

1. FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search

Author: Dotzel, Jordan, Wu, Gang, Li, Andrew, Umar, Muhammad, Ni, Yun, Abdelfattah, Mohamed S., Zhang, Zhiru, Cheng, Liqun, Dixon, Martin G., Jouppi, Norman P., Le, Quoc V., and Li, Sheng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Quantization has become a mainstream compression technique for reducing model size, computational requirements, and energy consumption for modern deep neural networks (DNNs). With improved numerical support in recent hardware, including multiple variants of integer and floating point, mixed-precision quantization has become necessary to achieve high-quality results with low model cost. Prior mixed-precision methods have performed either a post-training quantization search, which compromises on accuracy, or a differentiable quantization search, which leads to high memory usage from branching. Therefore, we propose the first one-shot mixed-precision quantization search that eliminates the need for retraining in both integer and low-precision floating point models. We evaluate our search (FLIQS) on multiple convolutional and vision transformer networks to discover Pareto-optimal models. Our approach improves upon uniform precision, manual mixed-precision, and recent integer quantization search methods. With integer models, we increase the accuracy of ResNet-18 on ImageNet by 1.31% and ResNet-50 by 0.90% with equivalent model cost over previous methods. Additionally, for the first time, we explore a novel mixed-precision floating-point search and improve MobileNetV2 by up to 0.98% compared to prior state-of-the-art FP8 models. Finally, we extend FLIQS to simultaneously search a joint quantization and neural architecture space and improve the ImageNet accuracy by 2.69% with similar model cost on a MobileNetV2 search space., Comment: Accepted to AutoML 2024
Published: 2023

2. Corona: System Implications of Emerging Nanophotonic Technology

Author: Vantrease, Dana, Schreiber, Robert, Monchiero, Matteo, McLaren, Moray, Jouppi, Norman P., Fiorentin, Marco, Davis, Al, Binkert, Nathan, Beausoleil, Raymond G., and Ahn, Jung Ho
Subjects: Computer Science - Hardware Architecture, Computer Science - Emerging Technologies, Computer Science - Networking and Internet Architecture
Abstract: We expect that many-core microprocessors will push performance per chip from the 10 gigaflop to the 10 teraflop range in the coming decade. To support this increased performance, memory and inter-core bandwidths will also have to scale by orders of magnitude. Pin limitations, the energy cost of electrical signaling, and the non-scalability of chip-length global wires are significant bandwidth impediments. Recent developments in silicon nanophotonic technology have the potential to meet these off- and on- stack bandwidth requirements at acceptable power levels. Corona is a 3D many-core architecture that uses nanophotonic communication for both inter-core communication and off-stack communication to memory or I/O devices. Its peak floating-point performance is 10 teraflops. Dense wavelength division multiplexed optically connected memory modules provide 10 terabyte per second memory bandwidth. A photonic crossbar fully interconnects its 256 low-power multithreaded cores at 20 terabyte per second bandwidth. We have simulated a 1024 thread Corona system running synthetic benchmarks and scaled versions of the SPLASH-2 benchmark suite. We believe that in comparison with an electrically-connected many-core alternative that uses the same on-stack interconnect power, Corona can provide 2 to 6 times more performance on many memory-intensive workloads, while simultaneously reducing power., Comment: This edition is recompiled from proceedings of ISCA-35 (the 35th International Symposium on Computer Architecture, June 21 - 25, 2008, Beijing, China) and has minor formatting differences. 13 pages; 11 figures
Published: 2023
Full Text: View/download PDF

3. RETROSPECTIVE: Corona: System Implications of Emerging Nanophotonic Technology

Author: Vantrease, Dana, Schreiber, Robert, Monchiero, Matteo, McLaren, Moray, Jouppi, Norman P., Fiorentino, Marco, Davis, Al, Binkert, Nathan, Beausoleil, Raymond G., and Ahn, Jung Ho
Subjects: Computer Science - Hardware Architecture, Computer Science - Networking and Internet Architecture
Abstract: The 2008 Corona effort was inspired by a pressing need for more of everything, as demanded by the salient problems of the day. Dennard scaling was no longer in effect. A lot of computer architecture research was in the doldrums. Papers often showed incremental subsystem performance improvements, but at incommensurate cost and complexity. The many-core era was moving rapidly, and the approach with many simpler cores was at odds with the better and more complex subsystem publications of the day. Core counts were doubling every 18 months, while per-pin bandwidth was expected to double, at best, over the next decade. Memory bandwidth and capacity had to increase to keep pace with ever more powerful multi-core processors. With increasing core counts per die, inter-core communication bandwidth and latency became more important. At the same time, the area and power of electrical networks-on-chip were increasingly problematic: To be reliably received, any signal that traverses a wire spanning a full reticle-sized die would need significant equalization, re-timing, and multiple clock cycles. This additional time, area, and power was the crux of the concern, and things looked to get worse in the future. Silicon nanophotonics was of particular interest and seemed to be improving rapidly. This led us to consider taking advantage of 3D packaging, where one die in the 3D stack would be a photonic network layer. Our focus was on a system that could be built about a decade out. Thus, we tried to predict how the technologies and the system performance requirements would converge in about 2018. Corona was the result this exercise; now, 15 years later, it's interesting to look back at the effort., Comment: 2 pages. Proceedings of ISCA-50: 50 years of the International Symposia on Computer Architecture (selected papers) June 17-21 Orlando, Florida
Published: 2023

4. TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings

Author: Jouppi, Norman P., Kurian, George, Li, Sheng, Ma, Peter, Nagarajan, Rahul, Nai, Lifeng, Patil, Nishant, Subramanian, Suvinay, Swing, Andy, Towles, Brian, Young, Cliff, Zhou, Xiang, Zhou, Zongwei, and Patterson, David
Subjects: Computer Science - Hardware Architecture, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Performance
Abstract: In response to innovations in machine learning (ML) models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) and its third supercomputer for such ML models. Optical circuit switches (OCSes) dynamically reconfigure its interconnect topology to improve scale, availability, utilization, modularity, deployment, security, power, and performance; users can pick a twisted 3D torus topology if desired. Much cheaper, lower power, and faster than Infiniband, OCSes and underlying optical components are <5% of system cost and <3% of system power. Each TPU v4 includes SparseCores, dataflow processors that accelerate models that rely on embeddings by 5x-7x yet use only 5% of die area and power. Deployed since 2020, TPU v4 outperforms TPU v3 by 2.1x and improves performance/Watt by 2.7x. The TPU v4 supercomputer is 4x larger at 4096 chips and thus ~10x faster overall, which along with OCS flexibility helps large language models. For similar sized systems, it is ~4.3x-4.5x faster than the Graphcore IPU Bow and is 1.2x-1.7x faster and uses 1.3x-1.9x less power than the Nvidia A100. TPU v4s inside the energy-optimized warehouse scale computers of Google Cloud use ~3x less energy and produce ~20x less CO2e than contemporary DSAs in a typical on-premise data center., Comment: 15 pages; 16 figures; to be published at ISCA 2023 (the International Symposium on Computer Architecture)
Published: 2023

5. Searching for Fast Model Families on Datacenter Accelerators

Author: Li, Sheng, Tan, Mingxing, Pang, Ruoming, Li, Andrew, Cheng, Liqun, Le, Quoc, and Jouppi, Norman P.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Neural Architecture Search (NAS), together with model scaling, has shown remarkable progress in designing high accuracy and fast convolutional architecture families. However, as neither NAS nor model scaling considers sufficient hardware architecture details, they do not take full advantage of the emerging datacenter (DC) accelerators. In this paper, we search for fast and accurate CNN model families for efficient inference on DC accelerators. We first analyze DC accelerators and find that existing CNNs suffer from insufficient operational intensity, parallelism, and execution efficiency. These insights let us create a DC-accelerator-optimized search space, with space-to-depth, space-to-batch, hybrid fused convolution structures with vanilla and depthwise convolutions, and block-wise activation functions. On top of our DC accelerator optimized neural architecture search space, we further propose a latency-aware compound scaling (LACS), the first multi-objective compound scaling method optimizing both accuracy and latency. Our LACS discovers that network depth should grow much faster than image size and network width, which is quite different from previous compound scaling results. With the new search space and LACS, our search and scaling on datacenter accelerators results in a new model series named EfficientNet-X. EfficientNet-X is up to more than 2X faster than EfficientNet (a model series with state-of-the-art trade-off on FLOPs and accuracy) on TPUv3 and GPUv100, with comparable accuracy. EfficientNet-X is also up to 7X faster than recent RegNet and ResNeSt on TPUv3 and GPUv100.
Published: 2021

6. In-Datacenter Performance Analysis of a Tensor Processing Unit

Author: Jouppi, Norman P., Young, Cliff, Patil, Nishant, Patterson, David, Agrawal, Gaurav, Bajwa, Raminder, Bates, Sarah, Bhatia, Suresh, Boden, Nan, Borchers, Al, Boyle, Rick, Cantin, Pierre-luc, Chao, Clifford, Clark, Chris, Coriell, Jeremy, Daley, Mike, Dau, Matt, Dean, Jeffrey, Gelb, Ben, Ghaemmaghami, Tara Vazir, Gottipati, Rajendra, Gulland, William, Hagmann, Robert, Ho, C. Richard, Hogberg, Doug, Hu, John, Hundt, Robert, Hurt, Dan, Ibarz, Julian, Jaffey, Aaron, Jaworski, Alek, Kaplan, Alexander, Khaitan, Harshit, Koch, Andy, Kumar, Naveen, Lacy, Steve, Laudon, James, Law, James, Le, Diemthu, Leary, Chris, Liu, Zhuyuan, Lucke, Kyle, Lundin, Alan, MacKean, Gordon, Maggiore, Adriana, Mahony, Maire, Miller, Kieran, Nagarajan, Rahul, Narayanaswami, Ravi, Ni, Ray, Nix, Kathy, Norrie, Thomas, Omernick, Mark, Penukonda, Narayana, Phelps, Andy, Ross, Jonathan, Ross, Matt, Salek, Amir, Samadiani, Emad, Severn, Chris, Sizikov, Gregory, Snelham, Matthew, Souter, Jed, Steinberg, Dan, Swing, Andy, Tan, Mercedes, Thorson, Gregory, Tian, Bo, Toma, Horia, Tuttle, Erick, Vasudevan, Vijay, Walter, Richard, Wang, Walter, Wilcox, Eric, and Yoon, Doe Hyun
Subjects: Computer Science - Hardware Architecture, Computer Science - Learning, Computer Science - Neural and Evolutionary Computing
Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU., Comment: 17 pages, 11 figures, 8 tables. To appear at the 44th International Symposium on Computer Architecture (ISCA), Toronto, Canada, June 24-28, 2017
Published: 2017

7. CACTI-IO Technical Report

Author: Jouppi, Norman, Kahng, Andrew, Muralimanohar, Naveen, and Srinivas, Vaishnav
Abstract: We describe CACTI-IO, an extension to CACTI that includes power,area and timing models for the IO and PHY of the off-chip memory interface forvarious server and mobile configurations. CACTI-IO enables quick design spaceexploration of the off-chip IO along with the DRAM and cache parameters. Wedescribe the models added to CACTI-IO that help include the off-chip impact tothe tradeoffs between memory capacity, bandwidth and power. This technicalreport also provides three standard configurations for the input parameters(DDR3, LPDDR2 and Wide-IO) and illustrates how the models can be modified for acustom configuration. The models are validated against SPICE simulations andshow that we are within 0-15% error for different configurations. We alsocompare with measured results.Pre-2018 CSE ID: CS2012-0986
Published: 2012

8. A Machine Learning Supercomputer with an Optically Reconfigurable Interconnect and Embeddings Support

Author: Jouppi, Norman P., primary and Swing, Andy, additional
Published: 2023
Full Text: View/download PDF

9. Hyperscale Hardware Optimized Neural Architecture Search

Author: Li, Sheng, primary, Andersen, Garrett, additional, Chen, Tao, additional, Cheng, Liqun, additional, Grady, Julian, additional, Huang, Da, additional, Le, Quoc V., additional, Li, Andrew, additional, Li, Xin, additional, Li, Yang, additional, Liang, Chen, additional, Lu, Yifeng, additional, Ni, Yun, additional, Pang, Ruoming, additional, Tan, Mingxing, additional, Wicke, Martin, additional, Wu, Gang, additional, Zhu, Shengqi, additional, Ranganathan, Parthasarathy, additional, and Jouppi, Norman P., additional
Published: 2023
Full Text: View/download PDF

10. Policies Impacting Cache Hit Rates

Author: Balasubramonian, Rajeev, Jouppi, Norman P., Muralimanohar, Naveen, Balasubramonian, Rajeev, Jouppi, Norman P., and Muralimanohar, Naveen
Published: 2011
Full Text: View/download PDF

11. Technology

Author: Balasubramonian, Rajeev, Jouppi, Norman P., Muralimanohar, Naveen, Balasubramonian, Rajeev, Jouppi, Norman P., and Muralimanohar, Naveen
Published: 2011
Full Text: View/download PDF

12. Interconnection Networks within Large Caches

Author: Balasubramonian, Rajeev, Jouppi, Norman P., Muralimanohar, Naveen, Balasubramonian, Rajeev, Jouppi, Norman P., and Muralimanohar, Naveen
Published: 2011
Full Text: View/download PDF

13. Basic Elements of Large Cache Design

Author: Balasubramonian, Rajeev, Jouppi, Norman P., Muralimanohar, Naveen, Balasubramonian, Rajeev, Jouppi, Norman P., and Muralimanohar, Naveen
Published: 2011
Full Text: View/download PDF

14. Organizing Data in CMP Last Level Caches

Author: Balasubramonian, Rajeev, Jouppi, Norman P., Muralimanohar, Naveen, Balasubramonian, Rajeev, Jouppi, Norman P., and Muralimanohar, Naveen
Published: 2011
Full Text: View/download PDF

15. Wear-Leveling Techniques for Nonvolatile Memories

Author: Wang, Jue, Dong, Xiangyu, Xie, Yuan, Jouppi, Norman P., and Xie, Yuan, editor
Published: 2014
Full Text: View/download PDF

16. A Circuit-Architecture Co-optimization Framework for Exploring Nonvolatile Memory Hierarchies

Author: Dong, Xiangyu, Jouppi, Norman P., Xie, Yuan, and Xie, Yuan, editor
Published: 2014
Full Text: View/download PDF

17. The Role of Photonics in Future Datacenter Networks

Author: Davis, Al, Jouppi, Norman P., McLaren, Moray, Muralimanohar, Naveen, Schreiber, Robert S., Binkert, Nathan, Ahn, Jung-Ho, Kachris, Christoforos, editor, Bergman, Keren, editor, and Tomkos, Ioannis, editor
Published: 2013
Full Text: View/download PDF

18. CMOS Nanophotonics: Technology, System Implications, and a CMP Case Study

Author: Ahn, Jung Ho, Beausoleil, Raymond G., Binkert, Nathan, Davis, Al, Fiorentino, Marco, Jouppi, Norman P., McLaren, Moray, Monchiero, Matteo, Muralimanohar, Naveen, Schreiber, Robert, Vantrease, Dana, Silvano, Cristina, editor, Lajolo, Marcello, editor, and Palermo, Gianluca, editor
Published: 2011
Full Text: View/download PDF

19. Memory Modeling with CACTI

Author: Muralimanohar, Naveen, Ahn, Jung Ho, Jouppi, Norman P., Leupers, Rainer, editor, and Temam, Olivier, editor
Published: 2010
Full Text: View/download PDF

20. A Domain-Specific Supercomputer for Training Deep Neural Networks: Google's TPU supercomputers train deep neural networks 50x faster than general-purpose supercomputers running a high-performance computing benchmark.

Author: JOUPPI, NORMAN P., DOE HYUN YOON, KURIAN, GEORGE, SHENG LI, PATIL, NISHANT, LAUDON, JAMES, CLIFF YOUNG, and PATTERSON, DAVID
Subjects: *ARTIFICIAL neural networks, *SUPERCOMPUTER design & construction
Abstract: The article examines the construction of a supercomputer by the high technology firm Google, featuring the first production domain specific architecture (DSA) for training deep neural networks (DNNs).
Published: 2020
Full Text: View/download PDF

21. Concluding Remarks

Author: Balasubramonian, Rajeev, Jouppi, Norman P., Muralimanohar, Naveen, Balasubramonian, Rajeev, Jouppi, Norman P., and Muralimanohar, Naveen
Published: 2011
Full Text: View/download PDF

22. Searching for Fast Model Families on Datacenter Accelerators

Author: Li, Sheng, primary, Tan, Mingxing, additional, Pang, Ruoming, additional, Li, Andrew, additional, Cheng, Liqun, additional, Le, Quoc V., additional, and Jouppi, Norman P., additional
Published: 2021
Full Text: View/download PDF

23. Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product

Author: Jouppi, Norman P., primary, Hyun Yoon, Doe, additional, Ashcraft, Matthew, additional, Gottscho, Mark, additional, Jablin, Thomas B., additional, Kurian, George, additional, Laudon, James, additional, Li, Sheng, additional, Ma, Peter, additional, Ma, Xiaoyu, additional, Norrie, Thomas, additional, Patil, Nishant, additional, Prasad, Sushma, additional, Young, Cliff, additional, Zhou, Zongwei, additional, and Patterson, David, additional
Published: 2021
Full Text: View/download PDF

24. Isolation in commodity multicore processors

Author: Aggarwal, Nidhi, Ranganathan, Parthasarathy, Jouppi, Norman P., and Smith, James E.
Subjects: Resource sharing software, Multiple core processors -- Usage
Published: 2007

25. Wear-Leveling Techniques for Nonvolatile Memories

Author: Wang, Jue, primary, Dong, Xiangyu, additional, Xie, Yuan, additional, and Jouppi, Norman P., additional
Published: 2013
Full Text: View/download PDF

26. A Circuit-Architecture Co-optimization Framework for Exploring Nonvolatile Memory Hierarchies

Author: Dong, Xiangyu, primary, Jouppi, Norman P., additional, and Xie, Yuan, additional
Published: 2013
Full Text: View/download PDF

27. The Role of Photonics in Future Datacenter Networks

Author: Davis, Al, primary, Jouppi, Norman P., additional, McLaren, Moray, additional, Muralimanohar, Naveen, additional, Schreiber, Robert S., additional, Binkert, Nathan, additional, and Ahn, Jung-Ho, additional
Published: 2012
Full Text: View/download PDF

28. Multi-Core Cache Hierarchies

Author: Balasubramonian, Rajeev, primary, Jouppi, Norman P., additional, and Muralimanohar, Naveen, additional
Published: 2011
Full Text: View/download PDF

29. A Domain-Specific Architecture for Deep Neural Networks.

Author: JOUPPI, NORMAN P., YOUNG, CLIFF, PATIL, NISHANT, and PATTERSON, DAVID
Subjects: *ARTIFICIAL neural networks, *APPLICATION-specific integrated circuits, *SERVER farms (Computer network management), *ENERGY consumption
Abstract: The article discusses the use of Google's tensor processing units (TPUs) to improve performance per watt of deep neural networks in the company's datacenters, with better energy efficiency than central processing units (CPUs) and graphics processing units (GPUs) in similar technologies.
Published: 2018
Full Text: View/download PDF

30. CMOS Nanophotonics: Technology, System Implications, and a CMP Case Study

Author: Ahn, Jung Ho, primary, Beausoleil, Raymond G., additional, Binkert, Nathan, additional, Davis, Al, additional, Fiorentino, Marco, additional, Jouppi, Norman P., additional, McLaren, Moray, additional, Monchiero, Matteo, additional, Muralimanohar, Naveen, additional, Schreiber, Robert, additional, and Vantrease, Dana, additional
Published: 2010
Full Text: View/download PDF

31. The Multicluster Architecture: Reducing Processor Cycle Time Through Partitioning

Author: Farkas, Keith I., Chow, Paul, Jouppi, Norman P., and Vranesic, Zvonko
Published: 1999
Full Text: View/download PDF

32. Google's Training Chips Revealed: TPUv2 and TPUv3

Author: Norrie, Thomas, primary, Patil, Nishant, additional, Yoon, Doe Hyun, additional, Kurian, George, additional, Li, Sheng, additional, Laudon, James, additional, Young, Cliff, additional, Jouppi, Norman P., additional, and Patterson, David, additional
Published: 2020
Full Text: View/download PDF

33. The Future Evolution of High-Performance Microprocessors

Author: Jouppi, Norman P., primary
Published: 2004
Full Text: View/download PDF

34. The Future Evolution of High-Performance Microprocessors

Author: Jouppi, Norman P., Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Dough, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Bougé, Luc, editor, and Prasanna, Viktor K., editor
Published: 2005
Full Text: View/download PDF

35. Designing, packaging, and testing a 300-MHz, 115V ECL microprocessor

Author: Jouppi, Norman P., Boyle, Patrick, and Fitch, John S.
Subjects: Processor Architecture, Research and Development, New Technique, Circuit Design, Chip Packaging, Performance Improvement, Microprocessor, Emitter-coupled logic, RISC, Microprocessors -- Design and construction, Emitter-coupled logic -- Design and construction, Reduced-instruction-set computers -- Design and construction
Published: 1994

36. Computer technology and architecture: an evolving interaction

Author: Hennessy, John L. and Jouppi, Norman P.
Subjects: Microprocessor, Integrated Circuits, Trends, Future of Computing, Technology
Published: 1991

37. Design of a High Performance VLSI Processor

Author: Hennessy, John L., Jouppi, Norman P., Przybylski, Steven, Rowen, Christopher, Gross, Thomas, and Bryant, Randal, editor
Published: 1983
Full Text: View/download PDF

38. TV: An nMOS Timing Analyzer

Author: Jouppi, Norman P. and Bryant, Randal, editor
Published: 1983
Full Text: View/download PDF

39. A high-speed optical multidrop bus for computer interconnections

Author: Tan, Michael R.T., Rosenberg, Paul, Yeo, Jong-Souk, McLaren, Moray, Mathai, Sagi, Morris, Terry, Kuo, Huei Pei, Straznicky, Joseph, Jouppi, Norman P., and Wang, Shih-Yuan
Subjects: Bus architecture, Buses (Computers) -- Research, Optical communications -- Research
Published: 2009

40. Architecting efficient interconnects for large caches with CACTI 6.0

Author: Muralimanohar, Naveen, Balasubramonian, Rajeev, and Jouppi, Norman P.
Subjects: Microprocessor, Microprocessor upgrade, Connector, Central processing units -- Design and construction, Microprocessors -- Design and construction, Connectors -- Usage, Connectors -- Design and construction
Published: 2008

41. Heterogeneous chip multiprocessors

Author: Kumar, Rakesh, Tullsen, Dean M., Jouppi, Norman P., and Ranganathan, Parthasarathy
Subjects: Multiprocessors -- Usage, Multiprocessors -- Analysis
Published: 2005

42. In-Datacenter Performance Analysis of a Tensor Processing Unit

Author: Jouppi, Norman P., primary, Young, Cliff, additional, Patil, Nishant, additional, Patterson, David, additional, Agrawal, Gaurav, additional, Bajwa, Raminder, additional, Bates, Sarah, additional, Bhatia, Suresh, additional, Boden, Nan, additional, Borchers, Al, additional, Boyle, Rick, additional, Cantin, Pierre-luc, additional, Chao, Clifford, additional, Clark, Chris, additional, Coriell, Jeremy, additional, Daley, Mike, additional, Dau, Matt, additional, Dean, Jeffrey, additional, Gelb, Ben, additional, Ghaemmaghami, Tara Vazir, additional, Gottipati, Rajendra, additional, Gulland, William, additional, Hagmann, Robert, additional, Ho, C. Richard, additional, Hogberg, Doug, additional, Hu, John, additional, Hundt, Robert, additional, Hurt, Dan, additional, Ibarz, Julian, additional, Jaffey, Aaron, additional, Jaworski, Alek, additional, Kaplan, Alexander, additional, Khaitan, Harshit, additional, Killebrew, Daniel, additional, Koch, Andy, additional, Kumar, Naveen, additional, Lacy, Steve, additional, Laudon, James, additional, Law, James, additional, Le, Diemthu, additional, Leary, Chris, additional, Liu, Zhuyuan, additional, Lucke, Kyle, additional, Lundin, Alan, additional, MacKean, Gordon, additional, Maggiore, Adriana, additional, Mahony, Maire, additional, Miller, Kieran, additional, Nagarajan, Rahul, additional, Narayanaswami, Ravi, additional, Ni, Ray, additional, Nix, Kathy, additional, Norrie, Thomas, additional, Omernick, Mark, additional, Penukonda, Narayana, additional, Phelps, Andy, additional, Ross, Jonathan, additional, Ross, Matt, additional, Salek, Amir, additional, Samadiani, Emad, additional, Severn, Chris, additional, Sizikov, Gregory, additional, Snelham, Matthew, additional, Souter, Jed, additional, Steinberg, Dan, additional, Swing, Andy, additional, Tan, Mercedes, additional, Thorson, Gregory, additional, Tian, Bo, additional, Toma, Horia, additional, Tuttle, Erick, additional, Vasudevan, Vijay, additional, Walter, Richard, additional, Wang, Walter, additional, Wilcox, Eric, additional, and Yoon, Doe Hyun, additional
Published: 2017
Full Text: View/download PDF

43. History-Assisted Adaptive-Granularity Caches (HAAG$) for high performance 3D DRAM architectures

Author: Chen, Ke, Li, Sheng, Ahn, Jung Ho, Muralimanohar, Naveen, Zhao, Jishen, Xu, Cong, O, Seongil, Xie, Yuan, Brockman, Jay B., Jouppi, Norman P., Chen, Ke, Li, Sheng, Ahn, Jung Ho, Muralimanohar, Naveen, Zhao, Jishen, Xu, Cong, O, Seongil, Xie, Yuan, Brockman, Jay B., and Jouppi, Norman P.
Abstract: 3D-stacked DRAM has the potential to provide high performance and large capacity memory for future high performance computing systems and datacenters, and the integration of a dedicated logic die opens up opportunities for architectural enhancements such as DRAM row-buffer caches. However, high performance and cost-effective row-buffer cache designs remain challenging for 3D memory systems. In this paper, we propose History-Assisted Adaptive-Granularity Cache (HAAG$) that employs an adaptive caching scheme to support full associativity at various granularities, and an intelligent history-assisted predictor to support a large number of banks in 3D memory systems. By increasing the row-buffer cache hit rate and avoiding unnecessary data caching, HAAG$ significantly reduces memory access latency and dynamic power. Our design works particularly well for manycore CPUs running (irregular) memory intensive applications where memory locality is hard to exploit. Evaluation results show that with memory-intensive CPU workloads, HAAG$ can outperform the state-of-the-art row buffer cache by 33.5%.
Published: 2015

44. Techniques for Data Mapping and Buffering to Exploit Asymmetry in Multi-Level Cell (Phase Change) Memory

Author: Yoon, HanBin, Muralimanohar, Naveen, Meza, Justin, Mutlu, Onur, and Jouppi, Norman P.
Subjects: Hardware_MEMORYSTRUCTURES, 90699 Electrical and Electronic Engineering not elsewhere classified, FOS: Electrical engineering, electronic engineering, information engineering, Computer Engineering
Abstract: Phase Change Memory (PCM) is a promising alternative to DRAM to achieve high memory capacity at low cost per bit. Adding to its better projected scalability, PCM can also store multiple bits per cell (called multi-level cell, MLC), offering higher bit density. However, MLC requires precise sensing and control of PCM cell resistance, which incur higher memory access latency and energy. We propose a new approach to mapping and buffering data in MLC PCM to improve memory system performance and energy efficiency. The latency and energy to read or write to MLC PCM varies depending on the resistance state of the multi-level cell, such that one bit in a multi-bit cell can be accessed at lower latency and energy than another bit in the same cell. We propose to exploit this asymmetry between the different bits by decoupling the bits and mapping them to logically separate memory addresses. This exposes reduced read latency and energy in one half of the memory space, and reduced write latency and energy in the other half of memory, to system software. We effectively utilize the reduced latency and energy by mapping read-intensive pages to the read-efficient half of memory, and write-intensive pages to the write-efficient half. Decoupling the bits also provides flexibility in the way data is buffered in the memory device, which we exploit to manipulate the physical row buffer as two logical row buffers for increased data locality in the row buffer. Our evaluations for a multi-core system show that our proposal improves system performance by 19.2%, memory energy efficiency by 14.4%, and thread fairness by 19.3% over the state-of-the-art MLC PCM baseline system that does not employ bit decoupling. The improvements are robust across a wide variety of workloads and system configurations.
Published: 2013
Full Text: View/download PDF

45. CACTI-IO: CACTI With OFF-Chip Power-Area-Timing Models

Author: Jouppi, Norman P., primary, Kahng, Andrew B., additional, Muralimanohar, Naveen, additional, and Srinivas, Vaishnav, additional
Published: 2015
Full Text: View/download PDF

46. History-Assisted Adaptive-Granularity Caches (HAAG$) for High Performance 3D DRAM Architectures

Author: Chen, Ke, primary, Li, Sheng, additional, Ahn, Jung Ho, additional, Muralimanohar, Naveen, additional, Zhao, Jishen, additional, Xu, Cong, additional, O, Seongil, additional, Xie, Yuan, additional, Brockman, Jay B., additional, and Jouppi, Norman P., additional
Published: 2015
Full Text: View/download PDF

47. Endurance-aware cache line management for non-volatile caches

Author: Wang, Jue, Dong, Xiangyu, Xie, Yuan, Jouppi, Norman P., Wang, Jue, Dong, Xiangyu, Xie, Yuan, and Jouppi, Norman P.
Abstract: Nonvolatile memories (NVMs) have the potential to replace low-level SRAM or eDRAM on-chip caches because NVMs save standby power and provide large cache capacity. However, limited write endurance is a common problem for NVM technologies, and today's cache management might result in unbalanced cache write traffic, causing heavily written cache blocks to fail much earlier than others. Although wear-leveling techniques for NVM-based main memories exist, we cannot simply apply them to NVM-based caches. This is because cache writes have intraset variations as well as interset variations, while writes to main memories only have interset variations. To solve this problem, we propose i 2WAP, a new cache management policy that can reduce both inter-and intraset write variations. i2WAP has two features: Swap-Shift, an enhancement based on existing main memory wear leveling to reduce cache interset write variations, and Probabilistic Set Line Flush, a novel technique to reduce cache intraset write variations. Implementing i2WAP only needs two global counters and two global registers. In one of our studies, i 2WAP can improve the NVM cache lifetime by 75% on average and up to 224%. We also validate that i2WAP is effective in systems with different cache configurations and workloads. © 2014 ACM.
Published: 2014

48. Efficient Data Mapping and Buffering Techniques for Multilevel Cell Phase-Change Memories

Author: Yoon, Hanbin, primary, Meza, Justin, additional, Muralimanohar, Naveen, additional, Jouppi, Norman P., additional, and Mutlu, Onur, additional
Published: 2014
Full Text: View/download PDF

49. Endurance-aware cache line management for non-volatile caches

Author: Wang, Jue, primary, Dong, Xiangyu, additional, Xie, Yuan, additional, and Jouppi, Norman P., additional
Published: 2014
Full Text: View/download PDF

50. The nonuniform distribution of instruction-level and machine parallelism and its effect on performance

Author: Jouppi, Norman P.
Subjects: Performance Measurement, Parallel Processing, Parallelism, Pipelining, Instruction Execution, Processor Architecture, Estimation, Models of Computation, Scientific Research, New Technique
Published: 1989

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

321 results on '"Jouppi, Norman P."'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources