172 results on '"UltraSPARC"'
Search Results
2. Ultrasparc Instruction Level Characterization of Java Virtual Machine Workload
- Author
-
Barisone, Andrea, Bellotti, Francesco, Berta, Riccardo, De Gloria, Alessandro, John, Lizy Kurian, editor, and Maynard, Ann Marie Grizzaffi, editor
- Published
- 2000
- Full Text
- View/download PDF
3. Hardware-assisted circumvention of self-hashing software tamper resistance.
- Author
-
van Oorschot, P.C., Somayaji, A., and Wurster, G.
- Abstract
Self-hashing has been proposed as a technique for verifying software integrity. Appealing aspects of this approach to software tamper resistance include the promise of being able to verify the integrity of software independent of the external support environment, as well as the ability to integrate code protection mechanisms automatically. In this paper, we show that the rich functionality of most modern general-purpose processors (including UltraSparc, x86, PowerPC, AMD64, Alpha, and ARM) facilitate an automated, generic attack which defeats such self-hashing. We present a general description of the attack strategy and multiple attack implementations that exploit different processor features. Each of these implementations is generic in that it can defeat self-hashing employed by any user-space program on a single platform. Together, these implementations defeat self-hashing on most modern general-purpose processors. The generality and efficiency of our attack suggests that self-hashing is not a viable strategy for high-security tamper resistance on modern computer systems. [ABSTRACT FROM PUBLISHER]
- Published
- 2005
- Full Text
- View/download PDF
4. Power Analysis and Implementation of Low-Power Design for Test Architecture for UltraSPARC Chip Multiprocessor
- Author
-
D. Jackuline Moni, Y. Amar Babu, and John Bedford Solomon
- Subjects
Power gating ,UltraSPARC ,Computer science ,business.industry ,Design for testing ,Clock gating ,Hardware_PERFORMANCEANDRELIABILITY ,Power analysis ,Network on a chip ,Embedded system ,Scalability ,Hardware_INTEGRATEDCIRCUITS ,System on a chip ,business - Abstract
Low-power architectures keeping in mind scalability presents a challenge to modern System on Chip and Network on Chip Designs. Especially, more so if these designs incorporate a Design for Testability Architecture too. DFT has become a De facto. From a Low-Power Scenario, it might seem easy to suggest a power down or power gating or clock gating or DVFS for a particular core to achieve this. But from a DFT perspective this presents a unique challenge as the scan chains and their allied clocks have to be active for verification to take place. Because if the power gated or clock gated low-power strategies can present difficulties to On-Chip Debug especially in modern SoC and NoC which tend to have long Test Data registers. The Drive to Low-Power Design should not impact yield or design confidence or test confidence. In this paper, a novel architecture is proposed to improve observability and controllability at individual core level while optimizing 20% of power consumption on UltraSPARC chip multiprocessor.
- Published
- 2017
5. Performance of the OpenMP and MPI implementations on ultrasparc system
- Author
-
K. Ko P. Sone, K. Zaya, and A. V. Bogdanov
- Subjects
UltraSPARC ,Computer science ,lcsh:T57-57.97 ,lcsh:Mathematics ,SPARC System ,OpenMP ,Parallel computing ,Software_PROGRAMMINGTECHNIQUES ,lcsh:QA1-939 ,Computer Science Applications ,Parallel Programming ,MPI (Message Passing Interface) ,Computational Theory and Mathematics ,Modeling and Simulation ,lcsh:Applied mathematics. Quantitative methods ,Implementation - Abstract
This paper targets programmers and developers interested in utilizing parallel programming techniques to enhance application performance. The Oracle Solaris Studio software provides state-of-the-art optimizing and parallelizing compilers for C, C++ and Fortran, an advanced debugger, and optimized mathematical and performance libraries. Also included are an extremely powerful performance analysis tool for profiling serial and parallel applications, a thread analysis tool to detect data races and deadlock in memory parallel programs, and an Integrated Development Environment (IDE). The Oracle Message Passing Toolkit software provides the high-performance MPI libraries and associated run-time environment needed for message passing applications that can run on a single system or across multiple compute systems connected with high performance networking, including Gigabit Ethernet, 10 Gigabit Ethernet, InfiniBand and Myrinet. Examples of OpenMP and MPI are provided throughout the paper, including their usage via the Oracle Solaris Studio and Oracle Message Passing Toolkit products for development and deployment of both serial and parallel applications on SPARC and x86/x64 based systems. Throughout this paper it is demonstrated how to develop and deploy an application parallelized with OpenMP and/or MPI.
- Published
- 2015
6. The Oracle Sparc T5 16-Core Processor Scales to Eight Sockets
- Author
-
Sebastian Turullols, Ali Vahidsafa, Sivaramakrishnan Ram, Sumti Jairath, Paul N. Loewenstein, John R. Feehrer, and David Smentek
- Subjects
Multi-core processor ,UltraSPARC ,Computer performance ,Computer science ,Parallel computing ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,computer.software_genre ,Oracle ,Hardware and Architecture ,SPARC T5 ,SPARC T4 ,Operating system ,Bandwidth (computing) ,Electrical and Electronic Engineering ,computer ,Software - Abstract
The Oracle Sparc T5 processor more than doubles the throughput of the Sparc T4 processor, while increasing per-thread performance, scalability, power efficiency, and I/O bandwidth. The authors detail the improvements and new features leading to this latest Oracle Sparc processor.
- Published
- 2013
7. UDSM Trends Comparison: From Technology Roadmap to UltraSparc Niagara2
- Author
-
Mariagrazia Graziano, Azzurra Pulimeno, and Gianluca Piccinini
- Subjects
Very-large-scale integration ,Engineering ,UltraSPARC ,business.industry ,Transistor ,Multiprocessing ,law.invention ,Reliability engineering ,Hardware and Architecture ,law ,Microsystem ,Dynamic demand ,Electronic engineering ,Technology roadmap ,Electrical and Electronic Engineering ,Inefficiency ,business ,Software - Abstract
The increased leakage, yield inefficiency, process, power supply, and temperature variations have significant aftereffects on the performance of complex VLSI architectures especially if mapped on ultra deep sub micrometer (UDSM) technologies. In this paper we assess the technology trend based on three industrial technologies (90, 65, and 45 nm) using a state of the art processor as benchmark: The UltraSparc Niagara 2 from SUN Microsystem. We analyze frequency, dynamic, and static power and area after synthesis varying power supply voltage and temperature. We then compare these exhaustive analyses of system level performance as a function of technology to ITRS device level estimations. The results suggest that this prediction can be of help when addressing both the technological scaling and the variability scenario of the selected technology. We believe that correctly predicting specific values on performance variations when realistic conditions and technologies are changed could provide a valuable information for the architect. Our analysis advises the designer on the effective applicability of the ITRS trends to system performance, but also pinpoints that a reliable system level prediction should better take into account the design complexity.
- Published
- 2012
8. Effective Utilization of Multicore Processor for Unified Threat Management Functions
- Author
-
Sudhakar Gummadi and Radhakrishnan Shanmugasundaram
- Subjects
Multi-core processor ,UltraSPARC ,Artificial Intelligence ,Computer Networks and Communications ,Computer science ,Packet processing ,Problem statement ,Workload ,Parallel computing ,Unified threat management ,Execution time ,Software ,Scheduling (computing) - Abstract
Problem statement: Multicore and multithreaded CPUs have become the new approach for increase in the performance of the processor based systems. Numerous applications benefit from use of multiple cores. Unified threat management is one such application that has multiple functions to be implemented at high speeds. Increasing performance of the system by knowing the nature of the functionality and effective utilization of multiple processors for each of the functions warrants detailed experimentation. In this study, some of the functions of Unified Threat Management were implemented using multiple processors for each of the functions. Approach: This evaluation was conducted on SunfireT1000 server having Sun UltraSPARC T1 multicore processor. OpenMP parallelization methods were used for scheduling the logical CPUs for the parallelized application. Results: Execution time for some of the UTM functions implemented was analyzed to arrive at an effective allocation and parallelization methodology that is dependent on the hardware and the workload. Conclusion/Recommendations: Based on the analysis, the type of parallelization method for the implemented UTM functions are suggested.
- Published
- 2012
9. Dynamic Allocation of CPUs in Multicore Processor for Performance Improvement in Network Security Applications
- Author
-
Anand Nagar, Radhakrishnan Shanmugasundaram, and Sudhakar Gummadi
- Subjects
Multi-core processor ,UltraSPARC ,Computer Networks and Communications ,Computer science ,Packet processing ,Workload ,Parallel computing ,Execution time ,Scheduling (computing) ,Processor affinity ,Artificial Intelligence ,Central processing unit ,Performance improvement ,Software - Abstract
Problem statement: Multicore and multithreaded CPUs have become the new approach for increase in the performance of the processor based systems. Numerous applications benefit from use of multiple cores. Increasing performance of the system by increasing the number of CPUs of the multicore processor for a given application warrants detailed experimentation. In this study, the results of the experimentation done by dynamic allocation/deallocation of the CPU based on the workload conditions for the packet processing for security application are analyzed and presented. Approach: This evaluation was conducted on SunfireT1000 server having Sun UltraSPARC T1 multicore processor. OpenMP tasking feature is used for scheduling the logical CPUs for the parallelized application. Dynamic allocation of a CPU to a process is done depending on the workload characterization. Results: Execution time for packet processing was analyzed to arrive at an effective dynamic allocation methodology that is dependant on the hardware and the workload. Conclusion/Recommendations: Based on the analysis, the methodology and the allocation of the number of CPUs for the parallelized application are suggested.
- Published
- 2011
10. A 40 nm 16-Core 128-Thread SPARC SoC Processor
- Author
-
Changku Hwang, Jinuk Luke Shin, A S Leon, K.W. Tam, Timothy P. Johnson, Dawei Huang, Hongping Li, A. Strong, Francis Schumacher, Bruce Petrick, Ha Pham, and A. Smith
- Subjects
Engineering ,Multi-core processor ,UltraSPARC ,business.industry ,CPU cache ,SerDes ,Hardware_PERFORMANCEANDRELIABILITY ,Thread (computing) ,Memory controller ,Multithreading ,Embedded system ,Hardware_INTEGRATEDCIRCUITS ,System on a chip ,Electrical and Electronic Engineering ,business - Abstract
This fourth generation UltraSPARC T3 SoC processor implements sixteen 8-threaded SPARC cores to double on-chip thread count and throughput performance over its previous generation. It enhances glueless scalability to enable up to 512 threads in a 4-way system. A 16-Bank 6 MB L2 Cache, a 512 GB/s hierarchical crossbar and a 312-lane SerDes I/O of 2.4 Tb/s support the bandwidth required by the large number of threads. This SoC processor integrates the memory controller, PCIE 2.0, 10 Gb Ethernet ports, and required cache coherency support in multi-chip configurations. Multiple clock and power domains are used to optimize performance and power for the SoC components. Extensive power management features, from architecture to circuit techniques, optimize both active and idle power. The 377 die includes 1 billion transistors in a flip-chip ceramic package with 2117 pins. The chip is fabricated in TSMC's 40 nm high-performance process with 11 Cu metals and four transistor types.
- Published
- 2011
11. Spin-based reader-writer synchronization for multiprocessor real-time systems
- Author
-
James H. Anderson and Bjorn B. Brandenburg
- Subjects
Multi-core processor ,Control and Optimization ,UltraSPARC ,Computer Networks and Communications ,Computer science ,Multiprocessing ,Parallel computing ,computer.software_genre ,Lock (computer science) ,Blocking (computing) ,Computer Science Applications ,Task (computing) ,Control and Systems Engineering ,Modeling and Simulation ,Synchronization (computer science) ,Operating system ,Electrical and Electronic Engineering ,Semaphore ,computer - Abstract
Reader preference, writer preference, and task-fair reader-writer locks are shown to cause undue blocking in multiprocessor real-time systems. Phase-fair reader writer locks, a new class of reader-writer locks, are proposed as an alternative. Three local-spin phase-fair lock algorithms, one with constant remote-memory-reference complexity, are presented and demonstrated to be efficiently implementable on common hardware platforms. Both task- and phase-fair locks are evaluated and contrasted to mutex locks in terms of hard and soft real-time schedulability--each under both global and partitioned scheduling--under consideration of runtime overheads on a multicore Sun "Niagara" UltraSPARC T1 processor. Formal bounds on worst-case blocking are derived for all considered lock types.
- Published
- 2010
12. Utilizing Predictors for Efficient Thermal Management in Multiprocessor SoCs
- Author
-
Tajana Rosing, Ayse K. Coskun, and Kenneth C. Gross
- Subjects
Engineering ,UltraSPARC ,Temperature control ,business.industry ,Energy management ,Multiprocessing ,Hardware_PERFORMANCEANDRELIABILITY ,Computer Graphics and Computer-Aided Design ,Reliability engineering ,Embedded system ,Lookup table ,Performance engineering ,Hardware_INTEGRATEDCIRCUITS ,System on a chip ,Autoregressive–moving-average model ,Electrical and Electronic Engineering ,business ,Software - Abstract
Conventional thermal management techniques are reactive, as they take action after temperature reaches a threshold. Such approaches do not always minimize and balance the temperature, and they control temperature at a noticeable performance cost. This paper investigates how to use predictors for forecasting temperature and workload dynamics, and proposes proactive thermal management techniques for multiprocessor system-on-chips. The predictors we study include autoregressive moving average modeling and lookup tables. We evaluate several reactive and predictive techniques on an UltraSPARC T1 processor and an architecture-level simulator. Proactive methods achieve significantly better thermal profiles and performance in comparison to reactive policies.
- Published
- 2009
13. Performance issues in emerging homogeneous multi-core architectures
- Author
-
Tarek El-Ghazawi, Gregory B. Newby, and Abdullah Kayi
- Subjects
Multi-core processor ,UltraSPARC ,Computer architecture ,Hardware and Architecture ,Computer science ,Modeling and Simulation ,Fast Fourier transform ,Benchmark (computing) ,Overhead (computing) ,x86 ,Benchmarking ,Software ,Cache coherence - Abstract
Multi-core architectures have emerged as the dominant architecture for both desktop and high-performance systems. Multi-core systems introduce many challenges that need to be addressed to achieve the best performance. Therefore, benchmarking of these processors is necessary to identify the possible performance issues. In this paper, broad range of homogeneous multi-core architectures are investigated in terms of essential performance metrics. To measure performance, we used micro-benchmarks from High-Performance Computing Challenge (HPCC), NAS Parallel Benchmarks (NPB), LMbench, and an FFT benchmark. Performance analysis is conducted on multi-core systems from UltraSPARC and x86 architectures; including systems based on Conroe, Kentsfield, Clovertown, Santa Rosa, Barcelona, Niagara, and Victoria Falls processors. Also, the effect of multi-core architectures in cluster performance is examined using a Clovertown based cluster. Finally, cache coherence overhead is analyzed using a full-system simulator. Experimental analysis and observations in this study provide for a better understanding of the emerging homogeneous multi-core systems.
- Published
- 2009
14. Coherency Hub Design for Multisocket Sun Servers with CoolThreads Technology
- Author
-
Stephen E. Phillips, John R. Feehrer, P. Rotker, Paul Gingras, M. Shih, John R. Heath, and P. Yakutis
- Subjects
UltraSPARC ,business.industry ,Computer science ,Distributed computing ,Node (networking) ,Memory bandwidth ,Pipeline (software) ,Hardware and Architecture ,Server ,Embedded system ,Multithreading ,Electrical and Electronic Engineering ,business ,Software ,Cache coherence - Abstract
To bring the benefits of CMT to larger workloads, these systems had to scale beyond a single socket. Because CMT requires massive memory bandwidth to achieve adequate throughput performance, the challenge was to develop a coherency link and fabric that would allow performance to scale along with thread count in a multinode (that is, multisocket) system. In this article CoHub's coherency scheme, ASIC design, and transtransaction flows, and discussion of the engineering challenges created by 800-MHz operation and a six-stage pipeline budget is presented. The basic principles embodied in the multinode coherency protocol and CoHub design will be important building blocks for future multinode CMT systems with higher node counts.
- Published
- 2009
15. Efficient SIMD optimization for media processors
- Author
-
Ce Shi and Jian-peng Zhou
- Subjects
UltraSPARC ,Speedup ,Computer science ,General Engineering ,Parallel computing ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Program optimization ,computer.software_genre ,Instruction set ,Computer architecture ,Factor (programming language) ,Compiler ,SSE2 ,SIMD ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,computer ,computer.programming_language - Abstract
Single instruction multiple data (SIMD) instructions are often implemented in modern media processors. Although SIMD instructions are useful in multimedia applications, most compilers do not have good support for SIMD instructions. This paper focuses on SIMD instructions generation for media processors. We present an efficient code optimization approach that is integrated into a retargetable C compiler. SIMD instructions are generated by finding and combining the same operations in programs. Experimental results for the UltraSPARC VIS instruction set show that a speedup factor up to 2.639 is obtained.
- Published
- 2008
16. Implementation of an 8-Core, 64-Thread, Power-Efficient SPARC Server on a Chip
- Author
-
King C. Yen, Amit Kumar, David J. Greenhill, Umesh Gajanan Nawathe, Mahmud Hassan, and Aparna Ramachandran
- Subjects
Ethernet ,Multi-core processor ,Engineering ,UltraSPARC ,business.industry ,Thread (computing) ,Chip ,law.invention ,Microprocessor ,UltraSPARC T2 ,law ,Embedded system ,Hardware_INTEGRATEDCIRCUITS ,System on a chip ,Electrical and Electronic Engineering ,business - Abstract
The second in the Niagara series of processors (Niagara2) from Sun Microsystems is based on the power-efficient chip multi-threading (CMT) architecture optimized for Space, Watts (Power), and Performance (SWaP) [SWap Rating = Performance/(Space * Power) ]. It doubles the throughput performance and performance/watt, and provides >10times improvement in floating point throughput performance as compared to UltraSPARC T1 (Niagara1). There are two 10 Gb Ethernet ports on chip. Niagara2 has eight SPARC cores, each supporting concurrent execution of eight threads for 64 threads total. Each SPARC core has a floating point and graphics unit and an advanced cryptographic unit which provides high enough bandwidth to run the two 10 Gb Ethernet ports encrypted at wire speeds. There is a 4 MB Level2 cache on chip. Each of the four on-chip memory controllers controls two FBDIMM channels. Niagara2 has 503 million transistors on a 342 mm2 die packaged in a flip-chip glass ceramic package with 1831 pins. The chip is built in Texas Instruments' 65 nm 11LM triple-Vt CMOS process. It operates at 1.4 GHz at 1.1 V and consumes 84 W.
- Published
- 2008
17. A Power-Efficient High-Throughput 32-Thread SPARC Processor
- Author
-
K.W. Tam, Francis Schumacher, P. Kongetira, D. Weisner, W. Bryg, Jinuk Luke Shin, A. Strong, and Ana Sonia Leon
- Subjects
Multi-core processor ,Engineering ,UltraSPARC ,business.industry ,CPU cache ,Register file ,Embedded system ,Low-power electronics ,Hardware_INTEGRATEDCIRCUITS ,Electrical and Electronic Engineering ,Physical design ,Crossbar switch ,business ,Dram - Abstract
This first generation of "Niagara" SPARC processors implements a power-efficient Chip Multi-Threading (CMT) architecture which maximizes overall throughput performance for commercial workloads. The target performance is achieved by exploiting high bandwidth rather than high frequency, thereby reducing hardware complexity and power. The UltraSPARC T1 processor combines eight four-threaded 64-b cores, a floating-point unit, a high-bandwidth interconnect crossbar, a shared 3-MB L2 Cache, four DDR2 DRAM interfaces, and a system interface unit. Power and thermal monitoring techniques further enhance CMT performance benefits, increasing overall chip reliability. The 378-mm2 die is fabricated in Texas Instrument's 90-nm CMOS technology with nine layers of copper interconnect. The chip contains 279 million transistors and consumes a maximum of 63 W at 1.2 GHz and 1.2 V. Key functional units employ special circuit techniques to provide the high bandwidth required by a CMT architecture while optimizing power and silicon area. These include a highly integrated integer register file, a high-bandwidth interconnect crossbar, the shared L2 cache, and the IO subsystem. Key aspects of the physical design methodology are also discussed
- Published
- 2007
18. Workload characterization and prediction: A pathway to reliable multi-core systems
- Author
-
Yiorgos Makris, Monir Zaman, and Ali Ahmadi
- Subjects
Power management ,Multi-core processor ,Engineering ,UltraSPARC ,business.industry ,Reliability (computer networking) ,Real-time computing ,Benchmark (computing) ,Workload ,business ,Reliability engineering ,Parsec ,Power (physics) - Abstract
As a result of technology scaling, power density of multi-core chips increases and leads to temperature hot-spots which accelerate device aging and chip failure. Moreover, intense efforts to reduce power consumption by employing low-power techniques decrease the reliability of new design generations. Traditionally, reactive thermal/power management techniques have been used to take appropriate action when the temperature reaches a threshold. However, these approaches do not always balance temperature and, as a result, may degrade system reliability. Therefore, to distribute temperature evenly across all cores, a proactive mechanism is needed to forecast future workload characteristics and the corresponding temperature, in order to make decisions before hot spots occur. Such proactive methods rely on an engine to precisely predict future workload characteristics. In this work, we first discuss the state-of-the-art methods for predicting workload dynamics and we compare their performance. We, then, introduce a prediction method based on Support Vector Regression (SVR), which accurately predicts the workload behavior several steps ahead. To evaluate the effectiveness of our approach, we use several programs from the PARSEC benchmark suite on an UltraSPARC T1 processor running the Sun Solaris operating system and we extract architectural traces. Then, the extracted traces are used to generate power and thermal profiles for each core using the McPAT and Hot-Spot simulators. Our results show that the proposed method forecasts workload dynamics and power very accurately and outperforms previous prediction techniques.
- Published
- 2015
19. Design and implementation of an embedded 512-KB level-2 cache subsystem
- Author
-
Mandeep Singh, Jinuk Luke Shin, Bruce Petrick, and Ana Sonia Leon
- Subjects
Random access memory ,Hardware_MEMORYSTRUCTURES ,UltraSPARC ,Computer science ,business.industry ,CPU cache ,law.invention ,Microprocessor ,law ,Embedded system ,Static random-access memory ,Cache ,Electrical and Electronic Engineering ,business - Abstract
Dual on-chip 512-KB unified second level (L2) caches for an UltraSparc processor are implemented using 0.13-/spl mu/m technology. Each 512-KB unit is implemented using 34 million transistors to achieve 1.4 GHz and 2.6 W at 1.3 V and 85/spl deg/C. This fully integrated subsystem is composed of conventional data and tag SRAMs along with datapaths, controller, and test engines. The unit achieves one of the shortest on-chip L2 cache latencies reported for 64-bit microprocessors, with a data latency of only four cycles including ECC correction for 128-bit data. In addition, balanced custom and automated design methodologies are used to achieve the aggressive design cycle. Architectural and physical design solutions to build this integrated short latency L2 cache are discussed.
- Published
- 2005
20. Direct simulation for discrete mixture distributions
- Author
-
Paul Fearnhead
- Subjects
Statistics and Probability ,Mathematical optimization ,UltraSPARC ,Posterior probability ,Forward–backward algorithm ,Poisson distribution ,Mixture model ,Theoretical Computer Science ,Set (abstract data type) ,symbols.namesake ,Computational Theory and Mathematics ,Component (UML) ,symbols ,Applied mathematics ,Statistics, Probability and Uncertainty ,Particle filter ,Mathematics - Abstract
We demonstrate how to perform direct simulation for discrete mixture models. The approach is based on directly calculating the posterior distribution using a set of recursions which are similar to those of the Forward-Backward algorithm. Our approach is more practicable than existing perfect simulation methods for mixtures. For example, we analyse 1096 observations from a 2 component Poisson mixture, and 240 observations under a 3 component Poisson mixture (with unknown mixture proportions and Poisson means in each case). Simulating samples of 10,000 perfect realisations took about 17 minutes and an hour respectively on a 900 MHz ultraSPARC computer. Our method can also be used to perform perfect simulation from Markov-dependent mixture models. A byproduct of our approach is that the evidence of our assumed models can be calculated, which enables different models to be compared.
- Published
- 2005
21. Hardware-Assisted Circumvention of Self-Hashing Software Tamper Resistance
- Author
-
Glenn Wurster, P. C. van Oorschot, and Anil Somayaji
- Subjects
UltraSPARC ,Exploit ,business.industry ,Computer science ,Processor design ,PowerPC ,Cryptography ,Software ,Embedded system ,Industrial relations ,x86 ,Electrical and Electronic Engineering ,business ,Tamper resistance - Abstract
Self-hashing has been proposed as a technique for verifying software integrity. Appealing aspects of this approach to software tamper resistance include the promise of being able to verify the integrity of software independent of the external support environment, as well as the ability to integrate code protection mechanisms automatically. In this paper, we show that the rich functionality of most modern general-purpose processors (including UltraSparc, x86, PowerPC, AMD64, Alpha, and ARM) facilitate an automated, generic attack which defeats such self-hashing. We present a general description of the attack strategy and multiple attack implementations that exploit different processor features. Each of these implementations is generic in that it can defeat self-hashing employed by any user-space program on a single platform. Together, these implementations defeat self-hashing on most modern general-purpose processors. The generality and efficiency of our attack suggests that self-hashing is not a viable strategy for high-security tamper resistance on modern computer systems.
- Published
- 2005
22. A dual-core 64-bit ultraSPARC microprocessor for dense server applications
- Author
-
V. Mathur, Howard L. Levy, D. Bistry, Bruce Petrick, Ana Sonia Leon, Jinseung Son, Ha Pham, Mandeep Singh, Jinuk Luke Shin, U. Nair, Toshinari Takayanagi, N. Moon, and Jeffrey Y. Su
- Subjects
Multi-core processor ,Engineering ,Hardware_MEMORYSTRUCTURES ,UltraSPARC ,Blade server ,business.industry ,CPU cache ,Hardware_PERFORMANCEANDRELIABILITY ,Integrated circuit design ,Chip ,Memory controller ,law.invention ,Microprocessor ,law ,Embedded system ,Hardware_INTEGRATEDCIRCUITS ,Electrical and Electronic Engineering ,business - Abstract
A dual-core 64-bit microprocessor optimized for compute-dense systems such as rack-mount and blade servers for network computing was developed. The chip consists of two UltraSPARC II cores, each with its own 512 kB L2 cache, a DDR-1 memory controller, and symmetric multiprocessor bus (JBus) controllers. The 206-mm/sup 2/ die is fabricated in 0.13-/spl mu/m CMOS technology with seven layers of Cu and a low-k dielectric. The chip offers a highly efficient performance-per-watt ratio with a typical power dissipation of 23 W at 1.3 V and 1.2 GHz. A short design cycle was achieved by leveraging existing designs wherever possible and developing effective design methodologies and flows. Significant design challenges faced by this project are described. These include deep-submicron design issues, such as negative bias temperature instability (NBTI), leakage, coupling noise, intra-die process variation, and electromigration (EM). A second important design challenge was implementing a high-performance L2 cache subsystem with a short four-cycle core-to-L2 latency including ECC.
- Published
- 2005
23. Dynamic Data Layouts for Cache-Conscious Implementation of a Class of Signal Transforms
- Author
-
Neungsoo Park and Viktor K. Prasanna
- Subjects
Signal processing ,UltraSPARC ,Memory hierarchy ,Factorization ,CPU cache ,Computer science ,Dynamic data ,Signal Processing ,Fast Fourier transform ,Pentium ,Parallel computing ,Cache ,Electrical and Electronic Engineering - Abstract
Effective utilization of cache memories is a key factor in achieving high performance for computing large signal transforms. Nonunit stride access in the computation of large signal transforms results in poor cache performance, leading to severe degradation in the overall performance. In this paper, we develop a cache-conscious technique, called a dynamic data layout, to improve the performance of large signal transforms. In our approach, data reorganization is performed between computation stages to reduce cache misses. We develop an efficient search algorithm to determine an optimal tree with the minimum execution time among possible factorization trees based on the size of the signal transform and the data access stride. Our approach is applied to compute the fast Fourier transform (FFT) and the Walsh-Hadamard transform (WHT). Experiments were performed on Alpha 21264, MIPS R10000, UltraSPARC III, and Pentium 4. Experimental results show that our FFT and WHT achieve performance improvement of up to 3.52 times over other state-of-the-art FFT and WHT packages. The proposed optimization is portable across various platforms.
- Published
- 2004
24. A Flexible, Fast, and Optimal Modeling Approach Applied to Crew Rostering at London Underground
- Author
-
Stephen Norris and ManMohan S. Sodhi
- Subjects
Flexibility (engineering) ,Mathematical optimization ,UltraSPARC ,Theory of computation ,Crew ,General Decision Sciences ,Crew rostering ,Ranging ,Management Science and Operations Research ,Solver ,Optimal modeling ,Mathematics - Abstract
We present a general modeling approach to crew rostering and its application to computer-assisted generation of rotation-based rosters (or rotas) at the London Underground. Our goals were flexibility, speed, and optimality, and our approach is unique in that it achieves all three. Flexibility was important because requirements at the Underground are evolving and because specialized approaches in the literature did not meet our flexibility-implied need to use standard solvers. We decompose crew rostering into stages that can each be solved with a standard commercial MILP solver. Using a 167 MHz Sun UltraSparc 1 and CPLEX 4.0 MILP solver, we obtained high-quality rosters in runtimes ranging from a few seconds to a few minutes within 2% of optimality. Input data were takes from different depots with crew sizes ranging from 30–150 drivers, i.e., with number of duties ranging from about 200–1000. Using an argument based on decomposition and aggregation, we prove the optimality of our approach for the overall crew rostering problem.
- Published
- 2004
25. Tiling, block data layout, and memory hierarchy performance
- Author
-
Viktor K. Prasanna, Neungsoo Park, and Bo Hong
- Subjects
Hardware_MEMORYSTRUCTURES ,UltraSPARC ,Memory hierarchy ,Computer science ,Translation lookaside buffer ,Parallel computing ,Matrix multiplication ,LU decomposition ,law.invention ,Computational Theory and Mathematics ,Hardware and Architecture ,law ,Signal Processing ,Cache ,Block (data storage) ,Cholesky decomposition - Abstract
Recently, several experimental studies have been conducted on block data layout in conjunction with tiling as a data transformation technique to improve cache performance. In this paper, we analyze cache and translation look-aside buffer (TLB) performance of such alternate layouts (including block data layout and Morton layout) when used in conjunction with tiling. We derive a tight lower bound on TLB performance for standard matrix access patterns, and show that block data layout and Morton layout achieve this bound. To improve cache performance, block data layout is used in concert with tiling. Based on the cache and TLB performance analysis, we propose a data block size selection algorithm that finds a tight range for optimal block size. To validate our analysis, we conducted simulations and experiments using tiled matrix multiplication, LU decomposition, and Cholesky factorization. For matrix multiplication, simulation results using UltraSparc II parameters show that tiling and block data layout with a block size given by our block size selection algorithm, reduces up to 93 percent of TLB misses compared with other techniques. The total miss cost is reduced considerably. Experiments on several platforms show that tiling with block data layout achieves up to 50 percent performance improvement over other techniques that use conventional layouts. Morton layout is also analyzed and compared with block data layout. Experimental results show that matrix multiplication using block data layout is up to 15 percent faster than that using Morton data layout.
- Published
- 2003
26. Data remapping for design space optimization of embedded memory systems
- Author
-
Rodric Rabbah and Krishna V. Palem
- Subjects
UltraSPARC ,Memory hierarchy ,Computer science ,C dynamic memory allocation ,CPU cache ,business.industry ,Design space exploration ,Distributed computing ,Optimizing compiler ,computer.software_genre ,Hardware and Architecture ,Embedded system ,Itanium ,Compiler ,business ,computer ,Software - Abstract
In this article, we present a novel linear time algorithm for data remapping , that is, (i) lightweight; (ii) fully automated; and (iii) applicable in the context of pointer-centric programming languages with dynamic memory allocation support. All previous work in this area lacks one or more of these features. We proceed to demonstrate a novel application of this algorithm as a key step in optimizing the design of an embedded memory system. Specifically, we show that by virtue of locality enhancements via data remapping, we may reduce the memory subsystem needs of an application by 50%, and hence concomitantly reduce the associated costs in terms of size, power, and dollar-investment (61%). Such a reduction overcomes key hurdles in designing high-performance embedded computing solutions. Namely, memory subsystems are very desirable from a performance standpoint, but their costs have often limited their use in embedded systems. Thus, our innovative approach offers the intriguing possibility of compilers playing a significant role in exploring and optimizing the design space of a memory subsystem for an embedded design. To this end and in order to properly leverage the improvements afforded by a compiler optimization, we identify a range of measures for quantifying the cost-impact of popular notions of locality, prefetching, regularity of memory access, and others . The proposed methodology will become increasingly important, especially as the needs for application specific embedded architectures become prevalent. In addition, we demonstrate the wide applicability of data remapping using several existing microprocessors, such as the Pentium and UltraSparc. Namely, we show that remapping can achieve a performance improvement of 20% on the average. Similarly, for a parametric research HPL-PD microprocessor, which characterizes the new Itanium machines, we achieve a performance improvement of 28% on average. All of our results are achieved using applications from the DIS, Olden and SPEC2000 suites of integer and floating point benchmarks.
- Published
- 2003
27. Implementation of a third-generation 1.1-GHz 64-bit microprocessor
- Author
-
G.K. Konstadinidis, K. Normoyle, null Samson Wong, S. Bhutani, H. Stuimer, T. Johnson, A. Smith, D.Y. Cheung, F. Romano, null Shifeng Yu, null Sung-Hun Oh, V. Melamed, S. Narayanan, D. Bunsey, null Cong Khieu, K.J. Wu, R. Schmitt, A. Dumlao, M. Sutera, null Jade Chau, K.J. Lin, and W.S. Coates
- Subjects
Very-large-scale integration ,Hardware_MEMORYSTRUCTURES ,UltraSPARC ,business.industry ,Computer science ,Transistor ,Memory bandwidth ,Hardware_PERFORMANCEANDRELIABILITY ,Chip ,law.invention ,Microprocessor ,Read-write memory ,Hardware_GENERAL ,law ,Hardware_INTEGRATEDCIRCUITS ,System on a chip ,Cache ,Electrical and Electronic Engineering ,business ,Computer hardware ,Hardware_LOGICDESIGN - Abstract
This third-generation 1.1-GHz 64-bit UltraSPARC microprocessor provides 1-MB on-chip level-2 cache, 4-Gb/s off chip memory bandwidth, and a new 200 MHz JBus interface that supports one to four processors. The 87.5-million transistor chip is implemented in a seven-layer-metal copper 0.13-/spl mu/m CMOS process and dissipates 53 W at 1.3 V and 1.1 GHz.
- Published
- 2002
28. The DAQ system with a RACEway switch for the PHOBOS experiment at RHIC
- Author
-
A. Sukhanov, P. Sarin, and P. Kulinich
- Subjects
Nuclear and High Energy Physics ,Engineering ,UltraSPARC ,business.industry ,Gigabit Ethernet ,Disk array ,Programmable logic device ,Data acquisition ,Nuclear Energy and Engineering ,Computer data storage ,Electrical and Electronic Engineering ,business ,Control logic ,Computer hardware ,VMEbus - Abstract
The PHOBOS data acquisition system based on a RACEway switching network is described. Occupying a single VME crate, the system utilizes 22 PPC750 CPUs working in parallel to compress data from 135 168 silicon pad detectors and an UltraSPARC VME host for event building and data storage. Lossless Huffman coding is used for compression; this reduces the event size fourfold. The two-host disk array is used to stage data before sending them over Gigabit Ethernet to the Relativistic Heavy Ion Collider (RHIC) central computing facility. All trigger and control logic is formed using universal programmable logic VME modules, which can be programmed in situ, even when the system is running. The event building and run control software is written using the ROOT framework. The slow control and configuration makes use of an Oracle database to store configuration and monitoring parameters. The system has been taking data from the PHOBOS experiment at RHIC since June 2000. The achieved data-taking rate is 280 events/s or 28 MB/s, with additional disk arrays it can potentially reach 80 MB/s.
- Published
- 2002
29. The Sun Fireplane Interconnect
- Author
-
A. Charlesworth
- Subjects
Interconnection ,UltraSPARC ,Workstation ,Computer science ,business.industry ,Reliability (computer networking) ,law.invention ,Sun Microsystems ,Hardware and Architecture ,law ,Server ,Embedded system ,Bandwidth (computing) ,Electrical and Electronic Engineering ,business ,Software ,Fireplane - Abstract
A computing system's internal interconnect is a key determiner of its cost, performance, and reliability. The Sun Fireplane Interconnect, used inside the Sun Microsystems Ultrasparc III generation of servers and workstations, builds on three generations of interconnects, and provides a significant increase in performance and system bandwidth.
- Published
- 2002
30. Optimization of the assignment of circuit cards to assembly lines in electronics assembly
- Author
-
Kimberly P. Ellis and Sudeer Bhoja
- Subjects
Engineering ,UltraSPARC ,Workstation ,Quadratic assignment problem ,business.industry ,Strategy and Management ,Real-time computing ,Process (computing) ,Management Science and Operations Research ,Work in process ,Industrial and Manufacturing Engineering ,Line (electrical engineering) ,law.invention ,Computer engineering ,law ,Electronics ,business ,Assignment problem - Abstract
Process planning is an important and integral function for ensuring efficient operations in printed circuit card assembly systems. This paper presents a new approach for solving the circuit card to assembly line assignment problem to minimize assembly time. This problem occurs frequently in process planning for electronic assembly systems and involves considering other interelated process planning problems. The line assignment problem is formulated as a large-scale mixed-integer programming problem and then solved using problem decomposition along with the branch-and-bound algorithm. Techniques for improving the solution time are discussed, and the solution approach is demonstrated using industry representative data sets from Lucent Technologies. For the data sets considered, the solution approach provides solutions within 3% of optimal in approximately 6 min of computation time on a Sun UltraSparc 2 Workstation. The solution approach developed for addressing the line assignment problem can serve as a use...
- Published
- 2002
31. Coherency Hub Design for Multi-socket Sun Servers with CoolThreads (TM) Technology
- Author
-
Stephen E. Phillips, John R. Feehrer, Paul Gingras, P. Rotker, Milton Shih, Peter Yakutis, and John R. Heath
- Subjects
UltraSPARC ,Transaction processing ,Computer science ,Multiprocessing ,02 engineering and technology ,computer.software_genre ,Pipeline (software) ,020202 computer hardware & architecture ,UltraSPARC T2 ,Hardware and Architecture ,Server ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Operating system ,Electrical and Electronic Engineering ,computer ,Database transaction ,Software - Abstract
This paper describes the micro-architecture of a Coherency Hub (CoHub) ASIC for a 4-socket highly-threaded multiprocessor using Sun's UltraSPARC ¯ T2 Plus processor. UltraSPARC T2 Plus is an 8-core CMT processor in the Sun Servers with CoolThreadsTM Technology family. CoHub enables cost-effective scaling to 4 nodes with a total thread count of 256 and near-linear performance scaling on transaction processing workloads. Extending a 2-node "glueless" system to a 4-node system without processor changes was a key requirement. CoHub broadcasts snoop requests, serializes requests to the same address, and consolidates snoop responses. It communicates with nodes via serial links, using a proprietary link layer implemented over FBDIMM. We present the coherency scheme, ASIC design, transaction flows, and engineering challenges created by 800 MHz operation and 6-stage pipeline budget. We report performance scalability results measured on commercial server benchmarks.
- Published
- 2017
32. Can a Light Typing Discipline Be Compatible with an Efficient Implementation of Finite Fields Inversion?
- Author
-
Emanuele Cesena, Daniele Canavese, Marco Pedicini, Rachid Ouchary, Luca Roversi, Canavese, D, Cesena, E, Ouchary, R, Pedicini, Marco, and Roversi, L.
- Subjects
cryptography ,UltraSPARC ,Binary number ,lambda calculus ,Efficient implementation ,Lambda ,ARM architecture ,binary fields ,Finite field ,Finite fields ,Multiplicative inverse ,Lambda calculus ,Time complexity ,computer ,Algorithm ,computer.programming_language ,Mathematics - Abstract
We focus on the fragment TFA of \(\lambda \)-calculus. It contains terms which normalize in polynomial time only. Inside TFA we translated BEA, a well known, imperative and fast algorithm which calculates the multiplicative inverse of binary finite fields. The translation suggests how to categorize the operations of BEA in sets which drive the design of a variant that we called DCEA. On several common architectures we show that these two algorithms have comparable performances, while on UltraSPARC and ARM architectures the variant we synthesized from a purely functional source can go considerably faster than BEA.
- Published
- 2014
33. Communication Efficient BSP Algorithm for All Nearest Smaller Values Problem
- Author
-
Chun-Hsi Huang and Xin He
- Subjects
UltraSPARC ,Bounded set ,Computer Networks and Communications ,Computer science ,Computation ,Graph theory ,Multiprocessing ,All nearest smaller values ,Load balancing (computing) ,Computational geometry ,Theoretical Computer Science ,Bulk synchronous parallel ,Artificial Intelligence ,Hardware and Architecture ,Algorithm ,Software - Abstract
We present a BSP (Bulk Synchronous Parallel) algorithm for solving the All Nearest Smaller Values Problem (ANSVP), a fundamental problem in both graph theory and computational geometry. Our algorithm achieves optimal sequential computation time and uses only three communication supersteps. In the worst case, each communication phase takes no more than an (np+p)-relation, where p is the number of the processors. In addition, our average-case analysis shows that, on random inputs, the expected communication requirements for all three steps are bounded above by a p-relation, which is independent of the problem size n. Experiments have been carried out on an SGI Origin 2000 with 32 R10000 processors and a SUN Enterprise 4000 multiprocessing server supporting 8 UltraSPARC processors, using the MPI libraries. The results clearly demonstrate the communication efficiency and load balancing for computation.
- Published
- 2001
34. Optimization of H.263 video encoding using a single processor computer: performance tradeoffs and benchmarking
- Author
-
Shahriar M. Akramullah, M.L. Liou, and Ishfaq Ahmad
- Subjects
UltraSPARC ,Speedup ,Workstation ,Computer performance ,business.industry ,Computer science ,Pentium ,Optimizing compiler ,Visual Instruction Set ,Frame rate ,law.invention ,Videotelephony ,Instruction set ,law ,Embedded system ,Media Technology ,Electrical and Electronic Engineering ,business ,Encoder ,Computer hardware - Abstract
We present the optimization and performance evaluation of a software-based H.263 video encoder. The objective is to maximize the encoding rate without losing the picture quality on an ordinary single processor computer such as a PC or a workstation. This requires optimization at all design and implementation phases, including algorithmic enhancements, efficient implementations of all encoding modules, and taking advantage of certain architectural features of the machine. We design efficient algorithms for DCT and fast motion estimation, and exploit various techniques to speed up the processing, including a number of compiler optimizations and removal of redundant operations. For exploiting the architectural features of the machine, we make use of low-level machine primitives such as Sun UltraSPARC's visual instruction set and Intel's multimedia extension, which accelerate the computation in a single instruction stream multiple data stream fashion. Extensive benchmarking is carried out on three platforms: a 167-MHz Sun UltraSPARC-1 workstation, a 233-MHz Pentium II PC, and a 600-MHz Pentium III PC. We examine the effect of each type of optimization for every coding mode of H.263, highlighting the tradeoffs between quality and complexity. The results also allow us to make an interesting comparison between the workstation and the PCs. The encoder yields 45.68 frames per second (frames/s) on the Pentium III PC, 18.13 frames/s on the Pentium II PC, and 12.17 frames/s on the workstation for QCIF resolution video with high perceptual quality at reasonable bit rates, which are sufficient for most of the general switched telephone networks based video telephony applications. The paper concludes by suggesting optimum coding options.
- Published
- 2001
35. MIDAS-W: a workstation-based incoherent scatter radar data acquisition system
- Author
-
T. Grydeland, Philip J. Erickson, A. M. Gorczyca, John M. Holt, Massachusetts Institute of Technology (MIT), University of Tromsø (UiT), and EGU, Publication
- Subjects
Atmospheric Science ,Electromagnetics ,010504 meteorology & atmospheric sciences ,Workstation ,Computer science ,[SDU.STU]Sciences of the Universe [physics]/Earth Sciences ,02 engineering and technology ,01 natural sciences ,law.invention ,Data processing system ,Data acquisition ,Software ,law ,0202 electrical engineering, electronic engineering, information engineering ,Earth and Planetary Sciences (miscellaneous) ,Software system ,lcsh:Science ,0105 earth and related environmental sciences ,[SDU.OCEAN]Sciences of the Universe [physics]/Ocean, Atmosphere ,UltraSPARC ,business.industry ,[SDU.OCEAN] Sciences of the Universe [physics]/Ocean, Atmosphere ,lcsh:QC801-809 ,020206 networking & telecommunications ,Geology ,Astronomy and Astrophysics ,lcsh:QC1-999 ,lcsh:Geophysics. Cosmic physics ,Space and Planetary Science ,Baseband ,[SDU.STU] Sciences of the Universe [physics]/Earth Sciences ,lcsh:Q ,business ,Computer hardware ,lcsh:Physics - Abstract
The Millstone Hill Incoherent Scatter Data Acquisition System (MIDAS) is based on an abstract model of an incoherent scatter radar. This model is implemented in a hierarchical software system, which serves to isolate hardware and low-level software implementation details from higher levels of the system. Inherent in this is the idea that implementation details can easily be changed in response to technological advances. MIDAS is an evolutionary system, and the MIDAS hardware has, in fact, evolved while the basic software model has remained unchanged. From the earliest days of MIDAS, it was realized that some functions implemented in specialized hardware might eventually be implemented by software in a general-purpose computer. MIDAS-W is the realization of this concept. The core component of MIDAS-W is a Sun Microsystems UltraSparc 10 workstation equipped with an Ultrarad 1280 PCI bus analog to digital (A/D) converter board. In the current implementation, a 2.25 MHz intermediate frequency (IF) is bandpass sampled at 1 µs intervals and these samples are multicast over a high-speed Ethernet which serves as a raw data bus. A second workstation receives the samples, converts them to filtered, decimated, complex baseband samples and computes the lag-profile matrix of the decimated samples. Overall performance is approximately ten times better than the previous MIDAS system, which utilizes a custom digital filtering module and array processor based correlator. A major advantage of MIDAS-W is its flexibility. A portable, single-workstation data acquisition system can be implemented by moving the software receiver and correlator programs to the workstation with the A/D converter. When the data samples are multicast, additional data processing systems, for example for raw data recording, can be implemented simply by adding another workstation with suitable software to the high-speed network. Testing of new data processing software is also greatly simplified, because a workstation with the new software can be added to the network without impacting the production system. MIDAS-W has been operated in parallel with the existing MIDAS-1 system to verify that incoherent scatter measurements by the two systems agree. MIDAS-W has also been used in a high-bandwidth mode to collect data on the November, 1999, Leonid meteor shower.Key words: Electromagnetics (instruments and techniques; signal processing and adaptive antennas) – Ionosphere (instruments and techniques)
- Published
- 2000
36. A low-jitter 1.9-V CMOS PLL for UltraSPARC microprocessor applications
- Author
-
David J. Allstot and Hee-Tae Ahn
- Subjects
Engineering ,UltraSPARC ,business.industry ,Detector ,Hardware_PERFORMANCEANDRELIABILITY ,Phase-locked loop ,CMOS ,Low-power electronics ,PLL multibit ,Hardware_INTEGRATEDCIRCUITS ,Electronic engineering ,Sensitivity (control systems) ,Electrical and Electronic Engineering ,business ,Hardware_LOGICDESIGN ,Jitter - Abstract
A phase-locked loop (PLL) for CMOS UltraSPARC microprocessor applications uses a loop filter referenced to a quiet power supply and achieves measured clock period jitter of /spl plusmn/25 ps at 360 MHz. The fully integrated CMOS PLL uses a charge-pump phase/frequency detector, a single-capacitor loop filter, and a feedforward error correction architecture. Loop characteristics are analyzed and verified by measurements. The measured sensitivity of clock period jitter to supply voltage is 2.6 ps/100 mv over an analog supply-voltage range of 1.6-2.1 V; the measured output operating frequency range is 8.5-660 MHz. Fabricated in an area of 310/spl times/280 /spl mu/m/sup 2/ in a 0.25-/spl mu/m CMOS process, the PLL dissipates 25 mW from a 1.9-V supply.
- Published
- 2000
37. [Untitled]
- Author
-
Sylvain Lelait and Andreas Krall
- Subjects
Loop unrolling ,UltraSPARC ,Multimedia ,Computer science ,Programming language ,Visual Instruction Set ,Parallel computing ,computer.software_genre ,Theoretical Computer Science ,Instruction set ,Vectorization (mathematics) ,Code generation ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,computer ,Machine code ,Software ,Information Systems ,AltiVec - Abstract
The huge processing power needed by multimedia applications has led to multimedia extensions in the instruction set of microprocessors which exploit subword parallelism. Examples of these extended instruction sets are the Visual Instruction Set of the UltraSPARC processor, the AltiVec instruction set of the PowerPC processor, the MMX and ISS extensions of the Pentium processors, and the MAX-2 instruction set of the HP PA-RISC processor. Currently, these extensions can only be used by programs written in assembly language, through system libraries or by calling specialized macros in a high-level language. Therefore, these instructions are not used by most applications. We propose two code generation techniques to produce native code using these multimedia extensions for programs written in a high-level language: classical vectorization and vectorization by unrolling. Vectorization by unrolling is simpler than classical vectorization since data dependence analysis is reduced to acyclic control flow graph analysis. Furthermore, we address the problem of unaligned memory accesses. This can be handled by both static analysis and dynamic runtime checking. Preliminary experimental results for a code generator for the UltraSPARC VIS instruction set show that speedups of up to a factor of 4.8 are possible, and that vectorization by unrolling is much simpler but as effective as classical vectorization.
- Published
- 2000
38. Alternatives to Coscheduling a Network of Workstations
- Author
-
Ajit Banerjee, Chita R. Das, Shailabh Nagar, and Anand Sivasubramaniam
- Subjects
UltraSPARC ,Workstation ,Computer Networks and Communications ,Computer science ,business.industry ,Distributed computing ,Message Passing Interface ,CPU time ,Workload ,Coscheduling ,Gang scheduling ,Theoretical Computer Science ,Scheduling (computing) ,law.invention ,Artificial Intelligence ,Hardware and Architecture ,law ,Embedded system ,Myrinet ,business ,Software - Abstract
Efficient scheduling of processes on processors of a Network of Workstations (NOW) is essential for good system performance. However, the design of such schedulers is challenging because of the complex interaction between several system and workload parameters. Coscheduling, though desirable, is impractical for such a loosely coupled environment. Two operations, waiting for a message and arrival of a message, can be used to take remedial actions that can guide the behavior of the system toward coscheduling using local information. We present a taxonomy of three possibilities for each of these two operations, leading to a design space of 3×3 scheduling mechanisms. This paper presents an extensive implementation and evaluation exercise in studying these mechanisms. Adhering to the philosophy that scheduling and communication are intertwined and should be studied in conjunction, a complete communication substrate for UltraSPARC workstations, connected by Myrinet and running Solaris 2.5.1, has been developed. This platform provides the entire Message Passing Interface (MPI) to readily run off-the-shelf MPI applications by employing protected low-latency user-level messaging. Several applications can concurrently use this interface. This platform has been used to design, implement, and uniformly evaluate nine scheduling strategies with a mixture of concurrent real applications with varying communication intensities. This includes five new schemes (Periodic Boost, Periodic Boost with Spin Block, Spin Yield, Periodic Boost with Spin Yield, Dynamic Coscheduling with Spin Yield) that are presented in this paper. In addition to our evaluations of the pros and cons of each mechanism in terms of throughput, response time, CPU utilization, and fairness, it is shown that Periodic Boost is a promising approach for scheduling processes on a NOW.
- Published
- 1999
39. Vying for the lead in high-performance processors
- Author
-
G. Lauterbach
- Subjects
UltraSPARC ,General Computer Science ,Computer science ,Operating system ,computer.software_genre ,computer - Abstract
With other major vendors delaying products or contending with corporate change, Sun's Ultrasparc III could have a window of opportunity in the competition among high-performance, 64-bit processors.In an interview with Computer, Gary Lauterbach, Ultrasparc III's chief architect, describes this processor's key features--its extended pipeline, speculation mechanisms, two-cycle load latency, and memory system design--and compares Ultrasparc III to its major competitors.
- Published
- 1999
40. Dynamic instrumentation of threaded applications
- Author
-
Zhichen Xu, Barton P. Miller, and Oscar Naim
- Subjects
UltraSPARC ,Java ,business.industry ,Computer science ,Instrumentation ,Multiprocessing ,Thread (computing) ,Software_PROGRAMMINGTECHNIQUES ,computer.software_genre ,Data structure ,Computer Graphics and Computer-Aided Design ,Operating system ,Timer ,Instrumentation (computer programming) ,business ,computer ,Java applet ,Context switch ,Computer hardware ,Software ,computer.programming_language - Abstract
The use of threads is becoming commonplace in both sequential and parallel programs. This paper describes our design and initial experience with non-trace based performance instrumentation techniques for threaded programs. Our goal is to provide detailed performance data while maintaining control of instrumentation costs. We have extended Paradyn's dynamic instrumentation (which can instrument programs without recompiling or relinking) to handle threaded programs.Controlling instrumentation costs means efficient instrumentation code and avoiding locks in the instrumentation. Our design is based on low contention data structures. To associate performance data with individual threads, we have all threads share the same instrumentation code and assign each thread with its own private copy of performance counters or timers. The asynchrony in a threaded program poses a major challenge to dynamic instrumentation. To implement time-based metrics on a per-thread basis, we need to instrument thread context switches, which can cause instrumentation code to interleave. Interleaved instrumentation can not only corrupt performance data, but can also cause a scenario we call self-deadlock where an instrumentation code deadlocks a thread. We introduce thread-conscious locks to avoid self-deadlock, and per-thread virtual CPU timers to reduce the chance of interleaved instrumentation accessing the same performance counter or timer, and to reduce the number of expensive timer calls at thread context switches.Our initial implementation is on SPARC Solaris 2.5 and 2.6 including multiprocessor Sun UltraSPARC Enterprise machines. We tested our tool on large multithreaded applications, including the Java Virtual Machine (JVM). We show how our new techniques helped us to speed up a Java graphics native method by 42% and consequently increase by 24% the amount of work that can be done in unit time in a game applet.
- Published
- 1999
41. UltraSPARC-III: designing third-generation 64-bit performance
- Author
-
T. Horel and G. Lauterbach
- Subjects
UltraSPARC ,Workstation ,Computer science ,business.industry ,computer.software_genre ,law.invention ,Sun Microsystems ,Hardware and Architecture ,law ,Embedded system ,Server ,Scalability ,Operating system ,Electrical and Electronic Engineering ,business ,computer ,Software - Abstract
The UltraSPARC-III is the third generation of Sun Microsystems' most powerful microprocessors, which are at the heart of Sun's computer systems. These systems, ranging from desktop workstations to large, mission critical servers, require the highest performance that the UltraSPARC line has to offer. The newest design permits vendors the scalability to build systems consisting of 1,000+ UltraSPARC processors. Furthermore, the design ensures compatibility with all existing SPARC applications and the Solaris operating system. The UltraSPARC-III design extends Sun's SPARC Version 9 architecture, a 64-bit extension to the original 32-bit SPARC architecture that traces its roots to the Berkeley RISC-I processor. The UltraSPARC-III design target is a 600-MHz, 70-watt, 13-mm die to be built in 0.25-micron CMOS with six metal layers for signals, clocks, and power.
- Published
- 1999
42. [Untitled]
- Author
-
Jean-Charles Henrion
- Subjects
UltraSPARC ,Computer science ,business.industry ,Software implementation ,Software ,Embedded system ,Error correcting ,Electrical and Electronic Engineering ,business ,Error detection and correction ,Computer communication networks ,Working environment ,Computer network ,Coding (social sciences) - Abstract
Today, Forward Error Correcting (FEC) codes are mainly implemented in hardware, and many believe that their complexity prohibits their software implementation. This paper presents in detail how the performances of a software implementantion can be significantly improved. Different levels of optimization which are independent of the working environment are presented and discussed. The coding throughput of 100 Mbps on an UltraSparc 1 shows that FEC codes can be easily added to multimedia applications without requiring dedicated hardware support. As a case study, we use FEC codes to protect AAL5-PDUs from cell losses in ATM networks.
- Published
- 1999
43. On the use of subword parallelism in medical image processing
- Author
-
Koen De Bosschere, Jan Van Campenhout, Mark Christiaens, and Bjorn De Sutter
- Subjects
Loop unrolling ,UltraSPARC ,Computer Networks and Communications ,Data parallelism ,Computer science ,Pipeline (computing) ,Loop fusion ,Speculative execution ,Task parallelism ,Multiprocessing ,Parallel computing ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Scalable parallelism ,Computer Graphics and Computer-Aided Design ,Theoretical Computer Science ,Artificial Intelligence ,Hardware and Architecture ,Very long instruction word ,Superscalar ,Memory-level parallelism ,Implicit parallelism ,Instruction-level parallelism ,Software - Abstract
Parallel implementations of algorithms for medical image processing mostly focus on the use of multiprocessor parallelism. Modern processor architectures however, provide several additional forms of parallelism at the processor level: subword parallelism, speculative execution, superscalar pipelining, very long instruction word, etc. In this article, we show that well-known parallelization techniques for multiprocessor systems can be used to exploit subword parallelism. Loop unrolling, loop fusion and if -hoisting prove to be valuable to achieve this goal. To illustrate this, we transformed the inner loops of a positron emission tomography image reconstruction algorithm. We achieved a speed-up of 45% on Sun's UltraSPARC processor.
- Published
- 1998
44. Scheduling with implicit information in distributed systems
- Author
-
Andrea C. Arpaci-Dusseau, Alan Mainwaring, and David E. Culler
- Subjects
UltraSPARC ,Computer science ,Distributed algorithm ,Computer Networks and Communications ,Hardware and Architecture ,Distributed computing ,Principal mechanism ,Coscheduling ,Parallel computing ,Software ,Scheduling (computing) - Abstract
Implicit coscheduling is a distributed algorithm for time-sharing communicating processes in a cluster of workstations. By observing and reacting to implicit information, local schedulers in the system make independent decisions that dynamically coordinate the scheduling of communicating processes. The principal mechanism involved is two-phase spin-blocking : a process waiting for a message response spins for some amount of time, and then relinquishes the processor if the response does not arrive.In this paper, we describe our experience implementing implicit coscheduling on a cluster of 16 UltraSPARC I workstations; this has led to contributions in three main areas. First, we more rigorously analyze the two-phase spin-block algorithm and show that spin time should be increased when a process is receiving messages. Second, we present performance measurements for a wide range of synthetic benchmarks and for seven Split-C parallel applications. Finally, we show how implicit coscheduling behaves under different job layouts and scaling, and discuss preliminary results for achieving fairness.
- Published
- 1998
45. Parallel adaptive mesh refinement techniques for plasticity problems
- Author
-
Mark T. Jones, William J. Barry, and Paul E. Plassmann
- Subjects
UltraSPARC ,Workstation ,Cost efficiency ,Adaptive mesh refinement ,Computer science ,General Engineering ,Process (computing) ,Parallel computing ,law.invention ,Computational science ,Parallel processing (DSP implementation) ,Mesh generation ,law ,Component-based software engineering ,Software - Abstract
The accurate modeling of the nonlinear properties of materials can be computationally expensive. Parallel computing offers an attractive way for solving such problems; however, the efficient use of these systems requires the vertical integration of a number of very different software components, we explore the solution of two- and three-dimensional, small-strain plasticity problems. We consider a finite-element formulation of the problem with adaptive refinement of an unstructured mesh to accurately model plastic transition zones. We present a framework for the parallel implementation of such complex algorithms. This framework, using libraries from the SUMAA3d project, allows a user to build a parallel finite-element application without writing any parallel code. To demonstrate the effectiveness of this approach on widely varying parallel architectures, we present experimental results from an IBM SP parallel computer and an ATM-connected network of Sun UltraSparc workstations. The results detail the parallel performance of the computational phases of the application during the process while the material is incrementally loaded.
- Published
- 1998
46. UltraSPARC-II/: expanding the boundaries of a system on a chip
- Author
-
T.P. Johnson, C.D. Furman, A. Tzeng, K.B. Normoyle, M.A. Csoppenszky, and J. Mostoufi
- Subjects
UltraSPARC ,business.industry ,Computer science ,Processor design ,law.invention ,Microprocessor ,Hardware and Architecture ,law ,Embedded system ,Conventional PCI ,System on a chip ,Electrical and Electronic Engineering ,business ,Software - Abstract
This processor uses a significant amount of integration and other techniques to enable the construction of cost-efficient SPARC computer systems that retain excellent absolute performance.
- Published
- 1998
47. Designing UltraSparc for testability
- Author
-
M.E. Levitt
- Subjects
Focus (computing) ,Engineering ,UltraSPARC ,business.industry ,Design for testing ,media_common.quotation_subject ,Processor design ,Volume (computing) ,Multiplexer ,Debugging ,Hardware and Architecture ,Embedded system ,Electrical and Electronic Engineering ,business ,Software ,Testability ,media_common - Abstract
With a focus on a short time to volume production, the UltraSparc microprocessor design incorporated innovative features that optimize test, debug and manufacture. The following areas are discussed: goals; cost-benefit analysis; scan design; decoded multiplexer; test generation flow; custom circuit blocks; boundary cell design; embedded array testing; and clock control features.
- Published
- 1997
48. Challenges to combining general-purpose and multimedia processors
- Author
-
Andrew Wolfe, A. Peleg, S. Rathnam, P. Song, M. Schlansker, P. K. Dubey, Ruby B. Lee, Thomas M. Conte, and Matthew D. Jennings
- Subjects
UltraSPARC ,General Computer Science ,Multimedia ,Computer science ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,computer.software_genre ,Instruction set ,Set (abstract data type) ,Application-specific integrated circuit ,General purpose ,Computer architecture ,Operating system ,x86 ,Compiler ,computer ,MMX - Abstract
Multimedia processor media extensions to general purpose processors present new challenges to the compiler writer, language designer, and microarchitect. Multimedia workloads have always held an important role in embedded applications, such as video cards or set top boxes, but these workloads are becoming increasingly common in general purpose computing as well. Over the past three years the major vendors of general purpose processors (GPPs) have announced extensions to their instruction set architectures that supposedly enhance the performance of multimedia workloads. These include North Carolina MAX 2 extensions to Hewlett-Packard PA-RISC, MMX for Intel's x86, UltraSparc's VIS, and MDMX extensions to MIPS V. Merging these new multimedia instructions with existing GPPs poses several challenges. Also, some doubt remains as to whether multimedia extensions are a real development or just a competition induced fad in the GPP industry. If it is indeed a development, how must current processor microarchitectures change in reaction? And if they change, can GPPs and MMPs apply application specific integrated circuit (ASIC) solutions to the same problems?.
- Published
- 1997
49. Relax-and-retime
- Author
-
Adithya Parandhaman, Shankar Ganesh Ramasubramanian, Swagath Venkataramani, and Anand Raghunathan
- Subjects
Combinational logic ,Logic synthesis ,UltraSPARC ,Computer science ,Path (graph theory) ,Parallel computing ,Retiming ,Hardware_LOGICDESIGN ,Electronic circuit ,Efficient energy use - Abstract
Recovery based design (RBD) is a promising approach for the design of energy-efficient circuits under variations. RBD instruments circuits with mechanisms to identify and correct timing violations, thereby allowing reduced guard bands or design margins. In addition, RBD enables aggressive voltage overscaling to a point where timing errors occur even under nominal conditions. A major barrier to the widespread adoption of RBD is that traditional design practices and synthesis tools result in circuits with so-called“path walls”, leading to an explosion in the number of timing errors beyond a certain critical operating voltage. To alleviate this effect, previous techniques focused on combinational circuit optimizations such as sizing, use of dual Vt,, cells, re-structuring, etc. to favorably reshape the path delay distribution. However, these techniques are limited by the inherent sequential structure of the circuit, which defines the boundaries of the combinational logic. In this work, we explore a completely different approach to synthesize circuits for RBD. We propose the use of retiming, a well-known and powerful sequential optimization technique to redefine the boundaries of combinational logic, thereby creating new opportunities for RBD that cannot be explored by previous techniques. We make the key observation that, in retiming circuits with RBD (unlike classical retiming), it is acceptable for a few paths in the circuit to exceed the clock period. Using this insight, we propose a synthesis methodology, Relax-and-Retire, wherein the original circuit is relaxed by ignoring timing constraints on selected paths that are bottlenecks to retiming. When classical minimum period retiming is employed on this relaxed circuit, the path wall is shifted to a lower delay, thus allowing additional voltage overscaling. The Relax-and-Retire methodology judiciously selects bottleneck paths by trading off recovery overheads caused by timing errors due to these paths with the opportunities for retiming. We utilize the proposed methodology to synthesize a wide range of benchmarks including arithmetic circuits, ISCAS89 benchmarks and modules from the UltraSPARC T1 processor. Our results demonstrate 9-25% (average of 15.3%) improvement in overall energy compared to a well-optimized baseline with RBD.
- Published
- 2013
50. UltraSparc I: a four-issue processor supporting multimedia
- Author
-
M. Tremblay and J.M. O'Connor
- Subjects
UltraSPARC ,Reduced instruction set computing ,Multimedia ,business.industry ,Computer science ,Pipeline burst cache ,Image processing ,computer.software_genre ,Instruction set ,Computer architecture ,Hardware and Architecture ,Superscalar ,Electrical and Electronic Engineering ,Graphics ,business ,FR-V ,computer ,Software ,Computer hardware - Abstract
UItraSpare I is a second-generation superscalar processor. It is a high performance, highly integrated, four issue superscalar processor based on the Spare Version 9 64-bit RISC architecture. We have extended the core instruction set to include graphics instructions that provide the most common operations related to two dimensional image processing; two- and three-dimensional graphics and image compression algorithms; and parallel operations on pixel data with 8-, 16-, and 32-bit components. Additional, new memory access instructions support the very high bandwidth requirements typical of graphics and multimedia applications.
- Published
- 1996
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.