Descriptor: "IEEE 754" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"IEEE 754"' showing total 49 results

Start Over Descriptor "IEEE 754"

49 results on '"IEEE 754"'

1. Unleashing Simple Pendulum Dynamics with Posit Arithmetic

Author: Aldhapati, Avinash, Jaya Kumar, Ashwini, Subramanian, Rajaraman, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Michalewicz, Marek, editor, Gustafson, John, editor, and De Silva, Himeshi, editor
Published: 2024
Full Text: View/download PDF

2. Floating-Point Systems

Author: LaMeres, Brock J. and LaMeres, Brock J.
Published: 2024
Full Text: View/download PDF

3. Efficient ASIC Implementation of Artificial Neural Network with Posit Representation of Floating-Point Numbers

Author: Gupta, Abheek, Gupta, Anu, Gupta, Rajiv, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Bansal, Hari Om, editor, Ajmera, Pawan K., editor, Joshi, Sandeep, editor, Bansal, Ramesh C., editor, and Shekhar, Chandra, editor
Published: 2023
Full Text: View/download PDF

4. FPGA Based Efficient IEEE 754 Floating Point Multiplier for Filter Operations

Author: Selvi, C. Thirumarai, Amudha, J., Sankarasubramanian, R. S., Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Arunachalam, V., editor, and Sivasankaran, K., editor
Published: 2021
Full Text: View/download PDF

5. Validation of a Formal Floating-Point Model for the Interactive Proof Assistant Isabelle/HOL

Author: Lindström, Olof and Lindström, Olof
Abstract: This thesis aims to validate the formal floating-point model implemented in the Higher-Order Logic (HOL) proof assistant Isabelle, according to the IEEE 754 Standard. By integrating a testing environment with the proof assistant, the generation and processing of a large quantity of test vectors is made possible, and the resulting empirical data can be collected and analyzed. As a result of previous research, a substantial amount of work has already been put into the construction of a testing framework tailored specifically for Isabelle’s formal floating-point model. Therefore, the contribution of this thesis is mainly to utilize the framework for conducting the testing; however, certain additions and modifications to its components are also made. This includes adding support for testing comparison operations, as well as making the two floating-point formats half-precision (16-bit) and quadruple-precision (128-bit) available for testing. Furthermore, the framework is extended to allow for infinite deterministic testing of all combinations of formats, operations, and rounding modes that are implemented. A total of 116 combinations are tested simultaneously, and the results can be monitored in real time through a command line tool. The evaluation finds that all the properties of the formal model subject to testing can be considered validated. This conclusion is based on the empirical evidence pertaining to approximately 850 million processed test vectors, among which not a single one failed.
Published: 2024

6. Analysis of Posit and Bfloat Arithmetic of Real Numbers for Machine Learning

Author: Aleksandr Yu. Romanov, Alexander L. Stempkovsky, Ilia V. Lariushkin, Georgy E. Novoselov, Roman A. Solovyev, Vladimir A. Starykh, Irina I. Romanova, Dmitry V. Telpukhov, and Ilya A. Mkrtchan
Subjects: Machine learning, floating point, posit, IEEE 754, benchmark, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: Modern computational tasks are often required to not only guarantee predefined accuracy, but get the result fast. Optimizing calculations using floating point numbers, as opposed to integers, is a non-trivial task. For this reason, there is a need to explore new ways to improve such operations. This paper presents analysis and comparison of various floating point formats – float, posit and bfloat. One of the promising areas in which the problem of using such values can be considered to be the most acute is neural networks. That is why we pay special attention to algorithms of linear algebra and artificial intelligence to assess efficiency of new data types in this area. The research results showed that software implementations of posit16 and posit32 have high accuracy, but they are not particularly fast; on the other hand, bfloat16 is not much different from float32 in accuracy, but significantly surpasses it in performance for large amounts of data and complex machine learning algorithms. Thus, posit16 can be used in systems with less stringent performance requirements, as well as in conditions of limited computer memory; and also in cases when bfloat16 cannot provide required accuracy. As for bfloat16, it can speed up systems based on the IEEE 754 standard, but it cannot solve all the problems of conventional floating point arithmetic. Thus, although posits and bfloats are not a full fledged replacement for float, they provide (under certain conditions) advantages that can be useful for implementation of machine learning algorithms.
Published: 2021
Full Text: View/download PDF

7. Stochastic rounding: implementation, error analysis and applications

Author: Matteo Croci, Massimiliano Fasi, Nicholas J. Higham, Theo Mary, and Mantas Mikaitis
Subjects: floating-point arithmetic, rounding error analysis, IEEE 754, binary16, bfloat16, machine learning, Science
Abstract: Stochastic rounding (SR) randomly maps a real number x to one of the two nearest values in a finite precision number system. The probability of choosing either of these two numbers is 1 minus their relative distance to x. This rounding mode was first proposed for use in computer arithmetic in the 1950s and it is currently experiencing a resurgence of interest. If used to compute the inner product of two vectors of length n in floating-point arithmetic, it yields an error bound with constant [Formula: see text] with high probability, where u is the unit round-off. This is not necessarily the case for round to nearest (RN), for which the worst-case error bound has constant nu. A particular attraction of SR is that, unlike RN, it is immune to the phenomenon of stagnation, whereby a sequence of tiny updates to a relatively large quantity is lost. We survey SR by discussing its mathematical properties and probabilistic error analysis, its implementation, and its use in applications, with a focus on machine learning and the numerical solution of differential equations.
Published: 2022
Full Text: View/download PDF

8. Algorithm 1014: An Improved Algorithm for hypot(x,y).

Author: Borges, Carlos F.
Subjects: *ALGORITHMS, *LIBRARIES
Abstract: We develop fast and accurate algorithms for evaluating √x2+y2 for two floating-point numbers x and y. Library functions that perform this computation are generally named hypot(x,y). We compare five approaches that we will develop in this article to the current resident library function that is delivered with Julia 1.1 and to the code that has been distributed with the C math library for decades. We will investigate the accuracy of our algorithms by simulation. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

9. Optimal Architecture of Floating-Point Arithmetic for Neural Network Training Processors

Author: Muhammad Junaid, Saad Arslan, TaeGeon Lee, and HyungWon Kim
Subjects: floating-points, IEEE 754, convolutional neural network (CNN), MNIST dataset, Chemical technology, TP1-1185
Abstract: The convergence of artificial intelligence (AI) is one of the critical technologies in the recent fourth industrial revolution. The AIoT (Artificial Intelligence Internet of Things) is expected to be a solution that aids rapid and secure data processing. While the success of AIoT demanded low-power neural network processors, most of the recent research has been focused on accelerator designs only for inference. The growing interest in self-supervised and semi-supervised learning now calls for processors offloading the training process in addition to the inference process. Incorporating training with high accuracy goals requires the use of floating-point operators. The higher precision floating-point arithmetic architectures in neural networks tend to consume a large area and energy. Consequently, an energy-efficient/compact accelerator is required. The proposed architecture incorporates training in 32 bits, 24 bits, 16 bits, and mixed precisions to find the optimal floating-point format for low power and smaller-sized edge device. The proposed accelerator engines have been verified on FPGA for both inference and training of the MNIST image dataset. The combination of 24-bit custom FP format with 16-bit Brain FP has achieved an accuracy of more than 93%. ASIC implementation of this optimized mixed-precision accelerator using TSMC 65nm reveals an active area of 1.036 × 1.036 mm2 and energy consumption of 4.445 µJ per training of one image. Compared with 32-bit architecture, the size and the energy are reduced by 4.7 and 3.91 times, respectively. Therefore, the CNN structure using floating-point numbers with an optimized data path will significantly contribute to developing the AIoT field that requires a small area, low energy, and high accuracy.
Published: 2022
Full Text: View/download PDF

10. Towards a correctly-rounded and fast power function in binary64 arithmetic

Author: Hubrecht, Tom, Jeannerod, Claude-Pierre, Zimmermann, Paul, Département d'informatique - ENS Paris (DI-ENS), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Arithmétiques des ordinateurs, méthodes formelles, génération de code (ARIC), Laboratoire de l'Informatique du Parallélisme (LIP), École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Lyon, Institut National de Recherche en Informatique et en Automatique (Inria), Cryptology, arithmetic : algebraic methods for better algorithms (CARAMBA), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Algorithms, Computation, Image and Geometry (LORIA - ALGO), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: efficiency, IEEE 754, IEEE 754 double precision binary64 format power function correct rounding efficiency, double precision, power function, [INFO]Computer Science [cs], correct rounding, binary64 format
Abstract: This is the extended version of an article published in the proceedings of ARITH 2023.; We design algorithms for the correct rounding of the power function x y in the binary64 IEEE 754 format, for all rounding modes, modulo the knowledge of hardest-to-round cases. Our implementation of these algorithms largely outperforms previous correctly-rounded implementations and is not far from the efficiency of current mathematical libraries, which are not correctly-rounded. Still, we expect our algorithms can be further improved for speed. The proofs of correctness are fully detailed, with the goal to enable a formal proof of these algorithms. We hope this work will motivate the next IEEE 754 revision committee to require correct rounding for mathematical functions.
Published: 2023

11. Improving Performance of Floating Point Division on GPU and MIC

Author: Huang, Kun, Chen, Yifeng, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Wang, Guojun, editor, Zomaya, Albert, editor, Martinez, Gregorio, editor, and Li, Kenli, editor
Published: 2015
Full Text: View/download PDF

12. RadixInsert, a much faster stable algorithm for sorting floating-point numbers.

Author: Maus, Arne
Subjects: COMPUTER science, ALGORITHMS, SMART cards, COMPUTER operating systems, INTEGERS
Abstract: The problem addressed in this paper is that we want to sort an array a[] of n floating point numbers conforming to the IEEE 754 standard, both in the 64bit double precision and the 32bit single precision formats on a multi core computer with p real cores and shared memory (an ordinary PC). This we do by introducing a new stable, sorting algorithm, RadixInsert, both in a sequential version and with two parallel implementations. RadixInsert is tested on two different machines, a 2 core laptop and a 4 core desktop, outperforming the not stable Quicksort based algorithms from the Java library -- both the sequential Arrays.sort() and a merge-based parallel version Arrays.parallelsort() for 5001.5). RadixInsert is in practice O(n), but as with Quicksort it might be possible to construct numbers where RadixInsert degenerates to an O(n²) algorithm. However, this worst case for RadixInsert was not found when sorting seven quite different distributions reported in this paper. Finally, the extra memory used by RadixInsert both in its sequential and parallel versions, is n + some minor arrays whereas the sequential Quicksort in the Java library needs basically no extra memory. However, the merge based Arrays.parallelsort() in the Java library needs the same amount of n extra memory as RadixInsert. [ABSTRACT FROM AUTHOR]
Published: 2019

13. IEEE 754 floating-point addition for neuromorphic architecture.

Author: George, Arun M., Sharma, Rahul, and Rao, Shrisha
Subjects: *BUILDING additions, *FLOATING-point arithmetic
Abstract: • IEEE-754 compliant floating-point addition system for neuromorphic architectures. • Stage-wise computation for floating-point addition of two numbers. • Encoding scheme proposed to reduce inter-ensemble error. • Experiments performed to determine most suitable value of radius. • Estimated total number of neurons required to implement such a system. Neuromorphic computing is looked at as one of the promising alternatives to the traditional von Neumann architecture. In this paper, we consider the problem of doing arithmetic on neuromorphic systems and propose an architecture for doing IEEE 754 compliant addition on a neuromorphic system. A novel encoding scheme is also proposed for reducing the inter-neural ensemble error. The complex task of floating point addition is divided into sub-tasks such as exponent alignment, mantissa addition and overflow-underflow handling. We use a cascaded approach to add the two mantissas of the given floating-point numbers and then apply our encoding scheme to reduce the error produced in this approach. Overflow and underflow are handled by approximating on XOR logic. Implementation of sub-components like right shifter and multiplexer are also specified. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

14. Le calcul sur ordinateur

Author: Goualard, Frédéric, Jermann, Christophe, Laboratoire des Sciences du Numérique de Nantes (LS2N), Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-École Centrale de Nantes (Nantes Univ - ECN), Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)-Nantes université - UFR des Sciences et des Techniques (Nantes univ - UFR ST), Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)-Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ), and LS2N
Subjects: virgule flottante, IEEE 754, [INFO.INFO-AO]Computer Science [cs]/Computer Arithmetic, arithmétique, entiers
Abstract: Fascicule accompagnant l'exposé sur le calcul sur ordinateur de Christophe Jermann présenté à la journée académique 2023 de l'IREM des Pays de la Loire.
Published: 2023

15. Investigation of posits and IEEE-754 floating points : In hardware implementations of addition and multiplication operations

Author: Kylväjä, Juho, Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences, and Tampere University
Subjects: UNUM, IEEE 754, Sähkötekniikan DI-ohjelma - Master's Programme in Electrical Engineering, posit, floating-point, arithmetic
Abstract: This thesis aims to investigate a relatively new alternative presentation for floating-point arithmetic the type-3 UNUM, posit for a replacement of the widely used IEEE 754 floating-point standard. The thesis's main focus is on arithmetic operations of addition and multiplication. First, literature check of posit and IEEE 754 floating-point standards formats, special cases, overflow and underflow operations, and rounding methods are conducted. Then the arithmetic implementation steps of posit and IEEE 754 addition and multiplication operations on hardware are shown. In addition, the tools used to analyze the chosen designs and the designed testbench flow for behavioral verification of the designs is described. Finally, the results were examined, followed by the conclusion. The thesis concludes that posits could replace the currently widely used IEEE 754 standard due to having better accuracy around one and better dynamic range with 8, 16 and 32-bit numbers. However, the synthesis results show that FPU achieves better area, delay and power scores than the posit designs chosen in this thesis. Furthermore, implementing compatible processors for posits would require lots of work and time. Overall, posits have great potential to replace the IEEE 754 standard. It is interesting to see how future studies on posits will affect the future of floating-point arithmetic in hardware.
Published: 2023

16. Reliable Computing with GNU MPFR

Author: Zimmermann, Paul, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Fukuda, Komei, editor, Hoeven, Joris van der, editor, Joswig, Michael, editor, and Takayama, Nobuki, editor
Published: 2010
Full Text: View/download PDF

17. Software Implementation of the IEEE 754R Decimal Floating-Point Arithmetic

Author: Cornea, Marius, Anderson, Cristina, Tsen, Charles, Filipe, Joaquim, editor, Shishkov, Boris, editor, and Helfert, Markus, editor
Published: 2008
Full Text: View/download PDF

18. The CORE-MATH Project

Author: Alexei Sibidanov, Paul Zimmermann, Stéphane Glondu, University of Victoria [Canada] (UVIC), Cryptology, arithmetic : algebraic methods for better algorithms (CARAMBA), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Algorithms, Computation, Image and Geometry (LORIA - ALGO), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: IEEE 754, efficiency, [INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS], correct rounding
Abstract: International audience; The CORE-MATH project aims at providing opensource mathematical functions with correct rounding that can be integrated into current mathematical libraries. This article demonstrates the CORE-MATH methodology on two functions: the binary32 power function (powf) and the binary64 cube root function (cbrt). CORE-MATH already provides a full set of correctly rounded C99 functions for single precision (binary32). These functions provide similar or in some cases up to threefold speedups with respect to the GNU libc mathematical library, which is not correctly rounded. This work offers a prospect of the mandatory requirement of correct rounding for mathematical functions in the next revision of the IEEE-754 standard.
Published: 2022
Full Text: View/download PDF

19. Approximate Computing for Low Power and Security in the Internet of Things.

Author: Gao, Mingze, Wang, Qian, Arafin, Md Tanvir, Lyu, Yongqiang, and Qu, Gang
Subjects: *INTERNET of things, *COMPUTER networks, *COMPUTER systems, *MATHEMATICAL ability, *DIGITAL watermarking
Abstract: To save resources for Internet of Things (IoT) devices, a proposed approach segments operands and corresponding basic arithmetic operations that can be carried out by approximate function units for almost all applications. The approach also increases the security of IoT devices by hiding information for IP watermarking, digital fingerprinting, and lightweight encryption. [ABSTRACT FROM PUBLISHER]
Published: 2017
Full Text: View/download PDF

20. Analysis of Posit and Bfloat Arithmetic of Real Numbers for Machine Learning

Author: Ilia V. Lariushkin, Dmitry Telpukhov, I. I. Romanova, Georgy E. Novoselov, Ilya A. Mkrtchan, A.L. Stempkovsky, Aleksandr Yu. Romanov, Vladimir A. Starykh, and Roman A. Solovyev
Subjects: Floating point, Speedup, General Computer Science, Computer science, Machine learning, computer.software_genre, Data type, benchmark, Software, General Materials Science, posit, Arithmetic, Computer memory, Artificial neural network, business.industry, floating point, General Engineering, IEEE floating point, TK1-9971, Memory management, IEEE 754, Electrical engineering. Electronics. Nuclear engineering, Artificial intelligence, business, computer
Abstract: Modern computational tasks are often required to not only guarantee predefined accuracy, but get the result fast. Optimizing calculations using floating point numbers, as opposed to integers, is a non-trivial task. For this reason, there is a need to explore new ways to improve such operations. This paper presents analysis and comparison of various floating point formats – float, posit and bfloat. One of the promising areas in which the problem of using such values can be considered to be the most acute is neural networks. That is why we pay special attention to algorithms of linear algebra and artificial intelligence to assess efficiency of new data types in this area. The research results showed that software implementations of posit16 and posit32 have high accuracy, but they are not particularly fast; on the other hand, bfloat16 is not much different from float32 in accuracy, but significantly surpasses it in performance for large amounts of data and complex machine learning algorithms. Thus, posit16 can be used in systems with less stringent performance requirements, as well as in conditions of limited computer memory; and also in cases when bfloat16 cannot provide required accuracy. As for bfloat16, it can speed up systems based on the IEEE 754 standard, but it cannot solve all the problems of conventional floating point arithmetic. Thus, although posits and bfloats are not a full fledged replacement for float, they provide (under certain conditions) advantages that can be useful for implementation of machine learning algorithms.
Published: 2021
Full Text: View/download PDF

21. Generating Random Floating-Point Numbers by Dividing Integers: A Case Study

Author: Frédéric Goualard, Laboratoire des Sciences du Numérique de Nantes (LS2N), IMT Atlantique Bretagne-Pays de la Loire (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-Université de Nantes - UFR des Sciences et des Techniques (UN UFR ST), Université de Nantes (UN)-Université de Nantes (UN)-École Centrale de Nantes (ECN)-Centre National de la Recherche Scientifique (CNRS), Université de Nantes - UFR des Sciences et des Techniques (UN UFR ST), Université de Nantes (UN)-Université de Nantes (UN)-École Centrale de Nantes (ECN)-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique Bretagne-Pays de la Loire (IMT Atlantique), and Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)
Subjects: Discrete mathematics, 021103 operations research, Floating point, Uniform distribution (continuous), Computer science, [INFO.INFO-AO]Computer Science [cs]/Computer Arithmetic, 020208 electrical & electronic engineering, 0211 other engineering and technologies, Floating-point number, Binary number, 02 engineering and technology, Division (mathematics), IEEE floating point, Article, Integer, IEEE 754, Error analysis, 0202 electrical engineering, electronic engineering, information engineering, Point (geometry), Random number
Abstract: International audience; A method widely used to obtain IEEE 754 binary floating-point numbers with a standard uniform distribution involves drawing an integer uniformly at random and dividing it by another larger integer. We survey the various instances of the algorithm that are used in actual software and point out their properties and drawbacks, particularly from the standpoint of numerical software testing and data anonymization.
Published: 2020

22. Area Efficient Floating Point Addition Unit With Error Detection Logic.

Author: Aswani, T.S. and Premanand, B.
Abstract: Applications that involve large dynamic range make use of the floating point operations. Addition is one of the complex operation in a floating point unit. This paper proposes an area efficient floating-point addition unit with error detection logic. Existing Leading Zero Anticipators (LZA) and error detection logics helps to reduce the delay of the general floating point unit, but are not area efficient. Here a single precision area efficient floating point addition unit is designed using an efficient Carry Select Adder together with the error detection logic. Efficient Carry Select Adder is developed using Binary to Excess-1 Converter instead of Ripple Carry Adder for cin=‘1’. The proposed design is simulated using ModelSim and is tested on Xilinx. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

23. On the definition of unit roundoff.

Author: Rump, Siegfried and Lange, Marko
Subjects: *FLOATING-point arithmetic, *ALGORITHMS, *ARITHMETIC mean
Abstract: The result of a floating-point operation is usually defined to be the floating-point number nearest to the exact real result together with a tie-breaking rule. This is called the first standard model of floating-point arithmetic, and the analysis of numerical algorithms is often solely based on that. In addition, a second standard model is used specifying the maximum relative error with respect to the computed result. In this note we take a more general perspective. For an arbitrary finite set of real numbers we identify the rounding to minimize the relative error in the first or the second standard model. The optimal 'switching points' are the arithmetic or the harmonic means of adjacent floating-point numbers. Moreover, the maximum relative error of both models is minimized by taking the geometric mean. If the maximum relative error in one model is $$\alpha $$ , then $$\alpha /(1-\alpha )$$ is the maximum relative error in the other model. Those maximal errors, that is the unit roundoff, are characteristic constants of a given finite set of reals: The floating-point model to be optimized identifies the rounding and the unit roundoff. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

24. Algorithms for Stochastically Rounded Elementary Arithmetic Operations in IEEE 754 Floating-Point Arithmetic

Author: Fasi, Massimiliano, Mikaitis, Mantas, Fasi, Massimiliano, and Mikaitis, Mantas
Abstract: We present algorithms for performing the five elementary arithmetic operations (+, -, x, divided by, and root) in floating point arithmetic with stochastic rounding, and demonstrate the value of these algorithms by discussing various applications where stochastic rounding is beneficial. The algorithms require that the hardware be compliant with the IEEE 754 floating-point standard and that a floating-point pseudorandom number generator be available. The goal of these techniques is to emulate stochastic rounding when the underlying hardware does not support this rounding mode, as is the case for most existing CPUs and GPUs. By simulating stochastic rounding in software, one has the possibility to explore the behavior of this rounding mode and develop new algorithms even without having access to hardware implementing stochastic rounding- once such hardware becomes available, it suffices to replace the proposed algorithms by calls to the corresponding hardware routines. When stochastically rounding double precision operations, the algorithms we propose are between 7.3 and 19 times faster than the implementations that use the GNU MPFR library to simulate extended precision. We test our algorithms on various tasks, including summation algorithms and solvers for ordinary differential equations, where stochastic rounding is expected to bring advantages., Funding agencies:Royal Society of LondonIstituto Nazionale di Alta Matematica INdAM-GNCS Project 2020UK Research & Innovation (UKRI)Engineering & Physical Sciences Research Council (EPSRC) EP/P020720/1
Published: 2021
Full Text: View/download PDF

25. Algorithms for Stochastically Rounded Elementary Arithmetic Operations in IEEE 754 Floating-Point Arithmetic

Author: Massimiliano Fasi and Mantas Mikaitis
Subjects: Floating point, numerical analysis, Computer science, Elementary arithmetic, Double-precision floating-point format, 010103 numerical & computational mathematics, 02 engineering and technology, 01 natural sciences, 0202 electrical engineering, electronic engineering, information engineering, Computer Science (miscellaneous), stochastic rounding, 0101 mathematics, Arithmetic, Pseudorandom number generator, Rounding, Numerical analysis, Volume (computing), Floating-point arithmetic, Extended precision, error-free transformation, IEEE floating point, 020202 computer hardware & architecture, Computer Science Applications, Human-Computer Interaction, IEEE 754, Computer Science::Mathematical Software, numerical algorithm, Algorithm, Information Systems
Abstract: We present algorithms for performing the five elementary arithmetic operations ( $+$ + , $-$ - , ×, $\div$ ÷ , and $\sqrt{\phantom{x}}$ x ) in floating point arithmetic with stochastic rounding, and demonstrate the value of these algorithms by discussing various applications where stochastic rounding is beneficial. The algorithms require that the hardware be compliant with the IEEE 754 floating-point standard and that a floating-point pseudorandom number generator be available. The goal of these techniques is to emulate stochastic rounding when the underlying hardware does not support this rounding mode, as is the case for most existing CPUs and GPUs. By simulating stochastic rounding in software, one has the possibility to explore the behavior of this rounding mode and develop new algorithms even without having access to hardware implementing stochastic rounding—once such hardware becomes available, it suffices to replace the proposed algorithms by calls to the corresponding hardware routines. When stochastically rounding double precision operations, the algorithms we propose are between 7.3 and 19 times faster than the implementations that use the GNU MPFR library to simulate extended precision. We test our algorithms on various tasks, including summation algorithms and solvers for ordinary differential equations, where stochastic rounding is expected to bring advantages.
Published: 2021
Full Text: View/download PDF

26. Simultaneous Floating-Point Sine and Cosine for VLIW Integer Processors.

Author: Jeannerod, Claude-Pierre and Jourdan-Lu, Jingyan
Abstract: Graphics and signal processing applications often require that sines and cosines be evaluated at a same floating-point argument, and in such cases a very fast computation of the pair of values is desirable. This paper studies how 32-bit VLIW integer architectures can be exploited in order to perform this task accurately for IEEE single precision (including subnormals). We describe software implementations for sinf, cosf, and sincosf over [-pi/4, pi/4]that have a proven 1-ulp accuracy and whose latency on STMicroelectronics' ST231 VLIW integer processor is 19, 18, and 19 cycles, respectively. Such performances are obtained by introducing a novel algorithm for simultaneous sine and cosine that combines univariate and bivariate polynomial evaluation schemes. [ABSTRACT FROM PUBLISHER]
Published: 2012
Full Text: View/download PDF

27. Interval arithmetic over finitely many endpoints.

Author: Rump, Siegfried
Subjects: *INTERVAL analysis, *ARITHMETIC, *PROOF theory, *MATHEMATICAL analysis, *TRANSCENDENTAL numbers, *FLOATING-point arithmetic
Abstract: To my knowledge all definitions of interval arithmetic start with real endpoints and prove properties. Then, for practical use, the definition is specialized to finitely many endpoints, where many of the mathematical properties are no longer valid. There seems no treatment how to choose this finite set of endpoints to preserve as many mathematical properties as possible. Here we define interval endpoints directly using a finite set which, for example, may be based on the IEEE 754 floating-point standard. The corresponding interval operations emerge naturally from the corresponding power set operations. We present necessary and sufficient conditions on this finite set to ensure desirable mathematical properties, many of which are not satisfied by other definitions. For example, an interval product contains zero if and only if one of the factors does. The key feature of the theoretical foundation is that 'endpoints' of intervals are not points but non-overlapping closed, half-open or open intervals, each of which can be regarded as an atomic object. By using non-closed intervals among its 'endpoints', intervals containing 'arbitrarily large' and 'arbitrarily close to but not equal to' a real number can be handled. The latter may be zero defining 'tiny' numbers, but also any other quantity including transcendental numbers. Our scheme can be implemented straightforwardly using the IEEE 754 floating-point standard. [ABSTRACT FROM AUTHOR]
Published: 2012
Full Text: View/download PDF

28. An Improved Algorithm for hypot(A,B)

Author: Borges, Carlos and Applied Mathematics
Subjects: fused multiply-add, IEEE 754, floating point, hypot()
Abstract: We develop a fast and accurate algorithm for evaluating a2 + b2 for two floating point numbers a and b. Library functions that perform this computation are generally named hypot(a,b). We will compare four approaches that we will develop in this paper to the current resident library func- tion that is delivered with Julia 1.1 and to the code that has been distributed with the C math library for decades. We will demonstrate the performance of our algorithms by simulation.
Published: 2019

29. Optimal Architecture of Floating-Point Arithmetic for Neural Network Training Processors.

Author: Junaid, Muhammad, Arslan, Saad, Lee, TaeGeon, and Kim, HyungWon
Subjects: FLOATING-point arithmetic, ARTIFICIAL intelligence, INDUSTRY 4.0, ELECTRONIC data processing, SUPERVISED learning, INTERNET of things
Abstract: The convergence of artificial intelligence (AI) is one of the critical technologies in the recent fourth industrial revolution. The AIoT (Artificial Intelligence Internet of Things) is expected to be a solution that aids rapid and secure data processing. While the success of AIoT demanded low-power neural network processors, most of the recent research has been focused on accelerator designs only for inference. The growing interest in self-supervised and semi-supervised learning now calls for processors offloading the training process in addition to the inference process. Incorporating training with high accuracy goals requires the use of floating-point operators. The higher precision floating-point arithmetic architectures in neural networks tend to consume a large area and energy. Consequently, an energy-efficient/compact accelerator is required. The proposed architecture incorporates training in 32 bits, 24 bits, 16 bits, and mixed precisions to find the optimal floating-point format for low power and smaller-sized edge device. The proposed accelerator engines have been verified on FPGA for both inference and training of the MNIST image dataset. The combination of 24-bit custom FP format with 16-bit Brain FP has achieved an accuracy of more than 93%. ASIC implementation of this optimized mixed-precision accelerator using TSMC 65nm reveals an active area of 1.036 × 1.036 mm2 and energy consumption of 4.445 µJ per training of one image. Compared with 32-bit architecture, the size and the energy are reduced by 4.7 and 3.91 times, respectively. Therefore, the CNN structure using floating-point numbers with an optimized data path will significantly contribute to developing the AIoT field that requires a small area, low energy, and high accuracy. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

30. Processor Design Using 32 Bit Single Precision Floating Point Unit

Author: Mr. Anand S. Burud and Dr. Pradip C. Bhaskar
Subjects: Floating point unit, IEEE 754, Electronics & Communication Engineering, Hardware_ARITHMETICANDLOGICSTRUCTURES
Abstract: The floating point operations have discovered concentrated applications in the various different fields for the necessities for high precision operation because of its incredible dynamic range, high exactness and simple operation rules. High accuracy is needed for the design and research of the floating point processing units. With the expanding necessities for the floating point operations for the fast high speed data signal processing and the logical operation, the requirements for the high speed hardware floating point arithmetic units have turned out to be increasingly requesting. The ALU is a standout amongst the most essential segments in a processor, and is ordinarily the piece of the processor that is outlined first. In this paper, a fast IEEE754 compliant 32 bit floating point arithmetic unit designed using VHDL code has been presented and all operations of addition got tested on Xilinx and verified successfully. Mr. Anand S. Burud | Dr. Pradip C. Bhaskar "Processor Design Using 32 Bit Single Precision Floating Point Unit" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-2 | Issue-4 , June 2018, URL: https://www.ijtsrd.com/papers/ijtsrd12912.pdf
Published: 2018

31. Area Efficient Floating Point Addition Unit With Error Detection Logic

Author: B. Premanand and T.S. Aswani
Subjects: Adder, Floating point, Computer science, business.industry, Binary number, Floating-point unit, Double-precision floating-point format, 02 engineering and technology, LZA, Single-precision floating-point format, 020202 computer hardware & architecture, ModelSim, IEEE 754, 0202 electrical engineering, electronic engineering, information engineering, General Earth and Planetary Sciences, Carry-select adder, VHDL, Hardware_ARITHMETICANDLOGICSTRUCTURES, business, Xilinx, FPGA, Computer hardware, General Environmental Science
Abstract: Applications that involve large dynamic range make use of the floating point operations. Addition is one of the complex operation in a floating point unit. This paper proposes an area efficient floating-point addition unit with error detection logic. Existing Leading Zero Anticipators (LZA) and error detection logics helps to reduce the delay of the general floating point unit, but are not area efficient. Here a single precision area efficient floating point addition unit is designed using an efficient Carry Select Adder together with the error detection logic. Efficient Carry Select Adder is developed using Binary to Excess-1 Converter instead of Ripple Carry Adder for cin=‘1’. The proposed design is simulated using ModelSim and is tested on Xilinx.
Published: 2016
Full Text: View/download PDF

32. Enabling High Performance Posit Arithmetic Applications Using Hardware Acceleration

Author: van Dam, Laurens (author) and van Dam, Laurens (author)
Abstract: The demand for higher precision arithmetic is increasing due to the rapid development of new computing paradigms. The novel posit number representation system, as introduced by John L. Gustafson, claims to be able to provide more accurate answers to mathematical problems with equal or less number of bits compared to the well-established IEEE 754 floating point standard. In this work, the performance of the posit number format in terms of decimal accuracy is analyzed and compared to alternative number representations. A framework for performing high-precision posit arithmetic in reconfigurable logic is presented. The supported arithmetic operations can be performed without rounding off intermediate results, minimizing the loss of decimal accuracy. The proposed posit arithmetic units achieve approximately 250 MPOPS for addition, 160 MPOPS for multiplication and 180 MPOPS for accumulation operations. A hardware accelerator for performing Level 1 BLAS operations on (sparse) posit column vectors is presented. For the calculation of the vector dot product for an input vector length of 10^6 elements, a speedup of approximately 15000x compared to software is achieved. The decimal accuracy is improved by one decimal of accuracy on average compared to posit emulation in software, and two additional decimals of accuracy are achieved compared to calculation using the IEEE 754 floating point format. A study of the application of posit arithmetic in the field of bioinformatics is performed. The effect on decimal accuracy of the pair-HMM forward algorithm by replacing traditional floating point arithmetic with posit arithmetic is analyzed. It is shown that the maximum achievable decimal accuracy using posit arithmetic is higher compared to the IEEE floating point format for the same number of required bits. The design of a hardware accelerator for the pair-HMM forward algorithm using posit arithmetic is proposed for two different interfaces: a streaming-based accelerator and an ac, ISBN 978-94-6186-957-9, Electrical Engineer | Embedded Systems
Published: 2018

33. Interval arithmetic with fixed rounding mode

Subjects: IEEE 754, successor, predecessor, rounding mode, interval arithmetic, chop rounding
Abstract: We discuss several methods to simulate interval arithmetic operations using floating-point operations with fixed rounding mode. In particular we present formulas using only rounding to nearest and using only chop rounding (towards zero). The latter was the default and only rounding on GPU (Graphics Processing Unit) and cell processors, which in turn are very fast and therefore attractive in scientific computations.
Published: 2016

34. Preservation of Lyapunov-Theoretic Proofs: From Real to loating-Point Numbers

Author: Maisonneuve, Vivien, Centre de Recherche en Informatique (CRI), MINES ParisTech - École nationale supérieure des mines de Paris, and Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)
Subjects: proof preservation, [INFO.INFO-SC]Computer Science [cs]/Symbolic Computation [cs.SC], [INFO.INFO-PF]Computer Science [cs]/Performance [cs.PF], ellipse, IEEE 754, [INFO.INFO-AU]Computer Science [cs]/Automatic Control Engineering, Lyapunov stability, rounding errors, [INFO.INFO-ES]Computer Science [cs]/Embedded Systems, floating-point, control system, [INFO.INFO-MO]Computer Science [cs]/Modeling and Simulation, [MATH.MATH-NA]Mathematics [math]/Numerical Analysis [math.NA]
Abstract: In Feron presents how Lyapunov-theoretic proofs of stability can be migrated toward computer-readable and verifiable certificates of control software behavior by relying of Floyd's and Hoare's proof system. We address the issue of errors resulting from the use of floating-point arithmetic: we present an approach to translate Feron's proof invariants on real arithmetic to similar invariants on floating-point numbers and show how our methodology applies to prove stability, thus allowing to verify whether the stability invariant still holds when the controller is implemented. We study in details the open-loop system of Feron's paper. We also use the same approach for Feron's closed-loop system, but the constraints are too tights to show stability in this second case: more leeway should be introduced in the proof on real numbers, otherwise the resulting system might be unstable.
Published: 2013

35. Low Cost Floating-Point Extensions to a Fixed-Point SIMD Datapath

Author: Kolumban, Gaspar
Subjects: IEEE 754, ePUMA, VPE, floating-point, fixed-point datapath, SIMD
Abstract: The ePUMA architecture is a novel master-multi-SIMD DSP platform aimed at low-power computing, like for embedded or hand-held devices for example. It is both a configurable and scalable platform, designed for multimedia and communications. Numbers with both integer and fractional parts are often used in computers because many important algorithms make use of them, like signal and image processing for example. A good way of representing these types of numbers is with a floating-point representation. The ePUMA platform currently supports a fixed-point representation, so the goal of this thesis will be to implement twelve basic floating-point arithmetic operations and two conversion operations onto an already existing datapath, conforming as much as possible to the IEEE 754-2008 standard for floating-point representation. The implementation should be done at a low hardware and power consumption cost. The target frequency will be 500MHz. The implementation will be compared with dedicated DesignWare components and the implementation will also be compared with floating-point done in software in ePUMA. This thesis presents a solution that on average increases the VPE datapath hardware cost by 15% and the power consumption increases by 15% on average. Highest clock frequency with the solution is 473MHz. The target clock frequency of 500MHz is thus not achieved but considering the lack of register retiming in the synthesis step, 500MHz can most likely be reached with this design.
Published: 2013

36. A Pseudo-Random Bit Generator Using Three Chaotic Logistic Maps

Author: François, Michael, Defour, David, Laboratoire d'Informatique Fondamentale d'Orléans (LIFO), Université d'Orléans (UO)-Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA), Digits, Architectures et Logiciels Informatiques (DALI), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Université de Perpignan Via Domitia (UPVD), and LIRMM (UM, CNRS)
Subjects: [INFO.INFO-AR]Computer Science [cs]/Hardware Architecture [cs.AR], IEEE 754, Logistic map, Chaotic map, PRBG, Cryptography, Pseudo-random
Abstract: A novel pseudo-random bit generator (PRBG) using three chaotic logistic maps is proposed. The algorithm generates at each iteration sequences of 32 bit-blocks by starting from randomly chosen initial seeds. The impact of relying on IEEE 754-2008 floating-point representation format for the generator is also taken into account. The performance of the generator is evaluated through various statistical analyses. The results show that the produced sequences possess high randomness statistical properties and good security level which make it suitable for cryptographic applications.
Published: 2013

37. Simultaneous floating-point sine and cosine for VLIW integer processors

Author: Claude-Pierre Jeannerod, Jingyan Jourdan-Lu, Arithmetic and Computing (ARIC), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire de l'Informatique du Parallélisme (LIP), École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Centre National de la Recherche Scientifique (CNRS), Compilation Expertise Center, STMicroelectronics [Grenoble] (ST-GRENOBLE), École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), and Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL)
Subjects: [INFO.INFO-AR]Computer Science [cs]/Hardware Architecture [cs.AR], Floating point, floating-point arithmetic, Computer science, 02 engineering and technology, Parallel computing, trigonometric function, Single-precision floating-point format, C software implementation, 0202 electrical engineering, electronic engineering, information engineering, Sine, Arithmetic, ACM: C.: Computer Systems Organization/C.1: PROCESSOR ARCHITECTURES/C.1.1: Single Data Stream Architectures/C.1.1.2: RISC/CISC, VLIW architectures, unit in the last place, Signal processing, [INFO.INFO-AO]Computer Science [cs]/Computer Arithmetic, ACM: B.: Hardware/B.2: ARITHMETIC AND LOGIC STRUCTURES/B.2.4: High-Speed Arithmetic, 020206 networking & telecommunications, instruction level parallelism, IEEE floating point, 020202 computer hardware & architecture, VLIW integer processor, Very long instruction word, IEEE 754, Unit in the last place, Integer (computer science)
Abstract: Accepted for publication in the proceedings of the 23rd IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP 2012).; International audience; Graphics and signal processing applications often require that sines and cosines be evaluated at a same floating-point argument, and in such cases a very fast computation of the pair of values is desirable. This paper studies how 32-bit VLIW integer architectures can be exploited in order to perform this task accurately for IEEE single precision. We describe software implementations for sinf, cosf, and sincosf over [-pi/4,pi/4] that have a proven 1-ulp accuracy and whose latency on STMicroelectronics' ST231 VLIW integer processor is 19, 18, and 19 cycles, respectively. Such performances are obtained by introducing a novel algorithm for simultaneous sine and cosine that combines univariate and bivariate polynomial evaluation schemes.
Published: 2012

38. Interval arithmetic over finitely many endpoints

Subjects: IEEE 754, Mathematical properties, Interval arithmetic, Finitely many endpoints
Abstract: To my knowledge all definitions of interval arithmetic start with real endpoints and prove properties. Then, for practical use, the definition is specialized to finitely many endpoints, where many of the mathematical properties are no longer valid. There seems no treatment how to choose this finite set of endpoints to preserve as many mathematical properties as possible. Here we define interval endpoints directly using a finite set which, for example, may be based on the IEEE 754 floating-point standard. The corresponding interval operations emerge naturally from the corresponding power set operations. We present necessary and sufficient conditions on this finite set to ensure desirable mathematical properties, many of which are not satisfied by other definitions. For example, an interval product contains zero if and only if one of the factors does. The key feature of the theoretical foundation is that "endpoints" of intervals are not points but non-overlapping closed, half-open or open intervals, each of which can be regarded as an atomic object. By using non-closed intervals among its "endpoints", intervals containing "arbitrarily large" and "arbitrarily close to but not equal to" a real number can be handled. The latter may be zero defining "tiny" numbers, but also any other quantity including transcendental numbers. Our scheme can be implemented straightforwardly using the IEEE 754 floating-point standard. © 2012 Springer Science + Business Media B.V.
Published: 2012

39. How to Square Floats Accurately and Efficiently on the ST231 Integer Processor

Author: Guillaume Revy, Jingyan Jourdan-Lu, Christophe Monat, Claude-Pierre Jeannerod, Computer arithmetic (ARENAIRE), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire de l'Informatique du Parallélisme (LIP), École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Centre National de la Recherche Scientifique (CNRS), Laboratoire de l'Informatique du Parallélisme (LIP), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), ARENAIRE - Arithmétique des ordinateurs, STMicroelectronics [Grenoble] (ST-GRENOBLE), Digits, Architectures et Logiciels Informatiques (DALI), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Perpignan Via Domitia (UPVD), Arithmetic and Computing (ARIC), École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Compilation Expertise Center, Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Université de Perpignan Via Domitia (UPVD), Centre National de la Recherche Scientifique (CNRS)-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-École normale supérieure - Lyon (ENS Lyon)-Centre National de la Recherche Scientifique (CNRS)-Université de Lyon-Université Claude Bernard Lyon 1 (UCBL), and Université de Lyon-École normale supérieure - Lyon (ENS Lyon)
Subjects: Floating point, Exploit, Computer science, binary floating-point arithmetic, Rounding, [INFO.INFO-AO]Computer Science [cs]/Computer Arithmetic, Parameterized complexity, squaring, 020206 networking & telecommunications, 010103 numerical & computational mathematics, 02 engineering and technology, Parallel computing, correct rounding, instruction level parallelism, 01 natural sciences, IEEE floating point, VLIW integer processor, Very long instruction word, IEEE 754, C software implementation, 0202 electrical engineering, electronic engineering, information engineering, 0101 mathematics, Latency (engineering), Instruction-level parallelism
Abstract: International audience; We consider the problem of computing IEEE floating-point squares by means of integer arithmetic. We show how to exploit the specific properties of squaring in order to design and implement algorithms that have much lower latency than those for general multiplication, while still guaranteeing correct rounding. Our algorithms are parameterized by the floating-point format, aim at high instruction-level parallelism (ILP) exposure, and cover all rounding modes. We show further that their C implementation for the binary32 format yields efficient codes for targets like the ST231 VLIW integer processor from ST Microelectronics, with a latency at least 1.75x smaller than that of general multiplication in the same context.
Published: 2011
Full Text: View/download PDF

40. Techniques and tools for implementing IEEE 754 floating-point arithmetic on VLIW integer processors

Author: Christian Bertin, Jean-Michel Muller, Hervé Knochel, Christophe Mouilleron, Guillaume Revy, Jingyan Jourdan-Lu, Christophe Monat, Claude-Pierre Jeannerod, STMicroelectronics [Grenoble] (ST-GRENOBLE), Computer arithmetic (ARENAIRE), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire de l'Informatique du Parallélisme (LIP), École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Centre National de la Recherche Scientifique (CNRS), Electronique, Informatique, Automatique et Systèmes (ELIAUS), Université de Perpignan Via Domitia (UPVD), École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), and Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL)
Subjects: Computer science, Optimizing compiler, correct rounding, instruction-level parallelism, code generation, 02 engineering and technology, Parallel computing, Single-precision floating-point format, Software, polynomial evaluation, C software implementation, 0202 electrical engineering, electronic engineering, information engineering, Code generation, [INFO.INFO-SC]Computer Science [cs]/Symbolic Computation [cs.SC], binary foating-point arithmetic, business.industry, [INFO.INFO-AO]Computer Science [cs]/Computer Arithmetic, 020206 networking & telecommunications, IEEE floating point, 020202 computer hardware & architecture, VLIW integer processor, IEEE 754, Very long instruction word, [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], Instruction-level parallelism, business, Integer (computer science)
Abstract: International audience; Recently, some high-performance IEEE 754 single precision floating-point software has been designed, which aims at best exploiting some features (integer arithmetic, parallelism) of the STMicroelectronics ST200 Very Long Instruction Word (VLIW) processor. We review here the techniques and software tools used or developed for this design and its implementation, and how they allowed very high instruction-level parallelism (ILP) exposure. Those key points include a hierarchical description of function evaluation algorithms, the exploitation of the standard encoding of floating-point data, the automatic generation of fast and accurate polynomial evaluation schemes, and some compiler optimizations.
Published: 2010
Full Text: View/download PDF

41. Software Aspects of IEEE Floating-Point Computations for Numerical Applications in High Energy Physics

Author: Arnold, Jeffrey
Published: 2010

42. Bringing fast floating-point arithmetic into embedded integer processors

Author: Bertin, Christian, Jeannerod, Claude-Pierre, Monat, Christophe, STMicroelectronics [Grenoble] (ST-GRENOBLE), Computer arithmetic (ARENAIRE), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire de l'Informatique du Parallélisme (LIP), École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Centre National de la Recherche Scientifique (CNRS), École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), and Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL)
Subjects: binary foating-point arithmetic, polynomial evaluation, VLIW integer processor, IEEE 754, C software implementation, [INFO.INFO-AO]Computer Science [cs]/Computer Arithmetic, [INFO.INFO-ES]Computer Science [cs]/Embedded Systems, instruction-level parallelism, code generation, [INFO.INFO-MS]Computer Science [cs]/Mathematical Software [cs.MS]
Published: 2010

43. Mata Matters: Overflow, underflow and the IEEE floating–point format

Author: Linhart, Jean Marie
Subjects: missing values, subnormal number, normalized number, IEEE 754, double precision, MathematicsofComputing_NUMERICALANALYSIS, underflow, format, binary, overflow, hexadecimal, Research Methods/ Statistical Methods, denormalized number
Abstract: Mata is Stata’s matrix language. The Mata Matters column shows how Mata can be used interactively to solve problems and as a programming language to add new features to Stata. In this quarter’s column, we investigate underflow and overflow and then delve into the details of how floating-point numbers are stored in the IEEE 754 floating-point standard. We show how to test for overflow and underflow. We demonstrate how to use the %21x format to see underflow and the %16H, %16L, %8H, and %8L formats for displaying the byte content of doubles and floats.
Published: 2008
Full Text: View/download PDF

44. Design of single precision float adder (32-bit numbers) according to IEEE 754 standard using VHDL

Author: Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Slovenská technická univerzita v Bratislave, Stopjaková, Viera, Zálusky, Roman, Barrabés Castillo, Arturo, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Slovenská technická univerzita v Bratislave, Stopjaková, Viera, Zálusky, Roman, and Barrabés Castillo, Arturo
Abstract: Projecte realitzat en el marc d'un programa de mobilitat amb la Slovenská Technická Univerzita v Bratislave, Fakulta Elecktrotechniky a Informatiky, [ANGLÈS] Floating Point arithmetic is by far the most used way of approximating real number arithmetic for performing numerical calculations on modern computers. Each computer had a different arithmetic for long time: bases, significant and exponents sizes, formats, etc. Each company implemented its own model and it hindered the portability between different equipments until IEEE 754 standard appeared defining a single and universal standard. The aim of this project is implementing a 32 bit binary floating point adder/subtractor according with the IEEE 754 standard and using the hardware programming language VHDL., [CASTELLÀ] La aritmética de punto flotante es, con diferencia, el método más utilizado para aproximar la aritmética con números reales para realizar cálculos numéricos por ordenador. Durante mucho tiempo cada máquina presentaba una aritmética diferente: bases, tamaño de los significantes y exponentes, formatos, etc. Cada fabricante implementaba su propio modelo, lo que dificultaba la portabilidad entre diferentes equipos, hasta que apareció la norma IEEE 754 que definía un estándar único para todos. El objetivo de este proyecto es, a partir del estándar IEEE 754, implementar un sumador/restador binario de punto flotante de 32 bits utilizando el lenguaje de programación hardware VHDL., [CATALÀ] L'aritmètica de punt flotant és, amb diferència, el mètode més utilitzat d'aproximació a l'aritmètica amb nombres reals per realitzar càlculs numèrics per ordinador. Durant molt temps cada màquina presentava una aritmètica diferent: bases, mida dels significants i exponents, formats, etc. Cada fabricant implementava el seu propi model, fet que dificultava la portabilitat entre diferents equips, fins que va aparèixer la norma IEEE 754 que definia un estàndard únic per a tothom. L'objectiu d'aquest projecte és, a partir de l'estàndard IEEE 754, implementar un sumador/restador binari de punt flotant de 32 bits emprant el llenguatge de programació hardware VHDL.
Published: 2012

45. Worst Cases of a Periodic Function for Large Arguments

Author: D. Stehle, Vincent Lefèvre, Guillaume Hanrot, Paul Zimmermann, Curves, Algebra, Computer Arithmetic, and so On (CACAO), INRIA Lorraine, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS), Computer arithmetic (ARENAIRE), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire de l'Informatique du Parallélisme (LIP), École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Centre National de la Recherche Scientifique (CNRS), Peter Kornerup and Jean-Michel Muller, École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), and Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL)
Subjects: Polynomial, Floating point, Computational complexity theory, floating-point arithmetic, Heuristic (computer science), [INFO.INFO-AO]Computer Science [cs]/Computer Arithmetic, Double-precision floating-point format, 010103 numerical & computational mathematics, Function (mathematics), correct rounding, periodic function, 01 natural sciences, 010101 applied mathematics, Periodic function, IEEE 754, Trigonometric functions, 0101 mathematics, Algorithm, Mathematics, worst case
Abstract: International audience; One considers the problem of finding hard to round cases of a periodic function for large floating-point inputs, more precisely when the function cannot be efficiently approximated by a polynomial. This is one of the last few issues that prevents from guaranteeing an efficient computation of correctly rounded transcendentals for the whole IEEE-754 double precision format. The first non-naive algorithm for that problem is presented, with an heuristic complexity of $O(2^{0.676 p})$ for a precision of $p$ bits. The efficiency of the algorithm is shown on the largest IEEE-754 double precision binade for the sine function, and some corresponding bad cases are given. We can hope that all the worst cases of the trigonometric functions in their whole domain will be found within a few years, a task that was considered out of reach until now.
Published: 2007
Full Text: View/download PDF

46. Error Bounds on Complex Floating-Point Multiplication

Author: Paul Zimmermann, Richard P. Brent, Colin Percival, Curves, Algebra, Computer Arithmetic, and so On (CACAO), INRIA Lorraine, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: Arithmetic underflow, Floating point, [INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS], Double-precision floating-point format, 010103 numerical & computational mathematics, 01 natural sciences, roundoff error, 0101 mathematics, Arithmetic, error analysis, Mathematics, Discrete mathematics, Algebra and Number Theory, Applied Mathematics, Complex multiplication, [MATH.MATH-CV]Mathematics [math]/Complex Variables [math.CV], IEEE floating point, 010101 applied mathematics, Computational Mathematics, complex multiplication, IEEE 754, Product (mathematics), floating-point number, Multiplication, Round-off error, [MATH.MATH-NA]Mathematics [math]/Numerical Analysis [math.NA]
Abstract: International audience; Given floating-point arithmetic with $t$-digit base-$\beta$ significands in which all arithmetic operations are performed as if calculated to infinite precision and rounded to a nearest representable value, we prove that the product of complex values $z_0$ and $z_1$ can be computed with maximum absolute error $\abs{z_0} \abs{z_1} \frac{1}{2} \beta^{1 - t} \sqrt{5}$. In particular, this provides relative error bounds of $2^{-24} \sqrt{5}$ and $2^{-53} \sqrt{5}$ for {IEEE 754} single and double precision arithmetic respectively, provided that overflow, underflow, and denormals do not occur. We also provide the numerical worst cases for {IEEE 754} single and double precision arithmetic.
Published: 2007

47. Codificación binaria de int y float

Author: Universitat Politècnica de València. Escuela Técnica Superior de Ingenieros de Telecomunicación - Escola Tècnica Superior d'Enginyers de Telecomunicació, González Téllez, Alberto, Universitat Politècnica de València. Escuela Técnica Superior de Ingenieros de Telecomunicación - Escola Tècnica Superior d'Enginyers de Telecomunicació, and González Téllez, Alberto
Abstract: Se describe mediante un ejemplo como se codifican en binario los tipos de datos int y float del lenguaje C
Published: 2008

48. Computing Floating-Point Square Roots via Bivariate Polynomial Evaluation.

Author: Jeannerod, Claude-Pierre, Knochel, Herve, Monat, Christophe, and Revy, Guillaume
Subjects: *FLOATING-point arithmetic, *SQUARE root, *COMPUTER systems, *POLYNOMIALS, *FIXED point theory, *DATA analysis, *COMPUTER software
Abstract: In this paper, we show how to reduce the computation of correctly rounded square roots of binary floating-point data to the fixed-point evaluation of some particular integer polynomials in two variables. By designing parallel and accurate evaluation schemes for such bivariate polynomials, we show further that this approach allows for high instruction-level parallelism (ILP) exposure, and thus, potentially low-latency implementations. Then, as an illustration, we detail a C implementation of our method in the case of IEEE 754-2008 binary32 floating-point data (formerly called single precision in the 1985 version of the IEEE 754 standard). This software implementation, which assumes 32-bit unsigned integer arithmetic only, is almost complete in the sense that it supports special operands, subnormal numbers, and all rounding-direction attributes, but not exception handling (that is, status flags are not set). Finally, we have carried out experiments with this implementation on the ST231, an integer processor from the STMicroelectronics' ST200 family, using the ST200 family VLIW compiler. The results obtained demonstrate the practical interest of our approach in that context: for all rounding-direction attributes, the generated assembly code is optimally scheduled and has indeed low latency (23 cycles). [ABSTRACT FROM AUTHOR]
Published: 2011
Full Text: View/download PDF

49. Disseny d'un sumador de punt flotant de precisió simple (32 bits) basat en l'estàndard IEEE 754 utilitzant VHDL

Author: Barrabés Castillo, Arturo, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Slovenská technická univerzita v Bratislave, Stopjaková, Viera, and Zálusky, Roman
Subjects: Lògica programable, Anàlisi numèrica, Electrònica digital, VHDL (Llenguatge de descripció de maquinari), IEEE 754, Enginyeria electrònica::Circuits electrònics [Àrees temàtiques de la UPC], VHDL, Floating point arithmetic, Aritmética de punto flotante, VHDL (Computer hardware description language), Numerical analysis
Abstract: Projecte realitzat en el marc d'un programa de mobilitat amb la Slovenská Technická Univerzita v Bratislave, Fakulta Elecktrotechniky a Informatiky [ANGLÈS] Floating Point arithmetic is by far the most used way of approximating real number arithmetic for performing numerical calculations on modern computers. Each computer had a different arithmetic for long time: bases, significant and exponents sizes, formats, etc. Each company implemented its own model and it hindered the portability between different equipments until IEEE 754 standard appeared defining a single and universal standard. The aim of this project is implementing a 32 bit binary floating point adder/subtractor according with the IEEE 754 standard and using the hardware programming language VHDL. [CASTELLÀ] La aritmética de punto flotante es, con diferencia, el método más utilizado para aproximar la aritmética con números reales para realizar cálculos numéricos por ordenador. Durante mucho tiempo cada máquina presentaba una aritmética diferente: bases, tamaño de los significantes y exponentes, formatos, etc. Cada fabricante implementaba su propio modelo, lo que dificultaba la portabilidad entre diferentes equipos, hasta que apareció la norma IEEE 754 que definía un estándar único para todos. El objetivo de este proyecto es, a partir del estándar IEEE 754, implementar un sumador/restador binario de punto flotante de 32 bits utilizando el lenguaje de programación hardware VHDL. [CATALÀ] L'aritmètica de punt flotant és, amb diferència, el mètode més utilitzat d'aproximació a l'aritmètica amb nombres reals per realitzar càlculs numèrics per ordinador. Durant molt temps cada màquina presentava una aritmètica diferent: bases, mida dels significants i exponents, formats, etc. Cada fabricant implementava el seu propi model, fet que dificultava la portabilitat entre diferents equips, fins que va aparèixer la norma IEEE 754 que definia un estàndard únic per a tothom. L'objectiu d'aquest projecte és, a partir de l'estàndard IEEE 754, implementar un sumador/restador binari de punt flotant de 32 bits emprant el llenguatge de programació hardware VHDL.

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

49 results on '"IEEE 754"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources