While the set ℝ of real numbers or its subset ℚ of rational numbers are known to everyone, the way in which computers treat them is perhaps less well known. On one hand, since machines have limited resources, only a subset 𝔽 of finite dimension of ℚ ⊂ ℝ can be represented. The numbers in this subset are called normalized floating-point numbers that were standardazed by IEEEE (Institute for Electrical and Electronic Engineers) in 1985. It was updated in 2019 to the IEEE 754 standard that specifies interchange and arithmetic formats and methods for binary and decimal floating-point arithmetic in computer programming environments.

IEEE Floating Point Numbers

Computers have both an integer mode and a floating-point mode for representing numbers. The integer mode is used for performing calculations that are integer values. Computers usually perform scientific calculations in floating point arithmetic, and a real number is represented in a normalized floating point binary form. That is, computers store a real number x ∈ ℝ in the binary form as

\[ x = \pm \left( \cdot d_1 d_2 \ldots d_n \right)_{\beta} \beta^e , \]

where \( \displaystyle \quad \left( \cdot d_1 d_2 \ldots d_n \right)_{\beta} \quad \) is called mantissa and e is an integer called the exponent. The digits in mantissa satisfy inequalities 0 ≤ d_i ≤ β − 1. In most of the computers β = 2. Other values of β used in some computers are 10 and 16. The term ‘normalized’ means that the leading digit d₁ is always nonzero unless the number represented is 0. That is, d₁ ≠ 0 or d₁ = d₂ = ⋯ = d_n = 0. The representation 0.00357 is not in the normalized form as it should be written as 0.357 × 10⁻².

There are several reasons for utilizing floating point representation. Normalisation ensures the highest possible precision by fully using the mantissa's length. By standardising the format, normalisation minimises errors in computations. It also standardises the representation of floating-point numbers, simplifying comparisons and arithmetic operations.

The set 𝔽 is fully characterized by the basis β, the number of significant digits n, and the range (L, U) (with L < 0 and U > 0) of variation of the index e. Thus, it is denoted as 𝔽(β, n, L, U). Although the Wolfram Language gives exact answers whenever possible, it also utilizes approximate numbers. Using N, you can obtain a numerical approximation to an exact quantity with any desired precision or accuracy. In calculations involving arbitrary-precision approximate numbers, the Wolfram Language tracks the propagation of the numerical error. The use of high-precision numbers can yield accurate results where other numerical systems fail.

Example 1: ■

End of Example 1

Example 2: Mathematica controlls the precision and accuracy of numerical calculations, see the corresponding documentation. ■

End of Example 2

Introduction to Linear Algebra

Fundamentals

Computations

Direct Methods

Iterative Methods

Eigenvalues

Orthogonality

Matrix Algebra

Least Squares

Miscellany

Preliminaries

Glossary

Reference

IEEE Floating Point Numbers