While the set ℝ of real numbers or its subset ℚ of rational numbers
are known to everyone, the way in which
computers treat them is perhaps less well known. On one hand, since
machines have limited resources, only a subset 𝔽 of finite dimension of
ℚ ⊂ ℝ can be represented. The numbers in this subset are called normalized
floating-point numbers that were standardazed by IEEEE (Institute for Electrical and Electronic Engineers) in 1985. It was updated in 2019 to the IEEE 754 standard that specifies interchange and arithmetic formats and methods for binary and decimal floating-point arithmetic in computer programming environments.
IEEE Floating Point Numbers
Computers have both an integer mode and a floating-point mode for representing numbers.
The integer mode is used for performing calculations that are integer values. Computers
usually perform scientific calculations in floating point arithmetic, and a real number is
represented in a normalized floating point binary form. That is, computers store a real
number x ∈ ℝ in the binary form as
where \( \displaystyle \quad \left( \cdot d_1 d_2 \ldots d_n \right)_{\beta} \quad \) is called mantissa and e is an integer called the exponent. The digits in mantissa satisfy inequalities 0 ≤ di ≤ β − 1. In most of the computers β = 2. Other values of β used in some computers are 10 and 16. The
term ‘normalized’ means that the leading digit d₁ is always nonzero unless the number
represented is 0. That is, d₁ ≠ 0 or d₁ = d₂ = ⋯ = dn = 0. The representation 0.00357 is not in the normalized form as it should be written as 0.357 × 10−2.
There are several reasons for utilizing floating point representation.
Normalisation ensures the highest possible precision by fully using the mantissa's length.
By standardising the format, normalisation minimises errors in computations.
It also standardises the representation of floating-point numbers, simplifying comparisons and arithmetic operations.
The set 𝔽 is fully characterized by the basis β, the number
of significant digits n, and the range (L, U) (with L < 0 and U > 0) of
variation of the index e. Thus, it is denoted as 𝔽(β, n, L, U).
Although the Wolfram Language gives exact answers whenever possible, it also utilizes approximate numbers. Using N, you can obtain a numerical approximation to an exact quantity with any desired precision or accuracy. In calculations involving arbitrary-precision approximate numbers, the Wolfram Language tracks the propagation of the numerical error. The use of high-precision numbers can yield accurate results where other numerical systems fail.
Example 1:
■
End of Example 1
Example 2:Mathematica controlls the precision and accuracy of numerical calculations, see the corresponding documentation.
■