Add or subtract floating point numbers (IEEE 754)

About the calculator

This calculator can be used to add or subtract 2 binary IEEE-754 floating point numbers. You can select whether the input numbers are binary floating point numbers in binary or hexadecimal representation or whether they are decimal numbers. If they are decimal numbers, they are converted into binary floating point numbers before being added or subtracted.

Layout

An IEEE-754 floating point number consists of a sign, an exponent part and a mantissa.

A binary floating point number is made up of 3 parts. It begins with a sign bit. This is 0 if the number is positive or 1 if it is negative.

This is followed by an exponent (also known as characteristic). This is used to store how many places the binary point of the binary number had to be shifted to the right or left in order to normalize it. The exponent is always saved as a positive value. A bias is added to the exponent so that an exponent value can also be saved if the binary point had to be shifted to the right for normalization.

The exponent is followed by a mantissa. This is determined by shifting the binary point to the left or right until it is exactly behind the first 1 of the binary number (normalization). The fractional digits are then written into the mantissa.

data typesizeexponentmantissebias
binary1616 bits5 bits10 bits15
binary3232 bits8 bits23 bits127
binary6464 bits11 bits52 bits1023
binary128128 bits15 bits112 bits16383

Addition of binary floating point numbers with the same sign

2 binary IEEE-754 floating point numbers, that have the same sign can be added using the following steps:

  1. Convert exponents to decimal numbers
  2. Prepend implicit 1 to mantissas
  3. Shift binary point to align exponents
  4. Add mantissas
  5. Normalization
  6. Rounding
  7. Convert exponent to binary number
  8. Assemble floating point number

Step 1 and step 7 can be omitted if the difference between the two exponents is determined in binary form and the exponent is modified in its binary form when the binary point is shifted.

To illustrate the procedure, the following 2 binary floating point numbers should be added together:

01000011011101100100011100000000

and

01000001010100110011100011011101

1. Convert exponents to decimal numbers:

The first step is to convert the exponents of the two binary floating point numbers into decimal numbers. The bias can be subtracted, but does not have to be. In this example, the bias is not subtracted from the exponent.

number 1:
100001102 = 128 + 4 + 2 = 134

number 2:
100000102 = 128 + 2 = 130

2. Prepend implicit 1 to mantissas:

Next, "1." is written in front of the two mantissas:

number 1:
1.11101100100011100000000

number 2:
1.10100110011100011011101

In combination with the exponents, the numbers are therefore:

number 1:
1.1110110010001112 ∙ 2134 − bias

number 2:
1.101001100111000110111012 ∙ 2130 − bias

3. Shift binary point to align exponents:

If the exponents of the two numbers differ, the difference between the larger and smaller exponent is calculated. The binary point of the number with the smaller exponent is then shifted to the left and the exponent is adjusted accordingly until the exponents of the two numbers are equal.

The difference of the two exponents is 4 and the exponent of the second number is smaller than that of the first number. Therefore, the binary point of the second number must be shifted 4 places to the left.

number 2:
0.0001101001100111000110111012 ∙ 2134 − bias

4. Add mantissas:

The mantissas are added in the same way as when adding two decimal numbers. The two numbers are written below each other so that the binary points of the two numbers are below each other. Either below the two numbers or above them, a row is kept clear for the carry. Then, starting from the right, the character of the upper number, the character of the lower number and, if available, the carry are added together. It is important that the sum of the characters is always a binary number. So 102 instead of 2 and 112 instead of 3. If the sum of the characters in a column consists of one character, this is written in the same column under the solution line, and if it consists of two characters, the back character is written in the solution line and the front character is written in the line for the carry one column to the left.

  1.11101100100011100000000    
+ 0.000110100110011100011011101
 11 1111       111             
 10.000001101111010100011011101

The result of the addition is therefore:
10.0000011011110101000110111012 ∙ 2134 − bias

5. Normalization:

It can happen that after adding, the binary point no longer appears after the first 1. In this case, the binary point must be shifted behind the first 1 again and the exponent is adjusted accordingly.

The binary point must be shifted one place to the left so that it is positioned after the first 1.

1.00000011011110101000110111012 ∙ 2135 − bias

6. Rounding:

The number must be rounded so that there are only as many digits after the binary point as can be stored in the mantissa of the binary floating point number. For floating point numbers with 32 bits, this would be 23, for example.

There are different rounding modes. You can always round up, always round down or always round towards 0. However, the most sensible and most frequently used is rounding to the nearest representable number. This is done by looking at what follows the least significant bit (the last bit that fits into the mantissa). If the least significant bit is followed by a 0, then the number is nearer to the representable number whose absolute value is smaller. If the least significant bit is followed by an 1 and then there is another 1 in any position, the number is nearer to the number whose absolute value is greater. If the bit after the least significant bit is an 1 and there is not another 1 in any position afterwards, then the number is exactly between 2 representable numbers, and then the number is rounded so that there is a 0 in the least significant bit afterwards.

For floating point numbers with 32 bits, 23 bits are available for the mantissa. However, the number has 28 fractional digits. It must therefore be rounded.

1.00000011011110101000110111012 ∙ 2135 − bias

The last bit that fits into the mantissa is marked in red. As the bit after this bit is an 1 and is followed by several more ones, the number is closer to the next larger representable number and must therefore be rounded up.

1.000000110111101010001112 ∙ 2135 − bias

7. Convert exponent to binary number

If the bias was subtracted in the first step, it must now be added to the exponent again. The biased exponent is then converted into a binary number.

135 converted into a binary number is: 100001112

8. Assemble floating point number:

Finally, all components are combined to form a binary floating point number in the IEEE-754 standard.

The sign bit corresponds to the sign bit of the two summands.

The exponent bits calculated in step 7 are written to the exponent part. If the exponent bits are not sufficient to completely fill the exponent part, the exponent part is filled with zeros at the front.

The bits after the binary point are written to the mantissa part. If these bits are not sufficient to completely fill the mantissa, the mantissa is filled with zeros at the trailing end.

The result is:
01000011100000011011110101000111

Subtraction of binary floating point numbers with the same sign

2 numbers that are in binary IEEE-754 floating point format and have the same sign can be subtracted using the following steps:

  1. Convert exponents to decimal numbers
  2. Prepend implicit 1 to mantissas
  3. Shift binary point to align exponents
  4. Subtract mantissas
  5. Normalization
  6. Rounding
  7. Convert exponent to binary number
  8. Assemble floating point number

The subtraction of 2 binary floating point numbers differs from the addition of binary floating point numbers mainly in step 4.

To illustrate the procedure, the following 2 binary floating point numbers are to be subtracted:

minuend:
00111100011010110111000000100000

subtrahend:
00111101100010110001101110000110

1. Convert exponents to decimal numbers:

First, the exponents are converted into decimal numbers. You can subtract the bias, but you don't have to.

number 1:
01111000226 + 25 + 24 + 23
 =64 + 32 + 16 + 8
 =120

number 2:
01111011226 + 25 + 24 + 23 + 21 + 20
 =64 + 32 + 16 + 8 + 2 + 1
 =123

2. Prepend implicit 1 to mantissas:

"1." is written in front of the two mantissas:

number 1:
1.1101011011100000012 ∙ 2120 − bias

number 2:
1.00010110001101110000112 ∙ 2123 − bias

3. Shift binary point to align exponents:

If the two exponents differ, the difference between the larger exponent value and the smaller exponent value is calculated. The binary point of the number with the smaller exponent is then shifted to the left and the exponent is adjusted accordingly so that both numbers then have the same exponent.

The exponent of number 1 is smaller than that of number 2 and the difference between the exponents is 3. Therefore, the binary point of number 1 must be shifted 3 places to the left and the exponent adjusted accordingly.

0.0011101011011100000012 ∙ 2123 − bias

4. Subtract mantissas:

If you want to subtract in the decimal system and the subtrahend is greater than the minuend, then you swap the minuend and the subtrahend and change the sign of the result. For example, if you want to calculate 3 − 5, first calculate 5 − 3 and write a minus sign in front of the result. This is also how you do it with binary numbers.

To subtract 2 numbers, a method of subtraction using regrouping or the so-called Austrian method can be used.

Regrouping:

The numbers are written one below the other so that their commas are underneath each other. The larger of the two numbers must always be at the top.

The second number is greater than the first and therefore the second number must be at the top.

 1.0001011000110111000011
0.0011101011011100000010
  .                      

The columns are run through from right to left and if the lower digit is not greater than the upper digit, then the lower digit is subtracted from the upper digit.

 1.0001011000110111000011
0.0011101011011100000010
                         
  .             011000001

If the lower digit is greater than the upper digit, a leading 1 must be prepended to the upper digit. This 1 is subtracted from the upper digit one column to the left. This can be written down by crossing out the number that you want to change and writing the number that you want to replace it with above it. It is important that the numbers are not interpreted as decimal numbers, but as binary numbers. 10 therefore stands for the decimal 2.

             010         
 1.0001011000110111000011
0.0011101011011100000010
  .            1011000001

The lower digit is greater than the changed upper digit. Therefore, a leading 1 is prepended to the changed upper digit and the upper digit one column to the left is reduced by 1.

            01010         
 1.0001011000110111000011
0.0011101011011100000010
  .           11011000001

If a leading 1 is to be placed in front of the upper digit of the current column and the upper digit in the column one place further to the left is 0, then the upper digit in the left-hand column must also be prefixed with a leading 1 before it can be reduced by 1, and the upper digit one column further to the left must be reduced by 1. And if this is also a 0, then it must also be prefixed with a leading 1 and so on.

        010101001010         
 1.0001011000110111000011
0.0011101011011100000010
  .         1011011000001
 0 101010101001010101001010         
 1.0001011000110111000011
0.0011101011011100000010
 0.1101101101011011000001

As the minuend and the subtrahend have been swapped, the sign of the result must be reversed.

-0.11011011010110110000012 ∙ 2123 − bias

Austrian method:

The numbers are written one below the other so that their binary points are underneath each other. The larger of the two numbers must always be at the top. One line is left blank below the two numbers.

The second number is greater than the first and therefore the second number must be at the top.

 1.0001011000110111000011
0.0011101011011100000010
                         
  .                      

Then go through the columns from right to left and subtract the lower digit from the upper digit.

 1.0001011000110111000011
0.0011101011011100000010
                         
  .             011000001

If the lower digit is greater than the upper digit, a leading 1 must be prepended to the upper digit. This digit is written one column to the left below the lower number. It is important that the upper digit prepended by the 1 is read as a binary number. So 102 = 2 or 112 = 3.

 1.0001011000110111000011
0.0011101011011100000010
              1          
  .            1011000001

If there is a 1 below the lower number in a column, this is added to the lower number in your mind before it is subtracted from the upper number. If the sum is greater than the upper digit, then the upper digit must be prepended by a leading 1 again.

 1.0001011000110111000011
0.0011101011011100000010
             11          
  .           11011000001
 1.0001011000110111000011
0.001110101101110000001 
 1 1111 1111 11          
 0.1101101101011011000001

As the minuend and the subtrahend have been swapped, the sign of the result must be reversed.

-0.11011011010110110000012 ∙ 2123 − bias

5. Normalization:

If the binary point is not after the first 1, then the binary point must be shifted there. The exponent is adjusted accordingly.

The binary point is not placed after the first 1, so the binary point must be shifted.

-1.1011011010110110000012 ∙ 2122 − bias

6. Rounding:

If the number has more fractional digits than the mantissa of the binary floating point number bits, then the number must be rounded. The most frequently used rounding mode and the one that makes the most sense for most applications is rounding to the nearest representable number. If the number is located exactly between 2 representable numbers (the character after the least significant bit is an 1 and is not followed by another 1 in any position), then the number is rounded to the number that has a 0 in the least significant bit.

The number has 21 fractional digits. These fit into the 23 bits of the mantissa. This means that the number does not need to be rounded.

7. Convert exponent to binary number:

If the bias was subtracted in step 1, it is now added again. The biased exponent is then converted into a binary number.

122 converted to a binary number is: 11110102

8. Assemble floating point number:

Finally, all 3 parts of the binary floating point number are combined.

If the first input number is greater than the second input number, then the sign corresponds to the sign of the two input numbers. Otherwise it must be reversed.

The binary number calculated in step 7 is written to the exponent part. If this does not fill all the exponent bits, the exponent part is filled with zeros at the front.

The fractional digits of the normalized and rounded binary number are written to the mantissa. If these do not completely fill the mantissa, it is filled with zeros at the end.

The result is:

10111101010110110101101100000100

The number is negative because the first input number is smaller than the second and the two numbers therefore had to be swapped for subtraction.

Binary floating point numbers with different signs

In order to add or subtract 2 binary IEEE-754 floating point numbers as described above, they must have the same sign. If this is not the case, the calculation can still be performed by first changing the sign of one of the two numbers and then performing a subtraction instead of an addition or an addition instead of a subtraction.

good explanatory videos on Youtube

Share:FacebookTwitter