Pages

Showing posts with label essay. Show all posts
Showing posts with label essay. Show all posts

Monday, September 20, 2010

On the hazards of integer arithmetic

A short essay on why integer arithmetic is 'bad'. I'm not out to start a holy war, just ran into some trouble having to do with some of the facets of using integers...

My simulator, as a base component, needs to compute energies according to some model. We'd like it to actually agree on what it computes with the parameter sets and models used in various literature sources. It does just fine (and has for a while) on the current NUPACK parameters, but I recently needed to use a different parameter set to test some parts. It's been a while since I used it and due to the nature of the model it's actually a different underlying implementation.

So, that's the setup. The test set is ~55k different test cases, in 6 different classes of test (the largest class, 32k, the smallest, 2). My program tested fine for nearly every test case, except exactly 2048 from the class with 33344 [why yes, I did mean 'k'] test cases, and 8 from the class with 4956 test cases.

After several tries at bucketing my cases vs the 'good' ones to figure out which part of the distribution had this failure, I ended up with these two sets:
In [122]: buckets7
Out[122]:
[{'(10, 2)': 256,
'(13, 2)': 256,
'(14, 2)': 256,
'(2, 10)': 256,
'(2, 13)': 256,
'(2, 14)': 256,
'(2, 22)': 256,
'(22, 2)': 256}]
In [123]: buckets8
Out[123]:
[{...
'(10, 1)': 2304,
'(11, 2)': 256,
'(12, 2)': 256,
'(15, 2)': 256,
'(16, 2)': 256,
'(17, 2)': 256,
'(18, 2)': 256,
'(19, 2)': 256,
'(20, 2)': 256,
'(21, 2)': 256,
'(23, 2)': 256,
'(24, 2)': 256,
'(25, 2)': 256,
'(26, 2)': 256,
'(27, 2)': 256,
'(28, 2)': 256,
'(29, 2)': 256,
'(30, 2)': 256,
... }]
Note that the ...'s are where I chopped symmetric entries, or all the ones that had a 1,x or x,1 [A large part of the test set.]. There are some very suspicious absences in the 'good' list, notably the two sets do not intersect at all!

So, what's going on here? Now we get to the hazards of integer arithmetic. As it happens, the energy for these cases has a particular dependence on the sizes [or rather the sum of the sizes, e.g. (12,2) we care about n = 12+2 = 14]. For large n, we know that energy(n) ~= energy(0) + k * log(n), for some constant k and initial value energy(0). As it happens, this approximation is not a good one for our actual physical system at small n, so what we do is get a lot of data about those small n and get really good numbers to fit those. For all the parameter sets we know that for n > 30 we can just use the logarithm approximation. Some of these sets have every value of n up to 30 specified, but depending on the quality of the set and other factors others may only have n <= 9 specified in the parameter set. Now in order to compute the energy we would prefer to just look up a value rather than take an expensive logarithm to do so, so the normal practice is to store a lookup of values energy(n) for n <= 30, doing the log if we get a n > 30 (unlikely for our systems). When the parameter set doesn't define all those, we need to compute them. This is easy though, as we can just use the logarithm approximation to figure out the rest; all we need is that same k we needed for n > 30: if the last measured parameter is at length j (so we know energy(j) but not energy(j+1) for our parameter set), we can compute lengths j+1,...30 using the log.

With me so far?

Oh, what about integers? Well, you'd think that given the use of logs and so on, these parameters might be real numbers and not integers, and in theory that's actually the case. In practice, though, all the energies are measured to 2 significant digits (e.g. '2.45 kcal/mol') and so some programmers decided that it was better to do everything in integer arithmetic. So everything gets multiplied by 100 and stored as an integer; this makes a lot of sense historically due to how cheap it is to perform computation on integers vs floating point numbers. Nowadays that's not really an issue [I think?], though.

So, let's spot the error. For the moment, assume k = 107.5 (or so) is our constant, and we know position l has the last measured value.

From simple log arithmetic, we know:
log(a/b) = log(a) - log(b)

So if we have a sequence of numbers x[0], x[1], ... x[l], and know that they are logarithmic in distribution, we can then calculate subsequent numbers:
 x[j+1] = x[j] + k * log((j+1) / j)
So for any position j < m <= 30, we can calculate:
 x[j] = x[j] + k * log((j+1)/j) + k * log((j+2)/(j+1) + ... + k * log((m) / (m-1))
      = (x[j] + k * log((j+1)/j) + k * log((j+2)/(j+1) + ...) + k * log((m) / (m-1))
      = x[m-1] + k * log(m / (m-1))

Or, grouping them the other way:
 x[m] = x[j] + k * log((j+1)/j) + k * log((j+2)/(j+1) + ... + k * log((m) / (m-1))
      = x[j] + (k * log((j+1)/j) + k * log((j+2)/(j+1) + ... + k * log((m) / (m-1)))
      = x[j] + k * log(m / j)

So, either method gives you an equivalent algorithm for generating all the values up to 30, but there's two important differences: The first method you 'remember' one piece of information as you go 'x[m-1]', and the second remembers 'j' (which is then used because you also need x[j]). When implementing this, I used the first method: our other parameter sets all are floating point numbers anyways and thus I can avoid storing 'j' in addition to the normal loop variable.

Now, even though the algorithms are equivalent on the reals, on integers (with our k) they differ by 1 in several places due to compounded rounding issues:

index 11: 434 vs 434
index 12: 443 vs 444
index 13: 452 vs 452
index 14: 460 vs 460
index 15: 467 vs 468
index 16: 474 vs 475
index 17: 481 vs 481
...
index 23: 514 vs 514
index 24: 519 vs 518
index 25: 523 vs 523
...

And there you have it. Integer rounding issues leading to a very slight disagreement in the computed energy, and causing me to track down the problem. At the precisions I deal with, the right solution probably is to convert things into floating point at the input stage - the later operations on energy all need floating point, but on the internals side it's a pain in the neck to transfer everything.