Stream: cql
Topic: Decimal equivalence / predecessor / successor and intervals
JP (Jul 20 2021 at 21:52):
There a couple potential inconsistencies in the CQL spec that I'm looking to get feedback on, in particular from @Chris Moesel
The decimal equivalence operator does the comparison of the precision of the least precise operand. IOW,
1.1 ~ 1.199999 = true
1.1 ~1.100000 = true
That implies that these intervals are equivalent:
[1.1, 1.2) ~ [1.1, 1.1] ~ [1.1, 1.1999999]
The predecessor / successor operations for decimals are defined as the minimum step size of decimal:
For Decimal, predecessor is equivalent to subtracting the minimum precision value for the Decimal type, or 10-08.
This is potentially problematic for CQL implementations that support higher levels of precision in that their minimum step size _could_ be more granular.
It also runs somewhat counter to the the explanation of how to handle precision in the collapse / expand operations:
Note that because the semantics for overlaps and meets are themselves defined in terms of the interval successor and predecessor operators, sets of Date-, DateTime-, or Time-based intervals that are only defined to a particular precision will calculate meets and overlaps at that precision.
It seems to me that decimal intervals should also be handled with the same rules for precision.
If the per argument is null, a per value will be constructed based on the coarsest precision of the boundaries of the intervals in the input set.
This is means that given expand { [1.0, 1.2000000) }
you'd get a default per of .1
Intuitively, if I run that I'd expect a result that's { [1.0, 1.1), [1.1, 1.2) }
Instead, in either the JS or Java engine you get a result like:
{1.0, 1.09999999], [1.1, 1.1999999] }
This is due to the fact that the predecessor and successor operations for decimals are defined as the minimum step size.
I think that result is potentially incorrect because:
- It assumes a specific level of precision in these particular engines that may not be present in other implementations
- It constructs boundaries at a greater precision than the "per", which is counter-intuitive given the initial truncation behavior.
- It implies additional precision in the calculated intervals that's not present in the original numbers.
I think the resolution is to define the successor / predecessor operations in terms of the precision of the original decimal, the same as the other types. This would mean:
- More precise implementations could support a smaller step size.
- The predecessor / successor definitions would be consistent with the equivalent definition
- The results of expand / collapse would preserve the original precision and give the "intuitively correct" result.
- Additional precision would not be implied by the result of expand / collapse.
So, the end result would be (these would be equivalent base on the decimal equivalence semantics):
expand { [1.0, 1.2000000] } =>
{ [1.0, 1.1), [1.1, 1.2), [1.2, 1.3) } OR
{ [1.0, 1.0], [1.1, 1.1], [1.2, 1.2] }
Chris Moesel (Jul 21 2021 at 02:57):
I don't think I have the mental capacity to process this right now (after a long day of work) but I'll try to take a look and provide feedback tomorrow! If the day goes by and you haven't heard from me, feel free to send a reminder, @JP!
JP (Jul 21 2021 at 18:39):
I'll try to boil it down a bit.
Both DateTime and Decimal are precision-aware for equivalence (though behavior of each is different):
1.56 ~ 1.5 = true
@2018 ~ @2018-01 = false
Both DateTime and Decimal are precision-aware for expand/collapse boundaries and selecting the "per":
If the per argument is null, a per value will be constructed based on the coarsest precision of the boundaries of the intervals in the input set. For example, a list of DateTime-based intervals where the boundaries are a mixture of hours and minutes will expand at the hour precision.
DateTime is precision-aware for predecessor/successor:
Successor(@2018) = 2019
Predecessor(@2019-02) = 2019-01
Decimal is NOT precision-aware for predecessor/successor, rather it's "hard-coded" to minimum step-size
For Decimal, predecessor is equivalent to subtracting the minimum precision value for the Decimal type, or 10-08.
This gives counter-intuitive and, I'd argue, incorrect results.
One issue is that the predecessor / successor operators should not increase the precision of the original value. This leads to potential errors in downstream calculations that are precision-aware.
1.4 ~ 1.45 = true
Successor(1.4) = 1.40000001
1.40000001 ~ 1.45 = false
The predecessor of 2*10^4 (20,000) should be 1*10^4 (10,000) because it's defined at a precision of 10^4. Similarly, the successor of 1*10^-4 (0.0001) should be 2*10^-4 (0.0002) because it's defined at a precision of 10^-4.
The DateTime analog which is obviously incorrect is:
Successor(@2018) = @2018-01-01T00:00:00.0000001
Another issue is that CQL permits implementations to support a greater precision than the minimum required. In that scenario, the "successor" and "predecessor" would be broken in that the specified step-size is larger than the minimum step-size supported by the implementation.
Making predecessor/successor precision-aware and combing that with the expand/collapse operators gives you results that are more intuitive (IMO), don't introduce potential precision related errors downstream, and allow specific implementations to support arbitrary levels of precision.
Old:
expand { [1.0, 1.2000000] } =>
{ [1.0, 1.09999999], [1.1, 1.1999999], [1.2, 1.2999999] }
New:
expand { [1.0, 1.2000000] } =>
{ [1.0, 1.1), [1.1, 1.2), [1.2, 1.3) } OR
{ [1.0, 1.0], [1.1, 1.1], [1.2, 1.2] }
Chris Moesel (Jul 22 2021 at 14:54):
Thanks, @JP. If I recall correctly, using precision in predecessor/successor has been proposed before, and, after some discussion, we ultimately decided against it. I think you make some convincing points above, but I'd like to see if I can dig up any notes of those old conversations to see what our arguments were against this approach. Decimal precision is kind of tricky and confusing, so I expect that trying to address consistencies in one area might introduce inconsistencies in another.
A few notes right away though...
I think that your interpretation (or at least examples) of decimal equivalence are not quite correct. According to the spec (emphasis mine):
For decimals, equivalent means the values are the same with the comparison done on values rounded to the precision of the least precise operand; trailing zeroes after the decimal are ignored in determining precision for equivalent comparison.
So, actually:
1.1 ~ 1.199999 = false // since 1.199999 gets rounded to 1.2
which in turn means:
[1.1, 1.2) !~ [1.1, 1.1] // at least according to current predecessor definition
// and
[1.1, 1.2) ~ [1.1, 1.1999999] // but would NOT be equivalent if we change the predecessor definition
That last bit does point to one inconsistency we would introduce if we change the definition of predecessor
. In the Author's Guide introduction to intervals, it indicates that Interval[3.0, 5.0)
"contains all the real numbers >= 3.0 and < 5.0." But the definition of interval End (in the CQL Reference) says: "If the high boundary of the interval is open, this operator returns the predecessor of the high value of the interval." So if we change predecessor to use precision, then Interval[3.0, 5.0)
becomes Interval[3.0, 4.9]
and does not contain all real numbers < 5 (for example, 4.95).
Regarding the unexpected results of expand/collapse, you noted that you expected { [1.0, 1.1), [1.1, 1.2) }
but got {1.0, 1.09999999], [1.1, 1.1999999] }
. Based on current definitions, those are indeed equivalent. Execution engines could potentially detect these situations and translated to the simpler representation and still be spec-compliant.
If we change the definition of predecessor/successor, thenexpand { [1.0, 1.2000000) }
becomes { [1.0, 1.1), [1.1, 1.2) }
which (according to the proposed definition) is actually the same as { [1.0, 1.0], [1.1, 1.1] }
. The expanded set now contains gaps between 1.0 and 1.1 and after 1.1. The original unexpanded interval contains 1.05 and 1.11; the new expanded set of intervals does not. I'd suggest that it is a problem if expanded intervals do not encompass all the same numbers as the original interval contained.
All that said, I do understand the concern about introducing more precision than was originally specified -- but I wanted to bring at least the above into the discussion before looking into it more.
JP (Jul 22 2021 at 15:15):
The expanded set now contains gaps between 1.0 and 1.1 and after 1.1.
I don't think this is the case if given precision-aware equivalence. Since the intervals would be defined at a granularity of .1, the intervals { [1.0, 1.0], [1.1,1.1] } would meet. Increments smaller than .1 simply don't exist or can't be represented. That's actually an important part of why the operators should not introduce additional precision. It creates gaps where there formally were none.
I think the DateTime comparison is relevant here again. There's no gap between {[@2018, @2018], [@2019, @2019] }
and there's no gap between {[@2018-01-01, @2018-01-01], [@2018-01-02, @2018-01-02] }
. The precision at which the boundaries are specified determines that.
Interval[3.0, 5.0) becomes Interval[3.0, 4.9] and does not contain all real numbers < 5 (for example, 4.95).
The same applies with the current definition, except that rather 4.95, it's 4.9999999995. There's still the same "gap" between 5 and the next smallest number than 5, it just is hard-coded to be the minimum decimal increment defined by the spec. It's not the case today that the intervals generated contain "all real numbers < 5 ". They contain "all real numbers < 5 that can be represented by the current minimum decimal precision listed in the spec, even though more precise implementations may actually be able to represent more of the numbers <5." For any given level of precision there will always be an arbitrarily small gap. The way around the "gap" is to make the interval calculations precision-aware, such that any gap that's smaller than the level of precision is not a gap. :smile:
JP (Jul 22 2021 at 15:23):
I'd suggest that it is a problem if expanded intervals do not encompass all the same numbers as the original interval contained.
I think this too is already handled. Given precision-aware successors/predecessors the new expanded intervals are actually broader than the input intervals:
{ [1.0, 1.0], [1.1, 1.1] }
-> does contain 1.19999999999999. given precision-aware equivalence.
This is consistent with guidance in the expand operation definition:
If the interval boundaries are more precise than the per quantity, the more precise values will be truncated to the precision specified by the per quantity. In these cases, the resulting list of intervals may be more broad than the input range due to this truncation. For example:
JP (Jul 22 2021 at 15:30):
Actually, I take that back. Given this bit:
For decimals, equivalent means the values are the same with the comparison done on values rounded to the precision of the least precise operand; trailing zeroes after the decimal are ignored in determining precision for equivalent comparison
you are correct. It wouldn't contain that.
JP (Jul 22 2021 at 15:33):
There's another inconsistency here, which is that the "expand" operator specifies truncation rather than rounding:
If the interval boundaries are more precise than the per quantity, the more precise values will be truncated to the precision specified by the per quantity.
JP (Jul 22 2021 at 15:39):
So there's at least two issues here that make it difficult to produce an implementation of "expand" for decimals:
- The successor / predecessor operations that use a fixed precision
- Truncation vs rounding in expand and equivalent, respectively
JP (Jul 22 2021 at 17:25):
If rounding were used in the interval construction instead of truncation you still run into the scenario where the constructed intervals may exclude values in the original set. It depends on whether a lower boundary is rounded up or and upper boundary is rounded down. So changing the "expand" definition for decimals there to use rounding does not solve that particular issue on its own.
JP (Jul 22 2021 at 17:51):
Both DateTime and Decimals represent continuous variables to an arbitrary level of precision. DateTimes use a different base for each level of precision (365, 12, 30, 24, 60, 60...) so it's more clear that comparisions are only valid for a given level of precision. That makes me think rounding is the wrong answer for decimal equivalence. That said, comparing decimals only to the precision of the coarsest operand leads to some counter-intuitive results.
1 >= 1.05 = null
@2018 >= @2018-01-02 = null
Hmm...
Last updated: Apr 12 2022 at 19:14 UTC