Different Approaches To Handling Tied Ranks
Reposted from Chris Webb's blog with the author's permission.
Even if blogging about MDX feels, these days, a bit like blogging about COBOL (I’ll be returning to DAX soon, I promise), here’s an interesting MDX problem I came across the other day that I thought was writing about.
Calculating ranks is one of those things in MDX that is slightly trickier than it first appears. There’s a RANK function, of course, but in order to get good performance from it you need to know what you’re doing. It’s fairly widely known that with normal ranks what you need to do is to order the set you’re using before you find the rank of a tuple inside that set. Consider the following query on Adventure Works:
WITH
MEMBER MEASURES.REGULARRANK AS
RANK([Customer].[Customer].CURRENTMEMBER,
ORDER(
[Customer].[Customer].[Customer].MEMBERS
, [Measures].[Internet Sales Amount]
, BDESC))
SELECT
{[Measures].[Internet Sales Amount], [Measures].REGULARRANK}
ON COLUMNS,
ORDER(
[Customer].[Customer].[Customer].MEMBERS
, [Measures].[Internet Sales Amount]
, BDESC)
ON ROWS
FROM [Adventure Works]
It’s unbearably slow (in fact I killed the query rather than wait for it to complete) because, in the calculated member, what we’re doing is ordering the set of all customers every time we calculate a rank. Obviously we don’t need to do this, so the solution to this problem is of course to order the set just once, use a named set to store the result, and then reference the named set in the calculated member as follows:
WITH
SET
ORDEREDCUSTOMERS AS
ORDER(
[Customer].[Customer].[Customer].MEMBERS
, [Measures].[Internet Sales Amount]
, BDESC)
MEMBER MEASURES.REGULARRANK AS
RANK([Customer].[Customer].CURRENTMEMBER,
ORDEREDCUSTOMERS)
SELECT
{[Measures].[Internet Sales Amount], [Measures].REGULARRANK}
ON COLUMNS,
ORDEREDCUSTOMERS
ON ROWS
FROM [Adventure Works]
This query now executes in 5 seconds on my laptop. You probably knew all this already though.
But what happens if you need to handle tied ranks? The approach above doesn’t give you tied ranks because the RANK function, in its two-parameter form, simply finds the position of a tuple in a set, and no two tuples can occupy the same position in a set. That’s why you get results like this:
Even though Courtney A. Edwards and Jackson L. Liu have the same value for Internet Sales Amount, their ranks are 799 and 800 respectively, because Courtney A. Edwards comes before Jackson L. Liu in the ORDEREDCUSTOMERS set.
BOL tells us of the three-parameter form of RANK that does give us tied ranks. This is how you use it:
WITH
SET
ORDEREDCUSTOMERS AS
ORDER(
[Customer].[Customer].[Customer].MEMBERS
, [Measures].[Internet Sales Amount]
, BDESC)
MEMBER [Measures].REGULARRANKTIED AS
RANK([Customer].[Customer].CURRENTMEMBER
,[Customer].[Customer].[Customer].MEMBERS,[Measures].[Internet Sales Amount])
SELECT
{[Measures].[Internet Sales Amount], [Measures].REGULARRANKTIED}
ON COLUMNS,
ORDEREDCUSTOMERS
ON ROWS
FROM [Adventure Works]
But, unfortunately, the performance is as bad as the original version of the non-tied rank calculation, ie incredibly bad, because once again we’re sorting the set every time we calculate a rank. So how can we get tied ranks and good performance?
The first approach I tried was to use a recursive calculation, which used the named set approach to calculate the non-tied rank and then checked to see if the CurrentMember on Customer had the same value for Internet Sales Amount as the member before it in the set of Ordered Customers; if it did, it displayed the rank of the previous Customer. Here it is:
WITH
SET
ORDEREDCUSTOMERS AS
ORDER(
[Customer].[Customer].[Customer].MEMBERS
, [Measures].[Internet Sales Amount]
, BDESC)
MEMBER MEASURES.REGULARRANK AS
RANK([Customer].[Customer].CURRENTMEMBER, ORDEREDCUSTOMERS)
MEMBER MEASURES.TIEDRANK AS
IIF(
[Measures].[Internet Sales Amount] =
(ORDEREDCUSTOMERS.ITEM(MEASURES.REGULARRANK-2), [Measures].[Internet Sales Amount])
AND MEASURES.REGULARRANK>1
, (ORDEREDCUSTOMERS.ITEM(MEASURES.REGULARRANK-2), MEASURES.TIEDRANK)
,MEASURES.REGULARRANK)
SELECT
{[Measures].[Internet Sales Amount], MEASURES.TIEDRANK,[Measures].REGULARRANK}
ON COLUMNS,
ORDEREDCUSTOMERS
ON ROWS
FROM [Adventure Works]
Now this particular query performs pretty well – 6 seconds on my laptop, only marginally worse than the non-tied rank. And it gives the correct results; the middle column of values below shows the tied rank:
Unfortunately, the performance of this approach varies a lot depending on the number of tied ranks that are present in the set. If we slice the query by the year 2001, when there were a lot more customers with tied ranks, as follows:
WITH
SET
ORDEREDCUSTOMERS AS
ORDER(
[Customer].[Customer].[Customer].MEMBERS
, [Measures].[Internet Sales Amount]
, BDESC)
MEMBER MEASURES.REGULARRANK AS
RANK([Customer].[Customer].CURRENTMEMBER, ORDEREDCUSTOMERS)
MEMBER MEASURES.TIEDRANK AS
IIF(
[Measures].[Internet Sales Amount] =
(ORDEREDCUSTOMERS.ITEM(MEASURES.REGULARRANK-2), [Measures].[Internet Sales Amount])
AND MEASURES.REGULARRANK>1
, (ORDEREDCUSTOMERS.ITEM(MEASURES.REGULARRANK-2), MEASURES.TIEDRANK)
,MEASURES.REGULARRANK)
SELECT
{[Measures].[Internet Sales Amount], MEASURES.TIEDRANK,[Measures].REGULARRANK}
ON COLUMNS,
ORDEREDCUSTOMERS
ON ROWS
FROM [Adventure Works]
WHERE([Date].[Calendar Year].&[2001])
…then performance gets really bad once again.
Then I came up with a new approach. After ordering the set of all Customers, I made a second pass over it and created a second set with exactly the same number of items in it: for every customer in the first set, in the second set I added the current Customer if that Customer did not have a tied rank; if the Customer did have a tied rank, I added the first Customer in the original set that shared its tied rank. So if there were four customers, A, B, C and D, and if A had sales of 1, B had sales of 2, C had sales of 2 and D had sales of 3, then this new set would contain the members A, B, B, D. I could then say, for Customer C, that it was the third Customer in the original set, but the third item in the new set was B, and that was the Customer whose rank I needed to display for Customer C. So each item in this second set gives us the member whose rank we need to display for the member in the same position in the set of ordered Customers.
Here’s the MDX:
WITH
SET
ORDEREDCUSTOMERS AS
ORDER(
[Customer].[Customer].[Customer].MEMBERS
, [Measures].[Internet Sales Amount]
, BDESC)
MEMBER MEASURES.REGULARRANK AS
RANK([Customer].[Customer].CURRENTMEMBER, ORDEREDCUSTOMERS)
SET CUSTOMERSWITHTIES AS
GENERATE(
{INTERSECT({ORDEREDCUSTOMERS.ITEM(0)} AS FIRSTTIE,{}), ORDEREDCUSTOMERS} AS ORDEREDCUSTOMERS2
, IIF(
({ORDEREDCUSTOMERS2.CURRENT AS CURRCUST}.ITEM(0), [Measures].[Internet Sales Amount]) =
({ORDEREDCUSTOMERS2.ITEM(ORDEREDCUSTOMERS2.CURRENTORDINAL-2) AS PREVCUST}.ITEM(0), [Measures].[Internet Sales Amount])
, IIF(
(PREVCUST.ITEM(0), [Measures].[Internet Sales Amount])
=
(FIRSTTIE.ITEM(0), [Measures].[Internet Sales Amount])
, {FIRSTTIE}
, {PREVCUST AS FIRSTTIE})
, {CURRCUST})
, ALL)
MEMBER MEASURES.HASTIE AS
RANK([Customer].[Customer].CURRENTMEMBER, CUSTOMERSWITHTIES)
MEMBER MEASURES.TIEDRANK AS
(MEASURES.REGULARRANK, CUSTOMERSWITHTIES.ITEM(MEASURES.REGULARRANK-1))
SELECT
{[Measures].[Internet Sales Amount], MEASURES.TIEDRANK,[Measures].REGULARRANK}
ON COLUMNS,
ORDEREDCUSTOMERS
ON ROWS
FROM [Adventure Works]
The named set CUSTOMERSWITHTIES is where the interesting stuff happens. I’m iterating over the set ORDEREDCUSTOMERS using the GENERATE function, and using inline named sets to store the current Customer in the iteration, the previous Customer, and the first Customer containing the shared tied rank (see here for a similar example of using named sets). It consistently executes in 12 seconds regardless of how you slice the query, so it’s not as good as the best performance of the recursive approach but it’s much, much better than the worst performance of the recursive approach. If anyone has any other ideas on how to solve this problem, I’d love to hear them. I’m still sure there’s a better way of doing this…
Of course, what I really want is for the Formula Engine to be able to optimise queries containing set functions like Order in scenarios like this – I’d want it to know that when a particular set operation returns the same result for a block of cells, it should only perform that set operation once. However, even this wouldn’t necessarily be good enough in all cases – there are plenty of situations where you need to perform the same expensive set operation like a sort or a filter in multiple similar calculations, and you’d like to share the result of this set operation between calculations. For example, you might have a calculated member that counted the number of Customers who bought something both this month and in the previous month, and a second calculated member that counted the number of Customers who not only bought this month and in the previous month and spent more than $1000. In both cases you end up finding the set of Customers who bought this month and last month, which may take a long time to do. This is why I think it would be useful to be able to have calculated members return set objects, which can then be cached, so you can share the set between multiple other calculated members; if you agree, please vote on this Connect.
Chris has been working with Microsoft BI tools since he started using beta 3 of OLAP Services back in the late 90s. Since then he has worked with Analysis Services in a number of roles (including three years spent with Microsoft Consulting Services) and he is now an independent consultant specialising in complex MDX, Analysis Services cube design and Analysis Services query performance problems. His company website can be found at http://www.crossjoin.co.uk and his blog can be found at http://cwebbbi.wordpress.com/ . |
Tags: mdx