Using Cross Joins

Many SQL books and tutorials recommend that you “avoid cross joins” or “beware of Cartesian products” when writing your SELECT statements, which occur when you don't express joins between your tables.  It’s true that you need to ensure that your join conditions are adequately stated so that you don’t accidentally produce this effect, but it is not true that you should avoid these types of joins in every situation.
Cross Joins produce results that consist of every combination of rows from two or more tables.  That means if table A has 3 rows and table B has 2 rows, a CROSS JOIN will result in 6 rows.  There is no relationship established between the two tables – you literally just produce every possible combination.
The danger here, of course, is that if you have table A with 10,000 rows and Table B with 30,000 rows, and you accidentally create the product of these two tables, you will end up with a 300,000,000 row result -- probably not a good idea.  (Though it is great for creating test data and the like.)
So, how can this ever be useful?  Actually, if you do lots of report writing in SQL, a CROSS JOIN can be your best friend.
Suppose you need to write a report that returns total sales for each Store and each Product.  You might come up with this:
SELECT Store, Product, SUM(Sales) as TotalSales
FROM Sales
GROUP BY Store, Product
Easy enough – except when the requirement states “show $0 if a store had no sales of a particular product”.  The above query won’t do that – it returns no rows at all if a store had no sales for a particular product. 
The solution?  Well, hopefully in your database you have a table of Stores and a table of Products.  A cross join of the two results will return 1 row per combination of Store and Product:
SELECT S.Store, P.Product
FROM Stores S
CROSS JOIN Products P
That result is the perfect starting point for the results we wish to return -- now we just need to return the sales for each combination.  We already have written that in our first attempt, so now we just need to combine the two:
SELECT S.Store, P.Product, ISNULL(C.TotalSales,0) as TotalSales
FROM Stores S
CROSS JOIN Products P
LEFT OUTER JOIN
    (SELECT Store, Product, SUM(Sales) as TotalSales
     FROM Sales
     GROUP BY Store, Product) C
ON
  S.Store = C.Store AND
  P.Product = C.Product
The SELECT is derived logically from our requirements.  We start by considering all combinations of stores and products, and from there we show any matching sales data.  Our primary, driving rowset is actually not the transaction table, but rather the cross join of two entity tables!  It might seem very counter intuitive if you haven't approached the problem from this angle before, but it leads to very simple and elegant ways to solve rather complicated problems using SQL.
The solution uses what I call “the report writers magic formula”:
(A x B ) -> (C)
In my made up notation, the above reads “A cross joined with B, left outer joined to C ”.  A and B represent master tables of entities in your database, and C represents a summarized derived table of a transactional table in your database.
Some important things to note:
  • All criteria for the transactions, such as date ranges and/or transaction types, need to be done in the inner transaction summary query. 
  • The summarized transactional sub-query needs to be properly grouped so that it returns 1 row per combination of A and B.  Typically, this means that if the PK of table A is “A_ID” and the PK is table B is “B_ID”, then the derived table C should be grouped by A_ID, B_ID.
  • All criteria that determines which entities to show on your report – i.e., certain regions or only “active” products – should be done on the outer query.
Take the previous SELECT statement, for example: Note that the inner SELECT is grouped by Product and Store, which ensures that we return 1 row per combination of Product/Store -- which perfectly matches what the cross join creates.  If we wanted to show only data for 2005, we would put the filter on the TransactionDate column within the inner SELECT (since that is the part of the statement in which we collect and summarize our transactions), but if we want only ProductID #23, we do that in the outer SELECT (since that is where we determine the population of Stores and Products to return):
SELECT S.Store, P.Product, ISNULL(C.TotalSales,0) as TotalSales
FROM Stores S
CROSS JOIN Products P
LEFT OUTER JOIN
    (SELECT Store, Product, SUM(Sales) as TotalSales
     FROM Sales
     WHERE TransactionDate between '1/1/2005' and '12/31/2005'
     GROUP BY Store, Product) C
ON
  S.Store = C.Store AND
  P.Product = C.Product
WHERE
  P.Product = 23
The CROSS JOIN technique can apply to many situations – to return total labor cost by office by month, even if month X has no labor cost, you can do a cross join of Offices with a table of all months.   Another classic example is showing all GL transactions for a specific set of companies and accounts, returning all accounts and companies even when they have no activity.
The important thing is to practice with very small sets of sample data until you get a feel for how it works.  Also, you should explicitly state CROSS JOIN in your SELECT so that it is very clear that you intend for this to happen and it is not the result of missing joins. 

0 comments:

Post a Comment