Sybase Aggregate Queries

Unlike many dialects of SQL, Transact SQL allows you to reference columns which are NOT part of a Group By clause in an aggregate query. Thus the following is permissable...

select salesman,
      
office,
       avg(invoice)
  from turnover
 group by salesman


which might produce something like...


salesman     office
-----------  -----------  --------
Al Smith     Slough       1,654.89
Fred Bloggs  Swindon      1,956.56

Jim Smith    Devizes      2,243.98
Jim Welsh    Slough       2,987.34
Paul Jones   Cirencester  2,654.23

Pete Pring   Swindon      2,124.87

etc...


Note that the dependent column, 'office' has no relevance to the aggregation - it's simply a look-up. This can lead to unexpected effects if used without thought as in the following example, taken from Sybase's own documentation...
select type, title_id, avg(price), avg(advance) 
from titles
group by type
type title_id 
------------ -------- ----- -------
business BU1032 13.73 6,281.25
business BU1111 13.73 6,281.25
business BU2075 13.73 6,281.25
business BU7832 13.73 6,281.25
mod_cook MC2222 11.49 7,500.00
mod_cook MC3021 11.49 7,500.00
UNDECIDED MC3026 NULL NULL
popular_comp PC1035 21.48 7,500.00
popular_comp PC8888 21.48 7,500.00
popular_comp PC9999 21.48 7,500.00
psychology PS1372 13.50 4,255.00
psychology PS2091 13.50 4,255.00
psychology PS2106 13.50 4,255.00
psychology PS3333 13.50 4,255.00
psychology PS7777 13.50 4,255.00
trad_cook TC3218 15.96 6,333.33
trad_cook TC4203 15.96 6,333.33
trad_cook TC7777 15.96 6,333.33

(18 rows affected)
See what's happening? The title_id is completely irrelevant to the averages in the row.  At first glance it looks as if the titles are all generating the same figures but the figures are actually based on the first column, 'type'. So the ability to use non-participating columns in aggregate queries is a useful one but to be applied with care.


And the other way around...


Sybase also allows you to group by a column which does not appear in the query's select statement. Thus...

select salesman,
       count(invoice)
  from turnover
 group by salesman, office

salesman
-----------  -----------

Al Smith              12
Al Smith            NULL
Al Smith            NULL

Al Smith            NULL

Fred Bloggs           19

Fred Bloggs         NULL

Fred Bloggs         NULL

Fred Bloggs         NULL

Jim Smith              9

Jim Smith           NULL

Jim Smith           NULL

Jim Smith           NULL

and so on...

Notice the 'gotcha' here. Because, in this case, salesmen work out of offices, there's no valid data for any salesman except at his own office. On the other hand, if salesmen occassionaly worked out of other offices, this might be usefull. Then again, it would be more usefull if the office were shown.


Look! No hands...


Transact SQL will allow you to group by an expression that does not contain an aggregate function as in...

select salesman,
       avg(invoice),
       invoice * commission
  from turnover
 group by invoice * commission

salesman
-----------  -----------  ----------
Al Smith        1,654.89      221.87
Al Smith        1,654.89      134.65
Al Smith        1,654.89      206.34

...and so on. Again, this example is somewhat meaningless but that's the problem with aggregations, used without thought they can provide very confusing output.


Nested aggregates.


Rather more helpfully, Sybase permits nesting of aggregate functions, so that the following very usefull query is entirely legal...

select max(avg(invoice))
from turnover

group by salesman

-------------
      2987.34



Handling Null Values.


If the grouping expression produces a NULL value, the NULL is placed in a row as if it were a real value. If more than one NULL is produced they are all, quite logically, placed into a single NULL row...

select avg(payment),
       count(*)
  from salesman
 group by invoice

-----------  -----------
       NULL            2
   1,225.00            1
   1,500.00            4
   1,750.00            1


and so on. This is one of those really usefull things that can be extremely helpfull in certain circumstances, such as when you can't understand why a report isn't delivering the expected result. Notice that we used count(*) rather than count(payment) to find out how many NULL rows we have. That's because counting NULLs against a column returns nothing, no matter how many instances of NULL there are.

Where There's a Where...

If we use a where clause in a grouping query, the where clause is honoured first. In other words, only the set returned by the where clause is grouped on. Hence...


select
salesman,
      
office,
       avg(invoice)
  from turnover
 where invoice > 3000
 group by salesman


might produce something like...


salesman     office
-----------  -----------  --------

Jim Welsh    Slough       3,201.34
Paul Jones   Cirencester  3,501.76


Compare this with our first example to see how dramatically this alters things.

However, there is another of those gotchas waiting because, if you do this...


select
salesman,
      
office,
       avg(invoice),
       avg(cost)
  from turnover
 where invoice > 3000
 group by salesman

you might get this...

salesman     office
-----------  -----------  -----------  -----------

Jim Welsh    Slough          3,201.34      1098.12

Paul Jones   Cirencester     3,501.76       875.32


which makes these guys look like geniuses. The problem is that the last column is a gold-plated, copper-bottomed lie.

Think about it. We told Sybase to find the average from the turnover where the invoice value was greater than 3,000. We then told it to find the average of the cost per invoice. What we didn't do was tie the 'cost' column to the where clause so it happily calculated the average across all the invoices for each salesman, some of which were obviously very low. The solution is to use a having clause...

Have I Got a Clause for You...

The having clause works on a group by in the same way as a where clause works on an ordinary query. Re-writing our example above as...


select
salesman,
      
office,
       avg(invoice),
       avg(cost)
  from turnover
 
 group by salesman
 having
invoice > 3000

Gets us something like this...

salesman     office
-----------  -----------  -----------  -----------
Jim Welsh    Slough          3,201.34     2,822.02
Paul Jones   Cirencester     3,501.76     3,019.95


which looks a lot more realistic. The having clause has dropped out all the irrelevant rows and concentrated only on the invoices that we're interested in.

Caveat emptor...

Sybase aggregate queries are one of the system's most powerful features but, as always, with power comes responsibility. Always test aggregate queries on small but representative data sets to ensure that they really are doing what you intend them to.


Find out more by searching Google here...

Google