T-SQL - 'Group By' Vs 'Distinct'

Just sharing, no big deal!

'Distinct' is the same as 'Group By'.

'Distinct' may achieve the same result as 'Group By' depending on the data.

1.) T-SQL_1:

SELECT cUserID,max(cDate),max(cDeviceID),max(cUsername) FROM Trans_Table
Group by cUserID
Order by cUserID


2.) T-SQL_2:

SELECT distinct(cUserID),max(cDate),max(cDeviceID),max(cUsername) FROM Trans_Table
Group by cUserID
Order by cUserID


T-SQL_1 and T-SQL_2 return the same result



3.) T-SQL_3:

SELECT distinct(cUserID),max(cDate),max(cDeviceID),max(cUsername) FROM Trans_Table
Group by cUserID,cDate
Order by cUserID


This is a valid t-sql, except that it actually will break the distinct selection of cUserID.

4.) T-SQL_4:

SELECT distinct(cUserID),cDate,cDeviceID,cUsername FROM Trans_Table
Order by cUserID


5.) T-SQL_5:

SELECT distinct(cUserID),min(cDate),min(cDeviceID),min(cUsername) FROM Trans_Table
Group by cUserID
Order by cUserID


T-SQL_4 and T-SQL_5 return same result.

T-SQL_4 and T-SQL_5 may return same result or may not depending on the data of cDate,cDeviceID,cUsername.

6.) T-SQL_6:

SELECT distinct(cUserID),min(cDate),min(cDeviceID),min(cUsername) FROM Trans_Table
Order by cUserID


This returns error message: Column 'TBL_Raw_Trans_201302.cUserID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

It can only work if you change it to become T-SQL_5.

In other words, 'Group By' actually means 'Distinct'.

'Distinct' may achieve the same result as 'Group By' depending on the data.

'Distinct' will result in nested results for non-distinct selection (column).

In other words, 'Distinct' only performs 'Group By' for the column which is selected for 'Distinct'.

Case Study:

T-SQL_7_1 - Full table selection:

SELECT cUserID,cDate,cDeviceID,cUsername 
FROM Trans_Table 
order by cUserID 



T-SQL_7_2:

SELECT distinct(cUserID),cDate,cDeviceID,cUsername 
FROM Trans_Table 
order by cUserID 





In this scenario, T-SQL_7_1 returns the same result as T-SQL_7_2

T-SQL_7_3:

SELECT distinct(cUserID),min(cDate),min(cDeviceID),min(cUsername) 
FROM Trans_Table 
Group by cUserID 
Order by cUserID 





T-SQL_7_4:

SELECT cUserID,min(cDate),min(cDeviceID),min(cUsername) 
FROM Trans_Table 
Group by cUserID 
Order by cUserID 





T-SQL_7_3 is the same as T-SQL_7_4 - these two queries will always yield the same result irregardless of the data.

T-SQL_7_5:

SELECT distinct(cUserID),max(cDate),max(cDeviceID),max(cUsername) 
FROM Trans_Table 
Group by cUserID 
Order by cUserID 


T-SQL_7_6:

SELECT cUserID,max(cDate),max(cDeviceID),max(cUsername) 
FROM Trans_Table 
Group by cUserID 
Order by cUserID 


T-SQL_7_5 is the same as T-SQL_7_6 - these two queries will always yield the same result irregardless of the data.

In retrospect, the official usage of 'Distinct' defined by Microsoft is that it is meant to return a one-column table

that contains the distinct values from the specified column.

http://technet.microsoft.com/en-us/library/ee634943.aspx

Hence, how is distinct really useful in a multiple column selection scenario ?

It can be used to identify differences in columns' data.

Using the following query, we can identify that cUserID has 3 different associated data with respect to cDate.

SELECT distinct(cUserID),cDate,cDeviceID,cUsername 
FROM Trans_Table 
order by cUserID 

and with the following query, cUserID has 2 different associated data with respect to cDeviceID.

SELECT distinct(cUserID),cDeviceID,cUsername 
FROM Trans_Table 
order by cUserID


In other words, if you really want to return a table with multi column selection, use 'Group By' instead of 'Distinct'.

Comments