Just sharing, no big deal!
'Distinct' is the same as 'Group By'.
'Distinct' may achieve the same result as 'Group By' depending on the data.
1.) T-SQL_1:
2.) T-SQL_2:
T-SQL_1 and T-SQL_2 return the same result
3.) T-SQL_3:
This is a valid t-sql, except that it actually will break the distinct selection of cUserID.
4.) T-SQL_4:
5.) T-SQL_5:
T-SQL_4 and T-SQL_5 return same result.
T-SQL_4 and T-SQL_5 may return same result or may not depending on the data of cDate,cDeviceID,cUsername.
6.) T-SQL_6:
This returns error message: Column 'TBL_Raw_Trans_201302.cUserID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
It can only work if you change it to become T-SQL_5.
In other words, 'Group By' actually means 'Distinct'.
'Distinct' may achieve the same result as 'Group By' depending on the data.
'Distinct' will result in nested results for non-distinct selection (column).
In other words, 'Distinct' only performs 'Group By' for the column which is selected for 'Distinct'.
Case Study:
T-SQL_7_1 - Full table selection:
T-SQL_7_2:
In this scenario, T-SQL_7_1 returns the same result as T-SQL_7_2
T-SQL_7_3:
T-SQL_7_4:
T-SQL_7_3 is the same as T-SQL_7_4 - these two queries will always yield the same result irregardless of the data.
T-SQL_7_5:
T-SQL_7_6:
T-SQL_7_5 is the same as T-SQL_7_6 - these two queries will always yield the same result irregardless of the data.
In retrospect, the official usage of 'Distinct' defined by Microsoft is that it is meant to return a one-column table
that contains the distinct values from the specified column.
http://technet.microsoft.com/en-us/library/ee634943.aspx
Hence, how is distinct really useful in a multiple column selection scenario ?
It can be used to identify differences in columns' data.
Using the following query, we can identify that cUserID has 3 different associated data with respect to cDate.
and with the following query, cUserID has 2 different associated data with respect to cDeviceID.
In other words, if you really want to return a table with multi column selection, use 'Group By' instead of 'Distinct'.
'Distinct' may achieve the same result as 'Group By' depending on the data.
1.) T-SQL_1:
SELECT cUserID,max(cDate),max(cDeviceID),max(cUsername) FROM Trans_Table Group by cUserID Order by cUserID
2.) T-SQL_2:
SELECT distinct(cUserID),max(cDate),max(cDeviceID),max(cUsername) FROM Trans_Table Group by cUserID Order by cUserID
T-SQL_1 and T-SQL_2 return the same result
3.) T-SQL_3:
SELECT distinct(cUserID),max(cDate),max(cDeviceID),max(cUsername) FROM Trans_Table Group by cUserID,cDate Order by cUserID
This is a valid t-sql, except that it actually will break the distinct selection of cUserID.
4.) T-SQL_4:
SELECT distinct(cUserID),cDate,cDeviceID,cUsername FROM Trans_Table Order by cUserID
5.) T-SQL_5:
SELECT distinct(cUserID),min(cDate),min(cDeviceID),min(cUsername) FROM Trans_Table Group by cUserID Order by cUserID
T-SQL_4 and T-SQL_5 may return same result or may not depending on the data of cDate,cDeviceID,cUsername.
6.) T-SQL_6:
SELECT distinct(cUserID),min(cDate),min(cDeviceID),min(cUsername) FROM Trans_Table Order by cUserID
This returns error message: Column 'TBL_Raw_Trans_201302.cUserID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
It can only work if you change it to become T-SQL_5.
'Distinct' may achieve the same result as 'Group By' depending on the data.
'Distinct' will result in nested results for non-distinct selection (column).
In other words, 'Distinct' only performs 'Group By' for the column which is selected for 'Distinct'.
Case Study:
T-SQL_7_1 - Full table selection:
SELECT cUserID,cDate,cDeviceID,cUsername FROM Trans_Table order by cUserID
T-SQL_7_2:
SELECT distinct(cUserID),cDate,cDeviceID,cUsername FROM Trans_Table order by cUserID
In this scenario, T-SQL_7_1 returns the same result as T-SQL_7_2
T-SQL_7_3:
SELECT distinct(cUserID),min(cDate),min(cDeviceID),min(cUsername) FROM Trans_Table Group by cUserID Order by cUserID
T-SQL_7_4:
SELECT cUserID,min(cDate),min(cDeviceID),min(cUsername) FROM Trans_Table Group by cUserID Order by cUserID
T-SQL_7_3 is the same as T-SQL_7_4 - these two queries will always yield the same result irregardless of the data.
T-SQL_7_5:
SELECT distinct(cUserID),max(cDate),max(cDeviceID),max(cUsername) FROM Trans_Table Group by cUserID Order by cUserID
T-SQL_7_6:
SELECT cUserID,max(cDate),max(cDeviceID),max(cUsername) FROM Trans_Table Group by cUserID Order by cUserID
T-SQL_7_5 is the same as T-SQL_7_6 - these two queries will always yield the same result irregardless of the data.
In retrospect, the official usage of 'Distinct' defined by Microsoft is that it is meant to return a one-column table
that contains the distinct values from the specified column.
http://technet.microsoft.com/en-us/library/ee634943.aspx
Hence, how is distinct really useful in a multiple column selection scenario ?
It can be used to identify differences in columns' data.
Using the following query, we can identify that cUserID has 3 different associated data with respect to cDate.
SELECT distinct(cUserID),cDate,cDeviceID,cUsername FROM Trans_Table order by cUserID
and with the following query, cUserID has 2 different associated data with respect to cDeviceID.
SELECT distinct(cUserID),cDeviceID,cUsername FROM Trans_Table order by cUserID
In other words, if you really want to return a table with multi column selection, use 'Group By' instead of 'Distinct'.
Comments