I think that 'data wrangling' takes a special kind of problem solving mentality. Keeping data current, cleaning it, merging it, aggregating it, and creating summary (or data warehouse) tables for reporting purposes... For MANY years I used Express installations of SQL Server 2000 as my ultimate data Swiss Army knife. It was like a secret weapon. You can crank out a couple of databases, make a DTS package, add some SQL flavor, ODBC connections to whatever you need - and viola - have a working and update-able consolidated data source in no time flat.
It never really mattered what kind of beast you were up against - Access, Paradox, Oracle, MySQL, Excel, delimited files, or just text files with shitty formatting. Say what you want about SQL Server 2000 and DTS - but it was hella flexible (probably unmatched in ease and flexibility still to this day, IMHO).
Although I spend a good deal of time with Oracle these days - I have recently taken on some large migrations using SQL Server 2008, which pushed me to dig up my archaic library of SQL code I used to cut-n-paste back and forth in the 2000 days in order to get just-about-anythinhg done. I'm going to be posting some of the more useful and interesting ones
It may be simple, or it may be complicated - but when you're searching for a hammer - the last thing you want is a bunch of meat tenderizers. Its damned close, but not exactly what you need. Today I'm posting a simple SQL Server cursor loop statement that you can add to your collection of templates.
Disclaimer: Now some of you might be asking: Isn't this pretty elementary? Yes, but maybe not everyone reading this has your "l33t hax0r skillz", ok there, Captain Ahab?
Screw that noise. I'm all about creating specific solutions to specific problems - and nothing is more usable, extensible, and straight-up customizable than a simple T-SQL cursor loop. It doesn't matter if youre updating rows on a separate table, doing a complicated INSERT INTO, doing some multi-variable lookups, or hell - maybe all of the above. Use your imagination. A good starting point prevents you from getting "code-blocked" and staring at a blank text editor or query screen as your jaws hangs open like a common mouth breather.
I write complicated ass SQL all day, and I open templates like this to modify all the time. I've got a whole library of them just for that purpose. I mean, c'mon lets get real, who actually sits down and writes stuff from scratch EVERY time? I'm all about the self-plagiarism, especially if it saves me time at the end of the day.
This example is supposed to be a simple as possible, while also being semi-plausible. Hypothetically, lets say you were building a summary table of all employees broken down by office, salary, department, etc. What we are doing is setting a True flag in a field on the summary table for the employee who has the highest salary for that particular office (I know it sounds odd - but like I said, its hypothetical, this could just as easily be doing a top flag based on department, or number of sales, seniority, etc. - I just thought that using salary by office is an easily understood one).
Here is an explanation in English about what the example is doing - data-wise...
First, we grab all the individual Office Ids - then go one by one through each of the different offices ranking the employees by salary, taking the highest paid Employee ID and then updating that same tables "Top Salary in the Office" field for that particular employee.
--Ryan Robitaille (ryrobes.com) 12/1/2009
--First define our return variable for the cursor and other misc vars
DECLARE @top_paid_in_office_emp_id VARCHAR(15), @office_id VARCHAR(15)
--here is the actual cursor select statement
DECLARE all_offices_cursor CURSOR FOR
SELECT DISTINCT office_id FROM coded_offices
ORDER BY office_id
--open that beast so we can use it
--start iterating through the cursor result set
FETCH NEXT FROM all_offices_cursor
--only pull in rows that we havent seen yet, else stop
WHILE @@FETCH_STATUS = 0
--start actual looping statement using our cursor variables
SET @top_paid_in_office_emp_id = (SELECT top 1 employee_id
WHERE office_id = @office_id
ORDER BY salary DESC)
UPDATE office_roster_summary SET top_paid_flag = 'T'
WHERE employee_id = @top_paid_in_office_emp_id
AND office_id = @office_id
--end looping statement
FETCH NEXT FROM all_offices_cursor
--grab the next var(s) from the cursor and repeat, baby!
--Close the cursor, and deallocate it for good housekeeping sake...
Depending on your data - when building a summary table (or Data Warehouse Fact table, or Analysis Services Cube for that matter) these types of "flag" fields often make reporting on such items much easier.