Analytical Functions

I mentioned recently that you sometimes see people programming in sql. Well, one entry to that is the analytical function. The syntax always escapes me, and the idea of them are always a little bit foggy. So now that I had an opportunity to use it I thought I’d jot it down here for future reference to myself.

You can read all about analytical (or window) functions elsewhere I know. But this is for me! A window function allows you to compute some value over a running window of your data, say a running average or running sum based off of some value. I’ve used that to have a running monthly average of some expense over a year for my expense data (where you’d partition over a year for each transaction and order the window by transaction_date). For example:

SELECT
  amount,
  avg(amount) OVER (PARTITION BY EXTRACT(YEAR FROM transaction_date) ORDER BY transaction_date) as running_monthly_average
FROM transactions
WHERE category = "food"
ORDER BY transaction_date DESC

But today I had a different usecase: I wanted to find duplicate entries in my data! And I’m glad I did, turned out that I had quite a few. This led me into the ranking functions, specifically LEAD() and LAG() which does a pretty simple thing really. They show the next or previous record respectively, for the column you select. This allows some pretty funky stuff:

    SELECT
      transaction_date,
      LEAD(transaction_date) OVER (ORDER BY transaction_date DESC) AS next_date,
    FROM transactions
    ORDER BY transaction_date DESC

will simply give you the date for each row, and the date for the next row. Since I knew that duplicate entries would specifically have the same title, date, amount (and potentially category) I could extract all of those into their own CTE and then just select the sequential rows that have identical titles, dates and names.

WITH
  dupl AS (
    SELECT
      transaction_date AS date,
      LEAD(transaction_date) OVER (ORDER BY transaction_date DESC) AS next_date,
      title,
      LEAD(title) OVER (ORDER BY transaction_date DESC) AS next_title,
      amount,
      LEAD(amount) OVER (ORDER BY transaction_date DESC) AS next_amount,
      category,
      LEAD(category) OVER (ORDER BY transaction_date DESC) AS next_cat,
    FROM transactions
    ORDER BY transaction_date DESC, amount, title
  )
SELECT date, title, amount, category
FROM dupl
WHERE
  next_date = date
  AND next_title = title
  AND next_amount = amount
ORDER BY amount DESC, date DESC

Et Presto! Only the entries that have been duplicated show up! Pretty sweet when you actually have a usecase for a sql function. I’m coming for you next =VLOOKUP()