dive into mark

You are here: dive into markArchivesApril 2003Ask Dr. SQL: task scheduler

Friday, April 4, 2003

Ask Dr. SQL: task scheduler

Suppose, for the sake of argument, that we have a series of tasks. Each of which has a begin date/time and an end date/time. Each task is run only once, and the results of the completed tasks are stored in a separate table along with various result data. Thus:

select * from task;

task_uid  begin_date               end_date
========= ======================== ========================
1         2003-04-04 00:30:00.000  2003-04-04 01:00:00.000
2         2003-04-04 00:45:00.000  2003-04-04 01:15:00.000
3         2003-04-04 01:00:00.000  2003-04-04 01:30:00.000
4         2003-04-05 10:00:00.000  2003-04-05 10:30:00.000

(4 row(s) affected)


select * from result;

result_uid  task_uid
=========-- =========
1           1

(1 row(s) affected)

(The result table also contains other data pertaining to the results of the completed task, but it’s irrelevant to our current assignment.)

As you will see in a minute, we will need to know what time it is. No two DBMSes have the same syntax for this; here we’ll be using Microsoft SQL Server 2000. Oracle weenies will select sysdate your ridiculous little dual table. Other DBMSes have a variety of maddeningly inconsistently named functions for the purpose.

It is currently 12:49 AM on April 4, 2003:

select current_timestamp;

========================
2003-04-04 00:49:23.597

(1 row(s) affected)

We are writing a scheduler, which needs to decide which tasks to run. The scheduler is run once a day. Tasks which have not been started, should be run now, along with all the other tasks whose starting date will come up before the scheduler is run again. Tasks that were never completed but whose end date has already passed, should not be run. (Yes, this means that some tasks may end up being skipped altogether, if the scheduler slips. For reasons known only to management, this is acceptable and even desired.)

Furthermore, we shall ignore the possibility of racing conditions, where a task is still “in progress” but not completed by the time the scheduler runs again. Tasks are assumed to run instantaneously. This assumption gives us the willies, and the dangers of said assumption have been brought up in numerous meetings and have been the subject of numerous CYA memos, but for now we are simply going to proceed without worrying about it.

However—and I am so incredibly not making this up—we are not allowed to assume that tasks set to begin in the future have not been completed. The only indication of whether a task has been completed or not is whether it is listed in the result table. This is affectionately known as the E-TWAP (Evan’s Time Warp Requirement).

Now, here’s our assignment: we need to find the first non-completed task that should have begun by now but hasn’t ended yet (i.e. the current time is between the task’s begin date and end date), plus all other non-completed tasks due to start within the next 24 hours of the first task.

This is a real-world example. I couldn’t possibly make up something this contrived.

First, let’s focus on figuring out the date of the first task we want to run. We can use the handy (but little-used) between keyword to find only tasks where the current time is between the begin_date and end_date, then use the min function on begin_date to find the earliest date among those tasks:

select min(t.begin_date) as first_date
from task t
where current_timestamp between t.begin_date and t.end_date
;

first_date
========================
2003-04-04 00:30:00.000

(1 row(s) affected)

Of course, this doesn’t take into account the tasks that have already completed. For that, we’ll need to add a where not exists clause and do a subquery on the result table:

select min(t.begin_date) as first_date
from task t
where not exists (
  select r.task_uid
  from result r
  where t.task_uid = r.task_uid
)
and current_timestamp between t.begin_date and t.end_date
;

first_date
========================
2003-04-04 00:45:00.000

(1 row(s) affected)

Now we‘re going to use this date to do calcuations against every other begin_date in the task table (to find all the tasks within the next 24 hours of this date). For this we use a CROSS JOIN. For larger result sets (more than one row), CROSS JOIN will give you what’s called a Cartesian Product: every row of the first table crossed with every row of the second table. And that’s technically what we’ll get here, but since the second table will only have one row and one column in it, it’s a trivial case: that value just kind of gets “tacked on” as an extra column to the first table.

(Note that we‘re going to take our previous SELECT statement, give it a name (m), and use it inline in a larger SELECT statement. This is called a derived table, and not all databases can do it. An alternative for those that can’t: turn the previous statement into a view and then CROSS JOIN with the view.)

select t.task_uid
, t.begin_date
, m.first_date
from task t
cross join (
  select min(t.begin_date) as first_date
  from task t
  where not exists (
    select r.task_uid
    from result r
    where t.task_uid = r.task_uid
  )
  and current_timestamp between t.begin_date and t.end_date
) m
;

task_uid  begin_date               first_date
========= ======================== ========================
1         2003-04-04 00:30:00.000  2003-04-04 00:45:00.000
2         2003-04-04 00:45:00.000  2003-04-04 00:45:00.000
3         2003-04-04 01:00:00.000  2003-04-04 00:45:00.000
4         2003-04-05 10:00:00.000  2003-04-04 00:45:00.000

(4 row(s) affected)

This is great as far as it goes, but once again we‘re going to need to look at the result table to see which of these tasks have already been completed. We did this once before, to figure out the date of the first non-completed task that we cared about, but now we’ll have to do it again to filter out the tasks in the outer portion of the SELECT.

select t.task_uid
, t.begin_date
, m.first_date
from task t
cross join (
  select min(t.begin_date) as first_date
  from task t
  where not exists (
    select r.task_uid
    from result r
    where t.task_uid = r.task_uid
  )
  and current_timestamp between t.begin_date and t.end_date
) m
where not exists (
  select r.task_uid
  from result r
  where t.task_uid = r.task_uid
)
;

task_uid  begin_date               first_date
========= ======================== ========================
2         2003-04-04 00:45:00.000  2003-04-04 00:45:00.000
3         2003-04-04 01:00:00.000  2003-04-04 00:45:00.000
4         2003-04-05 10:00:00.000  2003-04-04 00:45:00.000

(3 row(s) affected)

The final step is to compare each task’s begin_date to first_date. Specifically, we want all tasks whose begin_date is between first_date and first_date + 1 day. No two DBMSes use the same date functions; here I’m using Microsoft SQL Server’s DATEADD function. Oracle weenies would use ADD_DAYS, and so forth.

select t.task_uid
, t.begin_date
, m.first_date
from task t
cross join (
  select min(t.begin_date) as first_date
  from task t
  where not exists (
    select r.task_uid
    from result r
    where t.task_uid = r.task_uid
  )
  and current_timestamp between t.begin_date and t.end_date
) m
where not exists (
  select r.task_uid
  from result r
  where t.task_uid = r.task_uid
)
and t.begin_date between m.first_date and dateadd(day, 1, m.first_date)
;

task_uid  begin_date               first_date
========= ======================== ========================
2         2003-04-04 00:45:00.000  2003-04-04 00:45:00.000
3         2003-04-04 01:00:00.000  2003-04-04 00:45:00.000

(2 row(s) affected)

Filed under ,

11 comments

  1. I could probably come up with alternative ways of doing this, but I fear my head would explode with the effort.

    Still think how much fun it will be to modify this code once the requirements change - and they will.

    How big is this table going to get? With a couple of ‘not exists’ and a nice inline table (as us Oracle weenies call it) I would suspect it may not be that lightning fast. Its possible that the lights wont dim when you fire this up every day but I wouldn’t like to bet on it.

    But on a positive note, with requirements like these good developers should never be out of work.

    Comment by Andy Todd — Friday, April 4, 2003 @ 5:51 am

  2. “This assumption gives us the willies”

    Oh, it should. If anything is likely to screw up, it’ll be this. I had a problem with some similarities to this about a year ago, and someone in my group kindly made the assumption that everything happens in less than a half-second (who knows where they got that figure from) without letting anyone know. Needless to say, it was a few days of debugging to figure out what went wrong. The culprit was beaten, rightly, with their own shoes.

    Comment by GaryF — Friday, April 4, 2003 @ 8:05 am

  3. For what it’s worth, “select current_timestamp;” is
    how it is done in PostgreSQL, too. I think it
    might even be in some SQL standard or other. Of course,
    that’s irrelevant, because Oracle is *the* standard!

    Comment by Sean Neakums — Friday, April 4, 2003 @ 8:50 am

  4. I’m sorry, but this code explanation sounds dreadfully too complex.

    Given the requirements as outlined here, and the fact that you’ve stated that these requirements have a very good chance of changing in the future, it would be prudant to abstract and re-factor the code as much as you can.

    This current encarnation looks WAAY too spaghetti like to be actually maintanable, let alone functional.

    One question crops up in my mind… How are you defining that a particular task has actually been completed in it’s assigned time slot?

    With that in mind, if you added a new table named:

    tasks_completed

    and had the following columns:

    id, task_id, completed

    you could update this table when running the schedular, and then later query for tasks that were completed quite easily.

    As it stands now, I feel the code given is representative of a codebase that looks to be highly challenging, to say the least. ;)

    Nothing against the authors skill. Please don’t get me wrong. I just feel it would be prudent to really break down the taks into more manageable chunks. (pun intended. ;)

    My .02 cents.

    Comment by Chris Simmons — Friday, April 4, 2003 @ 9:33 am

  5. Really nothing to with post, but I noticed you didn’t use the samp tag. Everthing is marked up in code.

    Comment by Anonymous — Friday, April 4, 2003 @ 10:07 am

  6. Ha ha ha. Silly rabbit, you assume I have any control whatsoever over the database design.

    Comment by Mark — Friday, April 4, 2003 @ 10:22 am

  7. For time and sql related questions, I usually turn to “Developing time-oriented database applications in sql” by Richard Snodgrass. 2000. Morgan Kaufmann. ISBN 1-55860-436-7. Lots of code examples for several dbms’.

    Comment by Brad Smith — Friday, April 4, 2003 @ 10:55 am

  8. You are scaring me Mark, it sounds like you moved back north to work in Yardley.

    We had a very very scarily similar problem, and I would ask you to have your design people call me so I can tell them how WRONG they are. We assumed up here too that transactions would be completed in a split second, so that a new feature we enabled would send emails when the transaction was done.

    The problem was if you started one transaction, held it open, started a new one, then saved the second one, then saved the first one, you’d only get one email, when you should get two, one for each transaction.

    Someone conveniently forgot that multiple people might actually be using an ENTERPRISE computer system.

    Comment by Adrian — Friday, April 4, 2003 @ 3:42 pm

  9. Addendum to #5 - Insanely perfect (X)HTML:

    <pre><code>select current_timestamp;</code>

    <samp>========================
    2003-04-04 00:49:23.597

    (1 row(s) affected)</samp></pre>

    Comment by Anonymous — Friday, April 4, 2003 @ 8:31 pm

  10. The constraints you’re working under remind me of the Dilbert strip where Dogbert is waving a staff, chanting “Out, Out, you Demons of Stupidity!”

    Comment by dws — Friday, April 4, 2003 @ 10:13 pm

  11. Making an initial view that only selects unfinished tasks would reduce your complexity a lot, and at least that rule seems likely to hold.

    Also, if you make it a view, you can left outer join, instead of using Where Not Exists, and I’m pretty sure the join will be faster.

    create ciew unfinished_task as
    select t.*
    from task t left join result r
    on t.task_uid = r.task_uid
    where
    r.task_uid is null
    go

    Then m becomes
    select min(u.begin_date)
    from unfinished_task u

    And in fact, this would usefully be a user-defined function, since it’s always a scalar. This is new in SQL Server 2K.

    Thus:

    create function dbo.earliest_unfinished_task_datetime()
    returns datetime
    as
    begin
    declare @ret datetime
    select @ret = min(u.begin_date) from unfinished_task u

    return @ret
    end
    go

    and your final statement becomes a much nicer:

    select * from unfinished_task u
    where u.begin_date
    between dbo.earliest_unfinished_task_datetime and dateadd(day, 1, earliest_unfinished_task_datetime)

    You might also choose to stick the dateadd in another function, so that if they change their mind on the 1-day window, you just change it in one place.

    While I agree that T/SQL is an unproductive language to program in, when performance is a high priority, you really can milk it.

    Of course, just -try- to do unit testing on it. Ouch.

    Comment by Jeremy Dunck — Saturday, April 5, 2003 @ 12:02 am

Respond privately

I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)



Recent Stuff For You, Special Price Stay Here
  • Greasemonkey Hacks
Good Stuff Buy The Cow Go Away
Dive Into Python
Powered by Google Drink The Milk Don't Steal

 

posts / comments
© 2001-8 Mark Pilgrim