Hello,
Long time I am puzzled over the following task. This problem should be
similar to many systems, and it would be interesting to know what are best
practices to solve this.
The setup:
Table, that contains log of events. Typical fields would be: 1) uniqueID; 2)
evtTime; 3)evtType. Then there are some additional fields that carry
additional logged event data.
Sample data:
ID evtTime evtType
1 10-11-2005 A
2 13-11-2005 B
3 20-11-2005 C
4 22-11-2005 B
Now, if I want to perform analysis on this data, typical need would be to
analyze sequence of event types. In the sample data above, the sequence is
A-B-C-B (ordering by date-time). I find it very complex to create a query
that would let me find sequence of events by given pattern.
For example, I could use query to discover the next element in sequence and,
thus, use this expression for querying:
SELECT evtType, (SELECT TOP 1 evtType FROM SampleData WHERE
evtTime>Level1.evtTime ORDER BY logDate) AS nextType FROM SampleData AS
Level1
It would be possible to extract also the second next event type in sequence
in similar fashion, but I am afraid that the complexity of the overall query
would be unreasonable and performance would be slow. Plus, if I want also,
for example, count days when the "event type A was in force", I need to add
nex subquery for the evtTime:
SELECT evtType, evtTime, (SELECT TOP 1 evtType FROM SampleData WHERE
evtTime>Level1.evtTime ORDER BY logDate) AS nextType, (SELECT TOP 1 evtTime
FROM SampleData WHERE evtTime>Level1.evtTime ORDER BY logDate) AS nextTime
FROM SampleData AS Level1
Only then I could find the sequences that, for example, have "event A in
force at least 10 days, then followed by event B, and then by C".
How this standard need could be solved so that I don't need to create large
temporary tables, populated with data by raw code?
PavilsOn Tue, 15 Nov 2005 16:30:55 +0200, Pavils Jurjans wrote:
(snip)
>The setup:
>Table, that contains log of events. Typical fields would be: 1) uniqueID; 2
)
>evtTime; 3)evtType. Then there are some additional fields that carry
>additional logged event data.
>Sample data:
>ID evtTime evtType
>1 10-11-2005 A
>2 13-11-2005 B
>3 20-11-2005 C
>4 22-11-2005 B
>Now, if I want to perform analysis on this data, typical need would be to
>analyze sequence of event types. In the sample data above, the sequence is
>A-B-C-B (ordering by date-time). I find it very complex to create a query
>that would let me find sequence of events by given pattern.
(snip)
>Only then I could find the sequences that, for example, have "event A in
>force at least 10 days, then followed by event B, and then by C".
Hi Pavils,
If this is your typical reporting need, then I recommend that you
transform the data from events to durations (or states, or whatever term
is appropriate in your case).
CREATE TABLE Durations
(StartDate smalldatetime NOT NULL,
EndDate smalldatetime DEFAULT NULL,
EventType char(1) NOT NULL,
PRIMARY KEY (StartDate),
UNIQUE (EndDate),
FOREIGN KEY (EndDate) REFERENCES Durations (StartDate),
CHECK(EndDate > StartDate)
)
In most cases, I'd make Durations a view instead of a table. This might
well be an exception to that rule. The view could be slow, since it
would require a self-join. If your queries would then involve joining
the view to itself, you'd effectively be joining the base table to an
exponential increasing number of copies of itself.
Durations as indexed view would be ideal, but I don't think it's
possible to write a query that computes the Durations correctly without
breaking the limitations for indexed views.
That leaves you with two options for how to populate the table:
1. Remove the Events table; change the stored procedure or front-end
code that currently popultes the Events table to make the correct
changes to the Durations table, or
2. Keep the Events table; leave front-end and stored procedures
unchanges, but add insert, update and delete triggers on the Events
table to copy over all changes to the Durations table.
Once you have the Durations table, the query you give in your final
example would be:
SELECT d1.StartDate, d2,StartDate, d3.StartDate, d3.EndDate
FROM Durations AS d1
INNER JOIN Durations AS d2
ON d2.StartDate = d1.EndDate
INNER JOIN Durations AS d3
ON d3.StartDate = d2.EndDate
WHERE d1.EndDate >= DATEADD(day, 10, d1.StartDate)
AND d1.EventType = 'A'
AND d2.EventType = 'B'
AND d3.EventType = 'C'
Best, Hugo
--
(Remove _NO_ and _SPAM_ to get my e-mail address)|||Thanks, Hugo for you input.
I studied it carefully. To extend my proposed sample to my real case, I must
add that the events are registered for large number of subjects, so the
events table also has reference to subject ID. Your query then could look
like
SELECT d1.StartDate, d2,StartDate, d3.StartDate, d3.EndDate
FROM Durations AS d1
INNER JOIN Durations AS d2
ON d2.StartDate = d1.EndDate AND d2.subID=d1.subID
INNER JOIN Durations AS d3
ON d3.StartDate = d2.EndDate AND d3.subID=d2.subID
WHERE d1.EndDate >= DATEADD(day, 10, d1.StartDate)
AND d1.EventType = 'A'
AND d2.EventType = 'B'
AND d3.EventType = 'C'
Perhaps that just adds another load to the view. Building a temp table could
be solution for some standard reports, but the data are updated every minute
and some reports are asking another criteria on the series, so perhaps view
or dynamically generated query, while potentially slow, could be the most
realistic alternative.
However, I am now a bit worried that it seems that theoretically there could
be cases when the dates of succeeding events could be identical, so
apparently I need another order field. It seems that the best (could easily
be used for self-joins) would be carrying reference to previous event ID,
but it involves difficulty how to order the events in the right order by
simple SQL server means. The only option left then is a clumsy "orderNumber"
that will be a hassle to populate and it will be even hasslier to insert a
new event in existing series. Lots of code will be required to keep that
structure intact with no errors in numeration. Perhaps I need to weigh how
much I need a query that would give ordered list of events vs doing this
ordering by code.
Another concern is making self-joins vs using subqueries. I could get the
fields from succeeding event quite easily, and also getting fields from next
succeeding event seems no problem. The question is only about how effective
is this. Perhaps only tests will show this. I am sort of dismayed that so
few tools are built in MSSQL to analyse record series - that is typical need
in complex analysis and caching all the analysis sample in application
memory for proprietary code analysis sometimes is very ineffective.
Regards,
Pavils Jurjans|||On Mon, 21 Nov 2005 19:37:59 +0200, Pavils Jurjans wrote:
>Thanks, Hugo for you input.
>I studied it carefully. To extend my proposed sample to my real case, I mus
t
>add that the events are registered for large number of subjects, so the
>events table also has reference to subject ID. Your query then could look
>like
>SELECT d1.StartDate, d2,StartDate, d3.StartDate, d3.EndDate
>FROM Durations AS d1
>INNER JOIN Durations AS d2
> ON d2.StartDate = d1.EndDate AND d2.subID=d1.subID
>INNER JOIN Durations AS d3
> ON d3.StartDate = d2.EndDate AND d3.subID=d2.subID
>WHERE d1.EndDate >= DATEADD(day, 10, d1.StartDate)
>AND d1.EventType = 'A'
>AND d2.EventType = 'B'
>AND d3.EventType = 'C'
Hi Pavils,
Yes, that's how you would change it. I'd probably also add the subject
to the SELECT clause.
>Perhaps that just adds another load to the view.
Depends. With the right indexes it *might* even speed things up.
> Building a temp table could
>be solution for some standard reports, but the data are updated every minut
e
>and some reports are asking another criteria on the series, so perhaps view
>or dynamically generated query, while potentially slow, could be the most
>realistic alternative.
Yes.
But if that's to slow, then consider permanently storing both the
beginning and the end of the events. There are various ways to do this:
1. Change it in whatever currently produces the data.
2. Use a stored procedure to handle insertions; add code to that SP to
fill the end date of the "previous" event.
3. Create a trigger on the table that will populate end dates as
appropriate.
>However, I am now a bit worried that it seems that theoretically there coul
d
>be cases when the dates of succeeding events could be identical, so
>apparently I need another order field.
Or you could store the time in addition to the date. The smalldatetime
datatype stores date + time with a precision of minutes; the datetime
datatype even has a precision of 3/1000th of a second.
>It seems that the best (could easily
>be used for self-joins) would be carrying reference to previous event ID,
If you choose that, then remove the the end date. I included that to
serve as a reference to the previous event. If you use a different
reference, there's no need to keep the end date.
However, don't ever rely on a system-generated identifier to serve as
the only key for a table. If the "real" key is clumsy, you can add a
system-generated identifier to serve as a "surrogate" key.
>but it involves difficulty how to order the events in the right order by
>simple SQL server means.
Are the events inserted in chronological order? In that case, it's quite
easy to have the system generate an ID value that will keep increasing.
That would make ordering trivial.
(snip)
> I am sort of dismayed that so
>few tools are built in MSSQL to analyse record series - that is typical nee
d
>in complex analysis and caching all the analysis sample in application
>memory for proprietary code analysis sometimes is very ineffective.
You might want to investigate Analysis Services. I must admit that the
only thing I know about it is it's name (in some circles, that's enough
to qualify as "expert"...). But who knows - maye it's just the tool for
your needs!
Best, Hugo
--
(Remove _NO_ and _SPAM_ to get my e-mail address)
No comments:
Post a Comment