generate_series
I like to keep track of how many miles I'm driving in my car. One conceivable way of doing this is to create a table in a Postgres database in which I can track this information.
CREATE TABLE mileage (
observed_date DATE,
observed_mileage INTEGER
);
Unfortunately, I'm not always the most regular data collector. I often collect this data with gaps of days or months between each reading.
INSERT INTO mileage(observed_date, observed_mileage) VALUES
('2018-05-21', 84088),
('2018-05-26', 84201),
('2018-06-13', 84910);
I want to get some sense for how much I'm driving each day, and one reasonable way I might do that is to linearly interpolate the mileage between readings. For instance, if I see a reading of 10,000 on August 1st and a reading of 11,000 on August 10th, I want to see that on average I drove 100 miles each day 1-10 August.
How can we do this in Postgres? First, we pair the data up:
SELECT LAG(observed_date) OVER (ORDER BY observed_date) AS lag_date
, LAG(observed_mileage) OVER (ORDER BY observed_mileage) AS lag_mi
, observed_date AS obs_date
, observed_mileage AS obs_mi
FROM mileage;
which yields result:
lag_date | lag_mi | obs_date | obs_mi
------------+--------+------------+--------
| | 2018-05-21 | 84088
2018-05-21 | 84088 | 2018-05-26 | 84201
2018-05-26 | 84201 | 2018-06-13 | 84910
Then, we generate a series between each pair of dates:
WITH paired_dates AS (
SELECT LAG(observed_date) OVER (ORDER BY observed_date) AS lag_date
, LAG(observed_mileage) OVER (ORDER BY observed_mileage) AS lag_mi
, observed_date AS obs_date
, observed_mileage AS obs_mi
FROM mileage
)
SELECT *
FROM paired_dates
, generate_series(lag_date, obs_date, INTERVAL '1 day') days(driven_date)
LIMIT 10;
which yields result:
lag_date | lag_mi | obs_date | obs_mi | driven_date
------------+--------+------------+--------+------------------------
2018-05-21 | 84088 | 2018-05-26 | 84201 | 2018-05-21 00:00:00+00
2018-05-21 | 84088 | 2018-05-26 | 84201 | 2018-05-22 00:00:00+00
2018-05-21 | 84088 | 2018-05-26 | 84201 | 2018-05-23 00:00:00+00
2018-05-21 | 84088 | 2018-05-26 | 84201 | 2018-05-24 00:00:00+00
2018-05-21 | 84088 | 2018-05-26 | 84201 | 2018-05-25 00:00:00+00
2018-05-21 | 84088 | 2018-05-26 | 84201 | 2018-05-26 00:00:00+00
2018-05-26 | 84201 | 2018-06-13 | 84910 | 2018-05-26 00:00:00+00
2018-05-26 | 84201 | 2018-06-13 | 84910 | 2018-05-27 00:00:00+00
2018-05-26 | 84201 | 2018-06-13 | 84910 | 2018-05-28 00:00:00+00
2018-05-26 | 84201 | 2018-06-13 | 84910 | 2018-05-29 00:00:00+00
Note that 2018-05-26
occurs twice in the driven_date
column. We can fix that by stopping our series just before
getting to the later date:
WITH paired_dates AS (
SELECT LAG(observed_date) OVER (ORDER BY observed_date) AS lag_date
, LAG(observed_mileage) OVER (ORDER BY observed_mileage) AS lag_mi
, observed_date AS obs_date
, observed_mileage AS obs_mi
FROM mileage
)
SELECT *
FROM paired_dates
, generate_series(lag_date, obs_date - INTERVAL '1 minute', INTERVAL '1 day') days(driven_date)
LIMIT 10;
Anyway, now we need to calculate the actual number of miles driven on the driven_date
.
WITH paired_dates AS (
SELECT LAG(observed_date) OVER (ORDER BY observed_date) AS lag_date
, LAG(observed_mileage) OVER (ORDER BY observed_mileage) AS lag_mi
, observed_date AS obs_date
, observed_mileage AS obs_mi
FROM mileage
)
SELECT driven_date
, (obs_mi - lag_mi)::NUMERIC / (obs_date - lag_date) AS miles_driven
FROM paired_dates
, generate_series(lag_date, obs_date - INTERVAL '1 minute', INTERVAL '1 day') days(driven_date)
LIMIT 10;
which yields result:
driven_date | miles_driven
---------------------+---------------------
2018-05-21 00:00:00 | 22.6000000000000000
2018-05-22 00:00:00 | 22.6000000000000000
2018-05-23 00:00:00 | 22.6000000000000000
2018-05-24 00:00:00 | 22.6000000000000000
2018-05-25 00:00:00 | 22.6000000000000000
2018-05-26 00:00:00 | 39.3888888888888889
2018-05-27 00:00:00 | 39.3888888888888889
2018-05-28 00:00:00 | 39.3888888888888889
2018-05-29 00:00:00 | 39.3888888888888889
2018-05-30 00:00:00 | 39.3888888888888889
And thus I have achieved the desired result.