Padding a Time Series in R

When analyzing and visualizing a new dataset, you’ll often find yourself working with data over time. Most software assumes that the data in a time series is collected at regular intervals, without gaps in the data: while this is usually true of data collected in a laboratory experiment, this assumption is often wrong when working with “dirty” data sources found in the wild.

This can lead to irregularities in many charts. For example, imagine the following dataset:

time observations
2011/11/01 12
2011/12/01 100
2012/01/01 320
2012/06/01 7

Note that the gaps between the data points vary in size, from 1 month to 5 months. When we visualize this using d3, the assumption will be to connect the data points in a way that indicates a gradual shift from one value to another.

While this might work for some cases, you may actually want to fill in the gaps in the data like so:

time observations
2011/11/01 12
2011/12/01 100
2012/01/01 320
2011/02/01 0
2011/03/01 0
2011/04/01 0
2011/05/01 0
2012/06/01 7

Which would result in a much different chart!

There are many ways to pad the data. I have written scripts in many languages to accomplish this, but settled on R as the quickest way to transform my data. R is an open source programming language and software environment for statistical computing and graphics.

Here’s a quick way to pad your dataset with zero values for missing dates:

This will result in the following dataset:

> merged.data
        time observations
1 2011-11-01           12
2 2011-12-01          100
3 2012-01-01          320
4 2012-02-01            0
5 2012-03-01            0
6 2012-04-01            0
7 2012-05-01            0
8 2012-06-01            7

A substantial portion of any data visualization project involves cleaning, transforming and analysing data. Although R can be intimidating at first, it is a powerful open source tool for working with your data.

This entry was posted by Irene Ros (@ireneros) on June 06, 2012 in Data.

Comments

Author

This entry was posted by Irene Ros (@ireneros) on June 06, 2012 in Data.

Recent posts from this author

Related on the Bocoup Blog

Advertisement

Twitter

Google+