Time Series: Prototype & Possibilities
Internal-only exploration of time series use cases in W&B
Created on February 28|Last edited on June 1
Comment
This use case exploration follows an excellent TensorFlow Colab to prototype the basics of working with time series data and models in Weights & Biases. I use a dataset of weather measurements (temperature, wind speed, humidity, etc) and compare model predictions across some simple neural networks (linear, convolutional, and recurrent). I summarize workflow recommendations, product challenges & opportunities, and potential next steps on the ProductML side.
Workflow setup and recommendationsSave cleanest data as Artifact; load with custom functionSet up fixed, named windows of validation dataLog multiple formats: step index, timestamp, and human-readableLog ground truth as a wandb.Table in a distinct runToggle true and predicted line series via PanelPlotVisual examplesDetailed view with loss curvesFlexible evaluation across data windowsChallenges & opportunitiesPanelPlot UX detailsExploratory visualization in W&BCore Weave: plot query, multi-key Table lookupWorkflow versioning and orchestration across Runs, Tables, and ArtifactsGrouping across models by type / ad-hoc fieldPotential next stepsPrototype flexible & efficient data versioningIncrease model complexity: compare to traditional, SOTA, or multi-modal (e.g. images) approachesGrouping workarounds & orchestration prototypesMore complex & real-world use casesP.S. Logging time series dataLog a Pandas DataFrame directlyLog rows of data and explicit columns
Resource links
Workflow setup and recommendations
Time series data is easier to visualize, organize, and understand in W&B with a few modifications to the standard workflow:
Save cleanest data as Artifact; load with custom function
Starting with the initial raw data, explore and clean up the features—e.g. change the format, fill in missing values, normalize—as much as possible before versioning the cleanest possible data as an Artifact.
You might log the whole dataframe as one Table:
wandb.log({"raw_data" : wandb.Table(dataframe=data_df)
or choose to fix a split into train/val/test (expand for code sample)
Set up fixed, named windows of validation data
To precisely compare model performance, designate a few specific windows—timestamped sequences of consecutive data points—as validation data. Log these as named wandb.Tables—e.g. I use "val_samples_0", "val_samples_1", "val_samples_2"... as my Table names. By keeping the name constant, I can guarantee that I am comparing the forecasts of different models on the same input data.
Log multiple formats: step index, timestamp, and human-readable
I found three formats for managing time stamps useful for different purposes, and I recommend logging all of them to each Table as distinct columns:
- time step index: literally log the index of the input time step—this will be offset relative to any data window you've defined, so if your model reads in a sliding window of 3 data points, the first time step would be at index 3 and not 0.
- timestamp: log the numerical timestamp as Python datetime.timestamp()—this can be converted to a human-readable timestamp from the Table UI with .toTimestamp()
- human-readable timestamp: log the string version of the timestamp which you would like readers to see (the Table-supplied timestamp only has day-level resolution and its format cannot be customized)
Log ground truth as a wandb.Table in a distinct run
To compare model performance against the ground truth, log the ground truth values for each named window as a separate wandb.Table—e.g. I use a distinct run named "ground_truth" to log all the labels (five Tables of validation data).
Toggle true and predicted line series via PanelPlot
Create a PanelPlot from every named Table / window of validation data. On each PanelPlot, you can visually compare model predictions to each other and to the ground truth by toggling each run on/off independently. This run visibility is controlled through the left sidebar in the Workspace or the run set below a panel grid in a Report.
Below, each line can be shown or hidden by checking the box for each run name.
Ground truth
1
Baseline
1
Linear
1
Dense 64-64
1
1
LSTM 32
1
1
1
Visual examples
Detailed view with loss curves
Time series models
8
Flexible evaluation across data windows
Comparing a few hyperparameter settings for a toy CNN. For these training and evaluation runs, the model reads in three time steps at a time.
Convolutional models
5
Challenges & opportunities
These are the broad product areas where we can direct efforts if we want to fully and smoothly support time series use cases. Some of these are more fundamental issues/roadmap questions than others: e.g. the Core Weave support, ad-hoc grouping, and workflow versioning are significant and deeply-integrated pieces that we will need to address eventually regardless of whether we pursue time series. We could probably punt on exploratory visualization. PanelPlot UX improvements are mostly polish/visual details that could be an easy piece to separate from core engineering work for the platform, and also these are crucial for practitioners to be able to work with time series in W&B.
PanelPlot UX details
Configuring the PanelPlot for each validation data Table is a tricky manual process. Timestamp formats have several tradeoffs, and the typical interactive functionality of W&B charts (pan, zoom, hover over labels, change colors) is incomplete.
Extensible timestamp format & Table row indexes
We have a conversion option as a modifier on a Table column, but it only expects a particular input format and only supports resolution down to the hour. Managing the mapping from time step indexes to numerical time stamps in seconds to display format for timestamps is very complicated, and we can expect individual users' cases to vary substantially, making it difficult to pick one best default format for any of these stages. Currently I log three columns to each Table to address different time-based needs. Having explicit access to the Table row index would simplify a lot of the work.
More sophisticated visualizations for time series
We could support much more powerful interactive visualizations for time series: annotating/labeling points or subregions, averaging regions, using a different color to fill in regions where curve A is above B vs B is above A, faceting, switching between points/lines/bars/etc. Lots of fascinating and beautiful ideas in this Observable series.
Downstream steps & integration
Exporting PanelPlots as static images or Table data into more sophisticated analysis software would be very useful.
Exploratory visualization in W&B
While DataFrames are easy to log as Tables, exploratory data analysis from such Tables is difficult. PanelPlot and the grouped histograms from the Table view allow a narrow band of options, with no further configuration or customization of the Vega spec. New derived columns (such as the average of two existing columns) cannot be plotted, and each line has to come from a different run. Subsets of Tables or different formats for the columns cannot be generated flexibly/dynamically/from the UI.
Matplotlib integration or conversion
Matplotlib is a better fit for this exploratory visualization stage—especially for iterating on the raw data before arriving at the clean, fully-preprocessed, and versioned train/val/test split—and most users are already familiar with it. As a first pass, we could make it easier to associate matplotlib figures with specific runs/views of the Table. We could also expand our library of Custom Charts and show users how to convert their equivalent data (e.g. to log an interactive heatmap to W&B in one line instead of importing a static image file from the saved Matplotlib figure).
Core Weave: plot query, multi-key Table lookup
These are very fundamental issues—we need to be able to visualize the results of a query in Weave—e.g. grouped by a column, showing data from a derived column. We also need a pattern for flexible finding Tables by multiple keys (e.g. model A generated predictions on dataset X: where is this Table stored/how can I find it in the fewest clicks?). Shawn suggested that Jobs/Launch will be able to help here.
Workflow versioning and orchestration across Runs, Tables, and Artifacts
This is a broad area of active efforts, including Model Registry/Cards, Launch, high-level Artifacts API, Weave templates, etc. ProductML efforts in the current product could be sufficient to make better prototypes, but eventually these will need to merge cleanly with the product roadmap.
Static artifacts vs dynamic generators for batched training data
How do we recommend that folks organize and manage their static Artifacts versus dynamic batched data for time series? Do we have additional layers of runs and artifacts? Does the preprocessing code go in the Artifact metadata or as hyperparameters on this "preprocessing" run? We likely don't need too much product work here, but we do need clear alignment on and examples of best practices.
Best practices for versioning workflow stages
We have great patterns for managing experiments to track and understand training recipes and hyperparameters, visualize results, and analyze/compare a narrow set of models. How do practitioners factor out, track, and manage the upstream steps of preprocessing, feature engineering, and data splitting? How do they organize, manage, and flexibly evaluate mature models across clear data subsets? We need more clear recommendations for and examples of how to do this, and likely product/UI work to facilitate this, beyond the current approach of super careful naming and tons of trial-and-error.
Grouping across models by type / ad-hoc field
In Tables & PanelPlot, we don't provide a way to group results or metrics. Below, I've trained several variants of three model types: dense (orange), convolutional (mint), and lstm (blue). If I group by model type, you can see that lstm has the highest average loss, and you can see two separate groups of predictions on the PanelPlots—but the lines all look like they're the same brownish color (live workspace here). If we could average the PanelPlot lines, the way we can with grouping in standard W&B line plots, I could visually explore some hypothesis like "LSTM models tend to overestimate and convolutional models tend to underestimate spikes in temperature".
Managing the model evaluation phase via run group names
The best workaround idea I have is to write a function which loads in previously logged Tables of predictions, averages the correct columns, and logs a new averaged Table which I can turn into a PanelPlot from the UI. This PanelPlot will show new runs—one for each group. It's unclear how these new grouped runs relate semantically to the model variants themselves—plus the grouping would be a fixed snapshot and impossible to update as I train more models. This breaks our usually-dynamic run-tracking functionality, and highlights another complexity: giving each run R which evaluates an existing model variant M of model type N on validation dataset D and logs this to table T a group name G. This group needs to be meaningfully associated with the generic evaluation of these N-type models on D_0, D_1, D_2, etc, and distinctly associated with specific model variant M_x, and with all evaluations performed agains the specific data in T_x. One string in the group name can't accommodate this. Flexible ad-hoc tagging can't enable grouping (can't group runs by tag from the UI). Multiple config variables are very hard to set up in advance—these need to be edited/adjusted afterwards from the API. Table key names are fixed and need to be regenerated repeatedly for each evaluation phase. In summary, organizing runs and tracking experiments in this paradigm is brittle and near-impossible.

Live example of grouped predictions—looks different from the workspace because of run color and timestamp issues in addition to missing the grouping functionality.
Run set
26
Potential next steps
Prototype flexible & efficient data versioning
For simplicity, the current example assumes on one fixed version of the dataset and one possible path from raw data to the training batches. In practice, we need to support branching decisions and track meaningful hyperparameters and states at several levels. This can be accomplished via more detailed customization of Artifacts, Tables, run grouping/tagging, and specific recent prototypes like Dylan's workflow for versioned tabular data.
Feature engineering from preprocessing & format choices
This example converts the raw numerical wind speed, time of year, and day to sine / cosine wave forms to account for periodicity before creating the training data. It normalizes the distinct data splits before training. We may want to clean up, reformat, parameterize, or otherwise preprocess individual features or full splits differently before training.
Selection of input features X and predicted features Y
This example relies on 19 input weather features (wind speed, humidity, etc) to predict one output feature (temperature). We could use the same approach—same workflow, essentially identical code—to (1) explore and evaluate the significance/relevance of different input features, e.g. train on only 3 features, add in 3 more features to denote the geolocation, rank the features by their relative contribution and (2) predict additional labels, e.g. humidity in addition to temperature.
Sampling & sliding window configuration
We might be interested in varying:
- the subsampling frequency—one per day, one per hour, one per minute—possibly with different choices across features
- the size of the sliding window for model input: does the model look at 1 time step, 3 time steps, 24 time steps, etc in order to predict the next time step?
- the size of the sliding window for model output: does the model predict one
- time offset between input and prediction: we might read a full 24 hours' worth of data to predict the data for the next day, or a week's worth of data to predict the next hour, etc.
Train / val / test splits
Easily choose different subsets of the full raw data for training vs validation vs testing.
Integrating predictions and observations
In production, we will receive new real observations alongside our model's predictions We may want to explore active/online learning or otherwise dynamically integrate observed values to update our model.
Increase model complexity: compare to traditional, SOTA, or multi-modal (e.g. images) approaches
The specific models are toy examples: tiny networks with simple dense, convolutional, and LSTM architectures. A more in-depth report could explore:
- comparison to traditional/statistical approaches: LightGBM/XGBoost, ARIMA, etc.
- SOTA approaches with transformers, RL, latest publications on other NN variants
- more complex or multimodal: e.g. satellite images, heatmaps, or videos at each time step, instead of or in addition to tabular features
Grouping workarounds & orchestration prototypes
We could extend this example to iterate on Model Registry, Launch, Report templates, and other new MLOps features. A first pass could be implemented with very careful grouping, naming, and Weave/Table/Report configuration in the existing product (without relying on any of the features in active development).
Manage models as primary objects
Focus on optimizing a particular model across many experiments, sweeps, data splits, etc. This could be prototyped via These could become Model Cards, be finetuned/retrained/updated via Launch commands, be analyzed via Launch commands to a high-level evaluation dashboard, etc.
Enhance model evaluation workflows
This example is a sufficient foundation for a more general and powerful use case. Imagine I've trained models A, B, C, and D, and I have datasets X, Y, Z (from different months or years of weather measurements). How can I easily evaluate all models on X, or compare A&B on Y, evaluate new model E on X and Y, see the biggest discrepancies across all models on X, spot the biggest differences between B & D on Z, or otherwise answer all these questions from the UI? Most of these can be answered through very careful choices about grouping, naming, Artifacts, Weave, and Reports, but we could make it so much smother.
Prototype active/online learning workflows
Once a model is deployed, how do we combine its predictions with actual observations? How do we retrain/improve the model as we accumulate ground truth over time, or perhaps dynamically integrate the real-time data to adjust predictions? How do we test larger hypotheses on or orchestrate larger updates to the full model registry—e.g. if we discover that we can safely lower sampling frequency by 20% and save a bunch of storage/compute costs, or if we build a new weather monitoring station and want to incorporate its data?
More complex & real-world use cases
Collaborations and in-depth case studies with existing projects—like forecasting energy demand for the grid, output from existing solar farms, changes in ground or cloud cover from satellite over time—would make for more compelling reports, help us better understand and address actual customer needs, and provide an exciting opportunity to support the climate/environmental sustainability space.
P.S. Logging time series data
To visualize time series data via wandb.Table, there are several options:
Log a Pandas DataFrame directly
This is the easiest path—use it when exploring your dataset and understanding features.
wandb.log({"timeseries_data" : wandb.Table(dataframe=data_df)
Log rows of data and explicit columns
This format is more precise to select the columns you care about and time stamps in the correct format.
data_with_time = [[data[i], timestamps[i]] for i in range(len(data))]wandb.log({"timeseries_data" : wandb.Table(data=data_with_time,columns=["temperature", "time"]
Add a comment
In production, we will receive new real observations alongside our model's predictions We may want to explore active/online learning or otherwise dynamically integrate observed values to update our model.
This is true; I could imagine just logging a new copy (with signifying date) of the ground truth artifact as the new data comes in.
Reply
"train_fraction" : 0.7, "val_fraction" : 0.2, "test_fraction" : 0.1
One note is that for timeseries, the standard train/test/val is very uncommon. Because the datasets are required to be sequential, the train/test are usually framed in terms of a horizon, and validation is done after a sliding window is constructed. It's very common to construct a sequence of train/test splits.
Reply
