Official implementation of Investigating Stack Depth in xLSTM Architectures for Vibration Time Series Prediction, trained on MTSU's HPC cluster using DeepSpeed ZeRO Stage 2.
This study evaluates five xLSTM configurations ranging from 1 to 5 stacked mLSTM-sLSTM block pairs for one-step-ahead vibration displacement forecasting. Results reveal a strongly non-monotonic relationship between stack depth and predictive accuracy, with the 1-stack configuration achieving the best performance (RΒ² = 0.9869) and the 4-stack configuration undergoing near-complete training collapse.
- [05.01.2026] Repository released with full training code, DeepSpeed config, and SLURM job script.
Vibration time series prediction from industrial interferometer signals presents a challenging forecasting problem due to high-frequency content, large dynamic range, and intermittent spike artifacts. This repository investigates how stacking depth in xLSTM architectures affects forecasting performance on this domain.
Five architectures are evaluated β a single mLSTM-sLSTM block pair (1-stack) up to five sequential block pairs (5-stack) β with all other architectural and training parameters held constant. All experiments were conducted on two NVIDIA RTX A5000 GPUs using DeepSpeed ZeRO Stage 2 optimization and FP16 mixed precision.
The xLSTM model is built from two complementary recurrent block types stacked to varying depths:
mLSTM Block β operates on a 16Γ16 matrix-valued memory tensor updated at each timestep via outer-product operations. Gate outputs are constrained to [0.05, 0.95] to prevent saturation, and the memory update is scaled by 0.1 to prevent exponential growth.
sLSTM Block β follows a conventional gated recurrent structure with scalar memory cells, applying LayerNorm to the cell state before the tanh activation for stable activation distributions across long sequences.
The stacking scheme for each configuration is: N mLSTM blocks followed by N sLSTM blocks, where N β {1, 2, 3, 4, 5}. LayerNorm and Dropout (p = 0.1) are inserted between consecutive same-type blocks to stabilize gradient flow. Input sequences of 30 timesteps are projected from 1 dimension to a 128-dimensional hidden representation before being passed through the stacked blocks. The hidden state at the final timestep is projected to a single displacement prediction.
Vibration displacement signals (nm) were collected via interferometer from a lathe machine over time (ms). Raw signals exhibit two primary artifacts that require correction before training: a consistent downward drift and intermittent sharp spikes from sensor noise.
Figure 1. Representative raw displacement signal showing downward slope drift and a sharp spike artifact at approximately 185,000 ms.
Preprocessing steps applied before training:
- Linear regression detrending to remove measurement drift.
- Spike correction using a 3-standard-deviation threshold in 10,000 ms windows, with linear interpolation over detected outliers.
- Sequence construction with length 30 and stride 1 for one-step-ahead forecasting.
- RobustScaler normalization fit on the training set and applied to validation.
The dataset comprises 22 CSV files for training (~7.6M sequences) and 1 CSV file for validation (~605K sequences).
| Configuration | RΒ² | RMSE (nm) | MAE (nm) | Epochs | Train Time (h) |
|---|---|---|---|---|---|
| 1-Stack | 0.9869 | 74.41 | 38.88 | 35 | 3.16 |
| 2-Stack | 0.9828 | 85.33 | 44.93 | 29 | 15.29 |
| 3-Stack | 0.9650 | 121.60 | 60.26 | 13 | 3.38 |
| 4-Stack | 0.5286 | 446.25 | 65.00 | 43 | 45.65 |
| 5-Stack | 0.8796 | 225.54 | 72.81 | 13 | 17.54 |
Evaluated on 605,772 validation sequences. Training time and epoch count reflect the full training run including early stopping.
The 4-stack configuration represents a qualitatively different failure mode: despite the longest training run (45.65 hours, 43 epochs), the model became trapped in a poor local minimum and generated large-magnitude hallucinated predictions on a subset of high-amplitude inputs. The 5-stack partially recovered (RΒ² = 0.8796), suggesting a different convergence trajectory.
1-Stack β Near-perfect overlap between predicted and actual displacement across the full dataset. In the detail view, the model correctly captures rapid direction reversals, peak sharpness, and zero-crossings with only marginal underestimation at the highest-amplitude spikes.
2-Stack β Produces visually similar tracking but with slightly more divergence at high-amplitude burst events, consistent with its higher RMSE.
3-Stack β Shows increasing divergence at peak amplitude events. The detail view confirms growing phase misalignment in the 400β600 step window, consistent with the degraded RΒ² = 0.9650.
4-Stack β Exhibits the most visually striking failure: a hallucinated spike near time step 120,000 reaching approximately 10,000 nm with no corresponding feature in the actual signal. The predicted axis range extends to Β±10,000 nm, confirming that the model generates large-magnitude out-of-distribution predictions on a subset of inputs.
5-Stack β Shows a different failure signature: a spurious deep negative spike near time step 120,000 and asymmetric amplitude envelope clipping. The detail view confirms the predicted line has become noticeably smoother and less reactive than the actual signal.
All configurations that converged successfully exhibit an S-shaped nonlinear bias: residuals are increasingly negative at large negative predicted displacements and increasingly positive at large positive ones. This regression-toward-the-mean behavior is a structural property of MSE-trained xLSTM on this dataset and grows in magnitude with stack depth.
1-Stack β Tightest residual distribution, near-zero mean of β2.24 nm.
2-Stack β Almost identical S-shaped pattern with a negligible mean bias of β1.24 nm.
3-Stack β S-curve deepens noticeably; residuals reach approximately β2,000 nm at the most negative predicted values. Mean bias increases to β10.92 nm.
4-Stack β Qualitatively different failure mode. For predicted displacements beyond approximately 4,000 nm, residuals plunge in a tight diagonal arc to below β15,000 nm. Mean bias β11.63 nm.
5-Stack β Partially recovers with a small positive mean bias of +4.90 nm, but shows a structured downward arc for predicted values in the 0 to β2,000 nm range, with residuals dropping to approximately β7,000 nm.
.
βββ xlstm_v4_deepspeed.py # Main training script
βββ run_xlstm_v4_deepspeed.slurm # SLURM job submission script
βββ ds_config.json # DeepSpeed ZeRO Stage 2 configuration
βββ figures/ # Result plots and architecture diagram
βββ README.md
Prerequisites: Python 3.8+, PyTorch with CUDA, and a SLURM-managed cluster with NVIDIA GPUs.
Install dependencies:
pip install deepspeed torch numpy pandas scikit-learn matplotlib tqdmPlace CSV files (two columns: time, displacement) into the following structure:
Data/
βββ Train/ # 22 CSV files (~7.6M training sequences)
βββ Test/ # 1 CSV file (~605K validation sequences)
Update DATA_FOLDER in xlstm_v4_deepspeed.py to match your path.
Open xlstm_v4_deepspeed.py and set NUM_LAYERS to the desired stack depth (1β5):
NUM_LAYERS = 1 # β Change this to 1, 2, 3, 4, or 5Submit the job using the provided SLURM script:
sbatch run_xlstm_v4_deepspeed.slurmThe script launches DeepSpeed across 2 GPUs on the research-gpu partition:
deepspeed --num_gpus=2 xlstm_v4_deepspeed.py \
--deepspeed \
--deepspeed_config ds_config.jsonThe ds_config.json enables ZeRO Stage 2 with FP16 mixed precision:
{
"train_batch_size": 256,
"train_micro_batch_size_per_gpu": 128,
"fp16": { "enabled": true, "initial_scale_power": 16 },
"zero_optimization": { "stage": 2, "overlap_comm": true }
}| Setting | Value |
|---|---|
| Optimizer | Adam, lr = 1Γ10β»Β³, weight decay = 1Γ10β»β΅ |
| LR Scheduler | ReduceLROnPlateau (patience=3, factor=0.5) |
| Gradient Clipping | max norm 1.0 |
| Early Stopping | patience = 10 epochs |
| Batch Size | 256 (128 per GPU) |
| Sequence Length | 30 timesteps |
| Hidden Size | 128 |
| Memory Dimension | 16Γ16 (mLSTM) |
| Dropout | 0.1 |
| Hardware | 2Γ NVIDIA RTX A5000 (24 GB each) |
| Precision | FP16 via DeepSpeed auto loss scaling |
- The 1-stack configuration achieves the best performance across all metrics (RΒ² = 0.9869, RMSE = 74.41 nm, MAE = 38.88 nm) and is the recommended default for this task.
- Performance degrades progressively through 3-stack, collapses catastrophically at 4-stack (RΒ² = 0.5286, 45.65 h training), then partially recovers at 5-stack.
- The nonlinear S-shaped residual bias (regression toward the mean at extreme amplitudes) is a consistent structural property of MSE-trained xLSTM on high-dynamic-range vibration signals, and grows in magnitude with depth.
- DeepSpeed ZeRO Stage 2 enabled stable training across all configurations but did not resolve fundamental optimization difficulties in deep recurrent stacks.
If you find this work useful, please consider citing:
@article{xlstm_stack_depth_vibration,
title = {Investigating Stack Depth in xLSTM Architectures for Vibration Time Series Prediction},
year = {2026}
}[1] X. Fan, C. Tao, and J. Zhao, "Advanced stock price prediction with xLSTM-based models: Improving long-term forecasting," in 2024 11th International Conference on Soft Computing & Machine Intelligence (ISCMI). IEEE, 2024, pp. 117β123.
[2] M. Alharthi and A. Mahmood, "xLSTMTime: Long-Term Time Series Forecasting with xLSTM," AI, vol. 5, no. 3, pp. 1482β1495, Aug. 2024. doi: 10.3390/ai5030071











