RMSprop ensures that the error is correctly propagated backwards through all the chain of neural networks that compose the LTSM. The method has been used successfully for training LSTMs in generating complex sequences Graves, The PyTorch library provides an implementation of this method. The error criterion of the model was the mean squared error loss:. When the model is trained with a training set, the error on this set will be smaller for every iteration or epoch.
In order to prevent over-fitting, the process where the model starts memorizing the training set instead of learning the training set, after every epoch, the performance of the model on the validation set is checked.
When the model performance stops improving on the validation set is a good indicator that the model is starting to over-fit on the training set, and we can stop the training of the model. This is the classic early stopping method. When training a neural network, there are many parameters that can be chosen that have an impact on the performance.
However, finding the optimal value for these parameters is problem-dependent, the so-called No Free Lunch theorem Wolpert and Macready , and finding the optimal values is a computationally exhaustive task. In our model, there are three sets of hyperparameters. The first set is intrinsic to the architecture of our model itself, the second set to the learning method, and the third set to the training method.
The hyperparameters were obtained by manually tweaking their values over the course of 15—20 runs and evaluating their performance using 7-fold CV. We found that for our particular case, we observed no significant changes in the accuracy of the model caused by the tweaking.
Because the model seemed robust under the tweaking, we decided to not do an exhaustive search for the optimal values of the hyperparameters. We now provide an overview of the important hyperparameters and the values we gave them.
A final list of the parameters are summed up at the end of this section. In our model itself, the LSTM has a few hyperparameters that have impact on its performance. The first is the number of neurons in the hidden layer of the LSTM. This number must be large enough to ensure it can encode the process behind the data, but not too large to prevent over-fitting and plain computational cost.
This number has been determined by performing CV, and we found the best performance to be around 50 neurons. Next there is the number of layers in the LSTM.
This increases the complexity and computational cost of the model. In our search we tested using multiple layers of LSTMs, but did not find any significant performance increase and set the number of layers to 1. Finally, there is the option to make the LSTM bidirectional, where the model has access to both future and past states. But this is unnecessary in the context of the Dst forecast, thus this has been set to false. The RMSProp method has a lot of tweakable parameters, but we will focus on the two most important parameters, the learning rate and the momentum.
The learning rate is the most important parameter, and controls how strongly the model weights will be changed by the error gradient. A too large learning rate might cause unstable learning, with the performance swinging widely during training and preventing convergence. A too small learning rate might cause the model to barely change and converge too slowly or not at all. The momentum parameter will affect the learning rate parameter throughout the training process, and will accelerate the training process.
Most often, the momentum parameter is chosen close to 1. We found that setting the learning rate to 0. Finally, when training the model, we can also set a few parameters that can affect the performance. There are two parameters that are important: the number of epochs and the batch-size of the training set. The number of epochs decide how many times we loop over the full training set for training.
This number must be large enough that the model has time to converge to the ideal solution before the training is stopped by the classic early stopping method. The batch-size determines how many samples of the training data are given to the model before the error is computed and backpropagation is applied. Setting the batch-size to one would corresponds with so-called online learning, where the model is trained separately on every sample.
The opposite is offline learning, i. Offline learning is almost never used as it fails to learn more outlying cases, and online learning is more prone to over-fitting. Using a small batch-size is typically recommended. We found that setting the batch-size to 64 gave a fast convergence and did not have a large impact on the performance.
LSTM hyperparameters. RMSProp hyperparameters:. Training hyperparameters. The evaluation is discussed and a comparison of the model to some of the latest forecasting models is made. Finally, the forecast is visually observed. The first analysis examines whether the LSTM-NN model performs better than the persistence model defined in section 2. Table 1 displays the results from the metrics defined in section 2. Only the linear relation of the persistence model is consistently better for every forecasting horizon compared to the LSTM-NN model.
Table 1. It increases with the forecasting horizon, and always remains smaller than the RMSE. This would indicate that the persistence model could serve as a strong benchmark for nowcasts of the Dst index. Taking into account the results reported in Table 1 , we conclude that using the more complicated LSTM-NN model will result in better forecasts.
The model by Gruet et al. Table 2. By performing fold CV, we can measure the variance in performance caused by the choice of training and test set, as fold CV will replicate the effect of training the model for 10 different choices of training and test set. From this we can learn two things.
The first has already been discussed in section 2. By choosing an ideal training and test set, it is possible to cause an artificial improvement of the model performance. Computing the variance caused by the choice of training and test makes it possible to determine if the reported performance is an outlier, or expected. The results are shown in Figure 2.
The average performance of the LSTM-NN and the persistence model from the CV is indicated by the blue and dark green bars, and the standard deviation on the performance is indicated by the error bars. The uncertainty interval obtained from this experiment has also been included in Table 2. We find that for both RMSE and correlation, the reported performance of the LSTM-NN and the persistence model lies inside the variation, indicating that there is no artificial improvement of the results by choosing an ideal training and test set.
Figure 2. Using fold cross validation, the LSTM-NN model and persistence model were evaluated for 10 different training and test sets with the RMSE and correlation coefficient to determine the impact of the choice of set on the performance.
The left figure displays the correlation coefficient, whose values we want to have as close to 1 as possible. The right figure displays the RMSE, whose values we want to minimize. All of these values are also summarized in Table 2. Their reported values have also been added to Figure 2. Let us first look at the results for the correlation in the left bar chart. This seems to agree with the observation made by Gruet et al. Finally, a visual observation of the forecast is analyzed.
Figure 3. Each plot shows the observed Dst index Truth and the forecast Dst index Pred. Every row shows the same storm, and every column corresponds to a different forecaseting horizon. Notice that the LSTM-NN model systematically forecasts the main phase of the storm too late by a number of hours equivalent to the forecasting horizon.
This brings us to the main problem of this paper. The purpose of the experiment was to create a LSTM-NN model that forecasts the Dst-index with the same accuracy and correlation as other presented architectures.
We managed to create such a model, but, when visually inspecting the forecast, it was observed that there is a distinct time shift between forecast and observation. If geomagnetic storms are forecast only when they start, it means the LSTM-NN model will not give us any more information than the persistence model. While it is not possible to say that the models from Gruet et al. An additional problem that most modern machine learning techniques have to face is that rare events can not be properly forecasted.
Neural networks learning by gradient descent requires that patterns show up frequently in the data. In order to forecast dangerous super-storms, like the Halloween or the Carrington events, the networks must have to learn to identify them.
Possible solutions to this issue can be of four types: 1 data augmentation by duplication, where months with high number of storms are used multiple times in a singular epoch, 2 generative data augmentation, where a second machine learning technique, like auto-encoders or generative adversarial networks, is used to generate artificial storms, 3 augmentation by computer simulations, using 3D models of the interaction of the solar wind and the magnetosphere of the planet to artificially generate data with large storms, and 4 multi-tier machine learning architectures, where multiple models specialize in the detection of different types of inputs and storms strengths.
These solutions are out of the scope of the present paper but will be studied in a future work. Section 2 revealed that the LSTM-NN model failed to give an accurate forecast of the Dst index, and in particular geomagnetic storms, despite the evaluation of the model indicating that the model should have a high accuracy and correlation. It is often unclear whether or not this was overlooked or if the forecasting model did not have this problem. At the very least, we expect the warping measure to be able to detect the forecast made by a persistence model.
What follows first is a brief overview of the DTW algorithm, followed by the modifications we made to tailor the algorithm to our specific problem. The DTW algorithm is a method first developed for speech recognition and is now commonly used in the fields of economics, biology, and database analysis see e.
DTW is mainly used as a measure to investigate how much a sample time series matches or is contained in a target time series. The strength of DTW is that it can compare two time series even though they might be shifted or stretched in time, which is a property that is essential to our goal. This section summarizes the algorithm developed by Berndt and Clifford A visualization of this algorithm is shown in Figure 4. Take two time series, Q and S, of length n and m , respectively.
Each grid point i, j then corresponds to an alignment of q i and s j. An alignment is given a cost by a distance function d q i , s j. The DTW algorithm then searches for a path the so-called warping path P in this grid that minimizes the sum of said distance.
The warping path P can be defined as:. The path must then minimize the cost function, so. Boundary conditions: the beginning and the end of the sequences are matched;. Continuity: there are no gaps, every point is mapped to at least one other point;. Figure 4. A Shows two time series we want to compare. B Illustrates the cumulative distance matrix, together with a warping window w and the ideal warping path P in blue.
C Illustrates the warping path P aligning the two time series. In order to find this optimal path, the following dynamic programming technique can be used. Once all the cumulative distances are computed, the optimal warping path can be found by starting at the point n, m and tracing backwards in the grid, taking the smallest value each time.
This is displayed in Figure 5. Figure 5. Illustration of how the warping path is determined from the distance matrix D. A warping window constraint can be added on the algorithm. This window will change the warping cost and warping path P. Faster and better implementations of this algorithm exist see e. It should be mentioned that the DTW algorithm does not satisfy the necessary properties to be a metric.
For example, it is easy to see that the algorithm does not satisfy the triangle inequality. Consequently, this method will be called a measure, and not a metric. This measure does not make use of the warping cost, and instead uses the information contained in the warping path W.
The measure is then able to determine how exactly a forecast time series is shifted in time in comparison to the true or observed time series. Take the two time series M and O , where M is the predicted time series and O the true time series.
The warping constraint defined in Equation 17 is applied, and w is set equal to the forecasting horizon time. Applying this constraint can be done as a modification of the warping constraint defined in Equation 17 :. This is also illustrated in Figure 6. The percentages reflect how time series M is shifted compared to time series O. We now present the results of this measure applied to the persistence model prediction and the LSTM-NN model prediction. Figure 6. Overview of the warping measure.
A Displays the two time-series O and M that are compared for alignment. However, the warping algorithm is adapted with a new window w that prevents comparing values of M with values of O that lie in the future, as seen in B.
C Shows the alignment from B from which the measure is determined. Notice that there is no alignment from M to points in O that lie in the relative future of M. The warping measure is first applied to the forecast of the persistence model. The persistence model can be seen as the textbook example for this algorithm.
Assuming that the persistence model is set as follows:. The persistence model is applied to the test set defined in section 2. The results confirm our expectations, where except for a few percentile, all the values are detected to be shifted by the forecasting horizon.
Table 3. The row-normalized fractions of the warping measure on the persistence model. One potential problem that can arise is when the time shift in the two compared time series is very large. First, the algorithm will take longer to run as the window-size w needs to be much larger. Second, because the boundary conditions require the beginning and end of both sequences to match, if the time series is too short, the algorithm might give a shifted results.
Take the extreme example shown in Figure 7 , showing both a persistence model with respective time shift s 1 and s 2. If the time shift is very large and the time series is small, the number of values counted to have shift s 2 are made insignificant due to the path also including the boundary condition.
Potential changes to the algorithm that could account for this problem is a topic for future work. Figure 7. An illustration of a potential problem with the measure. A Illustrates the result for a persistence model with a small time-shift. B Illustrates the case of a persistence model with a large time shift. When the time series is too small, the counts will be dominated by the green-colored block, while the actual truth will appear very small due to normalization.
The highest percentages are located on the offset diagonal, identical to the results of the persistence model. As discussed before, this indicates that a shift in time exists between the observations and the model predictions.
This confirms that our observation of the results discussed in section 2. What follows is a statistical analysis of the Dst index itself. The autocorrelation of the Dst is shown in Figure 8B. This can also be seen in the lag plot, shown in Figure 8A. This could explain why the persistence model has such high accuracy and correlation when evaluated with the metrics of section 2.
We believe that this also explains why the linear fit parameters of the persistence are so high. Figure 8. A Displays three lag plots of the Dst, with lag times of 1, 3, and 5 h. The color of each point represents the how many times this value was encountered in the data. Notice the strong linear behavior for all three lag times. B Shows the autocorrelation of the Dst index for up to 15 h of lag.
In C , the partial autocorrelation of the Dst index. Notice that there is almost no more correlation past 2 h. The partial autocorrelation is also an important value. The partial autocorrelation of the Dst can be seen in Figure 8C. The autocorrelation properties of the Dst index are most likely the causes of the problem in the forecast. Direct workarounds consist of either changing the input or the output. A first solution is not to include the Dst index in the input vectors, as done by Wu and Lundstedt This gives a forecast based purely on the solar wind parameters.
Another solution is to de-trend the Dst time series, and instead forecast the change in the Dst. This parameter has also been introduced by Wintoft and Wik Computing the autocorrelation confirms this, as seen in Figure 9B. Notice now that the autocorrelation behaves almost identical to the partial-autocorrelation, displayed in Figure 9C. Figure 9. Notice the lack of any linear relation between the lagged values. As input we use, next to the parameters described in section 2.
For later forecasting horizons, the correlation coefficient decreases sharply, and the prediction efficiency becomes close to zero. The RMSE does not increase substantially when the forecasting horizon increases. The results of the DTW measure are shown in Table 6. Notice the absence of a persistence effect, as most values are no longer on the offside diagonal. This means that the forecast most likely no longer resembles the observed time series anymore, and the evaluation of the DTW algorithm does not have much meaning anymore.
Table 5. Using this as a basis, it will be much more transparent when a forecasting model provides us with an actual accurate forecast. Finally we discuss the possible causes of why forecasting the Dst is so difficult. We believe that there are two problems that we have not yet taken into account. The first is the variation of the geo-effectiveness of the quiet solar wind, mainly caused by how the tilt of the Earth effects the interaction of magnetosphere with the solar wind.
Together with the inclination of the equatorial plane of the Sun, this causes a yearly variation which was not taken into account in this experiment. The second is that we believe that it is misguided to forecast the Dst index using the solar wind data measured at L1.
These measurements are taken too close to the Earth, which causes an intrinsic limit on how far in the future we can give a forecast. We believe that having measurements at L5 would provide a large improvement in our abilities to provide timely forecasts, as discussed by Hapgood While the evaluation scores have indicated that the LSTM-NN model is comparable to the latest publications, visual inspection shows that the model's forecast behavior is similar to that of a persistence model, using the last known input of the Dst as its output.
Although the prediction performs better than the persistence model, showing that some information can be learned from the solar wind, the LSTM-NN model effectively fails in its objective. In order to detect this new type of error, a new test is developed based on the DTW algorithm, to measure the shift between observation and model prediction.
DTW can compare two time series in a time window, instead of comparing two values on the same timestamp such as done by the RMSE and the correlation coefficient, allowing the detection of temporal trends. By using the output of the DTW algorithm, first a least-distance mapping is given between the two time series, which can then be used to compare the timestamps of the points mapped to each other.
This gives us a measure of the time warp between these two time series, from which we can infer a potential persistence effect. When this new measure was applied to the persistence model, the results were as expected, and completely captured the temporal behavior of the persistence model.
When the measure was applied to the time series forecasting of the LSTM-NN model, it detected the temporal lag in the forecast, proving its usefulness. Finally, the possible origin of this lag was discussed by observing the autocorrelation of the time series, together with possible different experiments that do not suffer this temporal lag.
It was shown that the forecasting of the differentiated Dst did not have this temporal lag. Future studies focusing on forecasting the differentiated Dst could provide more transparent results. We believe that new research also has to explore the effect of the variability of the solar wind interacting with the magnetosphere in function of the Earth tilt and the inclination of the solar equatorial plane.
Finally, we believe that the observational data measured at L1 plays a big role in limiting the forecast horizon of the Dst index. Looking at the effects of having measurements at L5 should be further explored in future work, using simulations to provide the artificial measurements. As a concluding remark, we would like to emphasize that researchers should be very prudent when reporting results of time series forecasting with the metrics defined in section 2.
These metrics fail to capture behaviors that are only seen when taking into account the temporal dimension of the forecasting, and could provide misleading results. Publicly available datasets were analyzed in this study. BL performed and analyzed the experiment, developed and tested the new technique, and wrote the manuscript.
JA planned the study and has provided substantial intellectual contribution and interpretation of the results. All authors took part in the manuscript revision and have read and approved the submitted version. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Akasofu, S. Energy coupling between the solar wind and the magnetosphere. Space Sci. Bala, R. Improvements in short-term forecasting of geomagnetic activity. Space Weath. Barkhatov, N. Long-term forecasting of solar activity indices using neural networks.
Berndt, D. Google Scholar. Box, G. Time Series Analysis: Forecasting and Control. Boyle, C. Empirical polar cap potentials. Burton, R. An empirical relationship between interplanetary conditions and DST. Camporeale, E. The challenge of machine learning in space weather: nowcasting and forecasting.
Council, N. The objective of time series comparison methods is to produce a distance metric between two input time series. The similarity or dissimilarity of two-time series is typically calculated by converting the data into vectors and calculating the Euclidean distance between those points in vector space. Dynamic time warping is a seminal time series comparison technique that has been used for speech and word recognition since the s with sound waves as the source; an often cited paper is Dynamic time warping for isolated word recognition based on ordered graph searching techniques.
This technique can be used not only for pattern matching, but also anomaly detection e. For example, when looking at the red and blue lines in the following graph, note the traditional time series matching i. Euclidean Matching is extremely restrictive. On the other hand, dynamic time warping allows the two curves to match up evenly even though the X-axes i.
Another way is to think of this is as a robust dissimilarity score where a lower number means the series is more similar. Two-time series the base time series and new time series are considered similar when it is possible to map with function f x according to the following rules so as to match the magnitudes using an optimal warping path. Traditionally, dynamic time warping is applied to audio clips to determine the similarity of those clips.
For our example, we will use four different audio clips based on two different quotes from a TV show called The Expanse. There are four audio clips you can listen to them below but this is not necessary — three of them clips 1, 2, and 4 are based on the quote:.
Quotes are from The Expanse. The code to read these audio clips and visualize them using matplotlib can be summarized in the following code snippet. The full code-base can be found in the notebook Dynamic Time Warping Background. As noted below, when the two clips in this case, clips 1 and 4 have different intonations amplitude and latencies for the same quote.
If we were to follow a traditional Euclidean matching per the following graph , even if we were to discount the amplitudes, the timings between the original clip blue and the new clip yellow do not match. With dynamic time warping, we can shift time to allow for a time series comparison between these two clips. For our time series comparison, we will use the fastdtw PyPi library; the instructions to install PyPi libraries within your Databricks workspace can be found here: Azure AWS.
By using fastdtw, we can quickly calculate the distance between the different time series. As you can see, with dynamic time warping, one can ascertain the similarity of two different time series. Thank you for signing up! Our latest blogs will come directly to your inbox.
Get the eBook.
0コメント