Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Advertisement
Scientific Reports volume 15, Article number: 30177 (2025)
2000
2
Metrics details
Photovoltaic (PV) power is significantly influenced by meteorological fluctuations, and its forecasting accuracy is critical for power system dispatching and economic operation. To enhance forecasting precision, this paper proposes a hybrid framework integrating signal decomposition, parallel forecasting, and weight optimization. Firstly, the Thompson-Tau-Newton interpolation method is applied to handle missing data, and key meteorological factors are selected using the Pearson correlation coefficient to reduce input dimensionality. Secondly, the power sequence is decomposed into multi-scale subsequences using Ensemble Empirical Mode Decomposition (EEMD), which are then reconstructed into low-frequency components (reflecting trend features) and high-frequency components (capturing random fluctuations) based on sample entropy. Furthermore, a parallel XGBoost-LSTM forecasting structure is constructed, XGBoost models the low-frequency components to capture global patterns, while LSTM processes the high-frequency components to learn temporal dependencies. Finally, the Snake Optimization (SO) algorithm is introduced to dynamically optimize the combination weights, enabling adaptive fusion of forecasting results. Experimental results demonstrate that the proposed model significantly outperforms standalone benchmark methods. In comparison with Particle Swarm Optimization (PSO), Sparrow Search Algorithm (SSA), and the equal-weight assignment approach for high- and low-frequency component forecasting, the proposed SO algorithm attains the lowest forecasting errors. The proposed method provides a novel approach to high-precision PV power forecasting by integrating multi-modal feature fusion and optimized weight allocation.
With the rapid increase in the proportion of renewable energy installed capacity, power systems face growing challenges in integrating renewable energy. Photovoltaic (PV) power forecasting technology is critical for mitigating the conflict between large-scale renewable energy integration and efficient utilization. However, forecasting accuracy is significantly affected by multiple meteorological factors (e.g., solar irradiance, temperature), leading to strong intermittency and stochastic fluctuations in power output curves. These characteristics pose substantial challenges to real-time grid dispatching, electricity market transactions, and energy storage system charging/discharging strategies1,2.
PV power forecasting can be categorized into three types based on time horizon: short-term, medium-term, and long-term forecasting3. Short-term forecasting covers periods within one day, primarily providing precise data for real-time power system dispatching to balance power generation and load demand while ensuring grid frequency stability. Medium-term forecasting spans several days to a month, supporting unit commitment and maintenance scheduling. Long-term forecasting extends from months to years, offering strategic guidance for power system planning, energy structure adjustment, and PV plant investment decisions. Among these, short-term forecasting demands the highest accuracy due to its direct impact on real-time grid operations.
Current mainstream PV power forecasting methods are divided into physics-driven and data-driven approaches. Physics-driven models rely on mathematical equations derived from PV conversion mechanisms, array layouts, and inverter characteristics to simulate the quantitative relationship between meteorological conditions (e.g., irradiance, temperature) and power output4. In contrast, data-driven methods leverage historical operational data and employ statistical or artificial intelligence algorithms to uncover mappings between meteorological factors, temporal features, and power generation5. Data-driven methods can be further subdivided into traditional machine learning, deep learning, and hybrid models.
Traditional machine learning models establish mappings from input features to output power through historical data training to forecast unknown data. Representative methods include Random Forests (RF)6, Support Vector Machines (SVM)7, and Gradient Boosting Trees (XGBoost)8, et al. For example, literature9 proposes a Multi-Objective Slime Mould Algorithm (MOSMA)-optimized SVM model, which categorizes weather types based on irradiance fluctuation thresholds and dynamically adjusts kernel function parameters to suppress forecasting errors under extreme weather conditions. Deep learning models automatically learn deep features from data using multi-layer neural networks without manual feature engineering10, such as Long Short-Term Memory (LSTM) networks, Convolutional Neural Networks (CNN), and attention mechanisms11. These models excel in modeling multivariate couplings under complex weather conditions due to their independence from detailed physical mechanisms. However, factors such as cloud movement and atmospheric scattering cause PV signals to exhibit non-stationarity, multi-scale fluctuations, and high dimensionality, making it difficult for single models to balance long-term trend fitting and short-term fluctuation capture12.
To overcome these limitations, current research focuses on multi-dimensional improvements, such as data preprocessing, signal decomposition, and model architecture optimization. Literature13 employs Kernel Principal Component Analysis (KPCA) for feature extraction and noise reduction, reconstructing low-noise datasets by retaining principal component energy subspaces and integrating Convolutional Recurrent Units (CRU) to enhance forecasting robustness through spatiotemporal feature fusion. Literature14 introduces grey relational analysis to quantify similarity among historical samples in irradiance and temperature sequences, selects high-correlation training data to reduce heterogeneity, and constructs a cascaded temporal convolutional network-multi-head attention mechanism model to achieve local-global feature synergy. In the domain of signal decomposition, literature15 uses Empirical Mode Decomposition (EMD) to decouple non-stationary power signals into intrinsic mode functions (IMFs), effectively suppressing high-frequency noise. Literature16 combines Bayesian hyperparameter optimization with LSTM to develop a time-frequency dual-domain subsequence forecasting framework, independently modeling trend and residual components of IMFs to improve stability. For architecture optimization, literature17 designs a two-stage LSTM with sliding time windows to handle ultra-short-term power fluctuations, while XGBoost achieves efficient historical data training through gradient-boosted tree splitting strategies and parallel feature gain computation18.
However, existing models still exhibit limitations: while LSTM effectively captures long-term temporal dependencies, its capability in modeling high-dimensional feature interactions and non-stationary abrupt changes remains inadequate19; XGBoost excels at nonlinear regression of meteorological factors but struggles to precisely describe continuous temporal dynamics due to the inherent discretization of tree-based structures. The complementary strengths of these two models provide a theoretical foundation for constructing hybrid frameworks. To address these gaps, this paper proposes a hybrid LSTM-XGBoost-SO forecasting model, enhancing accuracy through multi-stage collaborative optimization. First, the Thompson Tau criterion detects outliers, the Newton interpolation method imputes missing data, and Pearson correlation coefficients identify key meteorological features to reduce redundancy. Second, Ensemble Empirical Mode Decomposition (EEMD) separates power signals into multi-scale components; these are categorized and reconstructed into low-frequency (LF) and high-frequency (HF) subsequences based on sample entropy. Third, a heterogeneous XGBoost-LSTM parallel architecture is built, leveraging XGBoost’s regression strength for LF components and LSTM’s temporal modeling capability for HF components. Finally, the Snake Optimizer (SO) simulates swarm intelligence to globally optimize model fusion weights, circumventing local optima traps inherent in traditional parameter tuning. Test results demonstrate that the proposed signal decomposition–parallel forecasting–weight optimization framework outperforms standalone models, validating the efficacy of multimodal fusion and bio-inspired optimization. The main contributions of this paper are summarized as follows:
(1) Innovative Signal Decomposition Framework: A hybrid signal reconstruction method combining Ensemble Empirical Mode Decomposition (EEMD) and sample entropy is proposed. By adaptively injecting white noise to suppress mode mixing in traditional EMD and quantifying subsequence complexity using sample entropy, this method achieves precise separation of low-frequency trend components and high-frequency components in PV power signals.
(2) Heterogeneous Parallel Forecasting Architecture: A dual-channel XGBoost-LSTM model is designed, capitalizing on XGBoost’s strength in high-dimensional nonlinear regression for low-frequency components and LSTM’s capability in modeling temporal dependencies for high-frequency components. This architecture overcomes performance bottlenecks of single models in feature interaction and temporal dynamics.
(3) Bio-inspired Optimization Strategy: A Snake Optimizer (SO)–based hyperparameter search mechanism is developed, simulating snake foraging behavior to achieve adaptive global optimization. This strategy resolves the local optima and slow convergence issues in traditional grid search and genetic algorithms, significantly improving forecasting accuracy and stability.
Photovoltaic (PV) power generation data may contain missing or anomalous values during acquisition, necessitating data imputation and outlier removal. For missing values, interpolation methods can be applied. This paper employs the Thompson-Tau method to identify and eliminate outliers. Specifically, irradiance—the meteorological factor most strongly correlated with PV power—is integrated with power data. The irradiance values are divided into s intervals of equal size. The power data samples within the i-th irradiance-power interval are denoted as Pi = {Pi,1, Pi,2, …, Pi, m}, where i = 1, 2, …, s and Pi,1 ≤ Pi,2 ≤ …≤ Pi, m. Here, m represents the number of power points within each irradiance-power interval. The Thompson-Tau method is applied to detect and remove outliers in each interval.
Calculate the mean power value within an interval:
where Pi, j represents the individual power values within the i-th interval.
The standard deviation of the power data in the i-th interval is calculated as:
The absolute deviation of each power sample in the interval is defined as:
Within the i-th interval, the power sample with the largest absolute deviation δi, j is identified as a potential outlier. A maximum deviation implies that the power value is either unusually high or low within its irradiance interval, significantly increasing its likelihood of being anomalous. The threshold τ in the Thompson Tau method is computed as:
where t denotes the critical value of the t-distribution for the power samples, α represents the significance level, whose value determines the trade-off between data sufficiency and outlier sensitivity. In this paper, α = 0.05 is selected. A power value is classified as an outlier if δi, j ≥ τSi; otherwise, it is retained as a valid data point.
For the detected anomalies, this paper employs the Newton interpolation method for data imputation. The Newton interpolation method involves constructing a polynomial function N(x) that approximates f(x) using known function values at finite discrete points.
Given n + 1 distinct interpolation points (xi, f(xi)), i = 0,1,…,n, the n-th degree Newton interpolation formula can be expressed as:
where f[x0, x1, x2,…, xn] represents the n-th order divided difference. It is derived recursively from the divided differences of each order:
As shown in Eqs. (5) and (6), the Newton interpolation method features computational simplicity. When adding a new known point and its corresponding function value, the formula requires only one additional term. After anomaly identification, nearby measured values are used to fill the power anomalies through Newton interpolation. This approach ensures temporal data integrity and facilitates subsequent forecasting tasks.
Meteorological variables often differ in measurement units. To ensure computational efficiency and preserve inter-variable relationships, this paper applies min-max normalization to the data, as expressed in Eq. (7):
where ({{text{x}}_{{text{nor}}}}) is the normalized value, ({text{x}}) is the raw value, and ({{text{x}}_{{text{max}}}}), represent the minimum and maximum values of each feature dimension, respectively. However, irrelevant or weakly correlated features may increase model complexity, reduce computational efficiency, and degrade forecasting accuracy. To address this, the Pearson correlation coefficient is introduced to quantify feature relevance and select key meteorological factors (beyond irradiance) strongly associated with PV power output. The Pearson correlation coefficient measures the linear relationship between two variables, calculated as:
where ({{bar {x}}}) and ({{bar {y}}}) denote the mean values of variables x and y, respectively, and ({{text{r}}_{{text{xy}}}}) represents their linear correlation strength.
Photovoltaic (PV) power exhibits intermittency, stochasticity, and volatility, which degrade forecasting accuracy when raw signals are directly fed into models. To address this, we employ Ensemble Empirical Mode Decomposition (EEMD) to decompose PV power signals. EEMD mitigates mode mixing—a phenomenon in traditional EMD where components of different frequencies are blended into a single Intrinsic Mode Function (IMF)—by iteratively adding white noise to the original signal and averaging multiple EMD decompositions. This approach also enhances computational robustness. The detailed procedure is as follows:
Set the number of white noise additions M, i.e., the total ensemble averaging iterations.
Generate a new signal by superimposing white noise onto the original signal.
where ({text{i}})= 1,2,…,M denotes the i-th white noise addition, and ({{text{X}}_{text{i}}}({text{t}})) represents the composite signal of the original PV power ({text{X}}({text{t}})) and the i-th white noise ({{text{N}}_{text{i}}}({text{t}})).
Perform Empirical Mode Decomposition (EMD) on each composite signal ({{text{X}}_{text{i}}}({text{t}})), repeating M times to extract Intrinsic Mode Functions (IMFs).
where (M(t)) and (L(t)) are the upper and lower envelopes of ({{text{X}}_{text{i}}}({text{t}})), respectively; (V(t)) is the average of the upper and lower envelopes; ({text{IM}}{{text{F}}_{{text{i,j}}}}({text{t}})) is the j-th IMF component decomposed from the i-th noise-added signal.
Compute the ensemble-averaged IMFs as the final decomposition result.
where ({{text{r}}_{{text{i,j}}}}({text{t}})) is the residual component, and ({text{J}}) is the total number of IMFs.
After decomposing the data into multiple components, rational reconstruction is crucial to enhance signal quality and enable component-specific model analysis, thereby improving overall forecasting accuracy. To quantify the complexity of each signal component, this paper employs sample entropy—a robust metric for assessing time series complexity. Sample entropy exhibits two key advantages: (1) independence from data length, ensuring reliable analysis even with limited datasets; (2) insensitivity to parameter selection, providing consistent evaluation across diverse signal characteristics. These properties make it suitable for reconstructing components with varying complexity levels in photovoltaic power signals.
where ({text{L}}) is the sequence length, ({{text{B}}^{{text{n+1}}}}({text{r}})) denotes the number of vector pairs of dimension ({text{n+}}1) whose maximum absolute difference between corresponding elements is less than or equal to the similarity tolerance, and ({{text{B}}^{text{n}}}({text{r}})) represents the analogous count for vectors of dimension ({text{n}}).
The workflow of the proposed hybrid forecasting model is illustrated in Fig. 1.
Workflow of the hybrid forecasting model.
The proposed hybrid forecasting framework operates through the following steps:
Step 1: Data Preprocessing – Clean historical PV power data by identifying and imputing missing values and outliers; select optimal meteorological features using correlation analysis.
Step 2: Data Decomposition – Partition preprocessed data into training and testing sets, then decompose PV power signals into multiple subsequences using Ensemble Empirical Mode Decomposition (EEMD).
Step 3: Signal Reconstruction – Quantify subsequence complexity by sample entropy, categorizing them into high-frequency (HF) and low-frequency (LF) components, followed by entropy-based reconstruction.
Step 4: Hybrid Forecasting – Model specialization: LSTM captures temporal dependencies in volatile HF components, while XGBoost regresses stable LF trends. Train each model with historical features until convergence.
Step 5: Weight Optimization – Apply the Snake Optimizer (SO) algorithm to globally optimize model weighting coefficients, minimizing forecasting residuals.
Step 6: Testing Phase – Process testing data through Steps 2–3, then feed reconstructed signals and features into trained LSTM and XGBoost models.
Step 7: Performance Evaluation – Assess forecasting accuracy using Root Mean Square Error (RMSE) and Mean Absolute Error (MAE), iteratively tuning hyperparameters for optimal performance.
The Long Short-Term Memory (LSTM) network is a forecasting algorithm designed to address the weakness of recurrent neural networks (RNNs) in capturing long-term temporal dependencies. It introduces three gating mechanisms (input gate, forget gate, and output gate) to regulate information flow in memory cells: Input Gate: Receives and updates valid information into the memory cell. Forget Gate: Selectively discards partial information from the previous memory cell state. Output Gate: Determines the final output based on weighted information from the input and forget gates.
LSTM network structure.
The LSTM architecture at time step t is illustrated in Fig. 2. Key computational steps are defined as follows: in the Forget Gate, combines the previous hidden state ({{text{h}}_{{text{t-}}1}}) and current input ({{text{x}}_{text{t}}}), processed through a Sigmoid activation function to generate the retention ratio for the cell state ({{text{C}}_{{text{t-}}1}}) .
where ({{text{W}}_{{text{fx}}}}) and ({{text{W}}_{{text{fh}}}}) are the weight matrices connecting ({{text{x}}_{text{t}}}) and ({{text{h}}_{{text{t-}}1}}), respectively. ({{text{b}}_{text{f}}}) is the bias vector, and ({{varvec{upsigma}}}) denotes the Sigmoid activation function.
The input gate controls the proportion of input information entering the memory cell through activation functions. By integrating the previous hidden state information, it generates a new candidate state ({mathop {text{C}}limits^{{text{-}}} _{text{t}}}) using the hyperbolic tangent (tanh) activation function, as shown in Eqs. (15)-(17). Here, the value ({{text{i}}_{text{t}}}) (ranging within [0,1]) determines the weight of the candidate state ({mathop {text{C}}limits^{{text{-}}} _{text{t}}}). ({{text{W}}_{{text{ix}}}}) and ({{text{W}}_{{text{ih}}}}) represent the weight matrices associated with ({{text{x}}_{text{t}}}) and ({{text{h}}_{{text{t-1}}}}), respectively, while ({{text{b}}_{text{i}}}) and are their corresponding bias vectors. The current cell state is derived by multiplying ({{text{f}}_{text{t}}}) with the previous cell state ({{text{C}}_{{text{t-1}}}}), multiplying ({{text{i}}_{text{t}}}) with the candidate state ({mathop {text{C}}limits^{{text{-}}} _{text{t}}}), and summing t({{text{b}}_{text{c}}})he results.
The output gate is responsible for generating the final output by merging information from the forget gate and input gate. As shown in Eqs. (18)-(19), the cell state Ct is processed through the tanh activation function, and the resulting value is multiplied by the output gate ot to obtain the current LSTM output ht. Here, Wox denotes the weight matrix for the current input xt, Woh represents the weight matrix associated with the previous hidden state ht−1, and bo is the bias vector.
The XGBoost algorithm operates by performing feature splitting to generate new regression tree models. Each new regression tree primarily fits the residuals of previous tree models. Through iterative training of tree models to approximate forecasting residuals, the algorithm progressively reduces forecasting errors and enhances model performance.
For regression problems, let M denote the total number of integrated trees, m index the m-th tree in the ensemble, fm represent the regression function of the m-th tree, and N is the set of regression functions. The forecasted value ({mathop {text{y}}limits^{{text{-}}} _{text{i}}}) for sample xi is computed as shown in Eq. (20):
The forecasted value for each sample corresponds to the leaf node scores of each regression tree. By summing these scores across all trees, the model fitted by the t-th regression tree is formulated as:
({overline{y}}^{^{(t)}}) denote the forecasted value of the i-th sample after t times model iterations, ({overline{y}}^{^{(t-1)}}) represent the forecasting from the (t−1)-th iteration, and ft(xi) is the newly added regression tree.
The optimization process is performed through iterative training to fit forecasting residuals, with the objective function defined as Eqs. (22)-(23):
(sum_{i=1}^{n}l(y_i-{overline{y}}^{^{(t)}})) denotes the loss function, where a smaller value indicates better model fitting; (sumlimits_{{{text{m=1}}}}^{{text{M}}} {{{varvec{Omega}}}({{text{f}}_{text{m}}})}) represents the regularization term in the objective function; ({{varvec{uplambda}}}) is the penalty factor controlling leaf node scores, ({text{T}}) denotes the number of leaf nodes, ({{{upgamma}}}) is the penalty factor regulating the number of leaf nodes, and ({{{varvec{upomega}}}_{text{j}}}) corresponds to the score of the j-th leaf node. The XGBoost model employs a second-order Taylor expansion to approximate the loss function, enhancing numerical precision and accelerating training convergence. By incorporating regularization terms that penalize both the number of leaf nodes and the norm of leaf weights, the trained model achieves simplified structures and mitigates overfitting.
The photovoltaic power generation forecasting investigated in this paper falls under regression problems. To validate the algorithm’s effectiveness, the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are adopted as model evaluation metrics, with their formulas defined in Eqs. (24)-(25):
where ({{text{y}}_{{text{pred,i}}}}) denotes the i-th forecasted value, yi represents the i-th true value, and n is the total number of samples.
As previously described, after obtaining forecasting from individual models, determining optimal weighting coefficients between the LSTM and XGBoost models is critical. This paper employs the Snake Optimizer (SO) algorithm to achieve this objective. The SO workflow comprises the following steps:
Step 1: Objective Function Definition. The goal is to minimize the Root Mean Square Error (RMSE) between the weighted combination of model forecasting and the true values:
where ({text{f}}({{varvec{upalpha},varvec{upbeta}}})) is the fitness function, ({{varvec{upalpha}}}), ({{varvec{upbeta}}})∈[0,1], ({{text{y}}_{{text{true,j}}}}) denotes the j-th true value, and ({{text{y}}_{{text{pred1,j}}}}), ({{text{y}}_{{text{pred2,j}}}}) are the forecasting from LSTM and XGBoost models, respectively.
Step 2: Population Initialization. Generate an initial population of N snake individuals with randomized positions and lengths:
where li denotes the length of the i-th individual, N represents the population size, Pi indicates the position of the i-th snake individual, and L corresponds to its characteristic length parameter.
Step 3: Iterative Position & Fitness Update. Update snake positions and fitness values iteratively:
where ({{varvec{upxi}}}) is a uniformly distributed random number within [0, 1], ({text{p}}_{{{text{best}}}}^{{({text{t}})}}) denotes the current best position, ({text{l}}_{{text{i}}}^{{({text{t}})}}) represents the fitness value, ({text{p}}_{{text{i}}}^{{({text{t}})}}) is the updated position vector, and ({text{f}}({text{p}}_{{text{i}}}^{{({text{t}})}})) corresponds to the current objective function value.
Step 4: Fitness Evaluation. Compute the new fitness based on Eq. (31). If the fitness at the updated position is improved, update both the position and length:
Step 5: Termination & Output. Repeat Steps 1–4 until convergence criteria are met. The algorithm terminates by returning the globally optimal position as the final result.
The historical dataset from Australia’s DKASC is selected, which includes photovoltaic power generation, wind speed, temperature, global irradiance, diffuse irradiance, and other meteorological variables with a 5-minute sampling interval. Data spanning from April to October 2015 are extracted for analysis. The Ensemble Empirical Mode Decomposition (EEMD) method is applied with 100 iterations of Gaussian white noise added. The decomposed sub-sequences are illustrated in Fig. 3.
Sub-sequences decomposed based on EEMD.
As shown in Fig. 3, the fluctuation amplitudes of the sub-sequences gradually attenuate. Introducing all decomposed signals into the model would elevate model complexity, thereby compromising computational efficiency. Furthermore, error accumulation across sub-sequences could amplify the final forecasting error. To address this, we leveraged the Sample Entropy (SampEn) metric to quantify the complexity of each sub-sequence. The SampEn values are listed in Table 1.
Signal reconstruction is performed based on Sample Entropy (SampEn) values. As shown in Table 1, IMF3, IMF4, IMF5, IMF6, IMF7, and IMF8 exhibit higher SampEn values, indicating greater complexity and stronger stochasticity in these components. These six sub-sequences are merged into high-frequency components. The remaining sub-sequences, with relatively lower and comparable SampEn values, are aggregated into low-frequency components.
To identify the most influential meteorological factors correlated with photovoltaic power generation, Pearson correlation coefficients are computed. The correlation strengths between meteorological variables and PV power output are summarized in Table 2.
As shown in Table 2, the meteorological factors exhibiting strong correlations with photovoltaic (PV) power generation are global irradiance (0.9944), temperature (0.6038), and diffuse irradiance (0.5916). These three factors, with significantly higher correlation coefficients compared to others, are classified as strongly relevant factors, while the remaining variables (e.g., wind speed, relative humidity) are categorized as weakly relevant factors.
Based on this analysis, the strongly relevant factors—global irradiance, temperature, and diffuse irradiance—are either individually or combinatorially integrated into new composite features. These features, combined with historical PV power data, served as inputs to the proposed model. For comparative validation, weakly relevant factors and their combinations are also tested within the proposed framework. It should be noted that during model training, historical data of total irradiance, temperature, and diffuse irradiance is incorporated and combined with the corresponding historical photovoltaic power generation data as model inputs, while local weather forecast data should be employed for actual photovoltaic power generation forecasting.
To identify optimal input features for PV power forecasting, selected meteorological factors (combined with processed high-frequency components and historical PV power data) are fed into the LSTM model. Identical network configurations are maintained: 4 hidden layers, 20 training epochs, and a batch size of 32. The attention mechanism has been widely adopted in time series forecasting due to its capability to dynamically assign higher weights to critical information in input sequences. In addition, this paper incorporates Bidirectional Long Short-Term Memory (BiLSTM) and Gated Recurrent Unit (GRU) models, both structurally analogous to LSTM, into testing for comparative analysis of forecasting performance. This paper conducts a comparative analysis between LSTM and an attention mechanism to demonstrate LSTM’s effectiveness in photovoltaic power forecasting.
Figure 4 shows the forecasting results for the last 970 data points in the dataset based on different feature inputs for the LSTM model. It highlights significant power fluctuations during the 200–300, 500–600, and 800–900 time intervals, with observable periodic patterns across these segments. Notably, the forecasting using Global Horizontal Irradiance (GHI) as input shows more pronounced volatility. This increased sensitivity allows the model to more closely track actual power values during peak events, which is the main reason for its superior accuracy metrics (i.e., minimal error values). For high-frequency data, the most relevant features are inherently related to short-term fluctuations, transient variations, and localized correlations; adding more features increases model complexity, which can introduce redundant information or noise, thus weakening the model’s ability to capture core high-frequency patterns.
LSTM forecasting results under different meteorological factors.
As shown in Table 3, the combination of GHI (Global Irradiance) with historical power data achieved the lowest RMSE (0.2773) and MAE (0.1692), outperforming other single or composite features. The Temp&GHI (Temperature & Global Irradiance) combination also demonstrated improved accuracy compared to Temp &DHI, attributed to the stronger correlation of GHI with PV power (Pearson coefficient: 0.9944). While the Temp&GHI&DHI combination introduced higher dimensionality, its error metrics remained lower than GHI&DHI due to the complementary effect of Temp (Temperature). The forecasting results analysis indicates that the attention mechanism-based model achieves the minimum RMSE and MAE under the combination of GHI and power output. However, in scenarios utilizing the optimal feature set, LSTM outperforms the attention mechanism in forecasting accuracy. Notably, the fundamental structure of BiLSTM is similar to that of LSTM but includes an additional backward-propagating hidden layer. As shown in Table 3, the forecasting results with the Temp&GHI feature combination closely match LSTM performance, with MAE nearly reaching the optimal accuracy. Furthermore, forecasting across different features shows excellent stability on average, indicating that BiLSTM overcomes historical dependency limitations by processing sequence data bidirectionally. However, photovoltaic power data often contains transient, irregular noise. BiLSTM tends to interpret such noise as meaningful features during bidirectional processing, leading to two main issues: (1) under optimal feature sets, forecasting accuracy does not exceed that of LSTM; (2) in terms of computational efficiency, LSTM training speed is significantly faster than BiLSTM, approaching GRU-level performance. This confirms that using LSTM for high-frequency power forecasting is the preferred choice regarding both computational efficiency and forecasting accuracy.
For low-complexity low-frequency components, the XGBoost model is configured with 100 trees and a maximum depth of 5. Inputs included meteorological factors, processed low-frequency components, and historical PV power data. In contrast, Temporal Convolutional Networks (TCN) use dilated causal convolutions to expand their receptive fields, effectively capturing both short-term fluctuations and long-term trends in time series. For further performance benchmarking, this paper includes the Transformer model, which is widely adopted in recent time-series forecasting research, as a reference framework. The Transformer uses self-attention mechanisms to model global dependencies across entire input sequences, demonstrating a strong ability to capture non-local dependencies that can span arbitrary time intervals. Unlike TCN, which relies on convolutional operations, the Transformer mainly focuses on contextual correlations between distant time steps. Due to the Transformer’s large number of parameters and structural similarity between its encoder and decoder components, a lightweight encoder-only Transformer variant is used for comparison to enable faster forecasting. The forecasting results and corresponding evaluation metrics are shown in Fig. 5; Table 4.
XGBoost Forecasting Results Under Different Meteorological Factors.
Figure 5 displays the forecasting results for the last 970 data points in the dataset based on different feature inputs of the XGBoost model. Similar to high-frequency data, distinct fluctuations and peaks are observed in certain intervals. Given that low-frequency components are characterized by longer time intervals, gentler fluctuations, and weaker short-term correlations, their variations are predominantly influenced by long-term trends, external environmental factors. In such contexts, incorporating additional features can help the model overcome the inherent informational limitations of low-frequency data. However, introducing features with excessively low correlations may lead the model to misidentify redundant information. As shown in the figure, forecasting using Temperature (Temp) as input exhibits pronounced phase lag and significant errors at peak events, validating that weakly correlated features impair the model’s recognition capability.
As indicated in Table 4, the Temp&GHI (Temperature & Global Irradiance) combination achieved the lowest errors when used as input features. This is attributed to the inherent smoothness and low complexity of the reconstructed low-frequency components. Introducing weakly correlated factors (e.g., DHI) not only increases input dimensionality and computational overhead but also degrades forecasting accuracy. Among the evaluated forecasting models, Transformer demonstrates the best performance in terms of RMSE, while XGBoost achieves the lowest MAE. Photovoltaic power forecasting inherently involves long-term temporal forecasting, where minimizing the average deviation across the entire forecast horizon is paramount. Consequently, MAE should be prioritized as the primary evaluation metric. Furthermore, when comparing computational costs, Transformer requires approximately 21 times more parameters than XGBoost under identical input conditions (GHI with low-frequency components). Critically, Transformer’s parameter scale expands exponentially with increasing feature dimensions, a limitation absent in XGBoost. Regarding training efficiency, XGBoost completes training within 15 s, whereas Transformer requires approximately 65 s. Thus, deploying XGBoost for high-frequency power forecasting represents the optimal choice, balancing both computational efficiency and forecasting accuracy.
The optimal single-model forecasting-XGBoost with Temp&GHI and LSTM with GHI are integrated into the Snake Optimizer (SO) to determine adaptive weights. The final forecasting is derived by weighted summation of the two models, as illustrated in Fig. 6; Table 5.
Forecasting Results of Single and Hybrid Models Under Different Meteorological Factors.
Testing results demonstrate that the hybrid model reduced RMSE by 66.08% and MAE by 64.69% compared to XGBoost, and by 31.33% (RMSE) and 61.70% (MAE) compared to LSTM. The hybrid model’s forecasting curve aligns more closely with ground truth, particularly at volatility peaks (as shown in Fig. 6), validating its superior accuracy and robustness over individual models. To investigate the impact of the SO algorithm on the proposed model, ablation and comparative experiments are conducted. Without applying the SO optimization algorithm, both high-frequency and low-frequency components decomposed by EEMD are input into the LSTM model for separate forecasting, and equal weights are assigned to the forecasting results of high-frequency and low-frequency components. The final forecasting results are presented in Table 6.
Results demonstrate that the EEMD-LSTM model without the SO algorithm achieved a minimum RMSE of 0.3166, exhibiting an 11.71% increase in error compared to the LSTM model trained solely on high-frequency components (as shown in Table 3). This validates the reliability of employing distinct forecasting models for high- and low-frequency components. Furthermore, compared with that in Table 5, the SO-optimized model significantly outperformed the equal-weight LSTM model for HF/LF components forecasting, achieving a 65.44% reduction in RMSE and a 66.73% decrease in MAE. These results confirm the superior optimization efficacy of the SO algorithm in forecasting accuracy enhancement.
To further investigate the adaptability of the SO algorithm, comparative experiments are conducted against Particle Swarm Optimization (PSO), Sparrow Search Algorithm (SSA), and the approach of assigning equal weights to both high- and low-frequency components forecasting. The forecasting results are summarized in Table 7.
Results demonstrate that all models incorporating optimization algorithms exhibited significant reductions in test errors compared to the non-optimized counterparts, thereby conclusively validating the efficacy of the algorithmic enhancement strategy in boosting model performance. Furthermore, the proposed SO algorithm achieves the lowest forecasting errors among all comparative optimization algorithms. This superiority stems from its capability to adjust model parameters adaptively, enhancing feature extraction and fusion mechanisms for high-frequency (HF) and low-frequency (LF) components. Such adaptive optimization effectively addresses the limitations of static weighting strategies in handling complex data interdependencies, significantly improving the model’s applicability in photovoltaic (PV) power forecasting scenarios.
This paper employed the Thompson Tau-Newton interpolation method to process photovoltaic (PV) power generation data, enhancing its completeness and smoothness. Key meteorological factors strongly correlated with PV power output—global irradiance, temperature, and diffuse irradiance—are identified using Pearson correlation coefficients, thereby reducing model input dimensionality. The preprocessed data are decomposed into 15 intrinsic mode functions (IMFs) using Ensemble Empirical Mode Decomposition (EEMD), which are subsequently reconstructed into high-frequency components (IMF3-IMF8) and low-frequency components (remaining IMFs) based on Sample Entropy (SampEn) analysis. A hybrid forecasting framework is developed: XGBoost is applied to low-frequency components for trend fitting, while LSTM captures high-frequency temporal dynamics. The decomposition structure effectively integrated the strengths of both algorithms, reducing model complexity and optimizing training efficiency. To further enhance accuracy, the Snake Optimizer (SO) is introduced to determine optimal weighting coefficients between XGBoost and LSTM. Experimental results demonstrated the superiority of the proposed hybrid model: compared to standalone XGBoost, RMSE and MAE decreased by 66.08% and 64.69%, respectively; compared to standalone LSTM, reductions reached 31.33% and 61.70%. Furthermore, compared with the Particle Swarm Optimization (PSO), Sparrow Search Algorithm (SSA), and the approach of assigning equal weights to both high- and low-frequency components forecasting, the proposed SO algorithm achieves the lowest forecasting errors. These findings validate the hybrid model’s exceptional accuracy and robustness under complex meteorological conditions.
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
Díaz-Bello, D., Vargas-Salgado, C., Alcazar-Ortega, M. et al. Optimizing photovoltaic power plant forecasting with dynamic neural network structure refinement. Sci. Rep. 15, 3337 https://doi.org/10.1038/s41598-024-80424-z (2025).
Cao, Y. et al. Multi-Timescale photovoltaic power forecasting using an improved stacking ensemble algorithm based LSTM-Informer model. Energy 283, 128669 (2023).
Article Google Scholar
Niu, Y. et al. De-Trend first, attend next: A Mid-Term PV forecasting system with attention mechanism and Encoder–Decoder structure. Appl. Energy. 353, 122169 (2024).
Article Google Scholar
Heibati, S., Maref, W. & Saber, H. H. Developing a model for predicting optimum daily tilt angle of a PV solar system at different geometric, physical and dynamic parameters. Advances in Building energy research, 15(2), 179–198 https://doi.org/10.1080/17512549.2019.1684366 (2019).
Ahn, H. A framework for developing Data-Driven correction factors for solar PV systems. Energy 290, Art130096 (2024).
Article Google Scholar
Connor, S., Mominul, A. & Alhussein, A. Machine Learning for Forecasting a Photovoltaic (PV) Generation System, Energy, vol. 278, Art no. 127807. (2023).
Nie, B. et al. Investigation on Ground-Based cloud image classification and its application in photovoltaic power forecasting. IEEE Trans. Instrum. Meas. 74, 1–11 (2025). Art 5008611.
Google Scholar
Gawusu, S. et al. Optimizing solar photovoltaic system performance: insights and strategies for enhanced efficiency. Energy 319, 135099 (2025).
Article Google Scholar
Zhang, C. & Xu, M. Time-Segment photovoltaic forecasting and uncertainty analysis based on Multi-Objective slime mould algorithm to improve support vector machine. IEEE Trans. Power Syst. 39 (3), 5103–5114 (May 2024) https://doi.org/10.1109/TPWRS.2023.3333686.
Huang, Y. et al. Dynamic Combination Forecasting for Short-Term Photovoltaic Power, IEEE Transactions on Artificial Intelligence, vol. 5, no. 10, pp. 5277–5289, Oct. (2024).
Niu, Y. et al. Amplify seasonality, prioritize meteorological: strengthening seasonal correlation in photovoltaic forecasting with Dual-Layer hierarchical attention. Appl. Energy. 394, 126104 (2025).
Article Google Scholar
Tao, K., Zhao, J., Wang, N., Tao, Y. & Tian, Y. Short-Term Photovoltaic Power Forecasting Using Parameter-Optimized Variational Mode Decomposition and Attention-Based Neural Network, Energy Sources, Part A: Recovery, Utilization, and Environmental Effects, vol. 46, no. 1, pp. 3807–3824, Dec. (2024).
Phan, Q. T., Wu, Y. K., Phan, Q. D. & Lo, H. Y. A Novel Forecasting Model for Solar Power Generation by a Deep Learning Framework With Data Preprocessing and Postprocessing, IEEE Trans. Ind. Appl., 59, 1, 220–231, (2023). Jan.-Feb.
Zhou, X., Pang, C., Zeng, X., Jiang, L. & Chen, Y. A Short-Term power prediction method based on Temporal convolutional network in virtual power plant photovoltaic system. IEEE Trans. Instrum. Meas. 72, 1–10 (2023). Art 9003810.
Google Scholar
Shahram, H., Hossein, Z., Andrea, C. & Saeid, L. Offshore Wind Power Forecasting Based on WPD and Optimised Deep Learning Methods, Renewable Energy, vol. 218, Dec. Art no. 119241. (2023).
Shi, J. Bayesian Optimization – LSTM modeling and time frequency correlation mapping based probabilistic forecasting of Ultra-short-term photovoltaic power outputs. IEEE Trans. Ind. Appl., 60, 2, pp. 2422–2430, March-April 2024 https://doi.org/10.1109/TIA.2023.3334700.
Ma, Y., Ma, W., Li, X., Shen, Y. & Two-Stage, A. LSTM Optimization Method for Ultrashort Term PV Power Prediction Considering Major Meteorological Factors, IEEE Trans. Industr. Inf., 21, 1, 228–237, Jan. (2025).
Zhang, T. Long-Term Energy and Peak Power Demand Forecasting Based on Sequential-XGBoost, IEEE Transactions on Power Systems, vol. 39, no. 2, pp. 3088–3104, March (2024).
Kim, D., Kwon, D., Park, L., Kim, J. & Cho, S. Multiscale LSTM-Based deep learning for Very-Short-Term photovoltaic power generation forecasting in smart City energy management. IEEE Syst. J. 15 (1), 346–354 (March 2021) https://doi.org/10.1109/JSYST.2020.3007184.
Download references
State Grid Wuxi Power Supply Company, Wuxi, 214111, China
Ying Xu
State Grid Jiangsu Electric Power Co., Ltd, 210019, Jiangsu, China
Xinrong Ji
State Grid Nantong Power Supply Company, Nantong, 226006, China
Zhengyang Zhu
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
Ying Xu: Ying Xu led the overall research design and methodology conception. She also played a key role in drafting and revising the manuscript.Xinrong Ji: conception, study design, execution, analysis, data analysis and interpretation.Zhengyang Zhu: study design, execution, validation of the proposed forecasting method and acquisition of data.
Correspondence to Ying Xu.
The authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Below is the link to the electronic supplementary material.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Reprints and permissions
Xu, Y., Ji, X. & Zhu, Z. A photovoltaic power forecasting method based on the LSTM-XGBoost-EEDA-SO model. Sci Rep 15, 30177 (2025). https://doi.org/10.1038/s41598-025-16368-9
Download citation
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-16368-9
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
Advertisement
Scientific Reports (Sci Rep)
ISSN 2045-2322 (online)
© 2025 Springer Nature Limited
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.