Machine learning-based prediction of soiling losses in photovoltaic modules under different cleaning frequencies: an experimental investigation | Scientific Reports – Nature

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Advertisement
Scientific Reports volume 16, Article number: 17416 (2026) Cite this article
1514 Accesses
Metrics details
Accumulation of dust on solar panels lowers performance and limits energy production, particularly in dry locations. Dust accumulation on photovoltaic panels diminishes performance and reduces energy output, especially in arid regions. This study uses four identical modules in Roorkee, India, from October to December to examine the impact of cleaning frequency on photovoltaic (PV) performance. The reference panel is cleaned daily, while the remaining panels are cleaned weekly, biweekly, and monthly. Alongside short-circuit current measurements, environmental parameters including global horizontal irradiance, ambient temperature, wind speed, and relative humidity are continuously recorded. In this study, soiling loss (%) is examined as the primary performance indicator under various cleaning intervals to observe dust accumulation progression and its impact on the performance of the solar photovoltaic module. Experimental data are utilized to develop an empirical regression model that describes the trend of dust accumulation. The daily average soiling loss ranges between 0.17 and 0.21%. Furthermore, machine learning models, including Decision Tree, K-Nearest Neighbour, support vector regression, artificial neural network, and a stacking ensemble method, are developed for accurate prediction of soiling loss from environmental variables and cleaning frequency. The stacking model consistently achieves the best performance across all months, with root mean square error as low as 0.03–0.045, mean absolute error below 0.03, and R² = 0.999 compared to other models. Moreover, statistical analyses such as Bland–Altman plots and the Wilcoxon signed-rank test are employed to validate the significance and agreement of the predicted outcomes. The study highlights the benefits of data-driven solutions for predictive operation and maintenance of solar photovoltaic systems and provides valuable insights into the impact of cleaning frequency on reducing soiling losses.
The solar energy extensively uses for heating purpose, desalination of water, cooling process and production of electrical energy, serving a diverse array of applications from home to industrial as well in agricultural sectors^1,2. About half of the world’s electricity will come from wind and solar power alone by 2050³. Until then, around two-thirds of India’s electricity will come from solar and wind. The price of solar energy has dropped by around 85% since 2010⁴. The primary factors for India’s solar PV to develop exponentially are the sharp decline in solar energy prices and Indian government policies subsidies schemes to use solar PV technology to produce power. India has abundant solar insolation due its location inside the tropical belt, enabling it to fulfil its daily electrical requirements. It gets an average solar insolation of 4 to 7 kWh/m² and around 2300 to 3200 h of sunlight annually⁵. Numerous components affect the efficiency of solar photovoltaic power generating systems as shown in Fig. 1 such as type of material, spacing of solar cell, module area, tilt angle and orientation, environmental condition, surface dust of solar photovoltaic (PV) panels. The most often occurring element influencing the solar photovoltaic panel performance is surface dust^6,7,8. The accumulation of dust on the surface of solar panels can result in changes in the electrical charecteristics of the panel array. These changes can cause the panels to have a reverse bias, which in turn can result in a loss of power generated by the panels⁹. The synthesized bio-derived TiO₂ nanoparticles using plant extract and demonstrated improved photovoltaic performance through enhanced light absorption and charge transport, highlighting the importance of material-level enhancements for solar cell efficiency¹⁰.The soiling rates vary between 0.05% and 0.55% per day in India^11,12, while in Dhaka, Bangladesh, it is 0.78%¹³.
Factors contributing dust accumulation impacts on PV modules.
Large-scale solar farms in remote locations are particularly affected by the soiling issue. This is due to the fact that regular cleaning and inspection may be challenging and costly, given the expenses of labor and long-distance travel^14,15. In order to determine the amount of power that is lost by dirty photovoltaic modules, it is desirable to have automated soiling detection. Inadequate maintenance of the cleanliness of solar photovoltaic panel surfaces will lead to significant economic losses. Consequently, utilising precise and effective techniques to identify dust buildup on the surfaces of solar photovoltaic panels is crucial. This facilitates prompt cleaning of the panels, thereby ensuring their safe and efficient functioning. Dust and ambient temperature energy losses were quantified using artificial neural network (ANN) and extreme learning machine (ELM) algorithms¹⁶. Both the ELM and ANN models predict 91.42 and 90.69% accurately. Multivariate Linear Regression (MLR) and ANN models were used to estimate dust-related energy and economic losses in solar panels¹⁷. ANN and MLR models estimated dust-related cost and energy losses at 89.97% and 86.78%, respectively. In another study Adaptive Neuro-Fuzzy Inference System (ANFIS) was used to predict the dust-exposed solar module performance¹⁸. The ANFIS model achieves root mean square error (RMSE) of 0.18719 and coefficient of determination (R²) of 0.99803 for monocrystalline silicon PV modules. In comparison, polycrystalline PV modules have an RMSE of 0.87098 and a R² of 0.99714. The study in¹⁹ predicted dust-induced PV panel performance deterioration in Qatar using ANN and MLR models. The ANN model has a R² of 0.537 and mean square error (MSE) of 0.0038, while the MLR model has a R² of 0.167 and an MSE of 0.0082. The ANN model outperformed the MLR model. Study in²⁰ estimated dust losses using artificial neural networks. Modern technology allows ANN to estimate losses with normalized root mean square errors (NRMSE) of 6.79 and correlation coefficients (R) of 0.91. Another of²¹ utilised AMM, MLR, Interactive Multivariate Linear Regression Model (MLRWI) and Response Surface Methodology (RSM), to predict the loss caused by dust on solar PV module surface. The artificial neural network produced better predictive results than the other machine learning models. The results for R² and RMSE are 0.813 and 0.026, respectively. The separate studied carried out focusing on machine learning (ML) methods implemented and their performance as shown in Table 1.
The studies summarised in Fig. 2 show that the reported soiling losses vary widely with location, exposure duration and local climatic conditions —observed rates in the literature span roughly 0.1% to 1.1% perday, with the highest values typically found in arid, dusty environments (e.g. Bahrain, Qatar) and lower values in regions with occasional rainfall or wind cleaning. Differences in measurement period, panel tilt, dust type, and cleaning practice (and the diversity of experimental protocols) make direct comparison difficult. Overall, the literature indicates a clear need for (a) longer-term, standardized measurements across diverse climates, (b) studies that relate soiling loss to measured environmental drivers (global horizontal irradiance (GHI), wind speed (WS), relative humidity (RH), dust deposition rate (DDR)), and (c) predictive models ( ML approaches) validated against controlled experiments^{36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}. These gaps motivate the present work, which combines experimental measurements with ML models to produce robust soiling-loss predictions.
Real-world natural dust build up on PV panels under four cleaning frequencies (daily, weekly, biweekly, monthly) is studied instead of controlled dust deposition research, offering practical insights.
This work established two empirical models: one for PV short circuit current (I_sc) prediction and one for soiling loss (SL), encompassing both electrical performance and soiling effects.
Novel stacking model for SL prediction and comparison with other ML models (ANN, SVM, KNN, DT) provide strong empirical and data-driven comparisons.
Bland–Altman analysis and the Wilcoxon signed-rank test for model validation provide a unique level of statistical validity, assuring model dependability.
Soiling loss and study duration across locations.
The methodology of the present work is shown in Fig. 3 with an experimental setup consisting of four PV panels subjected to different cleaning frequencies (P1-daily, P2-weekly, P3-biweekly, and P4-monthly). Data including GHI, RH, WS, ambient temperature (AT), and the short-circuit current (Isc) of each panel were recorded using a data logger. Based on the collected data, two empirical models were developed: An Isc model as a function of environmental parameters such as global horizontal irradiance (GHI), ambient temperature (AT), wind speed (WS), relative humidity (RH) are considered in present study, and a soiling loss (SL) model as a function of RH, WS, AT, and cleaning frequency (CF). To improve predictive capability, machine learning models (ANN, SVM, KNN, DT, and stacking ensemble) were implemented for SL prediction. The performance of empirical and ML models was compared using evaluation metrics RMSE, mean absolute percentage error (MAPE), MAE, MSE, and R². The statistical validation using Bland–Altman analysis and the Wilcoxon signed-rank test was carried out to assess the significance and agreement of the models.
Methodology of the work.
The experimental work was conducted in Roorkee, India (29.86°N, 77.89°E), located in the Indo-Gangetic plain and characterised by a subtropical climate with distinct winter, summer, and monsoon seasons. The study period, October to December, represents the dry-winter season with frequent dust-laden winds, moderate humidity, and occasional foggy conditions, making it suitable for investigating soiling effects. Rainfall-induced natural cleaning was avoided to maintain experimental consistency and isolate the effects of environmental variables and manual cleaning intervals.
The experimental setup is shown in Fig. 4, consisted four identical crystalline silicon PV modules, each rated at the same electrical capacity (:{P}_{max}) 20 W, installed outdoors with a fixed tilt angle of 30⁰ and south-facing orientation to maximize solar exposure. All the PV modules are mounted on a common frame to ensure that they experienced identical environmental conditions such as GHI, AT, RH, and WS.
The present study isconducted during the dry season because the primary objectiveis to evaluate the impact of manual cleaning at fixed and predefined intervals (daily, weekly, biweekly, monthly) under controlled accumulation conditions. During the monsoon season, frequent rainfall events act as natural cleaning mechanisms. Such stochastic and uncontrolled cleaning would interfere with the predefined manual cleaning schedules and compromise the controlled comparison between different cleaning frequencies.
Experimental setup: (1–4) PV modules with different cleaning frequencies, (5) weather station, (6) data logger, and (7) PV analyser.
The data acquisition architecture shown in Fig. 4 as follows:
The ATMEGA2560-based data logger is shown measuring:
Global horizontal irradiance (GHI).
Ambient temperature (AT).
Relative humidity (RH).
PV module current and voltage.
Timestamp via real-time clock (RTC).
The Weather Station is separately indicated as the source of:
Wind Speed (WS).
The experiment is conducted under natural outdoor exposure, allowing dust to accumulate under real environmental conditions. Although dust composition may vary geographically, the measured electrical performance and environmental parameters inherently capture the net effect of dust deposition and adhesion. This approach ensures that the dataset reflects realistic soiling behaviour and provides a reproducible foundation for predictive modelling.
To evaluate the impact of cleaning frequency on soiling losses, four panels are exposed to several cleaning regimens. Panel P1 served as the clean reference and underwent daily cleaning, whilst P2, P3, and P4 were cleaned weekly, biweekly, and every four weeks, respectively. Cleaning occurred at 06:00 AM utilizing distilled water and a gentle, lint-free cloth to avert scratches or more surface contaminants. Electrical measurements encompassed the short-circuit current of each module, recorded at consistent intervals as the principal performance metric for dust build-up. Meteorological parameters were continually recorded, including global horizontal irradiance, ambient temperature, relative humidity, and wind speed. The data were obtained using calibrated sensors incorporated with a data logging system, ensuring synchronous environmental and electrical recordings. The experimental configuration included several sensors to enable precise data collection and real-time observation of solar PV system metrics presented in Table 2.
Electrical and environmental parameters (GHI, AT, RH, WS, (:{I}_{SC}), and voltage) were recorded at fixed and uniform intervals using the ATMEGA2560-based data acquisition system. Measurements were sampled at regular intervals (hourly), ensuring temporal consistency throughout the experimental period. For modeling analysis, the recorded data were aggregated into daily average values to represent the cumulative effect of dust deposition under each cleaning frequency. The Fig. 5 illustrates the cleaning schedule followed during the experimental period from October to December, clearly highlighting the systematic maintenance intervals adopted for each panel.
Cleaning schedule timeline (Oct-Dec 2024).
To account for the influence of external climatic variables, meteorological parameters were continuously recorded during the experimental study. The Fig. 6 illustrates the daily average variation of global horizontal irradiance and wind speed, while Fig. 7 presents the daily average variation of relative humidity and ambient temperature from October to December 2024. Statistical description of the collected data is shown in Table 3. These measurements highlight the dynamic environmental conditions under which the PV panels operated, providing essential context for analysing the soiling effect and validating the empirical and machine learning models.
Daily average global horizontal irradiance and wind speed during the experimental period (October–December 2024).
Figure 6 shows that the wind speeds stayed in the moderate range (~ 1–3 m/s) during the experiment. Under such conditions, particle transport and deposition mechanisms are likely to dominate over aerodynamic removal, explaining the observed positive correlation between wind speeds and soiling loss.
Daily average relative humidity and ambient temperature during the experimental study (October–December 2024).
The abrupt increase in relative humidity as shown in Fig. 7 accompanied by a drop in ambient temperature corresponds to short-duration high-humidity events commonly observed during the winter season in the Indo-Gangetic Plain. These events are typically associated with fog formation, condensation, or transient cloud cover rather than measurable rainfall.
The monthly data analysis from October to December 2024 shows how the performance of PV is affected by both changes in the environment and how often it is cleaned. The daily-cleaned panel (P1) consistently exhibited higher short-circuit current values, while progressive reductions were observed in P2, P3, and most significantly in P4, reflecting the cumulative effect of soiling at longer cleaning intervals. The variability of irradiance, indicated by higher standard deviation values, further underscores the dynamic operating environment of the panels. These findings confirm that both environmental conditions and soiling accumulation substantially impact PV output, thereby establishing the need for predictive models. Accordingly, the subsequent section develops empirical models to express.
(:{I}_{SC:})as a function of GHI, AT, RH, and WS, and to quantify soiling loss as a function of RH, WS, AT, and CF, thus providing a foundational framework for later comparison with machine learning models.
The experimental dataset was utilised to derive empirical regression models for short-circuit current and soiling loss in percentage, using global horizontal irradiance, ambient temperature, wind speed, relative humidity, and cleaning frequency as predictors. Multiple linear regression is adopted to establish these relationships.
The regression Eq. (1) obtained for (:{I}_{SC:}) is:
Table 4 presents the estimated regression coefficients for the empirical model of short-circuit current⁵⁶. The results clearly highlight GHI as the most dominant predictor (Estimate = 0.001265, p < 0.001), consistent with the physical dependence of Isc on solar irradiance. Cleaning frequency also shows a highly significant negative effect (p < 0.001), indicating the reduction in current with increasing days between cleaning. Ambient temperature has a small but significant positive effect, while relative humidity shows a minor negative influence. In contrast, wind speed was statistically insignificant (p = 0.238), confirming its limited role in determining (:{I}_{SC:}).
To further ensure model robustness, adjusted R² and residual diagnostics were evaluated before and after removing GHI. The change in adjusted R² was negligible (< 0.001), confirming that irradiance does not contribute to predictive power in the SL formulation. The removal therefore improves model parsimony without compromising explanatory strength, consistent with regression theory principles. The regression Eq. (2) for soiling loss (SL) expressed as,
Table 5 presents the estimated regression coefficients for the empirical model of of soiling loss show that cleaning frequency is the most significant predictor (Estimate = 0.18487, p < 0.001), highlighting the strong impact of longer cleaning intervals on increased soiling losses. Relative humidity also exhibits a significant positive influence (p = 0.007), which may be attributed to dust adhesion and cementing effects under humid conditions. Ambient temperature has a significant negative effect (p < 0.001), suggesting that higher temperatures may reduce relative deposition or increase self-cleaning effects. Wind speed shows a near-significant positive influence (p = 0.048), reflecting its dual role in either removing or redistributing dust. In contrast, GHI is statistically insignificant (p = 0.989), indicating that irradiance itself does not directly drive soiling loss but instead affects PV output through (:{I}_{SC}).
To further ensure model robustness, adjusted R² and residual diagnostics were evaluated before and after removing GHI. The change in adjusted R² was negligible (< 0.001), confirming that irradiance does not contribute to predictive power in the SL formulation. The removal therefore improves model parsimony without compromising explanatory strength, consistent with regression theory principles.
The performance of the four PV modules was evaluated in terms of short-circuit current (Isc), soiling ratio (SR) and soiling loss (SL%). The SR and SL can be computed using I_SC as mention below Eqs. (3) and (4) as,
Where I_{sc soiled} is the short-circuit current of the soiled panel and I_{sc clean} is that of the clean reference panel (P1). The soiling loss percentage (SL%) was calculated as,
To further improve predictive accuracy beyond the empirical formulations, machine learning algorithms were employed using the experimental dataset described in Sect. 3. The empirical SL model achieved a high determination coefficient (R² = 0.978) with low RMSE and MAE; however, the mean absolute percentage error (MAPE = 28%) remained relatively high, reflecting systematic nonlinearities and residual bias. To address these limitations, an ML-based predictive framework was developed.
The experimental dataset consist of both environmental and operational parameters: global horizontal irradiance (GHI), ambient temperature (AT), wind speed (WS), relative humidity (RH), reference short-circuit current of the clean panel ((:{I}_{SC:left(Cleanright)})) and cleaning frequency (CF). Categorical variables such as cleaning interval were encoded as hot encoding method, while all continuous features were standardized to zero mean and unit variance. To assess the relative contribution of selected input variables, a sensitivity analysis using the CAM approach was carried out under the Sect. 2.4. The Fig. 8 presents the scatter matrix of the selected features (CF, GHI, AT, WS, RH, and SL), excluding the month variable. In contrast to the off-diagonal scatter plots, which represent the pairwise correlations between the parameters, the diagonal histograms illustrate the distribution of each parameter.
Pairwise matrix of experimental features and relationship with soiling losses.
Global horizontal irradiance, ambient temperature, relative humidity, and wind speed have substantial temporal autocorrelation, making subsequent measurements not statistically independent. Randomly mixing time-dependent samples destroys temporal structure and leaks temporal data, enabling the model to indirectly learn patterns from subsequent observations. This may result in inappropriately optimistic assessment outcomes and exaggerated performance measures (e.g., R²). To evade this, the dataset was divided into 80% training and 20% testing observations using a time-ordered split. This method retains temporal causality and provides realistic model generalization in practice.
Time-ordered data splitting for temporally correlated environmental data.
Figure 9 compares random and time-ordered (blocked) data splitting for temporally auto-correlated environmental variables (GHI, RH, AT, WS). Random splitting leaks temporal data and inflates performance measures by include samples from comparable time periods in practice and testing. Time-ordered splitting conserves chronology and enables realistic generalization.
In this work, the cosine amplitude method (CAM) was used to assess the correlation between the input and output data. The mathematical formulation of CAM given in Eq. (5).
There is a connection between the cosine function and the dot product, as shown by Eq. (3). Whereas the inner product of two vectors where Xi input vector while Y is the output vector equal to zero when they are at right angles to one another, the product of two vectors that are collinear is equal to one. A greater directional similarity (and hence sensitivity) to the output is seen by features that have higher CAM scores (closer to 1) than those with lower scores as shown in Fig. 10.
Cosine amplitude scores (CAM) for feature importance.
CAM is the cosine similarity between the features (GHI, AT, WS, RH and CF) and target vectors soiling loss (%) is scale-invariant. The Fig. 9 indicate that the CF is the strongest driver of soiling loss in comparison with meteorological variables which shows moderate CAM score.
To evaluate potential multicollinearity among environmental predictors, the Variance Inflation Factor (VIF) was computed for GHI, RH, and AT. The obtained VIF values were 1.023 (GHI), 1.5987 (RH), and 1.5809 (AT), all of which are substantially below the threshold value of 5. These results confirm the absence of significant multicollinearity and demonstrate that the regression coefficients are stable and not adversely affected by linear dependency among predictors.
To further validate feature sensitivity, permutation importance analysis⁵⁷ was performed, as shown in Fig. 11 Cleaning frequency was identified as the dominant predictor of soiling loss, consistent with physical dust accumulation mechanisms. Environmental variables such as ambient temperature, relative humidity, irradiance, and wind speed exhibited smaller but meaningful contributions. The inset plot provides a detailed view of environmental feature importance. These findings confirm the robustness and physical consistency of the predictive model.
Permutation-based feature importance analysis.
After pre-processing data, soiling loss are estimated using various method of ML such as DT, KNN, SVM, ANN and Stacking. The stacking ensemble combined ANN, SVM, and DT as base learners, with a gradient boosting regressor (GBR) as the meta-learner. GBR effectively refined the base predictions by capturing residual nonlinear patterns, resulting in improved accuracy and reduced bias. MATLAB R2024 a is used to run simulations on a Dell laptop, featuring a Core i9-11900 H processor and 32 GB of RAM.
ANN is intended to replicate the neuronal organisation of the human brain by employing layers that are interconnected in order to capture complicated interactions, as seen in (Fig. 12)^58,59. Backpropagation is used to train the model in this study in order to minimize the errors. In given Eq. (6) w_n are representing weights corresponding to each inputs x_n and b is the bias, while final predicted output represented by Y.
ANN model.
Figure 13 shows how support vector machine (SVM) uses kernel functions to divide data in high-dimensional regions and capture complicated connections for classification and regression⁶⁰. It estimates SL in this work and expressed in Eq. (7) as,
where, Z is the input vector, W is the weight and B is the bias term.
Support vector machine (regressor).
RT, as seen in Fig. 14, are decision trees used to forecast continuous variables SL by segmenting data according to defined criteria and computing the mean target value for each subgroup. They are interpretable, resilient to outliers, and adept at managing non-linear connections successfully⁶¹. The proposed approach employs regression trees by segmenting the feature space and predicting the target variables SL inside each segment as mentioned in Eq. (8).
Decision tree (regressor).
In a regression tree, N is the total number of nodes (leaves), each region Z_n represents a partition of the feature space, C_n is the mean target value within that region, and I (X∈Zm) is an indicator function that equals 1 if X belongs to Z_n otherwise 0.
The K-nearest neighbours (KNN) algorithm is a simple, non-parametric method that predicts outputs based on the average of the k closest data points in the feature space as shown in Fig. 15. In the context of this work, KNN estimates soiling loss by finding similar conditions of GHI, AT, WS, CF and RH from experimental data. It is intuitive and effective for capturing local patterns without requiring an explicit training phase.
K-nearest neighbours.
In KNN regression equation where K is the number of nearest neighbours, N_K(X) is the set of those neighbours, and Y_i are their target values, making the prediction the average of the K closest points.
The stacking model is an ensemble learning approach that combines multiple base learners to improve predictive performance shown in Fig. 16. In this work, ANN, SVM, and DT were used as base learners to capture diverse data patterns, and their outputs were blended by a Gradient Boosting Regressor (GBR) as the meta-learner. This framework leverages the strengths of each individual model while compensating for their weaknesses. As a result, the stacking model achieved higher accuracy and robustness compared to single-model approaches.
For the L base learner the stacking prediction given in Eq. (10) as,
where m_L (x) are the base learner outputs (ANN, SVM, DT in this work) and g(⋅) is the meta-learner (GBR) that combines them to produce the final output.
Stacking ensemble model.
In this study, conventional random k-fold cross-validation was not employed because the dataset represents a physically time-ordered environmental process. Environmental variables such as irradiance, temperature, humidity, and wind speed exhibit strong temporal autocorrelation and causal continuity. Randomized cross-validation would mix past and future observations, introducing information leakage and leading to overly optimistic performance estimates, particularly for stacking models where the meta-learner learns second-order correlations. To ensure physically realistic and leakage-free validation, a time-ordered training–testing strategy was adopted. As illustrated in the Fig. 17, the dataset is divided chronologically into a training window (earlier observations) and a testing window (later unseen observations). The base learners (ANN, SVM, and DT) were trained exclusively on the training window, and their predictions were used to train the meta-learner (Gradient Boosting Regressor). The trained stacking model was then evaluated only on the testing window, which contained future unseen samples.
Time-aware training of stacking ensemble without cross-validation leakage.
The hyperparameters of all machine learning models, including ANN, SVM, DT, KNN, and the GBR used in the stacking ensemble, were selected using a systematic tuning procedure based on grid search combined with validation on the training dataset⁶². The hyperparameter combinations presented in Table 6 which is used to all machine learning model in order to minimized prediction error while avoiding overfitting. For each model, a range of candidate hyperparameters was evaluated, and the optimal configuration was selected based on minimum RMSE and stable generalization performance. The same training dataset and evaluation criteria were applied consistently across all models to ensure fair comparison.
It is important to evaluate the precision of the prediction model. A variety of measures have been used to evaluate the precision of predicting PV output power production¹², which include:
(a) Mean Absolute Error (MAE): Computes the average of absolute differences between actual and predicted values, giving equal weight to all errors as expressed in Eq. (11)
(b) Mean Square Error (MSE): Measures the average of squared differences between actual and predicted values, penalizing larger errors more, its mathematical expression mention in Eq. (12)
(c) Root Mean Square Error (RMSE): Square root of MSE, expressing as Eq. (13) prediction error in the same units as the target variable.
(d) Coefficient of Determination (R²): Indicates how much variance in the actual data is explained by the model, with values closer to 1 showing better fit.The expression shown below in Eq. (14)
(e) Mean Absolute Percentage Error (MAPE): Represents as shown in Eq. (15) the average absolute error as a percentage of actual values, useful for relative accuracy.
This section discusses the performance and findings of the suggested models. The testing findings under actual environmental condition from the Roorkee area, India, are also given according to month and cleaning frequency. The empirical model produced from the experimental data is constructed and compared with machine learning models.
The performance of the four PV modules was evaluated in terms of short-circuit current (Isc), soiling ratio (SR), soiling loss (SL%), and current–voltage (I–V) and power–voltage (P–V) characteristics. The daily-cleaned panel (P1) was taken as the clean reference, while P2, P3, and P4 represent panels cleaned at weekly, biweekly, and monthly intervals, respectively.
Short-circuit current ((:{I}_{SC:})) was adopted as the primary soiling indicator because dust accumulation predominantly reduces optical transmission, directly affecting photocurrent generation. Since Isc is approximately proportional to irradiance, it provides a linear and direct measure of optical attenuation. In contrast, power output (P_max) incorporates nonlinear temperature and fill factor effects, which may obscure pure dust-related losses.
In addition to (:{I}_{SC:})based metrics, I–V and P–V curves were generated using a calibrated PV analyzer for each panel at different cleaning intervals. These curves provide detailed insight into the effect of dust accumulation not only on the short-circuit current but also on the maximum power point (.
(:{P}_{MPP})), open-circuit voltage ((:{V}_{oc})), and fill factor (FF). Figures 18 and 19 shows daily average.
(:{I}_{SC:}) and daily soiling ratio (SR) trend over the time period of the experiment while Fig. 20 illustrate about the monthly soiling loss in percentage.
Daily average variation of short-circuit current for PV panels with different cleaning frequencies.
Daily variation of soiling ratio for PV panels cleaned at different intervals.
Monthly average soiling loss (%) for PV panels with different cleaning intervals: P2 (clean weekly), P3 (clean biweekly), and P4 (clean monthly).
The Figs. 18, 19 and 20 collectively illustrate the impact of cleaning frequency on PV performance. The daily-cleaned panel (P1) maintained the highest (:{I}_{SC:}), while P2–P4 showed progressive reductions with longer cleaning intervals. The soiling ratio (SR) exhibited a stepwise decline within each cleaning cycle, steepest for the monthly-cleaned panel (P4). Monthly average soiling losses confirmed this trend, increasing from 1 to 1.5% (P2) to 2.5% (P3) and5% (P4). These results clearly demonstrate that extended cleaning intervals accelerate dust-induced performance degradation.
PV analyser measurement of I-V and P-V characteristics at GHI 415 w/m² (a) P1:-clean daily (b) P2:- clean weekly (c) P3:- clean biweekly (d) P4:- clean monthly.
Figure 21 shows the I–V and P–V characteristics of the four PV panels under different cleaning frequencies. The clean reference panel (P1, cleaned daily) achieved the highest maximum power point ((:{P}_{MPP}) = 6.00 W) and current at MPP ((:{I}_{mpp}) = 0.370 A). Panels with reduced cleaning frequency demonstrated progressive reductions in both (:{I}_{mpp}) and (:{P}_{MPP}): P2 (weekly) produced 5.80 W, P3 (biweekly) dropped to 5.59 W, and P4 (monthly) showed the lowest performance at 5.31 W. The open-circuit voltage ( (:{V}_{oc})) remained relatively stable across all panels, indicating that dust accumulation primarily impacts the short-circuit current and the maximum power output.
Validation performance of empirical (:{I}_{SC:}) models for four PV panels under different cleaning frequencies during October–December are shown in Fig. 22 as, (a) RMSE, (b) MAE, and (c) MAPE. Panels P1–P3 (daily, weekly, and biweekly cleaning) maintained low errors (RMSE ≤ 0.009 A, MAE ≤ 0.007 A, MAPE = 1–1.5%), whereas P4 (monthly cleaning) exhibited significantly higher deviations (RMSE up to 0.020 A, MAE = 0.017 A, MAPE = 3.7% in December), highlighting the negative impact of extended cleaning intervals on model accuracy.
Monthly cleaning frequency wise performance evaluation of empirical model (a) RMSE (b) MAE (c) MAPE.
Figure 23 shows that the four PV panels have a R² value of 0.99 or above, proving that the empirical modelling framework is reliable for describing the changes in I_(SC) under various cleaning conditions. The gradual decline in R² with reduced cleaning frequency highlights the sensitivity of empirical models to dust accumulation patterns.
Experimental vs. Empirical model R-Squared plot (Oct-Dec 2024) of (a) P1:-clean daily (b) P2:- clean weekly (c) P3:- clean biweekly (d) P4:- clean monthly.
The three PV panels are compared from October to–December to analyse the measured vs. projected soiling loss (SL, %) as shown in Fig. 24. The estimated regression line explains most of the variation (monthly R² =0.97–0.98) with minor absolute errors (RMSE = 0.35, MAE = 0.27). The moderate relative error (MAPE = 26–31%) suggests systemic bias or nonlinear effects that the basic empirical fit cannot capture. The following part uses machine-learning to lessen this relative inaccuracy.
Experimental vs. empirical SL model R-squared plot for month (a) October (b) November (c) December.
The empirical SL model’s performance is shown in Fig. 25. PV Panel (a) displays the regression plot between actual and expected soiling loss values. The data points closely correspond with the fitted regression line, indicating a high coefficient of determination (R2 = 0.978). The model’s low RMSE (0.364) and MAE (0.278) validate its trend capture. However, the mean absolute percentage error (MAPE) remains greater (28%), showing relative variances, especially for lower SL values. Panel (b) shows the residual distribution, where errors are centred around zero but include outliers. This residual spread shows systematic deviations not completely represented by the empirical formulation, motivating the upcoming section to use sophisticated machine-learning algorithms to minimize relative error while maintaining high R².
Overall empirical SL model plot of (a) R-squared (b) residual.
Although the empirical SL model achieved a high coefficient of determination (R² ≈ 0.978), the MAPE value (~ 28%) appears relatively high. This is primarily due to the sensitivity of MAPE to small denominator values. Since several SL observations fall within low ranges (below 2%), even small absolute deviations lead to inflated percentage errors. Furthermore, the linear regression framework may not fully capture nonlinear dust accumulation patterns, contributing to structural bias at low SL levels. This constraint led to the introduction of machine learning algorithms to make percentage-based predictions more accurate. .
Experimental data was used to develop the machine learning model. The model inputs are AT, GHI, RH, WS, and CF, while the target variable is solar PV module SL. The prediction data size was 5 × 336 and divided 80:20 for training and testing, as shown in Table 7. SL is predicted using stacking, ANN, SVM, DT, and KNN models.
Statistical metrics are needed to assess machine learning models’ prediction performance for reliability and robustness. This research evaluated solar panel soiling loss models using MAE, RMSE, MAPE, and R2. These measures show the models’ capacity to reduce prediction errors, capture data variability, and generalize across environmental conditions.
The stacking ensemble model’s tight alignment of projected and observed responses in training and testing datasets showed high predictive performance in Fig. 26. The Fig. 27 shows residual plots with random residuals around zero, confirming the model’s dependability and lack of systematic bias. Performance metrics as shown in Table 8, which demonstrated the model’s resilience, with R² values of 0.9995 (training) and 0.9997 (testing) and low error values (RMSE: 0.0566 and 0.0456; MAE: 0.0404 and 0.0333). The stacking model generalizes effectively across datasets and outperforms individual models, making it a very accurate soiling loss prediction framework.
R² plot of (a) Training (b) Testing data set.
Residual plot of (a) Training (b) Testing data set.
To examine whether cleaning frequency (CF) dominates the learning process, a feature ablation study was conducted by evaluating models trained using (i) the full feature set, (ii) CF alone, and (iii) environmental variables alone.
The full model consistently achieved the lowest prediction error. Although CF-only models exhibit strong correlation with soiling loss due to their causal relationship, they produce significantly higher absolute errors compared to the full model. Conversely, models trained exclusively on environmental variables perform poorly. Table 9 shows that environmental characteristics give important extra information and that cleaning frequency does not hide environmental learning.
The scatter plots reveal that the ANN model accurately predicted soiling loss as shown in Fig. 28. The stacking model had somewhat less departures from the ideal prediction line than the ANN, especially at higher response levels. The Fig. 29 residual plots reflect this tendency, with residuals spreading more broadly and displaying patterns at extreme values, indicating small bias in specific ranges. Performance measurements is shown in Table 10, which supports this result, with R² values of 0.9923 (training) and 0.9806 (testing) and greater error levels (RMSE: 0.2138 and 0.2822; MAE: 0.1277 and 0.1358). The ANN model is highly predictive, but its error distribution and somewhat lower accuracy than the stacking model suggest it cannot completely capture nonlinear data variability.
R² plot of (a) Training (b) Testing data set.
Residual plot of (a) Training (b) Testing data set.
Compared to the ANN and stacking models, the DT model predicted well but had lesser accuracy. The scatter plot (Fig. 30) demonstrates that although projected responses track the actual values, deviations from the ideal prediction line are greater at higher response levels. The Fig. 31 shows residual plots with larger dispersion and predictable patterns, showing overfitting in specific areas. This is supported by performance measurements (Table 11), including R² values of 0.9767 (training) and 0.9777 (testing), and higher error levels (RMSE: 0.3766 and 0.4020; MAE: 0.1989 and 0.2090). The DT model captures the input-soiling loss connection, but its restricted generalization and higher residual spread make it less suitable than sophisticated ensemble approaches.
R² plot of (a) Training (b) Testing data set.
Residual plot of (a) Training (b) Testing data set.
The scatter plots are shown in Fig. 32 to show that the Support Vector Machine (SVM) model predicted values that matched observed responses. Significant departures from the ideal prediction line, especially at higher response levels, imply limits in catching extreme instances. The Fig. 33 residual plots show hetero-scedasticity in predictions, with errors spreading further at higher response levels. Performance measures (Table 12) indicate lower R² values (0.9527) and greater error values (RMSE: 0.5365 and 0.4418; MAE: 0.3124 and 0.2917) compared to ANN and stacking. SVM has superior generalization and testing performance than DT, but its lower accuracy and higher residual spread restrict it compared to the stacking ensemble.
R² plot of (a) Training (b) Testing data set.
Residual plot of (a) Training (b) Testing data set.
As seen in the scatter plots (Fig. 34), the k-Nearest Neighbor (kNN) model had mixed predictive performance, with projected values following the genuine responses but deviating at higher response levels. The Fig. 35 residual plots show higher error dispersion and systematic bias in extreme ranges, indicating model resilience is lowered. Performance measures (Table 13) demonstrate high generalization on test set but poor fit during training, with R² values of 0.7662 (training) and 0.9701 (testing). Our error measurements were greater, with RMSE values of 1.1926 (training) and 0.4657 (testing) and MAE values of 0.6948 and 0.3904. Despite good testing accuracy, the kNN model’s large training error and residual spread overfit local patterns and impair dependability compared to stacking.
R² plot of (a) Training (b) Testing data set.
Residual plot of (a) Training (b) Testing data set.
The slightly higher testing R² compared to training R² for the KNN model is attributed to the local interpolation nature of KNN and the distribution of samples in feature space, rather than data leakage. Since testing samples fall within well-represented regions of the training feature space, stable prediction performance is achieved.
To assess potential overfitting and validate the generalization capability of the proposed stacking ensemble model, learning curve analysis is performed in accordance with statistical learning theory. The training and validation errors were evaluated as a function of increasing training data size. The learning curves demonstrate that although the training error decreases with increasing sample size, the validation error converges to a stable and closely aligned value without divergence. The narrow gap between training and validation errors confirms that the stacking model does not suffer from overfitting and generalizes well to unseen data.
The ensemble structure successfully balances the bias-variance trade-offs, so the validation error does not increase as the model capacity increases. These results provide theoretical and empirical evidence that the high R² values achieved by the stacking model are due to robust learning rather than memorization of the experimental dataset.
Learning curve for training and testing.
Also, the fact that the validation error doesn’t go up when the model capacity goes up shows that the ensemble structure does a good job of balancing bias and variation. Learning curves showing in Fig. 36 convergence of training and testing RMSE for the stacking ensemble, indicating strong generalization and absence of overfitting.
To assess the robustness of the stacking model against potential measurement noise, controlled Gaussian perturbations (± 3%) were introduced to environmental input variables. The model was retrained using the same train–test partition to ensure consistency. Figure 37 illustrates the residual distribution comparison between the original inputs and perturbed inputs.
Residual distribution: original vs. noisy inputs (± 3%).
Model accuracy diminishes with longer cleaning intervals, with weekly cleaning reaching R² = 0.995 and low RMSE (between 0.05 and 0.1), whereas monthly cleaning drops R² to 0.964 and raises RMSE over 0.4, as shown in Fig. 38. In all intervals, the stacking model had the lowest error (e.g., MAPE < 10%, RMSE ≈ 0.05) and greatest R² (> 0.99), demonstrating its durability over individual models.
Model performance comparison (MAPE vs. RMSE, bubble ∝ R²) across cleaning frequency intervals (a) clean weekly (b) clean biweekly (c) clean monthly.
Figure 39 demonstrates that stacking had the lowest MSE at all cleaning intervals: 0.003 (weekly), 0.001 (biweekly), and 0.001 (monthly). Empirical and KNN models had the largest errors, 0.022–0.294 and 0.144–0.158, respectively, especially during longer cleaning intervals. These findings confirm that stacking provides the most accurate and consistent forecasts regardless of cleaning frequency.
Model MSE comparison across cleaning-frequency intervals (a) clean weekly (b) clean biweekly (c) clean monthly.
Figure 40 shows MAE fluctuation by cleaning interval. The stacking model had the lowest MAE values (0.030 (weekly), 0.022 (biweekly), and 0.018 (monthly), whereas empirical and KNN models had the largest errors (0.468 and 0.337, respectively). This proves stacking’s prediction error-reducing ability under protracted soiling.
Model MAE Heatmap across cleaning intervals P2:- clean weekly P3:-clean biweekly P4:- clean monthly.
These findings show that stacking is the best accurate method for soiling loss estimate across cleaning frequencies and is resilient to increasing soiling buildup.
Table 14 indicates that stacking consistently outperformed other models with RMSE ranging from 0.03 to 0.045, MAE ≤ 0.03, and R² = 0.999 throughout all months. Empirical and KNN models had the largest errors (e.g., MAPE up to 35.38% and MAE > 0.4), especially in December, proving the stacking ensemble’s better resilience and dependability.
Shewhart control charts of RMSE for October–December 2024 forecasting models are shown in Fig. 41. Control charts, or Shewhart charts, provide performance data over time to determine control limits. The upper and lower control limits (UCL and LCL) set the permitted range of variation. Values over these limits indicate instability or unexpected swings. Central line (CL) shows process mean. The stacking model showed the most consistent projected accuracy throughout all months, with an average RMSE of 0.04 and tight control limits (UCL = 0.06, LCL = 0.01). ANN showed RMSE variation between 0.13 and 0.30 (mean 0.20), DT between 0.27 and 0.33 (mean 0.29), and KNN peaked at 0.49 (mean 0.38), suggesting greater variability. The stacking ensemble predicts well because of its low errors (Figs. 42 and 43) and process stability.
Shewhart control charts of RMSE for different predictive models (Oct–Dec 2024).
The Shewhart control chart of RMSE (Fig. 41 shows that prediction errors remain well within the statistical control limits across all months, confirming stable model performance and absence of instability due to non-stationarity. The empirical regression model also maintained consistent performance, with R² values between 0.97 and 0.98 across different months, further supporting the temporal robustness of the predictive framework. The environmental variables recorded during the experimental period exhibited natural variability while remaining within the same physical operating regime, enabling the model to learn stable relationships between environmental drivers and soiling loss. Since the models rely on physically meaningful predictors such as cleaning frequency, humidity, and irradiance, the learned relationships remain consistent over time. These results confirm that the predictive models demonstrate stable performance across different months without evidence of significant parameter drift or non-stationary.
Month-wise comparison of MAPE (%) across predictive models.
Month-wise comparison of MAE across predictive models.
The Shewhart control chart analysis shows that the stacking model predicts soiling loss with the lowest prediction errors and retains stability within restricted limits, making it the most dependable strategy⁶³.
To ensure that the superior performance of the stacking model was not merely a consequence of increased model flexibility, several safeguards were employed:
Independent testing evaluation (80:20 split) demonstrated that training and testing R² values were nearly identical (0.9995 vs. 0.9997), indicating strong generalization without overfitting.
Residual analysis showed random dispersion without systematic patterns.
Month-wise and cleaning-frequency-wise evaluations confirmed consistent performance across operational conditions.
Control chart stability analysis demonstrated low variance and stable error distribution across months.
These results collectively indicate that the improved performance of the stacking model arises from its ability to capture nonlinear environmental interactions rather than merely from increased complexity.
To evaluate the effectiveness of the proposed machine learning framework, its performance was compared with the semi-empirical regression model developed in Sect. 3.2 under identical validation conditions. The empirical model represents a physics-based baseline using environmental predictors such as irradiance, temperature, humidity, wind speed, and cleaning frequency. As shown in Table 12, the stacking ensemble achieved significantly lower prediction error (RMSE = 0.03–0.045, MAE ≤ 0.03) compared to the empirical model (RMSE = 0.348–0.386, MAE = 0.266–0.294). This represents an approximately 85–90% reduction in prediction error. These results demonstrate the superior predictive capability of the proposed machine learning framework in capturing nonlinear soiling dynamics compared to conventional semi-empirical models.
After performance assessment, statistical analysis verified machine learning model dependability and robustness. Although error measurements like RMSE, MAE, MAPE, and R² assess accuracy, they do not adequately resolve bias between estimated and actual soiling loss levels. A Bland–Altman (BA) plot was utilized to visually evaluate agreement, highlight recurring deviations, and indicate prevalent prediction error boundaries. A non-parametric Wilcoxon signed-rank test was employed to see whether predicted and actual values differed significantly. These graphical and inferential methods analyse model performance comprehensively.
To evaluate the statistical independence of the experimental observations, the autocorrelation function (ACF) of the soiling loss time series was analyzed, as shown in Fig. 44 Since environmental and PV performance data are collected sequentially, temporal autocorrelation may reduce the effective sample size and affect model validity⁶⁴.
The ACF results show that autocorrelation values decrease rapidly and remain within the 95% confidence bounds for most lags. Only short-term correlations are observed at very small lags, while longer lags exhibit negligible autocorrelation. This indicates weak temporal dependence and confirms that the observations are sufficiently independent for predictive modelling.
Furthermore, the natural variability in environmental parameters, including irradiance, temperature, humidity, and wind speed, along with different cleaning frequencies, ensured diverse operating conditions across samples. This variability further supports the effective independence of observations and validates the robustness of the machine learning models.
Autocorrelation function of daily-averaged soiling loss residuals.
The Bland–Altman (BA) study assessed the agreement between anticipated and actual soiling loss values. The BA plot shows bias and limitations of agreement, indicating systematic and random model prediction deviations, unlike traditional error measures. A lower bias value and narrower ranges of agreement imply that model predictions match data. This approach helps validate if machine learning models can reproduce experimental observations across operational circumstances.
The arrangement of dots around zero illustrates the degree of concordance between predictions and actual values, with tighter clustering near the red bias line signifying enhanced consistency. The dispersion within the limits of agreement (LoA) indicates the model’s variability, while outliers situated far beyond the LoA denote instances of inaccurate predictions, as depicted in Figs. 45 and 46, respectively.
.
Bland–Altman plot for (a) Empirical model (b) Stacking ML model (c) ANN (d) DT (e) SVM (f) KNN.
Model biases with limits of agreement (vertical lines).
The Fig. 47 shows the Wilcoxon signed-rank test comparing real and forecasted soiling loss values to assess the models’ predictive ability. A statistically insignificant result (p > 0.05) suggests that model predictions match experimental results, indicating model resilience. However, a significant finding (p < 0.05) indicates consistent disparities between projected and actual values. All models’ actual and expected soiling loss values were compared using the Wilcoxon signed-rank test. Despite having the lowest median difference (0.0016), the stacking model has a substantial p-value (p = 0.0083), demonstrating its capacity to capture tiny deviations with high consistency. The empirical (p = 0.593), decision tree (p = 0.276), and KNN (p = 0.428) models had non-significant p-values, indicating no statistically significant difference between their predictions and actual values.
Wilcoxon signed-rank test plot for (a) Empirical model (b) Stacking ML model (c) ANN (d) DT (e) SVM (f) KNN.
Even with strong numerical performance, ANN (p = 2.71e-42) and SVM (p = 7.4e-24) showed substantial discrepancies, suggesting systematic prediction errors. Stacking is the most reliable technique since it has minimum bias and statistically significant consistency, whereas empirical and tree-based approaches are equivalent but less robust. Although the Wilcoxon signed-rank test yielded a statistically significant p-value (p < 0.01) for the stacking model, indicating that the median difference between predicted and actual values is not exactly zero, the magnitude of this deviation was extremely small (≈ 0.001–0.002). Given the relatively large sample size, even minor deviations can become statistically significant. However, absolute error metrics (RMSE and MAE) remained very low, suggesting that the detected bias is negligible in practical terms. Therefore, the stacking model demonstrates high predictive accuracy with minimal practical bias rather than perfect agreement.
Bland–Altman analysis and Wilcoxon signed-rank test p-values vary because they employ different statistical methods. The Bland–Altman approach estimates the p-value using a paired t-test to see whether the mean difference (bias) between actual and predicted values is substantially different from zero. The Wilcoxon signed-rank test, on the other hand, tests if the median of the paired differences deviates considerably from zero without assuming normality. BA focuses on systematic bias in the mean, whereas Wilcoxon confirms median differences, therefore p-values may vary. Two methods give a more complete statistical assessment of model performance.
Although the dataset originates from a single geographical site and season, meaningful domain shifts exist within the data due to temporal variation in environmental conditions and operational variation in cleaning frequency. Figure 48 illustrates covariate distribution shifts of global horizontal irradiance across months, confirming changes in the input feature space.
Figures 49 and 50 further demonstrate that the stacking model maintains stable RMSE across temporally distinct months and across different cleaning frequencies. The consistency of predictive performance under these distributional and operational shifts indicates robust within-domain generalization rather than simple interpolation of identical conditions. While the present study does not claim cross-climate transferability, the proposed framework demonstrates strong robustness within the studied domain.
Covariate distribution shift of GHI across months.
Temporal domain shift evaluation of stacking model.
Operational domain shift evaluation of stacking model.
To statistically validate the observed performance dominance of the stacking ensemble, paired Diebold–Mariano tests were conducted on squared prediction error sequences. The results indicate as shown in Table 15 that the stacking model significantly outperforms ANN, DT, SVM, KNN, and empirical models, with DM statistics ranging from − 3.97 to − 12.08 and corresponding p-values well below 0.05.
The negative DM statistics confirm that the stacking model consistently yields lower prediction errors than competing models. These findings demonstrate that the superior performance of the stacking ensemble is statistically significant and not attributable to random variation.
This study investigated natural soiling on solar panels subjected to different cleaning protocols and developed empirical and machine learning models for predicting soiling loss. We employ experimental analysis and data-driven methods to test, evaluate, and predict how well PV systems will work when they are dirty in the real world. Ensemble learning outperforms empirical approaches in accuracy and robustness. The experimental study on four PV panels with different cleaning frequencies (daily, weekly, biweekly, monthly) confirmed that natural soiling significantly impacts PV performance, with higher losses under longer intervals.
This study contributes to the field by establishing a validated experimental–empirical–machine learning framework for real-world soiling prediction under controlled cleaning intervals.
The proposed stacking model significantly outperforms the semi-empirical baseline, demonstrating improved predictive accuracy and robustness under identical validation conditions.
Unlike purely simulation-based studies, the proposed approach is grounded in field measurements and incorporates temporal causality-aware validation, making it both scientifically rigorous and practically deployable.
The findings provide a reproducible methodology for future PV degradation studies and open avenues for intelligent, data-driven operation and maintenance optimization in solar energy systems.
Two empirical models were created one for (:{I}_{SC:})prediction (dominated by GHI) and another for Soiling Loss (SL) (mainly impacted by RH and cleaning frequency). The SL model has R² = 0.978, but a high MAPE = 28%, suggesting insignificant nonlinear effects.
Machine learning showed considerable increases, with the stacking ensemble obtaining the highest accuracy (R² = 0.9997, RMSE = 0.0456, MAE = 0.0333) and surpassing individual models (ANN, DT, SVM, KNN), among other models.
Model accuracy falls according to decreased cleaning frequency, but stacking remains strong (MSE ≤ 0.003, MAE < 0.03) even with monthly cleaning.
Statistical validation using Bland–Altman and Wilcoxon signed-rank tests confirmed stacking’s superiority, with minimal bias and narrowest limits of agreement, although minor statistically detectable differences were observed in some cases.
Overall, the integrated experimental–empirical–ML framework demonstrates that ensemble-based data-driven models can reliably predict soiling loss, enabling optimized maintenance scheduling and predictive O&M strategies for PV systems.
While prediction error differences may appear numerically small, their operational significance becomes substantial when translated into cumulative energy losses over extended periods. As shown in Table 3, soiling losses exceeding 5% were observed under extended cleaning intervals. Accurate prediction of soiling progression enables optimized maintenance scheduling, improving energy yield and reducing operational costs. The proposed model is developed based on short-term dry-season data, during which module surface properties are assumed constant. Long-term surface degradation, coating wear, and micro-roughness evolution may influence dust adhesion behaviour and soiling accumulation rates. Such effects represent gradual structural drift and would require multi-season or multi-year datasets for comprehensive modelling. Therefore, the current framework is primarily applicable to short- and medium-term predictive maintenance planning.
The present model was developed using data collected during dry environmental conditions to ensure controlled soiling accumulation. Extreme events such as rainfall-induced natural cleaning or dust storms introduce regime shifts that require representative training data for accurate prediction. Future work will incorporate multi-season datasets including rainfall and extreme environmental conditions to enhance model robustness and generalization capability.
Data is available based on request.
AdaBoost
Autoencoder
Artificial neural network
Ambient temperature
Backpropagation neural network
Cleaning frequency
Convolutional neural network
Cell temperature
Diffuse horizontal irradiance
Atmospheric pressure
Particulate matter (PM10 / PM2.5)
Relative humidity
Direct normal irradiance
Wind speed
Global horizontal irradiance
K-nearest neighbor
Linear regression
Long short-term memory
Mean absolute error
Mean absolute percentage error
Machine learning
Multilayer perceptron
Mean square error
Root mean square error
Recurrent neural network
Seasonal auto regressive integrated moving average with exogenous variables
Sunshine hour
Soiling loss
Soiling ratio
Support vector machine
Support vector regression
Random forest
RGB images of solar panels
Decision tree
Extreme learning machine
Gated recurrent unit
Extreme gradient boosting
Coefficient of determination
Current at maximum power point
Short-circuit current
Short-circuit current of clean reference panel
Short-circuit current of soiled panel
Power output of clean panel
Maximum power output
Power output of soiled panel
Temperature of dusty panel
PV module temperature
Voltage at maximum power point
Open-circuit voltage
Sampaio, P. G. V. & González, M. O. A. Photovoltaic solar energy: Conceptual framework. Renew. Sustain. Energy Rev. 74, 590–601 (2017).
Article Google Scholar
Siecker, J., Kusakana, K. & Numbi, E. B. A review of solar photovoltaic systems cooling technologies. Renew. Sustain. Energy Rev. 79, 192–203 (2017).
Article CAS Google Scholar
Fan, S. et al. A deep residual neural network identification method for uneven dust accumulation on photovoltaic (PV) panels. Energy 239, 122302 (2022).
Article Google Scholar
Sayyah, A., Horenstein, M. N. & Mazumder, M. K. Energy yield loss caused by dust deposition on photovoltaic panels. Sol. Energy. 107, 576–604 (2014).
Article ADS Google Scholar
Ilse, K. et al. Techno-economic assessment of soiling losses and mitigation strategies for solar power generation. Joule 3, 2303–2321 (2019).
Article Google Scholar
Fan, S., Wang, Y., Cao, S., Sun, T. & Liu, P. A novel method for analyzing the effect of dust accumulation on energy efficiency loss in photovoltaic (PV) system. Energy 234, 121112 (2021).
Article Google Scholar
Hussain, A., Batra, A. & Pachauri, R. An experimental study on effect of dust on power loss in solar photovoltaic module. Renew. Wind Water Sol. 4, 9 (2017).
Article Google Scholar
Chen, J. et al. Study on impacts of dust accumulation and rainfall on PV power reduction in East China. Energy 194, 116915 (2020).
Article CAS Google Scholar
Fitriyanah, D. N., Saputra, R. D. P., Abadi, I. & Musyafa, A. Optimal cleaning robot on solar panels with time-sequence input based on internet of things. Int. J. Electr. Comput. Eng. 15 (1), 2088–8708 (2025).
Sharma, S. et al. Titanium (IV) oxide nanoparticles synthesized using Nyctanthes arbor-tristis extract for enhanced photovoltaic performance in dye-sensitized solar cell. Res. Chem. Intermed. https://doi.org/10.1007/s11164-025-05866-0 (2025).
Article Google Scholar
Appels, R. et al. Effect of soiling on photovoltaic modules. Sol Energy 96, 283–291 (2013).
Article ADS Google Scholar
BBC, Saharan Dust Cloud Sweeps over UK Covering Cars in an Orange Powder, BBC. https://www.bbc.co.uk/newsround/66734529 (2023).
Garofalide, S. et al. Saharan dust storm aerosol characterization of the event (9 to 13 may 2020) over European AERONET sites. Atmosphere 13 (2022).
Conceiç˜ao, R. et al. Collares- Pereira, Saharan dust transport to Europe and its impact on photovoltaic performance: a case study of soiling in Portugal. Sol. Energy. 160, 94–102 (2018).
Article ADS Google Scholar
Korevaar, M., Mes, J., Nepal, P. G. & Snijders, M.X. van. Novel soiling detection system for solar panels, in: 33rd Eur. Photovolt. Sol. Energy Conf. Exhib. https://doi.org/10.4229/EUPVSEC20172017-6BV.2.11 (2017).
Article Google Scholar
Muller, M. et al. An in-depth field validation of DUSST: a novel low-maintenance soiling measurement device. Prog. Photovolt. Res. Appl. 29, 953–967 https://doi.org/10.1002/pip.3415 (2021).
Aïssa, B., Scabbia, G., Figgis, B. W., Garcia Lopez, J. & Bermudez Benito, V. PV-soiling f ield-assessment of Mars optical sensor operating in the harsh desert environment of the state of Qatar. Sol Energy. 239, 139–146 (2022).
Article ADS Google Scholar
Campos, L. et al. Autonomous measurement system for photovoltaic and radiometer soiling losses. 1336–1349.
Yang, M., Ji, J., Member, S. & Guo, B. Soiling quantification using an image-based method: effects of imaging conditions. IEEE J. Photovolt. 1–8. (2020).
Coello, M. & Boyle, L. Simple model for predicting time series soiling of photovoltaic panels. IEEE J. Photovolt. 9, 1382–1387 (2019).
Article Google Scholar
You, S., Lim, Y. J., Dai, Y. & Wang, C. H. On the temporal modelling of solar photovoltaic soiling: energy and economic impacts in seven cities. Appl. Energy 228, 1136–1146 (2018).
Article ADS Google Scholar
Eder, G. et al. COLOURED BIPV Market, vol. 15, research and development IEA PVPS Task, p. 57. Report IEA-PVPS T15-07 (2019).
Polo, J. et al. Modeling soiling losses for rooftop PV systems in suburban areas with nearby forest in Madrid. Renew. Energy. 178, 420–428 (2021).
Article Google Scholar
Chen, J. et al. Study on impacts of dust accumulation and rainfall on PV power reduction in East China. Energy 194, 116915 (2020).
Article CAS Google Scholar
Hammad, B., Al-Abed, M., Al-Ghandoor, A., Al-Sardeah, A. & Al-Bashir, A. Modeling and analysis of dust and temperature effectson photovoltaic systems’ performance and optimal cleaning frequency: Jordan case study. Renew. Sustain. Energy Rev. 82, 2218–2234 (2018).
Article Google Scholar
Adıgüzel, E., Özer, E., Akgündo˘ gdu, A. & Yılmaz, A. E. Prediction of dust particle size effect on efficiency of photovoltaic modules with ANFIS: An experimental study in Aegean region, Turkey. Sol. Energy. 177, 690–702 (2019).
Article ADS Google Scholar
Javed, W., Guo, B. & Figgis, B. Modeling of photovoltaic soiling loss as a function of environmental variables. Sol. Energy. 157, 397–407 (2017).
Article ADS Google Scholar
Simal Pérez, N., Alonso-Montesinos, J. & Batlles, F. J. Estimation of soiling losses from an experimental photovoltaic plant using artificial intelligence techniques. Appl. Sci. 11, 1516 (2021).
Article Google Scholar
Zitouni, H. et al. Experimental investigation and modeling of photovoltaic soiling loss as a function of environmental variables: A case study of semi-arid climate. Solar Energy Mater. Sol Cells. 221, 110874 (2021).
Article CAS Google Scholar
Jamil, I. et al. Predictive evaluation of solar energy variables for a large-scale solar power plant based on triple deep learning forecast models. Alex Eng. J. 76, 51–73 (2023).
Article Google Scholar
Pavan, A. M., Mellit, A., De Pieri, D. & Kalogirou, S. A. A comparison between BNN and regression polynomial methods for the evaluation of the effect of soiling in large scale photovoltaic plants. Appl. Energy. 108, 392–401 (2013).
Article ADS Google Scholar
Velásquez, R. M. A. & Ezcurra, T. T. P. Dust analysis in photo-voltaic solar plants with satellite data. Ain Shams Eng. J. 15, 102314 (2023).
Article Google Scholar
Elshazly, E. et al. Effect of dust and high temperature on photovoltaics performance in the new capital area. WSEAS Trans. Environ. Dev. 17 (1), 360–370 (2021).
Article Google Scholar
Saidan, M. et al. Experimental study on the effect of dust deposition on solar photovoltaic panels in desert environment. Renew. Energy. 92, 499–505 (2016).
Article Google Scholar
Dida, M. et al. Output power loss of crystalline silicon photovoltaic modules due to dust accumulation in Saharan environment. Renew. Sustainable Energy Rev. 124, 109787 (2020).
Article CAS Google Scholar
Javed, W., Guo, B., Figgis, B., Pomares, L. M. & Aïssa, B. Multi-year field assessment of seasonal variability of photovoltaic soiling and environmental factors in a desert environment. Sol. Energy. 211, 1392–1402 (2024).
Article ADS Google Scholar
Skomedal, Å., Haug, H. & Marstein, E. S. Endogenous soiling rate determination and detection of cleaning events in utility-scale PV plants. IEEE J. Photovolt. 9 (3), 858–863 (2023).
Article Google Scholar
Tanesab, J., Parlevliet, D., Whale, J. & Urmee, T. Energy and economic losses caused by dust on residential photovoltaic (PV) systems deployed in different climate areas. Renew. Energy. 120, 401–412 (2024).
Article Google Scholar
Ilse, K. K. et al. Comprehensive analysis of soiling and cementation processes on PV modules in Qatar. Solar Energy Mater. Solar Cells. 186, 309–323 (2024).
Article Google Scholar
Javed, W., Guo, B., Wubulikasimu, Y. & Figgis, B. W. Photovoltaic performance degradation due to soiling and characterization of the accumulated dust. In 2016 IEEE International Conference on Power and Renewable Energy (ICPRE), pp. 580–584. (IEEE, 2016).
Alnaser, N. et al. Dust accumulation study on the BAPCO 0.5 MWp PV project at University of Bahrain. Int. J. Power Renew. Energy Syst. 2 (1), 53 (2015).
Google Scholar
Guo, B., Javed, W., Figgis, B. W. & Mirza, T. Effect of dust and weather conditions on photovoltaic performance in Doha, Qatar. In 2015 First Workshop on Smart Grid and Renewable Energy (SGRE) pp. 1–6. (IEEE, 2015).
Fuentealba, E. et al. Photovoltaic performance and LCOE comparison at the coastal zone of the Atacama Desert, Chile. Energy. Conv. Manag. 95, 181–186 (2015).
Article CAS ADS Google Scholar
Shirakawa, M. A. et al. Microbial colonization affects the efficiency of photovoltaic panels in a tropical environment. J. Environ. Manag. 157, 160–167 (2015).
Article PubMed Google Scholar
Awwad, R., Shehadeh, M. & Al-Salaymeh, A. Experimental investigation of dust effect on the performance of photovoltaic systems in Jordan. Proceedings of GCREEDER 2013, pp. 10–13. (2013).
Adinoyi, M. J. & Said, S. A. Effect of dust accumulation on the power outputs of solar photovoltaic modules. Renew. Energy. 60, 633–636 (2013).
Article Google Scholar
Caron, J. R. & Littmann, B. Direct monitoring of energy lost due to soiling on first solar modules in California. IEEE J. Photovolt. 3 (1), 336–340 (2012).
Article Google Scholar
Hassan, A., Rahoma, U. A., Elminir, H. K. & Fathy, A. Effect of airborne dust concentration on the performance of pv modules. J. Astron. Soc. Egypt. 13 (1), 24–38 (2005).
Google Scholar
Rehman, S. & El-Amin, I. Performance evaluation of an off-grid photovoltaic system in Saudi Arabia. Energy 46 (1), 451–458 (2012).
Article Google Scholar
Ju, F. & Fu, X. Research on impact of dust on solar photovoltaic (PV) performance. In 2011 International Conference on Electrical and Control Engineering, pp. 3601–3606. (IEEE, 2011).
García, M., Marroyo, L., Lorenzo, E. & Pérez, M. Soiling and other optical losses in solar-tracking PV plants in Navarra. Prog. Photovoltaics Res. Appl. 19 (2), 211–217 (2011).
Article Google Scholar
Pavan, A. M., Mellit, A. & De Pieri, D. The effect of soiling on energy production for large-scale photovoltaic plants. Sol. Energy. 85 (5), 1128–1136 (2011).
Article ADS Google Scholar
Kaldellis, J. K., Kokala, A. & Kapsali, M. Natural air pollution deposition impact on the efficiency of PV panels in urban environment. Fresenius Environ. Bull. 19 (12), 2864–2872 (2010).
CAS Google Scholar
Kimber, A., Mitchell, L., Nogradi, S. & Wenger, H. The effect of soiling on large grid-connected photovoltaic systems in California and the southwest region of the United States. In 2006 IEEE 4th World Conference on Photovoltaic Energy Conversion, vol. 2, pp. 2391–2395. (IEEE, 2006).
Asl-Soleimani, E., Farhangi, S. & Zabihi, M. The effect of tilt angle, air pollution on performance of photovoltaic systems in Tehran. Renew. Energy. 24 (3–4), 459–468 (2001).
Article CAS Google Scholar
Sharma, S., Raina, G., Yadav, S. & Sinha, S. A comparative evaluation of different PV soiling estimation models using experimental investigations. Energy. Sustain. Dev. 73, 280–291 (2023).
Article Google Scholar
Khan, A., Ali, A., Khan, J., Ullah, F. & Faheem, M. Using permutation-based feature importance for improved machine learning model performance at reduced costs. IEEE Access., 13, 36421–36435.
Ahmed, M., Qasem, N. A., Abido, M., Antar, M. A. & Zubair, S. M. On using artificial neural network models for a thermodynamically-balanced humidification- dehumidification system: design and rating analysis. Energy Convers. Manage: X. 18, 100380 (2023).
Google Scholar
Tahir, M. F. & Saqib, M. A. Optimal scheduling of electrical power in energy-deficient scenarios using artificial neural network and Bootstrap aggregating. Int. J. Electr. Power Energy Syst. 83, 49–57 (2016).
Article Google Scholar
Muhammad, I. & Yan, Z. Supervised machine learning approaches: a survey. ICTACT J. Soft Comput. 5, (2015).
Yang, L., Liu, S., Tsoka, S. & Papageorgiou, L. G. A regression tree approach using mathematical programming. Expert Syst. Appl. 78, 347–357 (2017).
Article Google Scholar
Bischl, B. et al. Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges. WIREs Data Min. Knowl. Discov. 13 (2), e1484 (2023).
Article MathSciNet Google Scholar
Cyril Voyant, G. et al. Alexis Fouilloy,2017. Machine learning methods for solar radiation forecasting: A review. Renew. Energy 105, 569–582 .
Qu, Y., Xu, J., Sun, Y. & Liu, D. A Temporal distributed hybrid deep learning model for day-ahead distributed PV power forecasting. Appl. Energy. 304, 1–14 (2021).
Article Google Scholar
Download references
This Research was conducted with the financial support provided by UPES, Dehradun, India. The authors express gratitude to the Research & Development Department at UPES, Dehradun, Uttarakhand, India for their support under Grant Number UPES/R&D-SoAE/25062025/27.
Open access funding provided by Manipal University Jaipur.
Electrical Cluster, School of Advanced Engineering, UPES, Dehradun, 248007, India
Ashutosh Shukla & Rupendra Kumar Pachauri
Miyan Research Institute, International University of Business, Agriculture and Technology, Dhaka, 1230, Bangladesh
Rupendra Kumar Pachauri
UCRD & CSE-APEX, Chandigarh University, Mohali, Punjab, India
Ranjan Walia
Department of Electrical Engineering, Manipal University Jaipur, Jaipur, India
Vinay Gupta
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Ashutosh Shukla (AS): Conceptualization, Methodology, Writing – original draft, Software, Visualization. Rupendra Kumar Pachauri (RKP): Methodology, Data curation, Writing – review and editing, Supervision. Ranjan Walia (RW): Investigation, Writing – review and editing, Supervision. Vinay Gupta (VG): Investigation, software, Visualization, data analysis, Writing – review and editing.
Correspondence to Vinay Gupta.
The authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Reprints and permissions
Shukla, A., Pachauri, R.K., Walia, R. et al. Machine learning-based prediction of soiling losses in photovoltaic modules under different cleaning frequencies: an experimental investigation. Sci Rep 16, 17416 (2026). https://doi.org/10.1038/s41598-026-45485-2
Download citation
Received: 17 December 2025
Accepted: 19 March 2026
Published: 14 April 2026
Version of record: 05 June 2026
DOI: https://doi.org/10.1038/s41598-026-45485-2
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative
Collection
Advertisement
Scientific Reports (Sci Rep)
ISSN 2045-2322 (online)
© 2026 Springer Nature Limited
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

source

This entry was posted in Renewables. Bookmark the permalink.

Machine learning-based prediction of soiling losses in photovoltaic modules under different cleaning frequencies: an experimental investigation | Scientific Reports – Nature

Like this:

Leave a ReplyCancel reply

Links

WebSite

Follow Now.Solar via Email

Solar Now

Top Posts & Pages

New Posts

Calendar

Archives

Categories

Meta

Blog Followers

Machine learning-based prediction of soiling losses in photovoltaic modules under different cleaning frequencies: an experimental investigation | Scientific Reports – Nature

Share this:

Like this:

Leave a ReplyCancel reply

Links

WebSite

Follow Now.Solar via Email

Solar Now

Top Posts & Pages

New Posts

Calendar

Archives

Categories

Meta

Tags

Blog Followers

Discover more from Solar Now