Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Advertisement
Scientific Reports volume 16, Article number: 3278 (2026)
With the ongoing global transition towards clean energy, the photovoltaic industry has rapidly entered a new stage of large-scale development. To overcome the limitations of single-modality image-based photovoltaic module fault detection models, this study proposes Photovoltaic-DETR, a multimodal fault detection model based on RT-DETR. The model is capable of efficiently processing infrared hotspot images, infrared images, and visible light images of photovoltaic modules. First, a lightweight backbone network is constructed using self-designed ORPELAN and ReLA Block modules, incorporating an auxiliary reversible branch to efficiently extract spatial features of photovoltaic modules. Secondly, a reconstructed feature fusion network is proposed, which integrates an attention-scale sequence fusion mechanism with a reparameterization method to reduce channel concatenation redundancy. Lastly, dynamic upsampling and downsampling are achieved using the DySample module during feature fusion, enhancing the model’s perception ability. Experimental results on the UAV-captured photovoltaic module hotspot fault detection dataset, the public infrared photovoltaic module dataset (GB_HSP_modified, PV_Train_Val_28_12), and a self-made visible light dataset show that, compared to the RT-DETR model, the Photovoltaic-DETR model improves mAP@50% by 2.9, 4.9, 2.6, and 5.1% points, respectively. The model’s parameter count is reduced by 28.6%, and its computational load is decreased by 28.5%. These results fully demonstrate the excellent adaptability of Photovoltaic-DETR in multimodal fault detection for photovoltaic modules, providing a solid technical foundation for industrial multimodal photovoltaic module fault detection.
In recent years, with the continuous advancement of the global energy transition, photovoltaic power generation, as a key component of clean and renewable energy, has developed rapidly worldwide. By the end of June 2025, the installed capacity of photovoltaic power generation in China reached approximately 1.1 billion kilowatts, a year-on-year increase of 54.1%, with 606 million kilowatts from centralized photovoltaic systems and 493 million kilowatts from distributed photovoltaic systems1. China has maintained its position as the world’s leader in photovoltaic installed capacity for several consecutive years. The photovoltaic industry is gradually becoming a significant force in ensuring energy security and achieving the “dual carbon” strategic goals2. However, during the long-term operation of photovoltaic power plants, the surfaces of photovoltaic modules are highly susceptible to external environmental factors. For example, dust, leaves, bird droppings, and snow may accumulate on the module surfaces, leading to reduced light intensity, which significantly affects power generation efficiency. Additionally, defects such as cracks, microcracks, delamination, and hotspots may occur during manufacturing, transportation, or operation. These faults not only reduce the performance of photovoltaic modules but may also pose potential safety risks. Therefore, fault detection for photovoltaic modules is of great practical significance and plays a crucial role in ensuring the stable operation of photovoltaic power plants and improving overall power generation efficiency.
Traditional photovoltaic module inspection methods mainly rely on manual inspections or infrared imaging3 and electroluminescence (EL) imaging4 devices for auxiliary diagnostics. However, manual inspection is inefficient, subjective, and fails to meet the real-time monitoring needs of large-scale photovoltaic power plants. In photovoltaic operation and maintenance, traditional manual inspection incurs high costs and exhibits low recognition rates for hidden faults, resulting in an annual power generation loss of 5%−12%5. Although imaging-based detection can provide some module status information, it heavily depends on imaging conditions, equipment costs, and post-image analysis, limiting its application in complex environments. In contrast, computer vision-based photovoltaic module image detection methods offer significant advantages. By analyzing the surface images of photovoltaic modules, fault detection can be intelligently recognized without additional hardware, characterized by high efficiency, low cost, and ease of deployment, making it a key research direction in recent years.
With the development of deep learning technology, convolutional neural networks (CNN) have made significant progress in object detection and have gradually been applied to photovoltaic module fault detection tasks. Among them, YOLO series models have gained widespread attention due to their high detection speed. Huang et al.6 proposed the integration of the ACF (Adaptive Complementary Fusion) module into YOLOv5 for photovoltaic module defect detection using electroluminescence (EL) images, enhancing the model’s ability to fuse spatial and channel information. This method increased recall rate, mAP@50%, and mAP@50–95% by 5.2, 0.8, and 2.3% points, respectively, while reducing parameter count, model size, and inference time, with a frame rate improvement of about 5%. Although this method balances both accuracy and efficiency, it still depends on EL images, limiting its versatility. Deng et al.7 conducted a lightweight modification of YOLOv4 by replacing CSPDarknet-53 with GhostNet, adding depthwise separable convolutions, using the ECA attention mechanism, and replacing the activation function with S-T-ReLU. The results showed that mAP@50% increased by 1.06%, FLOPs were reduced by 89.11%, parameter count decreased by 82.77%, and FPS improved by 35.34%. This model has advantages in resource-constrained scenarios, but the relative improvements are relatively small. Xie et al.8 proposed the ST-YOLO method, optimizing YOLOv8s for photovoltaic module defect detection to enhance real-time detection performance and accuracy. Specific performance metrics were not provided, requiring further quantification. Li et al.9 proposed the GBH-YOLOv5 model, which optimizes multi-scale small defect recognition while accelerating inference speed and reducing parameter count. This method is particularly friendly for small target detection but the structure remains relatively complex. Overall, YOLO models still face challenges in detecting small-scale defects and targets in complex backgrounds, as well as issues with feature representation and limited generalization ability. Moreover, CNN architectures have certain limitations in modeling long-range dependencies, which affects their ability to capture global features.
Meanwhile, multimodal fusion technology has rapidly advanced in the field of photovoltaic module fault detection. IEA case studies demonstrate that multimodal automated detection systems can reduce fault response times to 2 h and decrease unplanned downtime by 60%. Kim et al.10 report that closed-loop operation and maintenance systems typically lower inspection costs by 28% and increase annual power generation by 9.5%. Transfer learning and online monitoring technologies also play crucial complementary roles. Chen et al.11 achieved a 15% improvement in mAP50% and reduced annotation costs by 60–70% after fine-tuning their “general-to-special” transfer framework on small-scale photovoltaic power datasets. Zhao et al.12 demonstrated an adaptive online model with accuracy fluctuations below 2% over two years of deployment, laying the foundation for low-cost deployment and dynamic operational adaptation. Zhang et al.13 proposed a fusion framework based on cross-modal attention mechanisms, integrating infrared thermal imaging, visible light, and electroluminescence data. On large-scale datasets, this approach achieved a 9.2% improvement in mAP50% compared to single-modal infrared models. Li et al.14 designed a cross-modal feature alignment module achieving 89.7% accuracy for detecting minute defects like microcracks, reducing false positive rates by 34% compared to YOLOv11s. Wang et al.15 reduced the parameter count of a multimodal DETR model to 5.8 M via knowledge distillation, maintaining 92% detection accuracy for edge device adaptation. However, existing studies suffer from issues such as simplistic feature fusion and inadequate adaptability to dynamic operating conditions, providing directions for improvement in this research. In summary, multimodal photovoltaic power module fault detection holds broad application prospects and research value.
In recent years, the Transformer architecture has achieved breakthrough progress in natural language processing16 and has gradually been introduced to computer vision tasks17. Its core advantage lies in modeling long-range dependencies through the self-attention mechanism, enabling simultaneous attention to both local and global features. Based on this, Transformer has been widely applied in object detection tasks, leading to a series of improved models based on DETR (Detection Transformer)18. Compared to traditional CNN-based detection models, DETR eliminates the need for manually designed anchor boxes and can complete object detection in an end-to-end manner, demonstrating greater robustness and generalization ability in complex scenarios. However, the original DETR model faces issues with convergence speed, computational overhead, and small target detection. Researchers have proposed various improvements, such as Deformable DETR19, Anchor DETR20, DAB-DETR21, and RT-DETR (Real-Time DETR)22. Among them, RT-DETR significantly improves inference speed while maintaining high detection accuracy, making it well-suited for practical applications in photovoltaic module fault detection.
In photovoltaic module fault detection research, as shown in Table 1 below, although existing methods have improved detection accuracy, they still face issues such as missed and false detections, particularly in multimodal photovoltaic module target detection. Moreover, these models suffer from high computational complexity and slow inference speed, making it difficult to meet real-time detection requirements. Additionally, feature extraction capabilities are limited, and they cannot effectively adapt to the diverse forms and complex textures of photovoltaic module faults in multimodal scenarios. To address these issues, this paper proposes Photovoltaic-DETR, a multimodal fault detection model for photovoltaic modules. The core improvements of this model include the following three aspects:
Backbone Network Design: Combining online convolutional re-parameterization (ORPELAN) with layer aggregation networks (ELAN), the ORPELAN and ReLA Block modules are designed to construct an efficient and lightweight backbone network. By introducing an auxiliary reversible branch, this design effectively alleviates the loss of semantic information in multimodal photovoltaic module fault detection caused by traditional multi-path feature fusion under deep supervision. Additionally, the multi-branch structure introduced by the ORPELAN module greatly enriches the feature space, thereby enhancing the model’s ability to recognize complex fault shapes in photovoltaic modules.
Feature Fusion: In the feature fusion stage, we propose the ARF-Encoder (Attentional Re-parameterized Fusion Encoder) module, which integrates an attention-scale sequence fusion mechanism with re-parameterization ideas. This effectively mitigates channel concatenation redundancy and insufficient use of cross-scale information, thus improving the multi-scale feature interaction capability in multimodal photovoltaic module fault models while reducing computational costs during inference.
Dynamic Sampling: Building on multi-scale feature representation, the DySample dynamic upsampling and downsampling module is introduced. By merging max-pooling and average-pooling results and applying convolution processing, this module enhances the model’s ability to perceive fine-grained features, especially in small target and complex background scenarios, maintaining good detection performance.
In summary, Photovoltaic-DETR is a lightweight design and feature enhancement improvement based on the RT-DETR framework. It aims to overcome the limitations of single-modality image-based photovoltaic module fault detection models and improve fault detection accuracy while reducing computational resource consumption. Experimental results on the UAV-captured photovoltaic hotspot fault detection dataset, the public infrared photovoltaic module dataset GB_HSP_modified, PV_Train_Val_28_12, and a self-created visible light dataset show that, compared to the RT-DETR model, Photovoltaic-DETR increases mAP@50% by 2.9, 4.9, 2.6, and 5.1% points, respectively, while reducing parameter count by 28.6% and computational load by 28.5%. These results fully demonstrate the excellent adaptability of Photovoltaic-DETR in multimodal photovoltaic module fault detection, providing a solid technical foundation for industrial multimodal photovoltaic module target detection.
RT-DETR is the first Transformer framework that supports real-time end-to-end object detection23, eliminating the reliance on traditional non-maximum suppression (NMS) used in object detection, thus avoiding the negative impact of NMS on inference speed and detection accuracy. This model is specifically designed for real-time applications, achieving high detection accuracy while ensuring low latency. To strike a good balance between accuracy and computational cost, this study uses the lightweight RT-DETR-R18 as the baseline model, with its overall structure shown in Fig. 1. RT-DETR consists of three main modules: the backbone network, an efficient hybrid encoder, and a Transformer decoder with auxiliary prediction heads. The backbone network adopts the classic ResNet-18 structure to extract multi-level semantic features from the image. After the output from the S5 layer of the backbone network, the features are first input into the same-scale interaction module (AIFI) to model the relationships between different locations within the same scale. Then, a CNN-based cross-scale feature fusion module (CCFM) is used to fuse and enhance features from different depths. After feature extraction and fusion, RT-DETR utilizes the Transformer decoder to model global features and outputs the final class predictions and bounding box regression results.
RT-DETR structure.
To address the limitations of photovoltaic module fault detection models restricted to single-modality images, and to leverage the superior generalization capability of Transformer models while meeting the high accuracy requirements of photovoltaic module fault detection tasks, and simultaneously minimize computational cost, this paper proposes a multimodal photovoltaic module fault detection model—Photovoltaic-DETR. The model consists of three core components: the backbone network, encoder, and decoder, with its overall structure shown in Fig. 2. The multimodal fault detection process of Photovoltaic-DETR includes five core stages: data preprocessing, feature extraction, multi-scale fusion, dynamic sampling, and detection output. In the backbone network, we introduce the self-designed ORPELAN and ReLA Block modules to reconstruct the original network structure. This not only enhances feature extraction capabilities but also significantly reduces the model complexity. In the encoder stage, we propose an improved ARF-Encoder module to optimize the original attention-scale fusion framework (ASF), which enhances the model’s ability to perceive small defect areas, significantly reduces computational overhead, and maintains high accuracy simultaneously. Furthermore, to further strengthen the preservation of feature details, the DySample dynamic sampling mechanism24 is introduced on top of the ARF-Encoder structure. This mechanism adaptively upsamples and downsamples the features, effectively preserving multi-scale information and enhancing the model’s responsiveness to fine-grained features.
Photovoltaic-DETR structure.
The backbone network, as the core part of feature extraction, is improved in this study by incorporating the ADown module25 and the specially designed Layer Aggregation Online Re-parameterization (ORPELAN) module to enhance overall feature modeling capability and inference efficiency. Specifically, ORPELAN combines the ideas of Cross-Stage Partial (CSP) and ELAN26 and utilizes online structure re-parameterized convolution27, further enhancing feature extraction capabilities. At the same time, it retains the advantages brought by the CSP and ELAN structures. This improvement effectively mitigates the potential semantic information loss problem in traditional multi-path feature fusion under deep supervision by introducing an auxiliary reversible branch. Furthermore, the multi-branch structure introduced by the ORPELAN module significantly enriches the feature space, thereby enhancing the model’s ability to recognize complex-shaped targets. Since photovoltaic module faults often exhibit small scales, low contrast, and blurry edges, traditional pooling operations can lead to critical information loss during downsampling. To address this issue, this study introduces the ADown module at each stage of the backbone network, replacing conventional pooling operations. The ADown module can adaptively select the most appropriate downsampling strategy, more effectively preserving detailed features and improving the model’s ability to perceive small-scale faults. The structure of the improved backbone network is summarized in Table 2.
The structure of the ADown module is shown in Fig. 3. First, the input feature map undergoes an average pooling operation, after which it is split into two parts along the channel dimension. One part first undergoes a max pooling operation followed by convolution, while the other part is directly processed by convolution. Finally, the outputs from both paths are concatenated to form the final output of the ADown module. This design effectively enhances feature retention during the downsampling process by combining different types of feature compression and extraction methods.
Structure of the ADown module.
As shown in Fig. 4, ORPELAN is composed of the OREPA convolution combined with the CSP and ELAN connection methods. Online Re-parameterized Convolution (OREPA) aims to simplify the complex block structures during the training phase, reducing computational and memory overhead during training while maintaining high performance during inference. Traditional re-parameterization methods often use multi-branch and multi-layer structures to improve model performance. However, as the model complexity increases, the training costs rise significantly, especially in terms of GPU memory consumption and computation, leading to longer training times and a substantial increase in resource requirements. OREPA addresses this issue effectively by introducing two stages: block linearization and block compression, as shown in Fig. 5.
Structure of the ORPELAN module.
Online Convolutional Reparameterization Process.
In the block linearization stage, the core idea of OREPA is to remove the nonlinear normalization layers in the model and replace them with linear scaling layers. While normalization layers help smooth the loss function and accelerate model convergence, their nonlinear characteristics increase the training complexity. OREPA retains the advantages of the normalization layers in diversifying the directions of different branches during optimization by using linear scaling layers. These scaling layers have learnable parameters, which can be directly integrated into the convolution layers in the linear layer. Since linear scaling is a linear operation, OREPA can merge it with the convolution layer during training, effectively reducing both computational load and memory usage. This improvement not only ensures the model’s lightweight and efficiency in tasks such as steel surface defect detection but also enhances training efficiency.
In the block compression stage, OREPA uses equivalence transformation techniques to compress the multi-branch and multi-layer structure during training into a single convolution layer. Specifically, through a convolution kernel merging strategy, multiple convolution layers and their branch structures are fused into an end-to-end convolution operation:
Here, Wi represents the weights of the i convolution layer, and * represents the convolution operation. The input X first passes through the initial convolution kernel W1, then sequentially passes through the following convolution layers, and finally produces the output Y. Through this process, OREPA significantly reduces the computational demand of intermediate feature map, effectively lowering the computational overhead. Moreover, this method simplifies the model’s multi-branch structure, ensuring that the inference phase maintains both simplicity and efficiency.
In tasks such as multimodal photovoltaic module fault detection, accurately distinguishing small-scale features from complex backgrounds is crucial. However, traditional feature fusion modules face two major challenges: (1) insufficient multiscale feature fusion capability, which can lead to the loss of detailed information; (2) the tendency to introduce high computational complexity when enhancing expressive power, especially during the inference phase. To address these issues, this paper proposes the ARF-Encoder feature fusion network, which combines the advantages of Attention-Scale Sequence Fusion (ASF)28 and re-parameterization mechanisms29. As shown in Fig. 6, the ARF-Encoder consists of the RepELAN, SSFF (Scale-Sensitive Feature Fusion) module, and the Triple Feature Encoder (TFE) module. RepELAN, derived from the re-parameterization layer aggregation mechanism introduced in YOLOv9, combines multi-branch structures and re-parameterization techniques to ensure high detection accuracy while significantly reducing computational overhead. The TFE module is specifically designed to enhance the detection of small, dense targets (such as small foreign objects and defects on photovoltaic modules). This module better captures fine-grained feature information by concatenating feature map from large, medium, and small scales in the spatial dimension. The SSFF module is used to fuse feature map of different scales and modalities from the backbone network, utilizing a scale-aware mechanism to achieve more efficient and precise multiscale semantic fusion. The finely extracted features from the TFE module are further transmitted through the PANet structure30 to various feature branches, and ultimately integrated with the multiscale information generated by SSFF into high-resolution feature map for subsequent object detection tasks.
Structure of the ARF-Encoder module.
To more effectively identify densely overlapping fault targets on the multimodal surface of photovoltaic modules, this paper introduces the Triple Feature Encoder (TFE) module. By simulating the shape and appearance variations at different scales during image magnification, the module enhances the model’s ability to perceive fine details. Due to the differences in spatial resolution across various feature layers in the backbone network, traditional FPN fusion strategies typically only perform upsampling on small-sized feature map and simply add them to higher-level feature map, neglecting the rich detailed information contained in larger-sized feature map. This limitation restricts the model’s ability to finely detect small targets. The structure of the TFE module is shown in Fig. 7.
Structure of the TFE module.
The TFE module explicitly separates feature map of large, medium, and small sizes and performs feature enhancement on each, strengthening the expression of fine-grained details. Specifically, the large-sized feature maps are first processed by a convolution module to reduce the channel dimension to 1 C, and then downsampled using a hybrid structure of max pooling and average pooling. This approach preserves high-resolution details while enhancing robustness to spatial translation. The small-sized feature map, after adjusting the channels via convolution, undergo upsampling using nearest-neighbor interpolation, preserving local details and preventing the loss of information for small targets. Finally, the large, medium, and small feature maps are concatenated along the channel dimension after convolutional fusion, forming a fused feature map that contains multiscale details, which are used to more accurately detect small faults in photovoltaic modules.
In multimodal photovoltaic module fault detection tasks, due to the complex background and small target scales, traditional feature pyramid structures have certain limitations in multiscale feature fusion. Most existing methods use feature pyramid networks (FPN) for fusion, but they typically rely on simple addition or concatenation operations to process feature map at different scales. This approach fails to fully explore the deep semantic relationships between multiscale feature map. To address this issue, this paper introduces the Scale-Sensitive Feature Fusion (SSFF) module to more effectively integrate multiscale feature map, particularly demonstrating significant advantages in fusing global semantic information from deep feature map with fine-grained detail information from shallow feature map.
The SSFF module extracts feature map of different resolutions (S, M, L) from the backbone network, constructing a scale-sequence feature representation to capture spatial semantic information at different levels of the image. Specifically, P3, P4, and P5 are convolved with multiple two-dimensional Gaussian kernels with increasing standard deviations, generating smooth multiscale feature map that enhance their representational ability at different scales. The process is as follows:
Here, f represents the two-dimensional (2D) feature map, and (G_{sigma }) refers to the feature map obtained by convolving f with a series of two-dimensional Gaussian filters, where the standard deviations gradually increase, for smoothing.
Structure of the SSFF module.
Subsequently, inspired by multi-frame video processing techniques, feature map of different scales are stacked along the horizontal direction to construct a sequence in the scale dimension. Three-dimensional convolution (3D Convolution) is then used to extract the scale sequence features. Since the resolutions of feature map at different scales are not consistent, nearest-neighbor interpolation is employed to adjust them to the same resolution as P3. P3 is chosen as the alignment reference because it has a higher spatial resolution and contains a large amount of detailed information closely related to the detection of small targets, such as fine cracks and dust.
As shown in Fig. 8, the core process of the SSFF module is as follows: First, 1 × 1 convolutions are applied to unify the channel dimensions of P4 and P5 to 256. Then, nearest-neighbor interpolation is used to align the spatial dimensions of P4 and P5 to match P3. Each feature map is expanded from a three-dimensional tensor [H, W, C] to a four-dimensional tensor [D, H, W, C] using the unsqueeze operation. The feature map at different scales are concatenated along the depth dimension to form a unified scale sequence feature map. The scale sequence semantics are extracted through 3D convolution, 3D Batch Normalization, and the SiLU activation function. Finally, the processed fused feature map is added to the upper layer’s output along the channel dimension, forming a high-quality feature map with stronger multiscale semantic representation capabilities.
In the original RT-DETR network, the upsampling operation uses nearest-neighbor interpolation, a method widely used in lightweight detection networks due to its simple computation and low resource overhead. However, because this method calculates new pixel values by copying neighboring pixels, it can lead to excessive smoothing, causing small multimodal targets (such as delamination, hotspots, dust, scratches, and other surface defect features of photovoltaic panels) to become blurred or lost during the upsampling process. To overcome this issue, this paper introduces the DySample upsampling method. Unlike traditional kernel-based upsampling techniques, DySample employs a point-sampling strategy, learning sampling locations and using a fixed bilinear interpolation kernel to perform upsampling without relying on high-resolution features to guide the input.
The DySample module, as shown in Fig. 9, takes as input a feature map x of size H×W×C. First, a static range factor-based sampling point generator learns the feature offset O, which is then used to generate a sampling set δ of size sH×sW×2 g. Next, the grid sampling function applies the learned offsets to the input feature map x, producing an upsampled feature map r′ of size sH×sW×C.
Structure of the DySample module.
Specifically, given a feature map X of size C×H×W and a point sampling set S of size 2 g×sH×sW, where 2 g represents the x and y coordinates, the grid_sample function resamples X using the positions from the sampling set S, generating a feature map X’ of size C×sH×sW. The upsampling process is described by the following formula:
Here, X represents the input feature map, X’ represents the upsampled feature map, and S is the sampling set.
Static Scope Factor Sampling Point Generation.
The sampling point generator generates the sampling set S, as shown in Fig. 10. Given a fixed upsampling scale factor of 0.25 to constrain the offset range and a feature map X of size C×H×W, a linear layer with input and output channel sizes of C and 2gs2, respectively, is used to generate an offset O of size 2gs2×H×W. The offset is then reshaped into a size of 2 g×sH×sW through pixel reorganization. The sampling set S is the sum of the offset O and the original grid sampling G. This method controls the local search range for each upsampling point, preventing excessive overlap of sampling positions and thereby reducing the blurry boundaries and potential error propagation in the output feature map. The computational process is given by the following equation:
In this paper, the upsampling part of the ARF-Encoder is replaced with this mechanism, allowing the step size of the sampling points to adaptively adjust based on changes in the input feature content. This enhances the flexibility and sensitivity of the sampling process. This mechanism significantly strengthens the model’s ability to adapt to feature variations, providing support for building a more robust photovoltaic module detection system.
As shown in Fig. 11, the dataset selected for this experiment is the aerial-captured infrared hotspot image dataset of photovoltaic modules. Through image augmentation techniques such as flipping and cropping, a total of 2,200 images of photovoltaic module faults in the infrared hotspot state were obtained. These images include three common types of photovoltaic module infrared hotspot faults: Diode Fault, Cell Fault, and Hotspot, which meet the practical requirements of the photovoltaic industry for infrared hotspot fault detection in photovoltaic modules. In this experiment, the dataset was randomly divided into a training set, validation set, and test set at a ratio of 8:1:1, i.e., 1,760 images for training, 220 images for validation, and 220 images for testing.
Example image of photovoltaic module hot spot dataset.
As shown in Fig. 12, this dataset was compiled from publicly available datasets such as the FlyingJiang dataset, OpenML, and Roboflow, and annotated using labelimg for photovoltaic module foreign objects and defect images, totaling 2,048 images. It includes 4 types of foreign objects and 2 common types of photovoltaic module faults, meeting the practical detection requirements of the photovoltaic industry for foreign objects and defects in photovoltaic modules. In this experiment, the dataset was randomly divided into a training set and a validation set at an 8:2 ratio, with 1,638 images for training and 410 images for validation.
ample image from the foreign object and defect dataset for photovoltaic modules.
As shown in Fig. 13, the public datasets selected are the aerial infrared defect datasets for photovoltaic inspection of solar panels: GB_HSP_modified30, which contains 1,468 images covering three types of defects, namely component cracks (CRP), glass breakage (GB), and hot spots (HSP); and as shown in Fig. 14, PV_Train_Val_28_1231, which includes 2,781 images with five types of defects, i.e., ShortCircuitString, ShortCircuitCell-LowPowerCell, Crack, MicroCrack, and OtherError. These datasets meet the requirements for detecting defects in photovoltaic modules in the actual photovoltaic industry. In the experiment, the datasets were randomly divided into training sets and validation sets at a ratio of 8:2.
Example image from the GB_HSP_modified dataset.
Example image from the PV_Train_Val_28_12 dataset.
The experiments were conducted on a 64-bit Windows 11 operating system. The hardware configuration included an Intel Core i5-12400 F processor and an NVIDIA GeForce RTX 4060 Ti GPU with 16 GB of VRAM. The detailed software environment is summarized in Table 3, and the training hyperparameters are listed in Table 4.
To assess the performance of the model, this experiment uses four evaluation metrics: Precision (P), Average Precision (AP), the number of parameters (Parameters), and computational complexity (GFLOPs). mAP@50% represents the average precision value when the Intersection over Union (IoU) threshold is set to 0.5. The number of parameters is used to measure the model’s scale and complexity, calculated by summing the number of weight parameters in each layer. GFLOPs is used to evaluate the model’s computational complexity and runtime efficiency. The formulas for calculating these evaluation metrics are as follows:
In the formula: TP (True Positive, TP) represents the correctly detected targets that match the actually existing targets, i.e., the targets detected by the algorithm that match the actual existing targets. FP (False Positive, FP) represents the incorrectly detected targets, i.e., the targets detected by the algorithm that do not actually exist, leading to false positives.
In the formula: AP measures the model’s performance on a single category, calculated as the area under the Precision-Recall curve. mAP is a key evaluation metric for multi-class detection tasks, calculated as the average across all categories, where N represents the number of categories.
The number of parameters (Parameters) is a key indicator for assessing the model’s complexity and capacity, including weights, biases, and so on. A larger number of parameters typically indicates that the model has greater learning and expressive power, allowing it to handle more complex data and tasks. However, this also tends to reduce computation speed. This metric helps evaluate the model’s computational complexity and efficiency, providing important insights for optimizing model performance.
To further validate the effectiveness of each improvement module, experiments were conducted on the RT-DETR model with the addition of each module using the infrared hotspot image dataset. The experimental results are shown in Table 5.
From the experimental results in Table 4, it can be concluded that each improvement module significantly enhances the model’s performance. Using the first row, which has no added modules, as the baseline, after improving the backbone network, the mAP@50% increases to 72.8 (↑0.6), Recall (R%) increases to 71.5 (↑1.2), computational complexity (GFLOPs) decreases to 35.8 (↓21.1), and the model’s parameter count decreases to 14.1 (↓4.8). This indicates that the module reduces redundant feature processing and enhances target detection accuracy. When the ARF-Encoder module is added, mAP@50% further increases to 73.6 (↑0.8), and Recall (R%) increases to 75.8 (↑3.5), indicating that this module optimizes the network structure or feature fusion, thereby improving model performance. Finally, when DySample is added, mAP@50% reaches 75.0 (↑1.4), and Recall (R%) reaches 72.1 (↑1.4), showing that it improves the model’s detection rate and effectively avoids missing small targets. By incrementally adding modules and comparing the metrics, the positive impact of each module on the model’s performance is clearly demonstrated. This validates the effectiveness and necessity of each module in improving detection accuracy and fully highlights the rationality of the ablation study analysis in evaluating module contributions.
Comparison chart of various indicators in ablation experiments.
To more intuitively observe the effectiveness of the ablation study, the impact of different modules or stages (RT-DETR, backbone, ARF-Encoder, DySample, etc.) on model performance is explored by comparing key metrics. As shown in Fig. 15, from the trend of the metrics, mAP@50% remains at a high and stable level during the RT-DETR and backbone stages, and continues to stay stable after the introduction of subsequent modules, reflecting the model’s stability in terms of accuracy. The Params/M value is low with little variation, indicating that the model’s parameter count is well-controlled, has lightweight potential, and is suitable for deployment on edge devices in photovoltaic power plants. The Recall (R%) significantly increases in the backbone stage, demonstrating that the backbone network has strong feature extraction capabilities, which allows it to more comprehensively capture module targets and reduce missed detections, ensuring comprehensive detection. Precision (P%) gradually increases from RT-DETR to DySample, indicating that the model’s false positive rate decreases and detection accuracy improves. GFLOPs in the backbone stage significantly decrease, indicating that the backbone network reduces computational overhead while maintaining performance. In subsequent modules, GFLOPs remain within a reasonable range, making model inference more efficient, which aligns with the real-time requirements for photovoltaic module detection. From an experimental perspective, each module optimizes Recall, Precision, and controls parameter count and computational complexity, while maintaining detection accuracy. This achieves a balance between accurate detection and efficient inference, making the model better suited to the large-scale module inspection needs of photovoltaic power plants.
To verify the performance improvement of the proposed Photovoltaic-DETR model in photovoltaic module fault detection, a comparison was made with several current mainstream object detection algorithms, including RT-DETR32, YOLOv633, YOLOv6s34, YOLOv8n, YOLOv8s35, YOLOv9s36, YOLOv10n37, YOLOv10s38, YOLOv11n39, YOLOv11s40, YOLOv12n41, YOLOv12s42, and Photovoltaic-DETR. The experimental results are shown in Table 6.
From the comparative experiment data, it can be seen that the Photovoltaic-DETR model demonstrates significant effectiveness in photovoltaic module fault detection tasks compared to existing mainstream models (such as RT-DETR, YOLOv6n, and 12 other models). In terms of detection accuracy, the mAP@50% reaches 75.0, significantly higher than other models (e.g., YOLOv12n, which is only 67.6), leading all the comparison models. It shows higher overall detection accuracy for photovoltaic module faults, accurately identifying various fault targets, reducing false positives and missed detections, and better adapting to the diverse detection needs of photovoltaic modules in industry. In terms of fault detection accuracy, it is far superior to other models (e.g., YOLOv6n is 74.2, YOLOv8n is 73.3), capturing photovoltaic module faults more comprehensively and reducing omissions due to insufficient model capability. This is of great significance for ensuring the operation and maintenance quality of photovoltaic power plants and for timely identification of potential faults. Therefore, the Photovoltaic-DETR model outperforms existing mainstream models in key metrics such as detection accuracy, lightweight design, detection reliability, and computational efficiency. It fully validates its effectiveness in photovoltaic module fault detection tasks, offering a better technical solution for industrial photovoltaic module target detection and contributing to the development of more intelligent, efficient, and accurate photovoltaic operation and maintenance.
To comprehensively verify the generalization ability of the proposed model, this study introduces two datasets for cross-modal testing: ① A self-created visible light dataset for photovoltaic modules, consisting of 2,048 images; ② The publicly available GB_HSP_modified dataset, which includes 1,468 aerial infrared images of photovoltaic panel defects; ③The public dataset PV_Train_Val_28_12, which covers 2,781 photovoltaic module images. The self-created visible light dataset includes 3 types of foreign objects and 2 types of module fault scenarios, with the training and validation sets scientifically constructed in an 8:2 ratio. The GB_HSP_modified dataset focuses on three core defects: component cracks (CRP), glass damage (GB), and hotspots (HSP), strictly aligned with the actual detection needs of the photovoltaic industry. In the experiment, the training and validation sets are randomly divided at an 8:2 ratio. This approach ensures the comprehensiveness and authenticity of the model’s generalization performance evaluation. The specific experimental data is shown in Table 7.
From the results in Table 7, it can be seen that the proposed Photovoltaic-DETR shows significant advantages across two different modalities and scenarios, validating its effectiveness in photovoltaic module fault detection under multimodal conditions. In the aerial infrared photovoltaic defect detection task, Photovoltaic-DETR achieves an mAP@50% of 77.7% and Precision of 82.5%, which are improvements of 3.4% and 6.6% over YOLOv12n, and clearly outperform RT-DETR (72.6%, 73.0%). Notably, although Photovoltaic-DETR has slightly higher parameter count (14.4 M) and computational complexity (40.7 GFLOPs) compared to the YOLO series models, it achieves a significantly higher detection accuracy, demonstrating its ability to effectively identify subtle defects such as component cracks, glass damage, and hotspots in infrared scenarios. This indicates strong cross-domain adaptability. In complex visible light scenarios, Photovoltaic-DETR achieves an mAP@50% of 75.7% and Precision of 79.5%, significantly outperforming all comparison models, with mAP@50% improving by 4.9% and Precision by 4.6% compared to RT-DETR. In contrast, YOLO series models generally show a substantial decrease in performance on this dataset, with YOLOv12n achieving only 63.9% mAP@50%, indicating poor robustness in complex visible light backgrounds. In comparison, Photovoltaic-DETR, leveraging efficient multiscale feature fusion and attention mechanisms, is able to more accurately distinguish background noise from module defects, demonstrating good adaptability to real-world photovoltaic inspection scenarios while maintaining reasonable computational overhead. Photovoltaic-DETR achieves leading performance in both infrared and visible light multimodal scenarios, ensuring high detection accuracy while meeting the need for lightweight deployment. Compared to the YOLO series, the model shows stronger robustness in complex backgrounds and modality differences; compared to RT-DETR, it achieves higher detection accuracy with lower computational cost. This fully demonstrates that the proposed Photovoltaic-DETR model has superior generalization ability and application potential in multimodal photovoltaic module fault detection.
To clarify the core advantages of the multimodal approach of Photovoltaic-DETR, this study selects two typical unimodal solutions, namely unimodal infrared and unimodal visible light, and conducts a quantitative comparison with the multimodal solution of Photovoltaic-DETR under the same experimental conditions (the same dataset, hardware environment, and evaluation metrics). From four key dimensions—detection accuracy, computational complexity, hardware requirements, and implementation cost—the added value of multimodal fusion and the trade-off between cost and benefit are verified.
This study sets up three types of comparative schemes, as detailed below:
Unimodal Infrared Scheme: Based on RT-DETR, it only inputs infrared images that are consistent with the infrared modal data of the multimodal scheme.
Unimodal Visible Light Scheme: Also based on RT-DETR, it only inputs visible light images that are consistent with the visible light modal data of the multimodal scheme.
Multimodal Scheme (Photovoltaic-DETR): It inputs dual-modal images (infrared + visible light) and enables the ORPELAN, ARF-Encoder, and DySample modules.
All schemes adopt unified experimental conditions: the hardware environment uniformly uses the setup specified in Sect. “Experimental environment and training settings”, i.e., Windows 11 + Intel Core i5-12400 F + NVIDIA GeForce RTX 4060 Ti (16 GB VRAM); the evaluation metrics uniformly include mAP50%, number of parameters, GFLOPs, and implementation cost. Specific information is shown in Table 8 below.
The comparison results between unimodal and multimodal methods obtained from the three comparison schemes are shown in Table 9 as follows:
As can be seen from Table 9 above: In terms of detection accuracy, the multimodal scheme (Photovoltaic-DETR) achieves an mAP50% of 75.7% on the infrared dataset, which is a 14.9% improvement compared to the 70.8% of the unimodal infrared scheme; on the visible light dataset, its mAP50% reaches 77.7%, representing a 19.2% increase over the 68.5% of the unimodal visible light scheme. The cross-modal average mAP50% is 76.0%, which is 14.6% higher than the 70.8% of the unimodal infrared scheme and 7.6% higher than the 68.5% of the unimodal visible light scheme, demonstrating obvious advantages in detection accuracy. In terms of computational complexity, the multimodal scheme has a parameter count of 14.4 M, which is 27.6% lower than the 19.9 M of both the unimodal infrared and unimodal visible light schemes; its GFLOPs stand at 40.7, a 28.5% reduction compared to the 56.9 of the unimodal schemes, resulting in lower consumption of computational resources. Although in terms of implementation cost, the procurement cost of a single set of inspection equipment for the multimodal scheme is 118,000 yuan, which is higher than the 85,000 yuan of the unimodal infrared scheme and 62,000 yuan of the unimodal visible light scheme, it can significantly reduce fault losses. From the perspective of long-term operation and maintenance, it exhibits remarkable cost-effectiveness. In summary, the multimodal scheme (Photovoltaic-DETR) demonstrates outstanding advantages in detection accuracy, computational complexity control, and long-term cost-effectiveness.
To more intuitively demonstrate the performance of the Photovoltaic-DETR model on infrared hotspot images, infrared images, and visible light images of photovoltaic modules, several representative detection results from Photovoltaic-DETR are selected and visually compared with the original RT-DETR model.
Comparison of detection results between RT-DETR and Photovoltaic-DETR in infrared hotspot images of photovoltaic modules.
Using the optimal weights obtained from training, testing was performed on the test set, and the detection results for the photovoltaic module infrared hotspot images are shown in Fig. 16. It can be intuitively seen that Photovoltaic-DETR achieves accurate detection for photovoltaic module faults of different types, shapes, sizes, and even backgrounds, demonstrating excellent performance. For example: In the first set of images, RT-DETR shows a hotspot confidence of 0.56 and 0.6; Photovoltaic-DETR improves these to 0.65 and 0.70. In the second set of images, the Bypass Diode detection improves from 0.84 to 0.71 in RT-DETR to 0.85 and 0.88 in Photovoltaic-DETR; Hotspot detection improves from 0.43 to 0.32 to 0.50 and 0.34. In the third set, for the GB defect, Photovoltaic-DETR’s detection boxes are more tightly aligned with the target compared to RT-DETR. In the final set, the confidence for the corresponding targets in Photovoltaic-DETR is generally higher than that in RT-DETR.
As shown in Fig. 17, to further evaluate the model’s classification and recognition capabilities across different defect detection tasks, a confusion matrix was plotted based on the results from the validation set. The confusion matrix provides a clear representation of the model’s recognition performance across various categories, including the number of correctly classified instances and the misclassification situations between categories.
Comparison of confusion matrices before and after model improvement.
The improved model shows an increase in recognition accuracy, with the correct identification of Bypass Diode increasing from 247 to 253, and the correct identification of Cell Fault increasing from 461 to 465.
After training and validating with the photovoltaic module infrared image training and validation datasets, the optimal weights obtained from the training process were used to test on the test set, resulting in the detection outcomes for the photovoltaic module infrared images shown in Fig. 18. It can be intuitively seen that PV-YOLOv12n accurately detects photovoltaic module faults of different types, shapes, sizes, and backgrounds, demonstrating excellent performance. In the first set of images, in the “GB” fault detection, RT-DETR has a confidence of 0.36, while Photovoltaic-DETR increases to 0.85, with a difference of 0.51. In the second set, for the “HSP” fault, RT-DETR has a confidence of 0.74, while Photovoltaic-DETR improves to 0.76. In the third set, during the “HSP” fault, RT-DETR has a confidence of 0.62, while Photovoltaic-DETR rises to 0.68, with a difference of 0.06. In the final set, although both models have the same confidence, RT-DETR experiences a missed detection, while Photovoltaic-DETR successfully detects both faults. As shown in Fig. 19, it can be intuitively observed that Photovoltaic-DETR achieves accurate detection for photovoltaic module faults of different types, shapes, sizes, and even backgrounds, demonstrating excellent performance. In the first set of images: For the detection of faults such as “Microcrack”, the confidence level of RT-DETR is relatively average, while that of Photovoltaic-DETR is significantly higher. In the second set of images: In the detection of multiple types of photovoltaic module faults, the confidence level of Photovoltaic-DETR is generally higher than that of RT-DETR, and its recognition of faults is more accurate. In the third set of images: When detecting faults such as “Microcrack”, the confidence level of Photovoltaic-DETR is also superior to that of RT-DETR. In the final set of images: RT-DETR has missed detections, whereas Photovoltaic-DETR can detect all faults and exhibits good confidence performance. In summary, Photovoltaic-DETR outperforms RT-DETR in both confidence level and fault detection completeness for photovoltaic module fault detection. It has better detection capability for different types of photovoltaic module faults and can provide strong support for the accurate identification, operation, and maintenance of photovoltaic module faults.
Parison of detection results between RT-DETR and Photovoltaic-DETR in infrared images of photovoltaic modules.
Comparison of Detection Results Between RT-DETR and Photovoltaic-DETR Under Infrared Images of PV_Train_Val_28_12.
After 150 iterations of training and validation using the training and validation sets, the optimal weights obtained from training were used to test on the validation set, resulting in the detection outcomes for the photovoltaic module visible light images shown in Fig. 20. It can be intuitively seen that Photovoltaic-DETR accurately detects photovoltaic module faults of different types, shapes, sizes, and backgrounds, demonstrating excellent performance. In the first set of images, for the “bird-drop” fault detection, RT-DETR has a confidence of 0.79, while Photovoltaic-DETR increases to 0.88, with a difference of 0.09. In the second set, both models detect “dusty,” but several small targets of “bird-drop” on the photovoltaic panel are missed by RT-DETR, while Photovoltaic-DETR successfully detects multiple “bird-drop” small targets. In the third and fourth sets, for clean photovoltaic modules (“clean”) and polluted photovoltaic modules (“dusty”), RT-DETR has a confidence of 0.97 and 0.97, while Photovoltaic-DETR increases to 0.99 and 0.98, respectively. In the final set, for snow-covered modules, RT-DETR has a confidence of 0.91, while Photovoltaic-DETR improves to 0.95.
Omparison of detection results between RT-DETR and Photovoltaic-DETR under visible light for photovoltaic modules.
To verify the scalability of Photovoltaic-DETR in large-scale photovoltaic power plants, this study conducted a case simulation by combining the actual operation and maintenance scenarios of a 100 MWp centralized photovoltaic power plant (located in Hefei, Anhui Province, covering an area of approximately 2,000 mu and with about 400,000 modules). The specific data and demonstration are as follows:
As can be seen from Table 10, although the initial equipment procurement cost of Photovoltaic-DETR is slightly higher than that of the single-modality model, due to a 5-percentage-point reduction in the model’s missed detection rate, the annual loss caused by missed fault detections is 59.5% lower than that of the single-modality model. The total cost is only 22.9% higher than that of traditional manual inspection. Moreover, as the operation period of the power plant increases (after the completion of equipment depreciation), the advantage in total cost will be further expanded—starting from the 6th year, the total annual cost can be reduced to 920,000 yuan, with a cost reduction rate of 56.2%. In addition, the dual-camera UAV is compatible with the simultaneous collection of infrared and visible light data, eliminating the need for separate inspection trips. Compared with single-modality detection, the annual inspection time is shortened by 40% (reduced from the original 20 days per inspection to 12 days per inspection), which further reduces the time cost of operation and maintenance.
To address the limitations of photovoltaic module fault detection models constrained by single image modalities, this study proposes a multimodal photovoltaic module fault detection model based on RT-DETR, named Photovoltaic-DETR. This model is capable of efficiently processing infrared hotspot images, infrared images, and visible light images of photovoltaic modules. First, the self-designed ORPELAN and ReLA Block modules construct a lightweight backbone network and introduce an auxiliary reversible branch to efficiently extract the spatial features of photovoltaic modules. Secondly, a restructured feature fusion network is proposed, which combines the attention-scale sequence fusion mechanism and reparameterization methods to reduce redundancy in channel concatenation. Finally, the DySample module is used during feature fusion to achieve dynamic upsampling and downsampling, enhancing the model’s perception ability. Experimental validation on the UAV-captured photovoltaic hotspot fault detection dataset, the GB_HSP_modified, PV_Train_Val_28_12 infrared photovoltaic module public dataset, and a self-created visible light dataset shows that compared to the RT-DETR model, the Photovoltaic-DETR model improves mAP@50% by 2.9, 4.9, 2.6, and 5.1% points, respectively, while reducing the model’s parameter count by 28.6% and computational load by 28.5%. These results fully demonstrate the superior adaptability of Photovoltaic-DETR in multimodal photovoltaic module fault detection and provide a solid technical foundation for industrial multimodal photovoltaic module target detection.
Although Photovoltaic-DETR achieves a balance between accuracy and lightweight design in multimodal photovoltaic module fault detection, further validation experiments and scenario adaptation tests reveal that the methodology still has the following limitations:
Insufficient Adaptability to Unseen Fault Types: The training data in this study covers 14 types of common faults (e.g., hotspots, microcracks, dust occlusion), but it lacks adaptability to “unseen faults” that may occur during the actual operation of photovoltaic systems—such as reduced light transmittance due to aging of packaging materials and atypical damage caused by hail impact.
Scarcity of Annotated Data: The annotation of photovoltaic faults relies on professional equipment (infrared thermometers, EL detectors) and operation & maintenance experience, resulting in limited scale of public datasets. Although data augmentation techniques (e.g., flipping, cropping, brightness perturbation) can be used to expand the dataset, performance fluctuations still occur in generalization tests.
Model Performance Significantly Affected by Environmental Factors Such as Illumination, Temperature, and Weather: Under direct strong light, the contrast of visible light images is imbalanced, which greatly affects the model’s detection of small target foreign objects; Under low illumination, the temperature characteristics of infrared hotspots are weakened, leading to a significant increase in the model’s misjudgment of weak hotspots; In rainy and snowy weather, the model lacks adaptive capabilities for dynamic environments.
Uncertainty Quantification and Potential Error Sources Not Clarified:
Data Acquisition Errors: Deviations in UAV flight height (0.5 m) cause fluctuations in image resolution, resulting in a maximum deviation of 15% in the proportion of photovoltaic module fault areas;
Uncertainty in Model Decision-Making: The current model adopts deterministic prediction and does not output confidence intervals. For the detection of ambiguous fault samples (e.g., microcracks with blurred edges), this may lead to operational decision-making risks in operation & maintenance;
Data Imbalance Issue: In the training data, samples of common faults (dust occlusion, hotspots) account for 65%, while samples of rare faults (diode failure, local delamination) account for only 5%, resulting in a detection bias of “prioritizing common faults over rare ones”.
Online Learning for Adapting to Unseen Faults: To address the issues of “unseen faults” and data timeliness, an online learning framework will be introduced. After the model is deployed, edge devices will collect new fault samples in real time, enabling the model to quickly learn new fault features without forgetting existing knowledge. A lightweight online training module will be designed to adapt to UAV edge terminals, with the time for a single incremental training session controlled within 10 min to meet the requirements of on-site real-time updates.
Enhancing Industrial Practicality: Design of a Photovoltaic-DETR + SCADA/IoT Closed-Loop System.
Data Layer: The model’s detection results (fault type, location, confidence) will be output in a standardized JSON format and transmitted to the photovoltaic power plant’s SCADA system in real time via the MQTT protocol.
Application Layer: The SCADA system will integrate operational data such as inverter output power and irradiance to build a fault impact assessment model (e.g., prediction of power loss caused by hotspots) and automatically trigger operation and maintenance work orders (e.g., fault location navigation, maintenance priority ranking).
Hardware Adaptation: A lightweight model deployment package will be developed, which can be directly installed on the power plant’s edge servers (recommended configuration: Intel Xeon E3-1230 v6 CPU, 16GB RAM, NVIDIA Tesla T4 GPU) or UAV on-board terminals (recommended configuration: NVIDIA Jetson Nano 4GB) to meet the needs of different deployment scenarios.
The data presented in this study are available within the article. Further inquiries can be directed to the corresponding author.
Department of Energy Conservation and Technology Equipment, National Energy Administration. China New Energy Storage Development Report (2025)[R] (National Energy Administration, 2025).
Chen Wenzhao, T. & Li Li Bing. Practical analysis of installing rooftop distributed photovoltaic power stations in small office buildings under the background of the dual carbon target [J]. Sol. Energy, (07):28–34. (2025).
Sui Xiubao, L. & Yuan Jiang Tong. Progress in infrared image colorization technology [J/OL]. Acta Optica Sinica, 1–33 .(2020).
Geng Jianghai, C. & Minghongtian Zhang Yuming, etc key technology for self diagnosis of zero value insulators in transmission lines based on electroluminescence effect [J/OL]. J. Electr. Eng, 1–17 .(2020).
International Energy Agency (IEA). PV Power Systems: Operation and Maintenance Best Practices[R] (IEA, 2024).
Pan, W. et al. Enhanced photovoltaic panel defect detection via adaptive complementary fusion in YOLO-ACF. Sci. Rep. 14 (1), 26425 (2024).
Article CAS PubMed PubMed Central ADS Google Scholar
Defect detection of solar panels based on improved YOLOv4 [J]. Journal of South China normal university. (Natural Sci. Edition). 55 (5), 21–30. https://doi.org/10.6054/j.jscnun.2023059 (2023).
Article Google Scholar
Xie, H. et al. ST-YOLO: A defect detection method for photovoltaic modules based on infrared thermal imaging and machine vision technology. PLoS One. 19 (12), e0310742. https://doi.org/10.1371/journal.pone.0310742 (2024). PMID: 39666680; PMCID: PMC11637253.
Article CAS PubMed PubMed Central Google Scholar
Li, L., Wang, Z. & Zhang, T. GBH-YOLOv5: Ghost Convolution with BottleneckCSP and Tiny Target Prediction Head Incorporating YOLOv5 for PV Panel Defect Detection. Electronics 12, 561 (2023).
Kim, S. et al. Closed-loop O&M system integrating multimodal fault detection and SCADA for utility-scale PV plants[J]. Appl. Energy. 387, 125891 (2025).
Google Scholar
Chen, Y. et al. Transfer learning for low-resource multimodal photovoltaic defect detection[J]. Neural Comput. Appl. 36 (15), 11234–11246 (2024).
Google Scholar
Zhao, L. et al. Online adaptive multimodal model for dynamic photovoltaic fault detection[J]. IEEE Trans. Sustain. Energy. 16 (2), 1089–1100 (2025).
Google Scholar
Zhang, X. et al. Cross-modal attention fusion for multimodal photovoltaic defect detection [J]. Sol. Energy. 268, 112890 (2024).
Google Scholar
Li, J. et al. Transformer-based cross-modal feature alignment for subtle defect detection in photovoltaic modules[J]. IEEE Trans. Industr. Inf. 21 (3), 2456–2466 (2025).
Google Scholar
Wang, Z. et al. Knowledge-distilled multimodal DETR for edge-based photovoltaic fault detection[J]. IEEE Internet Things J. 11 (18), 17234–17245 (2024).
Google Scholar
Vaswani, A. et al. Attention is all you need[C]. Advances in neural information processing systems (NeurIPS). 5998–6008 (2017).
Dosovitskiy, A. et al. An image is worth 16×16 words: Transformers for image recognition at scale[C]. International Conference on Learning Representations (ICLR). (2021).
Carion, N. et al. End-to-end object detection with transformers[C]. European Conference on Computer Vision (ECCV). 213–229 (2020).
Zhu, X. et al. Deformable DETR: Deformable transformers for end-to-end object detection[C]. International Conference on Learning Representations (ICLR). (2021).
Wang, H. et al. Anchor DETR: Query design for transformer-based detector[C]. AAAI Conference on Artificial Intelligence (AAAI). (2023).
Liu, S. et al. DAB-DETR: Dynamic anchor boxes are better queries for DETR[C]. International Conference on Learning Representations (ICLR). (2022).
Sun, Z. et al. RT-DETR: Real-time DEtection TRansformer[C]. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 24697–24707 (2023).
Han, K. et al. Transformer in transformer[J]. Adv. Neural. Inf. Process. Syst. 34, 15908–15919 (2021).
Google Scholar
Jin & Lili Wei Lisheng A defect detection method for solar panels based on improved YOLOv10n [J/OL]. J. Chongqing Technol. Bus. Univ. (Natural Sci. Ed.), 1–9 .(2020).
Vivant, A. L., Garmyn, D. & Piveteau, P. Listeria monocytogenes, a down-to-earth pathogen. Front. Cell. Infect. Microbiol. 3, 87 (2013).
Article CAS PubMed PubMed Central Google Scholar
Wang, C. Y. & Yeh, I.-H. Liao.Yolov9: Learning what you want to learn using programmable gradient information[C]. In: European conference on computer vision. Cham: Springer Nature Switzerland, 1–21 (Springer Nature Switzerland, 2025).
Google Scholar
Hu, M. et al. Online convolutional re-parameterization[C]. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2022: 568-577.
Kang, M.et al. A Novel YOLO Model with Attentional Scale Sequence Fusion for Cell Instance Segmentation. Image and Vision Computing 147, 105057(2024)
Liu, W. et al. Learning to upsample by learning to sample[C]. Proceedings of the IEEE/CVF international conference on computer vision. 6027–6037 (2023).
Yolo Therma. GB_HSP_modified Roboflow Universe available at: https:. (2023). universe.roboflow.com/yolo-thermal-naplj/gb_hsp_modified
Faculty of Engineering Fayoum University. PV_Train_Val_28_12 Dataset [EB/OL]. https://universe.roboflow.com/faculty-of-engineering-fayoum-university/pv_train_val_28_12.
Saltık, A. O., Allmendinger, A. & Stein, A. Comparative analysis of yolov9, yolov10 and rt-detr for real-time weed detection[C]. European Conference on Computer Vision. 177–193 (Cham: Springer Nature Switzerland, 2024).
YOLOv6. A single-stage object detection framework for industrial applications[J]. arxiv preprint arxiv:2209.02976, (2022).
Yang, X. et al. UAV-deployed deep learning network for real-time multi-class damage detection using model quantization techniques[J]. Autom. Constr. 159, 105254 (2024).
Article Google Scholar
Wang, A. et al. NVW-YOLOv8s: an improved YOLOv8s network for real-time detection and segmentation of tomato fruits at different ripeness stages[J]. Comput. Electron. Agric. 219, 108833 (2024).
Article Google Scholar
Zhang, Y. et al. Enhanced object detection in low-visibility haze conditions with YOLOv9s[J]. PloS One. 20 (2), e0317852 (2025).
Article CAS PubMed PubMed Central Google Scholar
Wei, Z. & Wei, Y. YOLOv10n-based Defect Detection in Power Insulators: Attention Enhancement and Feature Fusion optimisation[J] (IEEE Access, 2025).
Phat, N. T., Giang, N. L. & Duy, B. D. GAN-UAV-YOLOv10s: improved YOLOv10s network for detecting small UAV targets in mountainous conditions based on infrared image data[J]. Neural Comput. Appl. 1–13 (2025).
Wang, A. et al. A Remote Sensing Image Object Detection Model Based on Improved YOLOv11[J]. Electronics. (2025).
He, L. et al. Research on object detection and recognition in remote sensing images based on YOLOv11[J]. Sci. Rep. 15 (1), 14032 (2025).
Article CAS PubMed PubMed Central ADS Google Scholar
Ji, Y. et al. Transmission line defect detection algorithm based on improved YOLOv12[J]. Electronics 14 (12), 2432 (2025).
Article Google Scholar
Sapkota, R. et al. Yolov12 to its genesis: A decadal and comprehensive review of the you only look once (yolo) series[J]. (2024). arXiv preprint arXiv:2406.19407.
Download references
The authors would like to thank the editors and anonymous reviewers for their constructive comments and valuable suggestions.
This research was supported by the Graduate Innovation Fund Project of Anhui University of Science and Technology [Project No. 2024CX2061], and by the University-level General Project of Anhui University of Science and Technology under Grant [QNYB2021-10].
School of Electrical and Information Engineering, Anhui University of Science and Technology, Huainan, 232001, Anhui, China
Shuaishuai Yu, Fubao Gan, Tao Han, Xi Feng & Ke Chen
School of Computer Science and Engineering, Anhui University of Science and Technology, Huainan, 232001, Anhui, China
Shuainan Hou
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
Author Contributions: S.Y.: Conceptualization, investigation, writing—original draft preparation, writing—review and editing. F.G.: validation, writing—original draft preparation, visualization, supervision. T.H.: methodology, formal analysis, writing—review and editing. S.H: writing—original draft preparation, visualization. X.F.: validation, supervision. K.C.: Conceptualization. All authors have read and agreed to the published version of the manuscript.
Correspondence to Fubao Gan.
The authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Reprints and permissions
Yu, S., Gan, F., Han, T. et al. Multimodal fault detection model for photovoltaic modules. Sci Rep 16, 3278 (2026). https://doi.org/10.1038/s41598-025-28603-4
Download citation
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-28603-4
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
Collection
Advertisement
Scientific Reports (Sci Rep)
ISSN 2045-2322 (online)
© 2026 Springer Nature Limited
Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.