CPVPD-2024: A Chinese photovoltaic plant dataset derived via a topography-enhanced deep learning framework – Nature

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Advertisement
Scientific Data volume 12, Article number: 1601 (2025)
2867 Accesses
Metrics details
As the largest global photovoltaic (PV) market, China experiences continuous rapid growth in PV installed capacity, playing a crucial role in achieving carbon peaking and neutrality goals through this central pillar of the energy transition. To address data fragmentation and inconsistency in current PV datasets, this study develops the 2024 China Photovoltaic Power Plant Vector Dataset (CPVPD-2024) using a deep semantic segmentation framework (DSFA-SwinNet) with geospatial verification. The dataset comprehensively covers all 34 provincial-level administrative regions of China, achieving an overall Precision of 90.38% and Intersection over Union (IoU) of 81.78% in test zones, demonstrating significant improvements in identifying PV array gaps and detecting small-scale distributed power plants. Research results indicate that the total installed PV area in China reached 4,520.47 km² by 2024, exhibiting a characteristic spatial pattern dominated by agrivoltaic systems with concentrated distribution in arid regions. As the first national panel-level PV vector dataset, it enables precise PV site selection, ecological assessments, and AI-driven remote sensing analysis.
Photovoltaic (PV) power generation, a core technology for the global low-carbon energy transition, is reshaping the traditional energy landscape with its cleanliness, safety, and sustainability1. Under carbon peaking and carbon neutrality strategy of China, the PV industry has achieved remarkable growth. In 2023, China accounts for nearly 34% of total global PV installed capacity, more than doubling the capacity installed in Japan2. By the end of December 2024, the newly installed capacity of PV in China has reached 277 GW, a year-on-year increase of 28%, while the cumulative PV capacity exceeds 887 GW, representing a growth of 45.5% compared to 20233, solidifying its leading position in the global PV market4,5. However, the absence of high-precision, high-resolution, and time-sensitive spatial data on PV power plants severely restricts scientific decision-making in carbon sink accounting and ecological impact assessment6,7, as conventional datasets systematically overestimate true PV coverage by incorporating adjacent non-generation features including maintenance accessways and equipment gaps. Current mainstream PV datasets can be classified into three categories based on their data sources:
Machine learning training data based on manual annotation and data augmentation techniques8.
Spatial distribution data from remote sensing image interpretation9,10.
Inventory data sourced from statistical reports of energy departments11,12.
Bradbury et al. develop a distributed PV statistical dataset for California by integrating manual reporting and registration data, which contains geographic information of over 19,000 PV panels13. However, the spatial representation is limited to point coordinates without geometric patterns or boundary information of PV arrays, consequently failing to meet refined planning requirements14. In the field of remote sensing interpretation, Malof et al. first apply a random forest classifier to pixel-level classification of remote sensing images in 201615, representing a breakthrough in automated methods for distributed PV identification. This method leverages the advantages of remote sensing technology in wide-area coverage and dynamic monitoring, significantly improving data acquisition efficiency while promoting the development of multi-source remote sensing datasets. For instance, Jiang et al. construct a multi-scene annotated dataset covering both distributed ground-mounted PV and rooftop PV systems based on multi-source remote sensing data from Jiangsu Province16. Meanwhile, Kasmi et al. develop a specialized dataset integrating aerial images, segmentation masks and technical parameters, with particular focus on rooftop PV characteristics17.
Current remote sensing interpretation methods primarily achieve target extraction through machine learning algorithms that integrate PV panel spectral features with manually designed rules18. Xia et al. innovatively apply a random forest classifier to the short-wave infrared band of Sentinel time-series data, successfully establishing an indicator for identifying water-related PV installation types19. Zhang et al. develops the first 30 m resolution distribution map of PV power plants in China through integration of random forest classification and manual visual correction20. Liu et al. construct a high-precision spatiotemporal dataset for 2015 and 2020 using Landsat-8 imagery, which demonstrates significantly improved classification accuracy through multi-stage post-processing21. Feng et al. incorporate Sentinel-2 imagery and topographic features with active learning strategies to create a 10 m resolution dataset for ground-mounted PV power plants in China22. However, these traditional feature-engineering-based methods struggle with complex PV scenes due to the limitations of manually designed features in representing the diverse shapes of PV modules and complex environmental backgrounds23,24.
With the breakthrough of deep learning, remote sensing interpretation methods based on Convolutional Neural Networks (CNNs) demonstrate significant advantages25,26. Yu et al. develop the DeepSolar framework, which successfully establishes a high-fidelity solar deployment database covering the contiguous United States27. Li et al. create the first 20 m resolution global annual PV dataset by integrating U-Net architecture with positive-unlabelled learning methods, demonstrating the technical superiority of deep learning in complex feature extraction and generalization capabilities28. Nevertheless, existing datasets are plagued by three systematic errors.
Limited by medium-to-low-resolution imagery, gaps and shaded areas of dense PV arrays are frequently misclassified as installed regions, while the small-scale distributed PV systems are prone to identification omissions29.
Spectral similarity between PV panels and features like water bodies or building roofs causes inter-class confusion30, with false detection rates increasing significantly under cloudy or snow-covered conditions.
The data update cycle fails to match the rapid expansion of PV installations31, falling short of dynamic monitoring needs.
As illustrated in Fig. 1, current PV power plant datasets exhibit notable accuracy gaps. The (b) entirely misses the small-scale PV power plant area on the right. The (c), though detecting this area, fails to recognize the regular gaps within the array. The (d), while partially capturing the gap features, still misses smaller-scale gaps and narrow spaces between panels.
Examples of vector data misclassification. The (a) displays Tianditu imagery covering 100.7810°E–100.8127°E, 38.7310°N–38.7389°N. The (b) presents data from the study by Liu et al.21 The (c) shows results from the study by Zhang et al.20 The (d) illustrates findings from the study by Feng et al.22 The resolutions are 30 m for the (b) and the (c), and 10 m for the (d).
To address these challenges, this study proposes a PV power plant identification framework that combines deep semantic segmentation with geospatial verification, and constructs the 2024 China Photovoltaic Power Plant Vector Dataset (CPVPD-2024). The main components of the framework are illustrated in Fig. 2.
Overall architecture diagram.
First, based on the spatial stratified sampling strategy, this study integrates the 30 m resolution annual China Land Cover Dataset (CLCD) with 15-arc-second global elevation data from the General Bathymetric Chart of the Oceans (GEBCO) to construct a training sample set covering 15 terrain-landcover combination types. The set includes nine surface types such as cultivated land and bare land across low, medium, and high-altitude gradients, effectively enhancing generalization capability for complex geographic environments.
Second, the Dynamic Spatial-Frequency Attention SwinNet (DSFA-SwinNet) semantic segmentation architecture is developed, which incorporates the Dynamic Spatial-Frequency Attention (DSFA) mechanism to jointly optimize spatial texture features and frequency-domain edge responses. This integration significantly improves extraction accuracy of multi-scale PV panel features. During model training, Bayesian hyperparameter optimization combined with grid search algorithms is applied to enhance learning of PV features and accelerate convergence speed.
Finally, a multi-level morphological post-processing workflow is designed, utilizing the Canny edge detection algorithm for candidate region extraction, geometric constraints of PV panels for spectral noise filtering, and manual topological correction through QGIS platform to optimize geometric precision of vector boundaries.
The framework achieves the first panel-level vector characterization of PV power plants at national scale across all 34 provincial-level administrative regions in China. It quantifies spatial distribution patterns of 4,520.47 km2 PV installations, attaining 90.38% precision and 81.78% Intersection over Union (IoU), while maintaining stable recognition precision above 87% for critical land types including cultivated land and grassland. Results indicate that PV development in China during 2024 displays a distinct spatial pattern dominated by agrivoltaics with clustering in arid regions. Cultivated land and grassland contributed nearly 70% of the national PV installed-capacity area, with low-altitude cultivated land (22.96%) and mid-altitude grassland (25.23%) as the core types. These findings demonstrate spatial coupling between PV industry and agricultural systems, along with strategic support from resource endowments in arid and semi-arid regions for renewable energy deployment.
In response to the policy target proposed by China to achieve over 550 GW of PV installed capacity by 2025, this study selects 2024 as the baseline year to capture spatial response characteristics of PV infrastructure development during the policy implementation window. Data involved in this study are presented in Table 1.
This study selects 2024 World Imagery Wayback32 and Tianditu satellite images33 as main data sources. As a national geographic information service platform, Tianditu provides satellite imagery (0.5–2 m resolution) featuring complete coverage and high geometric accuracy, ensuring sub-meter precision for PV power plant spatial feature extraction. World Imagery Wayback (0.6–1.2 m resolution) complements full territorial coverage including Taiwan Province through standardized global image services by ESRI.
For data acquisition, World Imagery Wayback serves as the data source for imagery of Hong Kong Special Administrative Region, Macao Special Administrative Region and Taiwan Province, while Tianditu imagery is uniformly adopted for other regions. Standardized access and automated downloading of multi-source satellite imagery are achieved through integration of QGIS built-in WMS connectors with GDAL data processing modules via PyQGIS programming interface.
To build original training data, the 30 m resolution PV map of China (PV 2020) produced by Zhang et al.34 serves as prior knowledge to guide localization of PV power plant samples. To enhance sample representativeness, a spatial stratified sampling strategy is implemented with environmental controls using CLCD35 at 30 m resolution and global elevation data from GEBCO36 at 15-arc-second resolution. The CLCD data encompasses nine land cover types: cultivated land, forest land, shrubland, grassland, water bodies, snow and ice, bare land, artificial surface, and wetlands, achieving 79.31% overall accuracy based on 5,463 visual interpretation samples.
To address the radiation heterogeneity issues in multi-source remote sensing imagery caused by sensor parameter variations, dynamic imaging phases, and atmospheric disturbances (including aerosol scattering, cloud occlusion, and solar radiation fluctuations), this study first removes the alpha channel containing non-terrain information from Tianditu imagery to eliminate redundant data and comply with input specifications of deep learning frameworks. Adaptive histogram equalization combined with gamma correction algorithms is then applied to dynamically balance image contrast and brightness.
The spatial distribution of Chinese PV power plants is characterized by wide-area coverage, dispersion, and geographical heterogeneity. As shown in Fig. 3, their spatial patterns exhibit significant spatial differentiation.
Examples of Chinese PV power plants. All images are from Tianditu. The (a) located on cultivated land. The (b) located in forest land. The (c) located in shrubland. The (d) located on grassland. The (e) located on water bodies. The (f) located on bare land. The (g) located on ground-level artificial surface. The (h) located on rooftop artificial surface.
To enhance geographical representativeness of samples, this study integrates CLCD and GEBCO data to establish sampling zones through spatial stratified sampling. Elevation gradients are classified into three tiers following standard altitudinal zonation of China:
Low Elevation Areas (LEA): <500 m.
Mid Elevation Areas (MEA): 500–2000 m.
High Elevation Areas (HEA): >2000 m.
These tiers are cross-referenced with nine land cover types in CLCD to theoretically construct 27 sampling categories. Leveraging prior knowledge from the PV 2020 dataset, 1,200 sample points are randomly generated nationwide near vector edges. Subsequently, this study extracts 256 m × 256 m rectangular image units centered on the sample point coordinates. To mitigate sample imbalance, visual verification is conducted to screen and exclude image samples with either excessive or insufficient PV power plant coverage. Through this process, 1,006 valid samples representing 15 distinct geographical-combination types are selected for model training and validation. Additionally, 560 testing-zone samples are generated using the same method to check the validity of the CPVPD-2024 dataset, covering 17 geographical-combination types. The spatial distribution of sampling and testing zones is shown in Fig. 4.
Spatial distribution map of sampling and testing zones.
For annotation, a binary mask approach is employed to delineate PV power plant areas from background features. The PV power plant areas are annotated as (255, 255, 255), while non-PV areas are marked as (0, 0, 0). The sampling dataset has a pixel ratio of 1:1.49 between PV and non-PV areas. To enhance model adaptability to complex environmental conditions, this study employs multiple data augmentation techniques to expand the dataset, such as brightness adjustment, darkness adjustment, Gaussian noise addition, mirror transformation, and random scaling. The augmentation effects are illustrated in Fig. 5. The final dataset is divided into training set and validation set at a ratio of 75% to 25%, containing 4,509 and 1,503 samples, respectively.
Examples of data augmentation effects. (a) Original image. (b) Brightness increases by 50%. (c) Brightness decreases by 50%. (d) Horizontal flip. (e) Gaussian noise injection with standard deviation σ = 0.05. (f) Random scaling.
This study adopts the DSFA-SwinNet37 to achieve multi-scale segmentation of PV power plants. The DSFA-SwinNet uses the Swin-Transformer as the backbone network of the encoder. Based on the DSFA in the refined skip connection strategy and the Pyramid Attention Refinement (PAR) bottleneck structure, it combines long-range dependencies and internal PV features to realize pixel-level segmentation of PV arrays and individual PV panels in terms of spatial distribution.
To enhance the learning of the model of prior knowledge on PV spatial distribution and improve convergence speed, 1,250 sample-label pairs are randomly selected from the training set for hyperparameter optimization experiments, which are divided into training and validation sets at a ratio of 75% to 25%. Using the IoU as the optimization metric, Bayesian and grid search algorithms are employed to sequentially optimize the training learning rate (lr, momentum, weight_decay), mixed loss function weights (α, β, γ, flooding), and loss weight parameters (W1, W2, W3, W4, W5) of the segmentation heads in each network layer. The search process is illustrated in Fig. 6, with the search configuration and tuning results detailed in Table 2.
Heatmap of IoU variations under different parameter combinations during optimization. (a) Learning rate. (b) Momentum. (c) Weight decay. (d) Weight of Binary Cross-Entropy (BCE) loss. (e) Weight of Dice loss. (f) Weight of Lovasz-Softmax loss. (g) Flooding, as proposed by Ishida et al.39, prevents overfitting through loss clipping. (h–l) Weights of decoder output feature maps across different layers, from high to low, in the mixed loss function.
The preprocessed satellite images serve as input to the trained model. To address the dimensional constraints of high-resolution satellite images exceeding model input limits, this study designs a prediction framework based on edge padding and tile fusion, implemented through the following steps:
Using the top-left corner of image as the origin, pad pure white pixels (255,255,255) to the right and bottom to extend the image width and height to integer multiples of the input size n of the model, and record the padding amount to ensure subsequent geographic spatial consistency.
A sliding window tiling mechanism with n × n window size and stride length n perform non-overlapping segmentation of the padded image. The tiles are assembled into four-dimensional tensors for batch processing.
The tensor batches are fed into the model for prediction.
Prediction results are mosaicked and trimmed according to the recorded padding amounts from step 1, removing padded edges to eliminate boundary artifacts.
To address salt-and-pepper noise and discrete misclassified patches in the pixel-level segmentation results, this study proposes a multi-level morphological filtering method based on geometric characteristics of PV components. The Canny edge-detection algorithm is utilized to extract gradient-significant regions within the binary image. Subsequently, connected-component analysis identified potential PV power-station clusters. Tiny noise is eliminated through the application of an area threshold. Based on the rectangular geometry of individual PV panels, two criteria are established:
Elongated rectangle aspect-ratio thresholding to exclude irregular objects.
Minimum-enclosing-rectangle angle calculation with directional-consistency constraints to select targets matching the regular PV-panel arrangement.
The georeferencing information comprising longitude and latitude ranges along with map projection parameters from the original imagery is systematically transferred to the filtered binary masks, followed by batch conversion into vector polygons to guarantee spatial consistency between vectorized outputs and satellite imagery. For enhanced vector precision, this study implements manual quality inspection through QGIS platform integrated with high-resolution Tianditu imagery, utilizing topological verification tools to rectify contour jaggedness and discontinuity artifacts while optimizing irregular polygons based on standard PV array spatial distribution patterns. The finalized vector data undergoes integration with provincial administrative boundaries, ultimately producing a nationwide PV power plant vector dataset that comprehensively covers all 34 provincial-level administrative units across China.
The CPVPD-2024 dataset38 is available at the Zenodo repository (Link: https://doi.org/10.5281/zenodo.15618227).
The CPVPD-2024 dataset contains 34 independent vector layers named following provincial administrative divisions of China, stored in standard ESRI Shapefile format. Each layer includes five components: geometric data in .shp format, spatial index in .shx format, projection definition in .prj format, attribute tables in .dbf format, and character encoding files in .cpg format. The total uncompressed storage size reaches 1.36 GB, with compressed ZIP archive size of 332 MB.
All vector data uses EPSG:4326 (WGS 84 geographic coordinate system) as the spatial reference. The attribute table includes six fields: Uid, Area, latitude, longitude, landcover, and dem, as shown in Table 3.
The Uid field provides unique identification for each PV power plant feature. Area represents feature area in square meters, calculated using ESRI:102025 (Asia North Albers Equal Area Conic) projection. Latitude and longitude fields show centroid coordinates of each PV feature in decimal degrees under EPSG:4326. Landcover and dem fields derive from 2023 CLCD land cover data and GEBCO elevation data respectively, indicating dominant land cover type and mean elevation value for each PV site.
The test zone sample dataset covers 17 typical geographical combination types across China, comprising 560 standard sample units of 256 m × 256 m, with a total area of 3,670 hm2, as spatially distributed in Fig. 4. A confusion matrix quantifies the effectiveness of the CPVPD-2024 dataset, defining PV power plant areas as the positive class and non-PV areas as the negative class, with four metrics calculated: Precision, Recall, F1-score, and IoU. The metrics are defined in Eqs. (14).
True positive (TP) represents the sample ratio of correctly classified PV power plant pixels; true negative (TN) represents the sample ratio of correctly classified background pixels; false positive (FP) represents the sample ratio of misclassified background pixels; false negative (FN) represents the sample ratio of misclassified PV power plant pixels.
Table 4 presents the validation metrics for the 17 terrain-land cover combinations and the overall dataset in the test zone, comprehensively demonstrating the performance of CPVPD-2024. The results reveal significant variations in PV identification accuracy across different geographical environments. The HEA-Woodland combination achieves the highest Recall (98.63%), with an F1-score of 92.54% and IoU reaching 86.18%, primarily attributed to the distinct spectral characteristics of PV in forested high-altitude regions. Meanwhile, MEA-Artificial Surface also performs well, with 98.22% Recall, 95.00% F1-score, and 90.50% IoU, benefiting from the regular arrangement and clear boundaries of PV arrays in built-up areas. MEA-Shrubland attains the optimal Precision (97.79%), alongside 88.36% Recall, 92.84% F1-score, and 86.63% IoU, suggesting exceptional PV identification accuracy in shrub-dominated mid-elevation areas. In contrast, LEA-Woodland shows the lowest Recall (76.81%) due to vegetation occlusion in low-elevation forests. Notably, HEA-Bare Land demonstrates the optimal F1-score (95.44%) and IoU (92.04%), while maintaining high Precision (94.99%) and Recall (95.89%), indicating superior PV identification performance in open high-altitude regions.
Overall, CPVPD-2024 demonstrates robust performance across the entire test set with 90.38% Precision, 89.65% Recall and 81.78% IoU, verifying its stability and reliability in complex environments. A particularly significant finding is the consistent maintenance of over 75% IoU for Cultivated Land across all three elevation zones from low to high altitude, which confirms strong method adaptability for agrivoltaics applications.
As shown in Fig. 7, the CPVPD-2024 dataset exhibits high consistency with manual annotations across PV power plant regions with diverse texture features, spatial densities, and spectral characteristics. The visual comparison results indicate that the annotated PV array regions in the dataset provide complete coverage and accurately reflect the actual distribution of various PV facilities. Notably, FN primarily appear at the edges of target regions, likely due to blurred boundary features under complex lighting conditions. As illustrated in Fig. 7(d,e), FP are predominantly distributed in densely vegetated areas, reflecting the annotation challenges caused by spectral similarities between vegetation and PV components under specific spectral features.
Visual comparison of ground truth labels and the CPVPD-2024 dataset in the test zone, accompanied by color-coded confusion matrices. FP and FN are shown in blue and red, respectively.
The CPVPD-2024 dataset indicates that the total installed area of PV power plants in China during 2024 reaches 4,520.47 km2, demonstrating a pronounced spatial distribution pattern with high-density distributed PV systems concentrated in eastern regions and large-scale centralized PV installations dispersed across western territories.
Figure 8 demonstrates the recognition performance of the CPVPD-2024 dataset across 12 representative geographical environments, highlighting significant breakthroughs in identifying both inter-panel gaps and inter-array spacing of PV systems.
CPVPD-2024 Dataset Samples. (a) LEA-Artificial Surface. (b) LEA-Cultivated Land. (c) LEA-Grassland. (d) LEA-Waterbody. (e) MEA-Artificial Surface. (f) MEA-Bare Land. (g) MEA-Cultivated Land. (h) MEA-Grassland. (i) MEA-Woodland. (j) HEA-Grassland. (k) HEA-Woodland. (l) HEA-Shrubland.
Figure 9 shows a visualization example of the post-processing framework for PV power plant detection. As shown in the column of Fig. 9(a), the DSFA-SwinNet model can accurately segment the PV panel areas while avoiding false detection of panel gaps. However, due to the complex background of remote sensing images, the model still exhibits residual misclassifications, primarily manifested as shadows on bare land, vegetated areas between PV arrays, and easily confused features such as roads or walls around the power station. Additionally, due to the limitations of pixel-level segmentation, the initial results also contain some salt-and-pepper noise.
Example of PV power plant detection results under different steps. (1) 88.2628°E–88.3460°E, 43.5225°N–43.5449°N. (2) 88.9548°E–88.9989°E, 44.0089°N–44.0206°N. (3) 109.5649°E–109.7744°E, 40.2509°N–40.3101°N. (4) 113.5810°E°E–113.6192°E, 43.8273°N°N–43.8375°N.
After applying the multi-level morphological filtering method based on the geometric characteristics of PV components for post-processing, as shown in the column of Fig. 9(b), small noise points and discrete patches are effectively filtered out while preserving the overall structural features of the PV components. However, this method still leaves behind some larger elongated patches, mainly originating from mountain shadows, cloud cover, or linear features with spectral characteristics similar to PV arrays, such as roads. Finally, as shown in the column of Fig. 9(c), manual adjustments further optimize the results, precisely eliminating false-detected patches and ensuring the integrity of vector boundaries and geospatial accuracy.
The implementation details and exemplary resources of the CPVPD-2024 dataset framework are publicly accessible at https://github.com/cookie1129gu/CPVPD-2024.git. The released codebase includes:
• Test set samples with validation code for performance evaluation.
• Pre-trained model parameters.
• Preprocessing/postprocessing scripts.
• Training and batch prediction pipelines.
• Data utilization and analytical code samples.
Ali, T. H. et al. Application of Artifical Intelligence in Construction Waste Management. in 2019 8th International Conference on Industrial Technology and Management (ICITM) 50–55. https://doi.org/10.1109/ICITM.2019.8710680 (2019).
Hossain, M. S., Wadi Al-Fatlawi, A., Kumar, L., Fang, Y. R. & Assad, M. E. H. Solar PV high-penetration scenario: an overview of the global PV power status and future growth. Energy Syst. https://doi.org/10.1007/s12667-024-00692-6 (2024).
The Construction Situation of Photovoltaic Power Generation in 2024–National Energy Administration (NEA. https://www.nea.gov.cn/20250221/f04452701c914d51a89d0c0ea6f4acd1/c.html.
Masson, G. et al. A Snapshot of the Global PV Market. in 2024 IEEE 52nd Photovoltaic Specialist Conference (PVSC) 0566–0568. https://doi.org/10.1109/PVSC57443.2024.10749131 (2024).
Wang, F. & Liu, W. The Current Status, Challenges, and Future of China’s Photovoltaic Industry: A Literature Review and Outlook. Energies 17, 5694 (2024).
Article  CAS  Google Scholar 
Wang, J. et al. Mapping national-scale photovoltaic power stations using a novel enhanced photovoltaic index and evaluating carbon reduction benefits. Energy Convers. Manag. 318, 118894 (2024).
Article  Google Scholar 
Calvert, K., Pearce, J. M. & Mabee, W. E. Toward renewable energy geo-information infrastructures: Applications of GIScience and remote sensing that build institutional capacity. Renew. Sustain. Energy Rev. 18, 416–429 (2013).
Article  Google Scholar 
Stowell, D. et al. A harmonised, high-coverage, open dataset of solar photovoltaic installations in the. UK. Sci. Data 7, 1–15 (2020).
Google Scholar 
Ortiz, A. et al. An Artificial Intelligence Dataset for Solar Energy Locations in India. Sci. Data 9, 1–13 (2022).
Article  Google Scholar 
Sterl, S. et al. An all-Africa dataset of energy model “supply regions” for solar photovoltaic and wind power. Sci. Data 9, 664 (2022).
Article  PubMed  PubMed Central  Google Scholar 
Kruitwagen, L. et al. A global inventory of photovoltaic solar energy generating units. Nature 598, 604–610 (2021).
Article  ADS  CAS  PubMed  Google Scholar 
Hasan, S., Blinov, A., Chub, A. & Vinnikov, D. Solar PV Generation and Consumption Dataset of an Estonian Residential Dwelling. Sci. Data 12, 481 (2025).
Article  PubMed  PubMed Central  Google Scholar 
Bradbury, K. et al. Distributed solar photovoltaic array location and extent dataset for remote sensing object identification. Sci. Data 3, 160106 (2016).
Article  CAS  PubMed  PubMed Central  Google Scholar 
Zhou, Y., Wilmink, D., Zeman, M., Isabella, O. & Ziar, H. A geographic information system-based large scale visibility assessment tool for multi-criteria photovoltaic planning on urban building roofs. Renew. Sustain. Energy Rev. 188, 113885 (2023).
Article  Google Scholar 
Lu, W., Chen, J. & Xue, F. Using computer vision to recognize composition of construction waste mixtures: A semantic segmentation approach. Resour. Conserv. Recycl. 178, 106022 (2022).
Article  Google Scholar 
Abdallah, M. et al. Artificial intelligence applications in solid waste management: A systematic research review. Waste Manag. 109, 231–246 (2020).
Article  PubMed  Google Scholar 
Deng, J. et al. ImageNet: A large-scale hierarchical image database. in 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255. https://doi.org/10.1109/CVPR.2009.5206848 (2009).
Plakman, V., Rosier, J. & Vliet, J. van. Solar park detection from publicly available satellite imagery. GIScience Remote Sens. (2022).
Xia, Z., Li, Y., Guo, X. & Chen, R. High-resolution mapping of water photovoltaic development in China through satellite imagery. Int. J. Appl. Earth Obs. Geoinformation 107, 102707 (2022).
Article  Google Scholar 
Zhang, X., Xu, M., Wang, S., Huang, Y. & Xie, Z. Mapping photovoltaic power plants in China using Landsat, random forest, and Google Earth Engine. Earth Syst. Sci. Data 14, 3743–3755 (2022).
Article  ADS  Google Scholar 
Liu, J., Wang, J. & Li, L. Vectorized solar photovoltaic installation dataset across China in 2015 and 2020. Sci. Data 11, 1446 (2024).
Article  PubMed  PubMed Central  Google Scholar 
Feng, Q. et al. A 10-m national-scale map of ground-mounted photovoltaic power stations in China of 2020. Sci. Data 11, 1–15 (2024).
Article  Google Scholar 
Chen, Q. et al. Remote sensing of photovoltaic scenarios: Techniques, applications and future directions. Appl. Energy 333, 120579 (2023).
Article  Google Scholar 
Lu, N., Li, L. & Qin, J. PV Identifier: Extraction of small-scale distributed photovoltaics in complex environments from high spatial resolution remote sensing images. Appl. Energy 365, 123311 (2024).
Article  Google Scholar 
Chen, Y., Zhou, J., Ge, Y. & Dong, J. Uncovering the rapid expansion of photovoltaic power plants in China from 2010 to 2022 using satellite data and deep learning. Remote Sens. Environ. 305, 114100 (2024).
Article  Google Scholar 
Guo, Z. et al. TransPV: Refining photovoltaic panel detection accuracy through a vision transformer-based deep learning model. Appl. Energy 355, 122282 (2024).
Article  Google Scholar 
Yu, J., Wang, Z., Majumdar, A. & Rajagopal, R. DeepSolar: A Machine Learning Framework to Efficiently Construct a Solar Deployment Database in the United States. Joule 2, 2605–2617 (2018).
Article  Google Scholar 
Li, A. et al. Global photovoltaic solar panel dataset from 2019 to 2022. Sci. Data 12, 637 (2025).
Article  CAS  PubMed  PubMed Central  Google Scholar 
Wang, J. et al. PVNet: A novel semantic segmentation model for extracting high-quality photovoltaic panels in large-scale systems from high-resolution remote sensing imagery. Int. J. Appl. Earth Obs. Geoinformation 119, 103309 (2023).
Article  Google Scholar 
Ji, C. et al. Solar photovoltaic module detection using laboratory and airborne imaging spectroscopy data. Remote Sens. Environ. 266, 112692 (2021).
Article  PubMed  PubMed Central  Google Scholar 
Xia, Z. et al. Mapping the rapid development of photovoltaic power stations in northwestern China using remote sensing. Energy Rep. 8, 4117–4127 (2022).
Article  Google Scholar 
World, A. L. A. of the. World Imagery Wayback. https://livingatlas.arcgis.com/wayback/.
National Geospatial Information Public Service Platform, Tianditu. https://www.tianditu.gov.cn/.
Zhang, X., Wang, S., Huang, Y., Xie, Z. & Xu, M. The dataset of photovoltaic power plant distribution in China by 2020. https://doi.org/10.5281/zenodo.6849477 (2022).
Yang, J. & Huang, X. The 30 m annual land cover datasets and its dynamics in China from 1985 to 2023. https://doi.org/10.5281/zenodo.12779975 (2024).
GEBCO | General Bathymetric Chart of the Oceans. https://www.gebco.net/.
Lin, S., Yang, Y., Liu, X. & Tian, L. DSFA-SwinNet: A Multi-Scale Attention Fusion Network for Photovoltaic Areas Detection. Remote Sens. 17, 332 (2025).
Article  ADS  Google Scholar 
Yang, Y., Lin, S. & Liu, X. CPVPD-2024: A photovoltaic plant vector dataset derived from Chinese remote sensing imagery via a topography-enhanced deep learning framework with dynamic spatial-frequency attention. Zenodo https://doi.org/10.5281/zenodo.15618227 (2025).
Ishida, T., Yamane, I., Sakai, T., Niu, G. & Sugiyama, M. Do We Need Zero Training Loss After Achieving Zero Training Error? in Proceedings of the 37th International Conference on Machine Learning 4604–4614. https://doi.org/10.48550/arXiv.2002.08709 (PMLR, 2020).
Download references
This study was supported by National Key R&D Program (No. 2020YFB2104400).
College of Computer Science, Beijing University of Technology, Chaoyang District, Beijing, 100124, China
Yang Yang, Shaofu Lin, Rongtian Lu & Xiliang Liu
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
All authors contributed to conceiving and designing this study, as well as drafting the manuscript. Y.Y. carried out data processing, performed technical validation, conducted related experiments, and wrote the manuscript. S.L. defined the research topic, oversaw this study, and revised the manuscript. R.L. participated in manuscript revision. X.L. provided methodological guidance, supervised this study, and contributed to manuscript revision.
Correspondence to Shaofu Lin or Xiliang Liu.
The authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Reprints and permissions
Yang, Y., Lin, S., Lu, R. et al. CPVPD-2024: A Chinese photovoltaic plant dataset derived via a topography-enhanced deep learning framework. Sci Data 12, 1601 (2025). https://doi.org/10.1038/s41597-025-05891-z
Download citation
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41597-025-05891-z
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative
Advertisement
Scientific Data (Sci Data)
ISSN 2052-4463 (online)
© 2025 Springer Nature Limited
Sign up for the Nature Briefing: Anthropocene newsletter — what matters in anthropocene research, free to your inbox weekly.

source

This entry was posted in Renewables. Bookmark the permalink.

Leave a Reply