Validation and Optimisation of Common Clear Sky Models with a use case for North-East Bulgaria

One of the most significant error contributors to preliminary design tools for Photovoltaic power systems is related to the simple parametric Clear Sky models. Therefore, this paper focuses on providing a methodology and a more sophisticated open-source tool for 3 commonly used Clear Sky models. This includes all relevant steps involved in the process from filtering the raw meteorological data, identification of Clear Sky regions, data redistribution to genetic optimization of selected model parameter, etc.use case is built upon a multiyear dataset obtained from TU Varna meteorological station between 2012-2016. A significantly higher density distribution of Clear sky segments was identified during the summer through the Clear Sky Identification algorithm. To avoid the risk of overfitting the models to purely summer months and poor model fits in winter months, which was found to be the case with the legacy model, the underrepresented clear sky regions (based on θ) were replicated until uniform distribution is attained. Subsequently, a genetic optimization was applied to selected parameters in the Clear Sky algorithms and the updated models showed a significant improvement in low winter months (θ) and even overall performance boost RMSE / MAE /R2. Furthermore, such validations and optimizations are recommended prior to any design or real-time PV-system analysis for the specific location.


Introduction
Living in a decade of renewable energy and in particular, solar energy, photovoltaics (PV) became a mainstream energy source with potentially decreasing governmental subsidies on the industry side, but ever increasing interest (Reno & Hansen, 2016). Moreover, a trendreaching the asymptotic effect of economically valuable efficiency increase in academic studies can also be seen in recent years. Therefore, the need to improve the modelling accuracy and energy potential became paramount.
Through the last decades multiple Clear Sky models were developed with the goal of creating a universal one capable of estimating the Global Horizontal Irradiance (GHI) in a cloudless sky. These values have been shown to be of a great importance for calculating the solar generation potential of PV-systems and more importantly Direct Beam (DBI) & Global Diffuse Irradiance (GDI) from GHI.
There are two types of Clear Sky Models, first physical parameterization models like REST2, SMARTS, etc, which physically estimate the solar absorption from O3, NO2, mixed gasses, water precipitation, also the aerosol extinction and most importantly the Rayleigh scattering. These state of the art models achieve very high performance and consequently are the choice for satellite remote sensing systems like Galileo (ESA, 2017) . However, they have one important limitationthe necessity for at least 10 hard to acquire parameters. (Gueymard, 2008) .
Simpler models on the other hand like the ones by Robledo & Soler (Robledo & Soler, 2000), Innechen and Perez (Ineichen & Perez, 2002), Hottel (Hottel, 1976), are widely used in tools for preliminary PV-system sizing, performance evaluation and equipment calibration like in one of the widely used tools for preliminary PV-system design -PV-Syst. For such models, many parameters like Linkie Turbidity (TL) for example are constant for all locations and even all seasons. These simplifications and the corresponding models have a major drawback of being extremely sensitive to local meteorological variations. Multiple validation studies have been conducted across the globe and the findings were vastly different depending on the location, type of climate and the specific model. (Badescu et al., 2012) (Reno et al., 2014.) Moreover, the majority of the models were developed with the goal of estimating the overall energy yield of a system and therefore have a great accuracy varia-

2
Overview of the process In order to achieve the goals of the paper, first, a high-level overview of the methodology is adopted. The result of this methodology and the algorithm that followed afterwards can be seen in the functional flow diagram in Figure 1. As it can be seen from the diagram, first the raw multiyear meteorological data is loaded, filtered for days containing erroneous readings or missing data and last but not least the training set for optimization is selected, see Section 3.
The next crucial step, see Section 4, is the identification of the clear sky regions (regions with Global horizontal irradiance (GHI) unscatterred by clouds or other atmospheric phenomena). This is achieved by comparing consecutive GHI measurements, their speed of change and other thresholds with the best fitting Clear sky model (iteratively chosen).
Once the Clear Sky regions are identified, all clear models are executed and a statistical analysis of their best fit is performed, see Section 5 and Section 6. Those statistical results are then used for validation purposes for first choosing the most accurate model at the location of interest and then for initial value of the global optimization of the empirical parameters.

Data Acquisition and Preprocessing
The Technical University of Varna (TU-Varna) multiyear meteorological measurements were obtained for the period between 2012-2016 year from the meteorological station of 27°56.3065'E ). This dataset includes measurements from two sensors for Global Horizontal Irradiance (GHI), for Ambient temperature (Tamb), Windspeed (V) and Atmospheric Pressure (p), where a single sensor provides humidity data (η). The time-sampling of the station is 10 min and is remotely logged on a centralized system without any preprocessing or filtering. Therefore, after a short investigation, it was concluded that the raw dataset contains several errors. Those are: missing loggings of some of the sensors, missing timestamps, clearly incorrect values of the windspeed and the most significant a clear mismatch between timestamp and the GHI measurements. In order to reduce the effect of those errors to minimum, but keep the sample size similar, three actions were applied. First, all days with missing measurements of GHI or timestamps (days with less 144 samples) were excluded from the data.
Once the first automatic filter is applied, the solar zenith (θz) angle is calculated using Equations 1-6 in Appendix A , (Duffie et al., 1985). By definition, θz is between -90° and +90°, when the sun is above the horizon. Therefore, the algorithm automatically detects the days with mismatch between the timestamps and the GHI measurements, by looking for positive GHI when the sun is below horizon, adding 1.5° threshold for accounting phenomena like reflection and cloud scattering of diffuse irradiance.
Some of these filtered days, however, have errorless measurements, with the only exception of their timestamp. Therefore, an algorithm was developed to move the whole dataset for the problematic days back or further and allocate the first GHI measurement to the sunrise or θz = -90°, see Figure 2. Since this solution works for the majority, but not for all daysthese out of sync days and their data correctness visually verified and if necessary, filtered out.

Identification of Clear sky period
Since the meteorological system logs measurements on continuous steps, there arises the need for identification of Clear Sky periods from the ones with clouds, reflections or sensor errors. The most robust way for this is the usage of direct (beam) normal Irradiance (DNI) and GHI data. However, due to the lack of such data, alternative method can be applied. It is a modification of the proposed criteria by Reno et all (Reno & Hansen, 2016) which does not take into account the " L-length" criteria, but imposes new thresholds and additional criteria limiting the absolute error of the Clear Sky Model below 12.5%. Initially, the filtering algorithm assumes and uses the best original Clear Sky models from literature, Hottel, (Hottel, 1976) and applies all filter criteria based on it. For more information on the Hottel, please refer to Section 5. For all filters and threshold values please refer to Table 1, taking into account that the applied moving window was 4 samples or 40 min.  Figure 3 (a), (b), it can be seen that the algorithm is capable of identifying Clear Sky samples from different seasons, resulting in high or low GHI maximums at solar noon, depending on θz and Earth's declination, β.
Moreover, Figure 3(b) clearly validates the need for an optimization of the Hottel's Model for North-East Bulgaria showing an underestimation of between 3-7% during the winter months. Additionally, the figures also verify the chosen thresholds for maximum and mean errors by allowing identification with not optimally fitted Clear Sky model. The samples and periods with scattered clouds are also cleaned from the dataset, as it can be seen with the GHI peaks in Figure 3(b), (c), not meeting almost all criteria.

Page | 5
Please note the filtered periods at 9:30 in Figure 3(c) and at 17:20 in Figure 3(d), which is a good illustration of the effectiveness of numerical differentiation threshold, ±12.5 W/m 2 . This is not for the whole moving window (multiple samples), but for the single measurement, where all other criteria are satisfied (including σ of the moving window).
All in all, the distribution of the identified samples is crucial for the optimization process, see Section 6.
Two distributions are analyzed, see Figure 4. Figure 4(a), given on the left-hand side of the diagram, clearly shows that the original measurements of GHI are with higher density on the low spectrum and are exponentially inversely decreasing to the measured GHI. After identification of the Clear Sky periods, the distribution becomes more uniform (maximum deviation of 23%), with two minor exceptions for the range of 30-70 W/m 2 and higher than 900 W/m 2 . Therefore, no specific measures need to be taken for the GHI distribution.
This, however, is not the case for the distribution on a day-of-the-year basis, see Figure 4(b). As expected, the density is shifted to the summer, where the cloudless periods are substantially higher than those during the winters and the autumns. If the dataset had been left uncorrected, it would have led to a GHI underestimation for the simpler Clear Sky models during autumn and winter seasons. Therefore, the need for weighting these sample periods is applied through a replication of the measurements during the underrepresented measurement periods. This dataset will be hereinafter referred to as the renormalized one.

Selected Clear Sky Models
For validation purposes, several common Clear Sky models are selected. These are Hottel (Hottel, 1976), and Kasten (Kasten, 1980), Robledo and Soler (Robledo & Soler, 2000) The main reason for this choice is governed by the availability of measurements by the meteorological station. The authors acknowledge that more complex and state of the art models like REST2, Bird and MAC are likely to provide better accuracy, but this is impossible to validate or optimize with confidence without retrieval of data for atmospheric parameters like atmospheric ozone and NO2 content, ground reflectance, etc. These are also very rarely available measurements during the initial stages of PV-system development and hence it is not feasible to expect unless GIS information is integrated in the process.
It should be noted that the short description of the models may not provide all the information necessary for the reproduction directly from the paper-in such cases, please refer to the source provided for calculation of the hidden parameter estimations, like the extraterrestrial radiation, see Equation 1 and Appendix A for the rest of the solar angles.

=
( 1.0011 + 0.034221 + 0.00128 + 0.000719 cos 2 + 0.00007 2 (1) The second model, (Hottel 1976), uses the zenith angle (θ), the local altitude and empirical coefficients for the estimation of the beam radiation atmospheric transmittance coefficient (τb), see Equation 3. Please note that a0, a1, k are a linear functions of the local altitude and the empirically derived coefficients for the different seasons at lower altitudes below 2.5 km (Hottel, 1976). The correction factors for the Mid-attitude winter of a0*1.03; a1*1.01 was also used as described by Hottel. Similarly, with empirically derived diffuse transmittance coefficients from Liu and Jordan (τd) the diffused part of the GHI is calculated, see Equation 4. (Liu, B.Y.H., 1962) Last, the two radiation components added together determining the final GHI.
The third investigated legacy model was developed by Kasten. (Kasten, 1980) This model has been modified multiple times in the past to result ultimately in models such as Innechen and Perez. It calculates the GHI based on the local altitude(h) in meters, the zenith angle (θ), but most importantly it takes into account the atmospheric turbidity (TL) and the air mass (AM). Whereas the TLs are intrinsically complex for determination and with high variability from the local climate and environmental pollution, the air mass is straight forwardsee Equation 9 as given in (Myers, 2013) = 0.84 (θ ) −0.027 ( ℎ1 + ℎ2 ( −1)) (12)

Validation and Optimization
Once the basis of the models is implemented in Matlab, their performance can now be assessed through statistical parameters for the TU-Varna dataset. For overall(annual) performance evaluation, some classical Clear Sky performance statistical parameters were chosen-RMSE, MAE, R 2 . This choice was mainly motivated by the possibility to compare results with other validation studies such as (Badescu et al., 2012), (Engerer & Mills, 2015), (Reno et al., 2014.) Then through careful selection of the empirical parameters for optimization and choosing a meaningful optimization algorithm, it is possible to improve the overall (annual) performance of the Clear Sky algorithm for the specific site of interest. A distinguishable feature of the current method is the goal of improving not only the overall (annual) performance but more importantly to minimize the error in high solar zenith angles (θ) for short-term (intra-day) simulations and short-term forecasting (in matters of minutes and hours). This is of particular importance for the TU Varna efforts to provide a multiscale end-to-end PV modelling. For this purpose, the optimization was based on the renormalized density of clear sky regions, presented in Section 4.
Both convex and global optimizations were considered, but the latter was chosen due to the fundamentally non-linear nature of the models. The specific type of optimization was the built-in Genetic Algorithm (GA) due to the possibility to run generations in parallel. The monitored performance criteria are recommended to be Mean Absolute Error (MAE) instead of the commonly used RMSE because of the intrinsically high variability of external phenomena such as reflection in urban the environment. The tolerance for the GA was set to be 1˟10 -3 or 80 generations. The tool has also implemented MSE, SSE, R 2 , RMSE metrics and the user can optimize/validate with respect to its goals. Please note that there is not always a physical justification behind the lower and the upper constrains for the selected parameters and the results should be looked from entirely data regression point of view. For the sake of users employing Clear Sky algorithms only for Annual energy yield only, the same optimizations were performed for the original dataset without any artificial renormalization.
Starting from the simplest model Robledo and Soler, 3 empirical parameters were fitted through the GA. For easier follow up of the parameters, they will be called as follows a1 = 1159.24, a2 = 1.179, a3= -0.019. Since it is clear that a1 is related to the solar constant, a lower deviation band of ±10% was given, whereas a2, a3 were constrained for up to ±30%. The Tunned parameters and the multiyear performance of the models can be seen in Table 2. Both RMSE and MAE are improved with the renormalized database, with the decrease of MAE being 73%, while RMSE is only 3.1%. This justifies the choice of using MAE as optimization weighting function and is to be expected due to the noisy Clear Sky Identified dataset. These are slightly higher errors than similar validation studies (Engerer & Mills, 2015) for different climates, which can be explained by variations of the local climate in Varna but also the tunning of the Clear Sky identification algorithm. A closer analysis of the Robledo and Soler performance, see Figure 5-6, shows the significant overestimation of the original model at all GHI values. The relative error against the zenith angle, however, is not constant and is steadily growing with the increase of θ, reaching ~35% at sunrise/sunset. This is in agreement with other legacy findings for similar models (Engerer & Mills, Page | 8 2015), (Badescu et al., 2012) and despite of the lower GHI values at these periods may have huge complications in the future European smart grid system. The updated model improves this behavior drastically. Moreover, the model has a perfect fit during the summer months, as it can be seen in Figure 7. The only relative downside of the optimization was found around noon during the winter months, as seen in Figure 8. The model that provided the best performance for Varna appeared to be the one proposed by Hottel. Even before the optimization it was capable of estimating almost exactly summer days but had some deviations during the winter with underestimation of GHI, see Figure 9-10. The optimization limits of the Genetic Algorithm were bounded to 10% deviation for all parameters. The RMSE found was 22.18 and 21.71 for the original and the optimized model respectively, see Table 3. This is an increase of 2.1 % for the RMSE and the MAE is improved only by 2.7%. The optimized model was also observed to behave noticeably better in the winter months than the original one as can be seen in Figure 11, 12.

ISSN 2603-316X (Online)
Published: 2021-06-09 Page | 9  The original Kasten, which has a Linkie turbidity (TL) factor for every month was found to be the least accurate for the meteorological conditions in the tested dataset in terms of MAE. The RMSE on the other hand was of the same order as the other two models. The original model is overestimating all Clear Sky GHI measurements for Varna. A decision has been made, to optimize not only the TLs, but also the factor a1 = 0.84 which is a direct multiplication of the estimated GHI and the factor of the exponent a2 = 0.027. The limits were ±15% for a1, ±10% for a2 and [-10% +50%] for the TLs, since the TLs in cities are expected to be higher than the standard ones. The multiyear results from the tests and the optimizations can be seen in Table 4 with most noticeable improvement from all previous Page | 10 models 15% for RMSE and 70% decrease of MAE. In a more detailed analysis, however, see Figure  13-16, it could be observed that both the original and optimized models underperform for almost all conditionslow elevation angles (α = 90 -θ), winter months, etc. with only relatively advantageous estimations around the summer months and high elevation angles for the optimized model, see Figure  15. It can, therefrom, be concluded that tools and simulations using the optimized Kasten, are not suitable for the meteorological conditions of cities in North East Bulgaria.