From 9106cbc55a1018f89016f987ddda43bfe74a0cb8 Mon Sep 17 00:00:00 2001 From: Tim Sparer Date: Thu, 21 May 2026 22:49:43 +0000 Subject: [PATCH 01/32] Add first draft of s.o.p for model evaluation --- SOPs/Model-Evaluation-ML-metrics.md | 160 ++++++++++++++++++++++++++++ 1 file changed, 160 insertions(+) create mode 100644 SOPs/Model-Evaluation-ML-metrics.md diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md new file mode 100644 index 0000000..bfb5e04 --- /dev/null +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -0,0 +1,160 @@ +# Model Validation, Interpretation, and Testing Guidelines + +After running your model (either with a given set of historic parameters (e.g. those from past years) or with parameters found via cross-validation), You will need to interpret the results. To do so, you will need to first assess how well the sample data (sales) matches the population (buildings to be assessed) training data, then note any mix of data, practical, or real-world factors that may be impacting your model performance (novel market or time trends?), and finally interpret the formal model performance statistics to select the best model (where best means the most accurate + generalizable model, that conforms to IAAO standards). + +--- + +1. Assessing how representative your sales (sample) are, relative to the population (all buildings-to-be-assessed). + - a. Balance tests + - b. Visual inspection + - c. Not missing at random + - d. Domain specific approach + +2. Noting any real-world changes that may impact your model, and/or interactions between data and model that may effect your results. (Model drift, data drift). +3. Interpreting Model performance (evaluating machine learning and assessment metrics). + +--- + +## 1. Assessing how representative your sales-sample is, relative to the population (all buildings-to-be-assessed). + +To ensure that a model can generalize from a sample (the sales data that we train the model on) to the population (all buildings being assessed), we need to check that our sales sample contains buildings that are similar to those in the entire population: That is, we should generally see buildings with the same composition of features, in the same proportions, in the sales-sample and in the population. We check for this in several ways (primarily with statistical tests and visual inspections of distributions). + +*(Note: if there are factors that cause some properties (and their attendant sale-prices) to be over-represented in the sample, the model will over-index to these types of properties, likely leading to over or under-valuation).* + +--- + +Ways that we test for differences between the sample and the population (and some possible corrections): In a perfectly matched sample, no feature would predict inclusion/exclusion of a property in the sales-sample. That is, all properties, of all types would have an equal chance of being sold in a given year. These tests allow us to test that assumption (and to develop possible corrections). + +### A. Balance Tests + +*(See the "Statistical Tests" section of the model performance report.)* In a perfectly matched sample, no feature would predict inclusion/exclusion of a property in the sales-sample. Any feature that predicts inclusion in the sales set at a level greater than chance (statistical significance) suggests that this feature is over-or under-represented in the sample and will likely bias your results. (This is especially the case for features that also turn out to have high shap values in your results). The p value (for each feature in the report) tells you that a feature predicts inclusion in the sample at a level greater than expected-due-to-chance, while the Beta value gives you a relative sense of the weight (importance) and direction (include vs exclude) of that feature. (In our report, asterisks, represent statistically significant predictors). + +- **a. Caveats/Real world observations:** Our 2026 reports indicate that there may be some imbalance in our sample (see, for example, # of bedrooms, baths, various ACS5 values). We currently don't correct for this. + +- **b. Caveats/Real world observations/solution:** As many of our features may be correlated (e.g. baths and bedrooms), or have other interactions (e.g. geography, ACS5, characteristics), and some of those predict inclusion in the sales-sample – we may be underestimating the divergence between our sample and population. More specifically certain types of buildings in certain neighborhoods may be over-represented in the sales sample. (We currently don't analyze balance at a neighborhood level). If we want to address this in the future, we may want to attempt some sort of dimensionality/feature reduction, and then re-run the balance tests on the reduced feature space. If there are problems with sales-volume, this may be worthwhile. If needed, one can attempt to apply a correction with Inverse propensity scoring (which upweights the value of errors on under-sampled properties) – (example suggestion for a time sensitive sample IPW here: https://github.com/ccao-data/model-res-avm/issues/297) - Note this may not drastically improve overall accuracy metrics., but might improve neighborhood or township level performance, in particular vertical equity. + +- **c.** To validate your investigations from the balance tests, you can look at the standardized mean differences (between sample + population) for each feature. (Larger differences = more likely deviation between sample and population). + +### B. Visual Inspection + +See empirical distributions on the performance report. The distribution of a feature in the sales-set should visually match those in the population set. (Note: We only calculate this at the full sample level, it may differ at the township or neighborhood level). If needed can apply a KS test to see that the feature distributions (between sample and population) are the same. + +### C. Non-Missing at Random + +"Non-missing at random" means that some feature (or a particular value of a feature) is missing in a way that's correlated with your outcome variable, or some other variable in the dataset. This can sometimes indicate systemic under-sampling. In our case, it is somewhat controlled for by the fact that lgbm actually incorporates Nulls as a predictor*, because of this, we currently don't track correlations of nulls as rigorously as we otherwise might, though you can get a sense of the percentage of nulls-by-feature by looking at the "Missingness" heading in the Feature report. + +\* (https://medium.com/@andrywmarques/how-lgbm-deals-with-missing-values-bd361636357f. https://coder-wang-uspsa.medium.com/how-do-xgboost-lightgbm-and-catboost-handle-missing-features-e541da94d528) + +### D. A Quick Domain Specific Sanity Check + +Compare year over year changes in assessed values for sold-and-unsold houses. (This is documented in the performance report under "Change In and Out of Sample"). The sold and unsold properties should have roughly similar changes in assessed values. + +--- + +## 2. Note any real-world changes that may impact your model, and/or interactions between data and model that may effect your results. (Model drift, data drift). + +This step is less specific, but is useful: Since our model uses temporal features, are there any recent changes (market trends etc.) that may impact it? While large changes in major variables (general increases in sales price across the board) should be obvious in the model results, it's useful to think through other dynamics: For example, have there been changes in preferences (e.g. people moving to the suburbs for more space) that the model should be picking up on? A useful sanity check for this sort of change is to compare the SHAPS on a newly trained model to a prior model; is the new model picking up the newly important variable (did its SHAP increase)? To check for data drift over time, you can compare the feature distributions from a recent year to those of a prior (formally, you could do a KS test, though we do not do so currently – you can eyeball the "Distributions of Features" in the feature report). Note: Since we use a boosted model with historic data, and retrain each year, we are less subject to major problems with data or model drift: However, this can still pose an issue if we have high temporal volatility and low recent sales volume. + +https://www.ibm.com/think/topics/model-drift + +--- + +## 3. Interpret Model Performance Statistics + +We calculate several measures to assess the efficacy of the model. These measures range from more traditional machine learning metrics to assessment-specific metrics. The machine learning metrics we calculate are RMSE, mDape, and R squared. The assessment metrics that we calculate and attend to are (primarily) Median Ratio and Coefficient of Dispersion (C.o.D) for accuracy and precision (respectively). We also supplement our analysis with measures of vertical equity ("how equal are our assessments at all price levels") these are PRB, PRD, and MKI, and ratio curves. (These assessment metrics are explained in detail in this article: Mass Appraisal For The Masses: The Basics - by Lars Doucet – with the exception of MKI, a reference to which can be found In Chris Berry's "Reassessing the Property Tax"). + +Our approach is to fit candidate models using the machine learning (m.l.) metrics, and then compare/evaluate our candidate models based on both the m.l. metrics and the assessment metrics. We can think of the assessment-specific metrics as both offering acceptable bounds for candidate models* and as a way to "break ties" between models that are otherwise similar in terms of m.l. metrics *(e.g. even if a model shows good m.l. fits, if it is outside the acceptable assessment bounds, we may ignore it). + +Note: This discussion presumes a train-test breakout, where we fit the model on a subset of our data (training set) and calculate the performance measures on data that the model has not seen (the test set). We use this approach to avoid overfitting (see below for specifics) and ensure that our model is generalizable out-of-sample. + +--- + +### Model Fitting: Machine Learning Metrics + +**Fitting the model:** We aim to fit the model using proper scoring rules, of which RMSE is our primary measure. + +#### RMSE – Root Mean Squared Error + +This is generally the metric that we use to formally fit the models. It is just the mean of the, squared-errors (prediction – forecast), rescaled with a square-root. Smaller (closer to zero) is better. Note that since it squares the error (which allows for smooth-model fitting), it will tend to overweight large errors and/or large outliers, (which can be an issue for skewed data, like housing prices). + +Note that RMSE works best on normally distributed data, and our data is generally skewed, with high-value outliers. As such, fitting a model with this measure will tend to regress toward the mean (leading to under-valuations of pricey-properties, and over-valuations of lower-priced properties). We can adjust for this by using different objective functions (quantile functions, bespoke penalty terms); OR by finding candidate models with low RMSE, but making a final model choice based on ratio-curves and vertical equity measures*. That said, even after accounting for regression to the mean, we have found RMSE to lead to accurate and reasonably precise models (as judged by other measures) and its status as a proper-scoring rule gives us some safeguards against other biases. + +*(For example, perhaps we test two models; one with an additional feature that primarily leads to more accurate valuations for lower valued properties (sale_price < median_sale_price), given the skew in our data (high value properties contribute proportionally more to RMSE) we might find that this feature doesn't move our RMSE calculation very much, but does improve our vertical equity based on our vertical equity measures. We could justify selecting the model+new-feature based on their vertical equity improvements, rather than the RMSE alone).* + +--- + +**RMSE interpretability:** An additional reason to use RMSE is its interpretability. RMSE is on the same scale as your outcome value, and can be interpreted with reference to the mean, median, and standard deviation of your sample data (either train or test set). Since RMSE is structurally similar to measures of variability, such as standard deviation (the average distance of sample or population values from the mean), you can often interpret RMSE in relation to SD. (E.g. If one thinks of the mean as the simplest "model" of a distribution, then one can interpret the standard deviation in a manner similar to RMSE- the average deviation of your observations from your mean. More complex+accurate models should have an RMSE lower than the standard deviation of your test data). This insight is also useful for model comparison, as the standard deviation can be used as a baseline, or scale, with which to benchmark RMSE values from candidate models. (E.g. if two candidate models differ by some magnitude of RMSE how "large" or "trivial" is that difference, relative to the underlying standard deviation of your sample, or test data?) + +See here for further discussion: https://stats.stackexchange.com/questions/242787/how-to-interpret-root-mean-squared-error-rmse-vs-standard-deviation - more formally, here: Shmueli, G., Bruce, P. C., Stephens, M., & Patel, N. R. (2016). Data Mining for Business Analytics: Concepts, Techniques, and Applications with JMP Pro (3rd Edition). Wiley. + +--- + +#### mDape (Median Absolute Percentage Error) + +*(used in Zillow model, discussion here)* +https://www.zillow.com/zestimate/ +https://stats.stackexchange.com/questions/596324/is-median-absolute-percentage-error-useless#:~:text=Percentage%20Error%20Asymmetry:%20A%20significant%20drawback%20of,same%20factor%20yields%20only%20a%2050%25%20error. + +mDape is a useful sanity check for our models. Since mDape is a median (of the absolute percentage error of each observation in the test set, (abs((actual-forecast)/actual)*100) ), it is more robust to outliers than other measures, and thus complements RMSE. Note, we shouldn't use this as an optimization metric (as it is not a proper scoring rule, and treats over-forecasts and underforecasts assymetrically)**, but it is useful for comparing model performance in manner that is more robust to outliers (since it is a median). For example, in a case where two models differ slightly in their RMSE, we may accept a model with a slightly higher RMSE if it has a lower mDape than its competitor model (This arrangement would likely signal that the other "low RMSE" model might simply be fitting toward some high value outliers, at the expense of the median property). + +\**(e.g. suppose our Forecast =2, and the actual value =1; this will cause the APE =100%, Whereas an actual value of 3 (with a forecast of 2) will lead to an APE of 33%).* + +--- + +#### R-squared + +We report R-squared because it is a common and somewhat interpretable metric (R square varies between 0 and 1, and values of R squared closer to 1 are generally considered to signify better model fits), however given the myriad problems with R squared ([see here: Is R-squared Useless? | UVA Library](https://library.virginia.edu/data/articles/is-r-squared-useless)) we shouldn't make any model justification decisions based off it. At most, we can check for consistency with RMSE at local levels (e.g. townships) - (In cases where there is a discrepancy between goodness of fit as suggested by R-squared vs RMSE – default to the RMSE, and, perhaps investigate reasons for underlying low R-squared (Note that R squared is sensitive to scale, variance, and nonlinearities in the underlying data, so these may be causes of discrepancies between R squared and RMSE). + +- R-squared does not measure goodness of fit. +- R-squared does not measure predictive error. +- R-squared does not allow you to compare models using transformed responses. +- R-squared does not measure how one variable explains another (it's not causal). + +See also: Shalizi notes https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/10/lecture-10.pdf + +For a more sensible interpretation of goodness-of-fit, (in the spirit of how people think R-squared works) – simply eyeball the table in the model performance report for "Estimate vs Actual (Individual Obs.)" (look for changes between model estimates and actuals). + +--- + +### Train/Test Splits + +We follow a standard train/test split to test our models for overfitting. Overfitting is when our model picks-up spurious correlations in our sample, and erroneously uses those to predict sales. This makes our model less generalizable. + +When we fit the model, we hold out a portion of our sample data (the test-set) and do not train, or fit, the model on that data. (The training data is the sample data minus the test-set). We fit the model on the training data and then predict sales for the training data and calculate the model's performance metrics (RMSE, mDape, etc.); we then make predictions on the test data and calculate the test data's performance metrics. We use the difference between these two sets (train and test metrics) to gauge how much a given model may overfit. + +Generally, the wider the difference between these two measures, the more likely it is that the model is overfit. For example an RMSE of 10k on our sales training predictions, and an RMSE of 179k on the test predictions (with a standard deviation of 180k on both the train and test actual distributions) would very likely indicate that the model is overfit to the training data (because of the large difference between train and test RMSE, especially relative to the standard deviation). + +A note: If we only compare model performance in terms of test sets (e.g. looking at differences in RMSE between two models, without looking at train test splits), we may accidentally select a model that is overfit (even though we are analyzing test-set scores). For example, if we are looking at the test results from 2 separate models trained on the same train/test splits, and one model has an RMSE of 75k, while a second model has an RMSE of 80k (and the standard deviation for both train test splits is 180k), we might select the first (75k) model. However, if there is a large difference between the train and test sets for the 75k model (e.g. the train set is RMSE is 10k), but not for the 80k model (perhaps RMSE for this model's train set is 40k)- we might be less inclined to pick the 75k model (due to the possibility of the 75k model being overfit). (Note: we wouldn't be able to tell definitively if the 80k model is more generalizable: we might want to test additional models and see their train/test splits, and test accuracy, to get a better sense of whether the 75k model was anomalous. That said, the large gap, relative to the 2nd model (and the standard deviation of the underlying data) may be sign enough of overfitting). + +Further reading: Bias Variance Trade-off +- https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff +- https://www.ibm.com/think/topics/bias-variance-tradeoff + +--- + +Finally: If your sample is not a good match for your population, good train-test splits will only take you so far (it will be hard to generalize any model). This is because your sample will lack representative training data. This is why balance tests (see earlier section) are important. + +--- + +### Assessment Metrics + +Interpretations and acceptable ranges for assessment metrics can be found here: + +https://github.com/ccao-data/wiki/blob/master/SOPs/Sales-Ratio-Studies.md + +longer descriptions here: Mass Appraisal For The Masses: The Basics - by Lars Doucet +https://progressandpoverty.substack.com/p/mass-appraisal-for-the-masses-the + +--- + +### Practical Process + +Pull all current data, and fit + predict with last years hyper-parameters, This model can act as a baseline for any improvements you may intend to make. At this stage, you will mostly be using the model reports to check for data-drift (year-over-year changes in features- see the feature-report), model+data drift (changes in shaps) data that is not-missing at random, and balance tests (are some features and property types over-represented?). You can attempt to make corrections at this stage (e.g. applying IPW to correct for imbalanced features), but most likely you will just have to call this out as a concern in your model. + +--- + +### Tune the Hyperparameters + +Carry out a "CV run" with github actions. This will use a Bayesian optimizer to search for the best-fitting hyper-parameters. Assess the quality of the model, using the metrics discussed previously. Compare the newly tuned model to the old model. In addition to the machine learning and assessment metrics, you may wish to look at changes in assessments across townships? Are they relatively similar across models? Are there any big swings in one model but not the other? + +(If there is no clear better model, but there are swings across differing townships, this may indicate that your models have plateaued and are fitting to local noise. In this case, you likely need to make a judgement call as to which model to select based on the particular swings/changes and outside information, and call-out the townships with large swings for desk review). From 1d656fc81629907e2c266221dc5d2d4556cbc8c4 Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Fri, 22 May 2026 11:07:11 -0500 Subject: [PATCH 02/32] Update Model-Evaluation-ML-metrics.md copy-edit rmv training --- SOPs/Model-Evaluation-ML-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index bfb5e04..593638a 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -1,6 +1,6 @@ # Model Validation, Interpretation, and Testing Guidelines -After running your model (either with a given set of historic parameters (e.g. those from past years) or with parameters found via cross-validation), You will need to interpret the results. To do so, you will need to first assess how well the sample data (sales) matches the population (buildings to be assessed) training data, then note any mix of data, practical, or real-world factors that may be impacting your model performance (novel market or time trends?), and finally interpret the formal model performance statistics to select the best model (where best means the most accurate + generalizable model, that conforms to IAAO standards). +After running your model (either with a given set of historic parameters (e.g. those from past years) or with parameters found via cross-validation), You will need to interpret the results. To do so, you will need to first assess how well the sample data (sales) matches the population (buildings to be assessed), then note any mix of data, practical, or real-world factors that may be impacting your model performance (novel market or time trends?), and finally interpret the formal model performance statistics to select the best model (where best means the most accurate + generalizable model, that conforms to IAAO standards). --- From ceea4dfebf98935a9314f747b612a67ffd125fc4 Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Fri, 22 May 2026 11:08:00 -0500 Subject: [PATCH 03/32] Update Model-Evaluation-ML-metrics.md copy edit 2 --- SOPs/Model-Evaluation-ML-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index 593638a..8639bef 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -1,6 +1,6 @@ # Model Validation, Interpretation, and Testing Guidelines -After running your model (either with a given set of historic parameters (e.g. those from past years) or with parameters found via cross-validation), You will need to interpret the results. To do so, you will need to first assess how well the sample data (sales) matches the population (buildings to be assessed), then note any mix of data, practical, or real-world factors that may be impacting your model performance (novel market or time trends?), and finally interpret the formal model performance statistics to select the best model (where best means the most accurate + generalizable model, that conforms to IAAO standards). +After running your model (either with a given set of historic parameters (e.g. those from past years) or with parameters found via cross-validation), You will need to interpret the results. To do so, you will need to first assess how well the sample data (sales) matches the population (buildings to be assessed), then note any mix of data, practical, or real-world factors that may be impacting your model performance (novel market or time trends), and finally interpret the formal model performance statistics to select the best model (where best means the most accurate + generalizable model, that conforms to IAAO standards). --- From a632697b39c434dff9a87418638f1064f5033f6a Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Fri, 22 May 2026 11:08:26 -0500 Subject: [PATCH 04/32] Update Model-Evaluation-ML-metrics.md --- SOPs/Model-Evaluation-ML-metrics.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index 8639bef..374e573 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -1,6 +1,7 @@ # Model Validation, Interpretation, and Testing Guidelines -After running your model (either with a given set of historic parameters (e.g. those from past years) or with parameters found via cross-validation), You will need to interpret the results. To do so, you will need to first assess how well the sample data (sales) matches the population (buildings to be assessed), then note any mix of data, practical, or real-world factors that may be impacting your model performance (novel market or time trends), and finally interpret the formal model performance statistics to select the best model (where best means the most accurate + generalizable model, that conforms to IAAO standards). +After running your model (either with a given set of historic parameters (e.g. those from past years) or with parameters found via cross-validation), You will need to interpret the results. To do so, you will need to first assess how well the sample data (sales) matches the population (buildings to be assessed), then note any mix of data, practical, or real-world factors that may be impacting your model performance (e.g. +novel market or time trends), and finally interpret the formal model performance statistics to select the best model (where best means the most accurate + generalizable model, that conforms to IAAO standards). --- From ea84cff67020bf83298b1013fdf687858bb1999f Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Fri, 22 May 2026 11:09:09 -0500 Subject: [PATCH 05/32] Update Model-Evaluation-ML-metrics.md heading --- SOPs/Model-Evaluation-ML-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index 374e573..35b3b43 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -4,7 +4,7 @@ After running your model (either with a given set of historic parameters (e.g. t novel market or time trends), and finally interpret the formal model performance statistics to select the best model (where best means the most accurate + generalizable model, that conforms to IAAO standards). --- - +## This Guide Covers 1. Assessing how representative your sales (sample) are, relative to the population (all buildings-to-be-assessed). - a. Balance tests - b. Visual inspection From 41e659f2259d187d3a39bb5e9b7ace1e95348684 Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Fri, 22 May 2026 11:09:54 -0500 Subject: [PATCH 06/32] Update Model-Evaluation-ML-metrics.md --- SOPs/Model-Evaluation-ML-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index 35b3b43..3c40485 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -12,7 +12,7 @@ novel market or time trends), and finally interpret the formal model performance - d. Domain specific approach 2. Noting any real-world changes that may impact your model, and/or interactions between data and model that may effect your results. (Model drift, data drift). -3. Interpreting Model performance (evaluating machine learning and assessment metrics). +3. Interpreting Model performance (using machine learning and assessment metrics). --- From feb5527d8ec59262d39ae407bfc95649edae6502 Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Fri, 22 May 2026 11:36:42 -0500 Subject: [PATCH 07/32] Update Model-Evaluation-ML-metrics.md copy edits --- SOPs/Model-Evaluation-ML-metrics.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index 3c40485..89100d9 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -28,7 +28,7 @@ Ways that we test for differences between the sample and the population (and som ### A. Balance Tests -*(See the "Statistical Tests" section of the model performance report.)* In a perfectly matched sample, no feature would predict inclusion/exclusion of a property in the sales-sample. Any feature that predicts inclusion in the sales set at a level greater than chance (statistical significance) suggests that this feature is over-or under-represented in the sample and will likely bias your results. (This is especially the case for features that also turn out to have high shap values in your results). The p value (for each feature in the report) tells you that a feature predicts inclusion in the sample at a level greater than expected-due-to-chance, while the Beta value gives you a relative sense of the weight (importance) and direction (include vs exclude) of that feature. (In our report, asterisks, represent statistically significant predictors). +*(See the "Statistical Tests" section of the model performance report.)* In a perfectly matched sample, no feature would predict inclusion/exclusion of a property in the sales-sample. Any feature that predicts inclusion in the sales set at a level greater than chance (statistical significance) suggests that this feature is over-or under-represented in the sample and will likely bias your results. (This is especially the case for features that also turn out to have high shap values in your results). To check this, we run a simple logistic regression predicting the likelihood-of-a-sale, given a property's features. The resulting p values (for each feature in the report) tells you that a feature predicts inclusion in the sample at a level greater than expected-due-to-chance, while the Beta value gives you a relative sense of the weight (importance) and direction (include vs exclude) of that feature. (In our report, asterisks, represent statistically significant predictors). (Low p-values suggest statistical significance, high magnitudes for the Betas suggest a large impact). When a feature is predictive of inclusion in the sample, this means that your sample is likely biased towards properties with this feature, and may thus value these, or other properties inaccurately. - **a. Caveats/Real world observations:** Our 2026 reports indicate that there may be some imbalance in our sample (see, for example, # of bedrooms, baths, various ACS5 values). We currently don't correct for this. @@ -38,7 +38,7 @@ Ways that we test for differences between the sample and the population (and som ### B. Visual Inspection -See empirical distributions on the performance report. The distribution of a feature in the sales-set should visually match those in the population set. (Note: We only calculate this at the full sample level, it may differ at the township or neighborhood level). If needed can apply a KS test to see that the feature distributions (between sample and population) are the same. +See empirical distributions on the performance report. The distribution of a feature in the sales-set should visually match those in the population set. (Note: We only calculate this at the full sample level, it may differ at the township or neighborhood level, and perhaps ought to be investigated further). If needed can apply a KS test to see that the feature distributions (between sample and population) are the same. ### C. Non-Missing at Random From 815be14a8431185c6e6e21d9a8a90e93a655ee67 Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Fri, 22 May 2026 11:38:00 -0500 Subject: [PATCH 08/32] Update Model-Evaluation-ML-metrics.md outline edit --- SOPs/Model-Evaluation-ML-metrics.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index 89100d9..4726536 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -30,11 +30,11 @@ Ways that we test for differences between the sample and the population (and som *(See the "Statistical Tests" section of the model performance report.)* In a perfectly matched sample, no feature would predict inclusion/exclusion of a property in the sales-sample. Any feature that predicts inclusion in the sales set at a level greater than chance (statistical significance) suggests that this feature is over-or under-represented in the sample and will likely bias your results. (This is especially the case for features that also turn out to have high shap values in your results). To check this, we run a simple logistic regression predicting the likelihood-of-a-sale, given a property's features. The resulting p values (for each feature in the report) tells you that a feature predicts inclusion in the sample at a level greater than expected-due-to-chance, while the Beta value gives you a relative sense of the weight (importance) and direction (include vs exclude) of that feature. (In our report, asterisks, represent statistically significant predictors). (Low p-values suggest statistical significance, high magnitudes for the Betas suggest a large impact). When a feature is predictive of inclusion in the sample, this means that your sample is likely biased towards properties with this feature, and may thus value these, or other properties inaccurately. -- **a. Caveats/Real world observations:** Our 2026 reports indicate that there may be some imbalance in our sample (see, for example, # of bedrooms, baths, various ACS5 values). We currently don't correct for this. +- **Caveats/Real world observations:** Our 2026 reports indicate that there may be some imbalance in our sample (see, for example, # of bedrooms, baths, various ACS5 values). We currently don't correct for this. -- **b. Caveats/Real world observations/solution:** As many of our features may be correlated (e.g. baths and bedrooms), or have other interactions (e.g. geography, ACS5, characteristics), and some of those predict inclusion in the sales-sample – we may be underestimating the divergence between our sample and population. More specifically certain types of buildings in certain neighborhoods may be over-represented in the sales sample. (We currently don't analyze balance at a neighborhood level). If we want to address this in the future, we may want to attempt some sort of dimensionality/feature reduction, and then re-run the balance tests on the reduced feature space. If there are problems with sales-volume, this may be worthwhile. If needed, one can attempt to apply a correction with Inverse propensity scoring (which upweights the value of errors on under-sampled properties) – (example suggestion for a time sensitive sample IPW here: https://github.com/ccao-data/model-res-avm/issues/297) - Note this may not drastically improve overall accuracy metrics., but might improve neighborhood or township level performance, in particular vertical equity. +- **Caveats/Real world observations/solution:** As many of our features may be correlated (e.g. baths and bedrooms), or have other interactions (e.g. geography, ACS5, characteristics), and some of those predict inclusion in the sales-sample – we may be underestimating the divergence between our sample and population. More specifically certain types of buildings in certain neighborhoods may be over-represented in the sales sample. (We currently don't analyze balance at a neighborhood level). If we want to address this in the future, we may want to attempt some sort of dimensionality/feature reduction, and then re-run the balance tests on the reduced feature space. If there are problems with sales-volume, this may be worthwhile. If needed, one can attempt to apply a correction with Inverse propensity scoring (which upweights the value of errors on under-sampled properties) – (example suggestion for a time sensitive sample IPW here: https://github.com/ccao-data/model-res-avm/issues/297) - Note this may not drastically improve overall accuracy metrics., but might improve neighborhood or township level performance, in particular vertical equity. -- **c.** To validate your investigations from the balance tests, you can look at the standardized mean differences (between sample + population) for each feature. (Larger differences = more likely deviation between sample and population). +- To validate your investigations from the balance tests, you can look at the standardized mean differences (between sample + population) for each feature. (Larger differences = more likely deviation between sample and population). ### B. Visual Inspection From 9ca3b0b84a7960e3e5db3c2673ac58bd69aff246 Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Fri, 22 May 2026 11:46:41 -0500 Subject: [PATCH 09/32] Update Model-Evaluation-ML-metrics.md update broke assessment metrics links --- SOPs/Model-Evaluation-ML-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index 4726536..f78e691 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -62,7 +62,7 @@ https://www.ibm.com/think/topics/model-drift ## 3. Interpret Model Performance Statistics -We calculate several measures to assess the efficacy of the model. These measures range from more traditional machine learning metrics to assessment-specific metrics. The machine learning metrics we calculate are RMSE, mDape, and R squared. The assessment metrics that we calculate and attend to are (primarily) Median Ratio and Coefficient of Dispersion (C.o.D) for accuracy and precision (respectively). We also supplement our analysis with measures of vertical equity ("how equal are our assessments at all price levels") these are PRB, PRD, and MKI, and ratio curves. (These assessment metrics are explained in detail in this article: Mass Appraisal For The Masses: The Basics - by Lars Doucet – with the exception of MKI, a reference to which can be found In Chris Berry's "Reassessing the Property Tax"). +We calculate several measures to assess the efficacy of the model. These measures range from more traditional machine learning metrics to assessment-specific metrics. The machine learning metrics we calculate are RMSE, mDape, and R squared. The assessment metrics that we calculate and attend to are (primarily) Median Ratio and Coefficient of Dispersion (C.o.D) for accuracy and precision (respectively). We also supplement our analysis with measures of vertical equity ("how equal are our assessments at all price levels") these are PRB, PRD, and MKI, and ratio curves. (These assessment metrics are explained in detail in this article: [Mass Appraisal For The Masses: The Basics - by Lars Doucet](https://progressandpoverty.substack.com/p/mass-appraisal-for-the-masses-the) – with the exception of MKI, a reference to which can be found In Chris Berry's ["Reassessing the Property Tax"](https:/law.yale.edu/sites/default/files/area/center/corporate/spring2022_paper_berrychristopher_2-24-22.pdf) - See CCAO's [guide to Sales-Ratio-Studies](https://github.com/ccao-data/wiki/blob/master/SOPs/Sales-Ratio-Studies.md) as well. Our approach is to fit candidate models using the machine learning (m.l.) metrics, and then compare/evaluate our candidate models based on both the m.l. metrics and the assessment metrics. We can think of the assessment-specific metrics as both offering acceptable bounds for candidate models* and as a way to "break ties" between models that are otherwise similar in terms of m.l. metrics *(e.g. even if a model shows good m.l. fits, if it is outside the acceptable assessment bounds, we may ignore it). From c2d46f3076a94ecb616cc06e9aac05f0d8014087 Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Fri, 22 May 2026 11:52:14 -0500 Subject: [PATCH 10/32] Update Model-Evaluation-ML-metrics.md --- SOPs/Model-Evaluation-ML-metrics.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index f78e691..353d004 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -64,9 +64,10 @@ https://www.ibm.com/think/topics/model-drift We calculate several measures to assess the efficacy of the model. These measures range from more traditional machine learning metrics to assessment-specific metrics. The machine learning metrics we calculate are RMSE, mDape, and R squared. The assessment metrics that we calculate and attend to are (primarily) Median Ratio and Coefficient of Dispersion (C.o.D) for accuracy and precision (respectively). We also supplement our analysis with measures of vertical equity ("how equal are our assessments at all price levels") these are PRB, PRD, and MKI, and ratio curves. (These assessment metrics are explained in detail in this article: [Mass Appraisal For The Masses: The Basics - by Lars Doucet](https://progressandpoverty.substack.com/p/mass-appraisal-for-the-masses-the) – with the exception of MKI, a reference to which can be found In Chris Berry's ["Reassessing the Property Tax"](https:/law.yale.edu/sites/default/files/area/center/corporate/spring2022_paper_berrychristopher_2-24-22.pdf) - See CCAO's [guide to Sales-Ratio-Studies](https://github.com/ccao-data/wiki/blob/master/SOPs/Sales-Ratio-Studies.md) as well. -Our approach is to fit candidate models using the machine learning (m.l.) metrics, and then compare/evaluate our candidate models based on both the m.l. metrics and the assessment metrics. We can think of the assessment-specific metrics as both offering acceptable bounds for candidate models* and as a way to "break ties" between models that are otherwise similar in terms of m.l. metrics *(e.g. even if a model shows good m.l. fits, if it is outside the acceptable assessment bounds, we may ignore it). +Our approach is to fit candidate models using the machine learning (m.l.) metrics, and then compare/evaluate our candidate models based on both the m.l. metrics and the assessment metrics. We can think of the assessment-specific metrics as both offering acceptable bounds for candidate models* and as a way to "break ties" between models that are otherwise similar in terms of m.l. metrics +*(e.g. even if a model shows good m.l. fits, if it is outside the acceptable assessment bounds, we may ignore it). -Note: This discussion presumes a train-test breakout, where we fit the model on a subset of our data (training set) and calculate the performance measures on data that the model has not seen (the test set). We use this approach to avoid overfitting (see below for specifics) and ensure that our model is generalizable out-of-sample. +Note: This discussion presumes a train-test breakout, where we fit the model on a subset of our data (training set) and calculate the performance measures on data that the model has not seen (the test set). We use this approach to avoid overfitting (see below for specifics) and ensure that our model is generalizable out-of-sample. (Specifics on those break-outs later). --- From 037f4ec2bbb82d34c2e365bd695127498cd02cc3 Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Fri, 22 May 2026 11:53:59 -0500 Subject: [PATCH 11/32] Update Model-Evaluation-ML-metrics.md --- SOPs/Model-Evaluation-ML-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index 353d004..f34a1db 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -81,7 +81,7 @@ This is generally the metric that we use to formally fit the models. It is just Note that RMSE works best on normally distributed data, and our data is generally skewed, with high-value outliers. As such, fitting a model with this measure will tend to regress toward the mean (leading to under-valuations of pricey-properties, and over-valuations of lower-priced properties). We can adjust for this by using different objective functions (quantile functions, bespoke penalty terms); OR by finding candidate models with low RMSE, but making a final model choice based on ratio-curves and vertical equity measures*. That said, even after accounting for regression to the mean, we have found RMSE to lead to accurate and reasonably precise models (as judged by other measures) and its status as a proper-scoring rule gives us some safeguards against other biases. -*(For example, perhaps we test two models; one with an additional feature that primarily leads to more accurate valuations for lower valued properties (sale_price < median_sale_price), given the skew in our data (high value properties contribute proportionally more to RMSE) we might find that this feature doesn't move our RMSE calculation very much, but does improve our vertical equity based on our vertical equity measures. We could justify selecting the model+new-feature based on their vertical equity improvements, rather than the RMSE alone).* +*(For example, perhaps we test two models; one with an additional feature that primarily leads to more accurate valuations for lower valued properties (sale_price < median_sale_price), given the skew in our data (high value properties contribute proportionally more to RMSE) we might find that this feature doesn't move our RMSE calculation very much, but does improve our vertical equity based on our vertical equity measures. We could justify selecting the model+new-feature based on their vertical equity improvements, rather than the RMSE alone). --- From 21a6aa51800a81a35bb6233cbc3211bc29a961de Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Fri, 22 May 2026 11:59:30 -0500 Subject: [PATCH 12/32] Update Model-Evaluation-ML-metrics.md --- SOPs/Model-Evaluation-ML-metrics.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index f34a1db..de77df7 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -93,9 +93,8 @@ See here for further discussion: https://stats.stackexchange.com/questions/24278 #### mDape (Median Absolute Percentage Error) -*(used in Zillow model, discussion here)* -https://www.zillow.com/zestimate/ -https://stats.stackexchange.com/questions/596324/is-median-absolute-percentage-error-useless#:~:text=Percentage%20Error%20Asymmetry:%20A%20significant%20drawback%20of,same%20factor%20yields%20only%20a%2050%25%20error. +*(used in [Zillow model](https://www.zillow.com/zestimate/), discussion [here](https://stats.stackexchange.com/questions/596324/is-median-absolute-percentage-error-useless#:~:text=Percentage%20Error%20Asymmetry:%20A%20significant%20drawback%20of,same%20factor%20yields%20only%20a%2050%25%20error)* + mDape is a useful sanity check for our models. Since mDape is a median (of the absolute percentage error of each observation in the test set, (abs((actual-forecast)/actual)*100) ), it is more robust to outliers than other measures, and thus complements RMSE. Note, we shouldn't use this as an optimization metric (as it is not a proper scoring rule, and treats over-forecasts and underforecasts assymetrically)**, but it is useful for comparing model performance in manner that is more robust to outliers (since it is a median). For example, in a case where two models differ slightly in their RMSE, we may accept a model with a slightly higher RMSE if it has a lower mDape than its competitor model (This arrangement would likely signal that the other "low RMSE" model might simply be fitting toward some high value outliers, at the expense of the median property). @@ -112,7 +111,7 @@ We report R-squared because it is a common and somewhat interpretable metric (R - R-squared does not allow you to compare models using transformed responses. - R-squared does not measure how one variable explains another (it's not causal). -See also: Shalizi notes https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/10/lecture-10.pdf +See also: prof [Cosma Shalizi's notes](https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/10/lecture-10.pdf) For a more sensible interpretation of goodness-of-fit, (in the spirit of how people think R-squared works) – simply eyeball the table in the model performance report for "Estimate vs Actual (Individual Obs.)" (look for changes between model estimates and actuals). From deeaec7caeed3228ca67c1a276da01676b9a0bfa Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Fri, 22 May 2026 14:52:03 -0500 Subject: [PATCH 13/32] Update Model-Evaluation-ML-metrics.md --- SOPs/Model-Evaluation-ML-metrics.md | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index de77df7..a586ad5 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -87,7 +87,7 @@ Note that RMSE works best on normally distributed data, and our data is generall **RMSE interpretability:** An additional reason to use RMSE is its interpretability. RMSE is on the same scale as your outcome value, and can be interpreted with reference to the mean, median, and standard deviation of your sample data (either train or test set). Since RMSE is structurally similar to measures of variability, such as standard deviation (the average distance of sample or population values from the mean), you can often interpret RMSE in relation to SD. (E.g. If one thinks of the mean as the simplest "model" of a distribution, then one can interpret the standard deviation in a manner similar to RMSE- the average deviation of your observations from your mean. More complex+accurate models should have an RMSE lower than the standard deviation of your test data). This insight is also useful for model comparison, as the standard deviation can be used as a baseline, or scale, with which to benchmark RMSE values from candidate models. (E.g. if two candidate models differ by some magnitude of RMSE how "large" or "trivial" is that difference, relative to the underlying standard deviation of your sample, or test data?) -See here for further discussion: https://stats.stackexchange.com/questions/242787/how-to-interpret-root-mean-squared-error-rmse-vs-standard-deviation - more formally, here: Shmueli, G., Bruce, P. C., Stephens, M., & Patel, N. R. (2016). Data Mining for Business Analytics: Concepts, Techniques, and Applications with JMP Pro (3rd Edition). Wiley. +See [here for further discussion](https://stats.stackexchange.com/questions/242787/how-to-interpret-root-mean-squared-error-rmse-vs-standard-deviation) - more formally, here: Shmueli, G., Bruce, P. C., Stephens, M., & Patel, N. R. (2016). Data Mining for Business Analytics: Concepts, Techniques, and Applications with JMP Pro (3rd Edition). Wiley. --- @@ -139,12 +139,9 @@ Finally: If your sample is not a good match for your population, good train-tes ### Assessment Metrics -Interpretations and acceptable ranges for assessment metrics can be found here: +Interpretations and acceptable ranges for assessment metrics can be found in our [companion sale-ratio-studies wiki](https://github.com/ccao-data/wiki/blob/master/SOPs/Sales-Ratio-Studies.md) -https://github.com/ccao-data/wiki/blob/master/SOPs/Sales-Ratio-Studies.md - -longer descriptions here: Mass Appraisal For The Masses: The Basics - by Lars Doucet -https://progressandpoverty.substack.com/p/mass-appraisal-for-the-masses-the +longer descriptions here: [Mass Appraisal For The Masses: The Basics - by Lars Doucet](https://progressandpoverty.substack.com/p/mass-appraisal-for-the-masses-the) --- From 9b6de2cd09a85101b54b646411bd129f9fad27c3 Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Fri, 22 May 2026 14:58:25 -0500 Subject: [PATCH 14/32] Update Model-Evaluation-ML-metrics.md Edit links --- SOPs/Model-Evaluation-ML-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index a586ad5..686932b 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -44,7 +44,7 @@ See empirical distributions on the performance report. The distribution of a fe "Non-missing at random" means that some feature (or a particular value of a feature) is missing in a way that's correlated with your outcome variable, or some other variable in the dataset. This can sometimes indicate systemic under-sampling. In our case, it is somewhat controlled for by the fact that lgbm actually incorporates Nulls as a predictor*, because of this, we currently don't track correlations of nulls as rigorously as we otherwise might, though you can get a sense of the percentage of nulls-by-feature by looking at the "Missingness" heading in the Feature report. -\* (https://medium.com/@andrywmarques/how-lgbm-deals-with-missing-values-bd361636357f. https://coder-wang-uspsa.medium.com/how-do-xgboost-lightgbm-and-catboost-handle-missing-features-e541da94d528) +\* [lgbm handling missing values](https://medium.com/@andrywmarques/how-lgbm-deals-with-missing-values-bd361636357f. https://coder-wang-uspsa.medium.com/how-do-xgboost-lightgbm-and-catboost-handle-missing-features-e541da94d528) ### D. A Quick Domain Specific Sanity Check From 202745dd609994e4f7c4a5cf433f0be40f013ede Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Fri, 22 May 2026 14:59:51 -0500 Subject: [PATCH 15/32] Update Model-Evaluation-ML-metrics.md edit hyperlinks 2 --- SOPs/Model-Evaluation-ML-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index 686932b..70e958c 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -44,7 +44,7 @@ See empirical distributions on the performance report. The distribution of a fe "Non-missing at random" means that some feature (or a particular value of a feature) is missing in a way that's correlated with your outcome variable, or some other variable in the dataset. This can sometimes indicate systemic under-sampling. In our case, it is somewhat controlled for by the fact that lgbm actually incorporates Nulls as a predictor*, because of this, we currently don't track correlations of nulls as rigorously as we otherwise might, though you can get a sense of the percentage of nulls-by-feature by looking at the "Missingness" heading in the Feature report. -\* [lgbm handling missing values](https://medium.com/@andrywmarques/how-lgbm-deals-with-missing-values-bd361636357f. https://coder-wang-uspsa.medium.com/how-do-xgboost-lightgbm-and-catboost-handle-missing-features-e541da94d528) +\* ( [lgbm handling missing values](https://medium.com/@andrywmarques/how-lgbm-deals-with-missing-values-bd361636357f). see [also](https://coder-wang-uspsa.medium.com/how-do-xgboost-lightgbm-and-catboost-handle-missing-features-e541da94d528) ) ### D. A Quick Domain Specific Sanity Check From 748e94268edf62dacbb3fd8bdcabd74208727489 Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Fri, 22 May 2026 15:00:52 -0500 Subject: [PATCH 16/32] Update Model-Evaluation-ML-metrics.md close parens --- SOPs/Model-Evaluation-ML-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index 70e958c..67fa187 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -93,7 +93,7 @@ See [here for further discussion](https://stats.stackexchange.com/questions/2427 #### mDape (Median Absolute Percentage Error) -*(used in [Zillow model](https://www.zillow.com/zestimate/), discussion [here](https://stats.stackexchange.com/questions/596324/is-median-absolute-percentage-error-useless#:~:text=Percentage%20Error%20Asymmetry:%20A%20significant%20drawback%20of,same%20factor%20yields%20only%20a%2050%25%20error)* +*(used in [Zillow model](https://www.zillow.com/zestimate/), discussion [here](https://stats.stackexchange.com/questions/596324/is-median-absolute-percentage-error-useless#:~:text=Percentage%20Error%20Asymmetry:%20A%20significant%20drawback%20of,same%20factor%20yields%20only%20a%2050%25%20error) )* mDape is a useful sanity check for our models. Since mDape is a median (of the absolute percentage error of each observation in the test set, (abs((actual-forecast)/actual)*100) ), it is more robust to outliers than other measures, and thus complements RMSE. Note, we shouldn't use this as an optimization metric (as it is not a proper scoring rule, and treats over-forecasts and underforecasts assymetrically)**, but it is useful for comparing model performance in manner that is more robust to outliers (since it is a median). For example, in a case where two models differ slightly in their RMSE, we may accept a model with a slightly higher RMSE if it has a lower mDape than its competitor model (This arrangement would likely signal that the other "low RMSE" model might simply be fitting toward some high value outliers, at the expense of the median property). From 9ab70d8a4a5ef328aa8d5689ba75073b5b7ff0d9 Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Fri, 22 May 2026 16:06:56 -0500 Subject: [PATCH 17/32] Update README.md Save link in readme --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 21fb8d2..d6d1224 100644 --- a/README.md +++ b/README.md @@ -31,6 +31,7 @@ * [Sales Ratio Studies](SOPs/Sales-Ratio-Studies.md) * [Desk Review](SOPs/Desk-Review.md) * [Open Data](SOPs/Open-Data.md) +* [Model Evaluation and Machine Learning Metrics](SOPs/Model-Evaluation-ML-metrics.md) ### Residential From 3a61cb280f53bfc0a3a24662298051e989a04ec2 Mon Sep 17 00:00:00 2001 From: Tim Sparer Date: Fri, 5 Jun 2026 23:27:45 +0000 Subject: [PATCH 18/32] Updated model eval wiki with BIlly's edits --- SOPs/Model-Evaluation-ML-metrics.md | 141 ++++++++++++---------------- 1 file changed, 61 insertions(+), 80 deletions(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index 67fa187..d00a104 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -1,158 +1,139 @@ # Model Validation, Interpretation, and Testing Guidelines -After running your model (either with a given set of historic parameters (e.g. those from past years) or with parameters found via cross-validation), You will need to interpret the results. To do so, you will need to first assess how well the sample data (sales) matches the population (buildings to be assessed), then note any mix of data, practical, or real-world factors that may be impacting your model performance (e.g. -novel market or time trends), and finally interpret the formal model performance statistics to select the best model (where best means the most accurate + generalizable model, that conforms to IAAO standards). +After running the model you will need to interpret the results. First, assess how well the sample data (sales sample, training data) matches the population (assessment set, all parcels to be assessed), then note any factors that may be impacting your model performance such as data issues or novel market trends. Finally, interpret the formal model performance statistics to select the most accurate and generalizable model, that conforms to IAAO standards. ---- -## This Guide Covers -1. Assessing how representative your sales (sample) are, relative to the population (all buildings-to-be-assessed). +**Overview:** + +1. Assessing how representative your sales sample is of the assessment set. - a. Balance tests - b. Visual inspection - c. Not missing at random - d. Domain specific approach -2. Noting any real-world changes that may impact your model, and/or interactions between data and model that may effect your results. (Model drift, data drift). -3. Interpreting Model performance (using machine learning and assessment metrics). - ---- - -## 1. Assessing how representative your sales-sample is, relative to the population (all buildings-to-be-assessed). - -To ensure that a model can generalize from a sample (the sales data that we train the model on) to the population (all buildings being assessed), we need to check that our sales sample contains buildings that are similar to those in the entire population: That is, we should generally see buildings with the same composition of features, in the same proportions, in the sales-sample and in the population. We check for this in several ways (primarily with statistical tests and visual inspections of distributions). +2. Noting any real-world housing market changes that may impact your model, and/or interactions between data and model that may affect your results (model drift, data drift). -*(Note: if there are factors that cause some properties (and their attendant sale-prices) to be over-represented in the sample, the model will over-index to these types of properties, likely leading to over or under-valuation).* +3. Interpreting model performance (evaluating machine learning and assessment metrics). --- -Ways that we test for differences between the sample and the population (and some possible corrections): In a perfectly matched sample, no feature would predict inclusion/exclusion of a property in the sales-sample. That is, all properties, of all types would have an equal chance of being sold in a given year. These tests allow us to test that assumption (and to develop possible corrections). +## 1. Assessing How Representative Your Sales Sample Is of the Assessment Set -### A. Balance Tests +To ensure that a model can generalize from a sample to the population, we need to check that our sales sample contains buildings that are similar to those in the entire population. We should generally see parcels with the same composition of features, in the same proportions, in the sales (sample) and assessment set (population). We check for this with statistical tests and visual inspections of distributions. -*(See the "Statistical Tests" section of the model performance report.)* In a perfectly matched sample, no feature would predict inclusion/exclusion of a property in the sales-sample. Any feature that predicts inclusion in the sales set at a level greater than chance (statistical significance) suggests that this feature is over-or under-represented in the sample and will likely bias your results. (This is especially the case for features that also turn out to have high shap values in your results). To check this, we run a simple logistic regression predicting the likelihood-of-a-sale, given a property's features. The resulting p values (for each feature in the report) tells you that a feature predicts inclusion in the sample at a level greater than expected-due-to-chance, while the Beta value gives you a relative sense of the weight (importance) and direction (include vs exclude) of that feature. (In our report, asterisks, represent statistically significant predictors). (Low p-values suggest statistical significance, high magnitudes for the Betas suggest a large impact). When a feature is predictive of inclusion in the sample, this means that your sample is likely biased towards properties with this feature, and may thus value these, or other properties inaccurately. +> **Note:** If there are factors that cause some parcels to be over-represented in the sales sample, the model will over-index to these types of properties, likely leading to over- or undervaluation. -- **Caveats/Real world observations:** Our 2026 reports indicate that there may be some imbalance in our sample (see, for example, # of bedrooms, baths, various ACS5 values). We currently don't correct for this. +### Testing for and Correcting Differences Between Sales (Sample) and the Assessment Set (Population) -- **Caveats/Real world observations/solution:** As many of our features may be correlated (e.g. baths and bedrooms), or have other interactions (e.g. geography, ACS5, characteristics), and some of those predict inclusion in the sales-sample – we may be underestimating the divergence between our sample and population. More specifically certain types of buildings in certain neighborhoods may be over-represented in the sales sample. (We currently don't analyze balance at a neighborhood level). If we want to address this in the future, we may want to attempt some sort of dimensionality/feature reduction, and then re-run the balance tests on the reduced feature space. If there are problems with sales-volume, this may be worthwhile. If needed, one can attempt to apply a correction with Inverse propensity scoring (which upweights the value of errors on under-sampled properties) – (example suggestion for a time sensitive sample IPW here: https://github.com/ccao-data/model-res-avm/issues/297) - Note this may not drastically improve overall accuracy metrics., but might improve neighborhood or township level performance, in particular vertical equity. +In a perfectly matched sample, no feature would predict whether a parcel is more or less likely to be part of the sales sample — all properties of all types would have an equal chance of being sold in a given year. These tests allow us to test that assumption and to develop possible corrections: -- To validate your investigations from the balance tests, you can look at the standardized mean differences (between sample + population) for each feature. (Larger differences = more likely deviation between sample and population). +#### I. Balance Tests -### B. Visual Inspection +(See the "Statistical Tests" section of the model performance report.) Any feature that significantly predicts inclusion in the sales sample is likely over- or under-represented in the sample and will likely bias results. This is especially the case for features that also turn out to have high SHAP values. The p-value for each feature in the report tells you whether that feature predicts inclusion in the sales sample at a level greater than expected, while the Beta value gives you a relative sense of the weight (importance) and direction (include vs exclude) of that feature. In our report, asterisks represent statistically significant predictors. -See empirical distributions on the performance report. The distribution of a feature in the sales-set should visually match those in the population set. (Note: We only calculate this at the full sample level, it may differ at the township or neighborhood level, and perhaps ought to be investigated further). If needed can apply a KS test to see that the feature distributions (between sample and population) are the same. +- **a.** Our 2026 reports indicate that there may be some imbalance in our sample (see, for example, # of bedrooms, baths, various ACS5 values). We currently don't correct for this. -### C. Non-Missing at Random +- **b.** Many of our features may be correlated (baths and bedrooms) or interact with one another (geography, ACS5, characteristics), and some of those predict inclusion in the sales sample. We may be underestimating the divergence between the sales sample and assessment set, and certain types of buildings in certain neighborhoods may be over-represented in the sales sample. We don't currently analyze balance at a neighborhood level. We may want to attempt some sort of dimensionality/feature reduction and then re-run the balance tests on the reduced feature space. One potential option is to apply a correction with inverse propensity weighting (IPW), which upweights the value of errors on under-sampled properties. Note that this may not drastically improve overall accuracy metrics, but might improve neighborhood or township level performance, in particular vertical equity. -"Non-missing at random" means that some feature (or a particular value of a feature) is missing in a way that's correlated with your outcome variable, or some other variable in the dataset. This can sometimes indicate systemic under-sampling. In our case, it is somewhat controlled for by the fact that lgbm actually incorporates Nulls as a predictor*, because of this, we currently don't track correlations of nulls as rigorously as we otherwise might, though you can get a sense of the percentage of nulls-by-feature by looking at the "Missingness" heading in the Feature report. +- **c.** To validate your investigations from the balance tests, look at the standardized mean differences between the sales sample and assessment set for each feature. Larger differences indicate a more likely deviation between the sales sample and assessment set. -\* ( [lgbm handling missing values](https://medium.com/@andrywmarques/how-lgbm-deals-with-missing-values-bd361636357f). see [also](https://coder-wang-uspsa.medium.com/how-do-xgboost-lightgbm-and-catboost-handle-missing-features-e541da94d528) ) +#### II. Visual Inspection -### D. A Quick Domain Specific Sanity Check +See empirical distributions on the performance report. The distribution of a feature in the sales sample should visually match those in the assessment set. We only calculate this for the full sales sample, but it may differ at the township or neighborhood-level. We could apply a KS test to check if the feature distributions between the sales sample and assessment set are the same. -Compare year over year changes in assessed values for sold-and-unsold houses. (This is documented in the performance report under "Change In and Out of Sample"). The sold and unsold properties should have roughly similar changes in assessed values. +#### III. Non-Missing at Random ---- +"Non-missing at random" means that some feature or a particular value of a feature is missing in a way that's correlated with the outcome variable or another variable in the dataset. This can indicate systemic undersampling. In our case, it is somewhat controlled for by the fact that lgbm actually incorporates missing values as a predictor\*. Because of this we currently don't track correlations of nulls as rigorously as we otherwise might, though you can get a sense of the percentage of missingness for each feature by looking at the "Missingness" heading in the Feature report. -## 2. Note any real-world changes that may impact your model, and/or interactions between data and model that may effect your results. (Model drift, data drift). +> \* See: [How does LGBM deal with missing values?](https://www.google.com/search?q=how+does+lgbm+deal+with+missing+values) | [How do XGBoost, LightGBM, and CatBoost Handle Missing Features?](https://www.google.com/search?q=how+do+xgboost+lightgbm+catboost+handle+missing+features) -This step is less specific, but is useful: Since our model uses temporal features, are there any recent changes (market trends etc.) that may impact it? While large changes in major variables (general increases in sales price across the board) should be obvious in the model results, it's useful to think through other dynamics: For example, have there been changes in preferences (e.g. people moving to the suburbs for more space) that the model should be picking up on? A useful sanity check for this sort of change is to compare the SHAPS on a newly trained model to a prior model; is the new model picking up the newly important variable (did its SHAP increase)? To check for data drift over time, you can compare the feature distributions from a recent year to those of a prior (formally, you could do a KS test, though we do not do so currently – you can eyeball the "Distributions of Features" in the feature report). Note: Since we use a boosted model with historic data, and retrain each year, we are less subject to major problems with data or model drift: However, this can still pose an issue if we have high temporal volatility and low recent sales volume. +#### IV. Domain Specific Sanity Check -https://www.ibm.com/think/topics/model-drift +Compare year-over-year changes in assessed values for sold-and-unsold houses. This is documented in the performance report under "Change In and Out of Sample". The sold and unsold properties should have roughly similar changes in assessed values. --- -## 3. Interpret Model Performance Statistics +## 2. Note Any Housing Market Trends That May Impact Your Model -We calculate several measures to assess the efficacy of the model. These measures range from more traditional machine learning metrics to assessment-specific metrics. The machine learning metrics we calculate are RMSE, mDape, and R squared. The assessment metrics that we calculate and attend to are (primarily) Median Ratio and Coefficient of Dispersion (C.o.D) for accuracy and precision (respectively). We also supplement our analysis with measures of vertical equity ("how equal are our assessments at all price levels") these are PRB, PRD, and MKI, and ratio curves. (These assessment metrics are explained in detail in this article: [Mass Appraisal For The Masses: The Basics - by Lars Doucet](https://progressandpoverty.substack.com/p/mass-appraisal-for-the-masses-the) – with the exception of MKI, a reference to which can be found In Chris Berry's ["Reassessing the Property Tax"](https:/law.yale.edu/sites/default/files/area/center/corporate/spring2022_paper_berrychristopher_2-24-22.pdf) - See CCAO's [guide to Sales-Ratio-Studies](https://github.com/ccao-data/wiki/blob/master/SOPs/Sales-Ratio-Studies.md) as well. +Note any housing market trends that may impact your model and/or interactions between data and model that may affect your results (model drift, data drift). Since our model uses temporal features, are there any recent trends that may impact it? While large changes in major sale prices should be obvious in the model results, it's useful to compare changes in the model's assignment of feature importance (SHAP, gain) to trends in the sales sample (example: changing consumer preferences across years should be reflected in changes in SHAPs between models trained on data from separate years). To check for data drift over time, you can compare the feature distributions from a recent year to those of a prior year. Formally, you could do a KS test, though we do not currently. Less formally you could eyeball the "Distributions of Features" in the feature report. -Our approach is to fit candidate models using the machine learning (m.l.) metrics, and then compare/evaluate our candidate models based on both the m.l. metrics and the assessment metrics. We can think of the assessment-specific metrics as both offering acceptable bounds for candidate models* and as a way to "break ties" between models that are otherwise similar in terms of m.l. metrics -*(e.g. even if a model shows good m.l. fits, if it is outside the acceptable assessment bounds, we may ignore it). +> **Note:** Since we use a boosted model with historic data and retrain each year, we are less subject to major problems with data or model drift. However, this can still pose an issue if we have high temporal volatility and low recent sales volume. -Note: This discussion presumes a train-test breakout, where we fit the model on a subset of our data (training set) and calculate the performance measures on data that the model has not seen (the test set). We use this approach to avoid overfitting (see below for specifics) and ensure that our model is generalizable out-of-sample. (Specifics on those break-outs later). +See: https://www.ibm.com/think/topics/model-drift --- -### Model Fitting: Machine Learning Metrics +## 3. Interpreting Model Performance Statistics -**Fitting the model:** We aim to fit the model using proper scoring rules, of which RMSE is our primary measure. +We calculate traditional machine learning metrics and assessment-specific metrics to assess the model. The machine learning metrics we calculate are RMSE, mDape, and R-squared. The assessment metrics that we calculate and attend to are Median Ratio and Coefficient of Dispersion (COD) for accuracy and precision (respectively). We supplement our analysis with measures of vertical equity (how equal are assessments at all price levels). These are PRB, PRD, MKI, and ratio curves. These assessment metrics are explained in detail in [Mass Appraisal For The Masses: The Basics by Lars Doucet](https://www.google.com/search?q=Mass+Appraisal+For+The+Masses+The+Basics+Lars+Doucet). MKI is outlined in depth in [A Gini measure for vertical equity in property assessments](https://www.google.com/search?q=gini+measure+vertical+equity+property+assessments). See also our guide to ratio studies: https://github.com/ccao-data/wiki/blob/master/SOPs/Sales-Ratio-Studies.md -#### RMSE – Root Mean Squared Error +Our approach is to fit candidate models using the machine learning (ML) metrics then compare/evaluate our candidate models based on both the ML metrics and the assessment metrics. We can think of the assessment metrics as acceptable bounds for candidate models and as a way to "break ties" between models that are otherwise similar in terms of ML metrics. For example, if a model shows good ML fits but is outside the acceptable assessment bounds, we might ignore it. -This is generally the metric that we use to formally fit the models. It is just the mean of the, squared-errors (prediction – forecast), rescaled with a square-root. Smaller (closer to zero) is better. Note that since it squares the error (which allows for smooth-model fitting), it will tend to overweight large errors and/or large outliers, (which can be an issue for skewed data, like housing prices). +> **Note:** This discussion presumes a train-test breakout, where we fit the model on a subset of our data (training set) and calculate the performance measures on data that the model has not seen (the test set). We use this approach to avoid overfitting (see below for specifics) and ensure that our model is generalizable out-of-sample. -Note that RMSE works best on normally distributed data, and our data is generally skewed, with high-value outliers. As such, fitting a model with this measure will tend to regress toward the mean (leading to under-valuations of pricey-properties, and over-valuations of lower-priced properties). We can adjust for this by using different objective functions (quantile functions, bespoke penalty terms); OR by finding candidate models with low RMSE, but making a final model choice based on ratio-curves and vertical equity measures*. That said, even after accounting for regression to the mean, we have found RMSE to lead to accurate and reasonably precise models (as judged by other measures) and its status as a proper-scoring rule gives us some safeguards against other biases. -*(For example, perhaps we test two models; one with an additional feature that primarily leads to more accurate valuations for lower valued properties (sale_price < median_sale_price), given the skew in our data (high value properties contribute proportionally more to RMSE) we might find that this feature doesn't move our RMSE calculation very much, but does improve our vertical equity based on our vertical equity measures. We could justify selecting the model+new-feature based on their vertical equity improvements, rather than the RMSE alone). +### Model Fitting — Machine Learning Metrics ---- +#### Fitting the Model -**RMSE interpretability:** An additional reason to use RMSE is its interpretability. RMSE is on the same scale as your outcome value, and can be interpreted with reference to the mean, median, and standard deviation of your sample data (either train or test set). Since RMSE is structurally similar to measures of variability, such as standard deviation (the average distance of sample or population values from the mean), you can often interpret RMSE in relation to SD. (E.g. If one thinks of the mean as the simplest "model" of a distribution, then one can interpret the standard deviation in a manner similar to RMSE- the average deviation of your observations from your mean. More complex+accurate models should have an RMSE lower than the standard deviation of your test data). This insight is also useful for model comparison, as the standard deviation can be used as a baseline, or scale, with which to benchmark RMSE values from candidate models. (E.g. if two candidate models differ by some magnitude of RMSE how "large" or "trivial" is that difference, relative to the underlying standard deviation of your sample, or test data?) +We aim to fit the model using proper scoring rules, of which RMSE is our primary measure. -See [here for further discussion](https://stats.stackexchange.com/questions/242787/how-to-interpret-root-mean-squared-error-rmse-vs-standard-deviation) - more formally, here: Shmueli, G., Bruce, P. C., Stephens, M., & Patel, N. R. (2016). Data Mining for Business Analytics: Concepts, Techniques, and Applications with JMP Pro (3rd Edition). Wiley. +#### RMSE (Root Mean Squared Error) ---- +This is generally the metric that we use to formally fit the models. It is the mean of the squared errors, rescaled with a square-root. Closer to zero is better. Note that since it squares the error it can penalize larger errors more heavily, which can particularly be an issue for skewed data like housing prices. -#### mDape (Median Absolute Percentage Error) +RMSE works best on normally distributed data and our data is generally skewed, with high-value outliers. Fitting a model with this measure will tend to regress toward the mean, leading to under-valuations of pricey properties, and over-valuations of lower-priced properties. We can adjust for this by using different objective functions (quantile functions, bespoke penalty terms) or by finding several candidate models with low RMSE and making a final model choice based on ratio-curves and vertical equity measures. Despite regression to the mean, we have found RMSE to lead to accurate and precise models and its status as a proper-scoring rule safeguards against other biases. -*(used in [Zillow model](https://www.zillow.com/zestimate/), discussion [here](https://stats.stackexchange.com/questions/596324/is-median-absolute-percentage-error-useless#:~:text=Percentage%20Error%20Asymmetry:%20A%20significant%20drawback%20of,same%20factor%20yields%20only%20a%2050%25%20error) )* +As an example of the interplay between RMSE and assessment metrics, suppose we test a model with an additional feature that leads to more accurate valuations for properties with sale prices below the median. Given the skew in our data (high value properties contribute proportionally more to RMSE) we might find that this feature doesn't move our RMSE calculation very much but does improve vertical equity. We could justify selecting the model with the new feature based on its vertical equity improvements, rather than RMSE alone. +**RMSE interpretability:** An additional reason to use RMSE is its interpretability. RMSE is on the same scale as the outcome value, and can be interpreted with reference to the mean, median, and standard deviation of the sales sample. Since RMSE is structurally similar to measures of variability such as standard deviation, you can interpret RMSE in relation to standard deviation. (If one thinks of the mean as the simplest "model" of a distribution, then one can interpret the standard deviation in a manner similar to RMSE — the average deviation of your observations from your mean.) More accurate models should have an RMSE lower than the standard deviation of your test data. This insight is also useful for model comparison, as the standard deviation can be used as a baseline to benchmark RMSE values from candidate models. (For example, if two candidate models differ by some magnitude of RMSE, how "large" or "trivial" is that difference, relative to the underlying standard deviation of your sample, or test data?) -mDape is a useful sanity check for our models. Since mDape is a median (of the absolute percentage error of each observation in the test set, (abs((actual-forecast)/actual)*100) ), it is more robust to outliers than other measures, and thus complements RMSE. Note, we shouldn't use this as an optimization metric (as it is not a proper scoring rule, and treats over-forecasts and underforecasts assymetrically)**, but it is useful for comparing model performance in manner that is more robust to outliers (since it is a median). For example, in a case where two models differ slightly in their RMSE, we may accept a model with a slightly higher RMSE if it has a lower mDape than its competitor model (This arrangement would likely signal that the other "low RMSE" model might simply be fitting toward some high value outliers, at the expense of the median property). +Further discussion [here](https://stats.stackexchange.com/questions/242787/how-to-interpret-root-mean-squared-error-rmse-vs-standard-deviation) — more formally, here: [Shmueli, G., Bruce, P. C., Stephens, M., & Patel, N. R. (2016). *Data Mining for Business Analytics: Concepts, Techniques, and Applications with JMP Pro* (3rd Edition). Wiley.](https://www.amazon.fr/Data-Mining-Business-Analytics-Applications/dp/1118877438/) -\**(e.g. suppose our Forecast =2, and the actual value =1; this will cause the APE =100%, Whereas an actual value of 3 (with a forecast of 2) will lead to an APE of 33%).* +#### mDape (Median Absolute Percentage Error) ---- +[Zillow's performance metric of choice](https://stats.stackexchange.com/questions/596324/is-median-absolute-percentage-error-useless#:~:text=Percentage%20Error%20Asymmetry:%20A%20significant%20drawback%20of,same%20factor%20yields%20only%20a%2050%25%20error). Since mDape is a [median-based metric](https://en.wikipedia.org/wiki/Mean_absolute_percentage_error), it is more robust to outliers than other measures and complements RMSE. We shouldn't use mDape as an optimization metric as it is not a proper scoring rule, and treats over-forecasts and underforecasts asymmetrically, but it is useful for comparing model performance in a manner that is more robust to outliers. For example, in a case where two models differ slightly in their RMSE, we may accept a model with a slightly higher RMSE if it has a lower mDape than its competitor. This arrangement would likely signal that the other "low RMSE" model might simply be fitting toward some high value outliers at the expense of the median property. -#### R-squared +#### R-Squared -We report R-squared because it is a common and somewhat interpretable metric (R square varies between 0 and 1, and values of R squared closer to 1 are generally considered to signify better model fits), however given the myriad problems with R squared ([see here: Is R-squared Useless? | UVA Library](https://library.virginia.edu/data/articles/is-r-squared-useless)) we shouldn't make any model justification decisions based off it. At most, we can check for consistency with RMSE at local levels (e.g. townships) - (In cases where there is a discrepancy between goodness of fit as suggested by R-squared vs RMSE – default to the RMSE, and, perhaps investigate reasons for underlying low R-squared (Note that R squared is sensitive to scale, variance, and nonlinearities in the underlying data, so these may be causes of discrepancies between R squared and RMSE). +We report R-squared because it is a common and somewhat interpretable metric. R-squared varies between 0 and 1, and values of R-squared closer to 1 may suggest better model fits. Given the myriad problems with R-squared (see: [Is R-squared Useless? | UVA Library](https://library.virginia.edu/data/articles/is-r-squared-useless)) we shouldn't base any model decisions off it. At most we can check for consistency with RMSE within reporting geographies. In cases where there is a discrepancy between goodness-of-fit as suggested by R-squared and RMSE, default to the RMSE and investigate reasons for the difference with the R-squared. R-squared is sensitive to scale, variance, and nonlinearities in the underlying data, so these may be causes of discrepancies between R-squared and RMSE. -- R-squared does not measure goodness of fit. -- R-squared does not measure predictive error. -- R-squared does not allow you to compare models using transformed responses. +- R-squared does not necessarily measure goodness-of-fit. +- R-squared does not necessarily measure predictive error. - R-squared does not measure how one variable explains another (it's not causal). -See also: prof [Cosma Shalizi's notes](https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/10/lecture-10.pdf) - -For a more sensible interpretation of goodness-of-fit, (in the spirit of how people think R-squared works) – simply eyeball the table in the model performance report for "Estimate vs Actual (Individual Obs.)" (look for changes between model estimates and actuals). +See [Shalizi notes](https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/10/lecture-10.pdf) ---- +For a more sensible interpretation of goodness-of-fit, simply eyeball the table in the model performance report Estimate vs Actual (Individual Obs.). Look for changes between model estimates and actuals. ### Train/Test Splits -We follow a standard train/test split to test our models for overfitting. Overfitting is when our model picks-up spurious correlations in our sample, and erroneously uses those to predict sales. This makes our model less generalizable. +We follow a standard train/test split to test our models for overfitting. Overfitting is when our model picks up spurious correlations in our sample and uses those to predict sales. In other words, our model studies the sample so deeply it loses the ability to imagine a sample that looks different. It will do a good job predicting on the sample it was trained on but will perform worse on observations it has never seen. The ability of a model to handle situations it hasn't seen is called its "generalizability". -When we fit the model, we hold out a portion of our sample data (the test-set) and do not train, or fit, the model on that data. (The training data is the sample data minus the test-set). We fit the model on the training data and then predict sales for the training data and calculate the model's performance metrics (RMSE, mDape, etc.); we then make predictions on the test data and calculate the test data's performance metrics. We use the difference between these two sets (train and test metrics) to gauge how much a given model may overfit. +When we fit the model we hold out a portion of our sample data, the test set. We do not train, or fit, the model on the test set. The training data is the sample data minus the test set. We fit the model on the training data and then predict sales for both the training data and the test set, then calculate the model's performance metrics (RMSE, mDape, etc.) for both. We use the difference between these two sets (train and test metrics) to gauge how much the model is overfit. -Generally, the wider the difference between these two measures, the more likely it is that the model is overfit. For example an RMSE of 10k on our sales training predictions, and an RMSE of 179k on the test predictions (with a standard deviation of 180k on both the train and test actual distributions) would very likely indicate that the model is overfit to the training data (because of the large difference between train and test RMSE, especially relative to the standard deviation). +Generally, the larger the difference between these two measures the more likely it is that the model is overfit. An RMSE of $10k on our sales training predictions and an RMSE of $179k on the test predictions with a standard deviation of $180k on both the train and test actual distributions would indicate that the model is overfit to the training data (because of the large difference between train and test RMSE, especially relative to the standard deviation). -A note: If we only compare model performance in terms of test sets (e.g. looking at differences in RMSE between two models, without looking at train test splits), we may accidentally select a model that is overfit (even though we are analyzing test-set scores). For example, if we are looking at the test results from 2 separate models trained on the same train/test splits, and one model has an RMSE of 75k, while a second model has an RMSE of 80k (and the standard deviation for both train test splits is 180k), we might select the first (75k) model. However, if there is a large difference between the train and test sets for the 75k model (e.g. the train set is RMSE is 10k), but not for the 80k model (perhaps RMSE for this model's train set is 40k)- we might be less inclined to pick the 75k model (due to the possibility of the 75k model being overfit). (Note: we wouldn't be able to tell definitively if the 80k model is more generalizable: we might want to test additional models and see their train/test splits, and test accuracy, to get a better sense of whether the 75k model was anomalous. That said, the large gap, relative to the 2nd model (and the standard deviation of the underlying data) may be sign enough of overfitting). +> **Note:** If we only compare model performance in terms of test sets (that is, by looking at differences in RMSE between the test sets for two models, without looking at train/test splits), we may accidentally select a model that is overfit even though we are analyzing test-set scores. For example, for two separate models trained on the same train/test splits, one model has an RMSE of $75k and the other has an RMSE of $80k. The standard deviation for both train-test splits is $180k. We might be inclined to select the first ($75k) model, but if there is a larger difference between the train and test sets for the $75k RMSE model (whose hypothetical train-set RMSE is $10k) versus that of the $80k RMSE model (whose hypothetical train-set RMSE is $40k), we should consider the possibility that the $75k model is overfit. While the large gap, relative to the 2nd model and the standard deviation of the underlying data may be sign enough of overfitting, we can't tell definitively if the $80k model is more generalizable and we would want to test additional models to get a better sense of whether the $75k model was anomalous. -Further reading: Bias Variance Trade-off +Further reading — Bias Variance Trade-off: - https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff - https://www.ibm.com/think/topics/bias-variance-tradeoff ---- - -Finally: If your sample is not a good match for your population, good train-test splits will only take you so far (it will be hard to generalize any model). This is because your sample will lack representative training data. This is why balance tests (see earlier section) are important. - ---- +> **Finally:** If your sample is not a good match for your population, good train-test splits will only take you so far. This is because your sample lacks representative training data. This is why balance tests (see earlier section) are important. ### Assessment Metrics -Interpretations and acceptable ranges for assessment metrics can be found in our [companion sale-ratio-studies wiki](https://github.com/ccao-data/wiki/blob/master/SOPs/Sales-Ratio-Studies.md) +Interpretations and acceptable ranges for assessment metrics can be found here: +https://github.com/ccao-data/wiki/blob/master/SOPs/Sales-Ratio-Studies.md -longer descriptions here: [Mass Appraisal For The Masses: The Basics - by Lars Doucet](https://progressandpoverty.substack.com/p/mass-appraisal-for-the-masses-the) +Longer descriptions here: [Mass Appraisal For The Masses: The Basics — by Lars Doucet](https://www.google.com/search?q=Mass+Appraisal+For+The+Masses+The+Basics+Lars+Doucet) --- -### Practical Process - -Pull all current data, and fit + predict with last years hyper-parameters, This model can act as a baseline for any improvements you may intend to make. At this stage, you will mostly be using the model reports to check for data-drift (year-over-year changes in features- see the feature-report), model+data drift (changes in shaps) data that is not-missing at random, and balance tests (are some features and property types over-represented?). You can attempt to make corrections at this stage (e.g. applying IPW to correct for imbalanced features), but most likely you will just have to call this out as a concern in your model. - ---- +## Practical Process -### Tune the Hyperparameters +1. **Pull all current data** then train and predict with last year's hyperparameters. This model can act as a baseline for any improvements you may intend to make. At this stage, you can use the model reports to check for year-over-year changes in features in the feature report, changes in feature importance (SHAP values, gain, etc.), data that is not-missing-at-random, and parity in feature distributions between the sales sample and the assessment set (balance tests). You can attempt to make corrections like IPW at this stage, but most likely you will just have to note these issues as concerns in your model. -Carry out a "CV run" with github actions. This will use a Bayesian optimizer to search for the best-fitting hyper-parameters. Assess the quality of the model, using the metrics discussed previously. Compare the newly tuned model to the old model. In addition to the machine learning and assessment metrics, you may wish to look at changes in assessments across townships? Are they relatively similar across models? Are there any big swings in one model but not the other? +2. **Tune the hyperparameters:** Carry out a "CV run" with GitHub Actions. This will use a Bayesian optimizer to search for the best fitting hyperparameters, using cross-validation. Assess the quality of the model, using the metrics outlined above. Compare the newly tuned model to the old model. In addition to the machine learning and assessment metrics, look at changes in assessments across townships. Are they relatively similar across models? Are there any big swings in one model but not the other? -(If there is no clear better model, but there are swings across differing townships, this may indicate that your models have plateaued and are fitting to local noise. In this case, you likely need to make a judgement call as to which model to select based on the particular swings/changes and outside information, and call-out the townships with large swings for desk review). +3. **If there is no clearly better model**, but there are swings in ML or assessment metrics across differing townships, this may indicate that your models have plateaued and are fitting to local noise. In this case, you likely need to make a judgement call as to which model to select based on the particular swings/changes and outside information and mention any instabilities or deviations (across models) in your desk review email. From 7ebc9822592092d7dfe471b1cafdcae4aa141f49 Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Fri, 5 Jun 2026 18:43:10 -0500 Subject: [PATCH 19/32] Update Model-Evaluation-ML-metrics.md fixed link --- SOPs/Model-Evaluation-ML-metrics.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index d00a104..06d0138 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -123,8 +123,7 @@ Further reading — Bias Variance Trade-off: ### Assessment Metrics -Interpretations and acceptable ranges for assessment metrics can be found here: -https://github.com/ccao-data/wiki/blob/master/SOPs/Sales-Ratio-Studies.md +Interpretations and acceptable ranges for assessment metrics can be found [here.](https://github.com/ccao-data/wiki/blob/master/SOPs/Sales-Ratio-Studies.md) Longer descriptions here: [Mass Appraisal For The Masses: The Basics — by Lars Doucet](https://www.google.com/search?q=Mass+Appraisal+For+The+Masses+The+Basics+Lars+Doucet) From 83a08ab146cd7bcf632c554bc2e7ca2a0e0a154a Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Fri, 5 Jun 2026 18:44:30 -0500 Subject: [PATCH 20/32] Update Model-Evaluation-ML-metrics.md fix bias variance links --- SOPs/Model-Evaluation-ML-metrics.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index 06d0138..d686d1f 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -115,9 +115,7 @@ Generally, the larger the difference between these two measures the more likely > **Note:** If we only compare model performance in terms of test sets (that is, by looking at differences in RMSE between the test sets for two models, without looking at train/test splits), we may accidentally select a model that is overfit even though we are analyzing test-set scores. For example, for two separate models trained on the same train/test splits, one model has an RMSE of $75k and the other has an RMSE of $80k. The standard deviation for both train-test splits is $180k. We might be inclined to select the first ($75k) model, but if there is a larger difference between the train and test sets for the $75k RMSE model (whose hypothetical train-set RMSE is $10k) versus that of the $80k RMSE model (whose hypothetical train-set RMSE is $40k), we should consider the possibility that the $75k model is overfit. While the large gap, relative to the 2nd model and the standard deviation of the underlying data may be sign enough of overfitting, we can't tell definitively if the $80k model is more generalizable and we would want to test additional models to get a better sense of whether the $75k model was anomalous. -Further reading — Bias Variance Trade-off: -- https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff -- https://www.ibm.com/think/topics/bias-variance-tradeoff +Further reading — [Bias Variance Trade-off](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff), [IBM's notes](https://www.ibm.com/think/topics/bias-variance-tradeoff) > **Finally:** If your sample is not a good match for your population, good train-test splits will only take you so far. This is because your sample lacks representative training data. This is why balance tests (see earlier section) are important. From fa8a18e55902f5ed99e4d6f2050ad4168373fd15 Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Mon, 8 Jun 2026 14:43:56 -0500 Subject: [PATCH 21/32] Update SOPs/Model-Evaluation-ML-metrics.md Co-authored-by: Nicole Jardine <138712135+ccao-jardine@users.noreply.github.com> --- SOPs/Model-Evaluation-ML-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index d686d1f..0e91bb0 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -42,7 +42,7 @@ See empirical distributions on the performance report. The distribution of a fea #### III. Non-Missing at Random -"Non-missing at random" means that some feature or a particular value of a feature is missing in a way that's correlated with the outcome variable or another variable in the dataset. This can indicate systemic undersampling. In our case, it is somewhat controlled for by the fact that lgbm actually incorporates missing values as a predictor\*. Because of this we currently don't track correlations of nulls as rigorously as we otherwise might, though you can get a sense of the percentage of missingness for each feature by looking at the "Missingness" heading in the Feature report. +"Missing not at random" means that some feature or a particular value of a feature is missing in a way that's correlated with the outcome variable or another variable in the dataset. This can indicate systemic undersampling. In our case, it is somewhat controlled for by the fact that lgbm actually incorporates missing values as a predictor\*. Because of this we currently don't track correlations of nulls as rigorously as we otherwise might, though you can get a sense of the percentage of missingness for each feature by looking at the "Missingness" heading in the Feature report. > \* See: [How does LGBM deal with missing values?](https://www.google.com/search?q=how+does+lgbm+deal+with+missing+values) | [How do XGBoost, LightGBM, and CatBoost Handle Missing Features?](https://www.google.com/search?q=how+do+xgboost+lightgbm+catboost+handle+missing+features) From f29ab19e049d47b85286766ba95594cc8b9cf4de Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Mon, 8 Jun 2026 14:45:07 -0500 Subject: [PATCH 22/32] Update SOPs/Model-Evaluation-ML-metrics.md Co-authored-by: Nicole Jardine <138712135+ccao-jardine@users.noreply.github.com> --- SOPs/Model-Evaluation-ML-metrics.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index 0e91bb0..be07026 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -48,7 +48,9 @@ See empirical distributions on the performance report. The distribution of a fea #### IV. Domain Specific Sanity Check -Compare year-over-year changes in assessed values for sold-and-unsold houses. This is documented in the performance report under "Change In and Out of Sample". The sold and unsold properties should have roughly similar changes in assessed values. +Compare year-over-year changes in assessed values for sold and unsold houses. This is documented in the performance report under "Change In and Out of Sample". + +The sold and unsold properties should have roughly similar changes in assessed values, on the assumption that sold and unsold properties have similar characteristics and assessment histories. --- From 91f0a8e067d62be1328c52567b98fd2a046d990f Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Mon, 8 Jun 2026 14:46:00 -0500 Subject: [PATCH 23/32] Update SOPs/Model-Evaluation-ML-metrics.md Co-authored-by: Nicole Jardine <138712135+ccao-jardine@users.noreply.github.com> --- SOPs/Model-Evaluation-ML-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index be07026..21f0d9e 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -66,7 +66,7 @@ See: https://www.ibm.com/think/topics/model-drift ## 3. Interpreting Model Performance Statistics -We calculate traditional machine learning metrics and assessment-specific metrics to assess the model. The machine learning metrics we calculate are RMSE, mDape, and R-squared. The assessment metrics that we calculate and attend to are Median Ratio and Coefficient of Dispersion (COD) for accuracy and precision (respectively). We supplement our analysis with measures of vertical equity (how equal are assessments at all price levels). These are PRB, PRD, MKI, and ratio curves. These assessment metrics are explained in detail in [Mass Appraisal For The Masses: The Basics by Lars Doucet](https://www.google.com/search?q=Mass+Appraisal+For+The+Masses+The+Basics+Lars+Doucet). MKI is outlined in depth in [A Gini measure for vertical equity in property assessments](https://www.google.com/search?q=gini+measure+vertical+equity+property+assessments). See also our guide to ratio studies: https://github.com/ccao-data/wiki/blob/master/SOPs/Sales-Ratio-Studies.md +We calculate traditional machine learning metrics and assessment-specific metrics to assess the model. The machine learning metrics we calculate are RMSE, MdAPE, and R-squared. The assessment metrics that we calculate and attend to are Median Ratio and Coefficient of Dispersion (COD) for accuracy and precision (respectively). We supplement our analysis with measures of vertical equity (how equal are assessments at all price levels). These are PRB, PRD, MKI, and ratio curves. These assessment metrics are explained in detail in [Mass Appraisal For The Masses: The Basics by Lars Doucet](https://progressandpoverty.substack.com/p/mass-appraisal-for-the-masses-the). MKI is outlined in depth in [A Gini measure for vertical equity in property assessments](https://doi.org/10.63642/1357-1419.1225). See also our guide to ratio studies: https://github.com/ccao-data/wiki/blob/master/SOPs/Sales-Ratio-Studies.md Our approach is to fit candidate models using the machine learning (ML) metrics then compare/evaluate our candidate models based on both the ML metrics and the assessment metrics. We can think of the assessment metrics as acceptable bounds for candidate models and as a way to "break ties" between models that are otherwise similar in terms of ML metrics. For example, if a model shows good ML fits but is outside the acceptable assessment bounds, we might ignore it. From 3c7a57b12af8f21b2d3047a57de7b8964fe076b0 Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Mon, 8 Jun 2026 14:47:13 -0500 Subject: [PATCH 24/32] Update SOPs/Model-Evaluation-ML-metrics.md Co-authored-by: Nicole Jardine <138712135+ccao-jardine@users.noreply.github.com> --- SOPs/Model-Evaluation-ML-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index 21f0d9e..dd0158d 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -91,7 +91,7 @@ As an example of the interplay between RMSE and assessment metrics, suppose we t Further discussion [here](https://stats.stackexchange.com/questions/242787/how-to-interpret-root-mean-squared-error-rmse-vs-standard-deviation) — more formally, here: [Shmueli, G., Bruce, P. C., Stephens, M., & Patel, N. R. (2016). *Data Mining for Business Analytics: Concepts, Techniques, and Applications with JMP Pro* (3rd Edition). Wiley.](https://www.amazon.fr/Data-Mining-Business-Analytics-Applications/dp/1118877438/) -#### mDape (Median Absolute Percentage Error) +MdAPE [Zillow's performance metric of choice](https://stats.stackexchange.com/questions/596324/is-median-absolute-percentage-error-useless#:~:text=Percentage%20Error%20Asymmetry:%20A%20significant%20drawback%20of,same%20factor%20yields%20only%20a%2050%25%20error). Since mDape is a [median-based metric](https://en.wikipedia.org/wiki/Mean_absolute_percentage_error), it is more robust to outliers than other measures and complements RMSE. We shouldn't use mDape as an optimization metric as it is not a proper scoring rule, and treats over-forecasts and underforecasts asymmetrically, but it is useful for comparing model performance in a manner that is more robust to outliers. For example, in a case where two models differ slightly in their RMSE, we may accept a model with a slightly higher RMSE if it has a lower mDape than its competitor. This arrangement would likely signal that the other "low RMSE" model might simply be fitting toward some high value outliers at the expense of the median property. From a8061672284c50ad3a7f8f0eff3254e311e30c46 Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Mon, 8 Jun 2026 14:48:42 -0500 Subject: [PATCH 25/32] Update SOPs/Model-Evaluation-ML-metrics.md Co-authored-by: Nicole Jardine <138712135+ccao-jardine@users.noreply.github.com> --- SOPs/Model-Evaluation-ML-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index dd0158d..a755582 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -93,7 +93,7 @@ Further discussion [here](https://stats.stackexchange.com/questions/242787/how-t MdAPE -[Zillow's performance metric of choice](https://stats.stackexchange.com/questions/596324/is-median-absolute-percentage-error-useless#:~:text=Percentage%20Error%20Asymmetry:%20A%20significant%20drawback%20of,same%20factor%20yields%20only%20a%2050%25%20error). Since mDape is a [median-based metric](https://en.wikipedia.org/wiki/Mean_absolute_percentage_error), it is more robust to outliers than other measures and complements RMSE. We shouldn't use mDape as an optimization metric as it is not a proper scoring rule, and treats over-forecasts and underforecasts asymmetrically, but it is useful for comparing model performance in a manner that is more robust to outliers. For example, in a case where two models differ slightly in their RMSE, we may accept a model with a slightly higher RMSE if it has a lower mDape than its competitor. This arrangement would likely signal that the other "low RMSE" model might simply be fitting toward some high value outliers at the expense of the median property. +Median absolute percentage error (MdAPE) is a [median-based metric](https://support.numxl.com/hc/en-us/articles/115001223503-MdAPE-Median-Absolute-Percentage-Error). It is more robust to outliers than other measures and complements RMSE. We shouldn't use MdAPE as an optimization metric as it is not a proper scoring rule, and treats over-forecasts and under-forecasts asymmetrically, but it is useful for comparing model performance in a manner that is more robust to outliers. For example, in a case where two models differ slightly in RMSE, we may accept a model with a slightly higher RMSE if it has a lower MdAPE. This arrangement would likely signal that the other "low RMSE" model might simply be fitting toward some high value outliers at the expense of the median property. #### R-Squared From e856513b778fec02cac915eea41952676c9e289a Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Tue, 9 Jun 2026 12:02:43 -0500 Subject: [PATCH 26/32] Update SOPs/Model-Evaluation-ML-metrics.md Co-authored-by: Nicole Jardine <138712135+ccao-jardine@users.noreply.github.com> --- SOPs/Model-Evaluation-ML-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index a755582..c2f43e2 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -40,7 +40,7 @@ In a perfectly matched sample, no feature would predict whether a parcel is more See empirical distributions on the performance report. The distribution of a feature in the sales sample should visually match those in the assessment set. We only calculate this for the full sales sample, but it may differ at the township or neighborhood-level. We could apply a KS test to check if the feature distributions between the sales sample and assessment set are the same. -#### III. Non-Missing at Random +#### III. Missing Not at Random "Missing not at random" means that some feature or a particular value of a feature is missing in a way that's correlated with the outcome variable or another variable in the dataset. This can indicate systemic undersampling. In our case, it is somewhat controlled for by the fact that lgbm actually incorporates missing values as a predictor\*. Because of this we currently don't track correlations of nulls as rigorously as we otherwise might, though you can get a sense of the percentage of missingness for each feature by looking at the "Missingness" heading in the Feature report. From 52fd82dd914f0cdf7cb92dd7ed7ba357dd609993 Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Tue, 9 Jun 2026 12:16:10 -0500 Subject: [PATCH 27/32] Update Model-Evaluation-ML-metrics.md corrected inaccurate links for lgbm missingness handling --- SOPs/Model-Evaluation-ML-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index c2f43e2..e054cbc 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -44,7 +44,7 @@ See empirical distributions on the performance report. The distribution of a fea "Missing not at random" means that some feature or a particular value of a feature is missing in a way that's correlated with the outcome variable or another variable in the dataset. This can indicate systemic undersampling. In our case, it is somewhat controlled for by the fact that lgbm actually incorporates missing values as a predictor\*. Because of this we currently don't track correlations of nulls as rigorously as we otherwise might, though you can get a sense of the percentage of missingness for each feature by looking at the "Missingness" heading in the Feature report. -> \* See: [How does LGBM deal with missing values?](https://www.google.com/search?q=how+does+lgbm+deal+with+missing+values) | [How do XGBoost, LightGBM, and CatBoost Handle Missing Features?](https://www.google.com/search?q=how+do+xgboost+lightgbm+catboost+handle+missing+features) +> \* See: [How does LGBM deal with missing values?](https://medium.com/@andrywmarques/how-lgbm-deals-with-missing-values-bd361636357f) (docs)[https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#missing-value-handle] | [How do XGBoost, LightGBM, and CatBoost Handle Missing Features?](https://coder-wang-uspsa.medium.com/how-do-xgboost-lightgbm-and-catboost-handle-missing-features-e541da94d528) #### IV. Domain Specific Sanity Check From e6fffa8ee2f865793e069d9de352fd59a04d00ff Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Tue, 9 Jun 2026 12:21:31 -0500 Subject: [PATCH 28/32] Update SOPs/Model-Evaluation-ML-metrics.md Co-authored-by: Nicole Jardine <138712135+ccao-jardine@users.noreply.github.com> --- SOPs/Model-Evaluation-ML-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index e054cbc..2c2e2ba 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -109,7 +109,7 @@ For a more sensible interpretation of goodness-of-fit, simply eyeball the table ### Train/Test Splits -We follow a standard train/test split to test our models for overfitting. Overfitting is when our model picks up spurious correlations in our sample and uses those to predict sales. In other words, our model studies the sample so deeply it loses the ability to imagine a sample that looks different. It will do a good job predicting on the sample it was trained on but will perform worse on observations it has never seen. The ability of a model to handle situations it hasn't seen is called its "generalizability". +We follow a standard [train/test split](https://github.com/ccao-data/model-res-avm/#using-training_data) to test our models for overfitting. Overfitting occurs when a model picks up spurious correlations in the training sample and uses those to predict sales. In other words, overfitting occurs when a model studies the training sample so deeply it loses the ability to generalize predictions to properties that are different from the training set. An overfit model will do a good job predicting on the sample it was trained on but will perform worse on observations it has never seen. The ability of a model to handle situations it hasn't seen is called its "generalizability". A good model is one that can generalize its predictions, such that it reasonably accurately predicts the sale prices of properties it has not been trained on. When we fit the model we hold out a portion of our sample data, the test set. We do not train, or fit, the model on the test set. The training data is the sample data minus the test set. We fit the model on the training data and then predict sales for both the training data and the test set, then calculate the model's performance metrics (RMSE, mDape, etc.) for both. We use the difference between these two sets (train and test metrics) to gauge how much the model is overfit. From e0f0765fb5d980d7bff3a55f2bef2059ddc3951d Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Tue, 9 Jun 2026 12:24:44 -0500 Subject: [PATCH 29/32] Update Model-Evaluation-ML-metrics.md fixed progress and poverty link --- SOPs/Model-Evaluation-ML-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index 2c2e2ba..b4f9e85 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -125,7 +125,7 @@ Further reading — [Bias Variance Trade-off](https://en.wikipedia.org/wiki/Bias Interpretations and acceptable ranges for assessment metrics can be found [here.](https://github.com/ccao-data/wiki/blob/master/SOPs/Sales-Ratio-Studies.md) -Longer descriptions here: [Mass Appraisal For The Masses: The Basics — by Lars Doucet](https://www.google.com/search?q=Mass+Appraisal+For+The+Masses+The+Basics+Lars+Doucet) +Longer descriptions here: [Mass Appraisal For The Masses: The Basics — by Lars Doucet](https://progressandpoverty.substack.com/p/mass-appraisal-for-the-masses-the) --- From 68ff750dd27e63f775c6cc6ca9c2b1b59338f2ba Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Tue, 9 Jun 2026 12:40:11 -0500 Subject: [PATCH 30/32] Apply suggestion from @ccao-jardine Co-authored-by: Nicole Jardine <138712135+ccao-jardine@users.noreply.github.com> --- SOPs/Model-Evaluation-ML-metrics.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index b4f9e85..d319606 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -131,6 +131,8 @@ Longer descriptions here: [Mass Appraisal For The Masses: The Basics — by Lars ## Practical Process +Our [annual model checklist](https://github.com/ccao-data/model-res-avm/blob/master/.github/ISSUE_TEMPLATE/annual-model-checklist.md) details the technical steps necessary to run candidate models each year. Below is an overview of how to reason statistically about model candidates to choose a final one. + 1. **Pull all current data** then train and predict with last year's hyperparameters. This model can act as a baseline for any improvements you may intend to make. At this stage, you can use the model reports to check for year-over-year changes in features in the feature report, changes in feature importance (SHAP values, gain, etc.), data that is not-missing-at-random, and parity in feature distributions between the sales sample and the assessment set (balance tests). You can attempt to make corrections like IPW at this stage, but most likely you will just have to note these issues as concerns in your model. 2. **Tune the hyperparameters:** Carry out a "CV run" with GitHub Actions. This will use a Bayesian optimizer to search for the best fitting hyperparameters, using cross-validation. Assess the quality of the model, using the metrics outlined above. Compare the newly tuned model to the old model. In addition to the machine learning and assessment metrics, look at changes in assessments across townships. Are they relatively similar across models? Are there any big swings in one model but not the other? From f5e5d1a495dad4c863b6fa2d6a90d3c77711a22d Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Tue, 9 Jun 2026 14:32:06 -0500 Subject: [PATCH 31/32] Apply suggestion from @ccao-jardine Co-authored-by: Nicole Jardine <138712135+ccao-jardine@users.noreply.github.com> --- SOPs/Model-Evaluation-ML-metrics.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index d319606..be91498 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -68,7 +68,9 @@ See: https://www.ibm.com/think/topics/model-drift We calculate traditional machine learning metrics and assessment-specific metrics to assess the model. The machine learning metrics we calculate are RMSE, MdAPE, and R-squared. The assessment metrics that we calculate and attend to are Median Ratio and Coefficient of Dispersion (COD) for accuracy and precision (respectively). We supplement our analysis with measures of vertical equity (how equal are assessments at all price levels). These are PRB, PRD, MKI, and ratio curves. These assessment metrics are explained in detail in [Mass Appraisal For The Masses: The Basics by Lars Doucet](https://progressandpoverty.substack.com/p/mass-appraisal-for-the-masses-the). MKI is outlined in depth in [A Gini measure for vertical equity in property assessments](https://doi.org/10.63642/1357-1419.1225). See also our guide to ratio studies: https://github.com/ccao-data/wiki/blob/master/SOPs/Sales-Ratio-Studies.md -Our approach is to fit candidate models using the machine learning (ML) metrics then compare/evaluate our candidate models based on both the ML metrics and the assessment metrics. We can think of the assessment metrics as acceptable bounds for candidate models and as a way to "break ties" between models that are otherwise similar in terms of ML metrics. For example, if a model shows good ML fits but is outside the acceptable assessment bounds, we might ignore it. +Our approach while fitting candidate models is to follow ML best practices. During model training and fitting, we use standard machine learning (ML) metrics, like RMSE (described below). + +To compare and evaluate our candidate models, however, we use both ML metrics and assessment metrics. We rely heavily on median ratio, COD, and vertical equity in our recommendation for what should be the final model. > **Note:** This discussion presumes a train-test breakout, where we fit the model on a subset of our data (training set) and calculate the performance measures on data that the model has not seen (the test set). We use this approach to avoid overfitting (see below for specifics) and ensure that our model is generalizable out-of-sample. From 267b1e479b7ffe5257ade21a1f9b6952d8324b5f Mon Sep 17 00:00:00 2001 From: TimCookCountyDS Date: Thu, 25 Jun 2026 11:45:08 -0500 Subject: [PATCH 32/32] Update SOPs/Model-Evaluation-ML-metrics.md Co-authored-by: William Ridgeway <10358980+wrridgeway@users.noreply.github.com> --- SOPs/Model-Evaluation-ML-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SOPs/Model-Evaluation-ML-metrics.md b/SOPs/Model-Evaluation-ML-metrics.md index be91498..cf3b863 100644 --- a/SOPs/Model-Evaluation-ML-metrics.md +++ b/SOPs/Model-Evaluation-ML-metrics.md @@ -18,7 +18,7 @@ After running the model you will need to interpret the results. First, assess ho ## 1. Assessing How Representative Your Sales Sample Is of the Assessment Set -To ensure that a model can generalize from a sample to the population, we need to check that our sales sample contains buildings that are similar to those in the entire population. We should generally see parcels with the same composition of features, in the same proportions, in the sales (sample) and assessment set (population). We check for this with statistical tests and visual inspections of distributions. +To ensure that a model is generalizable, we need to check that our sales sample is similar to the population. We should see parcels with the same composition of features, in the same proportions, in the sales (sample) and assessment set (population). We check for this with statistical tests and visual inspections of distributions. > **Note:** If there are factors that cause some parcels to be over-represented in the sales sample, the model will over-index to these types of properties, likely leading to over- or undervaluation.