Modeling Investment Potential in Santa Cruz Residential Zones

The Challenge

Santa Cruz County sits at the intersection of high demand, constrained supply, and dramatic price variation. For planners and investors alike, the central question is deceptively simple: which neighborhoods are underpriced relative to their desirability, and where does spatial modeling reveal investment opportunity that traditional analysis misses?

Standard housing market assessments rely on comparable sales and market trends. They miss what spatial statistics can reveal — that housing values are not independent observations but are deeply shaped by the values of neighboring properties, proximity to amenities, and patterns of spatial clustering that only become visible through rigorous modeling.

Our Approach

We began by assembling a multi-source spatial dataset at the census tract level, drawing on American Community Survey 5-Year Estimates for socioeconomic indicators, Santa Cruz County GIS Portal zoning shapefiles, and OpenStreetMap for amenity locations including parks, libraries, retail, and food establishments.

Our analysis focused exclusively on residential zones — tracts classified as R-1 (single-family) and RM (multi-family) — to ensure homogeneity in land use characteristics across the study area.

Building the Desirability Score

We constructed a composite desirability score by combining five standardized variables: median household income and inverse distances to parks, shopping centers, food establishments, and libraries. Distance variables were negated prior to standardization so that closer proximity increased the score.

Desirability Score Components by Zone Type

Average standardized scores across R-1 and RM residential zones in Santa Cruz County.

The resulting score effectively distinguished high-amenity, high-income tracts from those with lower accessibility and economic indicators.

Regression Modeling

We began with an Ordinary Least Squares regression predicting log-transformed median home values using five predictors: median household income, education levels, renter percentages, the composite desirability score, and zoning classification. The OLS model explained approximately 47.5% of variance in home values.

However, diagnostic testing revealed what we expected: the residuals were not randomly distributed across space. Global Moran's I was 0.316 with a Z-score of 6.36 (p < .001), confirming statistically significant positive spatial autocorrelation. Nearby tracts had similar residual patterns — a clear signal that a non-spatial model was leaving important information on the table.

The Spatial Lag Model

To address this spatial dependence, we implemented a spatial lag model that explicitly incorporated neighborhood effects. The SLM improved model fit with a higher log-likelihood and lower AIC, confirming that housing values in one tract are meaningfully influenced by values in adjacent tracts.

The spatial autoregressive parameter was significant and positive — a home's value is shaped not just by its own characteristics but by the values of nearby homes. This is intuitive to anyone who has worked in real estate, but the statistical confirmation allows us to build more reliable investment scoring tools.

The Investment Score

Residuals from the spatial lag model were integrated with the desirability score into a composite investment framework. Tracts scoring highest — combining strong neighborhood quality with prices below what the model predicted — were concentrated in central Live Oak and midtown Santa Cruz. These are areas with favorable accessibility and socioeconomic characteristics that have not yet experienced the price appreciation those factors typically drive.

Top Investment Potential Zones: Price vs. Desirability

Census tracts where desirability scores significantly exceed current median home values, indicating potential undervaluation.

What We Learned

This project reinforced a core principle of our spatial analytics practice: location is not just a feature of real estate — it is a statistical structure that must be modeled explicitly. The OLS model was a reasonable starting point, but the spatial lag model revealed patterns of under- and over-valuation that would have been invisible without accounting for spatial dependence.

The LISA cluster maps were particularly valuable for identifying specific census tracts that may be over- or undervalued, providing direction for future data collection, policy decisions, or investment strategies.

The methodology is scalable and transferable to other jurisdictions where zoning shapefiles, census data, and real estate market indicators are available — making it a practical tool for planners, developers, and policy-makers working across California's complex housing landscape.

Project lead: Ian Klassen. Analysis conducted using R (sf, spdep, tmap, ggplot2). Data sources: U.S. Census Bureau ACS, Santa Cruz County GIS Portal, OpenStreetMap, Zillow Research.