Tips for Taps Blog

Water service boundary project by SimpleLab

How to Access the First Map of US Water Utilities

 

SimpleLab Releases Open-Source Dataset of US Water System Service Areas

This post is relevant for Version 3.0.0 of the water service boundary layer. 

Who does your water system (utility) serve? What people, and where do they live?

In theory, this information should be obvious and readily available. After all, we have maps of every public and private building in the United States. We have maps of national park boundaries. We even have maps of the entire electrical grid or all the primary and secondary roads across the US. Public or private, the infrastructure across the US is generally well-documented. Unfortunately, this is not the case for water systems.

The majority of states do not provide (or even monitor) which water systems serve which areas! 

SimpleLab decided to do something about this problem–and while what we’ve released is a work in progress–it’s open source, first of its kind, and lays the foundation for a new wave of data science in the water sector.

Motivation

Interest in water system boundaries is at an all-time high. Federal spending on drinking water systems should exceed $100 billion dollars in the next 10 years. How this money is allocated depends in part on who gets water from these systems. In order to better serve marginalized populations that have been disproportionately exposed to contamination, federal money needs to consider social, economic, and racial/ethnic demographics of every water system. Unless policymakers, non-profits, academics, and agencies are provided with dependable water system boundaries, the task of understanding social-demographic profiles of these water systems will be extraordinarily difficult. 

At SimpleLab, we manage some of the country’s most sophisticated water quality APIs and products. But so far, we have had to rely on lower resolution spatial scales to assign addresses to water system boundaries. 

Just as we began organizing to estimate water system service boundaries nationwide in late 2021, we learned that the Environmental Policy Innovation Center (EPIC) was pursuing the same questions. SimpleLab entered into a partnership to lead the technical work and rapidly prototype a new methodology for estimating water service boundaries in an open-source repository on GitHub. EPIC co-financed this work, and has been coalition-building to inform methodology, beneficiaries and use cases of this data. EPIC’s work has been critical in bringing these methods to stakeholders and advancing the national conversation on water system boundaries. See their post on the big picture impacts of this work here.

This blog post is a deep dive into the new methodology developed by SimpleLab (CSO Jess Goddard, Head of Data Engineering Ryan Shepherd, and Data Engineer Noor Brody) with support from a SimpleLab data science advisor, Rich Pauloo of Water Data Lab.

What’s a Water System Boundary?

Let’s start with a few terms. In this project, a water system refers to any water system that services more than 25 people or at least 15 service connections year-round. By definition, this is a community water system (CWS). You’re probably familiar with the “utility”, which we use interchangeably with the term water system here.

At its core, a water system boundary is the spatial extent of service for a water system. A water system boundary can cover a very small or a very large area. It can be contiguous or cover separate areas.

Let’s take an example. Here in Figure 1 is Ventura, California. Most of the city is served by the Ventura Water Department. This is the boundary that currently exists in California’s database of water systems (see here).

Figure 1. Water system boundary for Ventura Water Department.

Figure 1. Water system boundary for Ventura Water Department.

While no data is perfect, this boundary provides a pretty close approximation of the people served by the Ventura Water Department. 

This project is about finding known boundaries, like the one shown above for Ventura, and approximating or modeling boundaries where they are not publicly available or do not exist. 

How We Did It – The Bird’s Eye View

The TEMM Approach

The methodology we put forward is called a Tiered, Explicit, Match, and Model approach–or TEMM approach, for short. The name of the approach reflects exactly how we created a nationwide layer of water systems. The TEMM is composed of three hierarchical tiers, arranged by data and model fidelity. 

  1. First, we use explicit water service boundaries like the picture above. These are spatial polygon data, typically provided at the state-level. Occasionally we get these data from utilities directly, which are standardized and stored on Internet of Water’s repository here. We call systems with explicit boundaries Tier 1.

  2. In the absence of explicit water service boundary data, we use a matching algorithm to match water systems to the boundary of a town or city (TIGER Census Place polygons). When a water system and TIGER Census Place match one-to-one, we label this Tier 2. Sometimes multiple water systems match to the same TIGER Census Place, and we use a “best pick” algorithm to assign the most likely water system to that area. For systems that didn’t make the cut for the “best” match, we resort to a Tier 3 approach, but we retain the centroid of the matched TIGER Census Place as a candidate for the best centroid. 

  3. Finally, in the absence of an explicit water service boundary (Tier 1) or a TIGER Census Place polygon match (Tier 2), a statistical model trained on labeled boundary data (Tier 1) is used to estimate a reasonable radius at the best available centroid, generating a spherical water system boundary (Tier 3). 

In fewer words, Figure 2 below demonstrates the approach. Each water system is assigned one of three tiers, depending on the available data. 

Figure 2. The Tiered Explicit, Match, and Model hierarchy for water system boundary estimation.

 Figure 2. The Tiered Explicit, Match, and Model hierarchy for water system boundary estimation.

Each water system is assigned either an explicit boundary, a matched boundary (to Census Places), or a modeled boundary. Confidence decreases moving from Tier 1 (Explicit) to Tier 3 (Modeled) water system boundary types.

Open Data Science

The data pipeline developed in this work can dynamically regenerate the TEMM spatial layer based on the addition of new Tier 1 explicit boundaries, refinement of the Tier 2 matching algorithm, improvements to the Tier 3 modeled boundaries, and changes to any other data dependencies (e.g., improved centroid location to solve the pancake problem). 

This pipeline is provided under the MIT License online in a public Github repository, along with output TEMM spatial layer results in geopackage format for download on Hydroshare. Contributor guidelines are available on Github, detailing the processes by which contributions may be made to the project.

Key Results

In total, the final SimpleLab TEMM data layer represents systems that deliver tap water to 307.7 million people served by 45,973 water systems. This amounts to 97% of the population reportedly served by active community water systems in SDWIS and 93% of active community water systems in SDWIS [1].

Together, around 267 million people (84.4% of the population) are covered by either a Tier 1 or a Tier 2 spatial boundary (Table 1). This is an underestimation–because many systems assigned to Tier 2b are likely to be moved to Tier 2a with further refinement. The remaining approximately 116 million people (37.84%) represented in the layer are covered by a less accurate, Tier 2b or Tier 3 boundary. These results indicate relatively high confidence in the spatial accuracy of the resulting TEMM water boundary layer for a majority of the population served by community water systems. 

Table 1. Water system boundary types by system size and population served



Tier

Number of Water Systems

Percent of Water Systems

Population Served (millions)

Percent of total population served

1

17,645

35.7

156

49.3

2

11,079

22.42

111

35.1

3

17,249

34.9

41

12.9

No Geometry

3,451

6.98

7.6

2.4

 

Within the Tier 3 category, there are more small and very small water systems than medium to very large systems. This results in a relatively high number of systems representing a small percentage of the overall population.

In Figure 3, we show the proportion of population covered by each of the different tiers, in a diagram mimicking the location of US states.

Figure 3. Proportion of people served by water systems with different boundary tiers, by state.

Methods – A detailed view

Data Sources

The following data sources were used in the development of the TEMM layer.

Table 2. Data sources used in developing water system boundaries.

Data Source

Abbreviation

Description

Link

EPA Safe Drinking Water Information Systems

SDWIS

Public water systems “master list” with key attribute data and some tabular geographic data like cities served

SDWIS MODEL

EPA Enforcement and Compliance History Online

ECHO

Public water system facilities archive, drawing lat/long data for facilities centroids from FRS

ECHO Exporter

EPA Facilities Registry Service

FRS

FRS regularly updates facilities data with lat/long information, which pipes into ECHO

FRS Geospatial

US Census Bureau TIGER/Line (also called “TIGER Census Places”)

TIGER Census Places

US Census data of places–cities and towns–used to identify potential service area boundaries

TIGER/Line Shapefiles accessed using R package tigris

Unregulated Contaminant Monitoring Rule

UCMR

UCMR 3 and UCMR 4 provide data on pwsid and zip codes served, which can provide higher integrity centroids where needed

UCMR Occurrence data

Homeland Infrastructure Foundation-level Data

MHP

Mobile home park centroids

MHP centroids

Labeled Water System Boundaries

N/A

URLs on state pages for various water service boundary sources or utility-provided boundaries

Utility provided boundaries 

 

Tier 1

16 states have readily available, explicit boundaries (Tier 1). We consider readily available data to mean such boundaries that are hosted online and available for download. These include AR, AZ, CA, CT, IL, KS, MO, NC, NJ, NM, OK, PA, RI, TX, UT, and WA as seen in Figure 4. We know of a few other states that have water system boundaries, but these are not readily accessible to the public.  

Figure 4. Tier 1 water system boundary data accessibility by state.

Figure 4. Tier 1 water system boundary data accessibility by state.

Tier 2

For systems without explicit boundaries, we match these to TIGER Census Places–which are cities and towns with boundaries that can serve as an approximate water system boundary. 

To match water systems to TIGER Census Places we compile many data sets (outlined in Table 2 above). First, a master list of active community water systems is derived from SDWIS data. Second, supporting locational information–such as city served–is joined from supporting SDWIS tables. Federal water facility data (i.e., ECHO, FRS) exist with spatial centroids (latitude/longitude of the water system facility), but no federal sources provide explicit spatial boundaries. While ECHO is a superset of FRS centroids, both data sources are used because they result in improved matching. Many water system facility centroids in the ECHO database are poor quality because they are merely the centroid of the county or state of the water system. In these cases, we substitute higher quality centroids from other datasets where available, such as UCMR (zip code centroids for select water systems), MHP (specific lat/long coordinates for mobile home parks), or the centroid of a matched TIGER Census Place (when the water system matches the Place, but is not the “best pick” to receive the TIGER boundary). 

With this data in place, we begin a series of matches. We match water system name, city served name, and spatial attributes of water systems (centroids) to TIGER Census Places (shapefiles), which are assumed to represent reasonable proxy water system boundaries. Multiple matches are possible among the different match strategies, and thus a set of steps outlined in the methodology are taken to assign the best match. Best matches are based on rules that are validated against systems with existing labeled boundaries. 

Many water systems match TIGER Census Places one-to-one. Still, multiple water systems can match to the same TIGER Census Place. Where multiple water systems match to the same TIGER Census Place, we employ a “best pick” methodology that assigns one water system to the Place based on population size and other attributes such as the recorded city served. The primary uncertainty in this approach stems from the matching itself, and work to validate the matches is ongoing.

Tier 3

In the absence of explicit spatial boundaries (Tier 1) and matched TIGER Census Place proxy boundaries (Tier 2), we statistically model an estimated boundary (Tier 3). Model specification hinges on the correlation between the radius of Tier 1 convex hulls (the response variable) [2], and predictors that explain this response, such as service connection count, population served, ownership type and so on. 

We experimented with 3 different models: random forest, xgboost, and multiple linear regression. These different statistical and machine learning models have different assumptions and performance, which are discussed in the sections that follow.

Ultimately, we selected the multiple linear regression model as our final model because it is computationally efficient, easily interpretable, provides confidence intervals to characterize uncertainty in the modeled boundary (this may be useful depending on the application of the model results), and finally because it avoids overfitting [3].

Linear regression uses correlations between features of a dataset (called predictors) to predict an outcome (i.e. the response variable). There is sound rationale for linear regression in the context of this problem. A strong (and intuitive) linear relationship is observed between Tier 1 water system radii and service connections. The linear model fit outperforms other models, is easily interpretable, and provides standard error metrics. 

We experimented with different model specifications, including a model which combined the correlated population and service connection variables, however, this led to negligible improvement in the model fit and less interpretable model coefficients. Thus in the final model specification we regress only on service connection. Interaction terms are added for owner type, service area type code, and wholesaler status. A simple linear regression on service connections alone has an R2 = 0.56; including these extra terms substantially improves the model fit (R2 = 0.66) and reduces test error.

In Figure 5 we plot the predicted radius versus the actual radius for water systems with known boundaries. The relatively close clustering around the linear line indicates the goodness of fit of our model, with some notable exceptions. The model tends to under-predict the actual radius for very large systems and over-predict the radius for very small systems.  

Figure 5. Linear model predictions versus actual observations of radii distance (log scale).

 

Figure 5. Linear model predictions versus actual observations of radii distance (log scale).

Goodness of fit and error metrics are in Table 8. The linear model outperformed machine learning approaches with a higher R2 and lower error.

Table 8. Goodness of fit and error metrics.

Metric

Estimate

R2

0.6622484

Root Mean Squared Error

0.3429215

Mean Absolute Error

0.2656831

 

More details on the model, comparisons with other approaches, and a deeper-dive into the results can be found in the repository.

Key Limitations

As with any approach, there are limitations to the TEMM approach and the underlying data. We’re actively working on improving those limitations that we can. Perhaps the single most important improvement will come from more US states collecting, maintaining, and sharing their water system boundary shapefiles. 

Limitations of existing data include:

  • Lack of representation in Tier 1 data for Native American water systems

  • Poor quality spatial data provided by ECHO or FRS that is the basis of the centroid for matched or modeled water systems (Tiers 2 and 3)

  • Missing or incomplete data 

Limitations of the methodology include:

  • Incorrect matches between water system data and TIGER Census Place data are possible where underlying data is inaccurate

  • Rules-based algorithm for matching requires more fine-tuning and validation

  • Model over and under-predicts radii for very small and very large systems, respectively

Key Contributions

We developed a “Tiered Explicit, Match, and Model” (TEMM) approach to compile a nationwide water system boundary layer via an open source data pipeline. Such a dataset is unprecedented: easily-accessible, machine-readable, and clean water system spatial boundary data is presently siloed across states – if available online at all. 

Key contributions include:

  • Bringing together multiple spatial datasets where they exist, and developing algorithms and models to “fill in” missing data where it does not exist;

  • Spatial layer includes 45,973 community water systems and a total population of 307.7 million people;

  • Most people (267 million, 84% of the population) are covered by a Tier 1 or 2 spatial boundary, which have relatively high accuracy [4];

  • Open-access data pipeline with contributor guidelines and MIT license for re-use

We imagine that the TEMM spatial layer will serve as an appropriate input for analyses that depend on nationwide water service coverage boundaries, recognizing that the methodology is ongoing in its development. 

SimpleLab Products

The TEMM layer now underlies City Water Project (Beta) and our Water Quality Search APIs. No other water quality data product nationwide relies on water system boundaries.

Read More

Endnotes

  1. 3,451 water systems have no tier or geometry. This is because (1) they had no labeled data, (2) they failed to match to a TIGER Census Place boundary, or (3) we were unable to model them because their population or service count was less than 15, or they had no good centroid to rely on. 
  2. Labeled radii are calculated from the convex hull area of Tier 1 systems instead of simple water system area because they better represent calculated Tier 3 circular buffers.
  3. “Overfitting” is the tendency of a statistical or machine learning model to fit well to “seen” training data, but generalize poorly to “unseen” testing data. We need the model to generalize well to unseen water systems. The linear model actually results in the lowest error of the models tested, which may at first seem surprising, but is actually not considering the strong linear relationship between water system radius and service connections. One might expect machine learning approaches like Random Forest and xgboost to outperform the linear model. In fact, they do on training sets, but tend to overfit the training set and not not generalize as well to test sets. Simply put, we do not have enough of the high-dimensional data that machine learning models require to outperform classical approaches like linear regression in the context of this problem.
  4.  While versions of this data layer prior to version 3.0.0 had a higher percentage of the population represented in Tier 1 or Tier 2, this was due to assigning multiple water systems to the same TIGER Census Place polygon–which over-inflated the number of people represented in higher quality geometry tiers.
author portrait
About The Author
Jess Goddard

CHIEF SCIENCE OFFICER


Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et
back to top