Tips for Taps Blog

Water service boundary project by SimpleLab

How to Access the First Map of US Water Utilities

 

SimpleLab Releases Open-Source Dataset of US Water System Service Areas

Who does your water system serve? What people, and where do they live?

In theory, this information should be obvious and readily available. After all, we have maps of every public and private building in the United States. We have maps of national park boundaries. We even have maps of the entire electrical grid or all the primary and secondary roads across the US. Public or private, the infrastructure across the US is generally well documented. Unfortunately, this is not the case for water systems.

The majority of states do not provide (or even monitor) which water systems serve which areas! 

SimpleLab decided to do something about this problem–and while what we’ve released is a work in progress–it’s open source, first of its kind, and lays the foundation for a new wave of data science in the water sector.

Motivation

Interest in water system boundaries is at an all-time high. Federal spending on drinking water systems should exceed $100 billion dollars in the next 10 years. How this money is allocated depends in part on who gets water from these systems. In order to better serve marginalized populations that have been disproportionately exposed to contamination, federal money needs to consider social, economic, and racial/ethnic demographics of every water system. Unless policymakers, non-profits, academics, and agencies are provided with dependable water system boundaries, the task of understanding social-demographic profiles of these water systems will be extraordinarily difficult. 

At SimpleLab, we manage some of the country’s most sophisticated water quality APIs and products. But so far, we have had to rely on lower resolution spatial scales to assign addresses to water system boundaries. 

Just as we began organizing to estimate water system service boundaries nationwide, we learned that the Environmental Policy Innovation Center (EPIC) was pursuing the same questions. SimpleLab entered into a partnership for two months to lead the technical work and rapidly prototype a new methodology for estimating water service boundaries in an open-source repository on GitHub. EPIC financed the first two months of this work, and has been coalition-building to inform methodology, beneficiaries and use cases of this data. EPIC’s work has been critical in bringing these methods to stakeholders and advancing the national conversation on water system boundaries. See their post on the big picture impacts of this work here.

This blog post is a deep dive into the new methodology developed by SimpleLab (CSO Jess Goddard, Head of Data Engineering Ryan Shepherd, and Data Engineer Noor Brody) with support from long-time data collaborator Water Data Lab (Rich Pauloo).

What’s a Water System Boundary?

Let’s start with a few terms. In this project, a water system refers to any water system that services more than 25 people or at least 15 service connections year-round. By definition, this is a community water system (CWS). You’re probably familiar with the “utility”, which we use interchangeably with the term water system here.

At its core, a water system boundary is the spatial extent of service for a water system. A water system boundary can cover a very small or a very large area. It can be contiguous or cover separate areas.

Let’s take an example. Here in Figure 1 is Ventura, California. Most of the city is served by the Ventura Water Department. This is the boundary that currently exists in California’s database of water systems (see here).

Water system boundary for Ventura Water Department.

Figure 1. Water system boundary for Ventura Water Department.

While no data is perfect, this boundary provides a pretty close approximation of the people served by the Ventura Water Department. 

This project is about finding known boundaries, like the one shown above for Ventura, and approximating or modeling boundaries where they are not publicly available or do not exist. 

How We Did It – The Bird’s Eye View

The TEMM Approach

The methodology we put forward is called a Tiered, Explicit, Match, and Model approach–or TEMM approach, for short. The name of the approach reflects exactly how we created a nationwide layer of water systems. The TEMM is composed of three hierarchical tiers, arranged by data and model fidelity. 

  1. First, we use explicit water service boundaries like the picture above. These are spatial polygon data, typically provided at the state-level. We call systems with explicit boundaries Tier 1.

  2. In the absence of explicit water service boundary data, we use a matching algorithm to match water systems to the boundary of a town or city (TIGER place polygons). When a water system and TIGER place match one-to-one, we label this Tier 2a. When multiple water systems match to the same TIGER place, we label this Tier 2b. Tier 2b reflects overlapping boundaries for multiple systems. We hope to resolve this in a future iteration. 

  3. Finally, in the absence of an explicit water service boundary (Tier 1) or a TIGER place polygon match (Tier 2a or Tier 2b), a statistical model trained on explicit water service boundary data (Tier 1) is used to estimate a reasonable radius at provided water system centroids, and model a spherical water system boundary (Tier 3). 

Figure 2 below demonstrates the approach. Each water system is assigned one of three tiers, depending on the available data. 

The Tiered Explicit, Match, and Model hierarchy for water system boundary estimation.

Figure 2. The Tiered Explicit, Match, and Model hierarchy for water system boundary estimation.

Each water system is assigned either an explicit boundary, a matched boundary (to Census Places), or a modeled boundary. Confidence decreases moving from Tier 1 (Explicit) to Tier 3 (Modeled) water system boundary types.

Open Data Science

The data pipeline developed in this work can dynamically regenerate the TEMM spatial layer based on the addition of new Tier 1 explicit boundaries, refinement of the Tier 2 matching algorithm, improvements to the Tier 3 modeled boundaries, and changes to any other data dependencies (e.g., improved centroid location to solve the pancake problem). 

This pipeline is provided under the MIT License online in a public Github repository, along with output TEMM spatial layer results in geojson, shapefile, and csv format for download on Hydroshare. Contributor guidelines are available on Github, detailing the processes by which contributions may be made to the project.

Key Results

In total, the final SimpleLab TEMM data layer represents systems that deliver tap water to 306.88 million people served by 44,919 water systems. This amounts to 97.22% of the population reportedly served by active community water systems in SDWIS and 90.85% of active community water systems in SDWIS.[1]

Together, around 190 million people (62.16% of the population) are covered by either a Tier 1 or a Tier 2a spatial boundary (Table 1). This is an underestimation–because many systems assigned to Tier 2b are likely to be moved to Tier 2a with further refinement. The remaining approximately 116 million people (37.84%) represented in the layer are covered by a less accurate, Tier 2b or Tier 3 boundary. These results indicate relatively high confidence in the spatial accuracy of the resulting TEMM water boundary layer for a majority of the population served by community water systems. 

Table 1. Water system boundary types by system size and population served.



Tier

Number of Water Systems

Percent of Water Systems

Population Served (millions)

Percent of total population served

1

14,607

32.5

122

39.8

2a

9,488

21.1

69

22.4

2b

10,104

22.5

92

30.1

3

10,720

23.9

24

7.8

 

Within the Tier 3 category, there are more small and very small water systems than medium to very large systems. This results in a relatively high number of systems representing a small percentage of the overall population. 

In Figure 3, we show the proportion of population covered by each of the different tiers, in a diagram mimicking the location of US states.
Figure 3. Proportion of people served by water systems with different boundary tiers, by state.

Figure 3. Proportion of people served by water systems with different boundary tiers, by state. 

Native American Territories

Water systems in Native American territories are labeled by their EPA region. We can see from Figure 4 below that no tribal systems in our dataset have Tier 1 data. 

Figure 4. Proportion of people served by water systems in Native American territories with different boundary tiers, by EPA region.

Figure 4. Proportion of people served by water systems in Native American territories with different boundary tiers, by EPA region.

Methods – A detailed view

Data Sources

The following data sources were used in the development of the TEMM layer.

Table 2. Data sources used in developing water system boundaries.

Data Source

Abbreviation

Description

Link

EPA Safe Drinking Water Information Systems

SDWIS

Public water systems “master list” with key attribute data and some tabular geographic data like cities served

SDWIS MODEL

EPA Enforcement and Compliance History Online

ECHO

Public water system facilities archive, drawing lat/long data for facilities centroids from FRS

ECHO Exporter

EPA Facilities Registry Service

FRS

FRS regularly updates facilities data with lat/long information, which pipes into ECHO

FRS Geospatial

US Census Bureau TIGER/Line (also called “TIGER Places”)

TIGER Places

US Census data of places–cities and towns–used to identify potential service area boundaries

TIGER/Line Shapefiles accessed using R package tigris

Unregulated Contaminant Monitoring Rule

UCMR

UCMR 3 and UCMR 4 provide data on pwsid and zip codes served, which can provide higher integrity centroids where needed

UCMR Occurrence data

Homeland Infrastructure Foundation-level Data

MHP

Mobile home park centroids

MHP centroids

Labeled Water System Boundaries

N/A

URLs on state pages for various water service boundary sources

Tier 1

We were able to identify 12 states with readily available, explicit boundaries (Tier 1). We consider readily available data to mean such boundaries that are hosted online and available for download. These include AZ, CA, CT, KS, MO, NC, NJ, NM, OK, PA, TX, and WA as seen in Figure 5. We know of a few other states that have water system boundaries, but these are not readily accessible to the public.

Figure 5. Tier 1 water system boundary data accessibility by state. 

Figure 5. Tier 1 water system boundary data accessibility by state.

Tier 2

For systems without explicit boundaries, we match these to Census TIGER Places–which are cities and towns with boundaries that can serve as an approximate water system boundary. Across all water systems (including those with Tier 1 data), 58% of water systems match a TIGER Place. 

We calculate that 83.8% of Tier 2 matched TIGER Place boundaries spatially intersect their assigned explicit Tier 1 boundary (when present). The intersection does not imply perfect overlap–but a majority of cases had more than 75% of spatial overlap. 

To match water systems to TIGER Places we compile many data sets (outlined in Table 2 above). First, a master list of active community water systems is derived from SDWIS data. Second, supporting locational information–such as city served–is joined from supporting SDWIS tables. Federal water facility data (i.e., ECHO, FRS) exist with spatial centroids (latitude/longitude of the water system facility), but no federal sources provide explicit spatial boundaries. While ECHO is a superset of FRS centroids, both data sources are used because they result in improved matching. Many water system facility centroids in the ECHO database are poor quality because they are merely the centroid of the county or state of the water system. In these cases, we substitute higher quality centroids from other datasets where available, such as UCMR (zip code centroids for select water systems) and MHP (specific lat/long coordinates for mobile home parks).

With this data in place, we begin a series of matches. We match water system name, city served name, and spatial attributes of water systems (centroids) to Census TIGER Places (shapefiles), which are assumed to represent reasonable proxy water system boundaries. Multiple matches are possible among the different match strategies, and thus a set of steps outlined in the methodology are taken to assign the best match. Best matches are based on rules that are validated against systems with existing labeled boundaries. 

Many water systems match TIGER Places one-to-one (these are labeled Tier 2a). Still, multiple water systems can match to the same TIGER Place. Where water systems match to the same TIGER Place, the proxy boundary for those systems is perfectly overlapping. This makes sense–as many urban areas contain multiple water systems (these are labeled Tier 2b). See Figure 6, which indicates how Tier 2a and Tier 2b boundaries differ.

Figure 6. Tier 2a and Tier 2b boundaries and match assignment.

Figure 6. Tier 2a and Tier 2b boundaries and match assignment.

Ongoing work on this part of the approach includes developing rules to assign all Tier 2b systems to either Tier 2a or Tier 3–so that we have no one boundary assigned to more than one system. The primary uncertainty in this approach stems from the matching itself, and work to validate the matches is ongoing.

Tier 3

In the absence of explicit spatial boundaries (Tier 1) and matched TIGER Place proxy boundaries (Tier 2), we statistically model an estimated boundary (Tier 3). Model specification hinges on the correlation between the radius of Tier 1 convex hulls (the response variable), and predictors that explain this response, such as service connection count, population served, ownership type and so on.[2]

We experimented with 3 different models: random forest, xgboost, and multiple linear regression. These different statistical and machine learning models have different assumptions and performance, which are discussed in the sections that follow.

Ultimately, we selected the multiple linear regression model as our final model because it is computationally efficient, easily interpretable, provides confidence intervals to characterize uncertainty in the modeled boundary (this may be useful depending on the application of the model results), and finally because it avoids overfitting.[3]

Linear regression uses correlations between features of a dataset (called predictors) to predict an outcome (i.e. the response variable). There is sound rationale for linear regression in the context of this problem. A strong (and intuitive) linear relationship is observed between Tier 1 water system radii and service connections. The linear model fit outperforms other models, is easily interpretable, and provides standard error metrics. 

We experimented with different model specifications, including a model which combined the correlated population and service connection variables, however, this led to negligible improvement in the model fit and less interpretable model coefficients. Thus in the final model specification we regress only on service connection. Interaction terms are added for owner type, service area type code, and wholesaler status. A simple linear regression on service connections alone has an R2 = 0.56; including these extra terms substantially improves the model fit (R2 = 0.66) and reduces test error.

In Figure 7 we plot the predicted radius versus the actual radius for water systems with known boundaries. The relatively close clustering around the linear line indicates the goodness of fit of our model, with some notable exceptions. The model tends to under-predict the actual radius for very large systems and over-predict the radius for very small systems. 

Figure 7. Linear model predictions versus actual observations of radii distance (log scale).

Figure 7. Linear model predictions versus actual observations of radii distance (log scale).

Goodness of fit and error metrics are in Table 8. The linear model outperformed machine learning approaches with a higher R2 and lower error.

Table 8. Goodness of fit and error metrics.

Metric

Estimate

R2

0.6622484

Root Mean Squared Error

0.3429215

Mean Absolute Error

0.2656831

 

More details on the model, comparisons with other approaches, and a deeper-dive into the results can be found in the repository.

Key Limitations

As with any approach, there are limitations to the TEMM approach and the underlying data. We’re actively working on improving those limitations that we can. Perhaps the single most important improvement will come from more US states collecting, maintaining, and sharing their water system boundary shapefiles. 

Limitations of existing data include:

  • Lack of representation in Tier 1 data for Native American water systems

  • Poor quality spatial data provided by ECHO or FRS that is the basis of the centroid for matched or modeled water systems (Tiers 2 and 3)

  • Missing or incomplete data 

    Limitations of the methodology include:

    • Incorrect matches between water system data and Census TIGER Place data are possible where underlying data is inaccurate

    • Rules-based algorithm for matching requires more fine-tuning and validation

    • Model over and under-predicts radii for very small and very large systems, respectively

    Key Contributions

    We developed a “Tiered Explicit, Match, and Model” (TEMM) approach to compile a nationwide water system boundary layer via an open source data pipeline. Such a dataset is unprecedented: easily-accessible, machine-readable, and clean water system spatial boundary data is presently siloed across states – if available online at all. 

    Key contributions include:

    • Bringing together multiple spatial datasets where they exist, and developing algorithms and models to “fill in” missing data where it does not exist;

    • Spatial layer covers 44,919 community water systems and a total population of 306,876,850. 

    • Most people (282 million, 92.21% of the population) are covered by a Tier 1 or 2 spatial boundary, which have relatively high accuracy.

    • Open-access data pipeline with contributor guidelines and MIT license for re-use

    We imagine that the TEMM spatial layer will serve as an appropriate input for analyses that depend on nationwide water service coverage boundaries, recognizing that the methodology is ongoing in its development. 

    SimpleLab Products

    The TEMM layer now underlies City Water Project (Beta) and our PRO Water Quality products and APIs. No other water quality data product nationwide relies on water system boundaries.

    Read More

    Endnotes

    1. We drop 4,552/49,445 (9.2%) of water systems from this analysis for one of three reasons: (i) they lack a latitude or longitude (and hence we cannot “center” them anywhere), (ii) they fall outside of the 50 US States (e.g., territories like Puerto Rico, Guam, American Samoa), or (iii) their reported population or service connection count is less than 15, which is the minimum service connection count required for a system to be classified as a “community water system”. We will work on furthering quality assessment of this data and approach to include more systems in a future version.

    2. Labeled radii are calculated from the convex hull area of Tier 1 systems instead of simple water system area because they better represent calculated Tier 3 circular buffers.

    3. “Overfitting” is the tendency of a statistical or machine learning model to fit well to “seen” training data, but generalize poorly to “unseen” testing data. We need the model to generalize well to unseen water systems. The linear model actually results in the lowest error of the models tested, which may at first seem surprising, but is actually not considering the strong linear relationship between water system radius and service connections. One might expect machine learning approaches like Random Forest and xgboost to outperform the linear model. In fact, they do on training sets, but tend to overfit the training set and not not generalize as well to test sets. Simply put, we do not have enough of the high-dimensional data that machine learning models require to outperform classical approaches like linear regression in the context of this problem.

    back to top