ML Clustering of a Waste Water Treatment Plant


Introduction

This dataset comes from the daily measures of sensors in an urban wastewater treatment plant. The aim is to classify the operational state of the plant using clustering techniques in order to predict faults through the state variables of the plant at each of the stages of the treatment process. The original authors stated this as an ill-structured domain.

The data, problem aim and backup information is available from UCI Machine Learning Repository.[^9]

I will tackle this using Python and scikit-learn. All the code can data can be cloned from from GitHub. I don’t cover much of the code here, but have covered on other posts.

Hint from the Field. $1 of spending on water and sewer investment returns $2.03 in revenue. (21st Century Infrastructure Report, prepared for Governor Rick Snyder, November 30, 2016, p. 12, https://www.michigan.gov/documents/snyder/21st_Century_Infrastructure_Commission_Final_Report_1_544276_7.pdf) (accessed Nov. 2020)

Clustering

Clustering refers to a set of algorithms set out to find subgroups or clusters in a data set. The aim is to partition the data into distinct groups so that the observations contained within a cluster are like each other and conversely those observations in other groups differ from each other.

For this exercise we have a set of $n$ observations or instances, each with $p$ features or attributes. The $n$ observations corresponds to the days (527 number of instances) of measurements taken within the plant of $p$ features (38 number of attributes). The goal is to use the measurements taken from the sensors and try to group the observations on operational states of the plant in order to predict faults. We can use the results from this unsupervised method as a basis for future works with additional information in a supervised learning algorithm.

Clustering can utilise many methods, including K-Means, Affinity propagation, Mean-shift, Spectral clustering, Ward hierarchical clustering, Agglomerative clustering, DBSCAN, OPTICS, Gaussian mixtures, Birch to name a few. For this problem we will use two common known methods – the K-Means and Agglomerative clustering.

K-means clustering

The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum-of-squares criterion:

\sum_{i=0}^{n}\min_{\mu_j \in C} (||x_i - \mu_j||^2)

Hierarchical Agglomerative clustering

A dendrogram can represent bottom-up or agglomerative clustering. A dendrogram is a diagram depicting a tree. We build the dendrogram, starting from the leaves and combining clusters up to the trunk (hence depicted as an inverted tree). This method scales well with many instances or clusters. It can capture clusters of various sizes and shapes. We can use it with many pair-wise distances including Euclidean, Manhattan, Cosine or Precomputed.

Wastewater Sources and Flow Rates

The overall aim of wastewater treatment is to remove enough solid and dissolved organic material that the wastewater is safe for discharge to the intended receiving water or other media (that can be rivers, streams, ponds, lakes, oceans, or groundwater – or overland). Treated water is not intended as potable or even swimmable or fishable water, except in very rare cases. – Hopcroft[^7]

Table 3.1 Comparative composition of raw wastewater[^7]

ConstituentLow strengthMedium strengthHigh strength
5-day Biochemical Oxygen demand (BOD 5, ), mg/L100-200190-220350-400
Chemical Oxygen Demand (COD), mg/L240-260430-500800-1000
Total solids, mg/L300-400700-8001200-1250
Total suspended solids (TSS), mg/L100-120210-220350-400
Volatile suspended solids as a % of TSS75-81%75-81%75-81%

Table 3.2 EPA discharge limits for wastewater treatment plants (40 CFR 133.102)[^7]

TimeTotal $BOD_5$Suspended solids
Monthly average30 mg/L and >85% removal30 mg/L or 85% removal
Weekly average45 mg/L45 mg/L

Appendix 6: Typical effluent quality following various levels of treatment – NWQMS[^8]

TreatmentBOD mg/lTSS mg/l
Raw wastewater150-500150-450
A – Pre140-350140-350
B – Primary120-25080-200
C – Secondary20-3025-40
D – Nutrient removal5-205-20

Treatment Process

Wastewater treatment involves various processes used inline or in series to get the required effluent quality. In normal sequence, the principal processes are:

  • pre-treatment which removes gross solids, coarse suspended and floating matter
  • primary treatment which removes settleable solids, by sedimentation
  • secondary treatment which removes most of the remaining contaminants including fine suspended solids, colloidal and dissolved organic matter by biological aerobic processes or chemical treatment (where chemical treatment produces an effluent of similar quality to that achieved by biological processes)
  • nutrient removal which further reduces the levels of nitrogen and phosphorus following secondary treatment
  • disinfection of effluent, which reduces pathogens to levels acceptable for the reuse or discharge of the treated wastewater
  • advanced wastewater treatment which further improves the quality of effluent by processes such as sand filtration, ion exchange and microfiltration.

Primary treatment is the first process in the wastewater treatment plant to remove a significant fraction of organic particulate matter (suspended solids). These suspended solids contribute to biochemical oxygen demand (BOD5) of the wastewater. Thus, removing suspended solids also reduces BOD5. The process is important because the reduction of suspended solids and $BOD_5$:

  • lowers the oxygen demand,
  • decreases the rate of energy consumption, and
  • reduces operational problems with downstream biological treatment processes.

Primary treatment also serves the important function of removing scum and inert particulate matter that the grit chamber did not remove. The scum comprises grease, oil, plastic, leaves, rags, hair, and other floatable material.[^2]

Secondary treatment may include the first three treatments in series or combined in varying configurations. Secondary treatment is a prerequisite for advanced wastewater treatment and disinfection. Advanced wastewater treatment discharges to sensitive environmental areas or for effluent reuse, where health consideration are paramount. NWQMS[^8]

For info on subsequent treatments, peruse the reference material at the end of the post.

Machine Learning

Frame the Problem

The data set description provided a class distribution of 13 classes induced by the original donor conceptual clustering algorithm. The author’s classification was defined in chronological order with the same operational state but just over different days, hence I considered days with the same operational state as a cluster. Based on this, I could assign the classification from a minimum of 3 clusters up to the 13.

The original class distribution was:

  • Class 1: Normal situation (275 days)
  • Class 2: Secondary settler problems-1 (1 day)
  • Class 3: Secondary settler problems-2 (1 day)
  • Class 4: Secondary settler problems-3 (1 day)
  • Class 5: Normal situation with performance over the mean (115 days)
  • Class 6: Solids overload-1 (3 days)
  • Class 7: Secondary settler problems-4 (1 day)
  • Class 8: Storm-1 (1 day)
  • Class 9: Normal situation with low influent (69 days)
  • Class 10: Storm-2 (1 day)
  • Class 11: Normal situation (53 days)
  • Class 12: Storm-3 (1 day)
  • Class 13: Solids overload-2 (1 day)

I would consider combining these classes to a minimum of 3 distinct classes:

  • Class 1A – Normal situation
  • Class 2A – Normal situation with performance over the mean
  • Class 3A – Normal situation with low influent

Hence, I would look for an optimum cluster number from somewhere between 3 to 13. If the algorithm provided something out of this range, I would need to rethink the chosen methods for this data set.

Get the Data

The data, problem objective and backup information is available from UCI Machine Learning Repository.[^9]

I created a folder structure inline with the Cookiecutter Data ScienceA logical, reasonably standardised, but flexible project structure for doing and sharing data science work.

To set up a new project, cookiecutter installation is a prerequisite. Install by running the following command and navigate through the prompts:

# install prerequisite python package cookiecutter
python -m pip install cookiecutter

# start a new project
cookiecutter https://github.com/drivendata/cookiecutter-data-science

Explore the Data

The following are a list of attributes provided in the data set. I keep this handy as I found referencing just the number keep things simple.

The data set didn’t give any unit values. Hence, assumptions by the user are required on this front.

NAttrib.Description
1Q-Einput flow to plant
2ZN-Einput Zinc to plant
3PH-Einput pH to plant
4DBO-Einput Biological demand of oxygen to plant
5DQO-Einput chemical demand of oxygen to plant
6SS-Einput suspended solids to plant
7SSV-Einput volatile suspended solids to plant
8SED-Einput sediments to plant
9COND-Einput conductivity to plant
10PH-Pinput pH to primary settler
11DBO-Pinput Biological demand of oxygen to primary settler
12SS-Pinput suspended solids to primary settler
13SSV-Pinput volatile suspended solids to primary settler
14SED-Pinput sediments to primary settler
15COND-Pinput conductivity to primary settler
16PH-Dinput pH to secondary settler
17DBO-Dinput Biological demand of oxygen to secondary settler
18DQO-Dinput chemical demand of oxygen to secondary settler
19SS-Dinput suspended solids to secondary settler
20SSV-Dinput volatile suspended solids to secondary settler
21SED-Dinput sediments to secondary settler
22COND-Dinput conductivity to secondary settler
23PH-Soutput pH
24DBO-Soutput Biological demand of oxygen
25DQO-Soutput chemical demand of oxygen
26SS-Soutput suspended solids
27SSV-Soutput volatile suspended solids
28SED-Soutput sediments
29COND-Soutput conductivity
30RD-DBO-Pperformance input Biological demand of oxygen in primary settler
31RD-SS-Pperformance input suspended solids to primary settler
32RD-SED-Pperformance input sediments to primary settler
33RD-DBO-Sperformance input Biological demand of oxygen to secondary settler
34RD-DQO-Sperformance input chemical demand of oxygen to secondary settler
35RD-DBO-Gglobal performance input Biological demand of oxygen
36RD-DQO-Gglobal performance input chemical demand of oxygen
37RD-SS-Gglobal performance input suspended solids
38RD-SED-Gglobal performance input sediments

To help me better understand how these sensor measurements related to the plant, I sketched out a typical plant and marked up the values to help me visualise any relationships between the attributes.

wtp_low_diagram_detail_2000.jpeg

First pass observations:

  • 2 Zn-E stated only at the input to the plant
  • pH values didn’t change from input to output
  • 4 DBO-E was lower than 11 DBO-P which I expected DBO to decrease going through the plant
  • SS and COND stayed consistent from input to output

Correlations between attributes

A heat map of the data set can offer an easy to visualise representation of the potential correlations (strong/weak) between the attributes.

wtp_orrelation_heatmap_all_features.png
Features
Reading along the x-axis, scan up to see what is strong (red) or weak (blue) correlations to that current attribute.

Will use these correlations as a basis to drop unnecessary attributes.

1 Q-E has minimal correlation to all the others. But this is the key feature of the plant.
2 ZN-E has minimal correlations. Since they only reported Zn at the input, I will drop it.
3 PH-E has a high correlation with 10 PH-P, 16 PH-D and 23 PH-S. The PH doesn’t change much from the influent to the effluent.
9 COND-E has a high correlation with 15 COND-P, 22 COND-D and 29 COND-S. The conductivity is remains similar throughout the plant.

Prepare the Data

Pull in the raw data using Pandas. Run a few data checks. For this exercise, I didn’t want to arbitrarily remove any outliers, so I kept them. This might be something to fine-tune at a later stage, if warranted.

Data Cleaning

Missing values are key to determine a plan of attack. Most of the attributes have some missing data, as expected. The key attribute driving this plant would be the input flow rate. Since this value is large and the main process stream, I dropped the daily observation of any flow that was missing.

# Drop any nulls in the plant inflows columns. 
dataset.dropna(axis=0, how='any', subset=[1], inplace=True)

For all the remaining sensors measurements, a replacement with the mean should be an acceptable first pass.

Feature Selection

I will drop ZN-E, PH and COND as discussed earlier in the correlations section.

In addition, the performance attributes, ($A_0, A_1, A_2, …$) I dropped as they are just the mathematical relationship between its attribute at each stage in the plant. For example using the following equation:
\text{P } \% = \bigg(1 - \frac{A_1}{A_0} \bigg) \times 100\%

\text{RD-DBO-G} = \bigg(1 - \frac{\text{DBO-S}}{\text{DBO-E}} \bigg) \times 100\%

and plugging in mean values from the data set as an example:
\text{RD-DBO-G} = \bigg( 1 - \frac{20}{180} \bigg) \times 100\% = 89\%

Feature Engineering

The following are some examples of feature engineering I think this may be worth considering at a later stage:

  • decomposing the day – required if you analysed additional information for example seasons, or temperatures, as the plot evinces some cyclic behaviour in the flow rates.
  • transformation of features – One might consider this on the flow rate value given the large scale compared to the other attribute values
  • discretizing continuous features
  • aggregation of features – SSV and SS might be candidates.

Feature Scaling

Given the large disparity between the input flow values and the other attribute values, scaling of the data is appropriate to ensure the algorithms don’t apply greater emphasis on it because of the units factor.

In addition, it is good practice to use feature scaling on clustering techniques as it may constrict stretched clusters and improve performance. Scaling will not guarantee that the clusters will be spherical, but should improve it.

Short-listing and Fine-Tuning Models

My short-listing for this exercise was the two common techniques of clustering (K-means and Hierarchical).

Fine tuning the hyper-parameters at this stage wasn’t consider given the sample size, but it is something worth doing given more data or subsequent analysis.

Results

Find the Optimum Number of Clusters

K-means clustering

Using the so-called Elbow Method. Best shown via a chart of the number of clusters versus the inertial. It usually falls abruptly at the start and then the drop in inertia is smaller and this kink or elbow defines as the optimum number of clusters.

The figure shows the optimum number of clusters is 3.

wtp_nertia_vs_k_plot_elblow_plot.png
Is some cases this elbow is well pronounced and in our case it isn’t, so another check would be to plot the silhouette scores to find the peak representing the optimum number of clusters.

The Silhouette Score Method also peaks at 3 clusters.

wtp_ilhouette_score_vs_k_plot.png
Another option is to plot a silhouette analysis plot of varies clusters. This plot shows the distribution (height) and score (wider is better) of each cluster. The dashed vertical line represents the silhouette score. Therefore, the aim of the chart is:

  • to ensure all clusters extend to the right beyond the score
  • vertical line is furthest to the right (score)
  • height of each cluster is similar for consistent distribution.

Plot k = 3 and 5 satisfy the objectives but where k=3, it has the highest score therefore further confirmation that the optimum number of clusters is 3.

wtp_ilhouette_analysis_plot.png

Hierachical Agglomerative clustering

The dendrogram is a bottom-up design which starts at the bottom and assigns 1 cluster per observation. Each cluster contains a horizontal line connecting the closest pair at a distance represented by the pairs vertical lines. This continues upward until it reaches 1 final cluster encompassing all the observations.

Now to find the optimum number of clusters from the dendrogram, locate the largest height of a vertical line measured between any horizontal line drawn across completely from one side to the other.

Based on this the largest vertical height occurs between the 40 and ~ 45 Euclidean distance horizontal lines. Hence if you were to draw a horizontal line just above the 40, it would bisect 3 vertical lines which is the optimum number of clusters.

wtp_endogram.png

Train the Model and Prediction

Using the optimum number of clusters of 3, we can train the model using K-Means and Agglomerative Clustering.

# Training the K-Means model on the dataset

# Review the elbow plot, Silhouette plots, and Dendrogram and set the number of clusters.
n_clusters = 3

kmeans = KMeans(n_clusters= n_clusters, init= 'k-means++', random_state=0).fit(X)
y_kmeans = kmeans.fit_predict(X)

# Training the Hierarchical Clustering model on the dataset
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = n_clusters, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)

Prediction Remapping

As the cluster prediction index values assigned are random, I remapped the K-Means and Agglomerative predictions with indexes 1, 2, 3 representing the lowest number of observations in a cluster (index 1) to the highest number of observations (index 3).

I did this to aid in comparing and the visualising the resulting graphs, keeping in mind reading from left to right (1 to 3) that 1 is the lowest number of observations up the highest number.

Predicted Clusters versus attributes

As both K-means and Agglomerative provided the same number of optimum clusters, I will only show the results from one method here (K-means). I show both charts in the notebook at GitHub.

The distribution of observations to clusters for K-means was 2, 244, 263, and for Agglomerative was 2, 199, 308. Since the smallest cluster only contains 2 observations, they will not provide much variation and the box plots will show as a flat line, or a collapsed box.

Scatter Plot

This scatter plot is for illustrative purposes only as the plot is only showing 2 dimensions (x and y) and our dataset has 20+ dimensions, which cannot translate correctly onto a 2D plot. So we can take not much from this plot regarding overall cluster formation.

wtp_catterplot_cod_perf_input_clusters.png

Box Plots

Input flow

The average input flow rate (Q-E) hovers around the 40k mark (units unknown). Cluster 3 has a tighter spread than cluster 2.

wtp_means_clusters_3_input_flow_and_zinc_to_the_plant.png
Suspended Solids (SS)

The aim of primary sedimentation is to remove 50-70% of the suspended solids. The effluent suspended solids should be below 30 mg/L per EPA 40 CFR 133.102.

wtp_means_clusters_3_suspended_solids_from_influent_to_effluent.png
The input suspended solids (SS-P) into the primary settler is just over 200 mg/L. Cluster 3 has a few outliers with one upward of 2000 mg/L.

Note: The SS-D, and SS-D y-axis updated to show the refined range, due to the couple of large outliers at SS-P.

Cluster 1 SS-S effluent remained unchanged from the Influent SS-P around 175 mg/L. This could suggest some overflow or plant down condition. Or it could be a sensor error because of the minimal number of observations. Clusters 2 and 3 effluent SS-S remained below 25 mg/L.

The bottom row of the chart shows the performance values, which is a direct relation to its values at any point as shown in equation 1.

  • All clusters performed as expected at the primary settler setter of around 60%.
  • For the effluent SS-G, cluster 1 performed poorly at 30%, while clusters 2 and 3 preformed well at 90%.

The data set didn’t provide the performance of the suspended solids after the secondary settler but we can approximate it with equation 1 above as:
\text{RD-SS-S} \% = \bigg( 1 - \frac{\text{SS-D}}{\text{SS-P}} \bigg) \times 100\% =\bigg( 1 - \frac{22}{274} \bigg) \times 100\% = 76\%

Biological Oxygen Demand (BOD / DBO)

The aim of primary settler is to remove approximately 25-40% of the BOD. The effluent BOD should be below 30 mg/L per EPA 40 CFR 133.102.

All clusters BOD (DBO-E) input is around 200 mg/L. The BOD drops across the settlers and the effluent DBO-S remains unchanged for cluster 1 as discussed above, but clusters 2 and 3 leave at < 25 mg/L and a performance or reduction of 90%.

wtp_means_clusters_3_biological_oxygen_demand_from_influent_to_effluent.png
Chemical Oxygen Demand (COD / DQE)

Influent COD (DQO-E) ranges around 400 mg/L for the clusters. Effluent DQO-S reduced to 100 mg/L except for cluster 1 as discussed.

Performance for DQO-E at the primary settler wasn’t provided in the data set, but we can calculate the average with equation 1, at 38%.

\text{RD-DQO-P} \% = \bigg( 1 - \frac{\text{DQO-S}}{\text{DQO-E}} \bigg) \times 100\% = \bigg( 1 - \frac{254}{407} \bigg) \times 100\% = 38\%

Effluent DQO-G reduction of 80% for clusters 1 and 2. Effluent DQO-G reduction of 80% for clusters 1 and 2.

wtp_means_clusters_3_chemical_oxygen_demand_from_influent_to_effluent.png
Sediment (SED)

Most of the sediment reduced during primary settling at 90% performance. Secondary settlers all but removed effluent SED for Clusters 2 and 3 with 99% performance.

Note: The SED-D, and SED-s y-axis updated to show the refined range, due to the couple of outliers at SED-P.

Once again, we can determine the average SED performance at the secondary settler from equation 1:
\text{RD-SED-S} \% = \bigg( 1 - \frac{\text{SED-S}}{\text{SED-D}} \bigg) \times 100\% = \bigg( 1 - \frac{0.03}{0.41} \bigg) \times 100\% = 93\%

wtp_kmeans_clusters_3_sediments_from_influent_to_effluent.png

Closing and Next Step

This exercise shows how to cluster in a real world operating plant.

It is a start, and the charts may provide some useful information for a trained operator.

Additional info of actual faults would help extend this pipeline to supervised learning of actual faults to refine an overall online model that may predict potential faults in advance to allow time for the operators to plan maintenance or repair to keep the plant running at its plated capacity.

Additional readings links.[^1],[^3], [^4], [^5], [^6]

References

[^1]: Cheremisinoff, Nicholas P.. (1996). Biotechnology for Waste and Wastewater Treatment. William Andrew Publishing/Noyes. Retrieved from AIChE Knovel
[^2]: Mackenzie L. Davis, Ph.D., P.E., BCEE (2020). Water and Wastewater Engineering: Design Principles and Practice, Second Edition (McGraw-Hill Education). AccessEngineering
[^3]: Soli J Arceivala; Dr. Shyam R. Asolekar. Wastewater Treatment for Pollution Control and Reuse, Third Edition. Guidelines for Planning and Designing Treatment Plants and CETPs, Chapter (McGraw Hill Education (India) Private Limited, 2007). AccessEngineering
[^4]: S D Khepar, M E (Roorkee); A M Michael; S K Sondhi). Water Wells and Pumps, Second Edition McGraw Hill Education. https://www.accessengineeringlibrary.com/content/book/9780070657069
[^5]: Nalco Water, an Ecolab Company. Nalco Water Handbook, Fourth Edition (McGraw-Hill Education: 2018). https://www.accessengineeringlibrary.com/content/book/9781259860973
[^6]: Indranil Goswami, Ph.D., P.E. Civil Engineering All-In-One PE Exam Guide: Breadth and Depth, Third Edition. (McGraw-Hill Education: 2015). https://www.accessengineeringlibrary.com/content/book/9780071821957
[^7]: Hopcroft, Francis. Wastewater Treatment Concepts and Practices, Momentum Press, 2014. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/slq/detail.action?docID=1816350.
[^8]: NWQMS National Water Quality Management Strategy. Australian Guidelines for Sewage Systems – Effluent Management, 1997.
[^9]: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science. Dataset link https://archive.ics.uci.edu/ml/datasets/water+treatment+plant