1. Introduction
Currently, the data collection has made a real breakthrough with the many varieties of sensors and devices which have the possibility of transmitting information from anywhere. With the current increase in data storage capacity, more data can be stored than can be processed. In practice, when processing this amount of information, the problem of incomplete or missing data has to be addressed. The management of data from water networks [
1] or from hydrological resources [
2,
3,
4] is no exception. The problem of data loss is especially challenging when it occurs in long bursts of consecutive values.
Aigues de Vic S. A. (AVSA) decided three years ago to renew its SCADA (Supervisory Control And Data Acquisition) system because it was becoming obsolete. AVSA is the enterprise responsible for the water supply of the city of Vic. The SCADA is a tool for the technicians of the Water Purification Plant (WPP), where the Ter river water is purified, and for the operators of the Water Distribution System (WDS). The old system is usefully to receive information from the sensors and to make decisions, but not to remotely configure and control the devices. For example, in the case of a pumping system, it is possible to see the pumps’ configuration. Still, if it is necessary to reduce the pumped water flow, the operator must go where the pumps are located and do it manually.
In the SCADA renewal process, great importance was given to preserving the data collected by the equipment to be replaced. To avoid the loss of information accumulated by the old SCADA system during the last four years, the data have to be imported from the old database to the new one. So, during this procedure, the historical data series were verified with the aim to not import unusable data, and some problems related to missing data were detected. This case of study comes from the data collected by the level sensor of the deposit located in the main water reservoir of the city of Vic. It is crucial to preserve these data since they can be used to find consumption patterns in the city of Vic and then to build models that allow detecting anomalies in the operation of the water distribution network. As drinking water is a scarce resource in many regions of the planet, the efficient management of water becomes strategic. So, it is necessary to know the behaviour of distribution networks to improve their operation and to build reliable models that support predictive maintenance and early fault detection functionalities. This is why modern distribution networks monitor the status of hundreds of variables through SCADA equipment. Technological improvements such as those we are proposing will be implemented progressively and we hope that improvements in management will also be incremental. At this point, and before exploiting this huge amount of data, its quality should be guaranteed. In this sense, one of the problems encountered is the occasional loss of data due to sensors failures, sensor re-calibrations or communication failures that take time to be repaired and therefore causes the loss of entire bursts of data. Ensuring data integrity and being able to fill as credibly as possible data gaps is a critical step to the later exploitation of these data and it is the first step in the processing chain so that the errors made in this stage could be easily propagated towards the following ones. The aim of this paper is to present an effective method of data completion.
Different estimators based on interpolation or in linear prediction techniques were used to restore lost samples [
5,
6,
7] with acceptable results, but in the case of large amounts of consecutive lost samples, the performance of these estimators falls dramatically. However, this type of data loss is frequent as it is caused by a communication failure between the Programmable Logic Controller (PLC) where the sensor is connected, and the central SCADA server where the data is stored. The reason seems to be that classical data estimation methods hardly exploit patterns that occur on a combination of time scales, such as daily and weekly scales, as occurs in our case of study. In contrast, methods based on tensor decomposition are capable of revealing these patterns when the data are properly ordered in a multidimensional way [
8,
9,
10,
11,
12,
13].
In a previous study [
14], an
ad hoc method that combines a tensor decomposition and linear prediction techniques was implemented. The method was specially adapted to work with the water deposit level signals and to deal with long bursts of lost samples. That approach was compared with other reconstruction methods based on tensor techniques found in the literature, providing better results for these specific conditions. The work in [
14] also presented a
continuity correction method that guaranteed the continuity of the signal of the data recovered. The method presented in this paper is reminiscent of [
14] in some aspects, but gives some novelties that significantly impact the performance, improving it very considerably.
This method starts by filling the lost burst values to avoid missing elements before to
tensorizing the data. The main significant differences and contributions are introduced: (1), to fill empty values the most straightforward interpolation is chosen, which we called
ramp method, discarding other computational more intense strategies as [
5,
6,
7] or as the one used in [
14], (2), a new way to organize the tensor, called
burst centered tensorization, is introduced, and (3), there is applied a two-step reconstruction process by concatenating two tensor decompositions of different tensor cores. Each time the data is reconstructed, the
continuity correction method is carried out. The first tensor decomposition has a very small-dimension core and obtains a rough approximation which is refined thought a second reconstruction done by applying a tensor decomposition with a bigger core. Note that, although several tensor decompositions exist, the two most extended and well-known are the Tucker [
15,
16,
17] and the CANDECOMP/PARAFAC (CP) [
18,
19] which are the two decompositions considered in this work as well as in [
14]. At this point, we could combine different types of tensor decompositions, the results obtained employing CP and Tucker are very close so, in order to maintain the text simplicity, this study only considers these two. References [
11,
12,
13,
20] can provide the reader a quality introduction to the tensor algebra.
For the type of signals treated, when data losses are distributed uniformly or even in short bursts of less than 30–40 samples, all methods work more or less likewise. Above that length, tensor-based methods, like the one we propose in this work, show better performance. In practice, it is observed that the length of the bursts of data lost on a SCADA system communication cutting off can be much longer of 40 samples. The present work obeys the need to improve the performance of the data completion methods currently used. The main contribution of this research is to improve the data reconstruction methodology developed in [
14], whose results are taken as a reference since they were better in comparison with the proven tensor methods that already exist in the literature.
We have focused on a type of signal obtained from a drinking water distribution network because we have a large amount of data to quantify the quality of the solution provided and compare it with other reference algorithms. Besides, the proposed algorithm can be used to complete data of other types, but above all, to those data that are directly or indirectly related with the human activity and its hourly, daily and weekly patterns. In addition, the sensor techniques allow to reveal and express in a very compact way complicated relationships between modes of different dimensions, so that we think that they can be also very useful for the development of models for early detection of failures in water transport systems.
Henceforth, the work is organized as follows. In the
Related previous work section we briefly expose the algorithm developed in [
14] on the objective that the reader can appreciate the differences with the presented method. In the
Materials and methods section the details for reproducing results are explained. Aspects related to the database and its pre-processing are treated briefly because they are the same as those carried out in [
14]. The same is applied for tensor concepts or for the explanation about the algorithm evaluation and their performance quantification. In order that the article can be read in a simple way, in this section all the steps to complete the algorithm are explained, whether they are contributions of this work, such as the
burst centered tensorization, the
signal smoothing and the
double tensor decomposition approach, whether they are contributions of [
14] such as the
continuity correction. In the
Results section, since the proposed method uses a combination of two tensor decompositions, a study of the combination of decomposition sizes that provide better results for both CP and Tucker models is performed. Then the algorithm is tested by parts applying each of the proposed improvements, one by one, and then all together, in order to quantify their impact in the whole algorithm. Finally, the most remarkable aspects will be summarized in the
Conclusions section.
3. Materials and Methods
3.1. Used Database
The historical data used to perform the simulations are provided by Aigues de Vic S.A. (AVSA). Their Supervisory Control And Data Acquisition (SCADA) system collect approximately 1300 different signals. Specifically, the data used on the simulations is provided by a water level sensor located in the deposit of Castell d’en Planes, which is the water reserve of the city of Vic. Data from this sensor were collected from 1 October 2015 onwards. The data used for the simulations were verified, discarding the weeks in which there is an excess of lost data because, in these weeks, the results of the algorithms cannot be quantified by comparing them with the actual data.
3.2. Tucker and CP Tensor Decompositions
A tensor is a container that can arrange data in N-ways or dimensions. An N-way tensor of real elements is denoted as and its elements as: . According this, an vector is considered a tensor of order one, and an matrix (or ), a tensor of order two.
The procedure of reshaping a lower-dimensional original data organized in a vector or in a matrix into a tensor is referred to as tensorization. The procedure of reshaping the elements of a tensor into matrices or vectors is named matrization and vectorization, respectively.
Tensor decompositions are a very useful tool for revealing patterns between the dimensions in which the data are organized. In particular, low order tensor decompositions provide a simplified version of the data while making the relation between dimensions explicit.
In the case of the 3-dimensional tensor the approximations are given in the form of a smaller tensor core , (where usually , , and ) and the L, J and K eigenvectors of modes -1, -2, and -3. The eigenvectors of each mode are organized as column vectors in their respective matrices , , and . The size (L, M, N) of the core determines the level of the decomposition.
Many known tensor decompositions exist but overall of them the most widely used are the Tucker [
15] and the CP [
18] ones. These two are briefly presented below for the three-dimensional case.
In the 3-way Tucker model, the core is defined by parameters
L,
M,
N, relative to the size of the core
of the decomposition and, the decomposition, is expressed as Tucker(
L,
M,
N) according to:
where the symbol
is the i-way product of a tensor by a matrix; such a tensor operation defined, for instance, in [
21]. The matrices of Equation (
1) in terms of the column eigenvectors are:
,
and
.
The 3-way CANDECOMP/PARAFAC (from CANonical DECOMPosition/PARAllel FACtorization) model is commonly known as CP and can be seen as particular case of the Tucker decomposition when the core
is diagonal
(
). Taking this observation into account the CP decomposition can be written in the same terms as in the case of Tucker decomposition, as follows:
In the case of decomposition CP, all dimensions have the same number
D of eigenvectors and it depends on the only parameter
D, so that it is referenced as CD(
D). It is frequent to see the CD(
D) decomposition written in function of the elements
of the
diagonal such as:
where the symbol ∘ stands for the outer product and the column vectors
,
and
which are related with the matrices of Equation (
2) according to:
,
and
.
The algebra of tensors is explained in detail and often with the support of graphical illustrations in [
10,
11,
12,
20].
Figure 2 shows a unified representation of both 3D tensor decompositions.
3.3. Double Decomposition Approach
In this section the proposed data completion method that achieves better results on the data reconstruction that the previous one developed in [
14].
Figure 3 is used to clarify the explanation by showing the different parts as blocks. As was already mentioned, only the two more widely known tensor decomposition models have been considered, these being the Tucker and the CP. Thus the configuration of both decomposition algorithms have been analyzed with the aim of taking the biggest advantage possible from each one.
The first step consists of applying the
Smoothing process described in the
Section 3.3.1, which contributes to make the tensor decomposition a bit more effective. The tensor decomposition produces a continuous response. The sensor, however, measures the level as a percentage with resolution of 1%, providing a discrete signal of the deposit capacity. When the signal levels oscillate around the point of quantification, oscillations occur between adjacent discrete values. The
Smoothing process is applied on the received data and corrects this effect.
The second step consists of applying an imputation method to fill the missing data. The
ramp method is the selected one, which draws a line to join the known extreme values that delimit the burst as explained in
Section 3.3.2. It is a very rough approximation, but does not need any configuration, which brings simplicity to the algorithm.
Once the data have no empty values, and the positions of lost burst values have been saved, the
burst centered tensorization explained in
Section 3.3.3 is applied. That is a new
tensorization that places the original burst positions just in the central positions of the tensor. Note that this new
tensorization does not modify the dimensions of the resulting tensor that remain
because they have relation with the hourly, daily and weekly patterns.
Figure 3 considers the best algorithm configuration for
when the tensor decompositions are (a) Tucker and (b) CP. Once the data in
has no empty values, and the positions of lost burst values have been saved, the
burst centered tensorization explained in
Section 3.3.3 is applied. The tensor obtained is
. To remember, its first dimension indexes the SCADA measurements taken in 24 h every 5-min, its second dimension indexes the 7 days to complete a week, and its third dimension indexes the number
of weeks considered (which is an odd number: 3 or 7 in the tests realized).
Figure 3 shows the best algorithm configuration for the particular case of
when the tensor decomposition considered are: (a) Tucker and (b) CP. In both cases, after been applied the first three blocks corresponding to the operations:
smoothing,
ramp method and
burst centered tensorization, we have the tensor
.
The next step takes as input and gives the reconstruction as output. That reconstruction is performed following the decomposition of Tucker(4,6,1) in (a) and CP(1) in (b).
At that point, the samples of
occupying the positions of the lost data burst are extracted, and a
continuity correction performed with the original data is applied according to the details of
Section 3.3.4. Then we put the corrected data in original tensor
in substitution of the values provided by the ramp method. The resulting tensor after this step is
.
Following the processing chain, the tensor is taken and modeled using a decomposition of Tucker(4,7,7) in (a) and, CP(15) in (b), to obtain . Again, the set of samples of located in the positions of the lost burst are taken to apply the continuity correction with the original data. The data obtained is used to complete the burst of missing data.
The Tucker(4,6,1) and Tucker(4,7,7) decomposition shown in
Figure 3a correspond to the values that optimize in our database the recovery of a burst of 200 lost samples when employing the
tensorization of size
and the Tucker decomposition is used. The decomposition CP(1) and CP(15) shown in
Figure 3b optimize the recovery of a burst of 200 lost samples using the same
tensorization of size
and the CP decomposition. The results are a statistical measure obtained after running 1000 simulations.
3.3.1. Signal Smoothing
The
ad hoc smoothing algorithm adopted is developed considering the sensor way of working. The samples are processed in groups with the same integer value, and taking into account whether the signal is increasing, decreasing or is in a relative minimum or maximum. The blocks of samples of identical integer value
A are processed keeping in mind the values of the contiguous blocks. The procedure is effortless. There are more elaborate filtering methods but those introduce delays in the signal, and thus of that this straightforward solution has been chosen instead. If the block corresponds to a signal increment, a line with a positive slope is built with extreme values
A − 0.5 and
A + 0.5. If the block corresponds to a signal decrement, a line with a negative slope is built similarly. If it is detected that it is a local maximum or minimum, the block is replaced by a triangular shape with the corresponding orientation.
Figure 4 shows the smoothing performed through an example.
3.3.2. The Ramp Method
The
ramp method consists of filling the lost data by drawing a line between the last known sample before the lost burst starting,
, and the first known sample after the lost burst ending,
, where
B is the length of the data burst lost in number of samples. So that, considering a lost burst of
B samples and the index
i going from 1 to
B, to use a constant increment (or decrement),
m, and fill the entire lost burst,
must be:
Figure 5 shows the performance of the
ramp method.
3.3.3. Burst Centered Tensorization
In the data completion methods that use tensor techniques, as far as we know, the way to organize the data into a tensor does not take into account the position of the lost data. In this work, a
burst centered tensorization method is proposed, where the data selected to fill the tensor depends on the burst location.
Figure 6 shows this process. In
Figure 6a, the dark blue window shows the selection of data as was proposed in [
14], in a typical way and with the data presented in a uni-dimensional view. Through the dark blue window it can be seen how the burst is not exactly located in the center of the selected data, which would mean being in the center of the dark blue window, even if the week where the burst is located is selected as the central week. This happens because the burst is hardly located in the middle of a week, which only happens if the burst is located exactly in the center of Thursday. The
burst centered tensorization forces the burst to be located in the center of the data selected. The cyan window in
Figure 6a shows this new data selection, where the burst is placed exactly on the center of the window.
Figure 6b–d show the
tensorized data by the typical way. The
Figure 6e–g show the
tensorized data from the
burst centered tensorization method, which placed the burst in the core of the tensor (in the center of the central day of the week, which is located in the middle of the tensor). As explained in [
14], lost data bursts never exceed the day, meaning that their length,
B, is always less than 288 samples and that the burst can be located in the center of a day. Thus, given a
B burst in a tensor
, the burst samples are placed at
,
with initial position
= 0.5(288 −
B) according to and indexation
. Note that the daily cycles of the
burst centered tensorization rarely start at 00:00 and the weeks do not start on Mondays, as occurred in [
14], however there are always the same number of samples before and after the lost burst, which hardly happens with the previous
tensorization method.
3.3.4. The Continuity Correction
This procedure was developed in order to maintain the continuity of the estimate provided by a tensor decomposition in its vector form
and the known values of
at the edges of the burst. As consistently observed previously, the samples in the burst positions after a low-rank tensor reconstruction follow the original signal pretty well but with significant discontinuities in the extremes. Considering
to be the last original known sample before the burst and
the sample from the tensor reconstruction in that position, we define the initial burst offset as
. Similarly, for a lost burst of length
B, the final burst offset can be defined as:
. The corrected offset estimates
are computed as follows:
Figure 7 shows graphically the
continuity correction applied.
3.4. Algorithm Performance Evaluation
To test all the methods on the same conditions, firstly one thousand different starting positions are randomly selected from the 77 weeks of historical data previously verified. These set of starting positions determine the groups of consecutive samples which are deleted to simulate the burst of missing data. The strategy, the data set, the block of 77 consecutive weeks, and the burst lengths
B were the same as used in [
14] in order to compare the evolution of the algorithm performances. When an algorithm replenishes the missing burst, the Mean Square Error (MSE) per sample with the original data is computed. The same algorithm processes those one thousand different randomly selected cases and the MSE per sample is taken as the parameter to evaluate its performance. Before calculating the MSE, the reconstructed signal is adapted to the sensor resolution of 1% by rounding the values with decimals to the nearest integer the values with decimals. Then, considering
to be the samples provided by a completion algorithm and
to be the true values that had been eliminated in the verified data set to simulate a lost burst of length
B, the MSE per sample is computed as:
5. Conclusions
Completing data lost in bursts remains a difficult challenge and is where most data completion methods fail. However, data being lost in bursts is quite common. It is often associated with the failure of a component involved in capturing or transmitting the data. In the contribution of [
14] it is presented an
ad hoc data completion method to recover data lost in bursts that outperforms the methods available in the literature for the proposed application. This work significantly improves the method in [
14], which is taken as a reference for the new algorithm evaluation and for comparison purposes. The incorporated novelties are a
smoothing process, a new
tensorization we have called
burst centered tensorization, and the application of two tensor decomposition, one after the other using two different cores.
It is difficult to evaluate a data completion method in a practical framework. To do this in this article an experiment was carried out using only verified sets of samples from the historical data of the level sensor, avoiding to use on the experiments the weeks where there were real lost samples. The set of verified data is used, randomly erasing a burst of data and later applying the data reconstruction (completion) method. The error of the reconstruction method can be calculated because the original data is known. To compare statistically the tested methods the same 1000 simulated lost bursts are restored with each of them. Then the MSE per sample, which is the mean of the error generated with this 1000 simulations, is calculated for each method. By this way the results of the tested methods can be compared in the same conditions.
It is important to emphasize that comparing the MSEs obtained in the same database and the same distribution and lengths of lost frames, the proposed method reduces the MSE between 24% and 40%, the best results obtained in [
14]. Note in
Figure 11 that the MSE corresponding to a burst of 100 samples falls from 0.71, the best result obtained in [
14], to 0.50, the best result obtained with the new methodology. That means a reduction of the MSE approximately of the 39.5%. And in the case of the burst of 200 samples the MSE falls from 1.28 to 0.97 which is approximately a 24.2% of reduction. Note that the best results in [
14] were obtained using an imputation method based on two linear predictors that are difficult to adjust since they require to estimate the length of two FIR filters and, besides, it must be continuously adapted. Introducing a double tensor decomposition in the proposed method gives stability and robustness. From the studies carried out, we see that the first tensor decomposition must have a small core that keeps only the most relevant interactions between modes, while the second decomposition must have a larger core to capture the interactions between modes in greater detail. A signal reconstruction example of this procedure is shown in
Figure 12 using
and
, the best core configurations for the double decomposition according to the tests for the Tucker model in the case of a 200 samples of burst length.
The rest of the results are quite expected. If we use more data for the restoration, i.e., when we use seven-week rather than three-week tensors, we obtain lower MSE in the tests, although the percentage reductions are small and tend to stabilise with the increment of new data. It is also expected that when lost bursts are long, 200 samples in our experiments, the tests will provide higher MSE than when shorter bursts of 100 samples. The experiments also indicate that it makes more sense to use 7-week rather than 3-week tensorizations when the lost bursts are long (200 samples) since the differential obtained in terms of MSE is more significant than in the case of shorter bursts (100 samples), which could seem very small.
As also pointed in the introduction, the presented algorithm can be exportable to complete data of other types of problems, especially to those that the data capture patterns related to human activity because in these cases, there are interactions at different time levels, be it timetable, daily, weekly, seasonal, etc. The gas or electricity networks also fall into this category. However, it must keep in mind that when the algorithm will be used in another context it will probably be necessary to calculate the optimal dimensions of the decompositions. Finally, before developing models to monitor plants, enabling predictive maintenance or discovering failures at a very early stage, it is necessary to have reliable data. Tensor algebra is a helpful tool in completing data and for sure it will also be very useful in the development of predictive maintenance models.