Principal Components Analysis (PCA) of Time Series Using Simulated Sine Waves

Leif Peterson, Ph.D.

November 15, 2019

Abstract

Simulated data in the form of sine waves with noise were analyzed as time series using principal component analysis (PCA). Multiple replicates of each underlying signal were used so that the number of features exceeded the number extracted principal components (PCs). Both the covariance matrix and correlation matrix were used during eigendecomposition, and the standardized (scaled) and non-standardized (unscaled) PC scores were obtained from each run. A total of eight patterns were evaluated, each of which consisted of mixtures of amplitudes, wavelengths, and phase shifts. Results indicate that use of the correlation matrix for eigendecomposition typically only introduced small phase shifts among the extracted PC scores (over time, or interval); however, correlation was able to extract a PC vector which correlated strongly with the sum of the data, representing the composite cycle. While the composite cycle is not the dominant cycle in the sense of the greatest power spectrum peak from FFT, the composite nevertheless represents the entire dataset. It is important to realize that PCA cannot de-convolute a dataset, to extract PC score vectors that reflect the underlying signals, but rather, PCA can only extract shared information (correlation, covariance) between multiple time series. In conclusion, it is important that researchers know specifically whether covariance or correlation is used for eigendecomposition, and whether the standardized or non-standardized PC scores are generated by the algorithms employed for running PCA on time series. The choices made surrounding these options for PCA can have a profound effect on results, especially if dimension reduction is being performed to collapse data for further analyses.

1 Principal Components Analysis, PCA

Principal components analysis (PCA) is one of the most popular linear dimension reduction techniques. During PCA, PC scores are based on the summed (linear) components extracted during eigendecomposition of a covariance matrix or correlation matrix.

1.1 Covariance and Correlation Matrices

By definition, the population covariance matrix is

  ∂ef        ⊤
Σ =  E[(x − μ) (x− μ )],
(1)

where x is a p -tuple (xi1,xi2,...,xip)  for the i th object and μ is a p -tuple with means for the p features. Because it is impossible to measure Σ , we are limited to working with the sample covariance matrix, given as

      n
S = 1∑  (x − ¯x)⊤(x − ¯x).
    ni=1  i       i
(2)

Let D = diag(S)  , the sample correlation matrix is

      −1   −1
R = D   SD   .
(3)

During computation, elements rjk  of R can be determined individually using the relationship

r   = --sjk---
 jk   √sjjskk
      sjk
    = sjsk-.
(4)

If the means of columns of X are known, covariance elements sjk  of S can be directly calculated using the functional form

       ∑n
sjk =-1   (xij −x¯j)(xik − ¯xk),
     n i=1
(5)

whereas, if the standard deviations and correlations are already known, then calculate covariance elements using

sjk = rjksjsk.
(6)

It is assumed that the sample covariance matrix S is an accurate and reliable estimate of the population covariance matrix Σ . When the sample size n is infinitely large, there are fewer problems with this assumption, but as n ≪ p , the eigenvalues of S are biased where small eigenvalues are too small and the large eigenvalues are too large. Moreover, several of the eigenvalues become zero, so S becomes singular by losing full rank and therefore, is not positive definite and cannot be inverted.

Eigendecomposition of correlation matrix, R . Correlation-based PCA (RPCA) using the correlation matrix, R , is the traditional form of PCA. Let R be the feature × feature (p× p)  correlation matrix. By the Principal Axis Theorem, there exists a rotation matrix E and diagonal matrix Λ such that ERE  ⊤ = Λ . Columns of E and E ⊤ are the eigenvectors and diagonal entries of Λ are the eigenvalues λ1,λ2,...,λp  . NXG Explorer uses the QR algorithm for eigendecomposition of a real square symmetric matrix. The QR algorithm returns the eigenvalues of R sorted in descending order (λ(1) > λ(2) > ⋅⋅⋅ > λ(p)  ) as well as the accompanying eigenvector matrix with its columns sorted in accordance with the eigenvalues. We also enforce the sum of each k th column of E to be positive, that is, ∑
  pj ejk ≥ 0  by simply reversing the sign of each element if ∑
  pj ejk < 0  .
Eigendecomposition of covariance matrix, S . Performing eigendecomposition on the covariance matrix simply involves swapping out correlation with covariance in the Principal Axis Theorem to arrive at     ⊤
ESE   = Λ . The QR algorithm of Explorer will return the eigenvalues of S sorted in descending order (λ(1) > λ(2) > ⋅⋅⋅ > λ(p)  ) as well as the accompanying eigenvector matrix with its columns sorted in accordance with the eigenvalues.

1.2 Non-standardized (unscaled) principal component scores from covariance matrix, ˆF .

The non-standardized principal component score, “PC score,” for the i th object and k th principal component when the covariance matrix S is used for eigendecomposition is

ˆfik = (xi1 − ¯x1)ˆek1 + (xi2 − ¯x2)ˆek2 + ⋅⋅⋅+ (xip − ¯xp)ˆekp
   = (x − ¯x)⊤ˆe ,
       i      k
(7)

which translates to the matrix operation

Fnˆ×p = Xnc×ppˆE×p.
(8)

Thus, for an input n× p data matrix X , there will be an n× p matrix ˆF , which contains principal components (PCs) in columns and objects in rows. For each object there will be p PC scores in each row of Fˆ , and in each column of ˆF , there will be n PC scores for all of the objects. We have observed that non-standardized PC scores are equivalent to PCA output of most software packages, and reported in the majority of textbooks.

1.3 Non-standardized principal component scores from correlation matrix, ˆF .

When the correlation matrix R is used, the non-standardized PC scores are based on the relationship

     (        )     (        )          (        )
ˆf  =  xi1 −-¯x1 ˆe  +   xi2 −-¯x2 ˆe  + ⋅⋅⋅+   xip −-¯xp ˆe
 ik       s1     k1      s2      k2           sp      kp
   = zi1ˆek1 + zi2ˆek2 + ⋅⋅⋅+ zipˆekp
      ⊤
   = zi ˆek,
(9)

which in matrix notation is

Fˆ =  Z  ˆE .
n×p  n×pp×p
(10)

Note that, although the mean-zero standardized z -scores are used when eigendecomposition is performed on the correlation matrix, it does not result in standardized PC scores. Standardizing PC scores depends on rescaling with the eigenvalues, which is done by introduction of the loadings, which are introduced below.

Note: In the notation that follows, whenever the covariance matrix S is used for eigendecomposition, the mean-centered data matrix Xc  must be used for calculation of PC scores. Likewise, when the correlation matrix R is used for eigendecomposition, the mean-zero standardized data matrix Z should be used in PC score calculations.
Cleaning up the Eigenvector Matrix, E . A constraint for the eigenvector matrix E is that any eigenvectors, i.e., columns of E , whose corresponding eigenvalues are zero will need to be zeroed out. After eigendecomposition, there can nevertheless be non-zero elements of E in these null eigenvectors; however, they are meaningless, so they need to be reset to zero. When done, the appropriate dimensions of E will be p× m , where m is the number of eigenvectors whose eigenvectors are non-zero.
Filtering the Eigenvector Matrix, E . The next step is typically to only use eigenvectors whose eigenvalues are greater than one, λj > 1  . This can further reduce the number of m columns remaining in E .

After removing eigenvectors (columns) of E for null eigenvectors or for eigenvectors whose λ < 1  , the final number of valid columns will be m < p , and the PC score matrix is now determined as

 ˆ         ˆ
nF×m = Xnc×ppE×m.
(11)

Loading matrix, ˆL . Standardized PC scores are scaled by the inverse of the eigenvalue,   ∘ --
1∕  λj  , for each eigenvector. In order to introduce the eigenvalues into the calculations, the loadings are used, which represent the correlation between each input feature and the given PC. The loading matrix is a square non-symmetric correlation matrix with elements representing the correlation between each of p features and each PC in columns of  ˆ
F . The loading for the j th (j = 1,2,...,p)  feature and k th (k = 1,2,...,p)  PC is given by
         ---
ˆ      ∘  ˆ
ljk = eˆjk λk,
(12)

which, in terms of matrix operations is

ˆ   ˆˆ1∕2
L = EΛ   .
(13)

It is important to remember that the loading matrix ˆL is a p× p square non-symmetric matrix which has the same dimensions as the covariance matrix S or correlation matrix R . However, if principal components are only used for the m eigenvalues that are greater than 1, i.e., λj > 1  , then ˆL will have dimensions p× m . Operationally, this is typically the case, as most of the correlation loads on the first few PCs when there is moderate to strong multicollinearity. However, if most of the features are not correlated, then m →  p .

If we assume that eigendecomposition was performed on the covariance matrix S , once the loading matrix is obtained, the standardized PC scores are determined using the relationship

ˆ      ˆ⊤ ˆ⊤ ˆ −1
F = XcL  (L L )  ,
(14)

where ˆL = ˆEˆΛ1∕2  is the loading matrix that reveals the correlation between each feature and each PC. This is really a regression problem, for which the goal is to estimate E(f|x)  , where f are the PC scores and x is the observable data. We have

E(f|x) = μf + ΣfxΣ −x1x(y− μx).
(15)

For normally distributed data the mean vector drops out, and since the covariance between factor scores and data is simply the loadings, for which ˆ
L = Σfx  , we get

E (f|x) = ˆL⊤Σ −x1x (x− μx ).
(16)

Now, since we know the loading matrix is       1∕2
ˆL = EˆˆΛ  , and by replacing covariance Σxx with correlation, R  = ˆEˆΛ ˆE⊤ , and moving ˆL⊤ on the left to the right side we have

ˆF = XcLˆ⊤(ˆL⊤Lˆ)−1
        ⊤   −1
  = Xc(ˆL  ˆL)  ˆL
  = X (ˆE ˆΛ1∕2ˆEˆΛ1∕2)−1ˆEΛˆ1 ∕2 → X (ˆEˆΛ1∕2)−1 → X EˆˆΛ− 1∕2
      c                        c              c
  = Xc(ˆE ˆΛ1∕2ˆEˆΛ1∕2)−1ˆL
       ˆ ˆˆ ⊤ −1ˆ
  = Xc(E ΛE  )  L
  = XcR −1ˆL.
(17)

Therefore, while most textbooks and software packages define ˆF = X E
     c as the non-standardized PC scores, the use of X  ˆL⊤ (ˆL ⊤ˆL)−1
  c  to obtain standardized PC scores equates to using          −1∕2
ˆF = XcˆE ˆΛ  , essentially the non-standardized PC scores divided (scaled) by ∘ --
  λj  . By default, NXG Explorer uses standardized PC scores for 2D score plots of PC1 vs. PC2, since the scales are somewhat smaller and distributed more like the standard normal distribution.

Table 1: Rules-of-thumb for the matrices required to obtain non-standardized and standardized PC scores from eigendecomposition performed on the covariance matrix or correlation matrix.



Eigendecomposition performed onPC Scores Matrix operation



Covariance matrix, S Non-standardized nˆF×m = Xc pˆE×m
      n×p


Standardized  ˆF  = Xc  ˆE  ˆΛ− 1∕2
n×m   n×pp×m m ×m



Correlation matrix, R Non-standardized  ˆF  =  Z  ˆE
n×m   n×pp×m


Standardized  ˆF  =  Z  ˆE  ˆΛ− 1∕2
n×m   n×pp×m m ×m



Correlation or Covariance? When judging whether to use the correlation matrix R or covariance matrix S of the features (time series), the only issue that one needs to consider is whether the scale of the time series are similar. If they are, then naturally, covariance should be employed; however, if the scale of the time series are not similar, then correlation should be used.

2 Requirements for Running PCA on Time Series

There are a few rules-of-thumb which must be taken into consideration when applying PCA to time series. Firstly, each time series, x(t)  , must represent a single feature or variable, that is, the objects represent the values for a given time series at each time interval, and the feature itself represents the entire time series. PCA will not return anything if you represent a time series across multiple feature values. With regard to data formatting, what this means is that, if columns represent features or variables, and rows represent objects, then each time series will form a single column. So essentially, each time series represents an entire feature whose values are spread out over the rows.

After setting up the data, the data matrix X will have dimensions t× p , where t is the number of time periods (rows) and p is the number of time series being evaluated (columns).

Note: Readers need to be aware that in financial engineering and neurology, PCA of times series is a common occurrence. In financial engineering, the daily log-returns of price of multiple assets are used as the time series, and the covariance and correlation matrices are obtained for the assets being considered (followed by eigendecomposition). In neurology, the time series analyzed for patients or research subjects are time-specific EEG signals in multiple wavelength channels. Regarding feature scales, if all daily asset price returns are in units of USD, and all EEG values have the same units, then the scales will be viewed as being similar. Certainly, you would not combine assets prices in EU, USD, and Japanese Yen together during a PCA run – which would cause strong dissimilarity between feature scales.

Readers should also note that in financial engineering, PCA has been performed on the assets and the days. For example, assume the 30 stocks in the DOW-30 and 1000 daily price returns (4 years of trading, i.e., 250 days/year). The correlation (covariance) matrix can be run on the 30 stocks (30 features) with daily prices used as the objects (time series); or, the correlation (covariance) matrix can be run on the 1000 days (1000 features), with the 30 stocks used as objects – but this is not PCA on time series. The point is that in financial engineering, researchers use PCA on assets to understand how each asset correlates with ("loads on") each PC, as well as use PCA on days, to identify hidden grouping of days - which can be viewed as a type of clustering of days.

3 Simulation of Sine Waves

For simulation purposes, we initially chose to use straightforward sine waves for PCA analysis. We assume that the time series, x(t)  , can be strictly based on a sine wave defined as

x (t) = A sin(ωt+ φ )
    = A sin(2πft+ φ )
           (  (  )     )
    = A sin  2π  1- t+ φ
                λ
    = A sin (2πt∕λ + φ),
(18)

where t is the time or interval, A is the amplitude, f is the frequency or 1∕λ where λ is the wavelength, and φ is the phase shift. For a fixed frequency, it is easier to use the wavelength (period) directly in the equation for x(t)  by simply dividing 2πt by λ , and then adding the phase shift φ before taking the sin  , and then multiplying the result by the amplitude A . A small amount of noise was also added to each value of x(t)  so that we could simulate a number of features having the same underlying sine wave, but with a small amount of noise added. This was accomplished by simply adding a noise value of U(0,1)∗0.01  to each x(t)  value, where U (0,1)  is a random uniform variate in the range [0,1].

For each experimental setup, we used a set of four underlying sine waves (without noise) each having different amplitude, wavelength, or phase shift. For each of the four underlying sine wave, we simulated a set of 25 time series (features), each of which had the same parameters (amplitude, wavelength, shift) but with a small amount of noise added. Therefore, within each set of 25 time series based on the same underlying sine wave, the time series were not exactly the same because of the noise added. The total number of features (time series, or sine waves) was therefore p = 100  . We also set T = 500  , so that t = 1,2,...,500  , and the final dataset consisted of 100 time series each having 500 intervals. It was important not to consider using a wholly different sine wave for each simulated time series, as this would not enable the comparison between the PC scores derived from the simulated data and unique underlying pattern among each set of 25 time series that were used.

The following notation was used to represent each set of 25 time series that were similar: X1  represented time series 1-25, X2  represented time series 26-50, X3  represented time series 51-75, and X4  represented time series 76-100. For plotting purposes, we also added together and multiplied together all of the time series, and plotted the sum and product with labels SUM(X1:X4) and PROD(X1:X4) in each plot. (Note that the sum and product of all signals were not used in PCA, but rather, only in plotting). The various simulation setups and their results are presented in the following paragraphs.

Same wavelength, no amplitude or phase. What happens when all 100 simulated time series are similar with high multicollinearity? This was simulated using the same underlying wavelength, without amplitude changes or a phase shift:
X1 = sin(2πt∕100)+ U (0,1)∗ 0.01
X2 = sin(2πt∕100)+ U (0,1)∗ 0.01
X3 = sin(2πt∕100)+ U (0,1)∗ 0.01

X4 = sin(2πt∕100)+ U (0,1)∗ 0.01
(19)

When all 100 signals were similar and had the same underlying wavelength, there was only 1 PC extracted, which seemed to exactly represent the signal, which is expected when all features are very correlated, that is, all of the features load on one PC, and the variance explanation is based on only the first principal eigenvalue, λ1  . Quite naturally, the single PC also coincided with the composite sum of the signals, which is expected.

Note: In all of the output images which follow, the lines shown for the composite sum of all the 100 x(t)  values at each tth interval, i.e., SUM(X1:X4), as well as the product of all x(t)  at each value of t , PROD(X1:X4), were not used during PCA runs. Rather, these two lines are simply plotted to reveal composite sums and products of the original cycles so that they can be compared with the extracted PCs.

PIC

As shown below, the non-standardized PC scores are much larger than the standardized, since they are not scaled by   ∘ --
1∕  λj  :

PIC

Use of the correlation matrix for eigendecomposition resulted in similar PC scores when compared with covariance:

PIC

Again, the non-standardized PC scores display too large a scale when reflecting patterns among the signals:

PIC

Same wavelength, different amplitude, no phase. Next, we used the same underlying wavelength for all 100 simulated time series, but used a changing amplitude for each set of 25:
X1 = sin(2πt∕100)+ U(0,1)∗0.01
X2 = 1.5 sin(2πt∕100) + U(0,1)∗0.01

X3 = 2sin(2πt∕100)+ U (0,1)∗ 0.01
X4 = 2.5 sin(2πt∕100) + U(0,1)∗0.01
(20)

This resulted in a finding similar to the previous experiment, where the features were still highly correlated after the introduction of changing amplitude, but with the same wavelength. Obviously, with such high multicollinearity there would be only 1 PC extracted from the covariance matrix.

PIC

PIC

PIC

PIC

Same wavelength, different phase. For this simulation, for each set of 25 time series with noise, we varied the phase but used the same wavelength:
        (          )
X1 = sin  2πt∕100− π- + U(0,1)∗0.01
        (         4  )
X2 = sin  2πt∕100− 2π- + U (0,1) ∗0.01
        (          4 )
                  3π-
X3 = sin  2πt∕100−  4  + U (0,1) ∗0.01
X4 = sin(2πt∕100− π)+ U (0,1) ∗0.01
(21)

The PCA results when there was only a π∕4  phase shift for the same wavelength indicate that only two PCs were extracted, as shown in the image below:

PIC

PIC

PIC

PIC

Same wavelength, different phase, with amplitude change. Next, we added an amplitude change for each set of 25 time series and retained the π∕4  phase shift:
       (           )
X1 = sin  2πt∕100 − π- + U(0,1)∗0.01
          (       4    )
X2 = 1.5sin  2πt∕100 − 2π- + U(0,1)∗0.01
         (           4)
                   3π-
X3 = 2sin  2πt∕100−  4  + U (0,1)∗ 0.01
X4 = 2.5sin(2πt∕100 − π)+ U(0,1)∗0.01
(22)

Results indicate that still only 2 PCs were extracted, as show in the image below:

PIC

PIC

PIC

PIC

Different wavelength, no phase or amplitude. Here we only employed a different wavelength for each set of 25 simulated time series:
X1 = sin(2πt∕50)+ U (0,1)∗ 0.01
X2 = sin(2πt∕100)+ U (0,1)∗ 0.01
X3 = sin(2πt∕200)+ U (0,1)∗ 0.01

X4 = sin(2πt∕300)+ U (0,1)∗ 0.01
(23)

These results suggest that when only the wavelength is changed, there are 4 PCs which are extracted, implying a greater degree of orthogonality (non-correlation) between the sets of 25 simulated signals. In addition, the PC score vectors appear as composite signals representing summation of the 4 cycles. It is also interesting to note that by not having a difference in scale (amplitude) of the signals, the PCs do introduce scale differences.

PIC

The non-standardized PC scores display too large a scale when reflecting patterns among the signals:

PIC

Use of the correlation matrix for eigendecomposition resulted in a PC score vector (i.e., PC3) which tracked with the composite sum of signals. Although PCs can not de-convolute the data into the constituent individual cycles, one of them does nevertheless reflect the composite cycle (sum) of the data:

PIC

Again, the non-standardized PC scores display too large a scale when reflecting patterns among the signals:

PIC

Different wavelength, different amplitude, without phase. Using different wavelengths, we now add amplitude changes:
X1 = sin(2πt∕50)+ U(0,1)∗0.01
X2 = 1.5 sin(2πt∕100) + U(0,1)∗0.01
X3 = 2sin(2πt∕200)+ U (0,1)∗ 0.01
X4 = 2.5 sin(2πt∕300) + U(0,1)∗0.01
(24)

When amplitude and wavelength change is introduced, it seems that the resulting PC remove the scale, and tend to represent the original signals more closely when compared with the previous results when the signals had no change in scale (amplitude).

PIC

Life before, the non-standardized PC scores display too large a scale:

PIC

When amplitude is introduced to the varying wavelengths, unlike before, correlation was not able to extract a PC score vector which tracked with the composite sum of signals:

PIC

The non-standardized PC scores, again, displayed too large a scale:

PIC

Purely negative correlation (r = − 1  ), phase shift = π . Purely negative correlation, that is, a correlation coefficient of r = − 1  , was invoked by introducing a phase shift of π in half of the time series:
X1  = sin(2πt∕100)+ U (0,1) ∗0.01
X2  = sin(2πt∕100)+ U (0,1) ∗0.01

X3  = sin(2πt∕100 − π)+ U(0,1)∗0.01
X4  = sin(2πt∕100 − π)+ U(0,1)∗0.01
(25)

Under negative correlation when the signals cancel each other out, there is only 1 PC vector obtained.

PIC

PIC

PIC

PIC

Zero correlation (r = 0  ), phase shift = π∕2  . Zero correlation r = 0  was introduced by using a phase shift of π∕2  for half of the time series:
X1  = sin(2πt∕100) +U (0,1) ∗0.01
X2  = sin(2πt∕100) +U (0,1) ∗0.01
X3  = sin(2πt∕100 − π∕2)+ U(0,1)∗0.01
X4  = sin(2πt∕100 − π∕2)+ U(0,1)∗0.01
(26)

When the correlation is zero and the two signals are orthogonal, as expected, there were 2 PC vectors obtained.

PIC

PIC

PIC

PIC

4 Discussion

It is important to realize that PCA will not de-convolute a signal the same way FFT will, or at least pull out something like power spectra values for a signal to reveal what the major wavelengths are for the dominant cycles. When there are mixtures of various signals in a dataset, PCA will determine the degree of correlation between the signals and return the orthogonal signals (PC score vectors) based on the unique correlation patterns identified. But PCA will not return PC vector scores for the original underlying signals in the data the way FFT will, rather, the returned signals will be based on collapsed information hinged to correlation. For example, if there are two underlying signals, one based on sin and one on cos, and they have a small degree of correlation between them, then the PC score vectors that PCA returns will be based on the shared information between the two signals. Whereas FFT will provide the wavelength of the original two signals independent of their correlation. In summary, the important issue with PCA is that it can only return PC score vectors based on shared information between signals, and not anything regarding the original unique signals in a dataset.

While some reports have suggested that the covariance matrix must always be used for eigendecomposition of time series, our results indicate that use of correlation typically only introduces a small phase shift in the PC vectors, which may or may not be so important for more complex data. Irrespective of the phase shift in PC vector values introduced by correlation, when performing dimension reduction the difference between using covariance and correlation for eigendecomposition would likely not cause a strong bias. An important finding when using correlation for eigendecomposition was that one of the extracted PC vectors highly tracked with the composite (sum) of all the cyclical data. Although the composite cycle (sum of cycles) is not the dominant cycle in terms of an FFT power spectrum, this observation is important since the composite signal is the underlying signal in a dataset.

Lastly, the standardized (scaled) PC scores based on a least squares fit with the loading matrix should be used because their scale is much smaller than the non-standardized PC scores, which are not scaled by eigenvalues. One of the advantages of using the standardized PC scores based on the loading matrix is that columns of the loading matrix will be automatically zeroed out for any eigenvalues which are zero, since loadings are defined as        ∘ --
ljk = ejk λj  . Whereas when raw PC scores are used, there can be residual non-informative information still remaining in the eigenvector matrix, which has to be zeroed out manually for any zero eigenvalues or when the PCs are filtered for e.g. λj > 1  .