Correlation does not imply causation.. but it does give you a hint

This is just a short note on plotting a correlation matrix using the seaborn package within Python. I’ve found that this is the best way of showing the similarity between arrays to people who are unfamiliar with correlations. It also allows you to add some colour into your plots, which is always a nice thing! It can be used for a multitude of purposes, so I have left the variable names in my code (at the bottom) as general as possible, so that it can be copy and pasted for other users.

For those who have not seen these matrices before, what it shows is the similarity between different arrays. If two arrays have a correlation value of 1.0, this means that they have a perfect correlation (i.e. they are exactly the same), and a correlation value of 0.0 means that there is absolutely no similarity between the two. This can be used to compare datasets with one another if you are looking for a similar pattern.

Also, it is worth noting that one of the principal statements made in statistics is that,

“Correlation does not imply causation”

So you should also have some further information to back-up the correlation between arrays.

An example of one of these correlation matrices can be seen below, which shows the comparison of 54 arrays with each other (i.e. I have taken each array and cross-correlated it with the other 53 arrays). The squares with a darker tone have a higher correlation than those with a lighter tone.

corr

Correlation matrix for 54 arrays

Your first step is putting your correlation values into a pandas.DataFrame format, you can then just use the code below in order to create the matrix! This table should contain the full dataset, and this code can then create it into this triangle shape (as otherwise you will end up with the mirror image of this on the identity axis). I have used absolute values as I didn’t want to deal with negative correlation at this stage (this is when it is a perfect match but reversed in the x-axis).

If you don’t have any correlation values, I’d recommend reading up on cross-correlation, which is a function where you can obtain these correlation values. I might produce a blog post on this at a later date, but it is worth reading into it yourself so that you can fully understand the output.

— Roseanne


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.5)

def corr_mat_plot(correlation_mat, show = True, outfile = None):
    """
    Plots the correlation matrix in an image plot to show where the
    highest correlation between arrays is.
    """
    # Make the mask for the upper triangle so that it doesn't mirror image the values
    mask = np.zeros_like(correlation_mat, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    # Set up the figure
    fig, ax = plt.subplots(figsize=(10, 10))
    sns.set(font_scale=1.5)

    # Draw matrix
    sns.heatmap(np.abs(correlation_mat), cmap = sns.cubehelix_palette(8, as_cmap=True),
                mask=mask, vmin = 0,vmax=1, square=True, xticklabels=50, yticklabels=50,
                cbar_kws = {"shrink": .8, "label" : ("Correlation value")}, ax=ax)

    plt.title("Correlation between the arrays")

    if show:
        plt.show()

    if outfile:
        fig.savefig(outfile)
    elif show:
        plt.show()
    else:
        return fig

Gutenberg-Richter and fish?

This post is for explaining the basics behind two key statistical seismology terms: Gutenberg-Richter and Poisson distributions.

Gutenberg-Richter

The Gutenberg-Richter law is a relationship which every seismologist knows – for those who are not so aware (like me just over a year ago), it refers to an expression which relates the total number of earthquakes in any given region to the magnitude, by the following equation:

log_{10} N = a - bM

where N is the total number of earthquakes, a is a constant (usually 1), b is another constant which depends on the seismicity in the area (close to 1 in seismically active areas), and M is the magnitude. This can also be seen by the plot below.

Gutenberg-Richter law

This shows the Gutenberg-Richter distribution for a b value of 1. Code for this is at the end of the post.

What this expression does, is relate the frequency of earthquakes with their magnitude, i.e., there are lots of small earthquakes, and very few large earthquakes – makes sense.

At the moment, I am creating synthetic seismograms (see Make some noise for how to make the seismic noise), and as I am trying to make my seismograms as realistic as possible, it is only logical to want to have my seismic events follow a Gutenberg-Richter distribution as well. I have also added in a term for setting a minimum magnitude, as quite often there is a ‘fall-off’ of the magnitudes in the lower end, as it is sometimes harder to actually pick up these magnitudes in real-life.

 

Poisson distribution

You are probably wondering where the fish part of my title comes into play – well that’s because when I add my events, I am doing so with Poisson spaced inter-event times (also below), with magnitudes that follow this distribution (i.e., lots of small and few large earthquakes). For those still not following, Poisson = fish in French.. (ba dum tss)

Anyways, Poisson is used for the spacing of inter-event times as it is said that earthquakes follow a Poisson distribution. This is a rule which assigns probabilities to the number of occurrences, with a known average rate. This can be seen by the mathematical formula below,

P (n >1, t, \tau) = 1 - e^{-t/ \tau}

where the left term says the probability of at least one earthquake occurring in the time t, where there is an average recurrence time \tau – this can also be referred to with \tau = \frac{1}{\lambda}, where \lambda is the rate (i.e., P = 1 - e^{- \lambda t}), can be estimated.

So, if we were to say that there were an average recurrence time of 31 days, then after 25 days, there would be a 55% probability of an event. A Poisson distribution can be easily incorporated, as we just need to produce random numbers which scale to this \lambda term, as seen in the code at the end of this post.

In summary, I utilise both Gutenberg-Richter and Poisson statistics for my events, where the magnitude is scaled to Gutenberg-Richter, and are spaced as per Poisson distribution. I have supplied both functions (including how to do the Gutenberg-Richter plot) below.

— Roseanne


def gutenberg_richter(b=1.0, size=1, mag_min = 0.0):
"""Generate sequence of earthquake magnitudes
according to G-R law. logN = a-bM
Includes both the G-R magnitudes, and the
normalised version.
"""
g = mag_min + np.log10(-np.random.rand(size) + 1.0) / (-1*b)
gn = g/g.max()

return g, gn

# code for plotting the G-R distribution
testn = gutenberg_richter(size = 10**8)
y, bine = np.histogram(testn)
binc = 0.5 * (bine[1:] + bine[:-1])
plt.plot(binc, y, '.-')
plt.yscale('log', nonposy='clip')
plt.xlabel("Magnitude")
plt.ylabel("Log Cumulative frequency")

def poisson_interevent(lamb, number_of_events,st_event_2, samp_rate):
""" Finds the interevent times using Poisson, for the events, by choosing
lamb and number_of_events. We can use the random.expovariate function in
Python, as this generates exponentially distributed random numbers with
a rate of lambda for the first x number of events
( [int(random.expovariate(lamb)) for i in range(number_of_events)] ).
By taking the cumulative sum of these values, we then have the times at
which to place the events with Poisson inter-event times.
Here we create an array with a list of times which are spaced at a Poisson
rate of lambda. This will then be used as the times of the noise in which
we place the event at.
lamb = lambda value for Poisson
number_of_events = how many events you want
st_event_2 = your event
samp_rate = sampling rate
"""
poisson_values = 0
while (poisson_values == 0):
     poisson_values = [int(random.expovariate(lamb)) for i in range(number_of_events)]
     poisson_times = np.cumsum(poisson_values)
     for i in range(len(poisson_times)-1):
          if poisson_times[i+1] - poisson_times[i] <= len(st_event_2)/samp_rate:
               poisson_values = 0

return poisson_values, poisson_times

Make some noise

I spent a long time looking at how to ‘create noise’ in order to make some synthetic seismograms, so I thought that I would put up my code in case anyone ends up in the same spot as me! I take several steps in order to model this:

  • Load in some typical seismic noise (I have taken mine from a quiet day near the Tunguruhua volcano in Equador), which has been detrended and demeaned.
  • Taking the Fast Fourier Transform (FFT) of this (this puts the data into the frequency domain).
  • Smooth the FFT data.
  • Multiply this by the FFT of white noise.
  • Take the Inverse Fast Fourier Transform (IFFT) of this (takes it back into the time domain).

The results of this are shown below, where the green is our white noise, the blue is our real seismic noise, and the pink is our synthetic seismic noise.

Createnoiseplot2_comp_of_white_and_T_and_created

Creating seismic noise

There are a few other intermediate steps to this code (such as looping through so that it is in segments), however it is quite a simple process! A few other libraries are loaded into this beforehand, such as Obspy and Numpy, however you will probably have loaded these in already if you are doing this.

Now go and make some noise!

— Roseanne


def noise_segmenting(poisson_times, st_event_2, st_t, noise_level, samp_rate, delta):
 """ Creates the noise array so that it is big enough to host all of the events.
 Creating the noise by multiplying white noise by the seismic noise, in the frequency domain.
 We then inverse FFT it and scale it to whatever SNR level is defined to output the full
 noise array.
poisson_times = array of times where we then put in the seismic events (boundary for the noise)
st_event_2 = size of events that we are putting in later (again, this is a boundary)
st_t = seismic noise array that you are basing your synthetic on
samp_rate, delta = trace properties of st_t
 """
 # end time for noise to cover all events
 noise_lim = (poisson_times[-1] + len(st_event_2)) *2 #gives some time after last event
# load in seismic noise to base the synthetic type on
 st_noise_start_t = UTCDateTime("2015-01-22T01:00:00")
 st_noise_end_t = UTCDateTime("2015-01-22T01:02:00")
 test_trace = st_t[0].slice(st_noise_start_t, st_noise_end_t)
 test_trace_length = int(len(test_trace) / test_trace.stats.sampling_rate)
# setting the boundary for how many loops etc
 minutes_long = (noise_lim)/st_event_2.stats.sampling_rate
 noise_loops = int(np.ceil(minutes_long/2.0)) #working out how many 2 minute loops we need
# zero array
 noise_array = np.zeros([noise_loops, len(test_trace)])
# loop for the amount of noise_loops needed (in segments)
 for j in range(noise_loops):
      # we average the seismic noise over twenty 2 minute demeaned samples
      tung_n_fft = np.zeros([20, int(np.ceil((len(test_trace)/2.0)))])
      for i in range(20):
         st_noise = st_t[0].slice(st_noise_start_t+(i*test_trace_length),st_noise_end_t+(i*test_trace_length))
         noise_detrended = st_noise.detrend()
         noise_demeaned = mlab.demean(noise_detrended)
         noise_averaging = Trace(noise_demeaned).normalize()
         tung_n_fft[i] = np.fft.rfft(noise_averaging.data)

      # work out the average fft
      ave = np.average(tung_n_fft, axis=0)
      # smooth the data
      aves = movingaverage(ave,20)
# create white noise
      whitenoise = np.random.normal(0, 1, len(noise_averaging))
      whitenoise_n = Trace(whitenoise).normalize()
# FFT the white noise
      wn_n_fft = np.fft.rfft(whitenoise_n.data)
# multiply the FFT of white noise and the FFT smoothed seismic noise
      newnoise_fft = wn_n_fft * aves
# IFFT the product
      newnoise = ifft(newnoise_fft, n = len(st_noise))

      noise_array[j] = np.real(newnoise)
# transform the noise into an Obspy trace
 full_noise_array = np.ravel(noise_array)
 full_noise_array_n = Trace(np.float32(full_noise_array)).normalize()
 full_noise_array_n_scaled = Trace(np.multiply(full_noise_array_n, noise_level))
 full_noise_array_n_scaled.stats.sampling_rate = samp_rate
 full_noise_array_n_scaled.stats.delta = delta

 return full_noise_array_n_scaled