Correlation does not imply causation.. but it does give you a hint

This is just a short note on plotting a correlation matrix using the seaborn package within Python. I’ve found that this is the best way of showing the similarity between arrays to people who are unfamiliar with correlations. It also allows you to add some colour into your plots, which is always a nice thing! It can be used for a multitude of purposes, so I have left the variable names in my code (at the bottom) as general as possible, so that it can be copy and pasted for other users.

For those who have not seen these matrices before, what it shows is the similarity between different arrays. If two arrays have a correlation value of 1.0, this means that they have a perfect correlation (i.e. they are exactly the same), and a correlation value of 0.0 means that there is absolutely no similarity between the two. This can be used to compare datasets with one another if you are looking for a similar pattern.

Also, it is worth noting that one of the principal statements made in statistics is that,

“Correlation does not imply causation”

So you should also have some further information to back-up the correlation between arrays.

An example of one of these correlation matrices can be seen below, which shows the comparison of 54 arrays with each other (i.e. I have taken each array and cross-correlated it with the other 53 arrays). The squares with a darker tone have a higher correlation than those with a lighter tone.

corr

Correlation matrix for 54 arrays

Your first step is putting your correlation values into a pandas.DataFrame format, you can then just use the code below in order to create the matrix! This table should contain the full dataset, and this code can then create it into this triangle shape (as otherwise you will end up with the mirror image of this on the identity axis). I have used absolute values as I didn’t want to deal with negative correlation at this stage (this is when it is a perfect match but reversed in the x-axis).

If you don’t have any correlation values, I’d recommend reading up on cross-correlation, which is a function where you can obtain these correlation values. I might produce a blog post on this at a later date, but it is worth reading into it yourself so that you can fully understand the output.

— Roseanne


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.5)

def corr_mat_plot(correlation_mat, show = True, outfile = None):
    """
    Plots the correlation matrix in an image plot to show where the
    highest correlation between arrays is.
    """
    # Make the mask for the upper triangle so that it doesn't mirror image the values
    mask = np.zeros_like(correlation_mat, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    # Set up the figure
    fig, ax = plt.subplots(figsize=(10, 10))
    sns.set(font_scale=1.5)

    # Draw matrix
    sns.heatmap(np.abs(correlation_mat), cmap = sns.cubehelix_palette(8, as_cmap=True),
                mask=mask, vmin = 0,vmax=1, square=True, xticklabels=50, yticklabels=50,
                cbar_kws = {"shrink": .8, "label" : ("Correlation value")}, ax=ax)

    plt.title("Correlation between the arrays")

    if show:
        plt.show()

    if outfile:
        fig.savefig(outfile)
    elif show:
        plt.show()
    else:
        return fig

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s