A correlation between two variables that a change in one variable, the other variable also changes. The changes can be positive or negative.
Height and weight are correlated. As height of a population increases, weight tends to increase. This is a positive correlation.
As a population’s fruit and vegetable consumption increases, the the mortality rate for heart disease decreases which is a negative correlation.
Pearson’s correlation coefficient is the most commonly used correlation. It is represented by the Greek letter rho (ρ) or r for a population parameter and a sample statistic respectively.
This gives a number between -1 and + 1. In medicine, the correlations are interpreted as:
- 0.9 A very strong correlation
- 0.7 A strong correlation
- 0.5 A moderate correlation
- 0.3 A week correlation
Below is an example of calculating Pearson’s correlation coefficient using information stored in a database, instead of a spreadsheet.
The data is obtained from the China-Cornell-Oxford Project (The China Study) that examines the link between blood cholesterol and the amount of animal protein consumed as a percentage of total protein. Male and female data are very similar. Only the female data is shown in the table.
Surveys were conducted in 1983–1984 and 1989–1990. The study consisted of 6,500 people in 65 counties from 25 provinces. In each county, two villages (xiang) were selected with 25 men and 25 women from different families selected from each village. Blood, urine and food samples were obtained for analysis, questionnaires were completed and three-day diet information was recorded.
The A89AllVariables table contains the following information,
- Province Code
- Province Name
- Sex: M, F, T (combined M and F)
- Xiang: Xiang 3 combines data from xiang 1 and 2
- P001: Total cholesterol mg/dL
- D036: % animal protein / total protein consumption
- plus an additional 360 variables
SELECT 'Correlation P001 & D036: ' AS Labels, IF(Sex='M' THEN 'Male' ELSE 'Female') AS Sex, (sum(P001 * D036) - ((sum(P001)*sum(D036))/count(P001))) / sqrt((sum(P001 * P001)-((sum(P001) * sum(P001))/COUNT(P001))) * (sum(D036 * D036)-((sum(D036) * sum(D036))/COUNT(D036)))) AS Correlation FROM A89AllVariables GROUP BY Sex
This produces the following results, which matches the results obtained from LibreOffice spreadsheet program to 7 decimal places.
Correlation P001 & D036: Female 0.65 Correlation P001 & D036: Male 0.67