GER1000 : Chap 2 Association
Quick Summary on Chapter 2:
1. Association does not mean Causation
If the graph shows that it's the same, does not mean that one reason cause another
E.g The graph of having margarine in one's house is similar to the rate of divorce. You can't say that base on this graph, deduce that the rate of divorce is increased by having Margarine in one's house.
1) Deterministic relationship
The value of one variable can be determined from the value of another variable
e.g Celsius and Fahrenheit (formula)
2) Statistical relationship
Natural variability exists in the measurement of the two variables. The average pattern of one variable can be described given the value of the other variable.
The value of one variable can be determined from the value of another variable
e.g Celsius and Fahrenheit (formula)
2) Statistical relationship
Natural variability exists in the measurement of the two variables. The average pattern of one variable can be described given the value of the other variable.
2. The use of scatter diagram can deduce the correlation
Standard deviation is used to describe the spread or variability of data around the average
We can calculate the standard unit (SU) for each variable.
We can also calculate the Correlation Coefficient (R).
- Nature of relationship
The relationship is strong if data points cluster closely to the straight line. (linear or non-linear)
- Direction of relationship
The gradient of the line
- Strength of Relationship
Strong: The points will cluster closer to the straight line
Weak: The points will scatter loosely around the straight line
Correlation Coefficient (r)
(r) can only be between -1 and 1
0 being the weakest
3. Calculation of Standard Unit
Take the selected variable and subtract the avg
Then divide the whole by SD
To calculate the standard deviation of those numbers:
- 1. Work out the Mean (the simple average of the numbers)
- 2. Then for each number: subtract the Mean and square the result
- 3. Then work out the mean of those squared differences.
- 4. Take the square root of result
4. Calculation Of R
-Each SU * with SU of the other Variable
-Sum them
-avg it
Steps for Calculating r
We will begin by listing the steps to the calculation of the correlation coefficient. The data we are working with are paired data, each pair of which will be denoted by (xi,yi).
- We begin with a few preliminary calculations. The quantities from these calculations will be used in subsequent steps of our calculation of r:
- Calculate x̄, the mean of all of the first coordinates of the data xi.
- Calculate ȳ, the mean of all of the second coordinates of the data yi.
- Calculate s x the sample standard deviation of all of the first coordinates of the data xi.
- Calculate s y the sample standard deviation of all of the second coordinates of the data yi.
- Use the formula (zx)i = (xi – x̄) / s x and calculate a standardized value for each xi.
- Use the formula (zy)i = (yi – ȳ) / s y and calculate a standardized value for each yi.
- Multiply corresponding standardized values: (zx)i(zy)i
- Add the products from the last step together.
- Divide the sum from the previous step by n – 1, where n is the total number of points in our set of paired data. The result of all of this is the correlation coefficient r.
#Using excel can help by using the correl(a:a,b:b) function
5. Affecting R:
R will not be affected if
- Interchange two variables
- Adding a number to all values of a variable
- Multiplying a positive number to all values if a variable
#Do not skip outliers in a data
#R only works on Linear association
Impact of outliers:
The further the outlier is from the line, the more of an impact it can be on the correlation.
There are influential outliers. With that one outlier, the r is close to negative 1 (-0.75), however after removing it, the r is only 0.01 which is close to 0. This is very weak.
Impact of outliers:
The further the outlier is from the line, the more of an impact it can be on the correlation.
There are influential outliers. With that one outlier, the r is close to negative 1 (-0.75), however after removing it, the r is only 0.01 which is close to 0. This is very weak.
6. Ecological Correlation
Ecological Correlation is computed based on individuals in the data set. Based on aggregated data. (Group avg or rates)
In general, when the associations for both individuals and aggregates are in the same direction, the
ecological correlation, based on aggregates, will typically overstate the strength of the association in
individuals. That's because the variability among individuals will be eliminated once we use the
group aggregates.
We can use ecological correlation to examine the association between average exposures to a risk factor in various countries
Ecological fallacy tells us that it is not appropriate to a drawn conclusion about
individuals based on aggregated data; while Atomistic fallacy shows that the correlationEcological Correlation is computed based on individuals in the data set. Based on aggregated data. (Group avg or rates)
In general, when the associations for both individuals and aggregates are in the same direction, the
ecological correlation, based on aggregates, will typically overstate the strength of the association in
individuals. That's because the variability among individuals will be eliminated once we use the
group aggregates.
We can use ecological correlation to examine the association between average exposures to a risk factor in various countries
Ecological fallacy tells us that it is not appropriate to a drawn conclusion about
observed among individuals may not apply to aggregated data
As you can see in this graph, although for each individual of the four groups, there is a negative linear association (red line) but when comparing them together, there is a positive association (red line)
Ecological correlation ≠ Correlation based on individuals
We cannot assume that correlation based on aggregates will hold for individuals.
e.g since cs students are smart, a particular person from cs is smart
[Atomistic fallacy]
The opposite of ecological
There are 3 groups of individuals, and there’s a positive linear association
between the variables within each of the 3 groups.
DO NOT generalise aggregate level correlation based on the correlation
observed based on individuals as no clear correlation of the aggregates can
be inferred from the diagram.
We extend the trend from individual to everything and everyone else.
e.g on a personal level, the more internet we use, the lesser the grades. (Using the internet can be played Bandori and watch youtube)
But across different countries, countries like Africa don't use the internet but have bad grades compared to Singapore.
Generalised the trend based on observation of a similar trend in different separate small groups(individual)
7. Attenuation effect
Range restriction affect the R
Imagine a graph that is a shape of an eclipse,
when we chop the graph into smaller section (limit the range)
It will look like a circle
The circle shows that the R is closer to 0.
This happens when we decrease sample size
Imagine a graph that is a shape of an eclipse,
when we chop the graph into smaller section (limit the range)
It will look like a circle
The circle shows that the R is closer to 0.
This happens when we decrease sample size
8. Linear regression
Y=ax+b
Using Intercept in excel
8. Data removal
Removal of data from the analysis may change the whole picture of the analysis and result in a different
or even opposite inference.
-> Analysis can be skewed
Removal of data from the analysis may change the whole picture of the analysis and result in a different
or even opposite inference.
-> Analysis can be skewed
9. Regression fallacy
Based on a regression line which is a line best fit
Excel:
Intercept(X, Y)
Slope(X,Y)
If one variable is extreme on its measurement, the measurement of the other variable will be closer to the average
10. Simpsons paradox
A trend that appears in different groups of data but disappears when these data are combined
The paradox can be resolved when causal relations are appropriately taken care of in the statistical modeling.
An example:
UC Berkeley gender bias
One of the best-known examples of Simpson's paradox is a study of gender bias among graduate school admissions to University of California, Berkeley. The admission figures for the fall of 1973 showed that men applying were more likely than women to be admitted, and the difference was so large that it was unlikely to be due to chance.[14][15]
Men | Women | |||
---|---|---|---|---|
Applicants | Admitted | Applicants | Admitted | |
Total | 8442 | 44% | 4321 | 35% |
But when examining the individual departments, it appeared that six out of 85 departments were significantly biased against men, whereas only four were significantly biased against women. In fact, the pooled and corrected data showed a "small but statistically significant bias in favor of women".[15] The data from the six largest departments are listed below, the top two departments by number of applicants for each gender italicised.
Department | Men | Women | ||
---|---|---|---|---|
Applicants | Admitted | Applicants | Admitted | |
A | 825 | 62% | 108 | 82% |
B | 560 | 63% | 25 | 68% |
C | 325 | 37% | 593 | 34% |
D | 417 | 33% | 375 | 35% |
E | 191 | 28% | 393 | 24% |
F | 373 | 6% | 341 | 7% |
The research paper by Bickel et al.[15] concluded that women tended to apply to competitive departments with low rates of admission even among qualified applicants (such as in the English Department), whereas men tended to apply to less-competitive departments with high rates of admission among the qualified applicants (such as in engineering and chemistry).
< Prev Design Studies error Next Measurements >
No comments:
Post a Comment