STA 108

Homework 7

Due: Mar 6th 2020 (by 2:10 pm)

Please clearly print your name, course name, section number, student ID number,

and names of students who you have discussed homework problems with on top of

the first page of your homework as follows.

Name

Course STA108 Section

Student ID#

Study group members

Additional requirements can be found in the course syllabus.

Question 1: Binary covariate and interactions (50 pts)

A scientist wants to study the factors affecting length of hospital stay (y ) and identified two covariates: age of patient (x1 ) and the severity of disease (x2 , 1 if it is a

life-threatening disease and 0 otherwise). Since there might be an interaction effect

(denoted as x3 = x1 × x2 ), we consider the following regression model

y = β0 + β1 x1 + β2 x2 + β3 x3 + .

The scientist wants to decide whether the severity of disease is needed in the model

and proposes to test: H0 : β2 = β3 = 0 v.s. Ha : β2 6= 0 or β3 6= 0. In order to test this

hypothesis, the scientist fits two models. The first model considers 100 patients with

severe diseases with the R output as below.

STA 108: HW 7

1

UC Davis

The second model considers 100 patients without severe diseases with the R output as

below.

1. (20 pts) Based on the information provided, can we test the above hypothesis at

the 0.05 significance level? If so, carry out the test. If not, explain.

2. (30 pts) Describe how you will test this hypothesis if you have access to the data.

Question 2: Puzzling p-values (40 pts)

When analyzing a set of data, you run into an R output as follows.

Given the eight weeks of statistical training in statistics, you sense something strange

in this result….

STA 108: HW 7

2

UC Davis

1. (20 pts) Point out what is strange in this R output, and explain how this could

happen.

2. (20 pts) Will this phenomenon become a problem? Explain your answer.

Question 3: Testing linear combinations of parameters (10 pts)

Suppose that we have fit a linear regression, for i = 1, . . . , n,

yi = β0 + β1 xi,1 + β2 xi,2 + i ,

where 1 , 2 , . . . , n are i.i.d. errors with E[ i ] = 0 and var( i ) = σ 2 . Describe how you

will test H0 : β1 − 2β2 = 0 v.s. Ha : β1 − 2β2 6= 0.

STA 108: HW 7

3

UC Davis

STA 108 Applied Statistical Methods:

Regression Analysis

Multiple Linear Regression

Shizhe Chen, PhD

Winter 2020

1

Important Instructions

I The first four sections (motivation, model, inference, model

selection) will be taught as before

I The rest of the slides are provided as reading materials

I

I

I

I

Read the slides before class at your own pace

Bring questions to lectures!

Learn to use online resources for self-learning

≤ 10 pts in the final exam

2

Outline

I Motivation

I Multiple Linear Regression Model

I Statistical Inference

I Model selection

I Least Squares Estimation

I Generalized Least Squares

3

Motivation

4

New Requests from Clients

So far, we have successfully answered the following questions.

1. Are sales positively associated with TV advertising budgets?

2. What will the amount of sales be if the TV advertising budget

is 20 (in thousands of dollars)?

However, our clients suddenly realize that there are more avenues

for advertisement than just on TV.

5

Exploratory Data Analysis

(See Section 3.1 in Code MultipleLinearRegression.html.)

6

New Questions of Interest

1. Which advertising budgets (TV, radios, or newspaper) are

associated with sales?

2. What is the mean difference of sales per unit difference in TV

advertising budget when other budgets are held constant?

3. What is the expected sales given a fixed combination of

budgets?

4. How much will sales increase by increasing the TV advertising

budget by 20 (thousand of dollars)?

5. What is the maximum expected sales given a total budget of

100 (thousand of dollars)?

7

New Questions of Interest

1. Which advertising budgets (TV, radios, or newspaper) are

associated with sales? (Linear regression)

2. What is the mean difference of sales per unit difference in TV

advertising budget when other budgets are held

constant?(Linear regression)

3. What is the expected sales given a fixed combination of

budgets? (Linear regression)

4. How much will sales increase by increasing the TV advertising

budget by 20 (thousand of dollars)? (Causal inference)

5. What is the maximum expected sales given a total budget of

100 (thousand of dollars)? (Optimization)

8

Multiple Linear Regression Model

9

Multiple Linear Regression Model

The multiple linear regression model takes the form

y = β0 + x1 β1 + . . . + xp βp + ,

where

I y ∈ R is the real-valued response

I xj ∈ R is the jth covariate

I β0 is the intercept term

I βj is the regression slope for the jth covariate

I ∈ R is the error term with E[ i ] = 0 and var( i ) = σ 2

Note: The term “linear” refers to the fact that the mean is a

linear function of the unknown parameters β0 , . . . , βp .

10

With n Observations

With n observations of y and x1 , . . . , xp , the complete model

becomes

y1 = β0 + x11 β1 + x12 β2 + · · · + x1p βp + 1

y2 = β0 + x21 β1 + x22 β2 + · · · + x2p βp + 2

..

.

yn = β0 + xn1 β1 + xn2 β2 + · · · + xnp βp + n ,

where the error terms are assumed to have the following properties

I E[ i ] = 0

I var( i ) = σ 2 (constant for all i)

I cov( i , j ) = 0 for j 6= i

Note: Sometimes stronger assumptions are imposed, such as ’s

are i.i.d. with mean 0 and variance σ 2 .

11

Interpretation of Multiple Linear Regression

In a multiple linear regression model

y = β0 + x1 β1 + x2 β2 + . . . + xp βp + , β1 is the expected mean

difference in y per unit difference in x1 if x2 , . . . , xp is held

constant (or adjusted/controlling for x2 , . . . , xp )

Example: Suppose that y is the systolic blood pressure of

newborns, x1 is days of age, and x2 is the weight at birth in

ounces. We say that

I β1 : We estimate that two groups of newborns with the same

age and who differ by one ounce at birth will have systolic

blood pressure that differs on average by 0.13 mm Hg (95%

CI: 0.05, 0.20).

I β2 : We estimate that two groups of newborns with the same

weight at birth and who differ by one day of age will have

systolic blood pressure that differs on average by 5.89 mm Hg

(95% CI: 4.42, 7.36).

12

Interpretation of Multiple Linear Regression

In a multiple linear regression model

y = β0 + x1 β1 + x2 β2 + . . . + xp βp + , β1 is the expected mean

difference in y per unit difference in x1 if x2 , . . . , xp is held

constant (or adjusted/controlling for x2 , . . . , xp )

Note: Interpretation of parameters might not make sense in the

multiple linear regression!

13

Interpretation of Multiple Linear Regression: Special Cases

In y = β0 + x1 β1 + x2 β2 + . . . + xp βp + , what if

I x3 = x1 × x2 ?

Effect of x1 on y will differ depends on the value of x2 .

Example: When comparing two groups of newborns that differ

by one day of age and with the same birthweight, the

difference in systolic blood pressure depends on the babies’

birthweight, with the difference in mean systolic blood

pressure decreasing by 0.13 mm/Hg for each ounce difference

in birth.

I x2 = x31 ?

x1 is constant if x2 is held constant.

Interpret all the terms that depend on x1 !

I x1 , . . . , xp are dummy variables for a categorical variable z

that has p + 1 categories?

x1 = 1 means x2 = · · · = xp = 0.

How will you interpret β1 ?

14

Multiple Linear Regression Model: Categorical Covariates

Consider the question in Homework #1, we code xi as 0 or 1 to

distinguish between ducks and pandas

I What if we found out that there are actually red pandas and

raccoons in the data set?

I xi = 0, 1, 2, 3 for ducks, pandas, red pandas, or raccoons?

I xi1 = 1 for a panda, xi2 = 1 for a red panda, xi3 = 1 for a

raccoon, and zero otherwise!

15

Multiple Linear Regression Model: Categorical Covariates

(cont.)

Create K − 1 dummy variables for a categorical variable with K

categories

Also known as the ANalysis Of VAriance, a.k.a., ANOVA

16

Multiple Linear Regression Model: Polynomial Regression1

This slide is incomplete. Take notes!

Consider a true (and unknown!) model where y is non-linear in x

y = (x − 3)4 + ⇐⇒ y = x4 − 12×3 + 54×2 − 108x + 81 +

Suppose that we have n observations of x and y, can we learn the

above model using linear regression?

1

Check out Wolfram Alpha

17

Building a Linear Model

Suppose that you are interested in studying the relationship

between y and x1 , and you have the resources to collect data (via

experiments or surveys). How will you build a linear model?

y = β0 + x1 β1 + . . . + xp βp +

Consider the following scenarios

I y is the body weight and x1 is the length of sleep per day

I y is the lung function (measured by forced expiratory volume,

or FEV) and x1 is a dummy variable for smoking

I y is the occurrence of an heart attack and x1 is a dummy

variable for depression

18

Classification of Variables

This slide is incomplete. Take notes!

I Variable of interest (or exposure, treatment, etc.)

I Response variable (or outcome)

I Confounder

I Effect modifier

I Precision variable

I Instrument

19

Confounding

I Confounding is an effect of some uncontrolled variable on the

response variable that hinders interpretation of the relationship

between the response and the predictor variable of interest

I Confounding describes real or imagined effects that distort the

relationship one wishes to observe between the predictor

variable and response

I More of a problem for observational studies

I in contrast to a designed experiment

Examples: effects of smoking on lung function (measured by forced

expiratory volume, FEV) may be confounded by age

20

Controlling for Confounding

I Implicitly with appropriate study designs

I Explicitly by measuring it and including it in the model

21

Controlling for Confounding: Design

I Match the observations that are similar in terms of

confounding variables (confounders), e.g., comparing the FEV

between smokers and non-smokers of the same age

I Relatively easy to implement

I Infeasible when there are too many confounders

I Conduct a randomized experiment, e.g., randomly assign

participants to the smoking group or the non-smoking group

I Destroy all confounding possibilities (grant causality)

I Infeasible in many cases

22

Controlling for Confounding: Model-Based

Using knowledge in this class, we can

I Fit the unadjusted model

y = β0 + x1 β1 +

I Fit the adjusted model

0

0

0

y = β0 + x1 β1 + x2 β2 +

0

I Compare the fitted values of β1 and β1

I Eyeballing

I Hypothesis testing

You can also use other advanced statistical methods to control for

unmeasured confounders, e.g., propensity score, instrumental

variable

23

Effect Modifier

I Variable that modifies the effect (or association) of the

variable of interest on the response

I Modeling approaches

I Stratify analysis (for categorical variables)

I Multiple linear regression with an interaction term

y = β0 + x 1 β1 + x 2 β2 + x 1 x 2 β3 +

24

Precision Variable

I Variable that only affects the response variable

I Improve the precision of the model fits if included

25

Statistical Inference

26

Inferences about Multiple β̂j s

Assume that q < p and we want to test if a reduced (smaller)
model is sufficient:
H0 : βq+1 = · · · = βp = 0
Ha : at least one βk 6= 0
vs
Compare the residual sum of squares eT e for full and reduced
(smaller) models:
(a) Full model:
yi = β 0 +
q
X
p
X
βj xij +
j=1
βk × xik + i
k=q+1
(b) Reduced model:
yi = β0 +
q
X
j=1
βj xij +
p
X
k=q+1
0 × xik + i
27
Inferences about Multiple β̂j ’s - cont
This slide is incomplete. Take notes!
Test Statistic:
L(β̂reduced ) − L(β̂full ) L(β̂full )
dfreduced − dffull
dffull
L(β̂reduced ) − L(β̂full ) L(β̂full )
=
(n − q − 1) − (n − p − 1) n − p − 1
∼ F(p−q,n−p−1) ,
tF =
where
I L(β̂reduced ) is the residual sum of squares for the reduced
model
I L(β̂full ) is the residual sum of squares for the full model
I dfreduced is the error degrees of freedom for the reduced model
I dffull is the error degrees of freedom for the full model
28
Testing Linear Combinations of Parameters
This slide is incomplete. Take notes!
What if we want to test
H0 : 2β1 + β3 = 0 v.s. Ha : 2β1 + β3 6= 0
I Transform the data to fit a new linear regression
I Use the Wald test
I Estimators follow a multivariate normal distribution
(asymptotically)
I Need to know the covariance
29
Multiple Linear Regression with R
(See Section 3 in Code MultipleLinearRegression.html.)
30
Model Selection
31
Constructing Statistical Models
I Use domain knowledge
I Use statistical methods for model selection
32
Selecting an appropriate model
Model selection has many meanings
I It can mean selecting covariates and how they are to be
included in the model
known as feature selection, and sometimes feature
generation
I It can mean choosing between a regression tree (random
forest), neural network (deep learning), or logistic regression
I It can mean selecting a λ-value in the lasso
known as tuning parameter selection
We will focus on the feature/variable selection
33
Trade-off on Complexity
There are two keys for a method to do well on your data:
I It doesn’t completely miss important structure
I It admits a proper amount of complexity for the number of
observations in your data.
34
The “proper” amount of complexity
How can we tell if we are choosing a model that is...
sufficiently complex (to capture the signal)...
but not overly complex! (such that we overfit our training data)?
2
It’s also discussed in the notes.
35
The “proper” amount of complexity
How can we tell if we are choosing a model that is...
sufficiently complex (to capture the signal)...
but not overly complex! (such that we overfit our training data)?
The proper degree of complexity is extremely data-dependent
I Information criteria
I Loss: residual sum of squares (or likelihood) evaluated using
cross-validation
Task: Search and read about overfitting
2
2
It’s also discussed in the notes.
35
Information Criteria
Information criterion measures the trade-off between the goodness
of fit and model complexity. In general, the information criterion
admits the following form:
information criterion = loss in model fitting + penalty on model
complexity
I Select a model that minimizes the chosen information
criterion.
I A good information criterion need to carefully balance the
trade-off between model fitting and model complexity
I Two famous information criteria: AIC and BIC
36
Akaike Information Criterion (AIC)
Formulated by Japanese statistician Hirotugu Akaike in 1973.
AIC = 2k + nlog(RSS/n) or AIC = 2k − 2logLk
I Loss of model fitting: log of residual sum of squares (for linear
regression), or two times negative log-likelihood (in general)
I Penalty of model complexity: two times the number of
variables k.
When the sample size n is large, AIC tends to be conservative on
“punishing” model complexity.
37
Bayesian Information Criterion (BIC)
BIC was developed by statistician Gideon E. Schwarz in 1978.
Therefore, BIC is also named as Schwarz information criterion
(SIC).
BIC = log(n)k + nlog(RSS/n)
orBIC = log(n)k − 2logLk
I Loss of model fitting: log of residual sum of squares (for linear
regression), or two times negative log-likelihood (in general)
I Penalty of model complexity: log(n) times the number of
variables
38
Model Selection
Choosing a proper complexity for your model
I Information criteria
I Loss function: residual sum of squares (or likelihood)
evaluated using cross-validation
39
Split-Sample Validation
Q: What is overfitting?
Ideally, we wish to have two datasets, where we can
I explore all possible models on the training data
I evaluate/confirm our findings on the test data
In practice, often times we only have one dataset....
so we split the dataset into a training set and a test set.
40
How to Split?
What proportion should be used for training vs testing/evaluating?
Often something like 2/3 training - 1/3 testing is good.
This still feels like inefficient data use.
Is there some way to use the majority of the data for both training
and testing?
41
Cross-validation - I
Let’s use cross-validation:
I Partition our data into multiple folds...
I Each time use 1 fold as test, and all other folds as training
42
K-fold Cross-validation
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
43
Leave-one-out Cross-validation
44
Cross-validation II
Procedure for a k-fold validation
1. Randomly split the dataset into k non-overlapping subsets
(folds)
2. For i = 1, . . . , k,
2.1 Fit the chosen model on data from the data excluding the ith
fold
2.2 Evaluate the loss of the fitted model on the ith fold, denoted
as Li
3. The final loss is
Pk
i=1 Li /k.
45
Model Selection
Choosing a proper complexity for your model
I Information criteria
I Loss function: residual sum of squares (or likelihood)
evaluated using cross-validation
Next, how to select the “best” model?
46
Selection Procedure
I Best subset selection
I Stepwise selection
I Penalization
47
Best Subset Selection
An intuitive procedure:
1. List all possibilities
2. The model with the best criteria (AIC, BIC, CV loss) wins
Algorithm: Best Subset Selection
1. Choose a selection criterion
2. For k = 1, . . . , p,
2.1 Fit all possible models that contain exactly k covariates
2.2 Pick the best among these models (the one with the best
criterion)
2.3 Denote this model as Mk
3. Select a single best model among M1 , . . . , Mk
48
Best Subset Selection
An intuitive procedure:
1. List all possibilities
2. The model with the best criteria (AIC, BIC, CV loss) wins
There are a lot of models to fit (2p )!!!
49
Stepwise Selection
Stepwise selection:
I Computationally efficient alternative
I Explore a restricted set of models
No guarantee that it can find the best possible model!
Two directions
I Forward
I Backward
50
Forward Stepwise Selection
Algorithm: Forward Adding
1. Choose a selection criterion
2. Let M0 denote the null model, which contains no predictors
3. For k = 0, . . . , p − 1
3.1 Fit all p − k models by adding one additional variable to Mk
3.2 Pick the best among these p − k models, denoted this model
as Mk+1
4. Select a single best model among M0 , . . . , Mp using the
chosen criterion.
You may stop early if you use p-values as the criterion...
51
Backward Stepwise Selection
Algorithm: Backward Deleting
1. Choose a selection criterion
2. Let Mp denote the full model, which contains all predictors
3. For k = p − 1, . . . , 1
3.1 Fit all k models by deleting one variable from Mk
3.2 Pick the best among these k models, denoted this model as
Mk−1
4. Select a single best model among M0 , . . . , Mp using the
chosen criterion.
You may stop early if you use p-values as the criterion...
(See Section 3.7 in Code MultipleLinearRegression.html.)
52
Penalization
We can approximate the best subset selection using
sparsity-inducing penalties.
Choose β = (β0 , β1 , . . . , βp )T , to minimize
L(β) + λP (β)
53
Lasso
One notable example is the lasso where P (β) = |β1 | + . . . + |βp |,
a function that penalizes complexity in β
Minimizing β-values, will “magically” give sparse model
(meaning: many β-values exactly equal to 0)
To modulate the degree of sparsity, we change λ.
Q: How would you choose λ?
54
Summary of Model Selection
I When?
I How?
I Why?
55
Least Squares Estimation
56
Linear Model in Matrix Notation
y1 = β0 + x11 β1 + x12 β2 + · · · + x1p βp + 1
y2 = β0 + x21 β1 + x22 β2 + · · · + x2p βp + 2
..
.
yn = β0 + xn1 β1 + xn2 β2 + · · · + xnp βp + n ,
where the error terms are assumed to have the following
properties E[ i ] = 0, var( i ) = σ 2 , and cov( i , j ) = 0 for
j 6= i.
y = Xβ + ,
E[ ] = 0 and var( ) = σ 2 I
57
Linear Model in Matrix Notation, Step I
1 x11 x12 · · · x1p
y1
1
β0
y2 1 x21 x22 · · · x2p β1 2
.. = ..
..
..
.. .. + ..
..
. .
.
.
.
. . .
yn
n
βp
1 xn1 xn2 · · · xnp
We write
y = Xβ + ,
where
I y is the n × 1 response column vector
I X is the n × (p + 1) design matrix
I β is the n × (p + 1) design matrix
I is the random error n × 1 column vector
58
Linear Model in Matrix Notation, Step II
y = Xβ + .
Recall: The error terms are assumed to have the following
properties E[ i ] = 0, var( i ) = σ 2 , and cov( i , j ) = 0 for j 6= i.
I E[ ] = 0
I var( ) = σ 2 I
59
Linear Model in Matrix Notation
y1 = β0 + x11 β1 + x12 β2 + · · · + x1p βp + 1
y2 = β0 + x21 β1 + x22 β2 + · · · + x2p βp + 2
..
.
yn = β0 + xn1 β1 + xn2 β2 + · · · + xnp βp + n ,
where the error terms are assumed to have the following
properties E[ i ] = 0, var( i ) = σ 2 , and cov( i , j ) = 0 for
j 6= i.
y = Xβ + ,
E[ ] = 0 and var( ) = σ 2 I
60
Some Notes on Linear Models
y = Xβ +
1. Usually the first column of X is the vector of ones,
corresponding to the intercept β0
2. The jth column of X, denoted as Xj , is the jth predictor
variable for the n observations
3. is the random part of the model (y is random because is
random)
4. E[y] = E[Xβ + ] = E[Xβ] + E[ ] = Xβ, i.e., E[y] is a
linear combination of {xj : j = 1, . . . , p + 1}
61
Multiple Linear Regression Model: Examples
Consider the model
y = β 0 + x1 β 1 +
and suppose that we observe
(y1 , y2 , y3 , y4 , y5 ) = (1, 4, 3, 8, 9)
(x11 , x21 , x31 , x41 , x51 ) = (0, 1, 2, 3, 4)
Now, represent these data using matrix notation
y = Xβ +
62
Multiple Linear Regression Models: Advertising Data
y = Xβ + ,
E[ ] = 0 and var( ) = σ 2 I
In the advertising data, the responses y and the covariate X are
63
Least Squares Estimator for Multiple Linear Regression
Estimates β0 , β1 , . . . , βp are least squares estimators of
β0 , β1 , . . . , βp if they minimize
2
p
n
X
X
yvectors
Marginal and conditional
distributions
xij βj
L(β)
≡of MVN random
i − β0 −
Projections, quadratic forms, and
distributions
2
Introduction to linear models and least squares
i=1
j=1
Least squares in 2D
Graphical Interpretation: [Element of Stat Learning: Hastie et al.]
64
Linear Model in Matrix Notation
1 x11 x12 · · · x1p
y1
1
β0
y2 1 x21 x22 · · · x2p β1 2
.. = ..
..
..
.. .. + ..
..
. .
.
.
.
. . .
yn
n
βp
1 xn1 xn2 · · · xnp
We write
y = Xβ + ,
where
I y is the n × 1 response column vector
I X is the n × (p + 1) design matrix
I β is the n × (p + 1) design matrix
I is the random error n × 1 column vector
65
Least Squares Estimator for Multiple Linear Regression
This slide is incomplete. Take notes!
Solve the least squares problem in matrix notation:
minimize ky − Xβk22 =
β
n
X
i=1
yi − β0 −
p
X
2
xij βj ,
j=1
where k · k2 is the `2 -norm ky − Xβk22 = (y − Xβ)T (y − Xβ).
Claim: β̂ = (X T X)−1 X T y.
66
Simple Linear Regression as a Special Case
This slide is incomplete. Take notes!
Recall: In simple linear regression, we have
Pn
(x − x̄)(yi − ȳ)
Pn i
β̂1 = i=1
and β̂0 = ȳ − x̄β̂1 .
2
i=1 (xi − x̄)
67
Properties of the LSE in Matrix Notation
Projection
Residuals
Sampling Distribution
Underfitting and Overfitting
Multicollinearity
68
onditional distributions of MVN random vectors
ections, quadratic forms, and 2 distributions
ntroduction to linear models and least squares
Understanding Projection
t in (x, y , z) space
We have a three dimensional vector.
14 / 24
69
al and conditional distributions of MVN random vectors
Projections, quadratic forms, and 2 distributions
Introduction to linear models and least squares
Understanding Projection
ojection
(x, y ) the
plane
We onto
are projecting
three dimensional vector onto the (x,y)
plane.
15 / 24
70
Projections, quadratic forms, and 2 distributions
Introduction to linear models and least squares
Understanding Projection
rojection onto y axis
We are projecting the three dimensional vector onto the y axis
16 / 24
71
Understanding Regression via Projection
Let X0 be a vector of ones. Then, we have
Xβ = (x0 x1 x2
β0
β1
. . . xp ) .
..
βp
= x0 β0 + x1 β1 + x2 β2 + · · · + xp βp
∈ R(X)
Here R(X) is the subspace spanned by the columns of X.
72
Orthogonal Projection onto columns of X
Theorem: The observation vector y can be uniquely decomposed
as y = ŷ + e, where
ŷ ∈ R(X), e ∈ R(X)⊥ ,
with
R(X)⊥ = orthogonal complement of R(X)
In other words, if two vectors a ∈ R(X) and b ∈ R(X)⊥ , we have
aT b = 0.
73
Properties of the LSE in Matrix Notation
Projection X
Residuals
Sampling Distribution
Underfitting and Overfitting
Multicollinearity
74
Residuals
Again, residuals can be interpreted as what’s left over after
projecting y onto X.
Definition: The residual vector is
e = y − ŷ = y − X β̂
Definition: The residual sum of squares is defined as
eT e =
n
X
e2i
i=1
= (y − X β̂)T (y − X β̂)
75
Hat Matrix and Fitted Values
Definition: Let ŷ = X β̂ = P y denote the fitted values of y,
where
P = X(X T X)−1 X T .
Then P is called the hat matrix. The residuals can be rewritten as
e = y − ŷ = y − P y = (I − P )y.
Interpretation: P is a projection matrix that projects y onto
R(X).
76
Properties of Projection Matrix
The projection matrix P satisfies the following properties:
I P is the projection matrix onto R(X).
I I − P is the projection matrix onto R(X)⊥ .
I PX = X
I (I − P )X = 0
I Projection matrices are idempotent, i.e.,
(I − P )(I − P ) = (I − P ) and P P = P .
I P (I − P ) = 0.
77
Sum of Squares
Recall: The three sum of squares are
Pn
2
I Total Sum of Squares:
i=1 (yi − ȳ)
Pn
I Explained Sum of Squares:
(ŷi − ȳ)2
Pi=1
n
2
I Residual Sum of Squares:
i=1 (yi − ŷi )
78
Sum of Squares in Matrix Form
Let J = 11T be an n × n matrix of ones. Then we have
n
X
Total Sum of Squares: =
(yi − ȳ)2
i=1
1
=y
I− J y
n
n
X
Explained Sum of Squares: =
(ŷi − ȳ)2
T
i=1
1
=y
P− J y
n
n
X
Residual Sum of Squares: =
(yi − ŷi )2
T
i=1
T
= y (I − P ) y
79
Sum of Squares (cont.)
Therefore, we have
Total Sum of Squares: =
n
X
(yi − ȳ)2
i=1
1
=y
I− J y
n
1
T
=y
I − J +P −P y
n
1
T
=y
P − J y + y T (I − P ) y
n
=Residual Sum of Squares:+
T
Explained Sum of Squares:
80
Sum of Squares: Degrees of Freedom
The corresponding degrees of freedom are
I Total Sum of Squares:
dfT = n − 1
I Explained Sum of Squares:
dfE = p
I Residual Sum of Squares:
dfR = n − p − 1
81
Coefficient of Multiple Determination
Recall: The coefficient of multiple determination is defined as
Pn
Pn
2
2
2
i=1 (ŷi − ȳ)
i=1 (yi − ŷi )
P
P
R = n
=
1
−
.
n
2
2
i=1 (yi − ȳ)
i=1 (yi − ȳ)
Interpretation: Gives the amount of variation in y that is
explained by the linear relationships with the covariates X.
Note: When interpreting the R2 values, note that
I 0 ≤ R2 ≤ 1
I Large R2 values do not necessarily imply a good model
I The more covariates we include in our model, the higher R2 is
(Why?)
82
Adjusted Coefficient of Multiple Determination
Problem of R2 : Including more and more predictors can
artificially inflate R2
I Capitalizing on spurious effects present in noisy data
I Phenomenon of over-fitting the data
The adjusted R2 is a relative measure of fit:
Pn
(yi − ŷi )2 /dfR
σ̂ 2
2
=
1
−
.
Rα = 1 − Pi=1
n
2
s2y
i=1 (yi − ȳ) /dfT
1
where s2y = n−1
variance of y.
Pn
i=1 (yi
− ȳ)2 is the sample estimate of the
83
Properties of the LSE in Matrix Notation
Projection X
Residuals X
Sampling Distribution
Underfitting and Overfitting
Multicollinearity
84
Sampling Distribution: Prerequisite
In multiple linear regression,
T the sampling distribution is the joint
distribution of β̂0 , . . . , β̂p .
We need to introduce some new concepts to describe the
distribution of β̂.
I Covariance between two random vectors a ∈ Rm and b ∈ Rn
I Multivariate normal distribution
85
Covariance Matrix
I Covariance between two random vectors a ∈ Rm and b ∈ Rn
cov(a1 , b1 ) cov(a1 , b2 ) · · · cov(a1 , bn )
cov(a2 , b1 ) cov(a2 , b2 ) · · · cov(a2 , bn )
cov(a, b) =
..
..
..
..
.
.
.
.
cov(am , b1 ) cov(am , b2 ) · · · cov(am , bn )
I Covariance between Aa and Bb
cov(Aa, Bb) = Acov(a, b)B T
I Partitioned covariance
a1
b1
cov(a1 , b1 ) cov(a1 , b2 )
a=
b=
cov(a, b) =
a2
b2
cov(a2 , b1 ) cov(a2 , b2 )
I var(a) = cov(a, a)
86
Multivariate Normal Distribution
Let x = (x1 , . . . , xp )T and let x ∼ Np (µ, Σ). The multivariate
normal density function is
1
1
T −1
f (x) =
exp − (x − µ) Σ (x − µ)
2
(2π)p/2 |Σ|1/2
where
I µ = (µ1 , . . . , µp )T is the p-dimensional mean vector
σ11 σ12 · · · σ1p
σ21 σ22 · · · σ2p
I Σ=
..
..
.. is the covariance matrix
.
.
.
.
.
.
σp1 σp2 · · · σpp
87
Properties of Multivariate Normal Distribution
Properties on the mean and covariance parameters:
I µj ∈ R for all j
I σjj > 0 for all j

I σij = ρij √σii σjj , where ρij is the correlation between Xi an

Xj

2 ≤ σ σ for all i, j ∈ {1, . . . , p}

I σij

ii jj

The marginals of a multivariate normal is a univariate normal

Xj ∼ N (µj , σjj )

for all j ∈ {1, . . . , p}

88

Affine Transformation of Multivariate Normal

Let x ∼ Np (µ, Σ).

Let A = {aij }n×p be a non-random matrix and consider a

non-random vector b = (b1 , . . . , bn )T .

Define y = Ax + b with A 6= 0n×p . Then

y ∼ Nn (Aµ + b, AΣAT )

Linear combinations of normal variables are normally

distributed!!!

89

Conditional Distribution of Multivariate Random Variables

If Σ is positive definite and

x

µx

Σx Σxy

∼N

,

.

y

µy

Σyx Σy

Then, the conditional distribution of x given y is

−1

x | y ∼ N (µx + Σxy Σ−1

y (y − µy ), Σx − Σxy Σy Σyx )

Important Property: x and y are independent if and only if

Σxy = 0.

90

Sampling Distribution: Our Model Assumptions

Assume that

1. The errors have mean zero: E[ ] = 0.

2. The errors are uncorrelated with common variance

var( ) = σ 2 I.

These imply that

1. E[y] = E[Xβ + ] = Xβ

2. var(y) = var(Xβ + ) = var( ) = σ 2 I

91

Mean and Variance of Our Estimates

This slide is incomplete. Take notes!

1. The least squares estimate is unbiased: E[β̂] = β

2. The covariance matrix of the least squares estimate is

var(β̂) = σ 2 (X T X)−1 .

Proof:

See Wikipedia for the proof of Gauss-Markov theorem (BLUE)

92

How about e?

This slide is incomplete. Take notes!

Recall that e = y − ŷ = (I − P )y:

1. E[e] = 0

2. var(e) = σ 2 [I − P ]

3. E[eT e] = (n − p − 1)σ 2 .

4. Implication: An unbiased estimate of σ 2 is

σ̂ 2 =

eT e

n−p−1

93

Sampling Distributions

I Asymptotic Distributions

I Central Limit Theorem

I Bootstrap

I Exact Distributions

I Normality assumption: multivariate t-distribution

I Other distribution assumptions…

Construct confidence intervals/regions given the sampling

distribution as in simple linear regression

94

Properties of the LSE in Matrix Notation

Projection X

Residuals X

Sampling Distribution X

Underfitting and Overfitting

Multicollinearity

95

What Happens When you Underfit the Model?

Suppose that the true underlying model is

y = Xβ + Zη + ,

but we instead fit the model

y = Xβ + .

In fact, in real data application, we often does this because it is

impossible for us to know which variables are in the true underlying

model.

For simplicity, assume that the columns of X and Z are linearly

independent.

96

Bias due to underfitting

Naive Argument: I am only interested in the parameters β, so

why bother estimating η?

Claim: if we fit the smaller model, E[β̂] 6= β. The estimates we

get is biased! Even the fitted values are biased!

I E[β̂] = β + (X T X)−1 X T Zη

I E[ŷ] = Xβ + X(X T X)−1 X T Zη

97

Example I: Underfitting

Suppose that the true model is

yi = β0 + β1 xi + β2 x2i + i

but instead we fit the model

yi = β0 + β1 xi + i

What is the bias of β1 ?

98

Variance if we Underfit the Model?

Suppose that the true underlying model is

y = Xβ + Zη + ,

but we instead fit the model

y = Xβ + .

Claim: cov(β̂) = σ 2 (X T X)−1 . But

η T Z T (I − PX )Zη

eT e

2

= σ2 +

> σ2.

E[σ̂ ] = E

n−p−1

n−p−1

Implication: We overestimate the variance!

99

Example in R on Underfitting

Scenario I: True β = (1, 1)T

y = Xβ +

Fit the correct model

y = Xβ +

Scenario II: True β = (1, 1)T

y = Xβ + Zη +

Underfit the model

y = Xβ +

(See Section 3.5 in Code MultipleLinearRegression.html.)

100

What Happens When you Overfit the Model?

Suppose that the true underlying model is

y = X1 β1 + ,

but we instead fit the model

y = X1 β1 + X2 β2 + = Xβ,

where

X = X1 X2

and

β=

β1

β2

What will happen to our estimates β̂1 ?

101

Bias due to Overfitting

Claim: if we fit the larger model, E[β̂] = β. The estimates we get

is unbiased! Even the fitted values and σ̂ 2 are unbiased!

Proof:

102

Why don’t we keep Overfitting then?

Claim: The variance of β̂ will be larger!!! Too complicated and we

will skip the results.

Scenario True β1 = (1, 1)T

y = X 1 β1 +

Overfit the model

y = X 1 β1 + X 2 β2 +

(See Section 3.6 in Code MultipleLinearRegression.html.)

103

Summary of Effects of Underfitting and Overfitting

β̂

ŷ

σˆ2

cov(β̂)

Underfitting

biased

biased

biased upward

still σ 2 (X T X)−1

Overfitting

unbiased

unbiased

unbiased

increased

Model selection: How many variables are sufficient, and what

variables should we include?

I ???

I ???

I ???

104

Properties of the LSE in Matrix Notation

Projection X

Residuals X

Sampling Distribution X

Underfitting and Overfitting X

Multicollinearity

105

Understanding Multicollinearity using Projection

Multicollinearity is a phenomenon in which one predictor variable

in a multiple regression model can be linearly predicted from the

others with a substantial degree of accuracy.

1. This means that there are strong linear dependencies among

the columns of X.

2. We refer to such X as almost singular matrix.

3. Does multicollinearity affect E[β̂]?

4. Does multicollinearity affect var(β̂)?

5. What are the implications of multicollinearity?

Solutions

I Remove excessive variables

I Shrinkage estimator

106

Generalized Least Squares

107

Linear Regression Assumptions

Model:

y = Xβ +

1. Constant variance assumption var( ) = σ 2 I.

2. Uncorrelated error.

What if these assumptions are violated? How does it affect

our solution if we fit the ordinary multiple linear regression?

108

Non-constant Variance and Correlated Error

Suppose that

y = Xβ +

1. Non-constant variance Var( ) = σ 2 V for some positive

definite matrix V.

2. Correlated error.

How should we estimate β?

109

Motivating Example: Clustered Data

(See Section 3.8 in Code MultipleLinearRegression.html.)

110

Model Generation with R

111

R fit using Ordinary Multiple Linear Regression

Model:

y = Xβ + ,

where E[ ] = 0 and var[ ] = σ 2 V with σ 2 = 1.

(See Section 3.8 in Code MultipleLinearRegression.html.)

112

Generalized Least Squares

Suppose that we have

y = Xβ + ,

where E[ ] = 0 and var[ ] = σ 2 V.

Goal: Transform the above model so that it has uncorrelated data

points, and then fit the multiple linear regression.

113

Generalized Least Squares

T

Claim: Let V = UDUT and let K = UD−1/2 U . Then K is the

inverse of the sq. root of V (singular value decomposition)

I Create y∗ = Ky.

I Create X∗ = KX.

I Create ∗ = K .

Then,

y ∗ = X∗ β + ∗ ,

where E[ ∗ ] = 0 and var[ ∗ ] = σ 2 I.

Then, we can fit the multiple linear regression we have already

learnt using the new data points y∗ and X∗ .

(See Section 3.8 in Code MultipleLinearRegression.html.)

114

Mean and Variance of Generalized Least Squares Estimator

Model:

y ∗ = X∗ β + ∗ ,

where E[ ∗ ] = 0 and Var[ ∗ ] = σ 2 I with σ 2 = 1.

GLS Estimate:

β̂GLS = (XT V−1 X)−1 XT V−1 y.

Claim:

1. E[β̂GLS ] = β.

2. var(β̂GLS ) = σ 2 (XT V−1 X)−1 .

3. Residual sum of squares: (y − Xβ)T V−1 (y − Xβ).

115

GLS versus OLS

Assume the following model:

y = Xβ +

with var( ) = σ 2 V for some positive definite matrix V and

E[ ] = 0.

What are the properties of ordinary multiple linear regression

under this model?

I E[β̂] = β

I var(β̂) = σ 2 (XT X)−1 XT VX(XT X)−1

116

Weighted Least Squares for Unequal Variances

Consider the linear regression model

yi = βxi + i ,

where var( ) = σ 2 diag(1/w1 , . . . , 1/wn ).

For unknown w1 , . . . , wn ,use iteratively reweighted least squares

117

Purchase answer to see full

attachment

#### Why Choose Us

- 100% non-plagiarized Papers
- 24/7 /365 Service Available
- Affordable Prices
- Any Paper, Urgency, and Subject
- Will complete your papers in 6 hours
- On-time Delivery
- Money-back and Privacy guarantees
- Unlimited Amendments upon request
- Satisfaction guarantee

#### How it Works

- Click on the “Place Order” tab at the top menu or “Order Now” icon at the bottom and a new page will appear with an order form to be filled.
- Fill in your paper’s requirements in the "
**PAPER DETAILS**" section. - Fill in your paper’s academic level, deadline, and the required number of pages from the drop-down menus.
- Click “
**CREATE ACCOUNT & SIGN IN**” to enter your registration details and get an account with us for record-keeping and then, click on “PROCEED TO CHECKOUT” at the bottom of the page. - From there, the payment sections will show, follow the guided payment process and your order will be available for our writing team to work on it.