STA 108
Homework 7
Due: Mar 6th 2020 (by 2:10 pm)
Please clearly print your name, course name, section number, student ID number,
and names of students who you have discussed homework problems with on top of
the first page of your homework as follows.
Name
Course STA108 Section
Student ID#
Study group members
Additional requirements can be found in the course syllabus.
Question 1: Binary covariate and interactions (50 pts)
A scientist wants to study the factors affecting length of hospital stay (y ) and identified two covariates: age of patient (x1 ) and the severity of disease (x2 , 1 if it is a
life-threatening disease and 0 otherwise). Since there might be an interaction effect
(denoted as x3 = x1 × x2 ), we consider the following regression model
y = β0 + β1 x1 + β2 x2 + β3 x3 + .
The scientist wants to decide whether the severity of disease is needed in the model
and proposes to test: H0 : β2 = β3 = 0 v.s. Ha : β2 6= 0 or β3 6= 0. In order to test this
hypothesis, the scientist fits two models. The first model considers 100 patients with
severe diseases with the R output as below.
STA 108: HW 7
1
UC Davis
The second model considers 100 patients without severe diseases with the R output as
below.
1. (20 pts) Based on the information provided, can we test the above hypothesis at
the 0.05 significance level? If so, carry out the test. If not, explain.
2. (30 pts) Describe how you will test this hypothesis if you have access to the data.
Question 2: Puzzling p-values (40 pts)
When analyzing a set of data, you run into an R output as follows.
Given the eight weeks of statistical training in statistics, you sense something strange
in this result….
STA 108: HW 7
2
UC Davis
1. (20 pts) Point out what is strange in this R output, and explain how this could
happen.
2. (20 pts) Will this phenomenon become a problem? Explain your answer.
Question 3: Testing linear combinations of parameters (10 pts)
Suppose that we have fit a linear regression, for i = 1, . . . , n,
yi = β0 + β1 xi,1 + β2 xi,2 + i ,
where 1 , 2 , . . . , n are i.i.d. errors with E[ i ] = 0 and var( i ) = σ 2 . Describe how you
will test H0 : β1 − 2β2 = 0 v.s. Ha : β1 − 2β2 6= 0.
STA 108: HW 7
3
UC Davis
STA 108 Applied Statistical Methods:
Regression Analysis
Multiple Linear Regression
Shizhe Chen, PhD
Winter 2020
1
Important Instructions
I The first four sections (motivation, model, inference, model
selection) will be taught as before
I The rest of the slides are provided as reading materials
I
I
I
I
Read the slides before class at your own pace
Bring questions to lectures!
Learn to use online resources for self-learning
≤ 10 pts in the final exam
2
Outline
I Motivation
I Multiple Linear Regression Model
I Statistical Inference
I Model selection
I Least Squares Estimation
I Generalized Least Squares
3
Motivation
4
New Requests from Clients
So far, we have successfully answered the following questions.
1. Are sales positively associated with TV advertising budgets?
2. What will the amount of sales be if the TV advertising budget
is 20 (in thousands of dollars)?
However, our clients suddenly realize that there are more avenues
for advertisement than just on TV.
5
Exploratory Data Analysis
(See Section 3.1 in Code MultipleLinearRegression.html.)
6
New Questions of Interest
1. Which advertising budgets (TV, radios, or newspaper) are
associated with sales?
2. What is the mean difference of sales per unit difference in TV
advertising budget when other budgets are held constant?
3. What is the expected sales given a fixed combination of
budgets?
4. How much will sales increase by increasing the TV advertising
budget by 20 (thousand of dollars)?
5. What is the maximum expected sales given a total budget of
100 (thousand of dollars)?
7
New Questions of Interest
1. Which advertising budgets (TV, radios, or newspaper) are
associated with sales? (Linear regression)
2. What is the mean difference of sales per unit difference in TV
advertising budget when other budgets are held
constant?(Linear regression)
3. What is the expected sales given a fixed combination of
budgets? (Linear regression)
4. How much will sales increase by increasing the TV advertising
budget by 20 (thousand of dollars)? (Causal inference)
5. What is the maximum expected sales given a total budget of
100 (thousand of dollars)? (Optimization)
8
Multiple Linear Regression Model
9
Multiple Linear Regression Model
The multiple linear regression model takes the form
y = β0 + x1 β1 + . . . + xp βp + ,
where
I y ∈ R is the real-valued response
I xj ∈ R is the jth covariate
I β0 is the intercept term
I βj is the regression slope for the jth covariate
I ∈ R is the error term with E[ i ] = 0 and var( i ) = σ 2
Note: The term “linear” refers to the fact that the mean is a
linear function of the unknown parameters β0 , . . . , βp .
10
With n Observations
With n observations of y and x1 , . . . , xp , the complete model
becomes
y1 = β0 + x11 β1 + x12 β2 + · · · + x1p βp + 1
y2 = β0 + x21 β1 + x22 β2 + · · · + x2p βp + 2
..
.
yn = β0 + xn1 β1 + xn2 β2 + · · · + xnp βp + n ,
where the error terms are assumed to have the following properties
I E[ i ] = 0
I var( i ) = σ 2 (constant for all i)
I cov( i , j ) = 0 for j 6= i
Note: Sometimes stronger assumptions are imposed, such as ’s
are i.i.d. with mean 0 and variance σ 2 .
11
Interpretation of Multiple Linear Regression
In a multiple linear regression model
y = β0 + x1 β1 + x2 β2 + . . . + xp βp + , β1 is the expected mean
difference in y per unit difference in x1 if x2 , . . . , xp is held
constant (or adjusted/controlling for x2 , . . . , xp )
Example: Suppose that y is the systolic blood pressure of
newborns, x1 is days of age, and x2 is the weight at birth in
ounces. We say that
I β1 : We estimate that two groups of newborns with the same
age and who differ by one ounce at birth will have systolic
blood pressure that differs on average by 0.13 mm Hg (95%
CI: 0.05, 0.20).
I β2 : We estimate that two groups of newborns with the same
weight at birth and who differ by one day of age will have
systolic blood pressure that differs on average by 5.89 mm Hg
(95% CI: 4.42, 7.36).
12
Interpretation of Multiple Linear Regression
In a multiple linear regression model
y = β0 + x1 β1 + x2 β2 + . . . + xp βp + , β1 is the expected mean
difference in y per unit difference in x1 if x2 , . . . , xp is held
constant (or adjusted/controlling for x2 , . . . , xp )
Note: Interpretation of parameters might not make sense in the
multiple linear regression!
13
Interpretation of Multiple Linear Regression: Special Cases
In y = β0 + x1 β1 + x2 β2 + . . . + xp βp + , what if
I x3 = x1 × x2 ?
Effect of x1 on y will differ depends on the value of x2 .
Example: When comparing two groups of newborns that differ
by one day of age and with the same birthweight, the
difference in systolic blood pressure depends on the babies’
birthweight, with the difference in mean systolic blood
pressure decreasing by 0.13 mm/Hg for each ounce difference
in birth.
I x2 = x31 ?
x1 is constant if x2 is held constant.
Interpret all the terms that depend on x1 !
I x1 , . . . , xp are dummy variables for a categorical variable z
that has p + 1 categories?
x1 = 1 means x2 = · · · = xp = 0.
How will you interpret β1 ?
14
Multiple Linear Regression Model: Categorical Covariates
Consider the question in Homework #1, we code xi as 0 or 1 to
distinguish between ducks and pandas
I What if we found out that there are actually red pandas and
raccoons in the data set?
I xi = 0, 1, 2, 3 for ducks, pandas, red pandas, or raccoons?
I xi1 = 1 for a panda, xi2 = 1 for a red panda, xi3 = 1 for a
raccoon, and zero otherwise!
15
Multiple Linear Regression Model: Categorical Covariates
(cont.)
Create K − 1 dummy variables for a categorical variable with K
categories
Also known as the ANalysis Of VAriance, a.k.a., ANOVA
16
Multiple Linear Regression Model: Polynomial Regression1
This slide is incomplete. Take notes!
Consider a true (and unknown!) model where y is non-linear in x
y = (x − 3)4 + ⇐⇒ y = x4 − 12×3 + 54×2 − 108x + 81 +
Suppose that we have n observations of x and y, can we learn the
above model using linear regression?
1
Check out Wolfram Alpha
17
Building a Linear Model
Suppose that you are interested in studying the relationship
between y and x1 , and you have the resources to collect data (via
experiments or surveys). How will you build a linear model?
y = β0 + x1 β1 + . . . + xp βp +
Consider the following scenarios
I y is the body weight and x1 is the length of sleep per day
I y is the lung function (measured by forced expiratory volume,
or FEV) and x1 is a dummy variable for smoking
I y is the occurrence of an heart attack and x1 is a dummy
variable for depression
18
Classification of Variables
This slide is incomplete. Take notes!
I Variable of interest (or exposure, treatment, etc.)
I Response variable (or outcome)
I Confounder
I Effect modifier
I Precision variable
I Instrument
19
Confounding
I Confounding is an effect of some uncontrolled variable on the
response variable that hinders interpretation of the relationship
between the response and the predictor variable of interest
I Confounding describes real or imagined effects that distort the
relationship one wishes to observe between the predictor
variable and response
I More of a problem for observational studies
I in contrast to a designed experiment
Examples: effects of smoking on lung function (measured by forced
expiratory volume, FEV) may be confounded by age
20
Controlling for Confounding
I Implicitly with appropriate study designs
I Explicitly by measuring it and including it in the model
21
Controlling for Confounding: Design
I Match the observations that are similar in terms of
confounding variables (confounders), e.g., comparing the FEV
between smokers and non-smokers of the same age
I Relatively easy to implement
I Infeasible when there are too many confounders
I Conduct a randomized experiment, e.g., randomly assign
participants to the smoking group or the non-smoking group
I Destroy all confounding possibilities (grant causality)
I Infeasible in many cases
22
Controlling for Confounding: Model-Based
Using knowledge in this class, we can
I Fit the unadjusted model
y = β0 + x1 β1 +
I Fit the adjusted model
0
0
0
y = β0 + x1 β1 + x2 β2 +
0
I Compare the fitted values of β1 and β1
I Eyeballing
I Hypothesis testing
You can also use other advanced statistical methods to control for
unmeasured confounders, e.g., propensity score, instrumental
variable
23
Effect Modifier
I Variable that modifies the effect (or association) of the
variable of interest on the response
I Modeling approaches
I Stratify analysis (for categorical variables)
I Multiple linear regression with an interaction term
y = β0 + x 1 β1 + x 2 β2 + x 1 x 2 β3 +
24
Precision Variable
I Variable that only affects the response variable
I Improve the precision of the model fits if included
25
Statistical Inference
26
Inferences about Multiple β̂j s
Assume that q < p and we want to test if a reduced (smaller) model is sufficient: H0 : βq+1 = · · · = βp = 0 Ha : at least one βk 6= 0 vs Compare the residual sum of squares eT e for full and reduced (smaller) models: (a) Full model: yi = β 0 + q X p X βj xij + j=1 βk × xik + i k=q+1 (b) Reduced model: yi = β0 + q X j=1 βj xij + p X k=q+1 0 × xik + i 27 Inferences about Multiple β̂j ’s - cont This slide is incomplete. Take notes! Test Statistic: L(β̂reduced ) − L(β̂full ) L(β̂full ) dfreduced − dffull dffull L(β̂reduced ) − L(β̂full ) L(β̂full ) = (n − q − 1) − (n − p − 1) n − p − 1 ∼ F(p−q,n−p−1) , tF = where I L(β̂reduced ) is the residual sum of squares for the reduced model I L(β̂full ) is the residual sum of squares for the full model I dfreduced is the error degrees of freedom for the reduced model I dffull is the error degrees of freedom for the full model 28 Testing Linear Combinations of Parameters This slide is incomplete. Take notes! What if we want to test H0 : 2β1 + β3 = 0 v.s. Ha : 2β1 + β3 6= 0 I Transform the data to fit a new linear regression I Use the Wald test I Estimators follow a multivariate normal distribution (asymptotically) I Need to know the covariance 29 Multiple Linear Regression with R (See Section 3 in Code MultipleLinearRegression.html.) 30 Model Selection 31 Constructing Statistical Models I Use domain knowledge I Use statistical methods for model selection 32 Selecting an appropriate model Model selection has many meanings I It can mean selecting covariates and how they are to be included in the model known as feature selection, and sometimes feature generation I It can mean choosing between a regression tree (random forest), neural network (deep learning), or logistic regression I It can mean selecting a λ-value in the lasso known as tuning parameter selection We will focus on the feature/variable selection 33 Trade-off on Complexity There are two keys for a method to do well on your data: I It doesn’t completely miss important structure I It admits a proper amount of complexity for the number of observations in your data. 34 The “proper” amount of complexity How can we tell if we are choosing a model that is... sufficiently complex (to capture the signal)... but not overly complex! (such that we overfit our training data)? 2 It’s also discussed in the notes. 35 The “proper” amount of complexity How can we tell if we are choosing a model that is... sufficiently complex (to capture the signal)... but not overly complex! (such that we overfit our training data)? The proper degree of complexity is extremely data-dependent I Information criteria I Loss: residual sum of squares (or likelihood) evaluated using cross-validation Task: Search and read about overfitting 2 2 It’s also discussed in the notes. 35 Information Criteria Information criterion measures the trade-off between the goodness of fit and model complexity. In general, the information criterion admits the following form: information criterion = loss in model fitting + penalty on model complexity I Select a model that minimizes the chosen information criterion. I A good information criterion need to carefully balance the trade-off between model fitting and model complexity I Two famous information criteria: AIC and BIC 36 Akaike Information Criterion (AIC) Formulated by Japanese statistician Hirotugu Akaike in 1973. AIC = 2k + nlog(RSS/n) or AIC = 2k − 2logLk I Loss of model fitting: log of residual sum of squares (for linear regression), or two times negative log-likelihood (in general) I Penalty of model complexity: two times the number of variables k. When the sample size n is large, AIC tends to be conservative on “punishing” model complexity. 37 Bayesian Information Criterion (BIC) BIC was developed by statistician Gideon E. Schwarz in 1978. Therefore, BIC is also named as Schwarz information criterion (SIC). BIC = log(n)k + nlog(RSS/n) orBIC = log(n)k − 2logLk I Loss of model fitting: log of residual sum of squares (for linear regression), or two times negative log-likelihood (in general) I Penalty of model complexity: log(n) times the number of variables 38 Model Selection Choosing a proper complexity for your model I Information criteria I Loss function: residual sum of squares (or likelihood) evaluated using cross-validation 39 Split-Sample Validation Q: What is overfitting? Ideally, we wish to have two datasets, where we can I explore all possible models on the training data I evaluate/confirm our findings on the test data In practice, often times we only have one dataset.... so we split the dataset into a training set and a test set. 40 How to Split? What proportion should be used for training vs testing/evaluating? Often something like 2/3 training - 1/3 testing is good. This still feels like inefficient data use. Is there some way to use the majority of the data for both training and testing? 41 Cross-validation - I Let’s use cross-validation: I Partition our data into multiple folds... I Each time use 1 fold as test, and all other folds as training 42 K-fold Cross-validation !"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! 43 Leave-one-out Cross-validation 44 Cross-validation II Procedure for a k-fold validation 1. Randomly split the dataset into k non-overlapping subsets (folds) 2. For i = 1, . . . , k, 2.1 Fit the chosen model on data from the data excluding the ith fold 2.2 Evaluate the loss of the fitted model on the ith fold, denoted as Li 3. The final loss is Pk i=1 Li /k. 45 Model Selection Choosing a proper complexity for your model I Information criteria I Loss function: residual sum of squares (or likelihood) evaluated using cross-validation Next, how to select the “best” model? 46 Selection Procedure I Best subset selection I Stepwise selection I Penalization 47 Best Subset Selection An intuitive procedure: 1. List all possibilities 2. The model with the best criteria (AIC, BIC, CV loss) wins Algorithm: Best Subset Selection 1. Choose a selection criterion 2. For k = 1, . . . , p, 2.1 Fit all possible models that contain exactly k covariates 2.2 Pick the best among these models (the one with the best criterion) 2.3 Denote this model as Mk 3. Select a single best model among M1 , . . . , Mk 48 Best Subset Selection An intuitive procedure: 1. List all possibilities 2. The model with the best criteria (AIC, BIC, CV loss) wins There are a lot of models to fit (2p )!!! 49 Stepwise Selection Stepwise selection: I Computationally efficient alternative I Explore a restricted set of models No guarantee that it can find the best possible model! Two directions I Forward I Backward 50 Forward Stepwise Selection Algorithm: Forward Adding 1. Choose a selection criterion 2. Let M0 denote the null model, which contains no predictors 3. For k = 0, . . . , p − 1 3.1 Fit all p − k models by adding one additional variable to Mk 3.2 Pick the best among these p − k models, denoted this model as Mk+1 4. Select a single best model among M0 , . . . , Mp using the chosen criterion. You may stop early if you use p-values as the criterion... 51 Backward Stepwise Selection Algorithm: Backward Deleting 1. Choose a selection criterion 2. Let Mp denote the full model, which contains all predictors 3. For k = p − 1, . . . , 1 3.1 Fit all k models by deleting one variable from Mk 3.2 Pick the best among these k models, denoted this model as Mk−1 4. Select a single best model among M0 , . . . , Mp using the chosen criterion. You may stop early if you use p-values as the criterion... (See Section 3.7 in Code MultipleLinearRegression.html.) 52 Penalization We can approximate the best subset selection using sparsity-inducing penalties. Choose β = (β0 , β1 , . . . , βp )T , to minimize L(β) + λP (β) 53 Lasso One notable example is the lasso where P (β) = |β1 | + . . . + |βp |, a function that penalizes complexity in β Minimizing β-values, will “magically” give sparse model (meaning: many β-values exactly equal to 0) To modulate the degree of sparsity, we change λ. Q: How would you choose λ? 54 Summary of Model Selection I When? I How? I Why? 55 Least Squares Estimation 56 Linear Model in Matrix Notation y1 = β0 + x11 β1 + x12 β2 + · · · + x1p βp + 1 y2 = β0 + x21 β1 + x22 β2 + · · · + x2p βp + 2 .. . yn = β0 + xn1 β1 + xn2 β2 + · · · + xnp βp + n , where the error terms are assumed to have the following properties E[ i ] = 0, var( i ) = σ 2 , and cov( i , j ) = 0 for j 6= i. y = Xβ + , E[ ] = 0 and var( ) = σ 2 I 57 Linear Model in Matrix Notation, Step I       1 x11 x12 · · · x1p y1 1 β0  y2  1 x21 x22 · · · x2p  β1   2          ..  =  .. .. .. ..   ..  +  ..  ..  .  . . . . .  .   .  yn n βp 1 xn1 xn2 · · · xnp  We write y = Xβ + , where I y is the n × 1 response column vector I X is the n × (p + 1) design matrix I β is the n × (p + 1) design matrix I is the random error n × 1 column vector 58 Linear Model in Matrix Notation, Step II y = Xβ + . Recall: The error terms are assumed to have the following properties E[ i ] = 0, var( i ) = σ 2 , and cov( i , j ) = 0 for j 6= i. I E[ ] = 0 I var( ) = σ 2 I 59 Linear Model in Matrix Notation y1 = β0 + x11 β1 + x12 β2 + · · · + x1p βp + 1 y2 = β0 + x21 β1 + x22 β2 + · · · + x2p βp + 2 .. . yn = β0 + xn1 β1 + xn2 β2 + · · · + xnp βp + n , where the error terms are assumed to have the following properties E[ i ] = 0, var( i ) = σ 2 , and cov( i , j ) = 0 for j 6= i. y = Xβ + , E[ ] = 0 and var( ) = σ 2 I 60 Some Notes on Linear Models y = Xβ + 1. Usually the first column of X is the vector of ones, corresponding to the intercept β0 2. The jth column of X, denoted as Xj , is the jth predictor variable for the n observations 3. is the random part of the model (y is random because is random) 4. E[y] = E[Xβ + ] = E[Xβ] + E[ ] = Xβ, i.e., E[y] is a linear combination of {xj : j = 1, . . . , p + 1} 61 Multiple Linear Regression Model: Examples Consider the model y = β 0 + x1 β 1 + and suppose that we observe (y1 , y2 , y3 , y4 , y5 ) = (1, 4, 3, 8, 9) (x11 , x21 , x31 , x41 , x51 ) = (0, 1, 2, 3, 4) Now, represent these data using matrix notation y = Xβ + 62 Multiple Linear Regression Models: Advertising Data y = Xβ + , E[ ] = 0 and var( ) = σ 2 I In the advertising data, the responses y and the covariate X are 63 Least Squares Estimator for Multiple Linear Regression Estimates β0 , β1 , . . . , βp are least squares estimators of β0 , β1 , . . . , βp if they minimize  2 p n X X yvectors Marginal and conditional distributions xij βj  L(β) ≡of MVN random i − β0 − Projections, quadratic forms, and distributions 2 Introduction to linear models and least squares i=1 j=1 Least squares in 2D Graphical Interpretation: [Element of Stat Learning: Hastie et al.] 64 Linear Model in Matrix Notation       1 x11 x12 · · · x1p y1 1 β0  y2  1 x21 x22 · · · x2p  β1   2          ..  =  .. .. .. ..   ..  +  ..  ..  .  . . . . .  .   .  yn n βp 1 xn1 xn2 · · · xnp  We write y = Xβ + , where I y is the n × 1 response column vector I X is the n × (p + 1) design matrix I β is the n × (p + 1) design matrix I is the random error n × 1 column vector 65 Least Squares Estimator for Multiple Linear Regression This slide is incomplete. Take notes! Solve the least squares problem in matrix notation: minimize ky − Xβk22 = β n X i=1  yi − β0 − p X 2 xij βj  , j=1 where k · k2 is the `2 -norm ky − Xβk22 = (y − Xβ)T (y − Xβ). Claim: β̂ = (X T X)−1 X T y. 66 Simple Linear Regression as a Special Case This slide is incomplete. Take notes! Recall: In simple linear regression, we have Pn (x − x̄)(yi − ȳ) Pn i β̂1 = i=1 and β̂0 = ȳ − x̄β̂1 . 2 i=1 (xi − x̄) 67 Properties of the LSE in Matrix Notation Projection Residuals Sampling Distribution Underfitting and Overfitting Multicollinearity 68 onditional distributions of MVN random vectors ections, quadratic forms, and 2 distributions ntroduction to linear models and least squares Understanding Projection t in (x, y , z) space We have a three dimensional vector. 14 / 24 69 al and conditional distributions of MVN random vectors Projections, quadratic forms, and 2 distributions Introduction to linear models and least squares Understanding Projection ojection (x, y ) the plane We onto are projecting three dimensional vector onto the (x,y) plane. 15 / 24 70 Projections, quadratic forms, and 2 distributions Introduction to linear models and least squares Understanding Projection rojection onto y axis We are projecting the three dimensional vector onto the y axis 16 / 24 71 Understanding Regression via Projection Let X0 be a vector of ones. Then, we have  Xβ = (x0 x1 x2  β0 β1    . . . xp )  .   ..  βp = x0 β0 + x1 β1 + x2 β2 + · · · + xp βp ∈ R(X) Here R(X) is the subspace spanned by the columns of X. 72 Orthogonal Projection onto columns of X Theorem: The observation vector y can be uniquely decomposed as y = ŷ + e, where ŷ ∈ R(X), e ∈ R(X)⊥ , with R(X)⊥ = orthogonal complement of R(X) In other words, if two vectors a ∈ R(X) and b ∈ R(X)⊥ , we have aT b = 0. 73 Properties of the LSE in Matrix Notation Projection X Residuals Sampling Distribution Underfitting and Overfitting Multicollinearity 74 Residuals Again, residuals can be interpreted as what’s left over after projecting y onto X. Definition: The residual vector is e = y − ŷ = y − X β̂ Definition: The residual sum of squares is defined as eT e = n X e2i i=1 = (y − X β̂)T (y − X β̂) 75 Hat Matrix and Fitted Values Definition: Let ŷ = X β̂ = P y denote the fitted values of y, where P = X(X T X)−1 X T . Then P is called the hat matrix. The residuals can be rewritten as e = y − ŷ = y − P y = (I − P )y. Interpretation: P is a projection matrix that projects y onto R(X). 76 Properties of Projection Matrix The projection matrix P satisfies the following properties: I P is the projection matrix onto R(X). I I − P is the projection matrix onto R(X)⊥ . I PX = X I (I − P )X = 0 I Projection matrices are idempotent, i.e., (I − P )(I − P ) = (I − P ) and P P = P . I P (I − P ) = 0. 77 Sum of Squares Recall: The three sum of squares are Pn 2 I Total Sum of Squares: i=1 (yi − ȳ) Pn I Explained Sum of Squares: (ŷi − ȳ)2 Pi=1 n 2 I Residual Sum of Squares: i=1 (yi − ŷi ) 78 Sum of Squares in Matrix Form Let J = 11T be an n × n matrix of ones. Then we have n X Total Sum of Squares: = (yi − ȳ)2 i=1 1 =y I− J y n n X Explained Sum of Squares: = (ŷi − ȳ)2 T i=1 1 =y P− J y n n X Residual Sum of Squares: = (yi − ŷi )2 T i=1 T = y (I − P ) y 79 Sum of Squares (cont.) Therefore, we have Total Sum of Squares: = n X (yi − ȳ)2 i=1 1 =y I− J y n 1 T =y I − J +P −P y n 1 T =y P − J y + y T (I − P ) y n =Residual Sum of Squares:+ T Explained Sum of Squares: 80 Sum of Squares: Degrees of Freedom The corresponding degrees of freedom are I Total Sum of Squares: dfT = n − 1 I Explained Sum of Squares: dfE = p I Residual Sum of Squares: dfR = n − p − 1 81 Coefficient of Multiple Determination Recall: The coefficient of multiple determination is defined as Pn Pn 2 2 2 i=1 (ŷi − ȳ) i=1 (yi − ŷi ) P P R = n = 1 − . n 2 2 i=1 (yi − ȳ) i=1 (yi − ȳ) Interpretation: Gives the amount of variation in y that is explained by the linear relationships with the covariates X. Note: When interpreting the R2 values, note that I 0 ≤ R2 ≤ 1 I Large R2 values do not necessarily imply a good model I The more covariates we include in our model, the higher R2 is (Why?) 82 Adjusted Coefficient of Multiple Determination Problem of R2 : Including more and more predictors can artificially inflate R2 I Capitalizing on spurious effects present in noisy data I Phenomenon of over-fitting the data The adjusted R2 is a relative measure of fit: Pn (yi − ŷi )2 /dfR σ̂ 2 2 = 1 − . Rα = 1 − Pi=1 n 2 s2y i=1 (yi − ȳ) /dfT 1 where s2y = n−1 variance of y. Pn i=1 (yi − ȳ)2 is the sample estimate of the 83 Properties of the LSE in Matrix Notation Projection X Residuals X Sampling Distribution Underfitting and Overfitting Multicollinearity 84 Sampling Distribution: Prerequisite In multiple linear regression, T the sampling distribution is the joint distribution of β̂0 , . . . , β̂p . We need to introduce some new concepts to describe the distribution of β̂. I Covariance between two random vectors a ∈ Rm and b ∈ Rn I Multivariate normal distribution 85 Covariance Matrix I Covariance between two random vectors a ∈ Rm and b ∈ Rn   cov(a1 , b1 ) cov(a1 , b2 ) · · · cov(a1 , bn )  cov(a2 , b1 ) cov(a2 , b2 ) · · · cov(a2 , bn )    cov(a, b) =   .. .. .. ..   . . . . cov(am , b1 ) cov(am , b2 ) · · · cov(am , bn ) I Covariance between Aa and Bb cov(Aa, Bb) = Acov(a, b)B T I Partitioned covariance a1 b1 cov(a1 , b1 ) cov(a1 , b2 ) a= b= cov(a, b) = a2 b2 cov(a2 , b1 ) cov(a2 , b2 ) I var(a) = cov(a, a) 86 Multivariate Normal Distribution Let x = (x1 , . . . , xp )T and let x ∼ Np (µ, Σ). The multivariate normal density function is 1 1 T −1 f (x) = exp − (x − µ) Σ (x − µ) 2 (2π)p/2 |Σ|1/2 where I µ = (µ1 , . . . , µp )T is the p-dimensional mean vector   σ11 σ12 · · · σ1p σ21 σ22 · · · σ2p   I Σ=  .. .. ..  is the covariance matrix . .  . . . .  σp1 σp2 · · · σpp 87 Properties of Multivariate Normal Distribution Properties on the mean and covariance parameters: I µj ∈ R for all j I σjj > 0 for all j
I σij = ρij √σii σjj , where ρij is the correlation between Xi an
Xj
2 ≤ σ σ for all i, j ∈ {1, . . . , p}
I σij
ii jj
The marginals of a multivariate normal is a univariate normal
Xj ∼ N (µj , σjj )
for all j ∈ {1, . . . , p}
88
Affine Transformation of Multivariate Normal
Let x ∼ Np (µ, Σ).
Let A = {aij }n×p be a non-random matrix and consider a
non-random vector b = (b1 , . . . , bn )T .
Define y = Ax + b with A 6= 0n×p . Then
y ∼ Nn (Aµ + b, AΣAT )
Linear combinations of normal variables are normally
distributed!!!
89
Conditional Distribution of Multivariate Random Variables
If Σ is positive definite and

x
µx
Σx Σxy
∼N
,
.
y
µy
Σyx Σy
Then, the conditional distribution of x given y is
−1
x | y ∼ N (µx + Σxy Σ−1
y (y − µy ), Σx − Σxy Σy Σyx )
Important Property: x and y are independent if and only if
Σxy = 0.
90
Sampling Distribution: Our Model Assumptions
Assume that
1. The errors have mean zero: E[ ] = 0.
2. The errors are uncorrelated with common variance
var( ) = σ 2 I.
These imply that
1. E[y] = E[Xβ + ] = Xβ
2. var(y) = var(Xβ + ) = var( ) = σ 2 I
91
Mean and Variance of Our Estimates
This slide is incomplete. Take notes!
1. The least squares estimate is unbiased: E[β̂] = β
2. The covariance matrix of the least squares estimate is
var(β̂) = σ 2 (X T X)−1 .
Proof:
See Wikipedia for the proof of Gauss-Markov theorem (BLUE)
92
How about e?
This slide is incomplete. Take notes!
Recall that e = y − ŷ = (I − P )y:
1. E[e] = 0
2. var(e) = σ 2 [I − P ]
3. E[eT e] = (n − p − 1)σ 2 .
4. Implication: An unbiased estimate of σ 2 is
σ̂ 2 =
eT e
n−p−1
93
Sampling Distributions
I Asymptotic Distributions
I Central Limit Theorem
I Bootstrap
I Exact Distributions
I Normality assumption: multivariate t-distribution
I Other distribution assumptions…
Construct confidence intervals/regions given the sampling
distribution as in simple linear regression
94
Properties of the LSE in Matrix Notation
Projection X
Residuals X
Sampling Distribution X
Underfitting and Overfitting
Multicollinearity
95
What Happens When you Underfit the Model?
Suppose that the true underlying model is
y = Xβ + Zη + ,
but we instead fit the model
y = Xβ + .
In fact, in real data application, we often does this because it is
impossible for us to know which variables are in the true underlying
model.
For simplicity, assume that the columns of X and Z are linearly
independent.
96
Bias due to underfitting
Naive Argument: I am only interested in the parameters β, so
why bother estimating η?
Claim: if we fit the smaller model, E[β̂] 6= β. The estimates we
get is biased! Even the fitted values are biased!
I E[β̂] = β + (X T X)−1 X T Zη
I E[ŷ] = Xβ + X(X T X)−1 X T Zη
97
Example I: Underfitting
Suppose that the true model is
yi = β0 + β1 xi + β2 x2i + i
but instead we fit the model
yi = β0 + β1 xi + i
What is the bias of β1 ?
98
Variance if we Underfit the Model?
Suppose that the true underlying model is
y = Xβ + Zη + ,
but we instead fit the model
y = Xβ + .
Claim: cov(β̂) = σ 2 (X T X)−1 . But

η T Z T (I − PX )Zη
eT e
2
= σ2 +
> σ2.
E[σ̂ ] = E
n−p−1
n−p−1
Implication: We overestimate the variance!
99
Example in R on Underfitting
Scenario I: True β = (1, 1)T
y = Xβ +
Fit the correct model
y = Xβ +
Scenario II: True β = (1, 1)T
y = Xβ + Zη +
Underfit the model
y = Xβ +
(See Section 3.5 in Code MultipleLinearRegression.html.)
100
What Happens When you Overfit the Model?
Suppose that the true underlying model is
y = X1 β1 + ,
but we instead fit the model
y = X1 β1 + X2 β2 + = Xβ,
where
X = X1 X2

and
β=

β1
β2
What will happen to our estimates β̂1 ?
101
Bias due to Overfitting
Claim: if we fit the larger model, E[β̂] = β. The estimates we get
is unbiased! Even the fitted values and σ̂ 2 are unbiased!
Proof:
102
Why don’t we keep Overfitting then?
Claim: The variance of β̂ will be larger!!! Too complicated and we
will skip the results.
Scenario True β1 = (1, 1)T
y = X 1 β1 +
Overfit the model
y = X 1 β1 + X 2 β2 +
(See Section 3.6 in Code MultipleLinearRegression.html.)
103
Summary of Effects of Underfitting and Overfitting
β̂

σˆ2
cov(β̂)
Underfitting
biased
biased
biased upward
still σ 2 (X T X)−1
Overfitting
unbiased
unbiased
unbiased
increased
Model selection: How many variables are sufficient, and what
variables should we include?
I ???
I ???
I ???
104
Properties of the LSE in Matrix Notation
Projection X
Residuals X
Sampling Distribution X
Underfitting and Overfitting X
Multicollinearity
105
Understanding Multicollinearity using Projection
Multicollinearity is a phenomenon in which one predictor variable
in a multiple regression model can be linearly predicted from the
others with a substantial degree of accuracy.
1. This means that there are strong linear dependencies among
the columns of X.
2. We refer to such X as almost singular matrix.
3. Does multicollinearity affect E[β̂]?
4. Does multicollinearity affect var(β̂)?
5. What are the implications of multicollinearity?
Solutions
I Remove excessive variables
I Shrinkage estimator
106
Generalized Least Squares
107
Linear Regression Assumptions
Model:
y = Xβ +
1. Constant variance assumption var( ) = σ 2 I.
2. Uncorrelated error.
What if these assumptions are violated? How does it affect
our solution if we fit the ordinary multiple linear regression?
108
Non-constant Variance and Correlated Error
Suppose that
y = Xβ +
1. Non-constant variance Var( ) = σ 2 V for some positive
definite matrix V.
2. Correlated error.
How should we estimate β?
109
Motivating Example: Clustered Data
(See Section 3.8 in Code MultipleLinearRegression.html.)
110
Model Generation with R
111
R fit using Ordinary Multiple Linear Regression
Model:
y = Xβ + ,
where E[ ] = 0 and var[ ] = σ 2 V with σ 2 = 1.
(See Section 3.8 in Code MultipleLinearRegression.html.)
112
Generalized Least Squares
Suppose that we have
y = Xβ + ,
where E[ ] = 0 and var[ ] = σ 2 V.
Goal: Transform the above model so that it has uncorrelated data
points, and then fit the multiple linear regression.
113
Generalized Least Squares
T
Claim: Let V = UDUT and let K = UD−1/2 U . Then K is the
inverse of the sq. root of V (singular value decomposition)
I Create y∗ = Ky.
I Create X∗ = KX.
I Create ∗ = K .
Then,
y ∗ = X∗ β + ∗ ,
where E[ ∗ ] = 0 and var[ ∗ ] = σ 2 I.
Then, we can fit the multiple linear regression we have already
learnt using the new data points y∗ and X∗ .
(See Section 3.8 in Code MultipleLinearRegression.html.)
114
Mean and Variance of Generalized Least Squares Estimator
Model:
y ∗ = X∗ β + ∗ ,
where E[ ∗ ] = 0 and Var[ ∗ ] = σ 2 I with σ 2 = 1.
GLS Estimate:
β̂GLS = (XT V−1 X)−1 XT V−1 y.
Claim:
1. E[β̂GLS ] = β.
2. var(β̂GLS ) = σ 2 (XT V−1 X)−1 .
3. Residual sum of squares: (y − Xβ)T V−1 (y − Xβ).
115
GLS versus OLS
Assume the following model:
y = Xβ +
with var( ) = σ 2 V for some positive definite matrix V and
E[ ] = 0.
What are the properties of ordinary multiple linear regression
under this model?
I E[β̂] = β
I var(β̂) = σ 2 (XT X)−1 XT VX(XT X)−1
116
Weighted Least Squares for Unequal Variances
Consider the linear regression model
yi = βxi + i ,
where var( ) = σ 2 diag(1/w1 , . . . , 1/wn ).
For unknown w1 , . . . , wn ,use iteratively reweighted least squares
117

Purchase answer to see full
attachment




Why Choose Us

  • 100% non-plagiarized Papers
  • 24/7 /365 Service Available
  • Affordable Prices
  • Any Paper, Urgency, and Subject
  • Will complete your papers in 6 hours
  • On-time Delivery
  • Money-back and Privacy guarantees
  • Unlimited Amendments upon request
  • Satisfaction guarantee

How it Works

  • Click on the “Place Order” tab at the top menu or “Order Now” icon at the bottom and a new page will appear with an order form to be filled.
  • Fill in your paper’s requirements in the "PAPER DETAILS" section.
  • Fill in your paper’s academic level, deadline, and the required number of pages from the drop-down menus.
  • Click “CREATE ACCOUNT & SIGN IN” to enter your registration details and get an account with us for record-keeping and then, click on “PROCEED TO CHECKOUT” at the bottom of the page.
  • From there, the payment sections will show, follow the guided payment process and your order will be available for our writing team to work on it.