Notebook 3: deal with (missing) data in a non-oracle setting

This Notebook explains how to use the LinearRegressorNA class to compute, in the case of real missing values, the coefficient of the underlying model assumed to be linear. In practice, if the dataset contain missing values, note the following facts:

  • we can not compute the quadratic loss, $$\frac{1}{2n}\|y-X\hat{\beta}\|^2 + \lambda\frac{1}{2}\|\beta\|^2,$$ since we do not have access to the complete data matrix $X$.

  • we assume an heterogeneous MCAR mechanism. The probability of being missing $p_j, j=1,\dots,d$ are estimated from the binary matrix $D$ coding for the presence of missing entries.

In this Notebook, we use synthetic data and introduce missing values, but we will assume that the complete matrix is unknown.

In [1]:
import numpy as np
from random import randint
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['agg.path.chunksize'] = 10000
In [2]:
## Synthetic data 
np.random.seed(42)
n = int(1e5)
d = 3

Rdm_M = np.random.randn(d, d)
H_temp = Rdm_M @ Rdm_M.T
diag = np.diag(1 / np.arange(1, d+1)) 
P = np.linalg.eig(H_temp)[1]
H = P @ diag @ P.T
X = np.random.multivariate_normal(np.zeros(d), H, n)

beta_true = np.random.randn(d) 
sigma = 1
y = np.dot(X,beta_true) + sigma*np.random.randn(n)
In [3]:
# MCAR Heterogenous setting
p = np.random.uniform(low=0.7,high=1,size=d)
D_init = np.zeros((n,d))
for j in range(0,d):
       D_init[:,j] = np.random.binomial(n=1, p=p[j], size= n)

X_NA = X*D_init
index = D_init.sum(axis=1)!=0
X_NA = X_NA[index,]
X = X[index,]
y = y[index,]
D = D_init[index,]
n,d = X.shape
In [4]:
from sgd_lin_na_realdata import *
In [5]:
nepoch = 100
beta0 = np.zeros(d)
model = LinearRegressorNA(X=X,D=D,y=y,strength=0)
L = model.lip_max()
step = 1/(2*L) 
In [6]:
# SGD
beta_sgd = sgdNA(model,
                 beta0,
                 nepoch,
                 choice_step="sqrt",
                 step=step,
                 callback_modulo=n)
In [7]:
# SGD_cst
beta_sgd_cst = sgdNA(model,
                     beta0,
                     nepoch,
                     choice_step="cst",
                     step=step,
                     callback_modulo=n)
In [8]:
# AvSGD
beta_sgd_av = avsgdNA(model,
                      beta0,
                      nepoch,
                      step=step,
                      callback_modulo=n)
In [9]:
import pandas as pd
res_df = pd.DataFrame(columns=[]) #['variables','beta_sgd_av_1','beta_sgd_av_100','beta_sgd_av_200','beta_sgd_av_300','beta_sgd_av_restart200','beta_em','beta_reg'])

res_df['SGD'] = beta_sgd
res_df['SGD_cst'] = beta_sgd_cst
res_df['AvSGD'] = beta_sgd_av
res_df['True'] = beta_true
In [10]:
res_df
Out[10]:
SGD SGD_cst AvSGD True
0 -0.772760 -0.805541 -0.773249 -0.794006
1 -0.362042 -0.372288 -0.360295 -0.344326
2 0.959762 0.999712 0.960420 0.917108

We evaluate the results by computing: $$\|\hat{\beta}-\beta\|_2^2$$

In [11]:
norm_sgd = np.linalg.norm(beta_sgd-beta_true)
norm_sgdcst=np.linalg.norm(beta_sgd_cst-beta_true)
norm_avsgd=np.linalg.norm(beta_sgd_av-beta_true)
In [12]:
print('Distance to the minimizer - SGD    ', "%.2e" %(norm_sgd))
print('Distance to the minimizer - SGD cst', "%.2e" %(norm_sgdcst))
print('Distance to the minimizer - avSGD  ', "%.2e" %(norm_avsgd))
Distance to the minimizer - SGD     5.08e-02
Distance to the minimizer - SGD cst 8.80e-02
Distance to the minimizer - avSGD   5.06e-02
In [ ]: