None feat_sel

Task 1

Interior point algorithm. The optimal solution is [0,8].

In [1]:
from fractions import Fraction
import numpy as np
np.set_printoptions(precision=3, suppress=True)


c = np.array([1, 2, 0])
A = np.array([[1, 1, 1]])

b = np.array([8])

alpha = 0.5
# %%


x=np.array([1,3,4])
x_old=np.array([0,0,0])

while np.linalg.norm(x_old-x)>0.001:
    x_old=x
    D=np.diag(x)
    D1=np.linalg.inv(D)
    
    x_tilde = D1 @ x
    A_tilde = A @ D
    c_tilde = c @ D
    
    P=np.identity(3)-A_tilde.T @ np.linalg.inv( A_tilde @ A_tilde.T) @ A_tilde
    p_tilde=P@c_tilde
    
    #v=A_tilde @ c_tilde
    #w=np.linalg.solve((A_tilde @ A_tilde.T),v)
    #p1=A_tilde.T@w-c_tilde
    
    mu = np.max([abs(v) for v in p_tilde if v<0])
    
    x_tilde = x_tilde+alpha/mu*p_tilde
    x = D @ x_tilde
    print(x)
[1.046 4.954 2.   ]
[0.934 6.066 1.   ]
[0.724 6.776 0.5  ]
[0.465 7.285 0.25 ]
[0.249 7.626 0.125]
[0.125 7.812 0.062]
[0.063 7.906 0.031]
[0.031 7.953 0.016]
[0.016 7.977 0.008]
[0.008 7.988 0.004]
[0.004 7.994 0.002]
[0.002 7.997 0.001]
[0.001 7.999 0.   ]
[0.    7.999 0.   ]

Task 3

Solution Approach

Sets and Indices

$i \in I=\{1,2,\dots,n\}$: Set of observations.

$j \in J=\{0,1,2,\dots,p\}$: Set of features, where the first ID corresponds to the intercept.

$\ell \in L = J \setminus \{0\}$: Set of features, where the intercept is excluded.

Parameters

$s \in \mathbb{N}$: Number of features to include in the model, ignoring the intercept.

Decision Variables

$\beta_j \in \mathbb{R}$: Weight of feature $j$, representing the change in the response variable per unit-change of feature $j$.

$z_\ell \in \{0,1\}$: 1 if weight of feature $\ell$ is exactly equal to zero, and 0 otherwise. Auxiliary variable used to manage the budget constraint.

Objective Function

  • Training error: Minimize the Sum of Abosulte Residuals (aka L1) more robust with respect to outliers:
\begin{equation} \text{Min} \quad Z = \sum_{i \in I}\left|y_i-\sum_{j \in J}\beta_{j}x_{ij}\right| \end{equation}

Constraints

  • Mutual exclusivity: For each feature $\ell$, if $z_\ell=1$, then $\left|\beta_\ell\right|=0$
\begin{equation} \left|\beta_\ell\right|\leq M (1-z_\ell) \end{equation}

Here the big-M is the upper bound of the $\beta$ variables, which are actually unlimited. We could introduce an artificial upper bound and we could add the lasso term in the objective function:

$$ \begin{equation} \text{Min} \quad Z = \sum_{i \in I}\left|y_i-\sum_{j \in J}\beta_{j}x_{ij}\right| + \lambda \sum_{\ell=1}^p\left|\beta\right| \end{equation} $$

Alternatively we can define as a Special Ordered Set of type one (meaning that at most one variables is different from zero (variables can be both integer of continuous):

\begin{equation} (\beta_\ell, z_\ell): \text{SOS-1} \quad \forall \ell \in L \tag{1} \end{equation}
  • Budget constraint: Exactly $|L| - s$ feature weights must be equal to zero:
\begin{equation} \sum_{\ell \in L}z_\ell = |L| - s \tag{2} \end{equation}

This model, by means of constraint 2, implicitly considers all ${{p} \choose s}$ feature subsets at once. However, we also need to find the value for $s$ that maximizes the performance of the regression on unseen observations. Notice that the training error decreases monotonically as more features are considered, so it is not advisable to use it as the performance metric. Instead, we should estimate the Mean Squared Error (MSE) via cross-validation. This metric is defined as $\text{MSE}=\frac{1}{n}\sum_{i=1}^{n}{(y_i-\hat{y}_i)^2}$, where $y_i$ and $\hat{y}_i$ are the observed and predicted values for the ith observation, respectively. Then, we will fine-tune $s$ using grid search, provided that the set of possible values is quite small.

In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import math

import gurobipy as gp
from gurobipy import GRB

from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
In [3]:
# Load data and split into train (80%) and test (20%)
boston = load_boston()
X = boston['data']
y = boston['target']
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2,random_state=10101)
/home/marco/.local/lib/python3.8/site-packages/scikit_learn-1.0.1-py3.8-linux-x86_64.egg/sklearn/utils/deprecation.py:87: FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2.

    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_housing
        housing = fetch_california_housing()

    for the California housing dataset and::

        from sklearn.datasets import fetch_openml
        housing = fetch_openml(name="house_prices", as_frame=True)

    for the Ames housing dataset.
    
  warnings.warn(msg, category=FutureWarning)
In [4]:
scaler = StandardScaler()
scaler.fit(Xtrain)
Xtrain_std = scaler.transform(Xtrain)
Xtest_std = scaler.transform(Xtest)
In [5]:
Xtrain_std.shape[1]
Out[5]:
13
In [6]:
m = gp.Model("fitting")

# Sets
p = Xtrain_std.shape[1] # n of features
n = Xtrain_std.shape[0] # n of observations (data points)

I=range(n)
J=range(p+1)
L=range(1,p+1)

# Parameters
M=10000
s=5 # features in the model

# Create the decision variables
beta = {}
for j in J:
    beta[j] = m.addVar(lb=-float('inf'), vtype=GRB.CONTINUOUS, name=f"beta_{j}")

zeta = {}
for j in L:
    zeta[j] = m.addVar(vtype=GRB.BINARY, name="zeta"+str(j))
    
# auxiliary to handle absolut values
betap = m.addVars(L, lb=0.0, vtype=GRB.CONTINUOUS, name="betap")
betan = m.addVars(L, lb=0.0, vtype=GRB.CONTINUOUS, name="betan")
        
errp = m.addVars(I, lb=0.0, vtype=GRB.CONTINUOUS, name="errn")
errn = m.addVars(I, lb=0.0, vtype=GRB.CONTINUOUS, name="errp")


# The objective is to minimize 
m.modelSense=gp.GRB.MINIMIZE
m.setObjective(gp.quicksum(errp[i]+errn[i] for i in range(n)))


for i in I:
    m.addConstr(errp[i]-errn[i]==ytrain[i]-beta[0]-gp.quicksum(beta[j]*Xtrain_std[i][j-1] for j in L ))

    
m.addConstr(gp.quicksum(zeta[j] for j in L)>=p-s)

for j in L:
    m.addConstr(betap[j]-betan[j]==beta[j])
    m.addConstr(betap[j]+betan[j]<=M*(1-zeta[j]))

#for j in L:
    # If zeta[i]=1, then beta[i] = 0
    # m.addSOS(GRB.SOS_TYPE1, [zeta[j], beta[j]])
    

# Solve
m.write("feat_sel.lp")
#m.display()
m.optimize()


if m.status == gp.GRB.status.OPTIMAL:
    print('\nSum of Absolute Residuals (Training Error): %g' % m.ObjVal)
    print('\nbetas:')
    for j in range(p+1):
        if math.fabs(beta[j].X)>0+0.0001:
            print('%s %g' % (j, beta[j].X))
    print('\nzetas:')
    for j in range(1,p+1):
        #if zeta[j].X>0+0.0001:
        print('%s %g' % (j, zeta[j].X))
else:
    print('No solution')
Academic license - for non-commercial use only - expires 2022-09-27
Using license file /opt/gurobi/gurobi.lic
Gurobi Optimizer version 9.1.2 build v9.1.2rc0 (linux64)
Thread count: 4 physical cores, 8 logical processors, using up to 8 threads
Optimize a model with 431 rows, 861 columns and 6555 nonzeros
Model fingerprint: 0x0cd87ed7
Variable types: 848 continuous, 13 integer (13 binary)
Coefficient statistics:
  Matrix range     [3e-04, 1e+04]
  Objective range  [1e+00, 1e+00]
  Bounds range     [1e+00, 1e+00]
  RHS range        [5e+00, 1e+04]
Found heuristic solution: objective 9112.8000000
Presolve time: 0.01s
Presolved: 431 rows, 861 columns, 6555 nonzeros
Variable types: 848 continuous, 13 integer (13 binary)

Root relaxation: objective 1.248587e+03, 451 iterations, 0.03 seconds

    Nodes    |    Current Node    |     Objective Bounds      |     Work
 Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time

     0     0 1248.58702    0    9 9112.80000 1248.58702  86.3%     -    0s
H    0     0                    1589.0568379 1248.58702  21.4%     -    0s
H    0     0                    1418.0948020 1248.58702  12.0%     -    0s
     0     0 1258.80541    0    8 1418.09480 1258.80541  11.2%     -    0s
     0     0 1259.93204    0    8 1418.09480 1259.93204  11.2%     -    0s
     0     0 1268.34333    0    8 1418.09480 1268.34333  10.6%     -    0s
H    0     0                    1409.0547620 1268.34333  10.0%     -    0s
     0     0 1284.79586    0    9 1409.05476 1284.79586  8.82%     -    0s
     0     0 1284.79586    0    9 1409.05476 1284.79586  8.82%     -    0s
     0     0 1295.29475    0    9 1409.05476 1295.29475  8.07%     -    0s
     0     0 1303.99077    0    9 1409.05476 1303.99077  7.46%     -    0s
     0     2 1303.99077    0    9 1409.05476 1303.99077  7.46%     -    0s
*   20    18               6    1385.3583203 1303.99077  5.87%  21.4    0s
*   68    11               8    1360.5415651 1303.99077  4.16%  28.4    0s

Cutting planes:
  MIR: 4

Explored 152 nodes (3686 simplex iterations) in 0.43 seconds
Thread count was 8 (of 8 available processors)

Solution count 6: 1360.54 1385.36 1409.05 ... 9112.8

Optimal solution found (tolerance 1.00e-04)
Best objective 1.360541565094e+03, best bound 1.360541565094e+03, gap 0.0000%

Sum of Absolute Residuals (Training Error): 1360.54

betas:
0 21.7699
6 3.65536
8 -0.953577
11 -1.74308
12 1.64879
13 -3.5069

zetas:
1 1
2 1
3 1
4 1
5 1
6 -0
7 1
8 -0
9 1
10 1
11 -0
12 -0
13 -0