(Math) Let D be the distribution over the data points (x, y), and let H be the hypothesis class, in which one would like to find a function f that has a small expected loss L(f) by minimizing the...

(Math)

Let D be the distribution over the data points (x, y), and let H be the
hypothesis class, in which one would like to find a function f that has a small expected loss L(f) by minimizing the empirical loss Lˆ(f). A few definitions/terminologies:
• The best function among all (measurable) functions is called Bayes hypothesis:
f^∗
= arg inf_fL(f).
• The best function in the hypothesis class is denoted as
f_opt
= arg inf_f∈HL(f)
• The function that minimizes the empirical loss in the hypothesis class is denoted as
ˆf_opt
= arg inf_f∈HLˆ(f)
• The function output by the algorithm is denoted as ˆf. (It can be different from ˆf_optsince the optimization may not find the best solution.)
• The difference between the loss of f^∗
and f_opt
is called approximation error:
x_app
= L(f_opt) − L(f^∗)
which measures the error introduced in building the model/hypothesis class.
• The difference between the loss of f_opt
and ˆfopt is called estimation error:
x_est
= L(ˆf_opt) − L(f_opt)
which measures the error introduced by using finite data to approximate the distribution D.
• The difference between the loss of ˆfopt and ˆf is called optimization error:
x_opt
= L(ˆf) − L(ˆf_opt)
which measures the error introduced in optimization.
• The difference between the loss of f^∗
and ˆf is called excess risk:
x_exc
= L(ˆf) − L(f^∗)
which measures the distance from the output of the algorithm to the best solution possible.
(1) Show that x_exc
= x_app
+ x_est
+ x_opt.

Comments:
This means that to get better performance, one can think of: 1) building a hypothesis class closer to the ground truth; 2) collecting more data; 3) improving the optimization.

(2) Typically, when one has enough data, the empirical loss concentrates around the expected loss: there exists x_con
> 0, such that for any f ∈ H, |Lˆ(f) − L(f)| ≤ x_con. Show that
in this case, x_est
≤ 2 x_con.

Comments:
This means that to get small estimation error, the number of data points should be large enough so that concentration happens. The number of data points needed to get concentration x_conis called sample complexity, which is an important topic in learning theory and statistics.

Jun 07, 2022

SOLUTION.PDF

(Math) Let D be the distribution over the data points (x, y), and let H be the hypothesis class, in which one would like to find a function f that has a small expected loss L(f) by minimizing the...

Get Answer To This Question

Related Questions & Answers

Submit New Assignment