Its a online testQuestion 1 (10 marks): Short-answer questions: answer each of ...

Question

Its a online testQuestion   1   (10   marks):      Short-answer   questions:   answer   each   of   the   following   questions   in   a   short   paragraph.      1) Given   the   following   datasets:        (A)         (B)          (C)     I. If   we   want   to   apply   clustering   technique   on    each    dataset,   would   it   be   better   to   apply    k-means    or    DBSCAN ?   And   explain   why?   (4   marks)     Answer:   - (A)   DBSCAN   is   better.   (1   mark)   - (B)   DBSCAN   is   better.   (1   mark)   - (C)   K-means   or   DBSCAN.   (1   mark)   - Because   DBSCAN   doesn’t   assume   a   cluster   shape,   while   Kmeans   is   suitable   for   spherical   clusters.   (1   mark)     II. In   figure   (A),   you   can   observe   some   noise   in   the   dataset.   (3   marks)   ● Which    step(s)    in   the   typical   Data   Science   process   will   help   to   identify   and   fix   this   noise?   ● Briefly   explain   each   step.   ● Clearly   indicate   the   order   of   the    step(s)    as   part   of   your   answer.     Answer:   - Data   preparation   /   data   exploration   (1   mark)   - Order:   Data   preparation   /   data   exploration   (1   mark)   - There   should   be   detailed   explanation   for   each   step   (1   mark)     2) Suppose   you   have   a   data   set   that   includes   two    categorical    and   three    numerical    columns .   (If   you   don't   know   the   name,   you   can   sketch   an   example   picture.)   (3   marks)     i)   Name   two   kinds   of   graphs   that   can   be   used   to   visualise    categorical    data   2     Answer:   Barplot,   pie   chart   (0.5   marks   each)     ii)   Propose   a   simple   analysis   to   explore   the   relationship   between   a    categorical    and   a    numerical    column.     Answer:   Boxplot   by   category   (1   mark)     iii)   Propose   a   simple   analysis   to   explore   the   relationship   between   two    numerical   columns.     Answer:   Scatterplot   (1   mark)   Question   2   (4   marks) :      Considering   the   following    iris    dataset   to   train   a   classifier.   The   attributes   are    sepal_length,   sepal_width,   petal_length   and   petal_width .   The   class   labels   are   in    ‘target’    column.   The   datasets   contains   150   observations:   the   first   50   observations   are   for   the   type   of   ‘ Iris-setosa ’,   the   middle   50   observations   are   for   ‘ Iris-virginica ’,   and   the   last   50   observations   are   for   ‘ Iris-versicolor ’.          It   is   required   to   train   a   classifier   with   3-fold   cross   validation.   Please   answer   the   following   questions   with   plain   English,   and   explain   (you   may   draw   diagrams   to   explain).      1. What   are   the   necessary   step(s)   to   preprocess   the   data,   and   explain   why   preprocessing   is   important.   (2   marks)     Answer:    - Checking   errors   (outliers   or   typos)   in   data   entry,   which   result   in   few   observations   having   different   logic.   Extra   white   spaces,   which   will   affect   string   comparison.   Impossible   values,   e.g.   negative   age.   (0.5   mark,   should   mention   at   least   two   steps)   3   - Prepare   the   data   by   managing   the   order   of   the   rows,   as   they   are   listed   by   different   classes.   If   not   randomizing   them,   it   will   cause   the   training/test   data   not   reliable.   (1   mark)   - Data   preprocessing   is   important   because   it   will   help   models   to   perform   better   keeping   in   mind   that   “Garbage   in   equals   garbage   out”.   (0.5   mark)     2. Apply   3-fold   cross   validation   to   the   dataset,   explaining   the   process   step   by   step.   You   may   wish   to   include   a   diagram   as    part    of   your   answer.   (2   marks)     Answer:   - Split   the   data   into   3   equal   folds.   Use   2   for   training   and   1   testing,   iteratively.   (1   mark)   - K-fold   cross   validation   is   useful   with   small   datasets.   (1   mark)         Question   3   (3   marks):      Consider   we   have   a   sample   of   30   loan   applicants   with   two   variables    Income   range   (Low/   High)   and    Years   of   employment(   1-5/   >5) .   15   out   of   these   30   were   granted   the   loan.   Now,   we   want   to   build   a    Decision   Tree    on   this   data.   In   the   figure   below,   we   split   the   population   using   the   two   input   variables    Income    and    Years   of   employment .   4     Which   split   is   producing   more   homogeneous   sub-nodes   using   the    Gini    index   (equation   is   given   at   the   end   of   the   exam   paper)?   and   explain   why.   (3   marks)     Answer:   - For   split   on   Income:   (1   mark)   Gini   for   sub-node   Low   income   =

Question XXXXXXXXXXmarks): Short-answer questions: answer each of the following questions in a short paragraph. 1) Given the following datasets: (A XXXXXXXXXXB XXXXXXXXXXC) I. If we want to apply...

Get Answer To This Question

Related Questions & Answers

Submit New Assignment