In this lab you will create a web crawler. The web crawler will begin at a given URL (web address)...

Question

In this lab you will create a web crawler. The web crawler will begin at a given URL (web address) and create a list of all subURLs (links to other pages).COMSC111  Lab 6    In this lab you will create a web crawler.  The web crawler will begin at a given URL (web address) and  create a list of all subURLs (links to other pages).  As long as there is a URL left on the list or until a  maximum number of URLs have been visited, the web crawler will get the next URL from the list, find its  subURLS, and add them to the list.    Setting up your project:  1. Create a new project (Lab 6) and two new packages (webcrawler and tests) in that project.    2. Unzip the provided .zip file.  You should get a folder named jsjf and two other files.  3. Put URLParser.java in the webcrawler package and TestWebCrawler.java in the tests package  you just created. (Note: you may choose different package, class, and method names, but you  must update my code appropriately if you do).   4. Now put the jsjf folder in the src directory of your project.  The src folder will now have three  folders (webcrawler, tests, and jsjf), and your project will have three packages (you may need to  refresh it).  5. Next copy your LinkedQueue implementation from last week into the jsjf folder and use it in  Part 2.  If you have lost/do not have this, talk to me before recreating it.  Part 1:   Complete the implementation of ArrayUnorderedList. You will need to implement the  addToRear method.    Part 2:   Create a WebCrawler class in the webcrawler package.  Your class will not have a main method,  so use the provided TestWebCrawler to test your code. Setting up the WebCrawler class:  1. Create a new class named WebCrawler in your websearch package.  2. Add a class variable (type int) to hold the maximum number of URLs to be visited.  Give it a  default value of 100.  3. Add a class variable to store the starting URL as a String.    4. Add a constructor that has a String argument and uses it to set the value of the variable  from step 3  5. Add a default constructor that sets the starting URL to a default value (e.g.  http://www.acm.org)  6. Write setter methods for the two class variables.  7. Write a method crawlAndPrint (instructions below) Looking at the URLParser code may help you with the above steps if there are some that you do  not understand.  The above steps are also similar to some of the steps in Lab 2.    Crawling the web:    Here is an algorithm for crawling the web.  You should implement this in a method named  crawlAndPrint with a void return type and no arguments (look in TestWebCrawler to see how it  will be used).      create a new ArrayUnorderedList named visitedList  create a new LinkedQueue named pendingQueue1  add the starting URL to pendingQueue  while pendingQueue is not empty and number of visited URLs

Shalini · Accepted Answer

lab6files/Lab6files/assignment1september222020-1-ksc1g1hs.pdf
Assignment 1
Eric Johnson and Madhav Mani
September 23, 2020
Assignment Due Dates (Fall 2020):
• Attempt: October 2, 2020 at 11:59PM CST
• Completion: October 9, 2020 at 11:59PM CST
Please follow the guidelines on Canvas for assignment attempts and completion. In particular, the attempt
must be a Jupyter notebook or PDF that clearly enumerates the problem attempts, and the completed
assignment must be code and a PDF containing completed solutions. Please use complete sentences in
your solutions. All figures should have labeled axes, legends, and appropriate annotations.
Please read the entire assignment before getting too deep into any one problem so that you can properly
allocate your time and questions.
Learning Objectives:
The module learning objectives assessed in this assignment are that a student will be able to:
• PCS-1 Calculate summary quantities from data or a distribution.
• PCS-2 Construct probability distributions.
• PCS-3 Implement simple algorithms, especially to generate synthetic data sets.
• TS Understand and discuss core concepts in probability.
• MVD Graph core concepts in probability.
• NQP Illustrate the effect of parameter dependence on distributions.
You can learn more about where to study and practice these objectives in the curriculum alignment table.
Throughout this assignment, there will be notes like this:
Note: This is a note.
These are hints, intuitions, or elaborations on the problems. Read these if you are feeling lost, annoyed, or
you have extra time, but they are not necessary to complete the problems.
1
https://canvas.northwestern.edu/courses/127241/pages/how-to-assignment-attempts
https://canvas.northwestern.edu/courses/127241/pages/how-to-completed-assignments
https://canvas.northwestern.edu/courses/127241/pages/module-one-the-basics
September 23, 2020
1. This first problem is focused on plotting and generating theoretical and empirical probability distri-
butions. Complete the following tasks:
(a) PCS-2,3 , MVD Make a 2 row and 4 column figure. In the top row, plot a binomial, Gaussian,
Poisson, and uniform PDFs and in the bottom row plot the corresponding CDFs. For each type
of distribution, show 3 curves with varying parameters (even the uniform distribution! What
are the parameters of a uniform distribution?). Include a legend that shows the values of the
relevant parameters for each of the three curves. Label the axes and provide titles above each
column.
Note: This problem is all about getting familiar with some basic tools in plotting and figure-
making. Make sure to look at the scipy.stats module when you’re looking to generate these
distributions.
(b) PCS-2,3 , MVD Repeat the previous problem with empirical PDFs and CDFs in a new figure.
That is, generate data sets by sampling random variables from each of these distributions. Use
the same three sets of parameters for each type of distribution. You can choose how many
samples to generate. eCDFs should not be binned.
Note: Again this problem might seem repetitive, but it will help you start to work out what
parts of plotting make sense, and what you’ll need to ask about.
Hopefully this problem also underscores the difference between empirical and theoretical PDFs
and CDFs. Theoretical distributions are nice and smooth while empirical distribution are always
somewhat noisy. In this figure (and always), you should never use a binned histogram to form
your CDF, you should use the code in the notes to get an exact CDF for your data. This is
important because putting data into bins is an arbitrary process and it destroys information.
When possible, you should always do calculations with eCDFs.
(c) PCS-1 Report the mean, standard deviation, median, and IQR of each of the distributions in
questions 1 and 2. For the theoretical distributions, you can use theory to compute these values,
but indicate that you are doing so. The values for question 2 must be generated from the data
shown in your figure.
©JohnsonMani2020WDYDS
Page 2
September 23, 2020
2. In this problem we will consider the behavior of a “random walker” (although on a weekend night it
may be a drunken walker). More specifically, we consider the case of a very strange person who, in
moving down the road, flips a coin, and if the coin is heads, she takes a step to the left, and if it is
tails, a step to the right.
(a) PCS-3 , MVD Write a function that returns the path of such a random walker given a number
of steps, M , and a coin bias p. Plot the path of a single random walker generated by your
function for M = 100 and p = 0.5.
(b) PCS-3 , MVD Generate trajectories (paths) for N = 100 walkers, each taking M = 150 steps
with P (left) = p = 0.5. Create a plot showing all of these trajectories as a function of step
number.
(c) PCS-1 , MVD Calculate the mean and standard deviation of the position of the walkers at each
of their M = 150 steps. Plot the mean and standard deviation on top of the trajectories (the
alpha keyword might be useful to maintain figure clarity). Describe (in words) how the mean
and standard deviation scale with the number of steps.
(d) PCS-2 , MVD , TS Plot the distribution of positions that the walkers have gotten to after 144
steps. Using the Central Limit Theorem, explain why this distribution might be Gaussian.
(e) PCS-3 , NQP Repeat parts c and d for N = 100 walkers for M = 150 steps and p = 0.75.
Discuss what changes after biasing the walk in a specific direction.
In fact, what you have just simulated is the diffusion of molecules in one dimension. Molecules in
gas can be reliably modeled as random walkers that move with some characteristic speed, quantified
by their diffusion constant, D. The probability of observing a molecule at a given position after they
have been released at a central location can be written
P (x, t) =
1√
4πDt
e−
x2
4Dt ,
which you can verify for yourself is a Gaussian distribution in x.
Note: This problem is meant to introduce you to the ways that subtle changes to how you simulate
processes can allow you to gain intuition on a much wider variety of processes. Reinterpreting a coin
flip as a random walker has just allowed you to simulate molecular diffusion!
You should also start to see how writing code and generating figures that can show you how results
depend on parameters can be very useful. In particular, going forward, you should always feel
comfortable ignoring our parameter prescriptions to see what happens in different scenarios. We’ll
generally ask you to look at interesting situations, but as we progress, more of this is left to you!
©JohnsonMani2020WDYDS
Page 3
https://en.wikipedia.org/wiki/Random_walk
https://medium.com/i-math/the-drunkards-walk-explained-48a0205d304
September 23, 2020
3. In this problem we will work with some data from an experiment on bacterial chemotaxis, as discussed
in the course notes. You can download this data here. This link leads to a text file containing a list
of numbers. These numbers are the angular velocity of spinning bacteria recorded at a rate of 60
measurements per second.
(a) MVD Plot the entire list as a function of time-step (how many seconds in each time-step?).
Describe what you see. How many values are there? What are the largest and smallest values?
Is this a useful figure? What might be a better way of looking at this data? Show such a figure.
Note: Part of the reason we want you to become strong figure-makers is exactly this: when
you get a new data set, you may have no idea what it will look like. You will desperately hope
that there are some obvious features, but often, like the data given here, it will look like a mess.
Using plotting techniques and different calculations can help you orient yourself.
(b) PCS-1 Calculate the mean, standard deviation, median, and IQR of the angular velocity. Report
and comment on these values.
(c) PCS-3 As mentioned in the text, these bacteria are attempting to move, so they are alternating
between spinning one direction and another. Verify that you see this behavior. Assemble a list of
the lengths of the time intervals that the bacterium is spending spinning one direction. Assemble
a similar list of interval lengths for when the bacterium is spinning the other direction.
Note: If you are new to programming, this task may seem daunting, but can readily be
attacked if you step back and try and solve the problem on paper before returning to
your notebook. In particular, try to think about you, yourself, as a human going through
this data one observation at a time. How can you tell that the bacteria has switched
direction? Is there a simple criteria you can write down to make this same assessment?
Can you convert that criteria into something the computer understands? If you are hav-
ing trouble here, make sure to ask a question about how to detect switching; this isn’t
inherently a Python or coding problem! This is part of the struggle of how to think quantitatively.
Once you have done this step, the problem again becomes a coding problem: how are you going
to set up your loop, what do you need to keep track of, where are you storing these intervals? If
you are having trouble here, ask about coding. Tips on coding practices will likely help.
Also make sure that you are checking the results that you get. If you get negative time intervals,
something has gone wrong. If all your time intervals are the same, something has gone wrong.
If you don’t get anything but your code runs without error, guess what? Something has gone
wrong. This is ok! This is a part of learning to code, and you should never feel like you are bad
at this because it doesn’t work the first time. When this happens, insert some print statements,
shorten your loop, and see what happens. After a while, go back to the drawing board, see if a
different code structure will help. Finally, if it’s been 30 minutes and you have no idea what’s
going on, ASK FOR HELP! (And move on to the next problem.) This course is not meant to
be an exercise in futility.
(d) PCS-2 , MVD Plot the PDFs and CDFs of time intervals for each direction of spinning.
Calculate the mean, standard deviation, median, and IQR of both lists and format them as a
table. Comment on the values.
©JohnsonMani2020WDYDS
Page 4
https://www.dropbox.com/s/x8vgebx7jxdyktz/omega.txt?dl=0
September 23, 2020
The remaining problems in this assignment are bonus problems. You do not need to attempt or
complete these problems to receive a “B” on the assignment. To receive a higher grade you must
acceptably complete most of the bonus problems.
(e) BONUS: PCS-1 , TS It turns out that the bacterium’s switching from one direction to another
is an example of a Poisson process.

COMSC111 Lab 6 In this lab you will create a web crawler. The web crawler will begin at a given URL (web address) and create a list of all subURLs (links to other pages). As long as there is a URL...

Answer To: COMSC111 Lab 6 In this lab you will create a web crawler. The web crawler will begin at a given URL...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment