COMSC111 Lab 6 In this lab you will create a web crawler. The web crawler will begin at a given URL (web address) and create a list of all subURLs (links to other pages). As long as there is a URL...

1 answer below »
In this lab you will create a web crawler. The web crawler will begin at a given URL (web address) and create a list of all subURLs (links to other pages).


COMSC111 Lab 6 In this lab you will create a web crawler. The web crawler will begin at a given URL (web address) and create a list of all subURLs (links to other pages). As long as there is a URL left on the list or until a maximum number of URLs have been visited, the web crawler will get the next URL from the list, find its subURLS, and add them to the list. Setting up your project: 1. Create a new project (Lab 6) and two new packages (webcrawler and tests) in that project. 2. Unzip the provided .zip file. You should get a folder named jsjf and two other files. 3. Put URLParser.java in the webcrawler package and TestWebCrawler.java in the tests package you just created. (Note: you may choose different package, class, and method names, but you must update my code appropriately if you do). 4. Now put the jsjf folder in the src directory of your project. The src folder will now have three folders (webcrawler, tests, and jsjf), and your project will have three packages (you may need to refresh it). 5. Next copy your LinkedQueue implementation from last week into the jsjf folder and use it in Part 2. If you have lost/do not have this, talk to me before recreating it. Part 1: Complete the implementation of ArrayUnorderedList. You will need to implement the addToRear method. Part 2: Create a WebCrawler class in the webcrawler package. Your class will not have a main method, so use the provided TestWebCrawler to test your code. Setting up the WebCrawler class: 1. Create a new class named WebCrawler in your websearch package. 2. Add a class variable (type int) to hold the maximum number of URLs to be visited. Give it a default value of 100. 3. Add a class variable to store the starting URL as a String. 4. Add a constructor that has a String argument and uses it to set the value of the variable from step 3 5. Add a default constructor that sets the starting URL to a default value (e.g. http://www.acm.org) 6. Write setter methods for the two class variables. 7. Write a method crawlAndPrint (instructions below) Looking at the URLParser code may help you with the above steps if there are some that you do not understand. The above steps are also similar to some of the steps in Lab 2. Crawling the web: Here is an algorithm for crawling the web. You should implement this in a method named crawlAndPrint with a void return type and no arguments (look in TestWebCrawler to see how it will be used). create a new ArrayUnorderedList named visitedList create a new LinkedQueue named pendingQueue1 add the starting URL to pendingQueue while pendingQueue is not empty and number of visited URLs < maxurls2 get nexturl from pendingqueue if nexturl is not already in visitedlist3 print nexturl add nexturl to visitedlist create a new urlparser4 call the getsuburls method with the new urlparser5 for each string s in suburlslist if s is not in visitedlist add s to pendingqueue 1you may the java api queue if you have not finished linkedqueue 2see step 2 above 3look at the contains method 4urlparser parser = new urlparser(nexturlstring); 5this is an instance method, so use parser.getsuburls(). it will return an arrayunorderedlist of strings. the output of my code with testwebcrawler looks like this https://www.acm.org http://dl.acm.org/pubs.cfm http://www.sigmicro.org http://campus.acm.org/public/brochures/sb_control.cfm?refpg=s http://awards.acm.org http://awards.acm.org/contact-us http://awards.acm.org/guides http://awards.acm.org/sponsors http://awards.acm.org/award-nominations http://awards.acm.org/committees or this https://www.rwu.edu http://rwu.curriculog.com http://rogercentral.rwu.edu http://bridges.rwu.edu/ http://gmail.rwu.edu/ http://law.rwu.edu http://www.rwu.edu http://law.rwu.edu/admission/plan-your-visit http://law.rwu.edu/admission http://social.rwu.edu note: your output does not need to match exactly, but should look similar. part 3: what would you have to change in part 2 to use the java collections api instead of your implementation from part 1? state the line numbers in your code and a very brief explanation of the change needed. what to submit: for part 1, you should submit arrayunorderedlist.java. for part 2 you should submit webcrawler.java. for part 3, put your answer in the text field of your bridges submission maxurls2="" get="" nexturl="" from="" pendingqueue="" if="" nexturl="" is="" not="" already="" in="" visitedlist3="" print="" nexturl="" add="" nexturl="" to="" visitedlist="" create="" a="" new="" urlparser4="" call="" the="" getsuburls="" method="" with="" the="" new="" urlparser5="" for="" each="" string="" s="" in="" suburlslist="" if="" s="" is="" not="" in="" visitedlist="" add="" s="" to="" pendingqueue="" 1you="" may="" the="" java="" api="" queue="" if="" you="" have="" not="" finished="" linkedqueue="" 2see="" step="" 2="" above="" 3look="" at="" the="" contains="" method="" 4urlparser="" parser="new" urlparser(nexturlstring);="" 5this="" is="" an="" instance="" method,="" so="" use="" parser.getsuburls().="" it="" will="" return="" an="" arrayunorderedlist="" of="" strings.="" the="" output="" of="" my="" code="" with="" testwebcrawler="" looks="" like="" this="" https://www.acm.org="" http://dl.acm.org/pubs.cfm="" http://www.sigmicro.org="" http://campus.acm.org/public/brochures/sb_control.cfm?refpg="s" http://awards.acm.org="" http://awards.acm.org/contact-us="" http://awards.acm.org/guides="" http://awards.acm.org/sponsors="" http://awards.acm.org/award-nominations="" http://awards.acm.org/committees="" or="" this="" https://www.rwu.edu="" http://rwu.curriculog.com="" http://rogercentral.rwu.edu="" http://bridges.rwu.edu/="" http://gmail.rwu.edu/="" http://law.rwu.edu="" http://www.rwu.edu="" http://law.rwu.edu/admission/plan-your-visit="" http://law.rwu.edu/admission="" http://social.rwu.edu="" note:="" your="" output="" does="" not="" need="" to="" match="" exactly,="" but="" should="" look="" similar.="" part="" 3:="" what="" would="" you="" have="" to="" change="" in="" part="" 2="" to="" use="" the="" java="" collections="" api="" instead="" of="" your="" implementation="" from="" part="" 1?="" state="" the="" line="" numbers="" in="" your="" code="" and="" a="" very="" brief="" explanation="" of="" the="" change="" needed.="" what="" to="" submit:="" for="" part="" 1,="" you="" should="" submit="" arrayunorderedlist.java.="" for="" part="" 2="" you="" should="" submit="" webcrawler.java.="" for="" part="" 3,="" put="" your="" answer="" in="" the="" text="" field="" of="" your="" bridges="">
Answered Same DayOct 02, 2021

Answer To: COMSC111 Lab 6 In this lab you will create a web crawler. The web crawler will begin at a given URL...

Shalini answered on Oct 02 2021
130 Votes
lab6files/Lab6files/assignment1september222020-1-ksc1g1hs.pdf
Assignment 1
Eric Johnson and Madhav Mani
September 23, 2020
Assignment Due Dates (Fall 2020):
• Attempt: October 2, 2020 at 11:59PM CST
• Completion: October 9, 2020 at 11:59PM CST
Please follow the guidelines on Canvas for assignment attempts and completion. In particular, the attempt
must be a Jupyter notebook or PDF that clearly enumerates the problem attempts, and the completed
assignment must be code and a PDF containing completed solutions. Please use complete sentences in
your solutions. All figures should have labeled axes, legends, and appropriate annotations.
Please read the entire assignment before getting too deep into any one problem so that you can properly
allocate your time and questions.
Learning Objectives:
The module learning objectives assessed in this assignment are that a student will be able to:
• PCS-1 Calculate summary quantities from data or a distribution.
• PCS-2 Construct probability distributions.
• PCS-3 Implement simple algorithms, especially to generate synthetic data sets.
• TS Understand and discuss core concepts in probability.
• MVD Graph core concepts in probability.
• NQP Illustrate the effect of parameter dependence on
distributions.
You can learn more about where to study and practice these objectives in the curriculum alignment table.
Throughout this assignment, there will be notes like this:
Note: This is a note.
These are hints, intuitions, or elaborations on the problems. Read these if you are feeling lost, annoyed, or
you have extra time, but they are not necessary to complete the problems.
1
https://canvas.northwestern.edu/courses/127241/pages/how-to-assignment-attempts
https://canvas.northwestern.edu/courses/127241/pages/how-to-completed-assignments
https://canvas.northwestern.edu/courses/127241/pages/module-one-the-basics
September 23, 2020
1. This first problem is focused on plotting and generating theoretical and empirical probability distri-
butions. Complete the following tasks:
(a) PCS-2,3 , MVD Make a 2 row and 4 column figure. In the top row, plot a binomial, Gaussian,
Poisson, and uniform PDFs and in the bottom row plot the corresponding CDFs. For each type
of distribution, show 3 curves with varying parameters (even the uniform distribution! What
are the parameters of a uniform distribution?). Include a legend that shows the values of the
relevant parameters for each of the three curves. Label the axes and provide titles above each
column.
Note: This problem is all about getting familiar with some basic tools in plotting and figure-
making. Make sure to look at the scipy.stats module when you’re looking to generate these
distributions.
(b) PCS-2,3 , MVD Repeat the previous problem with empirical PDFs and CDFs in a new figure.
That is, generate data sets by sampling random variables from each of these distributions. Use
the same three sets of parameters for each type of distribution. You can choose how many
samples to generate. eCDFs should not be binned.
Note: Again this problem might seem repetitive, but it will help you start to work out what
parts of plotting make sense, and what you’ll need to ask about.
Hopefully this problem also underscores the difference between empirical and theoretical PDFs
and CDFs. Theoretical distributions are nice and smooth while empirical distribution are always
somewhat noisy. In this figure (and always), you should never use a binned histogram to form
your CDF, you should use the code in the notes to get an exact CDF for your data. This is
important because putting data into bins is an arbitrary process and it destroys information.
When possible, you should always do calculations with eCDFs.
(c) PCS-1 Report the mean, standard deviation, median, and IQR of each of the distributions in
questions 1 and 2. For the theoretical distributions, you can use theory to compute these values,
but indicate that you are doing so. The values for question 2 must be generated from the data
shown in your figure.
©JohnsonMani2020WDYDS
Page 2
September 23, 2020
2. In this problem we will consider the behavior of a “random walker” (although on a weekend night it
may be a drunken walker). More specifically, we consider the case of a very strange person who, in
moving down the road, flips a coin, and if the coin is heads, she takes a step to the left, and if it is
tails, a step to the right.
(a) PCS-3 , MVD Write a function that returns the path of such a random walker given a number
of steps, M , and a coin bias p. Plot the path of a single random walker generated by your
function for M = 100 and p = 0.5.
(b) PCS-3 , MVD Generate trajectories (paths) for N = 100 walkers, each taking M = 150 steps
with P (left) = p = 0.5. Create a plot showing all of these trajectories as a function of step
number.
(c) PCS-1 , MVD Calculate the mean and standard deviation of the position of the walkers at each
of their M = 150 steps. Plot the mean and standard deviation on top of the trajectories (the
alpha keyword might be useful to maintain figure clarity). Describe (in words) how the mean
and standard deviation scale with the number of steps.
(d) PCS-2 , MVD , TS Plot the distribution of positions that the walkers have gotten to after 144
steps. Using the Central Limit Theorem, explain why this distribution might be Gaussian.
(e) PCS-3 , NQP Repeat parts c and d for N = 100 walkers for M = 150 steps and p = 0.75.
Discuss what changes after biasing the walk in a specific direction.
In fact, what you have just simulated is the diffusion of molecules in one dimension. Molecules in
gas can be reliably modeled as random walkers that move with some characteristic speed, quantified
by their diffusion constant, D. The probability of observing a molecule at a given position after they
have been released at a central location can be written
P (x, t) =
1√
4πDt
e−
x2
4Dt ,
which you can verify for yourself is a Gaussian distribution in x.
Note: This problem is meant to introduce you to the ways that subtle changes to how you simulate
processes can allow you to gain intuition on a much wider variety of processes. Reinterpreting a coin
flip as a random walker has just allowed you to simulate molecular diffusion!
You should also start to see how writing code and generating figures that can show you how results
depend on parameters can be very useful. In particular, going forward, you should always feel
comfortable ignoring our parameter prescriptions to see what happens in different scenarios. We’ll
generally ask you to look at interesting situations, but as we progress, more of this is left to you!
©JohnsonMani2020WDYDS
Page 3
https://en.wikipedia.org/wiki/Random_walk
https://medium.com/i-math/the-drunkards-walk-explained-48a0205d304
September 23, 2020
3. In this problem we will work with some data from an experiment on bacterial chemotaxis, as discussed
in the course notes. You can download this data here. This link leads to a text file containing a list
of numbers. These numbers are the angular velocity of spinning bacteria recorded at a rate of 60
measurements per second.
(a) MVD Plot the entire list as a function of time-step (how many seconds in each time-step?).
Describe what you see. How many values are there? What are the largest and smallest values?
Is this a useful figure? What might be a better way of looking at this data? Show such a figure.
Note: Part of the reason we want you to become strong figure-makers is exactly this: when
you get a new data set, you may have no idea what it will look like. You will desperately hope
that there are some obvious features, but often, like the data given here, it will look like a mess.
Using plotting techniques and different calculations can help you orient yourself.
(b) PCS-1 Calculate the mean, standard deviation, median, and IQR of the angular velocity. Report
and comment on these values.
(c) PCS-3 As mentioned in the text, these bacteria are attempting to move, so they are alternating
between spinning one direction and another. Verify that you see this behavior. Assemble a list of
the lengths of the time intervals that the bacterium is spending spinning one direction. Assemble
a similar list of interval lengths for when the bacterium is spinning the other direction.
Note: If you are new to programming, this task may seem daunting, but can readily be
attacked if you step back and try and solve the problem on paper before returning to
your notebook. In particular, try to think about you, yourself, as a human going through
this data one observation at a time. How can you tell that the bacteria has switched
direction? Is there a simple criteria you can write down to make this same assessment?
Can you convert that criteria into something the computer understands? If you are hav-
ing trouble here, make sure to ask a question about how to detect switching; this isn’t
inherently a Python or coding problem! This is part of the struggle of how to think quantitatively.
Once you have done this step, the problem again becomes a coding problem: how are you going
to set up your loop, what do you need to keep track of, where are you storing these intervals? If
you are having trouble here, ask about coding. Tips on coding practices will likely help.
Also make sure that you are checking the results that you get. If you get negative time intervals,
something has gone wrong. If all your time intervals are the same, something has gone wrong.
If you don’t get anything but your code runs without error, guess what? Something has gone
wrong. This is ok! This is a part of learning to code, and you should never feel like you are bad
at this because it doesn’t work the first time. When this happens, insert some print statements,
shorten your loop, and see what happens. After a while, go back to the drawing board, see if a
different code structure will help. Finally, if it’s been 30 minutes and you have no idea what’s
going on, ASK FOR HELP! (And move on to the next problem.) This course is not meant to
be an exercise in futility.
(d) PCS-2 , MVD Plot the PDFs and CDFs of time intervals for each direction of spinning.
Calculate the mean, standard deviation, median, and IQR of both lists and format them as a
table. Comment on the values.
©JohnsonMani2020WDYDS
Page 4
https://www.dropbox.com/s/x8vgebx7jxdyktz/omega.txt?dl=0
September 23, 2020
The remaining problems in this assignment are bonus problems. You do not need to attempt or
complete these problems to receive a “B” on the assignment. To receive a higher grade you must
acceptably complete most of the bonus problems.
(e) BONUS: PCS-1 , TS It turns out that the bacterium’s switching from one direction to another
is an example of a Poisson process. One characteristic of a Poisson process is that the
time interval between events (the switches from one direction to another) are exponentially
distributed, that is
P (t) = λe−λt.
For an exponential distribution, the mean is given by 1/λ. Using your calculation of the mean,
provide an estimate for λ. You may use either list of intervals or both combined.
Note: This is your first time (in this course) where you are confronting a model with data. This
is definitely something that you will be interested in doing with your own data, and this is the
simplest: parameter estimation. In the next module, we’ll elaborate on this idea in much more
detail.
(f) BONUS: PCS-2,3 , MVD , TS , NQP Let’s say there is a paper that suggests that the value
of λ should be 0.2. Use a Bayesian analysis as in the text to qualitatively assess whether this
data are consistent with that result. Perform the analysis using both a non-informative prior
and an informed prior based on the result from the paper. Show a series of figures as in the text
where you have incorporated only the first time interval from your list, the first two, the first
ten, etc. Only show increments that are interesting (do not show more than 4-5 increments).
Qualitatively, how long does it take for the posterior to stop changing? Based off these posteriors,
does it seem that λ = 0.2 is consistent with your data?
Note:
If you can see past the daunting Bayesian component, you should notice that this problem
practices something we’ve explored at various points: how does adding more data affect my
outcomes? Furthermore, the posing of the question hints at something you won’t see until
Module 3: hypothesis testing. If you...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here