this file has 10 questions that they should be completed about a topic that you are interested in however the main assignment is 3000 words
[IB9CS0] [Big Data Analytics] Term {Two} [2019-2020] WARWICK BUSINESS SCHOOL Big Data Analytics: [10%] Assignment 2 – Asking good questions Neha Gupta
[email protected] Data Science Lab, Behavioural Science, Warwick Business School, The University of Warwick Teaching materials designed by Prof. Suzy Moat and Prof. Tobias Preis This coursework is worth 10% of your final marks for this course. It is due in on Tuesday 18th Feburary 2020 by 12 noon, along with Assignment 1. In this course, we have been showing you how our everyday interactions with technology are creating huge amounts of data capturing human behaviour worldwide. We have begun to outline how this sort of data can help us measure what is happening in the world, and even make better predictions about how people might behave in the future. Businesses and public organisations are increasingly becoming aware that these sorts of insights can help support their decision making. For scientists, this data is also fascinating, providing measurements of human behaviour at a speed and scale which was previously impossible. To gain value from these data sources, you need knowledge of both programming and statistics. However – this is not enough! These datasets also present a dangerous trap – an infinite universe of uninteresting questions, to which you can present answers that are quantitatively correct, but which nobody cares about. A lack of understanding of what data is actually available and what the limitations of these datasets are can also lead to very time-consuming attempts to answer inappropriate and impossible questions. In this exercise, you will draw on your business school education and the insights you have gained from this course to try and identify a good question for your final Big Data Analytics project, which illustrates how online data can provide insights into human behaviour in the offline world. You should start by considering what data is available, and try to develop a question that strikes a good compromise between being interesting and being feasible to answer. We are of course not looking for perfection – but we want to give you a chance to demonstrate your awareness of both what can be done and what is valuable, and use this to design an interesting project with which you can demonstrate your newly developed data science skills. Try brainstorming some ideas, and ask for each idea – can I find a question that is just as easy to answer, but even more interesting? Can I find a question which is just as interesting, but even easier to answer? We do want you to make use of the new sources of data that are becoming available, as well as your new knowledge of R. Your question should therefore involve one of the following kinds of online data: Google Trends data, data on Wikipedia page views or data retrieved from the Flickr API. Your question needs to link this online data with another source of data which reflects human behaviour in the offline world: for example, financial data, national statistics, or any other data source you find interesting. See page 4 for some suggestions of where you can find various kinds of economic and social indicators. For assessment, please submit a PDF providing answers to the questions on the next page. mailto:
[email protected] Big Data Analytics: Assignment 2 February 2020 2 You should keep your answers to under 2 pages of A4, with borders of 2.54cm and a font size of 11pt. Read through all the questions first, as answering some of the later questions might make you realise you should modify your answers to earlier questions. Big Data Analytics: Assignment 2 February 2020 3 1) What is the question you wish to ask? (1%) In one sentence, state the question you wish to ask. We will give you marks for stating an interesting question clearly. 2) What would your dream result be? (1%) Imagine you have finished your analysis, and you have found the best result you could dream of. Describe your dream result in two simple sentences that a member of the general public would understand. Often, if your question is not interesting enough, it becomes difficult to summarise the findings in so few words – so we want to make sure you can! 3) What reason do you have to believe you might find this result? (1%) In two to three sentences, explain why you might expect to find this result. In your final project, you will not lose marks if you do not find a significant result (nor gain marks if you do). However, you do need to provide a good explanation as to why you might expect to find the result you describe – otherwise it wouldn’t be worth your time looking into this idea. 4) What data will you use? (1%) Explain what Google Trends, Wikipedia page view or Flickr data you will use. Provide a link to the source of the data that is not from Google Trends, Wikipedia or Flickr. Please ensure that this link will take us directly to the data when we click on it. If there is a very good technical reason for which you cannot do this, explain what this reason is and clearly describe the source of the data. 5) How will you read the data into R for analysis? (1%) Outline what steps would be required to read both your online data and other data into R and carry out any pre-processing required before your statistical analysis. We are looking for an understanding of the basic steps you would need to carry out - you do not need to provide code. 6) What statistical method will you use to analyse the data? (1%) Describe the statistical approach you will use to answer your question. Describe any assumptions of this analysis. Make sure that the approach you describe is capable of delivering an answer to your question in line with the dream result you described in question 2. 7) Which R functions will you use to carry out the statistical analysis? (1%) Name the R functions that will allow you to carry out the statistical analysis (not the data pre- processing), and which will allow you to check any assumptions of your statistical analysis. If these R functions have not been used in the course, specify the R package that they are in. 8) Describe the form of the data. Do you have enough data? (1%) Is the data daily, weekly, monthly, something else? How many data points will you be able to analyse? Given the statistical approach you have described, is this sample size large enough to give you a chance of uncovering a significant result? (If not, you need to rethink your question!) 9) How would you describe your dream result to a professional audience? (2%) Imagine you have finished your analysis and you have found your dream result. You have been asked to write an executive summary of your results for a professional audience. What would you write? Give some background motivation to your question, briefly describe your finding, and then indicate what this finding might mean. Keep your summary under 125 words. Big Data Analytics: Assignment 2 February 2020 4 Further guidance on developing your question It is likely that your question will fall into one of three categories. These are as follows: • Nowcasting offline behaviour with online data Tip: see examples of nowcasting that we covered in the lectures, in particular in Week 3. • Predicting offline behaviour with online data Tip: be careful to ensure that the statistics you are proposing will genuinely allow you to make predictions. • Measuring offline behaviour with online data, where the offline behaviour was previously difficult or impossible to measure Tip: these can be tricky questions to set up correctly. Be careful to ensure that it really makes sense to use Google Trends, Wikipedia page views or Flickr data to measure the offline behaviour you are interested in. First, check that there is not a more obvious offline alternative that you could use. (If there is, you probably need to think of another question, as you do need to propose a question using Google Trends, Wikipedia page view or Flickr data.) Second, check that you can make a convincing argument that the measurements produced with online data are likely to be valuable, and not either too noisy or too biased in some way. Finding inspiration for your question In the course, we cover a number of examples of analyses using online data to provide insight into human behaviour in the real world. Here is another example from the Bank of England, where they use data on Google searches to nowcast unemployment rates and house prices: Using internet search data as economic indicators Nick McLaren and Rachana Shanbhogue https://www.bankofengland.co.uk/-/media/boe/files/quarterly-bulletin/2011/using-internet-search-data-as- economic-indicators.pdf You might also be interested in this paper looking into the relationship between Bitcoin prices and Google Trends and Wikipedia page view data: BitCoin meets Google Trends and Wikipedia: Quantifying the relationship between phenomena of the Internet era Ladislav Kristoufek https://www.nature.com/articles/srep03415 You can find more examples on the course reading list: https://rl.talis.com/3/warwick/lists/E1DCBC50-A278-379F-F255-3D29A326D2AB.html To ensure that the question you propose is interesting, note that it should not simply replicate one of the analyses covered in the course (or it will be difficult to argue for the value of your analysis). If you are keen to propose a very similar analysis to one of these analyses, make sure your answer to question 2 makes it clear what insights your dream result would provide beyond the knowledge we already have from the previous analysis. https://www.bankofengland.co.uk/-/media/boe/files/quarterly-bulletin/2011/using-internet-search-data-as-economic-indicators.pdf https://www.bankofengland.co.uk/-/media/boe/files/quarterly-bulletin/2011/using-internet-search-data-as-economic-indicators.pdf https://www.nature.com/articles/srep03415 https://rl.talis.com/3/warwick/lists/E1DCBC50-A278-379F-F255-3D29A326D2AB.html Big Data Analytics: Assignment 2 February 2020 5 Data source suggestions One of the challenges you have to address in this assignment is to find a source of offline data you can use in your project. There are many sources of fascinating real world data available online. Here are a few suggestions – but feel free to find others! Open administrative data – UK • data.gov.uk o UK Government project to make non-personal UK government data available as open data http://data.gov.uk/ • London Datastore o Official site providing free access