Please complete the following literature review. I have attached the assessment criteria, as well as an example of a literature review that scored 65%. I would expect a 80%+ result.
Microsoft Word - A3_Literature_Review_Tom_Te_Whaiti.docx The trend towards cloud data warehouses: What does it mean for the data scientist? Introduction The competencies required to do data science in business are evolving as businesses move from traditional Data Warehouse (DW) architectures to modern cloud systems. The trend towards greater levels of outsourcing of infrastructure, resource management, data management, data governance and information security should allow the data scientist to focus more effort on creating value from insights by creating data products, and less on managing complex data pipelines. However, businesses vary substantially in their adoption of new technologies, and cloud providers offer a diverse range of services. This variation means the work of a data scientist is likely to be quite different in a business using a modern, cloud DW compared to a traditional DW, or none at all. Industry sector, the prevalence of data literacy in senior management and technology choices past, present and future, all play a role in determining how businesses use data science (McKinsey Analytics, 2018). A data scientist will typically need a foot in both worlds, traditional and modern, as well as an understanding of where their business has been, where it’s heading and the internal and external challenges they face if they are to succeed in creating value from data. An aspiring data scientist should remain up to date with current trends, develop skills that generalise across multiple business domains and understand a potential employer’s data strategy, since the type of work they may be expected to perform in the future may vary substantially from one business to the next. Literature Review First generation DWs were built to allow reporting and analysis to run on separate infrastructure to transactional RDMS (Ponniah, 2010). They required significant effort to design and manage ingestion of data into a warehouse via complex extract, transform and load (ETL) functions, changing source data into the formats and structures required for storage in the DW (Ponniah, 2010, p282). Early DWs were not robust to change in the internal or external environments, and significant effort was required to maintain data integrity and keep ETL functions updated as business processes changed, new data sources were added, and new technologies or user behaviours emerged. Their architectures required careful optimisation: As the volume of data in a warehouse grows over time, query performance will suffer if entire schemas of historical data need to be traversed to produce results. Improvements in data warehousing can be understood as a series of responses to such challenges. DWs have evolved to take advantage of scalable infrastructure, both storage and compute, to remove or automate database administration tasks, and to streamline more data management processes. Challenges with traditional DW architecture Data typically flows through an ETL pipeline in batches, creating latency between real-time data generated by transactions or users, and its availability for analysis or reporting. Although this can be mitigated to some extent by additional rewrite/merge operations (Cuzzocrea et al., 2010) it is a constraint of traditional DW architecture that has recently been overcome with streaming technologies and the loading of untransformed data to the cloud. Queries to a traditional DW require views of historical data to be materialised based on the query parameters. The time it takes to materialise these views and the prohibitive costs involved in storing all possible materialised views can lead to sub-optimal query performance for some users and highlights a trade-off between query performance and cost. Monitoring demand for certain query types can help optimise the speed/cost trade-off (Kehua & Diasse, 2014), as can distributed processing, or periodically updating the data structure instead of using static data cubes (Dehne et al., 2015). Modern cloud DWs approach this challenge with higher degrees of automation and integrate query optimisation into their release cycle (Yan et al., 2018). Traditional DWs copied data multiple times as it moved from RDMS to staging, then to warehouse storage or data marts, leading to additional costs storing multiple versions of data. Storage was typically over-allocated for an expected future capacity, rather than for current requirements. Cloud providers overcome this challenge with elastic pricing, provisioning storage on-demand. Data lineage can also be difficult to confirm when data is copied and transformed multiple times, often without metadata linking it to its source. The traditional response to this challenge was to add more complexity and future technical debt to ETL processes (Cui et al., 2000, Variar, 2001). Traditional DWs are not able to integrate unstructured data as they are schema on-write and require schemas to be defined in advance of loading data to the warehouse. This challenge has been mitigated by the development of schema on-read non-relational database management systems such as NoSQL and Hadoop which allow for unstructured or semi-structured data to be integrated into a warehouse or distributed file system. There is no clear optimal approach as performance varies by use-case (Yassien & Desouky, 2016) and a data scientist should expect to work with a range of data management technologies depending on the enterprise architecture of their business. As a consequence of these challenges, data science teams working in a traditional DW environment should expect to focus heavily on maintaining and validating the integrity of the ETL pipeline as business processes change, while IT teams and database administrators will spend a substantial amount of effort on optimising infrastructure and system performance. The increasing variety, velocity and volume of data The modern era generates substantial more data than when traditional DWs were introduced. The variety, velocity and volume of data in the modern world is driven by rapid technology innovation, increased online activity and the expansion of the Internet of Things (IoT). Big Data is becoming even bigger as businesses continue to develop use-cases for data science in search of competitive advantage (McKinsey Analytics, 2018). This creates a feedback loop where increased data generation drives innovation in data processing and vice versa. As technology evolves, the data scientist needs to continually adapt to new challenges. Trends in modern DW architecture In 2016, ‘cloud computing’ and ‘DW’ were the top two keywords in academic publications relating to Big Data or Business Intelligence (Liang & Liu, 2018) indicating a shift in industry focus away from traditional DW to the cloud. One of the clear trends in cloud DWs is to alleviate data transfer and computational bottlenecks by moving computation to where the data is (Assunçao et al., 2018). Another is to reduce complexity loading data to cloud storage, removing transformation functions and storing data in the cloud in their original format, referred to as a data lake (Sawadogo et al., 2019). ETL is out of fashion, replaced by ELT, where schema on-read replaces schema on-write. A related innovation is storing columnar data of the same type to achieve greater compression and reduce storage requirements (Wandelt et al., 2018). These changes enable leading cloud providers such as Amazon, Microsoft, Google and relative newcomer Snowflake (Dageville et al., 2016), to offer massively parallel processing (MPP) and independently scalable storage and compute infrastructure on a pay-as-you-go basis. As many technologies used in a modern cloud DW are open source, such as Apache Spark for distributed processing and streaming of real-time data, innovation is likely to continue into the future. On-premise storage will likely be reserved for only the most private of business data, and Harrison (2019) suggests distributed file systems used for data lake storage such as Apache Hadoop will give way to cheap scalable object storage such as Azure or Amazon S3 that can be scaled on-demand. New challenges Big Data and cloud DWs offer potential for substantial value creation, but also pose a number of new challenges in change management, security, data management and data governance. Migrating to a cloud DW can require significant code revision in applications using an existing DW, although Antova et al. (2018) review tools to translate between different SQL variants and identify database functionality mismatches. Businesses concerned about privacy may see the cloud as a target for malicious actors. Data can be encrypted while at-rest or in-transit, but operations on encrypted data are only theoretically feasible and not implemented in cloud DW at present. Ahmadian & Marinescu (2020) define information leakage as the “inadvertent disclosure of sensitive information through correlation of records from several collections of a data warehouse”. Addressing this challenge is an active research field with approaches involving distributed secret sharing (Moghadam et al., 2017) and privacy-preserving clustering (Zobayed et al., 2020). Fernandez et al. (2020) offer a new architectural concept, the Data Station, an encapsulated data lake with a strictly governed data catalog that brings data-unaware tasks and queries to secure data, while keeping schema and data hidden from the user. This introduces significant new challenges for resource management but demonstrates a novel approach to restrict information leakage. Cloud vendors and their customers need to ensure encryption keys are managed very carefully. Schema on-read shifts the responsibility for expert domain knowledge from the DW administrator to the data consumer, since domain knowledge becomes essential to interpret data pulled from the cloud, assess its quality and validate errors. Another challenge is that duplicate datasets are often found in data lakes if no pre-defined schema is required. Metadata becomes more important with schema on- read and Samadogo et al. (2018) propose a graph-based metadata system that tracks intra-object, inter-object and global metadata and provides semantic enrichment, data indexing, link generation and conservation, data versioning and usage tracking at scale. Not all clouds are created equal. The level of service outsourced to a vendor determines the type of work needed internally to produce insight from data. A smorgasbord of acronyms has developed over recent years, including; Data Warehouse as a Service (DWaaS); Platform as a Service (PaaS); Analytics as a Service (AaaS), and; Software as a Service (SaaS). These should define the work undertaken by the vendor to configure and maintain the service, but with cloud vendors quick to adopt the latest jargon it can be difficult to differentiate between acronyms. The level of outsourcing will impact the work done by data science and IT teams, reducing the amount of effort spent on managing data ingestion and ETL, providing infrastructure and security services to varying degrees (Fisher, 2018). Cloud vendors are also offering deeper and more expensive solution stacks to cater for businesses with insufficient internal expertise (Assunçao et al., 2015). While this may be valuable to businesses without data science capability, it risks devaluing data science expertise already present in a business. A splinternet brought about by decentralised internet standards (Hoffman et al., 2020) or regional regulation such as the European Union’s General Data Protection Regulation (GDPR) may lead to more effort managing data sovereignty, privacy requirements and model interpretability. It is unclear at this point how big these impacts are likely to be. Legal scholars Gil González & de Hert (2019) describe GDPR provisions as vague and subjective, and impacts on data science are still uncertain, but potentially complex. Conclusion Data science requires a range of business and technical skills. Specific roles are not easily differentiable and often combined in a single person. Rawlings-Goss (2019) separates data science roles into the following generic job titles: • Data Scientist (Problem Solvers) • Data Analyst (Translators) • Data Architect (Builders) • Data Engineer (Testers) • Database Administrator (Librarians or Archivists) • Business Analyst (Communicators) • Data and Analytics Manager (Coaches) While it is easy to map these roles to the tasks required in a traditional DW environment, outsourcing to cloud vendors can free some of these resources in database administration, data engineering and data security for other priorities (Fisher, 2018), but may also place new requirements on the data science team. Role requirements are also likely to vary due to the level of service a vendor provides, eg. PaaS versus AaaS. Modern cloud DW platforms offer significant opportunity for data scientists to increase their focus on adding value to a business, and reduce their effort maintaining complex data pipelines. This should allow more time to advise data consumers how to use the available data to understand their business, and build data products that create value. Data scientists will need to acquire new skills continuously to manage increasing volumes of data across multiple business domains, where previously engineers and analysts specialised in a single domain and sometimes even a single RDMS. A balanced team should contain a mix of specialists and generalists that contribute to a team in different ways: A specialist provides deep domain expertise, while a generalist with broad knowledge can derive more value from their understanding of the connections