PYTHON ASSIGNMENT Send code as well as paste plots generated in a word, also explain what it concludes Let us use data analytical skills to determine which factors contribute to higher medical costs....

1 answer below »

View more »
Answered Same DayJul 22, 2021

Answer To: PYTHON ASSIGNMENT Send code as well as paste plots generated in a word, also explain what it...

Rajashekar answered on Jul 22 2021
155 Votes
Name:        Date:
PYTHON ASSIGNMENT
1.Insurance Dataset
Use data analytical skills to determine which factors contribute to higher medical costs. The insurance.csv dataset is related to individual medical costs billed by health insurance companies. It also includes some personal information.
1.1. Questions
1. We will examine if bmi has an impact on the medical costs. Put the bmi on the x-axis. The color of each point will be set according to whether the patient is a smoker. Set the transparency to be 0.7. Be sure to include the color bar, and set appropriate labels for x-axis, y-axis and the color bar. What business insights can you get?
Insights
The most obvious trend that we can observe here is that non-smokers have lower average charges accumulated compared to smokers.
Various other insights that can be derived from the plot are as follow:
1
. Maximum number of people who are Non-smokers do not incur more than 15000 with few outliers that do not exceed 40000.
2. The BMI of non-smokers is fairly distributed from 20-40
3. People that smoke and have a BMI between 15 and 30 incur higher charges than non-smokers with charges ranging between 15000 and 30000.
4. A significant number of smokers with BMI between 30 and 40 incur the highest amount of charges ranging from 30000 to 50000.
2. We further compare the distribution of the medical costs of smokers and that of non-smokers. Plot the distribution of medical costs of smokers first. Then on the same figure, plot the distribution of medical costs of non-smokers and set the transparency to 0.6. The number of bins is 12 for both plots. Set appropriate labels and legends.
3. We study whether age is an important factor by comparing the distribution of medical costs of young people and that of elder people. On the same plot, generate a histogram of medical costs of patients younger than 40 years old, and then another histogram representing the rest of the patients. Set the transparency of the second histogram to 0.7. The number of bins is 15 for both histograms. Set appropriate labels and legends. What can you conclude from this figure?
Insights
1. Majority of the young patients incur very insignificant charges signifying their superior health owing to their lower age. Major number of young patients incur charges less than 10000 with few percentages of patients incurring 20000 and 35000.
2. Compared to young patients the number of other patients with charges around 10000 is significantly less (200-250 compared to 350 of young patients). These patients on average incur higher costs compared to younger patients with the highest costs being more than 60000.
3. The costs incur increase as the age deteriorates.
4. Open-ended question. Now it is your turn to discover something interesting and valuable! What else can you conclude from this dataset using the data visualization skills we leant? Generate two more figures and explain your findings.
Insights
When we compare how male patients and female patients are associated with cost we observe that distribution is mostly similar with higher number of male patients incur larger charges between 30000 and 50000. This indicates that the charges incurred by patients are determined mostly by other factors like age and smoking as explored earlier.
Insights
1. Women with age between 30 and 53 incur the highest amount of charges.
2. Men with age between 43 and 52 incur highest amount of charges.
3. This indicates that women spend admitted to the hospital over a wide range of age groups compared to men
Insights
The south-East region has the highest number of smokers and consequently incur the highest amount of charges. The number of Non-smokers is evenly distributed across all regions with the south-west region having higher number of Non-smokers compared to smokers
2.Bike rental Dataset
The daily version of the Capital Bikeshare System dataset from the UCI Machine Learning Repository. This data set contains information about the daily count of bike rental checkouts in Washington, D.C.’s bikeshare program between 2011 and 2012. It also includes information about the weather and seasonal/temporal features for that day (like whether it was a weekday).
2.1. Questions
1. Understand Trends. Generate a line chart to show the checkouts over time by using day column as the x-axis and cnt column as the y-axis. Label the x-axis as ‘Day’, and y-axis as ‘Check Outs’. What can you conclude?
Insights
1. The general trend for both years seems to show that number of checkouts steadily increase over the year until they peak mid-year.
2. They steadily decrease until the end of the year and consecutively pick up as the next year progress following the same trend as previous year.
3. The number of overall checkouts significantly increase in the second year.
2. Explore Relationships. We will plot the daily count of bikes that were checked out by casual/non-registered users against the temperature. Color the points to be ‘#539cab’. Set the transparency to be 0.7. Be sure to include appropriate labels for x-axis and y-axis. What insight can you get?
Insights
1. People rent bikes less when it is colder as evident from the graph which shows checkouts ranging from 1500 to 4000 until 0.4 temp. This indicates the existence of various other factors like road conditions, body temperature, etc.
2. Highest number of check outs occur at mild temperatures with majority ranging between 4000 and 8000.
3. As the temperature increases the check outs decease but are still significantly higher than checkouts at lower temperatures
3. Explore Relationships with Multidimensional Information. We will plot the daily count of bikes that were checked out by casual/non-registered users against the temperature. The color of each point will be set according to whether it is a working day. Set the transparency to be 0.7. Be sure to include appropriate labels for x-axis and y-axis. Change the legend of the color bar to whether it is a working day. What additional insights can you get?
Insights
People rent bikes on working days much more than non-working days. This indicates that people are using bikes to travel to their work destinations more than they need for leisure on non-working days.
4. Examine Distributions. Let’s first build a histogram of the registered bike checkouts with the number of bins as 10. Set appropriate labels. Also set the title to be “Distribution of Registered Check Outs”.
Insights
The number of check outs that occur per day mostly fall between the 3000-4000 range.    
5. Compare Distributions. We now compare the distributions of registered and casual check- outs. To make the figure easy to understand, additional to the histogram we made for the previous question, we will set the transparency of the casual one to 0.8 and the number of bins to 5. Set appropriate labels.
Insights
1. The casual renters generally rent out fewer number of times compared to registered renters indicating that most of casual renters are not returning customers.
2. The maximum checkouts for casual renters steadily decrease to 3000.
3. From this we can concur that casual renters check out only once or the number of check outs by them happens during non-working days or holidays only.
6. How do the temperatures change across the seasons? You need to choose the type of visualization that best serves this purpose. What are the mean and median temperatures?
Insights
1. The temperature varies differently for different seasons for the two seasons as shown in the graphs.
2. The mean temperature for each season varies with winter mean temp as 0.3, spring mean temp is 0.55, summer mean temp is 0.72 and fall mean temp is 0.41.
3. We observe highest temperatures reaching in summer to 0.85 and higher average temps with winter having lower average temp with lowest temp of 0.1 as expected
7. What else can you conclude from this dataset by using various data exploration?
Insights
1. If we categorise the conditions of the day into 'Clear', 'Misty' and 'Rainy' and then plot the number of checkouts, we observe that on misty days and clear days people check out the similar number of bikes with highest check outs on clear days.
2. On rainy days these numbers drop significantly possibly due to road conditions and safety considerations.
3. Source code
import pandas as pd #import relavant libraries
import numpy as np
import matplotlib.pyplot as plt # This is the tool we will use to perform EDA
pip install --upgrade matplotlib #required to upgrade matlabplotlib to latest version for some code to...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here
April
January
February
March
April
May
June
July
August
September
October
November
December
2025
2025
2026
2027
SunMonTueWedThuFriSat
30
31
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
1
2
3
00:00
00:30
01:00
01:30
02:00
02:30
03:00
03:30
04:00
04:30
05:00
05:30
06:00
06:30
07:00
07:30
08:00
08:30
09:00
09:30
10:00
10:30
11:00
11:30
12:00
12:30
13:00
13:30
14:00
14:30
15:00
15:30
16:00
16:30
17:00
17:30
18:00
18:30
19:00
19:30
20:00
20:30
21:00
21:30
22:00
22:30
23:00
23:30