Background & Context
There is a huge demand for used cars in the Indian Market today. As sales of new cars have slowed down in the recent past, the pre-owned car market has continued to grow over the past years and is larger than the new car market now. Cars4U is a budding tech start-up that aims to find footholes in this market.
In 2018-19, while new car sales were recorded at 3.6 million units, around 4 million second-hand cars were bought and sold. There is a slowdown in new car sales and that could mean that the demand is shifting towards the pre-owned market. In fact, some car sellers replace their old cars with pre-owned cars instead of buying new ones.Unlike new cars, where price and supply are fairly deterministic and managed by OEMs (OriginalEquipmentManufacturer / except for dealership level discounts which come into play only in the last stage of the customer journey), used cars are very different beasts with huge uncertainty in both pricing and supply. Keeping this in mind, the pricing scheme of these used cars becomes important in order to grow in the market.
As a senior data scientist at Cars4U, you have to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it.
Objective
- Explore and visualize the dataset.
- Build a linear regression model to predict the prices of used cars.
- Generate a set of insights and recommendations that will help the business.
Data Dictionary
- S.No.: Serial Number
- Name: Name of the car which includes Brand name and Model name
- Location: The location in which the car is being sold or is available for purchase Cities
- Year: Manufacturing year of the car
- Kilometers_driven: The total kilometers driven in the car by the previous owner(s) in KM.
- Fuel_Type: The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)
- Transmission: The type of transmission used by the car. (Automatic / Manual)
- Owner: Type of ownership
- Mileage: The standard mileage offered by the car company in kmpl or km/kg
- Engine: The displacement volume of the engine in CC.
- Power: The maximum power of the engine in bhp.
- Seats: The number of seats in the car.
- New_Price: The price of a new car of the same model in INR Lakhs.(1 Lakh = 100, 000)
- Price: The price of the used car in INR Lakhs (1 Lakh = 100, 000)
Best Practices for Notebook :
- The notebook should be well-documented, with inline comments explaining the functionality of code and markdown cells containing comments on the observations and insights.
- The notebook should be run from start to finish in a sequential manner before submission.
- It is preferable to remove all warnings and errors before submission.
- The notebook should be submitted as an HTML file (.html) and NOT as a notebook file (.ipynb)
Best Practices for Presentation :
- The presentation should be made keeping in mind that the audience will be a business leader like CMO, COO, CFO, or CEO.
- The key points in the presentation should be the following
- business overview of the problem and solution approach
- key findings and insights which can drive business decisions
- model overview and performance summary
- business recommendations
- Focus on explaining the takeaways in an easy-to-understand manner.
- The inclusion of the potential benefits of implementing the solution will give you the edge.
- Copying and pasting from the notebook is not a good idea, and it is better to avoid showing codes unless they are the focal point of your presentation.
- The presentation should be submitted as a PDFfile (.pdf) and NOT as a .pptx file.
- A presentation template has been provided for reference.
Submission Guidelines :
- There are two parts to the submission:
- A well commented Jupyter notebook [format - .html]
- A presentation as you would present to the top management/business leaders [format - .pdf]
- Any assignment found copied/ plagiarized with other groups will not be graded and awarded zero marks
- Please ensure timely submission as any submission post-deadlinewill not be accepted for evaluation
- Submission will not be evaluated if,
- it is submitted post-deadline, or,
- more than 2 files are submitted
FAQ - Cars4U
1. How to convert Mileage from kmpl to km/kg?
kmpland km/kg are units associated with different types of fuel, but both the units refer to the distance covered (in km) per unit of fuel. So, there is no need to convert between them. The units can be stripped offand the numerical values can be used as they are.
2. Is it advisable to treat 'Seats' as a categorical variable?
If a numerical variable does not have too many distinct values,one can try treating each distinct value as a different category. The best approach in such a case would be to try out both approaches (numerical and categorical treatment) and keep the one that gives a better model performance.
3. Since we have missing values in the dataset, what is the best way to handle or treat those missing values?
There are multiple ways of dealing with missing values.
- It is generally preferred to drop the missing values in the target variable.
- It is generally preferred to impute the missing values in the independent variablesusing a suitable strategy.
- For an unskewed numerical variable, the mean of the variable can be used to impute missing values.
- For a skewed numerical variable, the median of the variable can be used to impute missing values.
- For a categorical variable, themost frequent valuecan be used to impute missing values.
4. I am trying to convert the Mileage, Engine, and Power columns to numeric, and It gives me the following error:
could not convert string to float: 'null'
How to resolve?
It is a good practice to ensure that all values in a column consist of digits before proceeding to convert from object to numeric type. If any non-digit values (like 'null') are encountered, they must be dealt with first, and then the column should be converted to numeric.
5. Somevalues in the New_Price column have units as Lakh while some have units as Cr. How to convert between the units and extract the numbers?
The following conversion rate can be used:
1 Cr = 100 Lakh
Pleaserefer to theincome_to_numfunction in the FIFA Data Preprocessing notebook regarding the extraction of numerical values after conversion.
6.X=sm.add_constant(X) is not working for my project as a new column is not created. Why is this happening? How can this be resolved?
add_constant()does not add a constant column to the data if a constant column already exists in it.Please check if the data has a constant column before usingadd_constant().
As none of the independent variables should ideally be constants as they have variability in them initially, the step where the variable(s) became constant has to be identified. The outlier treatment step is a good place to start.
7. What one should do if the p-values are high (> 0.05) forsome dummies and not for the other dummiesof a categorical variable?
The dummy variables with p-value > 0.05 should be dropped one by one until there are no such variables. After removing each high p-value variable,the regression should be run again, and the p-values of all the variables should bechecked.
8. What one should do if theVIFis high (> 5) forsome dummies and not for the other dummies of a categorical variable?
The VIF values for dummy variables can be ignored.
If, however, the VIF value is inf or NaN, thenone should check if one of the dummy variables was dropped during one-hot encoding. If the VIF value is still inf or NaN, a different dummy variable than the one dropped by usingdrop_first=Trueshould be dropped and VIF values should be checked again.
For example, if a categorical variable 'Season' has four levels 'Spring', 'Summer', 'Fall' and 'Winter', and usingdrop_first=Truedrops the dummy variable for 'Fall', then one can keepthe dummy variable for 'Fall' and drop the dummy variable for 'Summer', and then check the VIF values.