A machine learning project that predicts the price of used cars in the UK

Last update: Oct 13, 2022

Overview

Car Price Prediction

Image Credit: AA Cars

Project Overview

Scraped 3000 used cars data from AA Cars website using Python and BeautifulSoup.
Cleaned the data and built a model to help determine the price of cars on auction
Built a flask web app and deploy to cloud

Packages/Tools Used

Python Version: 3.9
BeautifulSoup
Request
Numpy
Matplotlib
Seaborn
Scikit-Learn

Data

The data was scraped from AA Cars. The data was scraped from multiple pages from the site and was stored as a csv file. The scraped data contains:

Name
Price
Year
Mileage
Engine
Transmisson

Data Cleaning

The features (columns) contained messy entries and were tidied using some custom functions. The following steps were taken.

Removed the duplicate rows in the data because it will affect the analysis.
Deleted thhe rows with missing values because they ae not up to 1% of the data.
Extracted the manufaturer of each car from the name column
Corrected some of the values in the manufacturers column by merging similar value and correcting those wrongly extracted.
Removed the pounds symbol and the comma in the values of the price column
Created an age column by substacting the values in the year column fom the current year, 2021. This is an easier column to work with.
Removed the commas, space and miles input in all the values of the mileage columns.
- Corrected some of the values in the engine and transmission columns by merging similar value and correcting those wrongly extracted.

Exploratory Data Analysis

The count of the number of cars owned by each car manufacturer
The count of the number of cars from the different years
The count of the number of cars with the diffrent car engine types
The count of the number of cars with different car transmission types
The word cloud of all car manufacturers.

Model Building

The 'name' and 'year' column were dropped because they are irrelevant.
The categorical features (name, colour and transmission) were transformed into numerical data and I scaled all the feature values to make all of them be in the same range
Linear Regression, Ridge Regression, Random Forest Regressor, Ada Boost Regressor and Support Vector Regressor models were all built.
Root mean squared error (RMSE) which is the square root of the sum of the difference between the true value and the predicted value was the metric used to evaluate the performance of the model.
The CatBoost Regressor model has the best performance and it was hypertuned using GridSearchCV to improve the performance.
The model was tested on new data and it gave a good output.

A flask web app is currently under construction

NB: I am open to constructive criticisms about this project

A machine learning project that predicts the price of used cars in the UK

Related tags

Overview

Car Price Prediction

Project Overview

Packages/Tools Used

Data

Data Cleaning

Exploratory Data Analysis

Model Building

Owner

Victor Umunna

This project impelemented for midterm of the Machine Learning #Zoomcamp #Alexey Grigorev

A Python package to preprocess time series

PyHarmonize: Adding harmony lines to recorded melodies in Python

Using Logistic Regression and classifiers of the dataset to produce an accurate recall, f-1 and precision score

A Powerful Serverless Analysis Toolkit That Takes Trial And Error Out of Machine Learning Projects

using Machine Learning Algorithm to classification AppleStore application

Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Given the names and grades for each student in a class N of students, store them in a nested list and print the name(s) of any student(s) having the second lowest grade.

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

Python ML pipeline that showcases mltrace functionality.

A handy tool for common machine learning models' hyper-parameter tuning.

This repository contains full machine learning pipeline of the Zillow Houses competition on Kaggle platform.

K-Means clusternig example with Python and Scikit-learn

An AutoML survey focusing on practical systems.

Tools for mathematical optimization region

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

A high performance and generic framework for distributed DNN training

CobraML: Completely Customizable A python ML library designed to give the end user full control

AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications.

Distributed deep learning on Hadoop and Spark clusters.