A machine learning project that predicts the price of used cars in the UK

Overview

Car Price Prediction

Car Image

Image Credit: AA Cars

Project Overview

  • Scraped 3000 used cars data from AA Cars website using Python and BeautifulSoup.
  • Cleaned the data and built a model to help determine the price of cars on auction
  • Built a flask web app and deploy to cloud

Packages/Tools Used

  • Python Version: 3.9
  • BeautifulSoup
  • Request
  • Numpy
  • Matplotlib
  • Seaborn
  • Scikit-Learn

Data

The data was scraped from AA Cars. The data was scraped from multiple pages from the site and was stored as a csv file. The scraped data contains:

  • Name
  • Price
  • Year
  • Mileage
  • Engine
  • Transmisson

Data Cleaning

The features (columns) contained messy entries and were tidied using some custom functions. The following steps were taken.

  • Removed the duplicate rows in the data because it will affect the analysis.
  • Deleted thhe rows with missing values because they ae not up to 1% of the data.
  • Extracted the manufaturer of each car from the name column
  • Corrected some of the values in the manufacturers column by merging similar value and correcting those wrongly extracted.
  • Removed the pounds symbol and the comma in the values of the price column
  • Created an age column by substacting the values in the year column fom the current year, 2021. This is an easier column to work with.
  • Removed the commas, space and miles input in all the values of the mileage columns.
    • Corrected some of the values in the engine and transmission columns by merging similar value and correcting those wrongly extracted.

Exploratory Data Analysis

  • The count of the number of cars owned by each car manufacturer Car manufacturer distribution

  • The count of the number of cars from the different years Year distribution

  • The count of the number of cars with the diffrent car engine types Car engine distribution

  • The count of the number of cars with different car transmission types Car transmission distribution

  • The word cloud of all car manufacturers.

Car manufacturer wordcloud

Model Building

  • The 'name' and 'year' column were dropped because they are irrelevant.
  • The categorical features (name, colour and transmission) were transformed into numerical data and I scaled all the feature values to make all of them be in the same range
  • Linear Regression, Ridge Regression, Random Forest Regressor, Ada Boost Regressor and Support Vector Regressor models were all built.
  • Root mean squared error (RMSE) which is the square root of the sum of the difference between the true value and the predicted value was the metric used to evaluate the performance of the model.
  • The CatBoost Regressor model has the best performance and it was hypertuned using GridSearchCV to improve the performance.
  • The model was tested on new data and it gave a good output.

A flask web app is currently under construction

NB: I am open to constructive criticisms about this project

Owner
Victor Umunna
Victor Umunna
This project impelemented for midterm of the Machine Learning #Zoomcamp #Alexey Grigorev

MLProject_01 This project impelemented for midterm of the Machine Learning #Zoomcamp #Alexey Grigorev Context Dataset English question data set file F

Hadi Nakhi 1 Dec 18, 2021
A Python package to preprocess time series

Disclaimer: This package is WIP. Do not take any APIs for granted. tspreprocess Time series can contain noise, may be sampled under a non fitting rate

Maximilian Christ 57 Dec 17, 2022
PyHarmonize: Adding harmony lines to recorded melodies in Python

PyHarmonize: Adding harmony lines to recorded melodies in Python About To use this module, the user provides a wav file containing a melody, the key i

Julian Kappler 2 May 20, 2022
Using Logistic Regression and classifiers of the dataset to produce an accurate recall, f-1 and precision score

Using Logistic Regression and classifiers of the dataset to produce an accurate recall, f-1 and precision score

Thines Kumar 1 Jan 31, 2022
A Powerful Serverless Analysis Toolkit That Takes Trial And Error Out of Machine Learning Projects

KXY: A Seemless API to 10x The Productivity of Machine Learning Engineers Documentation https://www.kxy.ai/reference/ Installation From PyPi: pip inst

KXY Technologies, Inc. 35 Jan 02, 2023
using Machine Learning Algorithm to classification AppleStore application

AppleStore-classification-with-Machine-learning-Algo- using Machine Learning Algorithm to classification AppleStore application. the first step : 1: p

Mohammed Hussien 2 May 02, 2022
Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Apache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validat

The Apache Software Foundation 121 Dec 28, 2022
Given the names and grades for each student in a class N of students, store them in a nested list and print the name(s) of any student(s) having the second lowest grade.

Hackerank-Nested-List Given the names and grades for each student in a class N of students, store them in a nested list and print the name(s) of any s

Sangeeth Mathew John 2 Dec 14, 2021
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 23.6k Jan 03, 2023
Python ML pipeline that showcases mltrace functionality.

mltrace tutorial Date: October 2021 This tutorial builds a training and testing pipeline for a toy ML prediction problem: to predict whether a passeng

Log Labs 28 Nov 09, 2022
A handy tool for common machine learning models' hyper-parameter tuning.

Common machine learning models' hyperparameter tuning This repo is for a collection of hyper-parameter tuning for "common" machine learning models, in

Kevin Hu 2 Jan 27, 2022
This repository contains full machine learning pipeline of the Zillow Houses competition on Kaggle platform.

Zillow-Houses This repository contains full machine learning pipeline of the Zillow Houses competition on Kaggle platform. Pipeline is consists of 10

2 Jan 09, 2022
K-Means clusternig example with Python and Scikit-learn

Unsupervised-Machine-Learning Flat Clustering K-Means clusternig example with Python and Scikit-learn Flat clustering Clustering algorithms group a se

Emin 1 Dec 13, 2021
An AutoML survey focusing on practical systems.

This project is a community effort in constructing and maintaining an up-to-date beginner-friendly introduction to AutoML, focusing on practical systems. AutoML is a big field, and continues to grow

AutoGOAL 16 Aug 14, 2022
Tools for mathematical optimization region

Tools for mathematical optimization region

林景 15 Nov 30, 2022
Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and t

164 Jan 04, 2023
A high performance and generic framework for distributed DNN training

BytePS BytePS is a high performance and general distributed training framework. It supports TensorFlow, Keras, PyTorch, and MXNet, and can run on eith

Bytedance Inc. 3.3k Dec 28, 2022
CobraML: Completely Customizable A python ML library designed to give the end user full control

CobraML: Completely Customizable What is it? CobraML is a python library built on both numpy and numba. Unlike other ML libraries CobraML gives the us

Sriram Govindan 14 Dec 19, 2021
AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications.

AutoTabular AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just

wenqi 2 Jun 26, 2022
Distributed deep learning on Hadoop and Spark clusters.

Note: we're lovingly marking this project as Archived since we're no longer supporting it. You are welcome to read the code and fork your own version

Yahoo 1.3k Dec 28, 2022