RevoU Python Assignment
Project Summary
- Imported and joined dataset from the source files.
- Conducted data cleaning such as removing null values, removing outliers, removing irrelevant values, and feature engineering.
- Conducted exploratory data analysis on the dataset and convey the important findings
- Created user segmentation using k-means clustering
Insights
- As time goes by the number of orders is increasing until it reaches the highest number of orders in November 2017 of 6,581 orders. Then the number of orders decreased and increased until the last in August 2018 as many as 6,313 orders.
- It can be concluded that the peak day is Tuesday, where the total number of orders reached 17660 orders.
- The most obtained were credit cards with a percentage of 73.21%, followed by boleto with a percentage of 19.33%, vouchers at 5.93%, and finally debit cards at 1.53%.
- Obtained 4 clusters based on elbow method and silhouette analysis.
Project Files
For a more comprehensive analysis and visualization, please open the project files.Project Background
Python is the most challenging yet also exciting data programming languages. In this assignment we practiced python skills such as data cleaning and exploratory data analysis using python code. Also, we practiced more advanced python skills such as user segmentation using cluster analysis. We used Google Collab as a python notebook tools.
Data Scope, Goals & Objectives
For this assignment we used data from kaggle and provided by olist, brazilian e-commerce company Customers Dataset. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. We also released a geolocation dataset that relates Brazilian zip codes to lat/lng coordinates.
Goals
Gustavo just started an e-commerce startup based in Brazil that recently opened an online website to sell their product. Fortunately, Gustavo is launcing their website when the covid-19 hits and making them grow faster than ever. However, Gustavo is still not using targeted marketing which hurts their marketing budget as only a fraction of their user comes back to their website. Gustavo needs your help to increase their marketing conversion rate by doing more targeted marketing using customer segmentation so that it will not hurt their budget.
Objectives
- Conducting data cleaning using various methods so the result would be more accurate.
- Exploratory data analysis of the dataset to help gustavo find the problem within his e-commerce company and to help gustavo make important decision.
- Creating user segmentation using cluster analysis to help targeted marketing.
Data Analysis
Note : only important steps shown to simplify the analysis explanation.
Data Preparation & Cleaning
Data Prep
Python environment preparation by loading the necessary library.
Import Dataset
Imported dataset from google sheets using csv interpreter.
Handling Data
Removed irrelevant features from the dataset.
Removed duplicate values from the dataset.
Feature Format
Changed the feature format to datetime, this is necessary to analyze the data further.
Handling Outliers
Removed the outliers from the data to make analysis results more statistically significant.
Combining Data
Combined data using join function.
Exploratory Data Analysis
Evaluative descriptive statistics
Numerical feature desciption using describe function in python.
Actual Delivery Time
The minimum actual delivery time value is -7 and the maximum is 28 which results in an actual delivery time range of 35. This value is quite large when compared to the average actual delivery time of 10.09. The mean > median > mode indicates a positive skew. The standard deviation of 6.06 Kurtosis 0.12 indicates the data is fairly distributed with the standard.
Payment SequentialThe minimum payment sequential value is 1 and the maximum is 29 which results in a sequential payment range of 28. This value is quite large when compared to the average sequential payment of 1.10. The mean > median > mode indicates a positive skew. The standard deviation of 0.74 Kurtosis 338.85 indicates peaked data compared to the normal distribution.
Payment InstallmentsThe minimum payment installments value is 0 and the maximum is 24 which results in a payment installments range of 24. This value is quite large when compared to the average payment installments of 2.61. The mean > median > mode indicates a positive skew. The standard deviation of 2.44 Kurtosis 3.10 indicates peaked data compared to the normal distribution.
Payment ValueThe minimum payment value is 0 and the maximum is 344.34 which results in a payment value time range of 344.34. This value is quite large when compared to the average payment value of 109.37. The mean > median > mode indicates a positive skew. The standard deviation of 72.53 Kurtosis 0.52 indicates that the data is fairly distributed with a standard.
Categorical feature desciption using describe function in python.
Number of orders per month
The graph shows a comparison between the number of orders vs the current month from October 2016 to August 2018. Over time the number of orders has increased until it reaches the highest number of orders in November 2017 of 6,581 orders. Then the number of orders decreased and increased until the last in August 2018 as many as 6,313 orders.
Daily orders trend
The graph shows the comparison between the number of orders vs the day on which the order was made. This chart is made to see Daily Trends/peak days where buyers place orders. It can be concluded that the peak day is Tuesday, where the total number of orders reached 17660 orders.
Percentage of orders per payment type
The graph shows the comparison between the number of orders vs. payment type. This graph is made to see what payment types are most often used by buyers. The most obtained were credit cards with a percentage of 73.21%, followed by boleto with a percentage of 19.33%, vouchers at 5.93%, and finally debit cards at 1.53%.
User Segmentation
Preparing the data for cluster analysis
How many clusters? We used Elbow Method and Silhouette Analysis
The elbow method show 4 clusters is the optimal cluster number.
The silhouette analysis also show 4 clusters is the optimal cluster number.
Creating cluster using K-Means
The K-Means clustering used because the data have more numerical features than categorical features.
- The distribution of data on each cluster quite good (no cluster with small count).
- Cluster 0: prefer to pay with credit card with relatively normal amounts of payment value >> can be named: "Middle Class"
- Cluster 1: prefer to pay with credit card (predominantly) with relatively huge amounts of payment value >> can be named: "The Real Sultans"
- Cluster 2: prefer to pay with boleto (cash) with relatively normal amounts of payment >> can be named: "Boleto Lovers"
- Cluster 3: prefer to pay with voucher & debit card (cash) with relatively small amounts of payment >> can be named: "Frugal Living"
Recommendation
- The market predominantly populated with middle class, using discount or promo would be helpful to boost sales.
- The real sultans also have significant amount of population, and these people tend to be more loyal and come back to buy more. Increasing benefit or creating customer loyalty program campaign would be helpful to retain these customers since they are can be quite a money maker.
Home Projects