RevoU Spreadsheet & Statistics Assignment
Project Summary
- Cleaned the dataset using various function in spreadsheet, particularly filter and sort function.
- Performed basic statistical analysis such as statistical description and data distribution analysis.
- Performed Exploratory Data Analysis (EDA) such as univariate and multivariate analysis to get meaningful insights.
- Performed advanced statistical analysis such as linear regression and hypothesis testing.
Insights
CASE 1
- By location, it can be concluded that Mont Kiara, KL occupies the top position with 710 listings, followed by KLCC, KL with 593 and Desa ParkCity with 353.
- By property type, it can be concluded that condominiums occupy the top position with 1171 listings, followed by serviced residences with 673 listings and condominiums (corners) with 602 listings.
- By furnishing, it can be concluded that Partly furnished occupies the top position with 2653 listings, followed by Fully Furnished with 1249 listings and undefined with 560 listings.
CASE 2
- Over time the number of orders has increased until it reaches the highest number of orders in March 2018 of 345 orders.
- The peak day is Tuesday, where the total number of orders reached 894 orders.
- By payment type, the most payment type was credit card with a percentage of 73.21%, followed by boleto with a percentage of 19.33%, voucher at 5.93%, and finally debit card at 1.53%.
- 80% of payments are generated by top States, so that stakeholders can focus on taking action on that State.
CASE 3
- Z = 0.853 < 1.645, therefore hypothesis is not proven.
- P-Score = 0.196 > 0.05 (alpha), therefore failed to reject H0.
Project Files
For a more comprehensive analysis and visualization, please open the project files.Project Background
As a data analyst, we are expected to be able to do some work related to spreadsheets such as data cleaning and EDA. Therefore, we want to test our level of comprehension by doing this assignment. This will help us hone our spreadsheet skill as a data analyst.
Data Scope, Goals & Objectives
CASE 1 | Property Listings in Kuala Lumpur
This dataset was provided by kaggle. Originally from tabular result of scraping a property listing website for properties for sale in Kuala Lumpur, Malaysia. Only the overview page was scraped so individual property details are scarce.
CASE 2 | Brazilian E-Commerce Public Dataset by Olist
This is a Brazilian ecommerce public dataset of orders made at Olist Store. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. We also released a geolocation dataset that relates Brazilian zip codes to lat/lng coordinates.
CASE 3 | A/B Test Result Dataset
This is a classified e-commerce company data. The dataset is about test result from an experimental landing page. The dataset includes user id, timestamp, group, and converted.
Goals
Learn spreadsheets skills such as data cleaning and EDA. Those skills would be really helpful in data analysis work environment.
Objectives
- Clean the dataset using various function in spreadsheet, particularly filter and sort function.
- Perform basic statistical analysis such as statistical description and data distribution analysis.
- Perform Exploratory Data Analysis (EDA) such as univariate and multivariate analysis to get meaningful insights.
- Perform advanced statistical analysis such as linear regression and hypothesis testing.
Intermediate Assignment
CASE 1
1. Clean up the data by removing records with empty price, splitting price column to get numbers only with number format.
We used filter and split text into column functions.
2. Share some insight that we can get using univariate analysis.
The minimum price value is 408 and the maximum is 130,000,000 which results in a price range of 129,999,592. This value is quite large when compared to the average payment value of 1,290,090. The mean > median > mode indicates a positive skew. The standard deviation is 3,405,036.07. Kurtosis 29,129.32 shows peaked data compared to normal distribution.
The graph shows the calculation of the listing per number of rooms. This graph is made to see the number of rooms with the highest percentage. It can be concluded that 3 rooms occupy the top position with a listing of 966, followed by 3+1 rooms with 908 and 4+1 rooms with 628.
The graph shows the calculation of the listing per property location. This graph was created to see the location of the property with the highest percentage. It can be concluded that Mont Kiara, KL occupies the top position with 710 listings, followed by KLCC, KL with 593 and Desa ParkCity with 353.
The graph shows the calculation of the listing per bathroom type. This graph is made to see the type of bathroom with the highest percentage. It can be concluded that 2 bathrooms occupy the top position with 1621 listings, followed by 3 bathrooms with 847 listings and 4 bathrooms with 797 listings.
The graph shows the calculation of the listing per number of car parks. This graph was created to see the number of car parks with the highest listings. It can be concluded that 2 car parks occupy the top position with 1512 listings, followed by Undefined (null value) with 1510 listings and 1 car park with 1055 listings.
The graph shows the calculation of the listing per property type. This graph is made to see the number of property types with the highest listings. It can be concluded that condominiums occupy the top position with 1171 listings, followed by serviced residences with 673 listings and condominiums (corners) with 602 listings.
The graph shows the calculation of listings per type of furnishings. This graph was created to see the number of types of furnishings with the highest percentage. It can be concluded that Partly furnished occupies the top position with 2653 listings, followed by Fully Furnished with 1249 listings and undefined with 560 listings.
CASE 2
3. Combine orders dataset, customers dataset, and order payment dataset into 1 sheet.
In combining datasets, the vlookup
and importrange
functions in the spreadsheet are used.
4. Based on the combined data above, do some data cleaning.
- We check whether there are a duplicate record and clean them using remove duplicate function.
- We check whether there are null/empty values and remove them using sort and filter function.
5. Do descriptive analysis to clean the data by spotting outliers.
The minimum payment value is 0.01 and the maximum is 2784.16 which results in a payment value range of 2784.15. This value is quite large when compared to the average payment value of 154.95. The mean > median > mode indicates a positive skew. The standard deviation is 197.50. Kurtosis 2267.71 shows peaked data compared to normal distribution.
The minimum value for actual delivery time is 0 days and the maximum is 79 days which results in an actual delivery time range of 79 days. This value is quite large when compared to the average payment value of 11.77 days. The mean > median > mode indicates a positive skew. The standard deviation is 8.74. Kurtosis 484.94 showed peaked data compared to normal distribution.
6. Create basic charts from the order data above.
The graph shows a comparison between the number of orders vs the current month from October 2016 to August 2018. Over time the number of orders has increased until it reaches the highest number of orders in March 2018 of 345 orders. Then the graph went up and down again in April 2018 to August 2018.
The graph shows the comparison between the number of orders vs the day on which the order was made. This chart is made to see Daily Trends/peak days where buyers place orders. It can be concluded that the peak day is Tuesday, where the total number of orders reached 894 orders.
The graph shows the comparison between the number of orders vs. payment type. This graph is made to see what payment types are most often used by buyers. The most obtained were credit cards with a percentage of 76.2%, followed by boleto with a percentage of 20.2%, vouchers at 2% and finally debit cards at 1%.
7. Explore what insight that we can get using bivariate or multivariate analysis.
The graph shows the comparison between the number of payments per state/country. State SP is in the highest payment with a total payment of 256,765. Followed by State RJ, MG, RS and so on. 80% of payments are generated by State SP to DF, so that stakeholders can focus on taking action on that State.
There is a very weak linear correlation between payment value and payment installments.
Advanced Assignment
8. Find variables that corellates with each other using scatterplot, and make linear regression between variables.
There is a strong linear correlation between freight value and product weight (g) - Correlation 0.66
There is a strong linear correlation between freight value and product volume (cm3) - Correlation 0.64
There is a very strong linear correlation between product weight (g) and product volume (cm3) - Correlation 0.84
There is a very strong linear correlation between payment value and price - Correlation 0.93
CASE 3
9.
An e-commerce company is revamping a landing page after various analysis and research, and want to experiment whether the new page will give better conversion rate before rolling out to wilder audience.
Given the experiment results from the control and experimental group we have a hypothesis that the new page (treatment group) will give a better conversion rate.
- Test whether hypothesis is proven or not using z-test.
- Determine if the test reject or fail to reject the hyphotesis.
Home Projects