This project presents a complete Exploratory Data Analysis (EDA) and Multiple Linear Regression modeling workflow using the California Housing dataset.
The task simulates a real-world Data Analyst role in a US Real Estate firm, focusing on data understanding, visualization, dimensionality reduction, and predictive modeling.
The implementation is done using Python, pandas, matplotlib, seaborn, and scikit-learn in a Jupyter Notebook.
The dataset contains information about housing blocks in California.
| Feature | Description |
|---|---|
| longitude | How far west a house is (higher = farther west) |
| latitude | How far north a house is (higher = farther north) |
| housingMedianAge | Median age of houses in a block |
| totalRooms | Total rooms in a block |
| totalBedrooms | Total bedrooms in a block |
| population | Population in a block |
| households | Number of households in a block |
| medianIncome | Median household income (in tens of thousands USD) |
| medianHouseValue | Median house value (target variable) |
| oceanProximity | Proximity of houses to the ocean |
Note: Due to temporary access issues with
sklearn.datasets.fetch_california_housing, the dataset was downloaded manually and loaded locally.
- Perform thorough data exploration and understanding
- Analyze skewness and outliers
- Study feature correlations
- Apply Principal Component Analysis (PCA)
- Build a Multiple Linear Regression model
- Evaluate the model using multiple performance metrics
- Interpret model strengths and limitations
- Loaded dataset using pandas
- Displayed:
- Dataset shape
- Column names
- First 10 rows
- Used
.info()and.describe()to understand structure and statistics - Identified missing values in
totalBedrooms - Computed feature variances to find the largest variance feature
- Plotted histograms for all numeric features
- Experimented with different bin sizes
- Observed right-skewed distributions in:
medianIncometotalRoomspopulation
- Used box plots to detect outliers in:
medianIncomeaverageRoomspopulation
- Identified several high-value outliers affecting distributions
- Computed correlation matrix
- Visualized correlations using a heatmap
- Used correlation insights for feature selection
- Observed weak linear correlation of latitude and longitude with house prices
- Scatter plot:
- Longitude vs Latitude
- Color mapped by Median House Value
- Point size scaled by Population
- Revealed higher house prices near coastal regions
- Standardized features
- Applied PCA
- Plotted Explained Variance Ratio (Scree Plot)
- Selected top 2 principal components
- Scatter plot of PC1 vs PC2, colored by house value
- Built a scikit-learn Pipeline including:
- Feature scaling
- Linear Regression
- Excluded latitude and longitude based on correlation analysis
- Extracted:
- Regression coefficients
- Intercept
Model performance was evaluated using:
- Mean Squared Error (MSE)
- Mean Absolute Error (MAE)
- R² Score
- Q.1)Explain most widely used methods to eliminate skewness of column features
- Q.2) Why Scaling is Required Before PCA?
- Q.3) Is high R2 score always good?
- Q.4) Is low training loss always preferred?
all above questions answers given in jupyter notebook
- Predicted vs Actual values
- Residuals vs Predicted values
- Reveals limitations of linear regression