Lesson 7 - Coding Lesson: Implementing Mini-Batch SGD in Python

Reusing and Modifying Your Batch GD Code

We'll start by reusing and modifying the Batch Gradient Descent code you wrote in Lesson 3. This will save us time and highlight the key changes needed to implement Mini-Batch SGD.

Start by opening your Jupyter Notebook or Colab notebook from Lesson 3 (or create a new notebook and copy over your Batch GD code if needed). We'll modify the batch_gradient_descent function to create our mini_batch_gradient_descent function.

We'll keep the calculate_mse and calculate_gradients functions exactly the same – they will be reused in our Mini-Batch SGD implementation. No need to rewrite those!

# (Code from Lesson 3 - calculate_mse function - REUSE AS IS)
def calculate_mse(y_true, y_predicted):
    """Calculates the Mean Squared Error."""
    n = len(y_true)
    mse = (1 / (2 * n)) * np.sum((y_true - y_predicted)**2)
    return mse

# (Code from Lesson 3 - calculate_gradients function - REUSE AS IS)
def calculate_gradients(X, y_true, y_predicted):
    """Calculates gradients of MSE cost function."""
    n = len(y_true)
    errors = y_predicted - y_true
    gradient_beta0 = (1 / n) * np.sum(errors)
    gradient_beta1 = (1 / n) * np.sum(X * errors)
    return gradient_beta0, gradient_beta1

Reuse calculate_mse and calculate_gradients: Ensure you have these two functions defined in your notebook (copy them over if needed).

Continue

Step 1: Implement Mini-Batch SGD Function (Core Logic)

Now, let's create the mini_batch_gradient_descent function. We'll start with the core logic of iterating through epochs and mini-batches.

from sklearn.model_selection import train_test_split # Import for shuffling

def mini_batch_gradient_descent(X, y, learning_rate, n_epochs, mini_batch_size):
    """Implements Mini-Batch Stochastic Gradient Descent for linear regression."""
    beta_0 = 0
    beta_1 = 0
    history_cost = []
    n_data = len(X)

    for epoch in range(n_epochs):
        # Shuffle data at the start of each epoch (NEW!)
        X_shuffled, y_shuffled = train_test_split(X, y, shuffle=True)

        for i in range(0, n_data, mini_batch_size): # Iterate through mini-batches (NEW!)
            X_mini_batch = X_shuffled[i:i + mini_batch_size]
            y_mini_batch = y_shuffled[i:i + mini_batch_size]

            # --- Gradient and Parameter Update Code will go here --- 
            pass # Placeholder

        # Calculate and store cost for full dataset (Moved to epoch loop)
        y_predicted_full = beta_0 + beta_1 * X
        cost = calculate_mse(y, y_predicted_full)
        history_cost.append(cost)

    return beta_0, beta_1, history_cost

Implement mini_batch_gradient_descent (Initial Structure): Copy and paste this code structure into a new cell. Let's understand the key additions for SGD:

Import train_test_split: We import train_test_split from sklearn.model_selection. We'll use this for easy shuffling of our dataset.
Epoch Loop: We now have an outer loop for epoch in range(n_epochs): to iterate through epochs.
Dataset Shuffling: Inside the epoch loop, X_shuffled, y_shuffled = train_test_split(X, y, shuffle=True) shuffles the data at the start of each epoch.
Mini-Batch Loop: The inner loop for i in range(0, n_data, mini_batch_size): iterates through the shuffled data in steps of mini_batch_size, creating mini-batches X_mini_batch and y_mini_batch.
Placeholder: The pass statement is a placeholder where we'll add the gradient calculation and parameter update code for each mini-batch in the next step.
Cost Calculation (per epoch): The cost is now calculated and stored after each epoch (using the full dataset for cost calculation, just for tracking progress across epochs).

This structure sets up the core loops for epochs and mini-batches. Now, let's fill in the gradient calculation and parameter update logic inside the mini-batch loop.

Continue

Step 2: Add Gradient Calculation and Parameter Update (Mini-Batch Loop)

Let's complete the mini_batch_gradient_descent function by adding the gradient calculation and parameter update steps within the mini-batch loop.

# (Add this code inside the 'for i in range(0, n_data, mini_batch_size):' loop, replacing 'pass')

            y_predicted_mini_batch = beta_0 + beta_1 * X_mini_batch # Predictions for mini-batch
            gradient_beta0, gradient_beta1 = calculate_gradients(X_mini_batch, y_mini_batch, y_predicted_mini_batch) # Gradients on mini-batch

            beta_0 = beta_0 - learning_rate * gradient_beta0
            beta_1 = beta_1 - learning_rate * gradient_beta1

Complete Mini-Batch Loop Logic: Copy and paste the code snippet above to replace the pass statement in your mini_batch_gradient_descent function. This code block does the following for each mini-batch:

Predictions for Mini-Batch: y_predicted_mini_batch = beta_0 + beta_1 * X_mini_batch calculates predictions using the current parameters, but only for the data points in the current mini-batch.
Gradient Calculation (Mini-Batch): gradient_beta0, gradient_beta1 = calculate_gradients(X_mini_batch, y_mini_batch, y_predicted_mini_batch) calls our calculate_gradients function to compute the gradients, but now it calculates them based only on the current mini-batch data.
Parameter Update: beta_0 = beta_0 - learning_rate * gradient_beta0 and beta_1 = beta_1 - learning_rate * gradient_beta1 update the parameters using the mini-batch gradients and the learning rate.

With this code added, your mini_batch_gradient_descent function is now complete! It implements the full Mini-Batch SGD algorithm.

Continue

Step 3: Run Mini-Batch SGD and Compare with Batch GD

Let's run our Mini-Batch SGD implementation and compare its performance and behavior to Batch Gradient Descent.

# Set hyperparameters for Mini-Batch SGD
learning_rate_sgd = 0.01
n_epochs_sgd = 10 # Fewer epochs needed for SGD
mini_batch_size = 32

# Run Mini-Batch SGD
learned_beta0_sgd, learned_beta1_sgd, cost_history_sgd = mini_batch_gradient_descent(X, y, learning_rate_sgd, n_epochs_sgd, mini_batch_size)

print(f"SGD Learned Intercept (beta_0): {learned_beta0_sgd:.2f}")
print(f"SGD Learned Slope (beta_1): {learned_beta1_sgd:.2f}")

# (Code for plotting cost history and regression line comparison - REUSE from Lesson 3 Coding - Step 4, but adapt labels for SGD)

Run Mini-Batch SGD and Compare: Copy and paste this code block (and reuse your plotting code from Lesson 3, adapting labels as needed) into a new cell and run it. This code:

Sets Hyperparameters for SGD: Sets hyperparameters specific to SGD, including n_epochs_sgd and mini_batch_size.
Runs Mini-Batch SGD: Calls the mini_batch_gradient_descent function.
Prints Learned Parameters (SGD): Prints the learned intercept and slope from SGD.
Plots Cost History Comparison: Creates a plot comparing the cost history of Mini-Batch SGD with Batch GD (if you still have cost_history from Lesson 3 in your notebook).
Plots Regression Line Comparison: Creates a plot comparing the regression line learned by Mini-Batch SGD with the line learned by Batch GD (if you still have learned_beta0, learned_beta1 from Lesson 3).

Examine the plots and printed parameters. How does Mini-Batch SGD compare to Batch GD in terms of convergence speed (epochs vs. iterations), cost function behavior (smoothness vs. noise), and the final learned regression line?

Expected output plots: (1) Cost function comparison plot showing SGD cost potentially decreasing faster per epoch than Batch GD per iteration, but possibly with more oscillations. (2) Regression line comparison plot showing both SGD and Batch GD lines fitting the data reasonably well, potentially very similar.

Continue

Experiment and Reflect

Congratulations, Mini-Batch SGD coder! 🎉 You've implemented Mini-Batch Stochastic Gradient Descent and compared it to Batch GD. Now, let's experiment and reflect on the differences.

Stop and Think

Compare the 'Cost Function Comparison' plot. Is the cost function history for SGD smoother or noisier than for Batch GD? Does SGD seem to converge faster in terms of epochs compared to Batch GD in terms of iterations? What do you think causes these differences?

Continue

Experiment Tasks

Experiment with Mini-Batch Size: Change the mini_batch_size value (e.g., to 1, 8, 64, or even set it to the full dataset size to effectively make it Batch GD). Rerun the code for each mini-batch size. Observe how the cost history and convergence behavior change. What happens as you decrease the mini-batch size towards 1 (pure SGD)? What happens as you increase it towards the full dataset size (approaching Batch GD)?
Experiment with Epochs: Change the n_epochs_sgd value. Do you need more or fewer epochs for SGD to converge compared to the number of iterations needed for Batch GD to reach a similar cost level?
Compare Learning Rates: Experiment with different learning_rate_sgd values for SGD. Does SGD require a different (potentially smaller) learning rate compared to Batch GD for stable convergence? Why might this be the case?

Continue

Lesson 7 - Coding Lesson: Implementing Mini-Batch SGD in Python

Reusing and Modifying Your Batch GD Code

Step 1: Implement Mini-Batch SGD Function (Core Logic)

Step 2: Add Gradient Calculation and Parameter Update (Mini-Batch Loop)

Step 3: Run Mini-Batch SGD and Compare with Batch GD

Experiment and Reflect

Stop and Think

Compare the 'Cost Function Comparison' plot. Is the cost function history for SGD smoother or noisier than for Batch GD? Does SGD seem to converge faster in terms of epochs compared to Batch GD in terms of iterations? What do you think causes these differences?

Experiment Tasks

Coding Lesson Complete - SGD Master!