CMU 10-714 Assignments 实验笔记

~~随缘更新ing~~

前置要求：微积分、线性代数(可能需要一点矩阵论)、DL基础、Python(熟悉 numpy 和 pytorch 更好)。

几乎不使用任何第三方库实现几个深度学习库，包括自动微分、常见神经网络、数据加载器、优化器，并使用 CUDA 实现 GPU 加速。代码量适中，但是基本 pytorch 中有的东西都有了。如果还想深入了解 torch.compiler 相关内容，可以继续学习Machine Learning Compiler，通过 TVM 介绍了 AI Compiler 中的大部分内容。

实验环境为 Arch Linux，环境如下：

既然这门课为 DL Systems, 本文将重点放在实现上，而不是各种数学公式的推导。

由于官方自动评分系统目前不再接受非选课学生注册，因此本代码仅保证能够通过已有测试样例。

本人不是特别熟悉 python 语法，实现过程中可能有不符合 python 规范的地方，欢迎大家指正。

资源存档

课程官网：Deep Learning Systems

实验要求即测试：Assignments

环境配置

采用 conda 进行包管理。 Arch 用户直接 yay 安装 anaconda 即可。

source /opt/anaconda/bin/activate base
conda --crate needle python=3.12
conda activate needle
pip install pytest numpy mugrade

HW0

实验仓库地址：Assignment 1

第一个 homework 主要用于简要复习 Machine Learning 的内容。需要实现包括训练两层的线性神经网络在内的 7 个函数。

Basic `add` function

用来熟悉评测系统，简单的返回 a + b 即可。

1 2	`def add(x, y): return x + y`

然后根据实验提示看看 test 文件内的测试框架，确保无误(谁家好人 a + b 能写错)后运行 python3 -m pytest -k "add" 进行测试。

paese_mnist

读取 MNIST 手写数字数据集。LeCun 老爷子的博客内对这个数据集的格式作了详细介绍，直接看 FILE FORMATS FOR THE MNIST DATABASE 这个部分即可。

整个数据集分为 4 个文件，分别为 train_image, train_label, test_imgae, test_label。整个数据集直接使用原始的二进制来存储，前 $2$ 个字节固定为 $0$，第 $3$ 个字节表明数据类型，第 $4$ 个字节表明维度数。之后的 $n * 4$ 个字节表明数据的各个维度大小，其中 $n$ 为将第 $4$ 个字节视为 uint8 之后的大小。之后即 MNIST 数据集的实际图片。

例如 0x00 0x00 0x08 0x03 60000 28 28 表明，数据集采用 uint8 格式存储(第 $4$ 个字节为 0x08,约定 0x08 为 uint8)，数据集有 $3$ 个维度(第 $4$ 个字节为 0x03)，之后的 $3 * 4 = 12$ 个字节表明数据集的维度为 (60000 * 28 * 28)。

具体实现中，使用 gzip 库对文件进行 按字节 读取，最后进行标准化即可(每个像素灰度值除以 $255$)。

def parse_mnist(image_filename, label_filename):
    with gzip.open(image_filename, 'rb') as image:
        zero, dtype, dims = struct.unpack('>HBB', image.read(4))
        shape = tuple(struct.unpack('>I', image.read(4))[0] for _ in range(dims))
        image_data = np.frombuffer(image.read(), dtype=np.uint8).reshape(shape[0], -1)
        image_data = image_data.astype(np.float32) / 255.0

    with gzip.open(label_filename, 'rb') as label:
        zero, dtype, dims = struct.unpack('>HBB', label.read(4))
        shape = tuple(struct.unpack('>I', label.read(4))[0] for _ in range(dims))
        label_data = np.frombuffer(label.read(), dtype=np.uint8).reshape(shape[0])

    return (image_data, label_data)

Softmax loss

实现交叉熵损失函数。照着公式写即可，不再赘述：

def softmax_loss(Z, y):
    sum_log = np.log(np.sum(np.exp(Z), axis=1)) # shape (batch_size)
    prob = Z[np.arange(Z.shape[0]), y]
    return np.mean(sum_log - prob)

SGD for Softmax Regression

实现 softmax 回归的一个 epoch 上的训练过程。

首先从数据中每次拿出一个 batch 大小的数据，然后根据公式算即可。label 转换为 one-hot 的小 trick：I_y = np.eye(num_classes)[y_batch]

def softmax_regression_epoch(X, y, theta, lr = 0.1, batch=100):
    m = X.shape[0]   
    num_classes = theta.shape[1]

    for start in range(0, m, batch):
        end = min(start + batch, m)
        X_batch = X[start:end]
        y_batch = y[start:end]
        batch_size = X_batch.shape[0]

        logits = X_batch @ theta
        exp_logits = np.exp(logits)
        Z = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
        I_y = np.eye(num_classes)[y_batch]

        grad = X_batch.T @ (Z - I_y) / batch_size
        theta -= lr * grad

SGD for Two-layer nn

实现一个双层感知机在一个 epoch 上的训练过程。

ReLu 本质为和 $0$ 取最大值，张量版本的 max 为 np.maximum。需要注意的是，除法运算能提前就提前，否则可能会导致精度不够。

def nn_epoch(X, y, W1, W2, lr = 0.1, batch=100):
    def ReLU(X):
        return np.maximum(0, X)

    m = X.shape[0]
    num_classes = W2.shape[1]

    for start in range(0, m, batch):
        end = min(start + batch, m)
        X_batch = X[start:end, ]
        y_batch = y[start:end]
        batch_size = X_batch.shape[0]

        Z1 = ReLU(X_batch @ W1)
        Z2_exp = np.exp(Z1 @ W2)
        I_y = np.eye(num_classes)[y_batch]
        G2 = Z2_exp / np.sum(Z2_exp, axis=1, keepdims=True) - I_y
        G1 = G2 @ W2.T * (Z1 > 0).astype(np.float32)

        G_W1 = X_batch.T @ G1 / batch_size
        G_W2 = Z1.T @ G2 / batch_size

        W1 -= lr * G_W1
        W2 -= lr * G_W2

Softmax Regression by cpp

使用 cpp 实现上面的 Softmax 回归。

与 Python 版本不懂，cpp 没有那么多的语法糖，需要手写矩阵乘法，需要处理多维数据映射为一维。不过在处理 ont-hot 方面比 Python 自由了很多。

void softmax_regression_epoch_cpp(const float *X, const unsigned char *y,
								  float *theta, size_t m, size_t n, size_t k,
								  float lr, size_t batch)
{
    for (size_t i = 0;i < m;i += batch) {
        assert(batch != 0);
        size_t batch_size = std::min(batch, m - i);
        const float *X_b = X + i * n;
        const unsigned char *y_b = y + i;

        float *logits = new (std::nothrow) float[batch_size * k];
        float *grad_logits = new (std::nothrow) float[batch_size * k];
        float *grad_theta = new (std::nothrow) float[n * k]();

        if (!logits || !grad_logits || !grad_theta) {
            delete[] logits;
            delete[] grad_logits;
            delete[] grad_theta;
            return ;
        }

        // Forward pass: copute logits = X_b * theta
        for (size_t b = 0;b < batch_size;b ++) {
            for (size_t c = 0;c < k;c ++) {
                float tot  = 0.0f;
                for (size_t d = 0;d < n;d ++) {
                    tot += X_b[b * n + d] * theta[d * k + c];
                }
                logits[b * k + c] = tot;
            }
        } 

        // Softmax and loss gradient
        for (size_t b = 0;b < batch_size;b ++) {
            float max_logit = logits[b * k];
            for (size_t c = 1;c < k;c ++) {
                if (logits[b * k + c] > max_logit) {
                    max_logit = logits[b * k + c];
                }
            }

            // grad_logits = exp(x @ theta)
            float sum_exp = 0.0f;
            for (size_t c = 0;c < k;c ++) {
                float exp_val = std::exp(logits[b * k + c] - max_logit);
                grad_logits[b * k + c] = exp_val;
                sum_exp += exp_val;
            }

            // grad_logits = Z
            int true_class = static_cast<int>(y_b[b]);
            for (size_t c = 0;c < k;c ++) {
                grad_logits[b * k + c] /= sum_exp;
            }
            // grad_logits = Z - I_y
            grad_logits[b * k + true_class] -= 1.0f;
        }

        // Backward
        for (size_t b = 0;b < batch_size;b ++) {
            for (size_t d = 0;d < n;d ++) {
                float x_val = X_b[b * n + d];
                for (size_t c = 0;c < k;c ++) {
                    // in-place transpose
                    grad_theta[d * k + c] += x_val * grad_logits[b * k + c];
                }
            }
        }

        float scale = lr / static_cast<float>(batch_size);
        for (size_t j = 0;j < n * k;j ++) {
            theta[j] -= scale * grad_theta[j];
        }

        delete[] logits;
        delete[] grad_logits;
        delete[] grad_theta;
    }  
}

HW0 总结

建议看完 Lecture 2 之后再写这个实验。没经过西瓜书拷打的同学一开始看到满屏幕的公式应该会很懵，不过好在熟悉推导之后实现起来还是比较简单的，很适合用来熟悉 NumPy 和基础 DL 模型。

HW1

这个 homework 共有 6 个部分：算子正向传播，梯度反向传播，拓扑排序、反向自动微分、softmax 损失以及双层感知机的 SGD。

Forward & Backward Computation

需要实现基本算子的前向传播和反向传播。

前两个部分关系比较紧密，这里放在一起说了。实现算子的时候，需要时刻清楚每一步的张量形状。还需要处理 NumPy 中无处不在的广播。

推导梯度的时候需要注意：当前算子的 out_grad 形状和当前算子的前向运算形状相同，例如令 $A \in \mathbb{R}^{m \times d}$，$B \in \mathbb{R}^{d \times n}$，当前算子为 Matmul(A, B)，则从后一层传过来的 out_grad 形状为 $\mathbb{R}^{m \times n}$。

对应参数的梯度形状和该参数相同。即当 $X \in \mathbb{R}^{m \times n}$，则 $\frac{\partial L}{\partial \mathbf{X}} \in \mathbb{R}^{m \times n}$。

需要注意的是，前向传播中操作对象是 NDarray，即只进行单纯的数值运算。而反向传播需要扩展计算图，操作对象为 ndl.Tensor。强烈建议通读 needle/autograd.py 中的 class Tensor 的源码，对理解框架很有用。

PowerScalar

所有元素对一个标量进行幂运算。输入输出形状相同，张量梯度形式和标量相同。建议所有梯度都返回元组，方便后续写 AD 的时候进行解包。

class PowerScalar(TensorOp):
    def __init__(self, scalar: int):
        self.scalar = scalar

    def compute(self, a: NDArray) -> NDArray:
        return array_api.power(a, self.scalar)

    def gradient(self, out_grad: Tensor, node: Tensor):
        a = node.inputs[0]
        grad = self.scalar * power_scalar(a, self.scalar - 1)
        return (out_grad * grad, )

EWiseDiv

张量对应元素之间的除法。输入输出形状相同，张量梯度形式和标量相同。分别对分子分母求偏导数即可。

class EWiseDiv(TensorOp):
    def compute(self, a, b):
        return array_api.divide(a, b)

    def gradient(self, out_grad, node):
        lhs, rhs = node.inputs
        lhs_grad = power_scalar(rhs, -1)
        rhs_grad = negate(divide(lhs, power_scalar(rhs, 2)))
        return (out_grad * lhs_grad, out_grad * rhs_grad)

DivScalar 所有元素对一个标量进行除法。输入输出形状相同，张量梯度形式和标量相同。对分子求导即可。

class EWiseDiv(TensorOp):
    def compute(self, a, b):
        return array_api.divide(a, b)

    def gradient(self, out_grad, node):
        lhs, rhs = node.inputs
        lhs_grad = power_scalar(rhs, -1)
        rhs_grad = negate(divide(lhs, power_scalar(rhs, 2)))
        return (out_grad * lhs_grad, out_grad * rhs_grad)

Transpose

广义的矩阵转置，实际更类似 swap 轴的操作。设 $A \in \mathbb{R}^{2, 3, 4, 5}$, 则 transpose(A, axes=(1, 3, )) 结果为 $A \in \mathbb{R}^{2, 5, 4, 3}$，即第 $1$ 轴和第 $3$ 个轴交换。当 axes=None 时，默认交换最后两个轴。

因为 $Y = X^T$，所以 $Y_{ji} = X_{ij}$。对损失 $L$ 求导：

\[ \frac{\partial L}{\partial X_{ij}}=\sum_{k, l}\frac{\partial L}{\partial Y_{kl}}\cdot\frac{\partial Y_{kl}}{\partial X_{ij}} \]

而 $\frac{\partial Y_{kl}}{\partial X_{ij}} = 1$ 当且仅当 $k = j$ 且 $l = i$，否则为 $0$。

因此有：

\[ \frac{\partial L}{\partial X_{ij}} = \frac{\partial L}{\partial Y_{ji}} \]

即

\[ \frac{\partial L}{\partial X} = {(\frac{\partial L}{\partial Y})}^T \]

所以 transpose 的梯度即为 out_grad 在相同维度上进行一次转置。

class Transpose(TensorOp):
    def __init__(self, axes: Optional[tuple] = None):
        self.axes = axes

    def compute(self, a):
        ndim = a.ndim
        if self.axes is None:
            axis1, axis2 = ndim - 2, ndim - 1
        else:
            axis1, axis2 = self.axes
        return array_api.swapaxes(a, axis1, axis2)

    def gradient(self, out_grad, node):
        return (transpose(out_grad, self.axes), )

Reshape

本质是重新对内存下标进行映射。

根据链式法则有：

\[ \frac{\partial L}{\partial x_i}=\sum_j\frac{\partial L}{\partial y_j}\cdot\frac{\partial y_j}{\partial x_i} \]

由于 reshape 只是对元素的重排列，某个 $y_j$ 要么是由某个唯一的 $x_i$ 直接复制过来，要么和 $x_i$ 没有任何关系，有 $\frac{\partial{y_j}}{\partial{x_i}} = 1$，当且仅当 $j = i$ 时成立。

即

\[ \frac{\partial L}{\partial x_i}=\frac{\partial L}{\partial y_j} \]

即把 $\frac{\partial L}{\partial y}$ 按相反的方式进行 reshape 即得到 $\frac{\partial L}{\partial y}$。

class Reshape(TensorOp):
    def __init__(self, shape):
        self.shape = shape

    def compute(self, a):
        return array_api.reshape(a, self.shape)

    def gradient(self, out_grad, node):
        shape = node.inputs[0].shape
        return (reshape(out_grad, shape), )

BoardCast

广播操作本质是沿某个轴进行复制，根据链式法则有：

\[ \frac{\partial L}{\partial x_i}=\sum_{j\in\text{由 }x_i\text{ 复制得到的位置}}\frac{\partial L}{\partial y_j} \cdot \frac{\partial{y_j}}{\partial{x_i}} \]

由于 $y_j = x_i$，即 $\frac{\partial{y_j}}{\partial{x_i}} = 1$，即沿着广播的轴对 out_grad 进行求和即可。由于 NumPy 中的广播可能会升维，所以最后进行一次 reshape 对齐 $X$ 的维度即可。

class BroadcastTo(TensorOp):
    def __init__(self, shape):
        self.shape = shape

    def compute(self, a):
        return array_api.broadcast_to(a, self.shape).copy()

    def gradient(self, out_grad, node):
        input_shape = node.inputs[0].shape
        output_shape = self.shape

        if input_shape == output_shape:
            return (out_grad, )

        ndim_added = len(output_shape) - len(input_shape)
        axes = list(range(ndim_added)) # 添加的维度
        for i, (in_dim, out_dim) in enumerate(zip(input_shape, output_shape[ndim_added:])):
            if in_dim == 1 and out_dim > 1: # 广播的维度
                axes.append(ndim_added + i)

        if axes:
            out_grad = summation(out_grad, tuple(axes))

        out_grad = reshape(out_grad, input_shape)
        return (out_grad,)

Summation

这个算子是沿特定轴进行求和，结果进行降维。如果 axes=None，则对全局进行求和，返回标量。

以 $y = x1 + x2$ 举例，有

\[ \frac{\partial y}{\partial x_1}=1,\quad\frac{\partial y}{\partial x_2}=1 \]

由链式法则有：

\[ \begin{aligned} & \frac{\partial L}{\partial x_1}=\frac{\partial L}{\partial y}\cdot1=\frac{\partial L}{\partial y} \\ & \frac{\partial L}{\partial x_2}=\frac{\partial L}{\partial y}\cdot1=\frac{\partial L}{\partial y} \end{aligned} \]

即 $y$ 的梯度值对所有参与求和的 $x$ 值都是相同的。所以只需要将 out_grad 沿着求和的轴广播回去即可。

class Summation(TensorOp):
    def __init__(self, axes: Optional[tuple] = None):
        self.axes = axes

    def compute(self, a):
        return array_api.sum(a, axis=self.axes)

    def gradient(self, out_grad, node):
        input_shape = node.inputs[0].shape

        # 全局求和
        if self.axes is None:
            new_shape = (1, ) * len(input_shape)
            return (broadcast_to(reshape(out_grad, new_shape), input_shape), )

        # 标准化处理，避免 slef.axes 中出现负数轴
        axes = tuple(ax % len(input_shape) if ax < 0 else ax for ax in self.axes)
        new_shape = list(input_shape)
        # 把求和的那几个轴改为1，方便进行广播
        for ax in axes:
            new_shape[ax] = 1

        return (broadcast_to(reshape(out_grad, tuple(new_shape)), input_shape), )

Matmul

广义矩阵乘法。NumPy 支持高维矩阵乘法，只需要最后两维度满足矩阵乘法的要求即可，前面的维度都视为 batch。如果维度不匹配，会进行升维或广播。

在计算梯度的时候，根据 Lecture 1中介绍的凑形状方法，可以得到如下两个表达式：

\[ \frac{\partial L}{\partial Y}\frac{\partial Y}{\partial A} = G B^T \\ \frac{\partial L}{\partial Y}\frac{\partial Y}{\partial B} = A^T G \]

但是 NumPy 会进行 reshape 和 boardcast，根据链式法则，需要进行一次规约。沿着广播轴进行求和，并 reshape 为输入的形状即可。

class MatMul(TensorOp):
    def compute(self, a, b):
        return array_api.matmul(a, b)

    def gradient(self, out_grad, node):
        lhs, rhs = node.inputs
        lhs_grad = matmul(out_grad, transpose(rhs))
        rhs_grad = matmul(transpose(lhs), out_grad)

        def reduce_to_shape(grad, shape):
            if grad.shape == shape:
                return grad
            ndim_added = len(grad.shape) - len(shape)
            axes = list(range(ndim_added))
            for i, (s, gs) in enumerate(zip(shape, grad.shape[ndim_added:])):
                if s == 1 and gs > 1:
                    axes.append(ndim_added + i)
            if axes:
                grad = summation(grad, tuple(axes))
            return reshape(grad, shape)

        return (reduce_to_shape(lhs_grad, lhs.shape),
                reduce_to_shape(rhs_grad, rhs.shape))

Negate

按元素取负，梯度即为乘 -1。

class Negate(TensorOp):
    def compute(self, a):
        ### BEGIN YOUR SOLUTION
        return array_api.negative(a)
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        return (negate(out_grad), ) 
        ### END YOUR SOLUTION

Log & Exp

这两个都没啥好说的。

class Log(TensorOp):
    def compute(self, a):
        return array_api.log(a)

    def gradient(self, out_grad, node):
        a = node.inputs[0]
        return (divide(out_grad, a), )

class Exp(TensorOp):
    def compute(self, a):
        return array_api.exp(a)

    def gradient(self, out_grad, node):
        a = node.inputs[0]
        return (out_grad * exp(a), )

Topological Sort

这部分需要实现拓扑排序。文档里叽里呱啦说了半天，说白了就是 后续遍历的 DFS 序即为逆拓扑序。

def find_topo_sort(node_list: List[Value]) -> List[Value]:
    visited = dict()
    topo_order = []
    for node in node_list:
        if not visited.get(node, False):
            topo_sort_dfs(node, visited, topo_order)
    return topo_order

def topo_sort_dfs(node: Value, visited: dict, topo_order: List[Value]) -> None:
    """Post-order DFS"""
    for fa in node.inputs:
        if not visited.get(fa, False):
            topo_sort_dfs(fa, visited, topo_order)
    visited[node] = True
    topo_order.append(node)

有同学会问，为什么不用 BFS 的方式求。答维护每个节点的入度是很大的开销。

reverse mode AD

开始措自动微分了。核心代码和 Lecture 3 中讲的都差不多。autograd.py 中还给我们提供了一个 sum_node_list 函数，对一系列 node 进行求和，对应伪代码中 $=_j 的部分。

由于 ops 中的算子都是继承自 TensorOp，会自动调用 make_from_op，然后拓展计算图。

def compute_gradient_of_variables(output_tensor, out_grad):
    node_to_output_grads_list: Dict[Tensor, List[Tensor]] = {}
    node_to_output_grads_list[output_tensor] = [out_grad]

    reverse_topo_order = list(reversed(find_topo_sort([output_tensor])))

    for node in reverse_topo_order:
        node.grad = sum_node_list(node_to_output_grads_list[node])

        # not a initial/leaf node
        if len(node.inputs) > 0:
            grad = node.op.gradient(node.grad, node)
            for i, fa in enumerate(node.inputs):
                node_to_output_grads_list.setdefault(fa, [])
                node_to_output_grads_list[fa].append(grad[i])

Softmax loss

用 ndl.Tensor 的算子实现 softmax loss。为了方便计算，已经很贴心的把 $y$ 替换成了独热编码。没啥好说的，照着公式写即可。

def softmax_loss(Z: ndl.Tensor, y_one_hot: ndl.Tensor):
    sum_log = ndl.log(ndl.summation(ndl.exp(Z), axes=(1, )))
    prob = ndl.summation(ndl.multiply(Z, y_one_hot), axes=(1, ))
    tot_loss = ndl.summation(sum_log - prob)
    return ndl.divide_scalar(tot_loss, Z.shape[0])

SGD for nn

利用前面的逐渐，实现一个双层感知机一个 epoch 的训练过程。需要注意，这里传入的 $y$ 是 label 类型，需要转换为独热编码。然后在更新权重的时候，建议转换成 NumPy 计算，否则会把不必要的运算加入计算图，导致计算图指数级增长。

def nn_epoch(X, y, W1, W2, lr=0.1, batch=100):
    m = X.shape[0]
    num_classes = W2.shape[1]
    y_one_hot = np.eye(num_classes)[y]

    for start in range(0, m, batch):
        end = min(start + batch, m)
        X_batch = X[start:end, ]
        y_batch = y_one_hot[start:end]
        batch_size = X_batch.shape[0]

        X_tensor = ndl.Tensor(X_batch)
        y_tensor = ndl.Tensor(y_batch)
        Z1 = ndl.relu(ndl.matmul(X_tensor, W1))
        Z2 = ndl.matmul(Z1, W2)
        loss = softmax_loss(Z2, y_tensor)
        loss.backward()

        new_W1 = ndl.Tensor(W1.numpy() - lr * W1.grad.numpy())
        new_W2 = ndl.Tensor(W2.numpy() - lr * W2.grad.numpy())
        W1, W2 = new_W1, new_W2

    return W1, W2

HW1 总结

强度已经明显大了很多，到后面不能用 NumPy 的时候不知道怎么办QAQ。

CMU10714

#10714

CMU 10-714 Assignments 实验笔记

https://acmicpc.top/2026/05/23/CMU-10717/

作者

江欣婷

发布于

2026年5月23日

许可协议

MIT 6.S081 All in one 下一篇

CMU 10-714 Assignments 实验笔记

CMU 10-714 Assignments 实验笔记

资源存档

环境配置

HW0

Basic add function

paese_mnist

Softmax loss

SGD for Softmax Regression

SGD for Two-layer nn

Softmax Regression by cpp

HW0 总结

HW1

Forward & Backward Computation

Topological Sort

reverse mode AD

Softmax loss

SGD for nn

HW1 总结

Basic `add` function