10个基本的机器学习面试问题 *

Toptal sourced essential questions that the best 机器学习 engineers can answer. Driven from our community, we encourage experts to submit questions and offer feedback.

is an exclusive network of the top freelance software developers, designers, 金融专家, 产品经理, 和世界上的项目经理. Top companies hire Toptal freelancers for their most important projects.

面试问题

What is stratified 交叉验证 and when should we use it?

View answer

Cross-validation is a technique for dividing data between training and validation sets. 在典型的交叉验证中，这种分割是随机进行的. But in stratified 交叉验证, 分割保留了训练和验证数据集上类别的比例.

For example, if we have a dataset with 10% of category A and 90% of category B, 我们使用分层交叉验证, we will have the same proportions in training and validation. In contrast, 如果我们使用简单的交叉验证, 在最坏的情况下，我们可能会发现在验证集中没有类别A的样本.

Stratified 交叉验证 may be applied in the following scenarios:

在具有多个类别的数据集上. The smaller the dataset and the more imbalanced the categories, the more important it will be to use stratified 交叉验证.
在具有不同分布的数据集上. For example, 在自动驾驶的数据集中, 我们可能有白天和晚上拍摄的照片. If we do not ensure that both types are present in training and validation, 我们会遇到泛化问题.

Why do ensembles typically have higher scores than individual models?

View answer

An ensemble is the combination of multiple models to create a single prediction. 做出更好预测的关键思想是，模型应该犯不同的错误. 这样，一个模型的错误将被其他模型的正确猜测所补偿，因此整体的得分将更高.

我们需要多样化的模型来创建一个整体. 多样性可以通过以下方式实现:

使用不同的机器学习算法. 例如，您可以组合逻辑回归、k近邻和决策树.
使用不同的数据子集进行训练. This is called bagging.
Giving a different weight to each of the samples of the training set. 如果这是迭代完成的, weighting the samples according to the errors of the ensemble, it’s called boosting.

Many winning solutions to data science competitions are ensembles. However, 在现实生活中的机器学习项目中, engineers need to find a balance between execution time and accuracy.

什么是正则化? 你能举一些正则化技术的例子吗?

View answer

Regularization is any technique that aims to improve the validation score, 有时以降低训练分数为代价.

一些正则化技术:

L1 tries to minimize the absolute value of the parameters of the model. 它产生稀疏参数.
L2 tries to minimize the square value of the parameters of the model. 它产生的参数值很小.
Dropout 是否有一种技术应用于神经网络，在训练过程中随机设置一些神经元的输出为零. 这迫使网络通过防止神经元之间复杂的相互作用来学习更好的数据表示:每个神经元都需要学习有用的特征.
Early stopping will stop training when the validation score stops improving, 即使训练成绩可能会提高. 这可以防止训练数据集的过拟合.

申请加入Toptal的发展网络

并享受可靠、稳定、远程自由机器学习工程师工作

申请成为自由职业者

什么是不平衡数据集? 你能列出一些处理它的方法吗?

View answer

An imbalanced dataset is one that has different proportions of target categories. For example, 例如，我们必须检测某些疾病的医学图像数据集通常会有比阳性样本更多的阴性样本, 98% of images are without the illness and 2% of images are with the illness.

There are different options to deal with imbalanced datasets:

过采样或欠采样. Instead of sampling with a uniform distribution from the training dataset, we can use other distributions so the model sees a more balanced dataset.
数据增加. 我们可以通过以一种可控的方式修改现有数据，在不太频繁的类别中添加数据. 在示例数据集中, 我们可以用疾病来翻转图像, or add noise to copies of the images in such a way that the illness remains visible.
使用适当的度量标准. 在示例数据集中, 如果我们有一个总是做出负面预测的模型, 它将达到98%的精度. 还有其他指标，比如精确度, recall, 和f分数，当使用不平衡的数据集时，能更好地描述模型的准确性.

为什么我们需要验证集和测试集? 它们之间有什么区别?

View answer

When training a model, we divide the available data into three separate sets:

The training dataset is used for fitting the model’s parameters. However, 我们在训练集上获得的准确性对于预测模型在新样本上是否准确是不可靠的.
验证数据集用于测量模型在不属于训练数据集的示例上的表现. 在验证数据上计算的度量可用于调优模型的超参数. However, every time we evaluate the validation data and we make decisions based on those scores, we are leaking information from the validation data into our model. 评估越多，泄露的信息就越多. 所以我们最终会过度拟合验证数据, 再一次，验证分数对于预测模型在现实世界中的行为是不可靠的.
测试数据集用于测量模型在以前未见过的示例上的表现. It should only be used once we have tuned the parameters using the validation set.

因此，如果我们省略测试集，只使用验证集, the validation score won’t be a good estimate of the generalization of the model.

你能解释一下监督学习、无监督学习和强化学习之间的区别吗?

View answer

在监督学习中, we train a model to learn the relationship between input data and output data. We need to have labeled data to be able to do supervised learning.

在无监督学习中，我们只有未标记的数据. 模型学习数据的表示. 当我们有大量未标记数据和一小部分标记数据时，经常使用无监督学习来初始化模型的参数. 我们首先训练一个无监督模型, after that, we use the weights of the model to train a supervised model.

在强化学习中, the model has some input data and a reward depending on the output of the model. 模型学习了一个使奖励最大化的策略. 强化学习已经成功地应用于战略游戏，如围棋，甚至是经典的雅达利电子游戏.

What are some factors that explain the success and recent rise of deep learning?

View answer

The success of deep learning in the past decade can be explained by three main factors:

More data. 大量标记数据集的可用性使我们能够训练具有更多参数的模型并获得最先进的分数. Other ML algorithms do not scale as well as deep learning when it comes to dataset size.
GPU. 与在CPU上训练相比，在GPU上训练模型可以将训练时间减少几个数量级. 目前，尖端的模型是在多个gpu甚至专门的硬件上训练的.
算法的改进. ReLU激活、退出和复杂的网络架构也是非常重要的因素.

什么是数据增强? 你能举几个例子吗?

View answer

数据增强是一种通过不改变目标的方式修改现有数据来合成新数据的技术, 或者以已知的方式改变.

Computer vision is one of fields where data augmentation is very useful. 我们可以对图像做很多修改:

Resize
水平或垂直翻转
Rotate
Add noise
Deform
Modify colors

Each problem needs a customized data augmentation pipeline. For example, on OCR, doing flips will change the text and won’t be beneficial; however, 调整大小和小的旋转可能会有所帮助.

什么是卷积网络? 我们可以在哪里使用它们?

View answer

卷积网络是一类使用卷积层而不是全连接层的神经网络. 在一个完全连接的层上，所有的输出单元都有连接到所有输入单元的权值. On a convolutional layer, we have some weights that are repeated over the input.

卷积层相对于全连接层的优势在于参数的数量要少得多. 这使得模型的泛化效果更好. For example, if we want to learn a transformation from a 10x10 image to another 10x10 image, 我们需要10个,如果使用一个完全连接的层，000个参数. 如果我们用两个卷积层, the first one having nine filters and the second one having one filter, 内核大小为3x3, 我们只有90个参数.

Convolutional networks are applied where data has a clear dimensionality structure. Time series analysis is an example where one-dimensional convolutions are used; for images, 2D convolutions are used; and for volumetric data, 使用3D卷积.

自2012年AlexNet赢得ImageNet挑战赛以来，计算机视觉一直由卷积网络主导.

10.

什么是维度的诅咒? 你能列出一些处理它的方法吗?

View answer

The curse of dimensionality is when the training data has a high feature count, 但是数据集没有足够的样本让模型从这么多特征中正确学习. For example, 具有100个特征的100个样本的训练数据集将很难从中学习，因为模型将发现特征和目标之间的随机关系. However, 如果我们有一个包含100k个样本和100个特征的数据集, 模型可以学习特征和目标之间的正确关系.

There are different options to fight the curse of dimensionality:

特征选择. Instead of using all the features, we can train on a smaller subset of features.
降维. There are many techniques that allow to reduce the dimensionality of the features. 主成分分析(PCA)和使用自动编码器是降维技术的例子.
L1正规化. Because it produces sparse parameters, L1 helps to deal with high-dimensionality input.
工程特性. It’s possible to create new features that sum up multiple existing features. For example, we can get statistics such as the mean or median.

There is more to interviewing than tricky technical questions, 所以这些只是作为一个指南. Not every “A” candidate worth hiring will be able to answer them all, 回答所有问题也不能保证成为A级考生. 一天结束的时候，招聘仍然是一门艺术，一门科学，需要大量的工作.

Why Toptal

提出面试问题

Submitted questions and answers are subject to review and editing, 并可能会或可能不会选择张贴, 由Toptal全权决定, LLC.

寻找机器学习工程师?

Looking for 机器学习工程师? 看看Toptal的机器学习工程师.

View Abhimanyu

Abhimanyu Veer Aditya

自由机器学习工程师

United StatesToptal Member Since May 7, 2019

Abhimanyu是一名机器学习专家，拥有19年为商业和科学应用创建预测解决方案的经验. 他是一个跨职能的技术领导者, experienced in building teams and working with C-level executives. Abhimanyu在计算机科学和软件工程方面有着成熟的技术背景，在高性能计算方面拥有专业知识, big data, algorithms, databases, 分布式系统.

View Dan

Dan Napierski

自由机器学习工程师

United StatesToptal Member Since April 28, 2016

Dan是一名专注于区块链技术应用的软件架构师和技术专家. 他拥有多年的专业咨询服务经验，为从初创公司到跨国公司的客户提供服务. 他擅长将严格的测试和防弹代码引入棘手的工程挑战. He has deep expertise in many aspects of artificial intelligence, blockchain, 机器学习, and automation.

机器学习 Blockchain Fintech Cryptocurrency .NET Windows RESTful Web服务敏捷软件开发人工智能(AI)C#Web应用程序开发 API Design 软件开发 + more

View Johnathan

Johnathan赫伯特

自由机器学习工程师

United StatesToptal Member Since March 19, 2017

jonathan有15年的web应用编写经验，涵盖了消费者生产力软件和关键任务金融交易平台. 他拥有丰富的前端JavaScript和浏览器api知识，以及React和Redux等流行框架和库的丰富经验. Johnathan丰富的全栈经验包括Node.js and Express, MongoDB as well as more traditional technologies like PHP, ASP.NET, and MySQL.

机器学习区块链发展 TensorFlow C++React Front-end Windows React Redux React Router Redux CSS JavaScript Web Development + more

Toptal连接 Top 3% 世界各地的自由职业人才.

加入Toptal社区.

Learn more