6个月的数据科学

Since my title flipped from consultant to data scientist six months ago, I’ve experienced a higher level of job satisfaction than I would have thought possible. To celebrate my first half year in this engaging field, here are six lessons I’ve collected along the way.

自从六个月前我的头衔从顾问变成数据科学家以来,我获得了比我想象的更高的工作满意度。 为了庆祝我在这个引人入胜的领域的上半年学习,在此过程中,我收集了六个课程。

#1 —阅读arXiv论文 (#1 — Read the arXiv paper)

Probably you’re aware that reviewing arXiv is a good idea. It’s a wellspring of remarkable ideas and state-of-the-art advancements.

可能您知道复习arXiv是个好主意。 这是非凡的想法和最新发展的源泉。

I’ve been pleasantly surprised, though, by the amount of actionable advice I come across on the platform. For example, I might not have access to 16 TPUs and $7k to train BERT from scratch, but the recommended hyperparameter settings from the Google Brain team are a great place to start fine-tuning (check Appendix A.3).

不过,我在平台上遇到的大量可行建议给我带来了惊喜。 例如,我可能无权使用16个TPU和7,000美元来从头开始训练BERT ,但是Google Brain团队推荐的超参数设置是开始进行微调的好地方(参阅附录A.3 )。

Hopefully, your favorite new package will have an enlightening read on arXiv to add color to its documentation. For example, I learned to deploy BERT using the supremely readable and abundantly useful write-up on ktrain, a library that sits atop Keras and provides a streamlined machine learning interface for text, image, and graph applications.

希望您最喜欢的新软件包对arXiv有启发性的阅读,以为其文档增添色彩。 例如,我学会了使用在ktrain上具有超强可读性和非常有用的文字来部署BERT,该库位于Keras之上,并为文本,图像和图形应用程序提供了简化的机器学习界面。

#2-收听播客以获取巨大的态势感知 (#2 — Listen to podcasts for tremendous situational awareness)

Podcasts won’t improve your coding skills but will improve your understanding of recent developments in machine learning, popular packages and tools, unanswered questions in the field, new approaches to old problems, underlying psychological insecurities common across the profession, etc.

播客不会提高您的编码技能,但会提高您对机器学习的最新发展,流行的软件包和工具,该领域未解决的问题,旧问题的新方法,整个行业普遍存在的潜在心理不安全感的理解。

The podcasts I listen to on the day-to-day have helped me feel engaged and up-to-date on fast moving developments in data science.

我每天收听的播客使我感到与数据科学的快速发展紧密接触,并保持最新状态。

Here are my favorite podcasts right now:

现在是我最喜欢的播客:

Recently I’ve been particularly excited to learn about advancements in NLP, follow the latest developments in GPUs and cloud computing, and question the potential symbiosis between advancements in artificial neural nets and neurobiology.

最近,我特别高兴地了解NLP的进步,关注GPU云计算最新发展,并质疑人工神经网络和神经生物学的进步之间潜在共生关系。

#3 —阅读GitHub问题 (#3 — Read GitHub Issues)

Based on my experience trawling this ocean of complaints for giant tuna of wisdom, here are three potential wins:

根据我的经验,在对巨大的智慧金枪鱼的抱怨之海中,这是三个潜在的胜利:

  1. I often get ideas from the ways others are using and/or misusing a package

    我经常从别人使用和/或滥用套件的方式中获得想法
  2. It’s also useful understand in what kinds of situations a package will tend to break in order to develop your sense of potential failure points in your own work

    了解包装在何种情况下容易破裂,以使您在自己的工作中发现潜在的故障点,这也很有用
  3. As you’re in your pre-work phase of setting up your environment and conducting model selection, you’d do well to take responsiveness of developers and the community into account before adding an open source tool into your pipeline

    当您处于准备环境和进行模型选择的准备工作阶段时,在将开源工具添加到管道中之前,最好考虑到开发人员和社区的响应能力

#4 —了解算法-硬件链接(#4 — Understand the algorithm-hardware link)

I’ve done a lot of NLP in the last six months, so let’s talk about BERT again.

在过去的六个月中,我已经做了很多NLP,所以让我们再谈一谈BERT。

In October 2018, BERT emerged and shook the world. Kind of like Superman after leaping a tall building in a single bound (crazy to think Superman couldn’t fly when originally introduced!)

2018年10月, BERT出现并震惊了世界。 像是一个超人一跃跃过一座高楼之后的超人(疯狂地想到超人最初被介绍时不会飞! )

BERT represented a step-change in the capacity of machine learning to tackle text processing tasks. It’s state-of-the-art results are based in parallelism of its transformer architecture running on the Google’s TPU computer chip.

BERT代表了机器学习解决文本处理任务的能力的一个巨大变化。 它的最新结果是基于在Google TPU计算机芯片上运行的变压器架构的并行性。

GIPHY GIPHY

Understanding the implications of TPU and GPU-based machine learning is important for advancing your own capabilities as a data scientist. It is also a critical step toward sharpening your intuition about the inextricable link between machine learning software and the physical constraints of the hardware on which it runs.

了解TPU和基于GPU的机器学习的含义对于提升自己作为数据科学家的能力非常重要。 这对于提高您对机器学习软件与其运行硬件的物理约束之间不可分割的联系的直觉也是至关重要的一步。

With Moore’s law petering out around 2010, increasingly creative approaches will be needed to overcome the limitations in the data science field and continue to make progress toward truly intelligent systems.

随着摩尔定律在2010年左右逐渐完善,将需要越来越多的创新方法来克服数据科学领域的局限性,并继续朝着真正的智能系统发展。

Image for post
Nvidia presentation showing transistors per square millimeter by year. This highlights the stagnation in transistor count around 2010 and the rise of GPU-based computing. Nvidia演示文稿中的图表显示了每平方毫米每年的晶体管数。 这凸显了2010年左右晶体管数量的停滞以及基于GPU的计算的兴起。

I’m bullish on the rise of ML model-computing hardware co-design, increased reliance on sparsity and pruning, and even “no-specialized hardware” machine learning that looks to disrupt the dominance of the current GPU-centric paradigm.

我看好ML模型计算硬件协同设计的兴起,稀疏性和修剪的依赖增加,甚至是“非专业硬件”机器学习,这些机器学习似乎都在破坏当前以GPU为中心的范例的主导地位。

#5 —向社会科学学习 (#5 — Learn from the Social Sciences)

There’s a lot our young field can learn from the reproducibility crisis in the Social Sciences that took place in the mid-2010s (and which, to some extent, is still taking place):

我们的年轻领域可以从2010年代中期发生的社会科学可再生性危机中学到很多东西(在一定程度上仍在发生)

Image for post
Comic by Randall Monroe of xkcd xkcd的Randall Monroe的漫画

In 2011, an academic crowdsourced collaboration aimed to reproduce 100 published experiments and correlational psychological studies. And it failed — just 36% of the replications reported statistically significant results, compared to 97% of the originals.

2011年,一场学术众包合作旨在重现100篇已发表的实验和相关心理学研究。 而且失败了,只有36%的复制品报告了统计上显着的结果,而原始数据的这一比例为97%。

Psychology’s reproducibility crisis reveals the danger, and responsibility, associated with sticking “science” alongside shaky methodology.

心理学的可再现性危机揭示了与“科学”和摇摇欲坠的方法论相联系的危险和责任。

Data science needs testable, reproducible approaches to its problems. To eliminate p-hacking, data scientists need to set limits on how they investigate their data for predictive features and on the number of tests they run to evaluate metrics.

数据科学需要可测试,可重现的方法来解决其问题。 为了消除p-hacking,数据科学家需要对如何调查其数据以预测特征以及运行以评估指标的测试数量设置限制。

There are many tools that can help with experimentation management. I have experience with ML Flowthis excellent article by Ian Xiao mentions six others, as well as suggestions across four other areas of the machine learning workflow.

有许多工具可以帮助进行实验管理。 我有经验ML流量-这个优秀的文章伊恩·肖提到六种人,以及跨机器学习工作流程的其他四个方面的建议。

We can also draw many lessons from the fair share of missteps and algorithmic malpractice within the data science field in recent years:

近年来,我们还可以从数据科学领域中的失误和算法错误中获得很多教训:

For example, interested parties need look no further than social engineering recommendation engines, discriminatory credit algorithms, and criminal justice systems that deepen the status quo. I’ve written a bit about these social ills and how to avoid them with effective human-centered design.

例如,感兴趣的各方只需要加深社会现状的社会工程推荐引擎,歧视性信用算法和刑事司法系统就可以了。 我已经写了一些关于这些社会弊端的文章,以及如何通过有效的以人为本的设计来避免这些弊端

The good news is that there are many an intelligent and driven practitioner working to address these challenges and prevent future breaches in public trust. Check out Google’s PAIR, Columbia’s FairTest, and IBM’s Explainability 360. Collaborations with social scientist researchers can yield fruitful results, such as this project on algorithms to audit for discrimination.

好消息是,有许多聪明才智的从业人员正在努力解决这些挑战,并防止将来公众信任遭到破坏。 查看Google的PAIR哥伦比亚的FairTestIBM的Explainability 360 。 与社会科学家研究人员的合作可以产生丰硕的成果,例如该项目旨在对歧视进行审核的算法

Of course, there are many other things we can learn from the social sciences, such as how to give an effective presentation:

当然,我们还可以从社会科学中学到很多其他东西,例如如何进行有效的介绍:

It’s crucial to study the social sciences to understand where human intuition about data inference is likely to fail. Humans are very good about drawing conclusions from data in certain situations. The ways our reasoning breaks down is highly systematic and predictable.

研究社会科学以了解人类对数据推断的直觉可能会失败的地方至关重要。 在某些情况下,人类非常擅长从数据中得出结论。 我们推理的方式是高度系统化和可预测的。

Much of what we understand about this aspect of human psychology is captured in Daniel Kahneman’s excellent Thinking Fast and Slow. This book should be required reading for anyone interested in decision sciences.

丹尼尔·卡尼曼(Daniel Kahneman)出色的《快与慢》(Thinking Fast and Slow)体现了我们对人类心理学这一方面的了解。 对决策科学感兴趣的任何人都应该阅读本书。

One element of Kahneman’s research that’s likely to be immediately relevant to your work is his treatment of the anchoring effect, which “occurs when people consider a particular value for an unknown quantity.”

卡尼曼研究的一项可能与您的工作直接相关的要素是他对锚定效果的处理,“这种锚定发生在人们考虑未知数量的特定值时。”

When communicating results from modeling (i.e. numbers representing accuracy, precision, recall, f-1, etc.), data scientists need to take special care to manage expectations. It can be useful to provide a degree of hand-waviness on a scale of “we are still hacking away at this problem, and these metrics are likely to change” to “this is the final product, and this is about how we expect our ML solution to perform in the wild.”

在交流建模结果(即代表准确性,精确度,召回率,f-1等的数字)时,数据科学家需要格外小心以管理期望值。 在“我们仍在努力解决这个问题,并且这些指标可能会改变”的范围内提供一定程度的帮助可能会有用,这会变成“这是最终产品,这与我们对我们的期望机器学习解决方案可以在野外执行。”

If you’re presenting intermediate results, Kahneman would recommend providing a range of values for each metric, rather than specific digits. For example, “The f-1 score, which represents the harmonic mean of other metrics represented in this table (precision and recall), falls roughly between 80–85%. This indicates some room for improvement.” This “hand-wavy” communication strategy decreases the risk that the audience will anchor on the specific value you’re sharing, rather than gain a directionally correct message about the results.

如果您要显示中间结果,Kahneman建议为每个指标提供一定范围的值,而不是特定的数字。 例如,“ f-1分数代表此表中其他指标(精确度和查全率)的谐和平均值,大致介于80-85%之间。 这表明仍有改进的空间。” 这种“手挥手”的交流策略降低了受众将锚定于您共享的特定价值上的风险,而不是获得有关结果的方向正确的信息。

#6 —将数据连接到业务成果(#6 — Connect data to business outcomes)

Before you start work, make sure that the problem you’re solving is worth solving.

在开始工作之前,请确保您要解决的问题值得解决。

Your organization isn’t paying you to build a model with 90% accuracy, write them a report, piddle around in Jupyter Notebook, or even to enlighten yourself and others on the quasi-magical properties of graph databases.

您的组织不会付钱给您以90%的准确性来构建模型,向他们编写报告,在Jupyter Notebook中四处打扰,甚至不会使您自己和其他人对图数据库准魔术性质有所启发

You’re there to connect data to business outcomes.

您可以在那里将数据与业务成果联系起来。

I hope that these tips are at least somewhat helpful to you — drop a note in the comments to let me know how you use them. And of course, if you enjoyed this article, follow me on Medium, LinkedIn, and Twitter.

我希望这些技巧至少对您有所帮助-在注释中添加注释,以使我知道您的用法。 当然,如果您喜欢本文,请在MediumLinkedInTwitter上关注我。

Disclaimer: I’m including affiliate links for book recommendations. Buying on Amazon through this link helps support my writing on Data Science — thanks in advance.

免责声明:我包括推荐书的会员链接。 通过此链接在亚马逊上购买有助于支持我在数据科学方面的文章-预先感谢。

有关您的数据科学之旅的更多文章 (More articles for your data science journey)

翻译自: https://towardsdatascience.com/6-months-data-science-e875e69aab0a