【趣味连载】攻城狮上传视频与普通人上传视频：（一）生成结构化数据

时间 2019-12-09

原文原文链接

背景

当知道要上传的视频资料从20条变成100条时，我就明白，绝对不能再人工处理了。他们老是想固然的认为，录入一条数据须要1分钟，那录入20条数据就是20分钟，录入100条数据，不就是100分钟吗？我有时候，真的很想问问他们，没有考虑过人是会犯错的吗？数据越多，出错的可能就越大；可是数据自己，又是不容许出现纰漏的。那拿什么去保证数据的正确性?刷脸？可能吗？node

大多数时候，相似的争论，最终几乎老是会以他们的一句“我不懂技术，大家看着办吧”结束。因此，也懒得去作口舌之争。我尽力尽快作；可是你承不认可事情自己的复杂度，并不会影响事情自己的复杂度。ios

回到问题自己，究竟如何处理新到来的100条数据以及之后更多的数据，确实是一个必须想办法完全解决下的问题。git

我拿到的原始数据

此处适当象征性的描述下我拿到的数据。如下讨论，单以 10 条数据为例。github

一个 word 文档，是一组问题。

内容假定是：typescript

1.【smart-transform】取自 Atom 的 babeljs&coffeescript&typescript 智能转 es5 库
2.【YFMemoryLeakDetector】人人都能理解的 iOS 内存泄露检测工具类
3.【玩转树莓派】使用 sinopia 搭建私有 npm 服务器
4.【小技巧解决大问题】使用 frp 突破阿里云主机无弹性公网 IP 不能用做 Web 服务器的限制
5.【树莓派自动化应用实例】整点提醒本身休息五分钟
6. 借助 frp 随时随地访问本身的树莓派
7.【LuaJIT版】从零开始在 macOS 上配置 Lua 开发环境
8.【最新版】从零开始在 macOS 上配置 Lua 开发环境
9. 关于混合应用开发的将来的一些思考
10.记录我发现的第一个关于 Google 的 Bug

是的，内容中还有各类中文标点。他们有至关一部分人不理解攻城狮为何喜欢用英文标点，甚至还有人以此为由说咱们小学标点符号没学好。懒得解释那么多，可是既然给出来了，做为纯文本，也不用管这么多，照单全收就好了。符号习惯问题自己，也是一个无伤大雅的问题。macos

另外一个 word 文档，是一组问题对应的 Luis 语义分析结果

微软的 Luis 语义分析服务，勉强算是和人工智能沾点边吧，感兴趣的请自行了解下。从客户端角度来讲，你给它一个文本字符串，他们分析出来和这个字符串匹配度最高的某个预录入的答案的惟一标记。每一个惟一标记 ID，被称做一个 intent。每次请求，最多只有一个匹配度最高的 intent。npm

感受已经有的 word 问题，咱们的后端小伙伴，送来了另外一个 word 文档：json

1. smart_transform
2. memory_leakDetector
3. sinopia_npm
4. frp_ip
5. tip_rest
6. frp_anywhere
7. luajit_macos
8. lua_macos
9. app_future
10. google_bug

又是非结构化的数据。显而易见，咱们可爱的后端同窗，只是简单完成了录入，本身没有作必要的单元测试。这是在等着我去发现问题啊。好久好久之前，我老是幻想着，全部的攻城狮，必然都是各类自动化测试用例，就像树上写的各类敏捷，各类快速迭代。事实上，我见到的许多所谓的敏捷式开发，最终其实只是把成本后置，各类技术债。出来混，真的早晚是要换的。100个问题，逐一去验证，真的是很耗费时间的，并且最终有问题的，数量也不会太多。也就说说，若是手动去作，颇有可能寻找问题的时间，要远远大于发现问题的时间。因此，自动化批量测试，是显而易见的。根据不一样的场景和须要，快速构建基本够用的批量自动化测试工具链，应该成为每一个攻城狮的必修课。后端

一组勉强算是有规律的分文件夹放置的视频

我依然是象征性的描述下，结构相似于：浏览器

/videos/树莓派/【smart-transform】取自 Atom 的 babeljs&coffeescript&typescript 智能转 es5 库.mp4
/videos/树莓派/【YFMemoryLeakDetector】人人都能理解的 iOS 内存泄露检测工具类.mp4
/videos/树莓派/【玩转树莓派】使用 sinopia 搭建私有 npm 服务器.mp4
/videos/树莓派/【小技巧解决大问题】使用 frp 突破阿里云主机无弹性公网 IP 不能用做 Web 服务器的限制.mp4
/videos/frp/【树莓派自动化应用实例】整点提醒本身休息五分钟.mp4
/videos/frp/借助 frp 随时随地访问本身的树莓派.mp4
/videos/Lua/【LuaJIT版】从零开始在 macOS 上配置 Lua 开发环境.mp4
/videos/Lua/【最新版】从零开始在 macOS 上配置 Lua 开发环境.mp4
/videos/Lua/关于混合应用开发的将来的一些思考.mp4
/videos/Lua/记录我发现的第一个关于 Google 的 Bug.mp4

目标数据要求

intent 必须和问题关联起来

显而易见，应该使用 intent 做为数据的惟一 id。为了便于处理，索性写成了一个 JS 模块。之因此不直接用 JSON,是由于模块比 JSON 文件，更灵活性，后期扩展方便，若是有的话。

这一步是必须手动作的，或者说老是须要有一我的手动去作的。为了效率，团队内老是须要有一我的必需要充当这个角色。

大体处理下，第一版结构 intent_info.js 大概相似这样：

module.exports = {
  /* 树莓派 */
  "smart_transform":"【smart-transform】取自 Atom 的 babeljs&coffeescript&typescript 智能转 es5 库",
  "memory_leakDetector":"【YFMemoryLeakDetector】人人都能理解的 iOS 内存泄露检测工具类",
  "sinopia_npm":"【玩转树莓派】使用 sinopia 搭建私有 npm 服务器",
  "frp_ip":"【小技巧解决大问题】使用 frp 突破阿里云主机无弹性公网 IP 不能用做 Web 服务器的限制",
  /* frp */
  "tip_rest":"【树莓派自动化应用实例】整点提醒本身休息五分钟",
  "frp_anywhere":"借助 frp 随时随地访问本身的树莓派",
  /* Lua */
  "luajit_macos":"【LuaJIT版】从零开始在 macOS 上配置 Lua 开发环境",
  "lua_macos":"【最新版】从零开始在 macOS 上配置 Lua 开发环境",
  "app_future":"关于混合应用开发的将来的一些思考",
  "google_bug":"记录我发现的第一个关于 Google 的 Bug",
}

排序

排序，是须要增长一个新的字段 order。不过，我就直接上面的相似 JSON 的结构来排序的。由于排序是由另一我的作，懂技术，操做很简单些。

通过对方排序后，intent_info.js，可能变成了这样：

module.exports = {
  /* 树莓派 */
  "smart_transform":"【smart-transform】取自 Atom 的 babeljs&coffeescript&typescript 智能转 es5 库",
  "memory_leakDetector":"【YFMemoryLeakDetector】人人都能理解的 iOS 内存泄露检测工具类",
  "sinopia_npm":"【玩转树莓派】使用 sinopia 搭建私有 npm 服务器",
  "frp_ip":"【小技巧解决大问题】使用 frp 突破阿里云主机无弹性公网 IP 不能用做 Web 服务器的限制",
  /* Lua */
  "luajit_macos":"【LuaJIT版】从零开始在 macOS 上配置 Lua 开发环境",
  "lua_macos":"【最新版】从零开始在 macOS 上配置 Lua 开发环境",
  "app_future":"关于混合应用开发的将来的一些思考",
  "google_bug":"记录我发现的第一个关于 Google 的 Bug",
  /* frp */
  "tip_rest":"【树莓派自动化应用实例】整点提醒本身休息五分钟",
  "frp_anywhere":"借助 frp 随时随地访问本身的树莓派",
}

在上面的优先显示。在真正生成 order 字段时，是借助 Node 一个不太可靠的特性：字典遍历时，会基于key的书写顺序来遍历。这一点，在 Node 和 Android 浏览器上都是成立的，在 safari 上，无效。通常开发时，不该依赖于这一点，不过目前，我只是须要一个够用的东西。Node 的这个特性，在短期内，应该是不会有改变的。

分类

没过几天，果真又加了新需求，说是视频太多了，太杂乱，想给每一个视频加个分类，而后能够按分类查看视频。

好，那我给你加个分类：

module.exports = {
  /* 树莓派 */
  "树莓派":"_category",
  "smart_transform":"【smart-transform】取自 Atom 的 babeljs&coffeescript&typescript 智能转 es5 库",
  "memory_leakDetector":"【YFMemoryLeakDetector】人人都能理解的 iOS 内存泄露检测工具类",
  "sinopia_npm":"【玩转树莓派】使用 sinopia 搭建私有 npm 服务器",
  "frp_ip":"【小技巧解决大问题】使用 frp 突破阿里云主机无弹性公网 IP 不能用做 Web 服务器的限制",
  /* Lua */
  "Lua":"_category",
  "luajit_macos":"【LuaJIT版】从零开始在 macOS 上配置 Lua 开发环境",
  "lua_macos":"【最新版】从零开始在 macOS 上配置 Lua 开发环境",
  "app_future":"关于混合应用开发的将来的一些思考",
  "google_bug":"记录我发现的第一个关于 Google 的 Bug",
  /* frp */
  "frp":"_category",
  "tip_rest":"【树莓派自动化应用实例】整点提醒本身休息五分钟",
  "frp_anywhere":"借助 frp 随时随地访问本身的树莓派",
}

新加了几个值为 _category 的字段。当检测到值为 _category 时，就自动断定为是一个分类。我这种处理方式，免不了引来一阵唏嘘。可是，许多时候，你选择的技术策略，都必须根据项目所处的状态和各类条件，去综合权衡。我只有几十分钟时间去从新规划和整理100条数据。可能真的无法想太多。需求老是变化的，不知道明天又会变成什么样，可能再进一步，就变成”过分设计“了。另外，项目自己， intent 自己约定了本身特有命名规律，是能够安全认为 intent 和分类必定不会重复的。

问题和视频关联

在读取 intent_info.js 中的足够可信的结构化数据后，我会动态创建问题和视频的关联。这个过程当中，可能须要适当修改问题和视频的标题。为了不遗漏，一个标题，若是没有对应的视频或对应多个视频，就直接crash。有些霸道，但总比后期一个一个比对排查，省太多事了。结合问题和视频标题的特色，我专门封装了一个方法：

/* 获取某个标题对应的本地路径.
为了不未知错误,若是找不到或找到多个,就直接 crash.

@return  本地视频的相对路径.
 */
function localVideoPath(title)
{
  let path = require("path")
  let fs = require ('fs-plus')
  let fse = require('fs-extra')
  let os = require("os")
  let {execSync} = require("child_process")

  let videoDir = path.resolve(__dirname,"./videos")

  let videos = fs.listTreeSync(videoDir)
                  .filter(item=>{
                    return [".mov",".mp4"].includes(path.extname(item))
                  })
                  .map(item=>{
                    return path.relative(__dirname,item)
                  })

  /* 一个标题,能且只能对应一个视频,不然就抛出异常. */
  let localVideoPath = null

  for (let item of videos) {
    if (item.includes(title)) {
      if (localVideoPath) {
        const tip = `致命异常: ${title} 对应的视频重复:
        ${localVideoPath}
        ${item}`

        throw new Error(tip)
      }

      localVideoPath = item
    }
  }

  if (!localVideoPath) {
    const tip = `致命异常!这个标题居然没有对应的视频:\n${title}`

    throw new Error(tip)
  }

  return localVideoPath
}

见码如唔

完整的自动化处理成结构数据的逻辑以下，都集中在 make_data.js 中。

/* 生成带有排序等信息的文件. */

/* 支持自动生成数据. */
makeDataWithOrder()
function makeDataWithOrder()
{
  const fs = require('fs-extra')
  const path = require('path')

  const intentInfo = require("./intent_info.js")

  let intentInfoNew = []
  let index = 1

  /* 在node中遍历时,key的顺序是和原始key的顺序对应的.
  这个特性,并不老是有效,好比在 ios 浏览器中.
  目前,仅仅是够用. */
  let category = ""
  for (let intent in intentInfo) {
    if (intentInfo[intent] == "_category") { /* 说明是一个分类标记. */
      category = intent
      continue
    }
    let title = intentInfo[intent]
    const local_path = localVideoPath(title)
    intentInfoNew.push({
      "type":"video",
      "content":"",
      "intent": intent,
      "title": title,
      "order": index,
      "local_video_path": local_path,
      "ext": path.extname(local_path),
      "category":category,
    })

    ++ index
  }

  localVideoLoseCheck(intentInfoNew)
  const dataPath = path.resolve(__dirname, "./data.json")
  fs.writeJsonSync(dataPath, intentInfoNew)
  console.log(`恭喜!数据已写入 ${dataPath}`)
}

/* 确保视频总数与intent总数是对应的,防止有视频遗漏.
有视频没有对应问题时,会直接抛出异常.
 */
function localVideoLoseCheck(intents)
{
  /* 先把视频信息处理成 key-value. */
  let path = require("path")
  let fs = require ('fs-plus')
  let fse = require('fs-extra')
  let os = require("os")
  let {execSync} = require("child_process")

  let videoDir = path.resolve(__dirname,"./videos")
  let videoDict = fs.listTreeSync(videoDir)
                  .filter(item=>{
                    return [".mov",".mp4"].includes(path.extname(item))
                  })
                  .map(item=>{
                    return path.relative(__dirname,item)
                  })
                  .reduce((sum,item,idx)=>{
                    sum[item] = false
                    return sum
                  },{})

  for (let item of intents) {
    videoDict[item.local_video_path] = true
  }

  /* 寻找缺失的. */
  let loses = []
  for (let item in videoDict) {
    if (!videoDict[item]) {
      loses.push(item)
    }
  }

  if (loses.length) {
    const tip = `一下 ${loses.length} 个视频没有对应的问题:
    ${JSON.stringify(loses)}`
    throw new Error(tip)
  }
}

/* 获取某个标题对应的本地路径.
为了不未知错误,若是找不到或找到多个,就直接 crash.

@return  本地视频的相对路径.
 */
function localVideoPath(title)
{
  let path = require("path")
  let fs = require ('fs-plus')
  let fse = require('fs-extra')
  let os = require("os")
  let {execSync} = require("child_process")

  let videoDir = path.resolve(__dirname,"./videos")

  let videos = fs.listTreeSync(videoDir)
                  .filter(item=>{
                    return [".mov",".mp4"].includes(path.extname(item))
                  })
                  .map(item=>{
                    return path.relative(__dirname,item)
                  })

  /* 一个标题,能且只能对应一个视频,不然就抛出异常. */
  let localVideoPath = null

  for (let item of videos) {
    if (item.includes(title)) {
      if (localVideoPath) {
        const tip = `致命异常: ${title} 对应的视频重复:
        ${localVideoPath}
        ${item}`

        throw new Error(tip)
      }

      localVideoPath = item
    }
  }

  if (!localVideoPath) {
    const tip = `致命异常!这个标题居然没有对应的视频:\n${title}`

    throw new Error(tip)
  }

  return localVideoPath
}

咱们在项目目录执行

node ./make_data.js

就能够获得咱们想要的结构化的数据：

[
  {
    "type": "video",
    "content": "",
    "intent": "smart_transform",
    "title": "【smart-transform】取自 Atom 的 babeljs:coffeescript:typescript 智能转 es5 库",
    "order": 1,
    "local_video_path": "videos/树莓派/【smart-transform】取自 Atom 的 babeljs:coffeescript:typescript 智能转 es5 库.mp4",
    "ext": ".mp4",
    "category": "树莓派"
  },
  {
    "type": "video",
    "content": "",
    "intent": "memory_leakDetector",
    "title": "【YFMemoryLeakDetector】人人都能理解的 iOS 内存泄露检测工具类",
    "order": 2,
    "local_video_path": "videos/树莓派/【YFMemoryLeakDetector】人人都能理解的 iOS 内存泄露检测工具类.mp4",
    "ext": ".mp4",
    "category": "树莓派"
  },
  {
    "type": "video",
    "content": "",
    "intent": "sinopia_npm",
    "title": "【玩转树莓派】使用 sinopia 搭建私有 npm 服务器",
    "order": 3,
    "local_video_path": "videos/树莓派/【玩转树莓派】使用 sinopia 搭建私有 npm 服务器.mp4",
    "ext": ".mp4",
    "category": "树莓派"
  },
  {
    "type": "video",
    "content": "",
    "intent": "frp_ip",
    "title": "【小技巧解决大问题】使用 frp 突破阿里云主机无弹性公网 IP 不能用做 Web 服务器的限制",
    "order": 4,
    "local_video_path": "videos/树莓派/【小技巧解决大问题】使用 frp 突破阿里云主机无弹性公网 IP 不能用做 Web 服务器的限制.mp4",
    "ext": ".mp4",
    "category": "树莓派"
  },
  {
    "type": "video",
    "content": "",
    "intent": "luajit_macos",
    "title": "【LuaJIT版】从零开始在 macOS 上配置 Lua 开发环境",
    "order": 5,
    "local_video_path": "videos/Lua/【LuaJIT版】从零开始在 macOS 上配置 Lua 开发环境.mp4",
    "ext": ".mp4",
    "category": "Lua"
  },
  {
    "type": "video",
    "content": "",
    "intent": "lua_macos",
    "title": "【最新版】从零开始在 macOS 上配置 Lua 开发环境",
    "order": 6,
    "local_video_path": "videos/Lua/【最新版】从零开始在 macOS 上配置 Lua 开发环境.mp4",
    "ext": ".mp4",
    "category": "Lua"
  },
  {
    "type": "video",
    "content": "",
    "intent": "app_future",
    "title": "关于混合应用开发的将来的一些思考",
    "order": 7,
    "local_video_path": "videos/Lua/关于混合应用开发的将来的一些思考.mp4",
    "ext": ".mp4",
    "category": "Lua"
  },
  {
    "type": "video",
    "content": "",
    "intent": "google_bug",
    "title": "记录我发现的第一个关于 Google 的 Bug",
    "order": 8,
    "local_video_path": "videos/Lua/记录我发现的第一个关于 Google 的 Bug.mp4",
    "ext": ".mp4",
    "category": "Lua"
  },
  {
    "type": "video",
    "content": "",
    "intent": "tip_rest",
    "title": "【树莓派自动化应用实例】整点提醒本身休息五分钟",
    "order": 9,
    "local_video_path": "videos/frp/【树莓派自动化应用实例】整点提醒本身休息五分钟.mp4",
    "ext": ".mp4",
    "category": "frp"
  },
  {
    "type": "video",
    "content": "",
    "intent": "frp_anywhere",
    "title": "借助 frp 随时随地访问本身的树莓派",
    "order": 10,
    "local_video_path": "videos/frp/借助 frp 随时随地访问本身的树莓派.mp4",
    "ext": ".mp4",
    "category": "frp"
  }
]

参考文章

【趣味连载】攻城狮上传视频与普通人上传视频源码工程