如何用R语言参破每一个复仇者的口头禅？

时间 2019-11-30

标签如何语言每一个复仇者口头禅繁體版

原文原文链接

盼望着，盼望着，《复联3》终于在国内上映。《复仇者联盟：无限战争》的表现也不负众望，国内上映3天后票房即达12亿元，目前豆瓣评分为8.5。git

不用说你也知道，“复仇者联盟”里每一个成员都性格迥异，因此说话用词都有各自鲜明的特色。那他们说话都爱用哪些词儿？windows

国外有几位漫威的铁杆粉丝把每一个复仇者的说话习惯用 R 语言可视化了出来，图中每一个词对应的条形长度，表明了他比其余复仇者更爱说这个词的程度。api

咱们能够看到，美队老爱喊别人名字，特别是托尼（emmmmmm...）；黑豹常常念叨一些很高大上的词（好比朋友，国王），不像蜘蛛侠，满嘴嗯啊个不停（好比嘿，啊，呃），还跟个孩子似的；浩克和鹰眼说的最多的是黑寡妇，不过两人喊得称呼却不一样（缘由你猜）；幻视和绯红女巫颇有共同话题，因此这是俩人互生爱慕的缘由？果真，雷神念叨最多的仍是老弟洛基，并且总是想着“宇宙大事”，说的话都和第三部《无限战争》紧密相关；至于洛基嘛，意料之中的常常哔哔“权力”“王位”这些，可是跟洛基同样也渴望权力的奥创却说话不同，人家说的词就颇有诗意。bash

这么有意思的可视化图形是怎么作出来的呢？秘笈以下：app

首先咱们会用到如下 R 语言包：ide

library(dplyr)
library(grid)
library(gridExtra)
library(ggplot2)
library(reshape2)
library(cowplot)
library(jpeg)
library(extrafont)
复制代码

有些人可能认为使用“清除全部”代码行很很差，可是在脚本顶部用它能够确保在执行脚本时，脚本不会依赖不当心遗留在工做区内的任何对象。函数

rm(list = ls())
复制代码

这是包含全部复仇者图像的文件夹：字体

dir_images <- "C:\\Users\\Matt\\Documents\\R\\Avengers"
setwd(dir_images)
复制代码

设置字体ui

windowsFonts(Franklin=windowsFont("Franklin Gothic Demi"))
复制代码

各个复仇者名字的简化版

character_names <- c("black_panther","black_widow","bucky","captain_america",
                    "falcon","hawkeye","hulk","iron_man",
                    "loki","nick_fury","rhodey","scarlet_witch",
                    "spiderman","thor","ultron","vision")
image_filenames <- paste0(character_names, ".jpg")
复制代码

读取和简化版复仇者名字对应的图像文件的函数this

read_image <- function(filename){
 char_name <- gsub(pattern = "\\.jpg$", "", filename)
 img <- jpeg::readJPEG(filename)
 return(img)
}
复制代码

将全部图像读取为一个列表

all_images <- lapply(image_filenames, read_image)
复制代码

为这列图像分配名字，这样后面就能够被字符检索到了

names(all_images) <- character_names
复制代码

其实使用图像名字很简单，好比下面这个例子

# clear the plot window
grid.newpage()
# draw to the plot window
grid.draw(rasterGrob(all_images[['vision']]))
复制代码

获取文本数据这几位漫威粉并无将他们本身的电影台词数据集分享出来，不过咱们能够在 IMSDB 上下载，而后用文本分析技术稍做处理。若是原做者后面将本身的数据集公开，咱们会第一时间分享。

加载本地数据集。

修正人物名字的大小写

capitalize <- Vectorize(function(string){
 substr(string,1,1) <- toupper(substr(string,1,1))
 return(string)
})

proper_noun_list <- c("clint","hydra","steve","tony",
                     "sam","stark","strucker","nat","natasha",
                     "hulk","tesseract", "vision",
                     "loki","avengers","rogers", "cap", "hill")

# Run the capitalization function
word_data <- word_data %>%
 mutate(word = ifelse(word %in% proper_noun_list, capitalize(word), word)) %>%
 mutate(word = ifelse(word == "jarvis", "JARVIS", word))
复制代码

注意前面的简化版人物名字，不要匹配文本数据框中已经处理好格式的人物名字。

unique(word_data$Speaker)
## [1] "Black Panther" "Black Widow" "Bucky"
## [4] "Captain America" "Falcon" "Hawkeye"
## [7] "Hulk" "Iron Man" "Loki"
## [10] "Nick Fury" "Rhodey" "Scarlet Witch"
## [13] "Spiderman" "Thor" "Ultron"
## [16] "Vision"
复制代码

制做一个查询表，将简写的文件名转换为美观的人物名字

character_labeler <- c(`black_panther` = "Black Panther",
                      `black_widow` = "Black Widow",
                      `bucky` = "Bucky",
                      `captain_america` = "Captain America",
                      `falcon` = "Falcon", `hawkeye` = "Hawkeye",
                      `hulk` = "Hulk", `iron_man` = "Iron Man",
                      `loki` = "Loki", `nick_fury` = "Nick Fury",
                      `rhodey` = "Rhodey",`scarlet_witch` ="Scarlet Witch",
                      `spiderman`="Spiderman", `thor`="Thor",
                      `ultron` ="Ultron", `vision` ="Vision")
复制代码

得到两个不一样版本的人物名字

其中一个版本用来展现（由于美观），另外一个版本用于简单的组织和引用图像文件（由于简单）。

convert_pretty_to_simple <- Vectorize(function(pretty_name){
 # pretty_name = "Vision"
 simple_name <- names(character_labeler)[character_labeler==pretty_name]
 # simple_name <- as.vector(simple_name)
 return(simple_name)
})
# convert_pretty_to_simple(c("Vision","Thor"))
# just for fun, the inverse of that function
convert_simple_to_pretty <- function(simple_name){
 # simple_name = "vision"
 pretty_name <- character_labeler[simple_name] %>% as.vector()
 return(pretty_name)
}
# example
convert_simple_to_pretty(c("vision","black_panther"))
## [1] "Vision" "Black Panther"

复制代码

为文本数据框添加简化版人物名字。

word_data$character <- convert_pretty_to_simple(word_data$Speaker)
复制代码

为每一个人物分配一个主要颜色。

character_palette <- c(`black_panther` = "#51473E",
                      `black_widow` = "#89B9CD",
                      `bucky` = "#6F7279",
                      `captain_america` = "#475D6A",
                      `falcon` = "#863C43", `hawkeye` = "#84707F",
                      `hulk` = "#5F5F3F", `iron_man` = "#9C2728",
                      `loki` = "#3D5C25", `nick_fury` = "#838E86",
                      `rhodey` = "#38454E",`scarlet_witch` ="#620E1B",
                      `spiderman`="#A23A37", `thor`="#323D41",
                      `ultron` ="#64727D", `vision` ="#81414F" )
复制代码

制做水平方向的条形图

avengers_bar_plot <- word_data %>%
 group_by(Speaker) %>%
 top_n(5, amount) %>%
 ungroup() %>%
 mutate(word = reorder(word, amount)) %>%
 ggplot(aes(x = word, y = amount, fill = character))+
 geom_bar(stat = "identity", show.legend = FALSE)+
 scale_fill_manual(values = character_palette)+
 scale_y_continuous(name ="Log Odds of Word",
                    breaks = c(0,1,2)) +
 theme(text = element_text(family = "Franklin"),
       # axis.title.x = element_text(size = rel(1.5)),
       panel.grid = element_line(colour = NULL),
       panel.grid.major.y = element_blank(),
       panel.grid.minor = element_blank(),
       panel.background = element_rect(fill = "white",
                                   colour = "white"))+
 # theme(strip.text.x = element_text(size = rel(1.5)))+
 xlab("")+
 coord_flip()+
 facet_wrap(~Speaker, scales = "free_y")
avengers_bar_plot
复制代码

看起来很不错。

可是咱们想画个更酷炫的图：用每一个复仇者的照片来填充条形图。

也就是说咱们只在条形图区域内展现出复仇者的照片，在条形区域之外的地方则不展现（以下图所示）。

若是想作到这点，咱们须要显示一个透明的条形，而后在条形的末尾画一个白色的条形，延伸至图像边缘覆盖人物照片的剩余部分。

在数据框部分，咱们如今想用所需的值的余数来补充数字值，以实现总体最大化，这样当把值和余数相加时，全部数值都会增长到同一最大数值，以一样的格式将不一样行组合到一块儿。

max_amount <- max(word_data$amount)
word_data$remainder <- (max_amount - word_data$amount) + 0.2
复制代码

只提取每一个复仇者说的最多的5个词

word_data_top5 <- word_data %>%
 group_by(character) %>%
 arrange(desc(amount)) %>%
 slice(1:5) %>%
 ungroup()
复制代码

将数量&余数转换为长格式

这样能保证每一个人物和所说词语的匹配关系有两个 entry，一个用以真实数量（“amount”），一个用以选择在哪里结束，达到常见的最大值（“remainder”）。

这会将“amount”和“remainder”重叠为一个单独的列称为“variable”，表示是什么值，而另外一个列“value”包含来自这些值中每个值的数字。

word_data_top5_m <- melt(word_data_top5, measure.vars = c("amount","remainder"))
复制代码

Variable 是一个值是真实数量仍是补充数量的标记。

如今咱们按顺序将它们放在一块儿，和在melt函数中的肯定它们的顺序相反。不然“amount”和“remainder”会以相反的顺序展示在图形中。

word_data_top5_m$variable2 <- factor(word_data_top5_m$variable,
                                    levels = rev(levels(word_data_top5_m$variable)))
复制代码

为一我的物展现前 5 个词语数据的函数

以简单的形式声明人物名字，好比用 black_panther 而不是 Black Panther。

plot_char <- function(character_name){
 # example: character_name = "black_panther"
 # plot details that we might want to fiddle with
 # thickness of lines between bars
 bar_outline_size <- 0.5
 # transparency of lines between bars
 bar_outline_alpha <- 0.25
 #
 # The function takes the simple character name,
 # but here, we convert it to the pretty name,
 # because we'll want to use that on the plot.
 pretty_character_name <- convert_simple_to_pretty(character_name)

 # Get the image for this character,
 # from the list of all images.
 temp_image <- all_images[character_name]

 # Make a data frame for only this character
 temp_data <- word_data_top5_m %>%
   dplyr::filter(character == character_name) %>%
   mutate(character = character_name)

 # order the words by frequency
 # First, make an ordered vector of the most common words
 # for this character
   ordered_words <- temp_data %>%
     mutate(word = as.character(word)) %>%
     dplyr::filter(variable == "amount") %>%
     arrange(value) %>%
     `[[`(., "word")

   # order the words in a factor,
   # so that they plot in this order,
   # rather than alphabetical order
   temp_data$word = factor(temp_data$word, levels = ordered_words)

 # Get the max value,
 # so that the image scales out to the end of the longest bar
 max_value <- max(temp_data$value)
 fill_colors <- c(`remainder` = "white", `value` = "white")

 # Make a grid object out of the character's image
 character_image <- rasterGrob(all_images[[character_name]],
                               width = unit(1,"npc"),
                               height = unit(1,"npc"))

 # make the plot for this character
 output_plot <- ggplot(temp_data)+
   aes(x = word, y = value, fill = variable2)+
   # add image
   # draw it completely bottom to top (x),
   # and completely from left to the the maximum log-odds value (y)
   # note that x and y are flipped here,
   # in prep for the coord_flip()
   annotation_custom(character_image,
                     xmin = -Inf, xmax = Inf, ymin = 0, ymax = max_value) +
   geom_bar(stat = "identity", color = alpha("white", bar_outline_alpha),
            size = bar_outline_size, width = 1)+
   scale_fill_manual(values = fill_colors)+
   theme_classic()+
   coord_flip(expand = FALSE)+
   # use a facet strip,
   # to serve as a title, but with color
   facet_grid(. ~ character, labeller = labeller(character = character_labeler))+
   # figure out color swatch for the facet strip fill
   # using character name to index the color palette
   # color= NA means there's no outline color.
   theme(strip.background = element_rect(fill = character_palette[character_name],
                                         color = NA))+
   # other theme elements
   theme(strip.text.x = element_text(size = rel(1.15), color = "white"),
         text = element_text(family = "Franklin"),
         legend.position = "none",
         panel.grid = element_blank(),
         axis.text.x = element_text(size = rel(0.8)))+
   # omit the axis title for the individual plot,
   # because we'll have one for the entire ensemble
   theme(axis.title = element_blank())
 return(output_plot)
}
复制代码

将 X 轴名称用为全部复仇者主图像的名称

plot_x_axis_text <- paste("Tendency to use this word more than other characters do",
                         "(units of log odds ratio)", sep = "\n")
复制代码

下面是函数在这里的工做示例

sample_plot <- plot_char("black_panther")+
 theme(axis.title = element_text())+
 # x lab is still declared as y lab
 # because of coord_flip()
 ylab(plot_x_axis_text)
sample_plot
复制代码

为什么咱们这里的水平轴上还带着很是奇怪的“对数差别比”？

由于随着数字增大，差别也会随之增大（具体数学知识这里再也不讲述）；将它们转换为对数尺度，能够约束变化幅度的大小，方便咱们在屏幕上展现。

若是想将这些对数差别转化为简单的几率形式，能够用以下函数：

logit2prob <- function(logit){
 odds <- exp(logit)
 prob <- odds / (1 + odds)
 return(prob)
}
复制代码

这样处理后水平轴会以下所示：

logit2prob(seq(0, 2.5, 0.5))
## [1] 0.5000000 0.6224593 0.7310586 0.8175745 0.8807971 0.9241418
复制代码

注意此序列中连续项目之间的差别在慢慢消失：

diff(logit2prob(seq(0, 2.5, 0.5)))
## [1] 0.12245933 0.10859925 0.08651590 0.06322260 0.04334474
复制代码

Okay，如今咱们制做出了一个图···

咱们接着将函数应用到列表中全部复仇者身上，将全部绘图放入一个列表对象。

all_plots <- lapply(character_names, plot_char)
复制代码

从绘图中提取轴名称的函数

不只仅是文本，还有其它画出的信息。

你能够选择提取 X 轴名称仍是 Y 轴名称：

get_axis_grob <- function(plot_to_pick, which_axis){
 # plot_to_pick <- sample_plot
 tmp <- ggplot_gtable(ggplot_build(plot_to_pick))
 # tmp$grobs
 # find the grob that looks like
 # it would be the x axis
 axis_x_index <- which(sapply(tmp$grobs, function(x){
   # for all the grobs,
   # return the index of the one
   # where you can find the text
   # "axis.title.x" or "axis.title.y"
   # based on input argument `which_axis`
   grepl(paste0("axis.title.",which_axis), x)}
 ))
 axis_grob <- tmp$grobs[[axis_x_index]]
 return(axis_grob)
} 
复制代码

提取轴名称 Grob

px_axis_x <- get_axis_grob(sample_plot, "x")
px_axis_y <- get_axis_grob(sample_plot, "y")
复制代码

下面是使用这些提取的轴的方法：

grid.newpage()
grid.draw(px_axis_x) 
复制代码

将全部绘图排成一个对象

big_plot <- arrangeGrob(grobs = all_plots)
复制代码

将 X 轴嵌入绘图的底部，由于每一个图并无 X 轴，而咱们想让它们都有 X 轴。

注意这时绘图会看着很不协调，高度差很少是宽度的十倍。

big_plot_w_x_axis_title <- arrangeGrob(big_plot,
                                      px_axis_x,
                                      heights = c(10,1))
grid.newpage()
grid.draw(big_plot_w_x_axis_title)
复制代码

绘图所占的空间大小不一，由于每一个图的词汇长度不一样。

这样看起来有些混乱。

一般咱们会用 facet_grid() 或 facet_wrap() 来确保绘图整洁有序，但这里却不能使用由于每一个图的背景图各不相同，没法像数据框中的其它列同样映射到平面上（由于背景图像实际上并不是数据框的一部分）。

使用 cowplot 而非 arrangeGrob

这样绘图的轴会垂直对齐：

big_plot_aligned <- cowplot::plot_grid(plotlist = all_plots, align = 'v', nrow = 4)
复制代码

和以前同样，将X轴名称添加至绘图对齐后网格的下方。

big_plot_w_x_axis_title_aligned <- arrangeGrob(big_plot_aligned,
                                              px_axis_x,
                                              heights = c(10,1))
复制代码

下面是将总体效果图绘制在屏幕上的方法：

grid.newpage()
grid.draw(big_plot_w_x_axis_title_aligned)
复制代码

很好！

保存最终图像：

ggsave(big_plot_w_x_axis_title_aligned,
      file = "Avengers_Word_Usage.png",
      width = 12, height = 6.3)
复制代码

这样，咱们就可视化出了《复联》中各个复仇者都最爱说那些话！