马尔科夫链随机文本生成器

时间 2019-11-19

标签随机文本成器繁體版

原文原文链接

说明：node

有一种基于马尔可夫链算法的随机文本生成方法，它利用任何一个现有的某种语言的文本（如一本英文小说），能够构造出由这个文本中的语言使用状况而造成的统计模型，并经过该模型生成的随机文本将具备与原文本相似的统计性质（即具备相似写做风格）。算法

该算法的基本原理是将输入当作是由一些互相重叠的短语构成的序列，其将每一个短语分割为两个部分：一部分是由多个词构成的前缀，另外一部分是只包含一个词的后缀。在生成文本时依据原文本的统计性质（即前缀肯定的状况下，获得全部可能的后缀），随机地选择某前缀后面的特定后缀。在此，假设前缀长度为两个单词，则马尔可夫链（Markov Chain）随机文本生成算法以下：数组

设w1和w2为文本的前两个词数据结构

输出w1和w2app

循环：函数

随机地选出w3，它是原文本中w1w2为前缀的后缀中的一个ui

输出w3spa

w1 = w2指针

w2 = w3code

重复循环

下面将经过一个例子来讲明该算法原理，假设有一个原文以下：

Show your flowcharts and conceal your tables and I will be mystified. Show your tables and your flowcharts will be obvious.

下面是上述原文的一些前缀和其后缀（注意只是部分）的统计：

Prefix	Suffix
Show your	flowcharts tables
your flowcharts	and will
flowcharts and	conceal
flowcharts willl	be
your tables	and and
will be	mystified. obvious.
be mystified.	Show
be obvious.	(end)

基于上述文本，按照马尔可夫链（Markov Chain）算法随机文本生成文本时，首先输出的是Show your，而后随机取出flowcharts或tables。若是为前者，则接下来的前缀就变成your flowcharts，而下一个后缀应该是and或will；若是为tables，则接下来的前缀就变成your tables，而下一个词就应该是and。这样继续下去，直到产生出足够多的输出，或在查找后缀时遇到告终束标志。

编写一个程序从文件中读入一个英文文本，利用马尔可夫链（Markov Chain）算法，基于文本中固定长度的短语的出现频率，生成一个最大单词数目不超过N的新文本到给定文件中。程序要求前缀词的个数为2，最大单词数目N由标准输入得到。

说明：

为了获得更好的统计特性，在此标点符号等非字母字符（如’ “ . , ? – ()等）也被当作单词的一部分，即“words”和“words.”是不一样的单词。所以，在此将“词”定义为由“空白界定的字符串”；

对于同一个前缀的后缀按出现顺序排放（无论该后缀是否已存在）；

在处理文本时，文件结束标志也将做为某一前缀的一个后缀，如上面示例（说明：在为文件最后两个前缀单词“be obvious.”读取后缀时，遇到文件结束，即其没有相应后缀，此时可用一个特殊标记来表示其后缀，如，可存储一个自定义的特殊串（如“(end)”）做为其后缀来表示当前状态，即文件结束）；

对于某一前缀，按以下方式来随机选择其后缀（若是某一前缀只有一个后缀，将直接选择该后缀）：

n = (int)(rrand() * N);

在此N为某一前缀的全部后缀的总数，n为所肯定的后缀在该前缀的后缀序列中的序号（从0开始计数，即n为0时选取第一个后缀，为1时选取第二个后缀，以此类推）。在此，随机数生成函数rrand()的定义以下：

double seed = 997;

double rrand()
{
    double lambda = 3125.0;
    double m = 34359738337.0;
    double r;
    seed = fmod(lambda*seed, m); //要包含头文件#include <math.h>
    r = seed/ m;
    return r;
}

注意：为了保证运行结果的肯定性，请务必使用本文提供的随机数生成函数。

在下面条件知足时文本生成结束：1）遇到后缀为文件结束标志；或2）生成文本的单词数达到所设定的最大单词数。在程序实现时，当读到文件（结束）尾时，可将一个特殊标志赋给后缀串suffix变量。

【输入形式】

建立英文文本文件“article.txt”进行统计分析，并从标准输入中读入一个正整数做为生成文本时的最大单词数。

【输出形式】

将生成文本输出到当前目录下文件“markov.txt”中。单词间以一个空格分隔，最后一个单词后空格无关紧要。

【样例输入】

若当前目录下文件article.txt中内容以下：

I will give you some advice about life.

Eat more roughage;

Do more than others expect you to do and do it pains;

Remember what life tells you;

do not take to heart every thing you hear.

do not spend all that you have.

do not sleep as long as you want;

Whenever you say "I love you", please say it honestly;

Whevever you say "I am sorry", please look into the other person's eyes;

Whenever you find your wrongdoing, be quick with reparation!

Whenever you make a phone call smil when you pick up the phone, because someone feel it!

Understand rules completely and change them reasonably;

Remember, the best love is to love others unconditionally rather than make demands on them;

Comment on the success you have attained by looking in the past at the target you wanted to achieve most;

In love and cooking, you must give 100% effort - but expect little appreciation.

从标准输入中输入的单词个数为：

1000

【样例输出】

当前目录下所生成的文件markov.txt中内容以下：

I will give you some advice about life. Eat more roughage; Do more than others expect you to do and do it pains; Remember what life tells you; do not take to heart every thing you hear. do not take to heart every thing you hear. do not spend all that you have. do not sleep as long as you want; Whenever you find your wrongdoing, be quick with reparation! Whenever you find your wrongdoing, be quick with reparation! Whenever you find your wrongdoing, be quick with reparation! Whenever you find your wrongdoing, be quick with reparation! Whenever you say "I am sorry", please look into the other person's eyes; Whenever you say "I am sorry", please look into the other person's eyes; Whenever you make a phone call smil when you pick up the phone, because someone feel it! Understand rules completely and change them reasonably; Remember, the best love is to love others unconditionally rather than make demands on them; Comment on the success you have attained by looking in the past at the target you wanted to achieve most; In love and cooking, you must give 100% effort - but expect little appreciation.

【样例说明】

按照本文介绍的马尔可夫链（Markov Chain）算法将生成相关输出文件。

【使用什么数据结构？】

使用如上图所示的数据结构，创建一个数据结构State存储状态和一个后缀链表Suffix，这样创建数据结构的缘由是每经过hash找到一个前缀pref的State，就能够经过这个这个State的suf链表寻找随机生成的后缀，另外为了加快速度也能够用二叉搜索树代替这个这个suf后缀链表。这里为了加快速度用了一些技巧，就是在State加了一个变量SufNum,这个变量的目的就是使插入的时候不用按照传统链表插入到末尾这种方法，而是直接在头部插入，读取的时候经过(SufNum-the_index_you_want)次next操做就能够找到所须要的后缀Suf了。

提及来很简单，但实现起来仍是十分的困难，做者在代码里加了两个C语言经常使用的函数memcpy和strdup

如下是 memcpy() 函数的声明。

void *memcpy(void *str1, const void *str2, size_t n)
参数
str1 -- 这是指针数组，其中的内容将被复制到目标，类型强制转换为void*类型的指针。

str2 -- 这是要复制的数据源的指针，void*类型的指针型铸造。

n -- 这是要被复制的字节数。

返回值
这个函数返回一个指针到目的地，str1。

可参考网址 http://www.cplusplus.com/reference/cstring/memcpy/

Example Code:

#include <stdio.h>
#include <string.h>

int main ()
{
   const char src[50] = "test";
   char dest[50];
   printf("Before:%s\n", dest);
   memcpy(dest, src, strlen(src)+1);
   printf("After: %s\n", dest);
   return(0);
}

结果：

Before:

After: test

而strdup是个字符串的复制，这个函数会单独alloc一块新的记忆体，不像strcpy函数同样须要本身准备两个记忆体。

调用以后须要用free()函数释放掉。

#include <string.h>
#include <assert.h>
#include <stdlib.h>

int main(void)
{
    const char *s1 = "String";
    char *s2 = strdup(s1);
    assert(strcmp(s1, s2) == 0);
    free(s2);
}

部分函数还使用了inline内联函数

inline函数优势：
传统程序的函数调用须要不停的调用栈，当有函数须要频繁调用的时候，那就会致使栈溢出或者效率不高等其余问题，用inline函数至关于把函数源代码直接“嵌入”到函数调用点
inline函数缺点：
若是调用inline函数的地方过多，也可能形成代码膨胀。

有了如上基础以后咱们能够从创建State的hash表，而后编写增长后缀函数和查找函数，就能够实现马尔可夫链随机文本的生成了。

这里的hash方法使用的是NHASH为5000011的BKDR算法，写有不少效率更高的方法，能够上网去寻找替代。

博主的代码:

  1 #include <math.h>
  2 #include <stdio.h>
  3 #include <stdlib.h>
  4 #include <string.h>
  5 const int NHASH = 5000011;
  6 const int PREFIX_NUM = 2;
  7 typedef struct State State;
  8 typedef struct Suffix Suffix;
  9 struct State {
 10   char *pref[PREFIX_NUM];
 11   Suffix *suf;
 12   State *next;
 13   unsigned int sufNum;
 14 };
 15 struct Suffix {
 16   char *word;
 17   Suffix *next;
 18 };
 19 State *statetab[NHASH];
 20 
 21 /*利用了BKDR HASH方法，这里还可使用别的HASH方法减小冲突*/
 22 unsigned int hash(char *s[PREFIX_NUM]) {
 23   unsigned int seed = 131;
 24   unsigned int hash = 0;
 25   unsigned int i;
 26   for (i = 0; i < PREFIX_NUM; i++) {
 27     char *str = s[i];
 28     while (*str)
 29       hash = hash * seed + (*str++);
 30   }
 31   return hash % NHASH;
 32 }
 33 
 34 /*查找前缀数组prefix[PREFIX_NUM]是否在哈希表中出现*/
 35 State *lookup(char *prefix[PREFIX_NUM], int isBuild) {
 36   /*If isBuild is true,it will be a new node*/
 37   int i, h;
 38   h = hash(prefix);
 39   State *sp = statetab[h];
 40   while (sp != NULL) {
 41     for (i = 0; i < PREFIX_NUM; ++i) {
 42       if (strcmp(prefix[i], sp->pref[i]))
 43         break;
 44     }
 45     if (i == PREFIX_NUM) //找到了就返回
 46       return sp;
 47     sp = sp->next;
 48   }
 49   if (isBuild) {
 50     sp = malloc(sizeof(State));
 51     for (i = 0; i < PREFIX_NUM; ++i) {
 52       sp->pref[i] = prefix[i];
 53     }
 54     sp->suf = NULL;
 55     sp->sufNum = 0;
 56     sp->next = statetab[h]; //头插法
 57     statetab[h] = sp;
 58   }
 59   return sp;
 60 }
 61 
 62 /*直接在头结点插入后缀，减小插入时间*/
 63 State *addsuffix(State *sp, char *suffix);
 64 inline State *addsuffix(State *sp, char *suffix) {
 65   Suffix *suf = malloc(sizeof(Suffix));
 66   suf->word = suffix;
 67   suf->next = sp->suf;
 68   sp->sufNum++;
 69   sp->suf = suf;
 70   return sp;
 71 }
 72 
 73 /*往数据结构中插入一个新的项，使用inline内联函数加快速度*/
 74 void add(char *prefix[PREFIX_NUM], char *suffix);
 75 inline void add(char *prefix[PREFIX_NUM], char *suffix) 
 76 {
 77   State *sp = NULL;
 78   sp = lookup(prefix, 1);
 79   sp = addsuffix(sp, suffix);
 80   // memmove(prefix, prefix + 1, sizeof(prefix[0]));
 81   memcpy(prefix, prefix + 1, sizeof(prefix[0]));
 82   prefix[1] = suffix;
 83 }
 84 
 85 double seed = 997;
 86 
 87 /*如上面所要求的随机生成器*/
 88 double rrand() {
 89   double lambda = 3125.0;
 90   double m = 34359738337.0;
 91   double r;
 92   seed = fmod(lambda * seed, m); //要包含头文件#include <math.h>
 93   r = seed / m;
 94   return r;
 95 }
 96 
 97 void build(char *prefix[PREFIX_NUM], FILE *f);
 98 inline void build(char *prefix[PREFIX_NUM], FILE *f) {
 99   char buf[40];
100   while (fscanf(f, "%39s", buf) != EOF) {
101     add(prefix, strdup(buf));
102   }
103 }
104 
105 /*从这里生成markov链*/
106 void generate(int nwords, FILE *OUT) {
107   State *sp;
108   char *prefix[PREFIX_NUM], *w;
109   unsigned int i;
110   for (i = 0; i < PREFIX_NUM; ++i)
111     prefix[i] = "\0";
112 
113   for (i = 0; i < nwords; ++i) {
114     sp = lookup(prefix, 0);
115     Suffix *suf = sp->suf;
116     if (sp->sufNum == 1) {
117       w = suf->word;
118     } else {
119       int n = sp->sufNum - (int)(rrand() * sp->sufNum) - 1;
120       while (suf != NULL) {
121         if (n == 0) {
122           w = suf->word;
123           break;
124         }
125         suf = suf->next;
126         n--;
127       }
128     }
129     if (strcmp(w, "(end)") == 0)
130       break;
131     fprintf(OUT, "%s ", w);
132     memcpy(prefix, prefix + 1, (PREFIX_NUM - 1) * sizeof(Suffix));
133     // strcpy(prefix[0], prefix[1]);
134     prefix[1] = w;
135   }
136 }
137 
138 int main() {
139   unsigned long int nwords;
140   unsigned int i;
141   scanf("%lu", &nwords);
142   char *prefix[PREFIX_NUM];
143   for (i = 0; i < PREFIX_NUM; ++i)
144     prefix[i] = "\0";
145   FILE *in = fopen("article.txt", "r");
146   build(prefix, in);
147   fclose(in);
148   /*末尾处加(end)*/
149   add(prefix, "(end)");
150   FILE *out = fopen("markov.txt", "w");
151   generate(nwords, out);
152   fclose(out);
153   return 0;
154 }