在作问答系统研究的时候,想经过deeplearning的方法得到句子语义,并计算两个问句的类似度,为此须要类似问题的数据集,可是通常类似问题数据集很难获取,特别是质量较高的数据集。为此,想到用目前最早进的翻译系统来实现构建数据。javascript
另外,当前NLP比较成功应用就是机器翻译,之因此deeplearning能成功应用到翻译,得意于庞大的、高质量的翻译语料数据。html
谷歌翻译直接经过request请求是获取不到结果的,须要tk值,tk值须要由问句+tkk值来生成。java
requests获取主页面,只需re正则在主页面上获取tkk值(以前须要经过js脚原本实现,参考)python
经过js脚本实现:gettk.jsgit
var b = function (a, b) { for (var d = 0; d < b.length - 2; d += 3) { var c = b.charAt(d + 2), c = "a" <= c ? c.charCodeAt(0) - 87 : Number(c), c = "+" == b.charAt(d + 1) ? a >>> c : a << c; a = "+" == b.charAt(d) ? a + c & 4294967295 : a ^ c } return a } var tk = function (a,TKK) { //console.log(a,TKK); for (var e = TKK.split("."), h = Number(e[0]) || 0, g = [], d = 0, f = 0; f < a.length; f++) { var c = a.charCodeAt(f); 128 > c ? g[d++] = c : (2048 > c ? g[d++] = c >> 6 | 192 : (55296 == (c & 64512) && f + 1 < a.length && 56320 == (a.charCodeAt(f + 1) & 64512) ? (c = 65536 + ((c & 1023) << 10) + (a.charCodeAt(++f) & 1023), g[d++] = c >> 18 | 240, g[d++] = c >> 12 & 63 | 128) : g[d++] = c >> 12 | 224, g[d++] = c >> 6 & 63 | 128), g[d++] = c & 63 | 128) } a = h; for (d = 0; d < g.length; d++) a += g[d], a = b(a, "+-a^+6"); a = b(a, "+-3^+b+-f"); a ^= Number(e[1]) || 0; 0 > a && (a = (a & 2147483647) + 2147483648); a %= 1E6; return a.toString() + "." + (a ^ h) }
将tk值和须要翻译的句子代人以下格式github
中-英 :json
英-中:ide
程序中简单增长了中英文判断。
获取的结果在三维的列表中,list[0][0][0]即为所需的结果
import requests import urllib import re import json import execjs class Goole_translate(): def __init__(self): self.url_base = 'https://translate.google.cn' self.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'} self.get_tkk() def get_tkk(self): page = requests.get(self.url_base, headers= self.headers ) tkks = re.findall(r"TKK='(.+?)';", page.text) if tkks: self.tkk = tkks[0] return self.tkk else: raise ('no found tkk') def translate(self, query_string): last_url = self.get_last_url(query_string) response = requests.get(last_url, headers=self.headers) if response.status_code != 200: self.get_tkk() last_url = self.get_last_url(query_string) response = requests.get(last_url, headers=self.headers) content = response.content # bytes类型 text = content.decode() # str类型 , 两步能够用text=response.text替换 dict_text = json.loads(text) # 数据是json各式 result = dict_text[0][0][0] return result def get_tk(self, query_string): tem = execjs.compile(open(r"gettk.js").read()) tk = tem.call('tk', query_string, self.tkk) return tk def query_string(self, query_string): '''将字符串转换为utf8格式的字符串,自己已utf8格式定义的字符串能够不须要''' query_url_trans = urllib.parse.quote(query_string) # 汉字url编码, 转为utf-8各式 return query_url_trans def get_last_url(self, query_string): url_parm = 'sl=en&tl=zh-CN' for uchar in query_string: if uchar >= u'\u4e00' and uchar <= u'\u9fa5': url_parm = 'sl=zh-CN&tl=en' break url_param_part = self.url_base + "/translate_a/single?" url_param = url_param_part + "client=t&"+ url_parm+"&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&source=btn&ssel=3&tsel=3&kc=0&" url_get = url_param + "tk=" + str(self.get_tk(query_string)) + "&q=" + str(self.query_string(query_string)) return url_get if __name__=="__main__": query_string = 'how are you' gt = Goole_translate() en = gt.translate(query_string) print(en)
频繁访问可能被封,没有测试过,能够设置延时或ip代理
http://www.cnblogs.com/by-dream/p/6554340.html
https://blog.csdn.net/boyheroes/article/details/78681357
自动检测中英文
获取百度翻译结果
# coding=utf-8 import requests import json class Baidu_translate(): def __init__(self): self.headers = {"User-Agent":"Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Mobile Safari/537.36"} self.lang_detect_url = "https://fanyi.baidu.com/langdetect" self.trans_url = "https://fanyi.baidu.com/basetrans" def get_lang(self,query_string): '''自动检测语言''' data = {'query':query_string} response = requests.post(self.lang_detect_url, data=data, headers=self.headers) return json.loads(response.text)['lan'] def translate(self,query_string): '''翻译''' lang = self.get_lang(query_string) data = {"query":query_string,"from":"zh","to":"en"} if lang== "zh" else {"query":query_string,"from":"en","to":"zh"} response = requests.post(self.trans_url, data=data, headers=self.headers) result = json.loads(response.text)["trans"][0]["dst"] return result if __name__ == '__main__': query_string = 'how are you' bt = Baidu_translate() en = bt.translate(query_string) print(en)
https://blog.csdn.net/blues_f/article/details/79319461
经过中-英-中能够产生类似问答对语料。
''' 经过翻译实现中文问句的类似问法,来产生类似问题对数据集。可用于语义类似模型训练。 goole翻译:中文-英文 baidu翻译:英文-中文 注意:本程序未设置ip代理,频繁访问谨防被封。(只作了简单的随机延迟措施) ''' import time import random from goole_trans import Goole_translate from baidu_trans import Baidu_translate gt = Goole_translate() bt = Baidu_translate() r_file = 'data/zh.txt' w_file = 'data/zh_en_zh.txt' fw = open(w_file,'w',encoding='utf8') with open(r_file,'r',encoding='utf8') as f: for line in f: r = random.random()*10 time.sleep(r) ls = line.strip().split('\t') query_string = ls[0] g_en = gt.translate(query_string) b_zh = bt.translate(g_en) fw.write(query_string+'\t'+g_en+'\t'+b_zh+'\n') print('q_zh:',query_string) print('g_en:',g_en) print('b_zh:',b_zh) print('\n') fw.close()
q_zh: 下周有什么好产品? g_en: What are the good products next week? b_zh: 下周的好产品是什么? ------------ q_zh: 第一次使用,额度多少? g_en: What is the amount of the first use? b_zh: 第一次使用的数量是多少? ------------ q_zh: 我何时能够经过微粒贷借钱 g_en: When can I borrow money from micro-credit? b_zh: 我何时能够从小额信贷中借钱? ------------ q_zh: 借款后多长时间给打电话 g_en: How long does it take to make a call after borrowing? b_zh: 借钱后打电话须要多长时间? ------------ q_zh: 没看到微粒贷 g_en: Didn't see the micro-credit b_zh: 没有看到小额信贷 ------------ q_zh: 原来的手机号不用了,怎么换 g_en: The original mobile phone number is not used, how to change b_zh: 原来的手机号码没有用,怎么改
程序均未设置ip代理,频繁访问谨防被封。(只作了简单的随机延迟措施),后期若更新,见github: github地址