I'm having problems dealing with unicode characters from text fetched from different web pages (on different sites). 我在处理从不一样网页(在不一样站点上)获取的文本中的unicode字符时遇到问题。 I am using BeautifulSoup. 我正在使用BeautifulSoup。 web
The problem is that the error is not always reproducible; 问题是错误并不是老是可重现的。 it sometimes works with some pages, and sometimes, it barfs by throwing a UnicodeEncodeError
. 它有时能够在某些页面上使用,有时它会经过抛出UnicodeEncodeError
来阻止。 I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode-related error. 我已经尝试了几乎全部我能想到的东西,可是没有发现任何能持续工做而又不会引起某种与Unicode相关的错误的东西。 ide
One of the sections of code that is causing problems is shown below: 致使问题的代码部分之一以下所示: fetch
agent_telno = agent.find('div', 'agent_contact_number') agent_telno = '' if agent_telno is None else agent_telno.contents[0] p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
Here is a stack trace produced on SOME strings when the snippet above is run: 这是运行上述代码段时在某些字符串上生成的堆栈跟踪: this
Traceback (most recent call last): File "foobar.py", line 792, in <module> p.agent_info = str(agent_contact + ' ' + agent_telno).strip() UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. 我怀疑这是由于某些页面(或更具体地说,来自某些站点的页面)可能已编码,而其余页面可能未编码。 All the sites are based in the UK and provide data meant for UK consumption - so there are no issues relating to internalization or dealing with text written in anything other than English. 全部站点都位于英国,并提供供英国消费的数据-所以,与英语之外的其余任何形式的内部化或文字处理都没有问题。 编码
Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem? 是否有人对如何解决此问题有任何想法,以便我能够始终如一地解决此问题? idea