Python抓取中文网页乱码问题

闲着没事写了个Python爬虫抓取了一下集思录的网页

#-*- coding:utf8 -*-

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "zh-CN,zh;q=0.8,en;q=0.6"}

url = "https://www.jisilu.cn/home/explore/sort_type-new__day-0__page-1"
html = requests.get(url,headers=headers).text
print html

抓取下来的页面中文乱码:

示例:
<a target="_blank" href="https://www.jisilu.cn/question/76907">è¡ç¾ä¸­çç¾æ°ï¼è¿æ¯äººåï¼ åçè¿ä¸æ®µè¯ï¼æ´»è±è±çåç§åï¼</a>

一般抓取的网页乱码基本都能用 html.decode(“网页编码”),encode(“utf-8”)就能解决
打开源代码看了下网站第四行代码

<meta content="text/html;charset=utf-8" http-equiv="Content-Type" />

可以看出集思录网站编码是utf-8,不需要用到decode。。。这就有点心塞了

折腾了半小时抱着试试的心态把html = requests.get(url,headers=headers).text改为了html = requests.get(url,headers=headers).text 乱码问题竟然就解决了。。

.text与.content的区别

requests的官方文档有这么一段:

We can read the content of the server’s response. Consider the GitHub timeline again:

>>> import requests

>>> r = requests.get(‘https://api.github.com/events’)
>>> r.text
u'[{“repository”:{“open_issues”:0,”url”:”https://github.com/…
Requests will automatically decode content from the server. Most unicode charsets are seamlessly decoded.

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property:

>>> r.encoding
‘utf-8’
>>> r.encoding = ‘ISO-8859-1’

If you change the encoding, Requests will use the new value of r.encoding whenever you call r.text. You might want to do this in any situation where you can apply special logic to work out what the encoding of the content will be. For example, HTTP and XML have the ability to specify their encoding in their body. In situations like this, you should use r.content to find the encoding, and then set r.encoding. This will let you use r.text with the correct encoding.

Requests will also use custom encodings in the event that you need them. If you have created your own encoding and registered it with the codecs module, you can simply use the codec name as the value of r.encoding and Requests will handle the decoding for you.

大意就是说,.text方法会猜网站编码,大部分网站都能蒙对,如果猜不对的话可以用 encoding 方法指定 response 的编码,上面的代码还可以写成

url = "https://www.jisilu.cn/home/explore/sort_type-new__day-0__page-1"
html = requests.get(url,headers=headers)
html.encoding = "utf-8" 
print html.text

而.content而高级一点,.content会自己去文本里面找是什么编码,而不是靠猜的。。

未经允许不得转载:晨飞小窝 » Python抓取中文网页乱码问题

赞 (1)

评论 0