20190715《Python网络数据采集》第 1 章-博客园

20190715《Python网络数据采集》第 1 章

2019-07-16 13:02:53 发布 664 浏览

页面报错/反馈

已收藏点赞

1. 爬虫常见得异常及处理方法，用一个简单得爬虫代码解释，核心知识点：

（1）异常一：网页在服务器上不存在（或者获取页面时，出现错误）。该异常发生时，程序会返回HTTP错误，如“404 Page Not Found” "500 Internet Server Error"等。

（2）异常二：服务器不存在（即，链接打不开，或者URL链接写错了），这时，urlopen会返回一个None对象。

Ps：有的时候，网页已经从服务器成功获取，如果网页上的内容并非完全是我们期望的那样，也会出现异常。

 1 from urllib.request import urlopen
 2 from bs4 import BeautifulSoup
 3 
 4 try:
 5     html = urlopen("http://pythonscraping.com/pages/page1.html")
 6 # print(html.read())
 7 # 检测：网页在服务器上是否存在（或者获取页面时是否出现错误）
 8 except HTTPError as e:
 9     print(e)
10 else:
11     bsobj = BeautifulSoup(html.read())
12     # 检测：服务器是否存在（就是说链接能否打开，或者是URL链接写错了）
13     if html is None:
14         print("URL is not found")
15     else:
16         print(bsobj.h1)
17         # print(bsobj.title)

 1 # 以上代码更改为检测异常更全面、可读性更强的代码，如下：
 2 from urllib.request import urlopen
 3 from bs4 import BeautifulSoup
 4 
 5 def getTitle(url):
 6     try:
 7         html = urlopen(url)
 8     except HTTPError as e:
 9         return None
10     try:
11         bsobj = BeautifulSoup(html.read())
12         title = bsobj.body.h1
13     except AttributeError as e:
14         return
15     return title
16 
17 title1 = getTitle("http://pythonscraping.com/pages/page1.html")
18 if title1 == None:
19     print("Title could not be found")
20 else:
21     print(title1)

该部分代码执行时，出现报错：

indentationerror: unexpected indent process finished with exit code 1

Google发现，Tag和Space不能混合使用。原始第五行，def被tab缩进，后删除该tab缩进，问题解决。该问题具体原因，仍需要仔细查明！！！

登录查看全部

参与评论

20190715《Python网络数据采集》第 1 章

参与评论

评论留言

还没有评论留言，赶紧来抢楼吧~~

给这篇文章打个标签吧~