Python通过HTTP协议定期抓取文件

分类：
Python

2007-07-26 18:56
698人阅读
评论(1)
收藏
举报

可以扩充成为简单的抓取工具，定时抓取

#!usr/bin/python
Python通过HTTP协议定期抓取文件

importurllib2,time;
Python通过HTTP协议定期抓取文件

classErrorHandler(urllib2.HTTPDefaultErrorHandler):
Python通过HTTP协议定期抓取文件

defhttp_error_default(self,req,fp,code,msg,headers):
Python通过HTTP协议定期抓取文件

result=urllib2.HTTPError(req.get_full_url(),code,msg,headers,fp)
Python通过HTTP协议定期抓取文件

result.status=code
Python通过HTTP协议定期抓取文件

returnresult
Python通过HTTP协议定期抓取文件

URL='http://www.ibm.com/developerworks/js/ajax1.js'
Python通过HTTP协议定期抓取文件

req=urllib2.Request(URL)
Python通过HTTP协议定期抓取文件

mgr=urllib2.build_opener(ErrorHandler())
Python通过HTTP协议定期抓取文件

whileTrue:
Python通过HTTP协议定期抓取文件

ns=mgr.open(req)
Python通过HTTP协议定期抓取文件

if(ns.headers.has_key('last-modified')):
Python通过HTTP协议定期抓取文件

modified=ns.headers.get('last-modified')
Python通过HTTP协议定期抓取文件

if(ns.code==304):
Python通过HTTP协议定期抓取文件

print'''

==============================
Python通过HTTP协议定期抓取文件

NOTMODIFIED
Python通过HTTP协议定期抓取文件

==============================
Python通过HTTP协议定期抓取文件

'''

elif(ns.code==200):
Python通过HTTP协议定期抓取文件

printns.read()
Python通过HTTP协议定期抓取文件

else:

print'thereisanerror';
Python通过HTTP协议定期抓取文件

if(notlocals().has_key('modified')):
Python通过HTTP协议定期抓取文件

modified=time.time();
Python通过HTTP协议定期抓取文件

req.add_header('If-Modified-Since',modified)
Python通过HTTP协议定期抓取文件

time.sleep(10)

标签 Http, Python