requests库入门实操
- 京东商品页面爬取
 
- 亚马逊商品页面的爬取
 
- 百度/360搜索关键字提交
 
- IP地址归属地查询
 
- 网络图片的爬取和储存
 
1.京东商品页面的爬取
华为nova3
1 2 3 4 5 6 7 8 9 10 11 12
   | import requests def GetHTMLText(url):     try:         r = requests.get(url)         r.raise_for_status()         r.encoding = r.apparent_encoding         return r.text[:1000]     except:         print("爬取失败") if __name__ == '__main__':     url = "https://item.jd.com/30185690434.html"     print(GetHTMLText(url))
   | 
 

2.亚马孙商品页面的爬取
某些网站可能有反爬机制。通常的反爬策略有:
- 通过Headers反爬虫
 
- 基于用户行为反爬虫
 
- 动态页面的反爬虫
 
参考
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
   |  import requests def GetHTMLText(url):     try:                  headers = {"user-agent":"Mozilla/5.0"}
          r = requests.get(url,headers = headers)         r.raise_for_status()         r.encoding = r.apparent_encoding         return r.text[:1000]     except:         print("爬取失败") if __name__ == '__main__':     url = "https://www.amazon.cn/gp/product/B01M8L5Z3Y"     print(GetHTMLText(url))
 
  | 
 
3.百度/360搜索关键字提交
使用params参数,利用接口keyword
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
   | 
 
  import requests
 
  def Get(url):     headers = {'user-agent':'Mozilla/5.0'}     key_word = {'wd':'python'}     try:         r=requests.get(url,headers=headers,params=key_word)         r.raise_for_status()         r.encoding = r.apparent_encoding         print(r.request.url)                  return r.text     except:         return "爬取失败"
  if __name__ == '__main__':     url = "http://www.baidu.com/s"          print(len(Get(url)))
 
  | 
 

4.IP地址归属地查询
使用IP138的API接口
http://m.ip138.com/ip.asp?ip=ipaddress
1 2 3 4 5 6 7 8 9 10 11 12 13
   |  import requests
  url ="http://m.ip138.com/ip.asp?ip=" ip = str(input()) try:     r= requests.get(url+ip)     r.raise_for_status()     print(r.status_code)          print(r.text[-500:]) except:     print("failed")
 
  | 
 

5.网络图片的爬取和储存
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
   | 
  import requests import os
  url = "http://n.sinaimg.cn/sinacn12/w495h787/20180315/1923-fyscsmv9949374.jpg"
  root = "C://Users/Administrator/Desktop/spider/first week/imgs/"
  path = root + url.split('/')[-1]
  try:     if not os.path.exists(root):         os.mkdir(root)     if not os.path.exists(path):         r = requests.get(url)         with open(path, 'wb') as f:             f.write(r.content)             f.close()             print("save successfully!")     else:         print("file already exist!") except:     print("spider fail")
 
  | 
 

