目标:www.imgii.com 最新看点模块 文章
准备:
一.python模块:
1.requests 请求库
2.pymysql MySQL数据库操作
3.lxml html解析模板
4.BeautifulSoup4 html解析
二.html分析
1.文章列表:
2.文章详情:
3.分页模块:
爬取思路
1.获取模块里面的列表文章的详情路径及封面图
2.循环进入文章详情,分析分页模块,获取文章分页的url路径
3.按顺序整合好文章的全部分页,分别获取文章内容并整合
4.将文章标题,内容保存到数据库
代码实现
基本配置:
conn = pymysql.connect(host='47.107.243.191', user='root',password='hobi2018', database='Hobi-SpiderData', charset='utf8') cursor = conn.cursor() website = 'http://www.imgii.com' host = 'www.imgii.com'
1.文章列表的路径获取
main.py
def getWebArticleList(): response = PomeloRequest.getRequest(website, host) content = BeautifulSoup(response, 'lxml').select('#new_pic a') urlList = [] for article in content: url = article.get('href') urlList.append(url) return urlList
2.文章详情内容获取
def getArticleContent(): articleUrl = getWebArticleList() for url in articleUrl: print(url) response = PomeloRequest.getRequest(url, host) html = BeautifulSoup(response, 'lxml') #标题 title = html.select('.entry-title_3')[0].get_text() #分页处理,若有分页,返回分页链接 pageInfo= html.select('.fenye a') pageUrl = pageInfoHandle(pageInfo) #文章首页链接添加 pageUrl.insert(0,url) #所有分页内容获取并整合 content='' for item in pageUrl: print(item) new_content = getArticlePageContent(item) content= str(content)+str(new_content) try: data=(title,content) sql="insert into WebArticleContent(WebSiteId,Title,Type,Content,CreateTime) values(1,%s,'01',%s,now())" cursor.execute(sql,data) conn.commit() except: print('error') conn.rollback() cursor.close() conn.close() return content def getArticlePageContent(url): response = PomeloRequest.getRequest(url, host) html = BeautifulSoup(response, 'lxml') #单段内容 content = html.select('.entry-content p') #单端内容整合为单页所有内容 new_content='' for item in content: new_content= str(new_content)+str(item) return new_content
3.分页处理
def pageInfoHandle(pageInfo): urlList=[] for item in pageInfo: urlList.append(item.get('href')) return PomeloCommon.listDistinct(urlList)
4.开始执行
if __name__ == "__main__": getArticleContent()
其他公用函数
PomeloRequest.py
def getRequest(website,host): headers = { "Host":host, "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36", "Referer":host } response=requests.get(website,headers=headers) return response.text
效果