目标:www.imgii.com 最新看点模块 文章

准备:
一.python模块:
1.requests 请求库
2.pymysql MySQL数据库操作
3.lxml html解析模板
4.BeautifulSoup4 html解析

二.html分析
1.文章列表:

2.文章详情:

3.分页模块:


爬取思路
1.获取模块里面的列表文章的详情路径及封面图
2.循环进入文章详情,分析分页模块,获取文章分页的url路径
3.按顺序整合好文章的全部分页,分别获取文章内容并整合
4.将文章标题,内容保存到数据库


代码实现
基本配置:

conn = pymysql.connect(host='47.107.243.191', user='root',password='hobi2018', database='Hobi-SpiderData', charset='utf8')
cursor = conn.cursor()

website = 'http://www.imgii.com'
host = 'www.imgii.com'

1.文章列表的路径获取
main.py

def getWebArticleList():
    response = PomeloRequest.getRequest(website, host)
    content = BeautifulSoup(response, 'lxml').select('#new_pic a')

    urlList = []
    for article in content:
        url = article.get('href')
        urlList.append(url)
    return urlList

2.文章详情内容获取

def getArticleContent():
    articleUrl = getWebArticleList()
    for url in articleUrl:
        print(url)
        response = PomeloRequest.getRequest(url, host)

        html = BeautifulSoup(response, 'lxml')
        #标题
        title = html.select('.entry-title_3')[0].get_text()

        #分页处理,若有分页,返回分页链接
        pageInfo= html.select('.fenye a')
        pageUrl = pageInfoHandle(pageInfo)
        #文章首页链接添加
        pageUrl.insert(0,url)

        #所有分页内容获取并整合
        content=''
        for item in pageUrl:
            print(item)
            new_content = getArticlePageContent(item)
            content= str(content)+str(new_content)
        try:
            data=(title,content)
            sql="insert into WebArticleContent(WebSiteId,Title,Type,Content,CreateTime) values(1,%s,'01',%s,now())" 
            cursor.execute(sql,data)
            conn.commit()
        except:
            print('error')
            conn.rollback()

    cursor.close()
    conn.close()
    return content


def getArticlePageContent(url):
    response = PomeloRequest.getRequest(url, host)
    html = BeautifulSoup(response, 'lxml')
    #单段内容
    content = html.select('.entry-content p')
    
    #单端内容整合为单页所有内容
    new_content=''
    for item in content:
        new_content= str(new_content)+str(item)
 
    return new_content

3.分页处理

def pageInfoHandle(pageInfo):
    urlList=[]
    for item in pageInfo:
        urlList.append(item.get('href'))
    return PomeloCommon.listDistinct(urlList)

4.开始执行

if __name__ == "__main__":
    getArticleContent()

其他公用函数
PomeloRequest.py

def getRequest(website,host):
    headers = {
        "Host":host,
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36",
        "Referer":host
    }
    response=requests.get(website,headers=headers)

    return response.text

效果