18910140161

Python bs4 HTML内容无意义-堆栈溢出

顺晟科技

2022-10-19 12:15:36

92

我有这个功能,可以为二手市场网站搜索产品列表。

有时函数不能工作,因为它工作的Html不同。对于某些产品Html很好,而对于同一页面上的其他列表,Html完全不同,就像没有加载一样,我的浏览器上的Html显示两个产品有相似的Html结构

def cj_search(location='portugal', search_term='gtx+1080'):
    headers = {'user-agent': 'Mozilla/5.0'}
    page_num = 1
    while True:
        page = requests.get(f'https://www.custojusto.pt/{location}/q/{search_term}?o={page_num}&sp=1&st=a', headers=headers)
        soup = BeautifulSoup(page.text, 'lxml')
        products = soup.find_all('div', class_='container_related')

        if not products:
            break

        for product in products:
            print(product)
            # Get the data
            product_name = product.find('h2', class_='title_related').find('b').text
            product_price = float(product.find('h5', class_='price_related').text.strip()[:-2])
            product_link = product.find('a')['href']

            page_num = page_num + 1
            print(f'Name: {product_name}\nPrice: {product_price}€\nLink: {product_link}\n')

大多数时候,我得到的东西都有这样的含义:

def cj_search(location='portugal', search_term='gtx+1080'):
    headers = {'user-agent': 'Mozilla/5.0'}
    page_num = 1
    while True:
        page = requests.get(f'https://www.custojusto.pt/{location}/q/{search_term}?o={page_num}&sp=1&st=a', headers=headers)
        soup = BeautifulSoup(page.text, 'lxml')
        products = soup.find_all('div', class_='container_related')

        if not products:
            break

        for product in products:
            print(product)
            # Get the data
            product_name = product.find('h2', class_='title_related').find('b').text
            product_price = float(product.find('h5', class_='price_related').text.strip()[:-2])
            product_link = product.find('a')['href']

            page_num = page_num + 1
            print(f'Name: {product_name}\nPrice: {product_price}€\nLink: {product_link}\n')

和其他时候,我会得到这个Html,它经常出现在产品列表的中间/末尾

def cj_search(location='portugal', search_term='gtx+1080'):
    headers = {'user-agent': 'Mozilla/5.0'}
    page_num = 1
    while True:
        page = requests.get(f'https://www.custojusto.pt/{location}/q/{search_term}?o={page_num}&sp=1&st=a', headers=headers)
        soup = BeautifulSoup(page.text, 'lxml')
        products = soup.find_all('div', class_='container_related')

        if not products:
            break

        for product in products:
            print(product)
            # Get the data
            product_name = product.find('h2', class_='title_related').find('b').text
            product_price = float(product.find('h5', class_='price_related').text.strip()[:-2])
            product_link = product.find('a')['href']

            page_num = page_num + 1
            print(f'Name: {product_name}\nPrice: {product_price}€\nLink: {product_link}\n')

编辑(尝试硒)

为了创建soup,我还尝试了以下方法,但没有加载相同的html

def cj_search(location='portugal', search_term='gtx+1080'):
    headers = {'user-agent': 'Mozilla/5.0'}
    page_num = 1
    while True:
        page = requests.get(f'https://www.custojusto.pt/{location}/q/{search_term}?o={page_num}&sp=1&st=a', headers=headers)
        soup = BeautifulSoup(page.text, 'lxml')
        products = soup.find_all('div', class_='container_related')

        if not products:
            break

        for product in products:
            print(product)
            # Get the data
            product_name = product.find('h2', class_='title_related').find('b').text
            product_price = float(product.find('h5', class_='price_related').text.strip()[:-2])
            product_link = product.find('a')['href']

            page_num = page_num + 1
            print(f'Name: {product_name}\nPrice: {product_price}€\nLink: {product_link}\n')

顺晟科技:

第二个div的id,表明这是一些JavaScript生成的内容。将允许您解析HTML,但当您还必须解析JavaScript生成的内容时,这将没有任何帮助。

你试过使用吗?它是一个非常强大的模块,为您运行浏览器。您基本上可以用代码远程控制浏览器,您可以使用它进行测试和WebScraping。这将允许您像使用浏览器一样与页面本身交互。

编辑:

我搞错了,实际上那个div中没有JS生成的内容,但是如果有必要,您可以跳过它。你确实可以用BeautifulSoup做到这一点。查找与Regex匹配的部分id或类。所以导入并改变你的函数。关键是添加一个代码片段来查找导致您出现问题的div

def cj_search(location='portugal', search_term='gtx+1080'):
    headers = {'user-agent': 'Mozilla/5.0'}
    page_num = 1
    while True:
        page = requests.get(f'https://www.custojusto.pt/{location}/q/{search_term}?o={page_num}&sp=1&st=a', headers=headers)
        soup = BeautifulSoup(page.text, 'lxml')
        products = soup.find_all('div', class_='container_related')

        if not products:
            break

        for product in products:
            print(product)
            # Get the data
            product_name = product.find('h2', class_='title_related').find('b').text
            product_price = float(product.find('h5', class_='price_related').text.strip()[:-2])
            product_link = product.find('a')['href']

            page_num = page_num + 1
            print(f'Name: {product_name}\nPrice: {product_price}€\nLink: {product_link}\n')
  • TAG:
相关文章
我们已经准备好了,你呢?
2024我们与您携手共赢,为您的企业形象保驾护航