Python bs4 HTML内容无意义-堆栈溢出

顺晟科技

2022-10-19 12:15:36

我有这个功能，可以为二手市场网站搜索产品列表。

有时函数不能工作，因为它工作的Html不同。对于某些产品Html很好，而对于同一页面上的其他列表，Html完全不同，就像没有加载一样，我的浏览器上的Html显示两个产品有相似的Html结构

def cj_search(location='portugal', search_term='gtx+1080'):
    headers = {'user-agent': 'Mozilla/5.0'}
    page_num = 1
    while True:
        page = requests.get(f'https://www.custojusto.pt/{location}/q/{search_term}?o={page_num}&sp=1&st=a', headers=headers)
        soup = BeautifulSoup(page.text, 'lxml')
        products = soup.find_all('div', class_='container_related')

        if not products:
            break

        for product in products:
            print(product)
            # Get the data
            product_name = product.find('h2', class_='title_related').find('b').text
            product_price = float(product.find('h5', class_='price_related').text.strip()[:-2])
            product_link = product.find('a')['href']

            page_num = page_num + 1
            print(f'Name: {product_name}\nPrice: {product_price}€\nLink: {product_link}\n')

大多数时候，我得到的东西都有这样的含义:

def cj_search(location='portugal', search_term='gtx+1080'):
    headers = {'user-agent': 'Mozilla/5.0'}
    page_num = 1
    while True:
        page = requests.get(f'https://www.custojusto.pt/{location}/q/{search_term}?o={page_num}&sp=1&st=a', headers=headers)
        soup = BeautifulSoup(page.text, 'lxml')
        products = soup.find_all('div', class_='container_related')

        if not products:
            break

        for product in products:
            print(product)
            # Get the data
            product_name = product.find('h2', class_='title_related').find('b').text
            product_price = float(product.find('h5', class_='price_related').text.strip()[:-2])
            product_link = product.find('a')['href']

            page_num = page_num + 1
            print(f'Name: {product_name}\nPrice: {product_price}€\nLink: {product_link}\n')

和其他时候，我会得到这个Html，它经常出现在产品列表的中间/末尾

def cj_search(location='portugal', search_term='gtx+1080'):
    headers = {'user-agent': 'Mozilla/5.0'}
    page_num = 1
    while True:
        page = requests.get(f'https://www.custojusto.pt/{location}/q/{search_term}?o={page_num}&sp=1&st=a', headers=headers)
        soup = BeautifulSoup(page.text, 'lxml')
        products = soup.find_all('div', class_='container_related')

        if not products:
            break

        for product in products:
            print(product)
            # Get the data
            product_name = product.find('h2', class_='title_related').find('b').text
            product_price = float(product.find('h5', class_='price_related').text.strip()[:-2])
            product_link = product.find('a')['href']

            page_num = page_num + 1
            print(f'Name: {product_name}\nPrice: {product_price}€\nLink: {product_link}\n')

编辑（尝试硒）

为了创建soup，我还尝试了以下方法，但没有加载相同的html

def cj_search(location='portugal', search_term='gtx+1080'):
    headers = {'user-agent': 'Mozilla/5.0'}
    page_num = 1
    while True:
        page = requests.get(f'https://www.custojusto.pt/{location}/q/{search_term}?o={page_num}&sp=1&st=a', headers=headers)
        soup = BeautifulSoup(page.text, 'lxml')
        products = soup.find_all('div', class_='container_related')

        if not products:
            break

        for product in products:
            print(product)
            # Get the data
            product_name = product.find('h2', class_='title_related').find('b').text
            product_price = float(product.find('h5', class_='price_related').text.strip()[:-2])
            product_link = product.find('a')['href']

            page_num = page_num + 1
            print(f'Name: {product_name}\nPrice: {product_price}€\nLink: {product_link}\n')

顺晟科技：

第二个div的id，表明这是一些JavaScript生成的内容。将允许您解析HTML，但当您还必须解析JavaScript生成的内容时，这将没有任何帮助。

你试过使用吗？它是一个非常强大的模块，为您运行浏览器。您基本上可以用代码远程控制浏览器，您可以使用它进行测试和WebScraping。这将允许您像使用浏览器一样与页面本身交互。

编辑:

我搞错了，实际上那个div中没有JS生成的内容，但是如果有必要，您可以跳过它。你确实可以用BeautifulSoup做到这一点。查找与Regex匹配的部分id或类。所以导入并改变你的函数。关键是添加一个代码片段来查找导致您出现问题的div

def cj_search(location='portugal', search_term='gtx+1080'):
    headers = {'user-agent': 'Mozilla/5.0'}
    page_num = 1
    while True:
        page = requests.get(f'https://www.custojusto.pt/{location}/q/{search_term}?o={page_num}&sp=1&st=a', headers=headers)
        soup = BeautifulSoup(page.text, 'lxml')
        products = soup.find_all('div', class_='container_related')

        if not products:
            break

        for product in products:
            print(product)
            # Get the data
            product_name = product.find('h2', class_='title_related').find('b').text
            product_price = float(product.find('h5', class_='price_related').text.strip()[:-2])
            product_link = product.find('a')['href']

            page_num = page_num + 1
            print(f'Name: {product_name}\nPrice: {product_price}€\nLink: {product_link}\n')

上一篇：html-close按钮定义使用Ja 下一篇：JavaScript-下拉菜单在显示