Python-使用BeautifulSoup将输出保存到DataFrame中-堆栈溢出

顺晟科技

2022-10-18 12:58:57

我是网页抓取的新手。我正试着从新闻网站上搜集数据。

我有这个代码：

from bs4 import BeautifulSoup as soup
import pandas as pd
import requests

detik_url = "https://news.detik.com/indeks/2"
detik_url

html = requests.get(detik_url)

bsobj = soup(html.content, 'lxml')
bsobj

for link in bsobj.findAll("h3"):
  print("Headline : {}".format(link.text.strip()))

links = []
for news in bsobj.findAll('article',{'class':'list-content__item'}):
  links.append(news.a['href'])

for link in links:
  page = requests.get(link)
  bsobj = soup(page.content)
  div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
  for p in div:
    print(p.find('p').text.strip())

如何利用Pandas数据帧将获得的内容存储到CSV文件中？

顺晟科技：

您可以将内容存储在Pandas DataFrame中，然后将结构写入CSV文件。

假设您要将p.find('p').text.strip()中的所有文本与标题一起保存在CSV文件中，您可以将标题存储在任何变量中（例如head）：

所以，从你的代码：

链接中链接的

<代码>：page=requests.get（链接）BSOBJ=汤（页面.内容）DIV=bsobj.findAll（' DIV '，{' Class '：'详细__正文ITP_正文内容_包装器'}）对于DIV中的p：#<；--我们在这里进行更改打印（p.find（' p '）.text.strip（））

在上面显示的行中，我们执行以下操作：

for link in links:
  page = requests.get(link)
  bsobj = soup(page.content)
  div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
  for p in div:                 # <----- Here we make the changes
    print(p.find('p').text.strip())

此外，您可以直接使用列表理解并保存到CSV，而不是使用for循环。

<代码>#代替上面的代码片段，将整个“ for p in DIV ”循环替换为#所以从你上面的代码：.....BSOBJ=汤（页面.内容）DIV=bsobj.findAll（' DIV '，{' Class '：'详细__正文ITP_正文内容_包装器'}）#删除整个“ for p in DIV：”，改为使用：DF=PD.dataframe（[p.find（' p '）.text.strip（）for p in DIV]，columns=[head]）....DF.to_CSV（' CSV_名称.CSV '，index=false）

此外，您还可以将从List Comprehension生成的数组转换为NumPy数组，并直接将其写入CSV文件：

import pandas as pd

# Create an empty array to store all the data

generated_text = []  # create an array to store your data

for link in links:
  page = requests.get(link)
  bsobj = soup(page.content)
  div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
    for p in div:
        # print statement if you want to see the output
        generated_text.append(p.find('p').text.strip())  # <---- save the data in an array


# then write this into a csv file using pandas, first you need to create a 
# dataframe from our list

df = pd.DataFrame(generated_text, columns = [head])

# save this into a csv file

df.to_csv('csv_name.csv', index = False)

上一篇：JavaScript-jQuery在下一篇：html鼠标经过滚动停止,HTML-

网站建设

Html

Python-使用BeautifulSoup将输出保存到DataFrame中-堆栈溢出