Python-BeautifulSoup过滤器功能：多个条件-堆栈溢出

顺晟科技

2022-10-18 13:36:47

我正在编写一个Python脚本，它可以抓取网页并将信息存储到列表中。网页结构不好，所以内置的过滤功能对我不起作用。因此，我试图开发一个自定义过滤器函数，用于查找_All（）。

我想过滤的东西是这样的：

<代码><；TD ID="；td343_23"；style=display：>；文本<；/TD>；

其中：

“ TD34_23 ”对于每个都是不同的，因此需要正则表达式。标记"；TD"；应过滤，但标记"；th"；应排除具有类似属性的。"；style=display："；或“样式”（无值）应被筛选，但"；style=display：none"；应排除.

我

所做的是：

进口RE从BS4导入Beautifulsoup4

<代码>定义可见性（标记）：displaynone=re.compile（'（？！display：none）'）IDValue=re.compile（' TD[0-9]+_[0-9]+'）返回bool（displaynone.search（tag.get（' style '）））和tag.select（' TD '）和bool（idvalue.search（tag.get（' ID '）））Parsed=BeautifulSoup（HTML，' HTML.parser '）FILTERED=PARSED.FIND_全部（可见性）

上面的代码根本不起作用。请告诉我如何写过滤条件。

任何帮助都将不胜感激。谢谢.

编辑2021/09/21

该网页是一种私人的，所以我不能提供网址。但是，下面是HTML的示例。

<代码><；HTML（>；）<；DIV类="；阻止"；>；<；螺纹>；<；TR>；<；th ID="；td18_8"；样式>；包括<；/th>；<；th ID="；td18_9"；style="；显示："；>；包括<；/th>；<；th ID="；td18_10"；style="；display：none"；>；排除<；/th>；<；/tr>；<；/线程>；<；螺纹>；<；TR>；<；th ID="；td19_8"；样式>；包括<；/th>；<；th ID="；td19_9"；style="；显示："；>；包括<；/th>；<；th ID="；td19_10"；style="；display：none"；>；排除<；/th>；<；/tr>；<；/线程>；<；/DIV>；<；/四四方方>；<；HTML（>；）

顺晟科技：

<td id="td343_23" style=display:>text</td>对此不是必需的，原则上它也可以与def visibility(tag): displaynone = re.compile('(?!display:none)') idvalue = re.compile('td[0-9]+_[0-9]+') return bool(displaynone.search(tag.get('style'))) and tag.select('td') and bool(idvalue.search(tag.get('id'))) parsed = BeautifulSoup(html, 'html.parser') filtered = parsed.find_all(visibility)一起使用：

<html>
<div class="block">
<thread>
<tr>
<th id="td18_8" style>include</th>
<th id="td18_9" style="display:">include</th>
<th id="td18_10" style="display:none">exclude</th></tr>
</thread>
<thread>
<tr>
<th id="td19_8" style>include</th>
<th id="td19_9" style="display:">include</th>
<th id="td19_10" style="display:none">exclude</th>
</tr>
</thread>
</div>
</boxy>
<html>

工作示例：

Regex

输出：

selector

上一篇：JavaScript-eCharts 下一篇：HTML-如何强制视频包装器自动调整

网站建设

Html

Python-BeautifulSoup过滤器功能：多个条件-堆栈溢出

编辑2021/09/21

工作示例：

输出：