配置 Cheerio
¥Configuring Cheerio
在本指南中,我们将介绍如何配置 Cheerio 以处理不同类型的文档,以及如何使用和配置该库附带的不同解析器。
¥In this guide, we'll cover how to configure Cheerio to work with different types of documents, and how to use and configure the different parsers that ship with the library.
使用 parse5 解析 HTML
¥Parsing HTML with parse5
默认情况下,Cheerio 对 HTML 文档使用 parse5
解析器。parse5
是一个严格符合 HTML 标准的优秀项目。但是,如果你需要修改 HTML 输入的解析选项,你可以向 .load()
传递一个额外的对象:
¥By default, Cheerio uses the parse5
parser for HTML
documents. parse5
is an excellent project that rigorously conforms to the HTML
standard. However, if you need to modify parsing options for HTML input, you may
pass an extra object to .load()
:
const cheerio = require('cheerio');
const $ = cheerio.load('<noscript><h1>Nested Tag!</h1></noscript>', {
scriptingEnabled: false,
});
例如,如果你希望将 <noscript>
标签的内容解析为 HTML,则可以将 scriptingEnabled
选项设置为 false。
¥For example, if you want the contents of <noscript>
tags to be parsed as HTML,
you can set the scriptingEnabled
option to false.
有关选项及其效果的完整列表,请查看 API 文档。
¥For a full list of options and their effects, have a look at the API documentation.
片段模式
¥Fragment Mode
默认情况下,parse5
将其接收的文档视为完整的 HTML 文档,并将使用嵌套的 <head>
和 <body>
标记在 <html>
文档元素中构建内容。
¥By default, parse5
treats documents it receives as full HTML documents and
will structure content in an <html>
document element with nested <head>
and
<body>
tags.
const $ = cheerio.load('<li>Apple</li><li>Banana</li>');
$.html(); // => '<html><head></head><body><li>Apple</li><li>Banana</li></body></html>'
parse5
还支持 "片段模式",允许你解析 HTML 片段,而不是完整的文档。要使用此模式,请将指示是否正在解析完整文档的布尔值传递给 .load()
方法:
¥parse5
also supports a "fragment mode" that allows you to parse HTML
fragments, rather than complete documents. To use this mode, pass a boolean
indicating whether you are parsing a full document to the .load()
method:
// Note that we are passing `false`, as we are not parsing a full document.
const $ = cheerio.load('<li>Apple</li><li>Banana</li>', {}, false);
$.html(); // => '<li>Apple</li><li>Banana</li>'
这会将 HTML 片段解析为独立文档,而不是将其视为较大文档的一部分。
¥This will parse the HTML fragment as a standalone document, rather than treating it as a part of a larger document.
使用 htmlparser2 解析 XML
¥Parsing XML with htmlparser2
默认情况下,Cheerio 对 XML 文档使用 htmlparser2
。htmlparser2
是一个快速且节省内存的解析器,可以处理 HTML 和 XML。要解析 XML,请将 xml
选项传递给 .load()
:
¥By default, Cheerio uses htmlparser2
for XML documents. htmlparser2
is a
fast and memory-efficient parser that can handle both HTML and XML. To parse
XML, pass the xml
option to .load()
:
const $ = cheerio.load('<ul id="fruits">...</ul>', {
xml: true,
});
如果你需要自定义 XML 输入的解析选项,你可以将一个对象作为 xml
选项传递给 .load()
,其中包含你想要更改的选项:
¥If you need to customize the parsing options for XML input, you may pass an
object as the xml
option to .load()
, with the options you want to change:
const $ = cheerio.load('<ul id="fruits">...</ul>', {
xml: {
withStartIndices: true,
},
});
设置 xml
时,默认选项为:
¥When xml
is set, the default options are:
{
xmlMode: true, // Enable htmlparser2's XML mode.
decodeEntities: true, // Decode HTML entities.
withStartIndices: false, // Add a `startIndex` property to nodes.
withEndIndices: false, // Add an `endIndex` property to nodes.
}
xml 对象中的选项直接从 htmlparser2 获取,因此任何可以在 htmlparser2 中使用的选项在 Cheerio 中也有效。
¥The options in the xml object are taken directly from htmlparser2, therefore any options that can be used in htmlparser2 are valid in cheerio as well.
有关选项及其效果的完整列表,请参阅 API 文档。
¥For a full list of options and their effects, see the API documentation.
对 HTML 使用 htmlparser2
¥Using htmlparser2
for HTML
一些用户可能希望使用 htmlparser2
库解析标记,并使用 Cheerio 遍历和操作结果结构。对于从 Cheerio 1.0 之前版本(依赖于 htmlparser2
)升级的用户、处理无效标记的用户(因为 htmlparser2
更宽容 1)或在性能关键情况下操作的用户(因为 htmlparser2
通常是 更快,并且生成的 DOM 消耗更少的内存)。
¥Some users may wish to parse markup with the htmlparser2
library, and traverse
and manipulate the resulting structure with Cheerio. This may be the case for
those upgrading from pre-1.0 releases of Cheerio (which relied on
htmlparser2
), for those dealing with invalid markup (because htmlparser2
is
more forgiving1), or for those operating in performance-critical situations
(because htmlparser2
is often faster and the resulting DOM consumes less
memory).
为了支持这些情况,你可以简单地禁用 xml
选项中的 xmlMode
:
¥To support these cases, you can simply disable xmlMode
inside of the xml
option:
const $ = cheerio.load('<ul id="fruits">...</ul>', {
xml: {
// Disable `xmlMode` to parse HTML with htmlparser2.
xmlMode: false,
},
});
.load()
还接受 htmlparser2
兼容的数据结构作为其第一个参数。用户可以安装 htmlparser2
,用它来解析输入,并将结果传递给 .load()
:
¥.load()
also accepts a htmlparser2
-compatible data structure as its first
argument. Users may install htmlparser2
, use it to parse input, and pass the
result to .load()
:
import * as htmlparser2 from 'htmlparser2';
const dom = htmlparser2.parseDocument(document, options);
const $ = cheerio.load(dom);
此方法的警告是,这仍将使用 parse5
的序列化程序,因此生成的输出将是 HTML,而不是 XML,并且不尊重任何提供的选项。因此,如上所示,禁用 xmlMode
是推荐的方法。
¥The caveat of this method is that this will still use parse5
's serializer, so
the resulting output will be HTML, not XML, and not respect any of the supplied
options. Disabling xmlMode
, as shown above, is therefore the recommended
approach.
你还可以使用 Cheerio 的 slim 导出,它始终使用 htmlparser2
。这可以避免加载 parse5
,从而节省一些字节,例如。 在浏览器环境中:
¥You can also use Cheerio's slim export, which always uses htmlparser2
. This
avoids loading parse5
, which saves some bytes eg. in browser environments:
import * as cheerio from 'cheerio/slim';
结论
¥Conclusion
在本指南中,我们探讨了如何配置 Cheerio 以分别使用 parse5
和 htmlparser2
解析 HTML 和 XML 文档。我们还讨论了如何修改解析选项并直接使用 htmlparser2
。
¥In this guide, we explored how to configure Cheerio for parsing HTML and XML
documents using parse5
and htmlparser2
respectively. We also discussed how
to modify parsing options and use htmlparser2
directly.