Skip to main content

配置 Cheerio

¥Configuring Cheerio

在本指南中,我们将介绍如何配置 Cheerio 以处理不同类型的文档,以及如何使用和配置该库附带的不同解析器。

¥In this guide, we'll cover how to configure Cheerio to work with different types of documents, and how to use and configure the different parsers that ship with the library.

使用 parse5 解析 HTML

¥Parsing HTML with parse5

默认情况下,Cheerio 对 HTML 文档使用 parse5 解析器。parse5 是一个严格符合 HTML 标准的优秀项目。但是,如果你需要修改 HTML 输入的解析选项,你可以向 .load() 传递一个额外的对象:

¥By default, Cheerio uses the parse5 parser for HTML documents. parse5 is an excellent project that rigorously conforms to the HTML standard. However, if you need to modify parsing options for HTML input, you may pass an extra object to .load():

const cheerio = require('cheerio');
const $ = cheerio.load('<noscript><h1>Nested Tag!</h1></noscript>', {
scriptingEnabled: false,
});

例如,如果你希望将 <noscript> 标签的内容解析为 HTML,则可以将 scriptingEnabled 选项设置为 false。

¥For example, if you want the contents of <noscript> tags to be parsed as HTML, you can set the scriptingEnabled option to false.

有关选项及其效果的完整列表,请查看 API 文档

¥For a full list of options and their effects, have a look at the API documentation.

片段模式

¥Fragment Mode

默认情况下,parse5 将其接收的文档视为完整的 HTML 文档,并将使用嵌套的 <head><body> 标记在 <html> 文档元素中构建内容。

¥By default, parse5 treats documents it receives as full HTML documents and will structure content in an <html> document element with nested <head> and <body> tags.

const $ = cheerio.load('<li>Apple</li><li>Banana</li>');

$.html(); // => '<html><head></head><body><li>Apple</li><li>Banana</li></body></html>'

parse5 还支持 "片段模式",允许你解析 HTML 片段,而不是完整的文档。要使用此模式,请将指示是否正在解析完整文档的布尔值传递给 .load() 方法:

¥parse5 also supports a "fragment mode" that allows you to parse HTML fragments, rather than complete documents. To use this mode, pass a boolean indicating whether you are parsing a full document to the .load() method:

// Note that we are passing `false`, as we are not parsing a full document.
const $ = cheerio.load('<li>Apple</li><li>Banana</li>', {}, false);

$.html(); // => '<li>Apple</li><li>Banana</li>'

这会将 HTML 片段解析为独立文档,而不是将其视为较大文档的一部分。

¥This will parse the HTML fragment as a standalone document, rather than treating it as a part of a larger document.

使用 htmlparser2 解析 XML

¥Parsing XML with htmlparser2

默认情况下,Cheerio 对 XML 文档使用 htmlparser2htmlparser2 是一个快速且节省内存的解析器,可以处理 HTML 和 XML。要解析 XML,请将 xml 选项传递给 .load()

¥By default, Cheerio uses htmlparser2 for XML documents. htmlparser2 is a fast and memory-efficient parser that can handle both HTML and XML. To parse XML, pass the xml option to .load():

const $ = cheerio.load('<ul id="fruits">...</ul>', {
xml: true,
});

如果你需要自定义 XML 输入的解析选项,你可以将一个对象作为 xml 选项传递给 .load(),其中包含你想要更改的选项:

¥If you need to customize the parsing options for XML input, you may pass an object as the xml option to .load(), with the options you want to change:

const $ = cheerio.load('<ul id="fruits">...</ul>', {
xml: {
withStartIndices: true,
},
});

设置 xml 时,默认选项为:

¥When xml is set, the default options are:

{
xmlMode: true, // Enable htmlparser2's XML mode.
decodeEntities: true, // Decode HTML entities.
withStartIndices: false, // Add a `startIndex` property to nodes.
withEndIndices: false, // Add an `endIndex` property to nodes.
}

xml 对象中的选项直接从 htmlparser2 获取,因此任何可以在 htmlparser2 中使用的选项在 Cheerio 中也有效。

¥The options in the xml object are taken directly from htmlparser2, therefore any options that can be used in htmlparser2 are valid in cheerio as well.

有关选项及其效果的完整列表,请参阅 API 文档

¥For a full list of options and their effects, see the API documentation.

对 HTML 使用 htmlparser2

¥Using htmlparser2 for HTML

一些用户可能希望使用 htmlparser2 库解析标记,并使用 Cheerio 遍历和操作结果结构。对于从 Cheerio 1.0 之前版本(依赖于 htmlparser2)升级的用户、处理无效标记的用户(因为 htmlparser2 更宽容 1)或在性能关键情况下操作的用户(因为 htmlparser2 通常是 更快,并且生成的 DOM 消耗更少的内存)。

¥Some users may wish to parse markup with the htmlparser2 library, and traverse and manipulate the resulting structure with Cheerio. This may be the case for those upgrading from pre-1.0 releases of Cheerio (which relied on htmlparser2), for those dealing with invalid markup (because htmlparser2 is more forgiving1), or for those operating in performance-critical situations (because htmlparser2 is often faster and the resulting DOM consumes less memory).

为了支持这些情况,你可以简单地禁用 xml 选项中的 xmlMode

¥To support these cases, you can simply disable xmlMode inside of the xml option:

const $ = cheerio.load('<ul id="fruits">...</ul>', {
xml: {
// Disable `xmlMode` to parse HTML with htmlparser2.
xmlMode: false,
},
});

.load() 还接受 htmlparser2 兼容的数据结构作为其第一个参数。用户可以安装 htmlparser2,用它来解析输入,并将结果传递给 .load()

¥.load() also accepts a htmlparser2-compatible data structure as its first argument. Users may install htmlparser2, use it to parse input, and pass the result to .load():

import * as htmlparser2 from 'htmlparser2';
const dom = htmlparser2.parseDocument(document, options);

const $ = cheerio.load(dom);

此方法的警告是,这仍将使用 parse5 的序列化程序,因此生成的输出将是 HTML,而不是 XML,并且不尊重任何提供的选项。因此,如上所示,禁用 xmlMode 是推荐的方法。

¥The caveat of this method is that this will still use parse5's serializer, so the resulting output will be HTML, not XML, and not respect any of the supplied options. Disabling xmlMode, as shown above, is therefore the recommended approach.

提示

你还可以使用 Cheerio 的 slim 导出,它始终使用 htmlparser2。这可以避免加载 parse5,从而节省一些字节,例如。 在浏览器环境中:

¥You can also use Cheerio's slim export, which always uses htmlparser2. This avoids loading parse5, which saves some bytes eg. in browser environments:

import * as cheerio from 'cheerio/slim';

结论

¥Conclusion

在本指南中,我们探讨了如何配置 Cheerio 以分别使用 parse5htmlparser2 解析 HTML 和 XML 文档。我们还讨论了如何修改解析选项并直接使用 htmlparser2

¥In this guide, we explored how to configure Cheerio for parsing HTML and XML documents using parse5 and htmlparser2 respectively. We also discussed how to modify parsing options and use htmlparser2 directly.

Footnotes

  1. 请注意,"更宽容" 意味着 htmlparser2 具有纠错机制,但该机制并不总是与 Web 浏览器遵守的标准相匹配。在解析非 HTML 内容时,此行为可能很有用。

    ¥Note that "more forgiving" means htmlparser2 has error-correcting mechanisms that aren't always a match for the standards observed by web browsers. This behavior may be useful when parsing non-HTML content. 2