Skip to main content

使用 extract 方法提取数据

¥Extracting Data with the extract Method

extract 方法允许你从 HTML 文档中提取数据并将其存储在对象中。该方法采用 map 对象作为参数,其中键是要在该对象上创建的属性的名称,值是用于提取值的选择器或描述符。

¥The extract method allows you to extract data from an HTML document and store it in an object. The method takes a map object as a parameter, where the keys are the names of the properties to be created on the object, and the values are the selectors or descriptors to be used to extract the values.

要使用 extract 方法,你首先需要导入库并加载 HTML 文档。例如:

¥To use the extract method, you first need to import the library and load an HTML document. For example:

import * as cheerio from 'cheerio';

const $ = cheerio.load(`
<ul>
<li>One</li>
<li>Two</li>
<li class="blue sel">Three</li>
<li class="red">Four</li>
</ul>
`);

加载文档后,你可以在加载的对象上使用 extract 方法从文档中提取数据。

¥Once you have loaded the document, you can use the extract method on the loaded object to extract data from the document.

以下是如何使用 extract 方法的一些示例:

¥Here are some examples of how to use the extract method:

// Extract the text content of the first .red element
const data = $.extract({
red: '.red',
});

这将返回一个具有 red 属性的对象,其值是第一个 .red 元素的文本内容。

¥This will return an object with a red property, whose value is the text content of the first .red element.

要提取所有 .red 元素的文本内容,可以将选择器封装在数组中:

¥To extract the text content of all .red elements, you can wrap the selector in an array:

// Extract the text content of all .red elements
const data = $.extract({
red: ['.red'],
});

这将返回一个具有 red 属性的对象,其值是所有 .red 元素的文本内容的数组。

¥This will return an object with a red property, whose value is an array of the text content of all .red elements.

要更具体地了解你想要提取的内容,你可以传递具有 selectorvalue 属性的对象。例如,要提取第一个 .red 元素的文本内容和第一个 a 元素的 href 属性:

¥To be more specific about what you'd like to extract, you can pass an object with a selector and a value property. For example, to extract the text content of the first .red element and the href attribute of the first a element:

const data = $.extract({
red: '.red',
links: {
selector: 'a',
value: 'href',
},
});

value 属性可用于指定要从所选元素中提取的属性的名称。在本例中,我们从 <a> 元素中提取 href 属性。这在底层使用了 Cheerio 的 prop 方法

¥The value property can be used to specify the name of the property to extract from the selected elements. In this case, we are extracting the href attribute from the <a> elements. This uses Cheerio's prop method under the hood.

value 默认为 textContent,即提取元素的文本内容。

¥value defaults to textContent, which extracts the text content of the element.

作为 prop 方法内部具有特殊逻辑的属性,href 将相对于文档的 URL 进行解析。使用 fromURL 加载文档时,将自动设置文档的 URL。否则,使用 baseURL 选项指定文档 URL。

¥As an attribute with special logic inside the prop method, hrefs will be resolved relative to the document's URL. The document's URL will be set automatically when using fromURL to load the document. Otherwise, use the baseURL option to specify the documents URL.

这里有很多属性可以使用;查看 prop 方法 了解详细信息。例如,要提取所有 .red 元素中的 outerHTML

¥There are many props available here; have a look at the prop method for details. For example, to extract the outerHTML of all .red elements:

const data = $.extract({
red: [
{
selector: '.red',
value: 'outerHTML',
},
],
});

你还可以通过将对象指定为 value 从多个嵌套元素中提取数据。例如,要提取第一个 <ul> 元素中所有 .red 元素和第一个 .blue 元素的文本内容,以及第二个 <ul> 元素中所有 .sel 元素的文本内容:

¥You can also extract data from multiple nested elements by specifying an object as the value. For example, to extract the text content of all .red elements and the first .blue element in the first <ul> element, and the text content of all .sel elements in the second <ul> element:

const data = $.extract({
ul1: {
selector: 'ul:first',
value: {
red: ['.red'],
blue: '.blue',
},
},
ul2: {
selector: 'ul:eq(2)',
value: {
sel: ['.sel'],
},
},
});

这将返回一个具有 ul1ul2 属性的对象。ul1 属性将是一个具有 red 属性的对象,其值是第一个 ul 元素中所有 .red 元素的文本内容的数组,以及 blue 属性。ul2 属性将是一个具有 sel 属性的对象,其值是第二个 <ul> 元素中所有 .sel 元素的文本内容的数组。

¥This will return an object with ul1 and ul2 properties. The ul1 property will be an object with a red property, whose value is an array of the text content of all .red elements in the first ul element, and a blue property. The ul2 property will be an object with a sel property, whose value is an array of the text content of all .sel elements in the second <ul> element.

最后,你可以将函数作为 value 属性传递。将使用每个选定的元素以及属性的 key 调用该函数:

¥Finally, you can pass a function as the value property. The function will be called with each of the selected elements, and the key of the property:

const data = $.extract({
links: [
{
selector: 'a',
value: (el, key) => {
const href = $(el).attr('href');
return `${key}=${href}`;
},
},
],
});

这将提取所有 <a> 元素的 href 属性,并为每个元素返回 links=href_value 形式的字符串,其中 href_valuehref 属性的值。返回的对象将具有 links 属性,其值是这些字符串的数组。

¥This will extract the href attribute of all <a> elements and return a string in the form links=href_value for each element, where href_value is the value of the href attribute. The returned object will have a links property whose value is an array of these strings.

把它们放在一起

¥Putting it all together

让我们从 GitHub 获取最新版本的 Cheerio,并从发布页面提取发布日期和发布说明:

¥Let's fetch the latest release of Cheerio from GitHub and extract the release date and the release notes from the release page:

import * as cheerio from 'cheerio';

const $ = await cheerio.fromURL(
'https://github.com/cheeriojs/cheerio/releases',
);

const data = $.extract({
releases: [
{
// First, we select individual release sections.
selector: 'section',
// Then, we extract the release date, name, and notes from each section.
value: {
// Selectors are executed within the context of the selected element.
name: 'h2',
date: {
selector: 'relative-time',
// The actual release date is stored in the `datetime` attribute.
value: 'datetime',
},
notes: {
selector: '.markdown-body',
// We are looking for the HTML content of the element.
value: 'innerHTML',
},
},
},
],
});