使用 extract
方法提取数据
¥Extracting Data with the extract
Method
extract
方法允许你从 HTML 文档中提取数据并将其存储在对象中。该方法采用 map
对象作为参数,其中键是要在该对象上创建的属性的名称,值是用于提取值的选择器或描述符。
¥The extract
method allows you to extract data from an HTML document and store
it in an object. The method takes a map
object as a parameter, where the keys
are the names of the properties to be created on the object, and the values are
the selectors or descriptors to be used to extract the values.
要使用 extract
方法,你首先需要导入库并加载 HTML 文档。例如:
¥To use the extract
method, you first need to import the library and load an
HTML document. For example:
import * as cheerio from 'cheerio';
const $ = cheerio.load(`
<ul>
<li>One</li>
<li>Two</li>
<li class="blue sel">Three</li>
<li class="red">Four</li>
</ul>
`);
加载文档后,你可以在加载的对象上使用 extract
方法从文档中提取数据。
¥Once you have loaded the document, you can use the extract
method on the
loaded object to extract data from the document.
以下是如何使用 extract
方法的一些示例:
¥Here are some examples of how to use the extract
method:
// Extract the text content of the first .red element
const data = $.extract({
red: '.red',
});
这将返回一个具有 red
属性的对象,其值是第一个 .red
元素的文本内容。
¥This will return an object with a red
property, whose value is the text
content of the first .red
element.
要提取所有 .red
元素的文本内容,可以将选择器封装在数组中:
¥To extract the text content of all .red
elements, you can wrap the selector in
an array:
// Extract the text content of all .red elements
const data = $.extract({
red: ['.red'],
});
这将返回一个具有 red
属性的对象,其值是所有 .red
元素的文本内容的数组。
¥This will return an object with a red
property, whose value is an array of the
text content of all .red
elements.
要更具体地了解你想要提取的内容,你可以传递具有 selector
和 value
属性的对象。例如,要提取第一个 .red
元素的文本内容和第一个 a
元素的 href
属性:
¥To be more specific about what you'd like to extract, you can pass an object
with a selector
and a value
property. For example, to extract the text
content of the first .red
element and the href
attribute of the first a
element:
const data = $.extract({
red: '.red',
links: {
selector: 'a',
value: 'href',
},
});
value
属性可用于指定要从所选元素中提取的属性的名称。在本例中,我们从 <a>
元素中提取 href
属性。这在底层使用了 Cheerio 的 prop
方法。
¥The value
property can be used to specify the name of the property to extract
from the selected elements. In this case, we are extracting the href
attribute
from the <a>
elements. This uses Cheerio's
prop
method under the hood.
value
默认为 textContent
,即提取元素的文本内容。
¥value
defaults to textContent
, which extracts the text content of the
element.
作为 prop
方法内部具有特殊逻辑的属性,href
将相对于文档的 URL 进行解析。使用 fromURL
加载文档时,将自动设置文档的 URL。否则,使用 baseURL
选项指定文档 URL。
¥As an attribute with special logic inside the prop
method, href
s will be
resolved relative to the document's URL. The document's URL will be set
automatically when using fromURL
to load the document. Otherwise, use the
baseURL
option to specify the documents URL.
这里有很多属性可以使用;查看 prop
方法 了解详细信息。例如,要提取所有 .red
元素中的 outerHTML
:
¥There are many props available here; have a look at the
prop
method for details. For example, to
extract the outerHTML
of all .red
elements:
const data = $.extract({
red: [
{
selector: '.red',
value: 'outerHTML',
},
],
});
你还可以通过将对象指定为 value
从多个嵌套元素中提取数据。例如,要提取第一个 <ul>
元素中所有 .red
元素和第一个 .blue
元素的文本内容,以及第二个 <ul>
元素中所有 .sel
元素的文本内容:
¥You can also extract data from multiple nested elements by specifying an object
as the value
. For example, to extract the text content of all .red
elements
and the first .blue
element in the first <ul>
element, and the text content
of all .sel
elements in the second <ul>
element:
const data = $.extract({
ul1: {
selector: 'ul:first',
value: {
red: ['.red'],
blue: '.blue',
},
},
ul2: {
selector: 'ul:eq(2)',
value: {
sel: ['.sel'],
},
},
});
这将返回一个具有 ul1
和 ul2
属性的对象。ul1
属性将是一个具有 red
属性的对象,其值是第一个 ul 元素中所有 .red
元素的文本内容的数组,以及 blue
属性。ul2
属性将是一个具有 sel
属性的对象,其值是第二个 <ul>
元素中所有 .sel
元素的文本内容的数组。
¥This will return an object with ul1
and ul2
properties. The ul1
property
will be an object with a red
property, whose value is an array of the text
content of all .red
elements in the first ul element, and a blue
property.
The ul2
property will be an object with a sel
property, whose value is an
array of the text content of all .sel
elements in the second <ul>
element.
最后,你可以将函数作为 value
属性传递。将使用每个选定的元素以及属性的 key
调用该函数:
¥Finally, you can pass a function as the value
property. The function will be
called with each of the selected elements, and the key
of the property:
const data = $.extract({
links: [
{
selector: 'a',
value: (el, key) => {
const href = $(el).attr('href');
return `${key}=${href}`;
},
},
],
});
这将提取所有 <a>
元素的 href
属性,并为每个元素返回 links=href_value
形式的字符串,其中 href_value
是 href
属性的值。返回的对象将具有 links
属性,其值是这些字符串的数组。
¥This will extract the href
attribute of all <a>
elements and return a string
in the form links=href_value
for each element, where href_value
is the value
of the href
attribute. The returned object will have a links
property whose
value is an array of these strings.
把它们放在一起
¥Putting it all together
让我们从 GitHub 获取最新版本的 Cheerio,并从发布页面提取发布日期和发布说明:
¥Let's fetch the latest release of Cheerio from GitHub and extract the release date and the release notes from the release page:
import * as cheerio from 'cheerio';
const $ = await cheerio.fromURL(
'https://github.com/cheeriojs/cheerio/releases',
);
const data = $.extract({
releases: [
{
// First, we select individual release sections.
selector: 'section',
// Then, we extract the release date, name, and notes from each section.
value: {
// Selectors are executed within the context of the selected element.
name: 'h2',
date: {
selector: 'relative-time',
// The actual release date is stored in the `datetime` attribute.
value: 'datetime',
},
notes: {
selector: '.markdown-body',
// We are looking for the HTML content of the element.
value: 'innerHTML',
},
},
},
],
});