HTML scraping library focused on easy to use.
In this library, matching patterns are described as HTML DOM trees. You can write patterns intuitive and extract desired contents easily.
```rust use easy_scraper::Pattern;
let doc = r#"
let pat = Pattern::new(r#"
"#).unwrap();
let ms = pat.matches(doc);
asserteq!(ms.len(), 3); asserteq!(ms[0]["foo"], "1"); asserteq!(ms[1]["foo"], "2"); asserteq!(ms[2]["foo"], "3"); ```
DOM trees are valid pattern. You can write placeholders in DOM trees.
```html
```
Patterns are matched if the pattern is subset of document.
If the document is:
```html
```
there trees are subset of this.
```html
```
```html
```
```html
```
So, match result is
json
[
{ "foo": "1" },
{ "foo": "2" },
{ "foo": "3" },
]
Child nodes are matched to any descendants because of subset rule.
For example, this pattern
```html
```
matches against this document.
```html
```
To avoid useless matches, siblings are restricted to match only consective children of the same parent.
For example, this pattern
```html
```
does not match to this document.
```html
```
And for this document,
```html
```
match results are:
json
[
{ "foo": "1", "bar": "2" },
{ "foo": "2", "bar": "3" },
]
{ "foo": 1, "bar": 3 }
is not contained, because there are not consective children.
You can specify allow nodes between siblings by writing ...
in the pattern.
```html
```
Match result for this pattern is:
```json [ { "foo": "1", "bar": "2" }, { "foo": "1", "bar": "3" }, { "foo": "2", "bar": "3" }, ] ``````
If you want to match siblings as subsequence instead of consective substring,
you can use the subseq
pattern.
```html
AAA | aaa |
---|---|
BBB | bbb |
CCC | ccc |
DDD | ddd |
EEE | eee |
```
For this document,
```html
AAA | {{a}} |
---|---|
BBB | {{b}} |
DDD | {{d}} |
```
this pattern matches.
json
[
{
"a": "aaa",
"b": "bbb",
"d": "ddd"
}
]
You can specify attributes in patterns. Attribute patterns match when pattern's attributes are subset of document's attributes.
This pattern
```html
```
matches to this document.
```html
```
You can also write placeholders in attributes.
html
<a href="{{url}}">{{title}}</a>
Match result for
html
<a href="https://www.google.com">Google</a>
<a href="https://www.yahoo.com">Yahoo</a>
this document is:
json
[
{ "url": "https://www.google.com", "title": "Google" },
{ "url": "https://www.yahoo.com", "title": "Yahoo" },
]
You can write placeholders arbitrary positions in text-node.
```html
```
Match result for
```html
```
this document is:
json
[
{ "a": "1", "b": "2" },
{ "a": "3", "b": "4" },
{ "a": "5", "b": "6" },
]
You can also write placeholders in atteibutes position.
```html
```
Match result for
```html
```
this document is:
json
[
{ "userid": "foo", "username": "Foo" },
{ "userid": "bar", "username": "Bar" },
{ "userid": "baz", "username": "Baz" },
]
The pattern {{var:*}}
matches to whole sub-tree as string.
```html
```
There are invalid:
```html
```
html
<ul>
<li></li>
{{foo:*}}
<li></li>
<ul>
License: MIT