web-grep

What this?

Grep for HTML or XML.

bash $ echo '<a>Hello</a>' | web-grep '<a>{}</a>' Hello

bash $ echo '<a>Hello</a>' | web-grep '<a>{html}</a>' --json {"html":"Hello"}

```bash

List up all

-innerHTML

$ cat << EOM | web-grep '

{}

'

hello

world

EOM hello world ```

```bash

filtering with attributes

$ cat << EOM | web-grep '

{}

'

hello

world

EOM world ```

```bash

Place-holder {} can be attribute

$ cat << EOM | web-grep '

'

hello

world

EOM here ```

How this?

This is just a CLI for an awesome library, tanakh/easy-scraper.

Installation

  1. Install cargo
  2. Then,

Usage

bash $ web-grep <QUERY> [INPUT]

The QUERY is a HTML (XML) Pattern.

Patterns are valid HTML structures which has placeholders for innerHTMLs or attributes. web-grep has various placeholders for cases.

Placeholders

Anonymous Palceholder {}

If you need exact one placeholer in the pattern, use {}.

```html

{}

``` ```html

{}

```

web-grep outputs all texts matching for {}.

bash $ echo "<p>1</p><p>2</p><p>3</p>" | web-grep "<p>{}</p>" 1 2 3

Numbered Placeholders {n}

html <a href="{1}">{2}</a>

web-grep outputs matched texts for {1}, {2}... in order, separated by \t.

bash $ echo '<a href=hoge>fuga</a>' | web-grep "<a href={2}>{1}</a>" fuga hoge

The delimiter can be specified with -F.

bash $ echo '<a href=hoge>fuga</a>' | web-grep "<a href={2}>{1}</a>" -F ' ' fuga hoge

Named Placeholders {xxx}

html <a href="{href}">{innerHTML}</a>

The output can be formatted as JSON with --json.

bash $ echo '<a href=hoge>fuga</a>' | web-grep "<a href={href}>{html}</a>" --json {"href":"hoge","html":"fuga"}