A Rust library the crate help you Write a function similar to that of Python's request repository ([Python
] BS4 library).
Each new requests initializes a cache instance which stores the parsed data in key value pairs
When you get an instance of connect, you can call the parser method to parse the data in the form of closure
parser
select function support a lot css selection syntax, this is mainly to simplify the work of parsing the domValue
, so not required some Value
type data| Support Css Selector List| | ---- | | .class.class | | .class | | #id| |element.class| |element element| |[attr=value]| |[attr~value]| |element|
Add #[derive(DBfile)]
to struct, you can use func DBStore::to_csv
put data in a csv file
```rust let data = Cache::new(); let client = Requests::new(&data); let rq = client.connect("https://www.qq.com/", Headers::Default);
#[derive(DBfile, Debug)]
struct Link<'a> {
href: &'a str,
link_name: String,
}
rq.free_parse(|mut p| {
p.free_select("li.nav-item a",|n| {
let links = n.iter().map(|x| {
Link { href: x.attr("href").expect("extra href error"), link_name: x.text() }
}).collect::<Vec<Link>>();
DBStore::to_csv(links, "D:\\links.csv", "a", true);
});
});
```
Open links.csv
view result:
rust
href,link_name
http://news.qq.com/,新闻
https://v.qq.com/?isoldly=1,视频
http://gongyi.qq.com/,公益
https://new.qq.com/ch/milite/,军事
https://sports.qq.com/,体育
https://sports.qq.com/nba/,NBA
https://new.qq.com/ch/ent/,娱乐
https://new.qq.com/ch/finance/,财经
https://new.qq.com/ch/tech/,科技
https://new.qq.com/ch/fashion/,时尚
https://new.qq.com/ch/auto/,汽车
http://house.qq.com/,房产
https://new.qq.com/ch/edu/,教育
https://new.qq.com/ch/cul/,文化
https://new.qq.com/ch/astro/,星座
https://new.qq.com/ch/games/,游戏
http://book.qq.com/,文学
https://v.qq.com/tv/,热剧
https://new.qq.com/ch/antip/,抗肺炎
http://new.qq.com/ch/history/,历史
http://sports.qq.com/premierleague/,英超
http://sports.qq.com/cba/,CBA
https://new.qq.com/ch2/star,明星
https://new.qq.com/ch/finance_licai/,理财
https://new.qq.com/ch/kepu/,科普
https://new.qq.com/ch/health/,健康
https://auto.qq.com/car_public/index.shtml,车型
http://www.jia360.com,家居
https://new.qq.com/ch/baby/,育儿
https://new.qq.com/ch/emotion/,情感
https://new.qq.com/ch/comic/,动漫
https://new.qq.com/omv/,享看
http://tianqi.qq.com/index.htm,天气
https://new.qq.com/omn/author/5107513,较真
https://v.qq.com/channel/variety,综艺
https://new.qq.com/ch/cul_ru/,新国风
https://new.qq.com/ch/world/,国际
http://sports.qq.com/csocce/csl/,中超
http://fans.sports.qq.com/#/,社区
http://v.qq.com/movie/,电影
https://new.qq.com/ch/finance_stock/,证券
https://new.qq.com/ch/digi/,数码
https://new.qq.com/ch2/makeup,美容
https://new.qq.com/ch/topic/,话题
https://new.qq.com/ch/life/,生活
http://kid.qq.com/,儿童
http://www.qq.com/map/,全部
```rust let data = Cache::new(); let client = Requests::new(&data); let mut rq = client.connect("https://www.qq.com/", Headers::Default);
rq.parser(|p| { p.findall("a", |x| { x.attr("href").mapor(false, |v| v.starts_with("http")) }, "href") }, "href"); // data.print() ```
Use data.print
you can view the value stored as the [href
] key. It is a value enumeration type that contains most data types.
rust
Key -- "href" Value -- LIST(["https://qzone.qq.com", "https://qzone.qq.com", "https://mail.qq.com", "https://mail.qq.com/cgi-bin/loginpage", "http://news.qq.com/", "https://v.qq.com/?isoldly=1", "http://gongyi.qq.com/",...]
Headers defines three types of request headers. The default Header::default
has only one user agent
, or it can be without any Headers::None
.also use JSON string to make a request header containing useragent
and host
, this code:
rust
let headers = r#"{"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36", "host": "www.qq.com"}"#;
let store = Cache::new();
let client = Requests::new(&store);
let mut p = client.connect("https://www.qq.com", Headers::JSON(headers));
If you need more request header fields, you need to use add corresponding fields in headers.rs
When you use the connect
method to connect to a URL, you can use the parser method to write the work of parsing HTML, the current parser has find
and find_ all
method
the parser must return Value
type
If you use find find_all the code:
rust
rq.parser(|p| {
p.find_all("a", |x| {
x.attr("href").map_or(false, |v| v.starts_with("http"))
}, "href")
}, "href")
The first parameter of parser is obtained by closure automatically saves the value to value:: list
In general, you may need to handle the parsing manually and customize the returned Value
, please use p.select
function , this example code:
```rust let data = Cache::new(); let client = Requests::new(&data); let mut parser = client.connect("https://www.qq.com", Headers::Default); parser.parser(|p| { let mut result = HashMap::new();
let navs = p.select("li.nav-item", |nodes| {
let navs = nodes.into_iter().map(|n| {
let mut item = HashMap::new();
n.find(Name("a")).next().map_or(HashMap::from([("".to_string(), Value::NULL)]), |a| {
let nav_name = a.text();
let nav_href = a.attr("href").map_or(String::from(""), |x| x.to_string());
item.insert("nav_name".to_string(), Value::STR(nav_name));
item.insert("nav_href".to_string(), Value::STR(nav_href));
item
})
}).collect::<Vec<HashMap<String, Value>>>();
Value::VECMAP(navs)
});
let news = p.select("ul.yw-list", |nodes| {
let mut news = Vec::new();
for node in nodes {
for n in node.find(Class("news-top")) {
for a in n.find(Name("a")) {
let title = a.text();
news.push(title);
}
}
}
Value::LIST(news)
});
result.insert("titles".to_owned(), news);
result.insert("nav".to_owned(), navs);
Value::MAP(result)
}, "index");
data.print();
```
```rust
pub enum Value {
/// 字符串类型
STR(String),
/// 字符串列表
LIST(Vec
/// map类型的列表
VECMAP(Vec<HashMap<String, Value>>),
/// map类型
MAP(HashMap<String, Value>)
} ``` Add the data type you need in the Value.rs
use rayon
library test the concurrency, this have a simple code:
```rust
let data = Cache::new();
let client = Requests::new(&data);
let urls = ["https://www.baidu.com", "https://www.qq.com", "https://www.163.com"];
let _ = urls.pariter().map(|url| {
let mut p = client.connect(url, Headers::Default);
p.parser(|p| {
p.findall("a", |f| f.attr("href").mapor(false, |v| v.startswith("http://")), "href")
}, format!("{}link", url).asstr());
p.parser(|p| {
p.find("title", |f| f.text() != "", "text")
}, format!("{}_title", url).as_str());
})
.map(|_| String::from("")).collect::<String>();
match data.get("https://www.qq.com_title") {
Value::STR(i) => assert_eq!(i, "腾讯首页"),
_ => panic!("")
};
if let Value::STR(i) = data.get("https://www.163.com_title") {
assert_eq!(i, "网易");
}
if let Value::STR(i) = data.get("https://www.baidu.com_title") {
assert_eq!(i, "百度一下,你就知道");
}
```