Abot 是一个开源的 C# 网络爬虫框架,它的特点是速度快和灵活。 它负责低级管道(多线程、http 请求、调度、链接解析等)。开发者只需要注册事件即可处理页面数据。
功能特性
使用 Nuget 安装 Abot
- PM> Install-Package Abot
下面是一些Abot代码示例
- private static async Task DemoSimpleCrawler()
- {
- var config = new CrawlConfiguration
- {
- // 爬取 10 页
- MaxPagesToCrawl = 10,
- // 延迟 3 秒
- MinCrawlDelayPerDomainMilliSeconds = 3000
- };
- var crawler = new PoliteWebCrawler(config);
-
- // 监听记录日志
- crawler.PageCrawlCompleted += PageCrawlCompleted;//
-
- var crawlResult = await crawler.CrawlAsync(new Uri("https://google.com"));
- }
-
- private static async Task DemoSinglePageRequest()
- {
- var pageRequester = new PageRequester(new CrawlConfiguration(), new WebContentExtractor());
-
- var crawledPage = await pageRequester.MakeRequestAsync(new Uri("http://google.com"));
- Log.Logger.Information("{result}", new
- {
- url = crawledPage.Uri,
- status = Convert.ToInt32(crawledPage.HttpResponseMessage.StatusCode)
- });
- }
全局配置
Abot 的 Abot2.Poco.CrawlConfiguration 类有大量的配置选项, 你可以根据自己的需要进行配置。
- var crawlConfig = new CrawlConfiguration();
- crawlConfig.CrawlTimeoutSeconds = 100;
- crawlConfig.MaxConcurrentThreads = 10;
- crawlConfig.MaxPagesToCrawl = 1000;
- crawlConfig.UserAgentString = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36";
- crawlConfig.ConfigurationExtensions.Add("SomeCustomConfigValue1", "1111");
- crawlConfig.ConfigurationExtensions.Add("SomeCustomConfigValue2", "2222");
- etc...
注册事件
你可以注册 Abot的执行事件,来查看爬虫执行的每一个步骤。
- crawler.PageCrawlStarting += crawler_ProcessPageCrawlStarting;
- crawler.PageCrawlCompleted += crawler_ProcessPageCrawlCompleted;
- crawler.PageCrawlDisallowed += crawler_PageCrawlDisallowed;
- crawler.PageLinksCrawlDisallowed += crawler_PageLinksCrawlDisallowed;
https://github.com/sjdirect/abot