NGINX静止爬虫访问

管理员
管理员 2021-9-27

1、robots.txt 的User-agent方式禁止网站访问

User-agent: SemrushBot
Disallow: /
User-agent: SemrushBot-SA
Disallow: /
User-agent: bingbot
Disallow: /
User-agent: DotBot
Disallow: /
User-agent: MegaIndex.ru
Disallow: /
User-agent: MauiBot
Disallow: /
User-agent: AhrefsBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: BLEXBot
Disallow: /
User-agent: Googlebot
Disallow: /
User-agent: SemrushBot
Disallow: /

User-agent: *
Disallow: /api/
Disallow: /data/
Disallow: /source/
Disallow: /install/
Disallow: /template/
Disallow: /config/
Disallow: /uc_client/
Disallow: /uc_server/
Disallow: /static/
Disallow: /admin.php
Disallow: /search.php
Disallow: /member.php
Disallow: /api.php
Disallow: /misc.php
Disallow: /connect.php
Disallow: /forum.php*
Disallow: /home.php*
Disallow: /thread*
Disallow: /forum.php?mod=redirect*
Disallow: /forum.php?mod=post*
Disallow: /home.php?mod=spacecp*
Disallow: /userapp.php?mod=app&*
Disallow: /*?mod=misc*
Disallow: /*?mod=attachment*
Disallow: /*mobile=yes*

2、nginx.conf 配置文件设置禁止爬虫访问

listen       80;
location =/robots.txt {
	default_type text/html;
	add_header Content-Type "text/plain; charset=UTF-8";
	return 200 "User-Agent: *
	Disallow: /";
}
location / {
	#禁止Scrapy等工具的抓取
	if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
	     return 403;
	}

	#禁止指定UA及UA为空的访问
	if ($http_user_agent ~* "FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|Baiduspider-render|SemrushBot|SemrushBot-SA|bingbot|DotBot|MegaIndex.ru|MauiBot|BLEXBot|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|Bytespider|Ezooms|Googlebot|JikeSpider|SemrushBot|^$" ) {
	     return 403;             
	}

	#禁止非GET|HEAD|POST方式的抓取
	if ($request_method !~ ^(GET|HEAD|POST)$) {
	    return 403;
	}
}


回帖
  • 消灭零回复

微信二维码

微信二维码

微信扫码添加微信好友