最近得到不少用户反馈,网站经常访问非常慢甚至打不开,cpu占用非常高,服务器负载整体也非常高,经过技术人员分析网站的日志发现有很多不知名的蜘蛛一直在爬行客户的站点,根据我们的经验结合大数据时代,各类数据爬虫(如安全扫描、舆情监测、AI大模型训练等)在持续不断的扫描和采集网站数据,这类型的访问量占据网站总流量的99%以上。所以问题肯定是出在这里,我们有必要屏掉没有必要的蜘蛛爬行来减少网站的运行压力,接下来我们跟大家分享一下,如何通过配置web.config 屏蔽一些不常用的蜘蛛爬行。其实设置方法也很简单,登录您的服务器,用记事本等工具打开 web.config,找到以下rewrite节点,增加如下红色的配置:
<system.webServer>
<modules runAllManagedModulesForAllRequests="true">
<add type="Kesion.APPCode.HttpModule" name="HttpModule" />
<remove name="Session" />
<add name="Session" type="System.Web.SessionState.SessionStateModule" />
</modules>
<security>
<requestFiltering>
<requestLimits maxAllowedContentLength="262144000" />
</requestFiltering>
</security>
<defaultDocument>
<files>
<clear />
<add value="index.aspx" />
<add value="index.html" />
</files>
</defaultDocument>
<handlers>
<add name="html" path="*.html" verb="*" type="System.Web.UI.PageHandlerFactory" />
<add name="all" path="*" verb="*" modules="IsapiModule" scriptProcessor="%windir%\Microsoft.NET\Framework\v4.0.30319\aspnet_isapi.dll" resourceType="Unspecified" requireAccess="None" preCondition="classicMode,runtimeVersionv2.0,bitness32" />
<remove name="ExtensionlessUrlHandler-Integrated-4.0" />
<remove name="OPTIONSVerbHandler" />
<remove name="TRACEVerbHandler" />
<add name="ExtensionlessUrlHandler-Integrated-4.0" path="*." verb="*" type="System.Web.Handlers.TransferRequestHandler" preCondition="integratedMode,runtimeVersionv4.0" />
</handlers>
<staticContent>
<!-- <mimeMap fileExtension=".mp4" mimeType="application/octet-stream" />
<mimeMap fileExtension=".woff" mimeType="application/x-woff" />
<mimeMap fileExtension="." mimeType="application/octet-stream" />
-->
<mimeMap fileExtension=".vue" mimeType="text/html" />
</staticContent>
<directoryBrowse enabled="false" />
<httpErrors errorMode="Custom">
<remove statusCode="404" />
<error statusCode="404" prefixLanguageFilePath="" path="/index.aspx?c=Go404" responseMode="ExecuteURL" />
</httpErrors>
<rewrite>
<rules>
<!-- 拦截非主流爬虫 -->
<rule name="Block Non-Mainstream Crawlers" stopProcessing="true">
<match url=".*" />
<conditions>
<add input="{HTTP_USER_AGENT}" pattern="spider|scanner|curl|MegaIndex|MegaIndex.ru|BLEXBot|Qwantify|qwantify|semrush|Semrush|serpstatbot|hubspot|python|Bytespider|Go-http-client|Java|PhantomJS|SemrushBot|Scrapy|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|perl|Python|Wget|Xenu|ZmEu|^$"
ignoreCase="true" />
<add input="{HTTP_USER_AGENT}" pattern="Googlebot|Bingbot|Sogou|360Spider|Baiduspider" negate="true" />
</conditions>
<action type="AbortRequest" />
</rule>
<rule name="https" stopProcessing="true">
<match url="(.*)" />
<conditions>
<add input="{HTTPS}" pattern="^OFF$" />
<add input="{PATH_INFO}" pattern="^websystem" />
</conditions>
<action type="Redirect" url="https://{HTTP_HOST}/{R:1}" redirectType="Temporary" />
</rule>
</rules>
</rewrite>
</system.webServer>
以上增加的配置规则说明:
以下这句话规则中默认屏蔽部分不明蜘蛛,要屏蔽其他蜘蛛按规则添加即可,如下:
<add input="{HTTP_USER_AGENT}" pattern="spider|scanner|curl|MegaIndex|MegaIndex.ru|BLEXBot|Qwantify|qwantify|semrush|Semrush|serpstatbot|hubspot|python|Bytespider|Go-http-client|Java|PhantomJS|SemrushBot|Scrapy|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|perl|Python|Wget|Xenu|ZmEu|^$" ignoreCase="true" />
而这句会将主流的蜘蛛放行(如必应、百度,搜狗,百度等)
<add input="{HTTP_USER_AGENT}" pattern="Googlebot|Bingbot|Sogou|360Spider|Baiduspider" negate="true" />
通过配置以上rewrite规则,我们可以屏掉大部分的蜘蛛爬行,有效的提高网站的稳定运行。
当然,条件允许的情况下,强烈建议您的网站开启WAF功能,比如购买阿里云的网站 WAF安全产品并做好相应的规则过滤等配置。
以下附各大蜘蛛名字:
google蜘蛛:googlebot
百度蜘蛛:baiduspider
百度手机蜘蛛:baiduboxapp
yahoo蜘蛛:slurp
alexa蜘蛛:ia_archiver
msn蜘蛛:msnbot
bing蜘蛛:bingbot
altavista蜘蛛:scooter
lycos蜘蛛:lycos_spider_(t-rex)
alltheweb蜘蛛:fast-webcrawler
inktomi蜘蛛:slurp
有道蜘蛛:YodaoBot和OutfoxBot
热土蜘蛛:Adminrtspider
搜狗蜘蛛:sogou spider
SOSO蜘蛛:sosospider
360搜蜘蛛:360spider