科汛网校V10/V11
帮助首页 科汛网校V10/V11 - 问题汇总

IIS 拦截屏蔽垃圾蜘蛛UA爬行降低负载方法

0 2022/7/15 14:56:02

最近得到不少用户反馈,网站经常访问非常慢甚至打不开,cpu占用非常高,服务器负载整体也非常高,经过技术人员分析网站的日志发现有很多不知名的蜘蛛一直在爬行客户的站点,根据我们的经验结合大数据时代,各类数据爬虫(如安全扫描、舆情监测、AI大模型训练等)在持续不断的扫描和采集网站数据,这类型的访问量占据网站总流量的99%以上。所以问题肯定是出在这里,我们有必要屏掉没有必要的蜘蛛爬行来减少网站的运行压力,接下来我们跟大家分享一下,如何通过配置web.config 屏蔽一些不常用的蜘蛛爬行。其实设置方法也很简单,登录您的服务器,用记事本等工具打开 web.config,找到以下rewrite节点,增加如下红色的配置:

<system.webServer>

    <modules runAllManagedModulesForAllRequests="true">

      <add type="Kesion.APPCode.HttpModule" name="HttpModule" />

      <remove name="Session" />

      <add name="Session" type="System.Web.SessionState.SessionStateModule" />

    </modules>

    <security>

      <requestFiltering>

        <requestLimits maxAllowedContentLength="262144000" />

      </requestFiltering>

    </security>

    <defaultDocument>

      <files>

        <clear />

        <add value="index.aspx" />

        <add value="index.html" />

      </files>

    </defaultDocument>

    <handlers>

      <add name="html" path="*.html" verb="*" type="System.Web.UI.PageHandlerFactory" />

      <add name="all" path="*" verb="*" modules="IsapiModule" scriptProcessor="%windir%\Microsoft.NET\Framework\v4.0.30319\aspnet_isapi.dll" resourceType="Unspecified" requireAccess="None" preCondition="classicMode,runtimeVersionv2.0,bitness32" />

      <remove name="ExtensionlessUrlHandler-Integrated-4.0" />

      <remove name="OPTIONSVerbHandler" />

      <remove name="TRACEVerbHandler" />

      <add name="ExtensionlessUrlHandler-Integrated-4.0" path="*." verb="*" type="System.Web.Handlers.TransferRequestHandler" preCondition="integratedMode,runtimeVersionv4.0" />

    </handlers>

    <staticContent>

      <!-- <mimeMap fileExtension=".mp4" mimeType="application/octet-stream" />

      <mimeMap fileExtension=".woff" mimeType="application/x-woff" /> 

      <mimeMap fileExtension="." mimeType="application/octet-stream" />

  -->

      <mimeMap fileExtension=".vue" mimeType="text/html" />

    </staticContent>

    <directoryBrowse enabled="false" />

    <httpErrors errorMode="Custom">

      <remove statusCode="404" />

      <error statusCode="404" prefixLanguageFilePath="" path="/index.aspx?c=Go404" responseMode="ExecuteURL" />

    </httpErrors>

        <rewrite>

            <rules>

             <!-- 拦截非主流爬虫 -->

      <rule name="Block Non-Mainstream Crawlers" stopProcessing="true">

        <match url=".*" />

        <conditions>

         <add input="{HTTP_USER_AGENT}" pattern="spider|scanner|curl|MegaIndex|MegaIndex.ru|BLEXBot|Qwantify|qwantify|semrush|Semrush|serpstatbot|hubspot|python|Bytespider|Go-http-client|Java|PhantomJS|SemrushBot|Scrapy|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|perl|Python|Wget|Xenu|ZmEu|^$"

ignoreCase="true" />

          <add input="{HTTP_USER_AGENT}" pattern="Googlebot|Bingbot|Sogou|360Spider|Baiduspider" negate="true" />

        </conditions>

        <action type="AbortRequest" />

      </rule>

                <rule name="https" stopProcessing="true">

                    <match url="(.*)" />

                    <conditions>

                        <add input="{HTTPS}" pattern="^OFF$" />

                        <add input="{PATH_INFO}" pattern="^websystem" />

                    </conditions>

                    <action type="Redirect" url="https://{HTTP_HOST}/{R:1}" redirectType="Temporary" />

                </rule>

            </rules>

        </rewrite>

  </system.webServer>


以上增加的配置规则说明:

以下这句话规则中默认屏蔽部分不明蜘蛛,要屏蔽其他蜘蛛按规则添加即可,如下:

<add input="{HTTP_USER_AGENT}" pattern="spider|scanner|curl|MegaIndex|MegaIndex.ru|BLEXBot|Qwantify|qwantify|semrush|Semrush|serpstatbot|hubspot|python|Bytespider|Go-http-client|Java|PhantomJS|SemrushBot|Scrapy|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|perl|Python|Wget|Xenu|ZmEu|^$" ignoreCase="true" />

而这句会将主流的蜘蛛放行(如必应、百度,搜狗,百度等)

<add input="{HTTP_USER_AGENT}" pattern="Googlebot|Bingbot|Sogou|360Spider|Baiduspider" negate="true" />


通过配置以上rewrite规则,我们可以屏掉大部分的蜘蛛爬行,有效的提高网站的稳定运行。


当然,条件允许的情况下,强烈建议您的网站开启WAF功能,比如购买阿里云的网站 WAF安全产品并做好相应的规则过滤等配置。



以下附各大蜘蛛名字:


google蜘蛛:googlebot


百度蜘蛛:baiduspider


百度手机蜘蛛:baiduboxapp


yahoo蜘蛛:slurp


alexa蜘蛛:ia_archiver


msn蜘蛛:msnbot


bing蜘蛛:bingbot


altavista蜘蛛:scooter


lycos蜘蛛:lycos_spider_(t-rex)


alltheweb蜘蛛:fast-webcrawler


inktomi蜘蛛:slurp


有道蜘蛛:YodaoBot和OutfoxBot


热土蜘蛛:Adminrtspider


搜狗蜘蛛:sogou spider


SOSO蜘蛛:sosospider


360搜蜘蛛:360spider


100%