Pages

Advertisement

Sunday, August 5, 2007

ASP.NET Request Logger and Crawler Killer

Shows a simplified way to log requests and deny requests that come from <enter annoying bot name here>. Can easily be turned on or off with a database entry and without causing app recycle.

If you have ever had a web site that gets visited in the middle of peak hours by a nasty crawler / bot that doesn't completely observe the robots standard, tying up lots of your pages and causing humongous database access, then you know that you absolutely have to have good metrics to help identify the problem.
This is a simple logging class that:
1) Grabs key information from each request and logs it into a SQL Server table.
2) Can be programmed to identify certain "nastybots" via their User-Agent string and reply with a 401  Access Denied.
3) Can easily be turned on and off by simply updating a row in a SQL Server Database table, which will NOT cause an application restart.
The basic concept here is to try and intercept a request before Page processing and any database access has begun. The easiest way to do that is to override the Application_PreRequestHandlerExecute event. This is most easily done in Global.asax, where you can simply make a static class method call, like so:

protected void Application_PreRequestHandlerExecute (object sender, EventArgs e)
        {
            RequestLogger.Logger.LogRequest(sender as HttpApplication);
        }

When this call is made to the LogRequest method, it checks two private fields, _loggingOn, and _denyBots, and behaves accordingly. If _loggingOn is true, it grabs the items we want from the Request object and writes a row into your Requests SQL Table. The list I have is short, but you can add many more items if your needs differ.
If _denyBots is true, it performs an advanced "IsCrawler" check using Regex test strings of your choosing, and will issue a 401 Access Denied response, which basically stops the bot dead in its tracks, preventing it from doing any damage. Not even a Page object is created. 
The class self-populates the values of the two state variables through a method that checks the Cache and reloads from the database every 10 minutes. So you can change the state in the database, and be guaranteed that ten minutes later it will check and change state without recycling your app, as rewriting the web.config or other file might do.
Here's the code for the logging class:



using System;
using System.Configuration;
using System.Data;
using System.Data.SqlClient;
using System.Text.RegularExpressions;
using System.Web;
using System.Web.Caching;
 
namespace RequestLogger
{
    public static class Logger
    {
        private static bool _loggingOn=true;
        private static bool _denyBots=false;
        private static string _connectionString = ConfigurationManager.AppSettings["connectionString"];
 
        public static void LogRequest(HttpApplication app)
        {
            HttpRequest request = app.Request;
            EnsureSwitches(app);
            if (!_loggingOn) return;
            bool isCrawler = IsCrawler(request);
            string userAgent = request.UserAgent;
            string requestPath = request.Url.AbsolutePath;
            string referer = request.UrlReferrer != null ? request.UrlReferrer.AbsolutePath : "";
            string userIp = request.UserHostAddress;
            string isCrawlerStr = isCrawler.ToString();
 
            SqlConnection cn = new SqlConnection(_connectionString);
            SqlCommand cmd = new SqlCommand("dbo.insertRequest", cn);
            cmd.CommandType = CommandType.StoredProcedure;
            try
            {
                cmd.Parameters.AddWithValue("@UserAgent", userAgent);
                cmd.Parameters.AddWithValue("@RequestPath", requestPath);
                cmd.Parameters.AddWithValue("@Referer", referer);
                cmd.Parameters.AddWithValue("@RemoteIp", userIp);
                cmd.Parameters.AddWithValue("@IsCrawler", isCrawlerStr);
                cn.Open();
                cmd.ExecuteNonQuery();
            }
            catch (SqlException ex)
            {
                // this is just for quick debugging, can be commented out:
                app.Response.Write(ex.Message);
            }
            finally
            {
                cn.Close();
                cmd.Dispose();
            }
            if (isCrawler && _denyBots)
                DenyAccess(app);
        }
 
        private static void EnsureSwitches(HttpApplication app)
        {
            if (app.Context.Cache["_loggingOn"] == null)
            {
                SqlConnection cn = new SqlConnection(_connectionString);
                SqlCommand cmd = new SqlCommand("dbo.GetRequestLogState", cn);
                cmd.CommandType = CommandType.StoredProcedure;
                cn.Open();
                SqlDataReader rdr = cmd.ExecuteReader(CommandBehavior.CloseConnection);
                if (rdr.HasRows)
                {
                    rdr.Read();
                    _loggingOn = rdr.GetBoolean(0);
                    _denyBots = rdr.GetBoolean(1);
                }
                rdr.Close();
                cmd.Dispose();
                app.Context.Cache.Insert("_loggingOn", _loggingOn, null, DateTime.Now.AddMinutes(10),
                                         Cache.NoSlidingExpiration);
                app.Context.Cache.Insert("_denyBots", _denyBots, null, DateTime.Now.AddMinutes(10),
                                         Cache.NoSlidingExpiration);
            }
            else
            {
                _loggingOn = (bool) app.Context.Cache["_loggingOn"];
                _denyBots = (bool) app.Context.Cache["_denyBots"];
            }
        }
 
        private static void DenyAccess(HttpApplication app)
        {
            app.Response.StatusCode = 401;
            app.Response.StatusDescription = "Access Denied";
            app.Response.Write("401 Access Denied");
            app.CompleteRequest();
        }
 
 
        public static bool IsCrawler(HttpRequest request)
        {
            // set next line to "bool isCrawler = false; to use this to deny certain bots
            bool isCrawler = request.Browser.Crawler;
            // Microsoft doesn't properly detect several crawlers
            if (!isCrawler)
            {
                // put any additional known crawlers in the Regex below
                // you can also use this list to deny certain bots instead, if desired:
                // just set bool isCrawler = false; for first line in method 
                // and only have the ones you want to deny in the following Regex list
                Regex regEx = new Regex("Slurp|slurp|ask|Ask|Teoma|teoma");
                isCrawler = regEx.Match(request.UserAgent).Success;
            }
            return isCrawler;
        }
    }
}

No comments:

Post a Comment