Measure a Website's Recurring Readership with Bise
Traducciones al EspañolEstamos traduciendo nuestros guías y tutoriales al Español. Es posible que usted esté viendo una traducción generada automáticamente. Estamos trabajando con traductores profesionales para verificar las traducciones de nuestro sitio web. Este proyecto es un trabajo en curso.
Introduction
Bise is a command-line program that generates simple reports about a website’s regular readership size, as a concept distinct from total hits or unique visitors. It uses raw web server access logs as its input data, and bases its output on a number of user-configurable metrics.
Typical output looks like this:
April 19 - May 03
Source Uniques Regulars
---------------------------------------
All visitors 1227 179
RSS feed 232 111
JSON feed 8 2
Front page 426 54
From Twitter 39 1
From web searches 910 6
Bise is intended for use by bloggers and other people who self-host content on their own websites. It aims to complement more thorough visitor-analysis tools with this simple, specific report regarding estimated audience size.
In this guide you will:
Set up Bise for use on your Linux-based system
See readership-level reports based on your website’s access logs
Fine-tune Bise to match your posting frequency and desired readership data
Use Bise with Cron to have it mail you regular reports
Have Bise output JSON data that other programs can process
Regular Readers
Bise defines a regular reader as any visitor website who meets the following criteria:
They do not appear to be an automated indexer, crawler, or some other kind of bot.
They spent at least two sessions visiting your website, with each session separated by at least a day, during the last couple of weeks or so.
Bise disregards bots because it is interested only in human readership. It then gives special consideration to those who visit more than once, in order to separate one-time or very occasional visitors from readers who, through their repeated visits, show a deeper and sustained interest in the work that your website posts.
So, if your website’s logs indicate that a user at a certain IP address spent a few minutes on the first of the month clicking around your website a bit, and then returned to click around a bit more a week later, Bise would consider that user as a “regular” when analyzing your website’s readership during the first half of that month.
Before You Begin
To use Bise, you should have the following:
A website running on Apache, or another web server configured to write out its access logs in the Common Log Format. Visit the Apache section for help with installing Apache.
Access to those logs! Bise needs read-access to those log files in order to work. The If You Don’t Have Read-Access to the Logs section will provide suggestions if you don’t currently have read access.
Cpanminus, to install Bise’s prerequisite libraries.
Cron, if you plan to run Bise on a regular schedule. Any Linux machine almost certainly has this installed as well.
Note Any other scheduling software that can run command-line scripts for you will also work, but this guide will demonstrate using Bise with Cron, specifically.
Installing Bise
At the time of this writing, Bise lacks any kind of one-step installation solution. You will instead have to fetch it from its public source repository and manage the installation of its prerequisites yourself:
First, visit Bise’s page on GitHub and download or clone its source directory.
Alternately, use
git
on the command line to clone it locally:git clone https://github.com/jmacdotorg/bise.git
Note You can follow the How to Install Git guide ifgit
is not installed on your system.In your terminal, set your current working directory to your new
bise
directory:cd bise
Create a fresh configuration file by copying the example config that’s included in the cloned repository:
cp conf/conf-example.yaml conf/conf.yaml
Note By default, Bise expects to read its setup configuration from theconf/conf.yaml
location. This will be explained further in the Test Bise section.
Install Prerequisites
Install Bise’s prerequisites using cpanm
:
If you already have
cpanm
installed on your machine, you can run this command to automatically install all the Perl libraries that Bise needs:sudo cpanm --installdeps .
If you do not have
cpanm
installed, then you have two options:Install
cpanm
, as described in this Linode guide. Then, run the command described above.Run this command, which will load and run a temporary copy of
cpanm
and then proceed to install Bise’s dependencies:curl -fsSL https://cpanmin.us | perl - --sudo --installdeps .
sudo
command or the --sudo
option from the above commands. If you do, the libraries will be installed in your home directory’s perl5/
subdirectory, rather than installing them as root at system level. Doing so may require further configuration to allow perl
to load libraries from that location. When run without sudo
, the install command’s output will show this further guidance.Test Bise
The cloned repository contains a bin/
folder, and the Bise binary is located in this folder. Once you have completed the steps in the previous section, try running:
bin/bise
The program should run immediately, printing a table with a lot of zeros, and then exiting:
April 18 - May 02
Source Uniques Regulars
---------------------------------------
All visitors 0 0
RSS feed 0 0
JSON feed 0 0
Front page 0 0
From Twitter 0 0
From web searches 0 0
If you see something that looks like the above output, then you have successfully installed Bise’s prerequisite libraries and set Bise up with a default configuration file.
By default, Bise looks for a config file in ../conf/conf.yaml
, relative to its own location on the filesystem. In your cloned repository, you previously created a configuration file in this location. So, the command runs as expected.
You could further customize Bise’s installation by moving the executable file found in bin/bise
to some other location, such as /usr/local/bin
. You would then need to run Bise with its -c
command-line option. This option specifies a config-file path.
bin/bise
, within the copy of its cloned or downloaded source directory.Running Bise from the Command Line
Returning our attention to the output table from the previous section, we must admit that it doesn’t look very interesting, reporting only rows of zeroes. This happened because we did not pass it any logs to analyze! Let’s amend that in order to see some more meaningful data. Then we will proceed to explore Bise’s configuration file in the next section so that we can fine-tune its behavior.
To use Bise effectively, you need to:
- Determine the location of your website’s access logs, and then:
- Make sure you have read-access to them
Locating your Website’s Access Logs
The location of your website’s access logs varies by instance. On a typical Debian-based setup, Apache keeps its logs in /var/log/apache2/
. On CentOS, Apache logs are kept in /var/log/httpd/
. If your logs are not in either of these directories, then you should be able to determine their location through your web server software’s configuration files.
Within that directory, access logs (as opposed to error logs) have filenames that begin with access.log
. Past access logs will have a numerical suffix in the filename. Older access files may also be gzip
-archived, ending with a .gz
extension. These are the files that Bise wants to know about.
Running Bise with Logs
Bise accepts the location of your access logs as a command line argument. Bise is able to scan both plain-text and gzip
-archived log files.
Bise will scan the provided log files in order from newest to oldest. It will stop once it reaches reports from more than two weeks ago. You can use a fileglob to hand it all the access logs in your log directory. Bise will process only those files it needs to before delivering its report.
For example, this command will run Bise with all your Apache server’s access logs:
bin/bise /var/log/apache2/*access.log*
Bise may take a few moments to process this data, especially for websites that receive (and log) significant levels of traffic. On finishing its scan, Bise should print a table containing interesting non-zero numbers, like this:
April 19 - May 03
Source Uniques Regulars
---------------------------------------
All visitors 1227 179
RSS feed 232 111
JSON feed 8 2
Front page 426 54
From Twitter 39 1
From web searches 910 6
In this table, the Uniques
column expresses a count of unique IP addresses that don’t appear to belong to bots. The Regulars
column counts returning visitors meeting the criteria described earlier in this article.
Bise’s output can be customized. The six rows in this table are defined by conf/conf.yaml
. We’ll take a closer look at that file in the Configuring Bise section.
If You Don’t Have Read-Access to the Logs
By default, Apache keeps its log files visible to only administrative users. Your own user account might not have the right permissions to read them. Bise won’t work until you resolve this situation.
If you receive a Permission denied
error when attempting to view the contents of your machine’s log directory, then this is the case with your Apache setup:
ls -l /var/log/apache2/
There are several ways to address this. These two methods assume that you have sudo
rights on the machine:
You could run Bise as root, via the
sudo
command. This is a relatively safe procedure, since Bise has a strictly read-only relationship with its data.As a safer alternative, you could add yourself to the group that owns
/var/log/apache2/
. On Debian, this group is typicallyadm
. Executing this command should give you the necessary read-access to the log directory:sudo adduser [your-username] adm
Note After adding yourself to the group, you will need to log out of the system and log back in again. Then, you can run Bise successfully.
Configuring Bise
Open the file conf/conf.yaml
in your favorite text editor. As its main task, the file defines the rows appearing in Bise’s output table. This includes:
- Each row’s presence in Bise’s output
- The row’s label
- The criteria that each row uses to come up with its count of unique and regular non-bot visitors to your website
It also lets you define a couple of other, optional behavioral settings for Bise.
Default Configuration
Bise’s default configuration file defines six rows. On some lines, it includes comments that clarify its activity:
- File: conf/conf.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
reports: - label: All visitors test_type: path_regex test: | /$ # Match all requests whose paths end in '/'. |html$|htm$ # And all explicit requests for .html or .htm files. |xml$|json$ # And all requests for .xml (RSS) and .json (feed) files. - label: RSS feed test_type: path test: /atom.xml - label: JSON feed test_type: path test: /feed.json - label: Front page test_type: path test: / - label: From Twitter test_type: referer_regex test: \bt\.co\b # Match all reqs referred from Twitter's "t.co" URLs. - label: From web searches test_type: referer_regex test: \bgoogle.com|\bduckduckgo.com|\bbing.com
If you’re happy with the behavior of the default rows, you can certainly continue using them as-is! You can also modify or remove these report-row directives, or add new ones, depending upon your needs.
There are four kinds of rows you can define, each of which examines a different part of your access logs. These correspond to the values for the test_type
parameter: path, path_regex, referer_regex, and agent_regex.
Three of the row types involve the use of regular expressions. You should probably understand the basics of this text-processing technology before defining your own row definitions with any of these types.
Note also that Bise ignores whitespace in regular expressions, allowing you to write more complex regexes with inline comments, as one of the examples below will illustrate.
Let’s step through the file’s available test_type
configuration directives, and then examine the other configuration options.
test_type: path
Row definitions with a test_type
set to path
will count any access whose requested URL path matches the value of test
, exactly.
The following row definition will count any request for the path /
, and only that path, as a “Front page” access:
1 2 3
- label: Front page test_type: path test: /
test_type: path_regex
Counts any access whose requested URL path matches the value of test
, evaluated as a regular expression.
The following “All visitors” definition from the default configuration will match any request path that ends in an HTML, XML, or JSON filename, as well as any request ending in /
. This means that a request for /
, in the default configuration, will match both this row and the “Front page” one defined above.
1 2 3 4 5 6
- label: All visitors test_type: path_regex test: | /$ # Match all requests whose paths end in '/'. |html$|htm$ # And all explicit requests for .html or .htm files. |xml$|json$ # And all requests for .xml (RSS) and .json (feed) files.
As noted earlier, Bise’s regular expression processor ignores whitespace, allowing configuration files to add newlines and commentary in the middle of regexes like this.
test_type: referer_regex
Counts any access whose referer URL matches the value of test
, evaluated as a regular expression.
This line from the default configuration will count any visit that arrived by way of a t.co
-based URL as “From Twitter”. t.co
is Twitter’s own URL shortening service. Therefore, a matching request probably came from a link posted to Twitter.
1 2 3
- label: From Twitter test_type: referer_regex test: \bt\.co\b
test_type: agent_regex
Counts any access whose User-agent string matches the value of test
, evaluated as a regular expression.
This configuration (not found in the default file) would add a row to the output table describing visits from clients using Perl’s LWP toolkit:
1 2 3
- label: Using LWP test_type: agent_regex test: libwww-perl
Other configuration options
The configuration file lets you set these optional directives as well:
days_to_consider
: The number of days that Bise will examine when it scans logs. Defaults to 14 (that is, two weeks).regular_interval_days
: The minimum number of days in between a visitor’s earliest and most recent visits in order for Bise to count that visitor as a “regular” reader. Defaults to 1.
Running Bise as a cron Task
Once you have Bise creating meaningful reports about your website’s readership, consider having your system run it regularly. For example, you could automatically run the report once a week. The Cron utility can be used to schedule this task.
Cron’s normal behavior is to mail you anything a scheduled program prints as output or error messages. So, you can use Cron to receive periodic emails about your website’s readership levels.
For example, this crontab
line will run Bise at 12AM every Monday and mail you the results:
0 0 * * 1 /home/your-username/bise/bin/bise /var/log/apache2/access.log*
You will want to tune the precise syntax of the command to your own Bise setup, of course.
Getting JSON data instead of a table
Bise can output a JSON data structure instead of a plain-text table, allowing you to feed its results as data into other programs. Accomplishing this is as easy as running the bise
program with an additional -j
flag.
The output will look similar to this:
{
"start_time":"2020-04-20T18:02:18",
"reports":[
{
"uniques":1213,
"regulars":155,
"label":"All visitors"
},
{
"uniques":226,
"label":"RSS feed",
"regulars":103
},
{
"uniques":11,
"label":"JSON feed",
"regulars":2
},
{
"uniques":426,
"label":"Front page",
"regulars":46
},
{
"uniques":33,
"regulars":0,
"label":"From Twitter"
},
{
"uniques":917,
"label":"From web searches",
"regulars":5
},
{
"uniques":2,
"label":"Using LWP",
"regulars":0
}
],
"end_time":"2020-05-04T18:02:18"
}
More Information
You may wish to consult the following resources for additional information on this topic. While these are provided in the hope that they will be useful, please note that we cannot vouch for the accuracy or timeliness of externally hosted materials.
This page was originally published on