This section describes the syntax used for adding entries to the manual and global lists.
Any line beginning with a
# (hash) is a comment and will not be treated as a list entry. Comments appear in the Comment field on the filter-reject page and are applied to all following entries (until the next comment). Empty lines and lines containing only white-space are also ignored as non-entries.
To block an entire Web site, simply enter its domain. For example,
google.com will block everything at
images.google.com, etc. It is also possible to block individual sub-domains; for example,
To block a single page or directory on a Web site, enter the URL up to the point at which the filter should stop matching. For example, to block all pages under
http://www.domain.com/directory, simply enter that into the list. The list parsing is intelligent enough to handle both complete URLs as well as partial URLs (such as
For ease of use the sub-domain
www will be stripped from "simple" list entries, leaving the top-level domain (e.g., the entry
www.google.com will be interpreted the same as
google.com). If it is necessary to block the
www sub-domain specifically, the advanced syntax may be used.
Specific parts of a Web site or even ranges of sites can be blocked by using a regular-expression entry. These entries are somewhat more complicated, but also much more powerful. Two different styles of regular-expression entry are supported: the simplified
REGEX: style and the more advanced
PCRE: style; both make use of the Perl Compatible Regular Expressions (PCRE) engine and its pattern syntax. Equivalent examples of each entry style are provided below:
REGEX: style takes the form of three colon-separated fields: the
REGEX entry prefix, the host or domain name, and the PCRE pattern. All three fields are case-insensitive.
PCRE: style takes the form of three colon-separated fields: the
PCRE entry prefix, the host or domain name, and the PCRE pattern in a Perl-style delimited format. The pattern may optionally be followed by any combination of modifiers representing flags supported by the PCRE engine. Unlike the simplified syntax, the pattern in this type of entry is not case-insensitive unless the
i modifier is supplied, as in this example.
PCRE:-style entries support delimited patterns similar to those used by Perl and PHP. Any non-alphanumeric, non-white-space, non-backslash character may be used as a delimiter. The Perl match operator
m is supported but not required. Examples of valid patterns include:
m<foo>. Valid modifiers are
imsuxADJUX, of which only
imsU are currently supported. The
g modifier has no effect on match patterns and is silently dropped. An error will appear in the content-filter log if an invalid or unsupported modifier is used, but the entry (minus the bad modifier) will be accepted anyway.
Clearly, most users will prefer the
REGEX: style, since its syntax is far simpler and more forgiving. In either case, the specified pattern (the third field) is matched against the full URL of each request to the specified host (the second field). For example, the entry
REGEX:youtube.com:watch will match any URL containing the text
watch on any
youtube.com Web site.
For performance reasons, pattern matching is performed against "normalized" domains. As an example, the normalized domain for the entry
youtube.com; therefore, the entry will be matched against any YouTube sub-domain, not just
m.youtube.com. Alternatively, a wildcard (
*, or asterisk) can be used to apply a match to all domains (e.g.,
REGEX:*:porn will match any URL containing the text
porn on any Web site). Please note, however, that matching an entry against all domains does incur a performance penalty. The extent of this penalty depends on several factors, but on filters with many clients or many global wildcard entries, the effect can be quite significant. For this reason, entries of this type should be considered a last resort.
The syntax of PCRE patterns is described fully in the PCRE documentation; however, the following can be used as a quick reference:
$ match the beginning and end of a URL, respectively
) treat a series of characters as a single group
* matches 0 or more of the preceding group/character
+ matches 1 or more of the preceding group/character
? matches exactly 0 or 1 of the preceding group/character
. matches any single character
[^/] matches any single character except for
\d matches any single digit (
\w matches any single word character (
\b matches the start or end of a word (the boundary between a word character and a non-word character)
\ can be used in front of any special character to treat it literally
This is a basic domain entry; it tells the content filter to blacklist or whitelist all pages on all Web sites which are part of the
youtube.com domain. This would not only match video pages, but also, for example,
This is a basic sub-domain entry; it tells the content filter to blacklist or whitelist all pages on all Web sites which are part of the
mail.google.com domain. This would also match sub-domains further down; for example, it would affect
chatenabled.mail.google.com. It would not match any other Google sub-domain — for example,
images.google.com would be unaffected.
# Social networking
This is another basic domain entry; this time it is preceded by a comment. The
# Social networking line will not be interpreted as a list entry; however, it will appear in the Comment field on the filter-reject page. This is useful for explaining why a page has been blocked; it can also be used to (for example) give the name of the person who added the entry and/or the date they added it.
* after the first colon indicates that this rule is a "wildcard" entry — the content filter will try to match the pattern against every URL request that passes through it. As mentioned above, this does incur a certain performance hit, so it is important to use this type of rule only when absolutely necessary.
porn at the end indicates that the entry should match if the text
porn is found anywhere in the URL. (Note that this entry will also match the word
anti-pornography, for example, since it still contains
* after the first colon indicates that this rule is a "wildcard" entry — the content filter will try to match the expression on every Web site that passes through it. Once again, this does incur a certain performance hit.
^ after the second colon is an anchor that means the expression should only be matched at the very beginning of the URL (without this, the expression would match anywhere).
? means "zero or one of the preceding character" — in this case, the preceding character is an
[^/]+ matches one or more (
+) of any character that is not a slash (
[^/]). Matching only non-slash characters ensures that we only look at the first part of the URL (the domain).
\.edu matches the text
.edu. The backslash is necessary because, in regular expressions, a dot by itself means "any character."
[:/] matches either a
: or a
/. This is useful to ensure that the pattern matches only at the very end of the domain name (otherwise, it might match a domain like
This rule would blacklist or whitelist all
.edu Web sites (
mit.edu, and so on).
reddit.com after the first colon indicates that this rule should be matched only against Web sites under the
reddit.com domain. Therefore, this rule would affect
ssl.reddit.com, and so on.
\b(cat|dog)s?\b at the end indicates that the entry should match if any of the following whole words appear anywhere in the URL:
\bmatches the start or end of a whole word; the
(x|y) structure means "either
y"; and the
s? means that the letter
s may or may not appear.)
Because of the "whole word" requirement, this rule would not match, for example, the words
bulldog. However, it would still match
dog-catcher, since the hyphen makes two separate words.