Cipafilter Support:
Support@Cipafilter.com
309 517 2022 option 2
Mon - Fri 7 AM - 6 PM CT
Knowledgebase: Product Manual > Miscellaneous
Cipafilter Documentation - List entry syntax
Posted by Jim Giseburt on 11 April 2017 04:32 PM

This section describes the syntax used for adding entries to the manual and global lists.

Any line beginning with a # (hash) is a comment and will not be treated as a list entry. Comments appear in the Comment field on the filter-reject page and are applied to all following entries (until the next comment). Empty lines and lines containing only white-space are also ignored as non-entries.

To block an entire Web site, simply enter its domain. For example, google.com will block everything at google.com, www.google.com, images.google.com, etc. It is also possible to block individual sub-domains; for example, groups.google.com.

To block a single page or directory on a Web site, enter the URL up to the point at which the filter should stop matching. For example, to block all pages under http://www.domain.com/directory, simply enter that into the list. The list parsing is intelligent enough to handle both complete URLs as well as partial URLs (such as www.domain.com/directory).

For ease of use the sub-domain www will be stripped from "simple" list entries, leaving the top-level domain (e.g., the entry www.google.com will be interpreted the same as google.com). If it is necessary to block the www sub-domain specifically, the advanced syntax may be used.

Specific parts of a Web site or even ranges of sites can be blocked by using a regular-expression entry. These entries are somewhat more complicated, but also much more powerful. Two different styles of regular-expression entry are supported: the simplified REGEX: style and the more advanced PCRE: style; both make use of the Perl Compatible Regular Expressions (PCRE) engine and its pattern syntax. Equivalent examples of each entry style are provided below:

REGEX:domain.com:foo.*bar

The simplified REGEX: style takes the form of three colon-separated fields: the REGEX entry prefix, the host or domain name, and the PCRE pattern. All three fields are case-insensitive.

PCRE:domain.com:/foo.*bar/i

The advanced PCRE: style takes the form of three colon-separated fields: the PCRE entry prefix, the host or domain name, and the PCRE pattern in a Perl-style delimited format. The pattern may optionally be followed by any combination of modifiers representing flags supported by the PCRE engine. Unlike the simplified syntax, the pattern in this type of entry is not case-insensitive unless the i modifier is supplied, as in this example.

PCRE:-style entries support delimited patterns similar to those used by Perl and PHP. Any non-alphanumeric, non-white-space, non-backslash character may be used as a delimiter. The Perl match operator m is supported but not required. Examples of valid patterns include: /foo/, %foo%, m<foo>. Valid modifiers are imsuxADJUX, of which only imsU are currently supported. The g modifier has no effect on match patterns and is silently dropped. An error will appear in the content-filter log if an invalid or unsupported modifier is used, but the entry (minus the bad modifier) will be accepted anyway.

Clearly, most users will prefer the REGEX: style, since its syntax is far simpler and more forgiving. In either case, the specified pattern (the third field) is matched against the full URL of each request to the specified host (the second field). For example, the entry REGEX:youtube.com:watch will match any URL containing the text watch on any youtube.com Web site.

For performance reasons, pattern matching is performed against "normalized" domains. As an example, the normalized domain for the entry REGEX:m.youtube.com: is youtube.com; therefore, the entry will be matched against any YouTube sub-domain, not just m.youtube.com. Alternatively, a wildcard (*, or asterisk) can be used to apply a match to all domains (e.g., REGEX:*:porn will match any URL containing the text porn on any Web site). Please note, however, that matching an entry against all domains does incur a performance penalty. The extent of this penalty depends on several factors, but on filters with many clients or many global wildcard entries, the effect can be quite significant. For this reason, entries of this type should be considered a last resort.

The syntax of PCRE patterns is described fully in the PCRE documentation; however, the following can be used as a quick reference:

  • ^ and $ match the beginning and end of a URL, respectively
  • ( and ) treat a series of characters as a single group
  • * matches 0 or more of the preceding group/character
  • + matches 1 or more of the preceding group/character
  • ? matches exactly 0 or 1 of the preceding group/character
  • . matches any single character
  • [^/] matches any single character except for /
  • \d matches any single digit (09)
  • \w matches any single word character (az, 09, and _)
  • \b matches the start or end of a word (the boundary between a word character and a non-word character)
  • \ can be used in front of any special character to treat it literally

Blacklist examples

youtube.com

This is a basic domain entry; it tells the content filter to blacklist or whitelist all pages on all Web sites which are part of the youtube.com domain. This would not only match video pages, but also, for example, accounts.youtube.com and m.youtube.com.

mail.google.com

This is a basic sub-domain entry; it tells the content filter to blacklist or whitelist all pages on all Web sites which are part of the mail.google.com domain. This would also match sub-domains further down; for example, it would affect chatenabled.mail.google.com. It would not match any other Google sub-domain — for example, images.google.com would be unaffected.

# Social networking
reddit.com

This is another basic domain entry; this time it is preceded by a comment. The # Social networking line will not be interpreted as a list entry; however, it will appear in the Comment field on the filter-reject page. This is useful for explaining why a page has been blocked; it can also be used to (for example) give the name of the person who added the entry and/or the date they added it.

REGEX:*:porn

The * after the first colon indicates that this rule is a "wildcard" entry — the content filter will try to match the pattern against every URL request that passes through it. As mentioned above, this does incur a certain performance hit, so it is important to use this type of rule only when absolutely necessary.

The porn at the end indicates that the entry should match if the text porn is found anywhere in the URL. (Note that this entry will also match the word anti-pornography, for example, since it still contains porn.)

REGEX:*:^https?://[^/]+\.edu[:/]

The * after the first colon indicates that this rule is a "wildcard" entry — the content filter will try to match the expression on every Web site that passes through it. Once again, this does incur a certain performance hit.

The ^ after the second colon is an anchor that means the expression should only be matched at the very beginning of the URL (without this, the expression would match anywhere).

https?:// matches http:// or https:// (the ? means "zero or one of the preceding character" — in this case, the preceding character is an s).

[^/]+ matches one or more (+) of any character that is not a slash ([^/]). Matching only non-slash characters ensures that we only look at the first part of the URL (the domain).

\.edu matches the text .edu. The backslash is necessary because, in regular expressions, a dot by itself means "any character."

Finally, [:/] matches either a : or a /. This is useful to ensure that the pattern matches only at the very end of the domain name (otherwise, it might match a domain like example.education.com.

This rule would blacklist or whitelist all .edu Web sites (harvard.edu, mit.edu, and so on).

REGEX:reddit.com:\b(cat|dog)s?\b

The reddit.com after the first colon indicates that this rule should be matched only against Web sites under the reddit.com domain. Therefore, this rule would affect www.reddit.com, ssl.reddit.com, and so on.

The \b(cat|dog)s?\b at the end indicates that the entry should match if any of the following whole words appear anywhere in the URL: cat, cats, dog, dogs. (\bmatches the start or end of a whole word; the (x|y) structure means "either x or y"; and the s? means that the letter s may or may not appear.)

Because of the "whole word" requirement, this rule would not match, for example, the words vacation or bulldog. However, it would still match dog-catcher, since the hyphen makes two separate words.

(0 vote(s))
Helpful
Not helpful

Comments (0)
©Cipafilter 2017. All Rights Reserved.