URL Matcher

From UntangleWiki
Jump to: navigation, search

The URL Matcher Syntax describes all or part of a website.

Example Matches Does not Match
example.com http://example.com/, http://www.example.com/, http://example.com/foo http://example.net
example.com/bar http://example.com/bar/test.html, http://www.example.com/bar http://example.com/foo
*porn* http://pornsite.com/ http://foobar.com
example???.com/ http://example123.com http://example1.com
example.com/foo http://example.com/foo, http://abc.example.com/foobar http://example.com/
*?.gov/* http://www.whitehouse.gov/, http://www.opm.gov/ http://www.govtoday.co.uk/

URL Matchers use globs which are describe more in depth in the Glob Matcher documentation.

Important notes:

  • The left side of the rule is anchored with the regular expression "^([a-zA-Z_0-9-]*\.)*". "foo.com" will match only "foo.com" and "abc.foo.com" but not "afoo.com"
  • The right side of the rule is anchored with with the regular expression ".*$". "foo.com" will match "foo.com/test.html" because it is actually "foo.com.*$". "foo.com/bar" is "foo.com/bar.*$" which will match "foo.com/bar/baz" and "foo.com/bar2". Also "foo" becomes "foo.*" which will match "foobar.com" and "foo.com"
  • "http://" and "https://" are stripped from the rule.
  • URIs are case-sensitive, but domains are not. The URL Matcher is case sensitive, but domains are converted to lowercase before evaluation because they should not be case sensitive. Any part of the matcher that should match against the domain should be lower case in the rule.
  • "www." is automatically stripped from the rule. This is to prevent the frequent misconfiguration of users adding a block rule for something like "www.pornsite.com" which blocks "www.pornsite.com" but not just "pornsite.com." If you truly desire to only match www.pornsite.com and not pornsite.com then use "*www.pornsite.com" because the "*" will match zero or more characters.
  • Similarly "*." is stripped from the rule for the same reason as above. If you truly want all subdomains but not the main domain matched, you can accomplish this by doing "*?.foo.com"