#!/usr/bin/perl -- # # CGIProxy 2.0.1 # # nph-proxy.cgi-- CGIProxy 2.0.1: a proxy in the form of a CGI script. # Retrieves the resource at any HTTP or FTP URL, updating embedded URLs # in HTML resources to point back through this script. By default, no # user info is sent to the server. Options include text-only proxying # to save bandwidth, cookie filtering, ad filtering, script removal, # user-defined encoding of the target URL, and more. Requires Perl 5. # # Copyright (C) 1996, 1998-2002 by James Marshall, james@jmarshall.com # All rights reserved. # # For the latest, see http://www.jmarshall.com/tools/cgiproxy/ # # # IMPORTANT NOTE ABOUT ANONYMOUS BROWSING: # CGIProxy was originally made for indirect browsing more than # anonymity, but since people are using it for anonymity, I've tried # to make it as anonymous as possible. Suggestions welcome. For best # anonymity, browse with JavaScript turned off. In fact, that's the # only reliable way, in spite of what certain anonymity vendors claim. # Anonymity is pretty good, but may not be bulletproof. For example, # if even a single JavaScript statement can be run, your anonymity can # be compromised. I've tried to remove JS from every place it can # exist, but please tell me if I missed any. Also, browser plugins or # other executable extensions may be able to reveal you to a server. # If you find any way your anonymity can be compromised even with scripts # turned off, please let me know. # # # CONFIGURATION: # # None required in most situations. On some servers, these might be # required (all in the "user configuration" section): # . If you're using another HTTP or SSL proxy, set $HTTP_PROXY, # $SSL_PROXY, and $NO_PROXY as needed. If those proxies use # authentication, set $PROXY_AUTH and $SSL_PROXY_AUTH accordingly. # . If this is running on an SSL server that doesn't use port 443, set # $RUNNING_ON_SSL_SERVER=1 (otherwise, the default of '' is fine). # # Options include: # . Set $TEXT_ONLY, $REMOVE_COOKIES, $REMOVE_SCRIPTS, $FILTER_ADS, # $HIDE_REFERER, and $INSERT_ENTRY_FORM as desired. Set # $REMOVE_SCRIPTS if anonymity is important. # . To let the user choose all of those settings (except $TEXT_ONLY), # set $ALLOW_USER_CONFIG=1. # . To change the encoding format of the URL, modify the # proxy_encode() and proxy_decode() routines. The default # routines are suitable for simple PATH_INFO compliance. # . To encode cookies, modify the cookie_encode() and cookie_decode() # routines. # . You can restrict which servers this proxy will access, with # @ALLOWED_SERVERS and @BANNED_SERVERS. # . Similarly, you can specify allowed and denied server lists for # both cookies and scripts. # . For security, you can ban access to private IP ranges, with # @BANNED_NETWORKS. # . If filtering ads, you can customize this with a few settings. # . To insert your own block of HTML into each page, set $INSERT_HTML # or $INSERT_FILE. # . As a last resort, if you really can't run this script as NPH, # you can try to run it as non-NPH by setting $NOT_RUNNING_AS_NPH=1. # BUT, read the notes and warnings above that line. Caveat surfor. # . For crude load-balancing among a set of proxies, set @PROXY_GROUP. # . Other config is possible; see the user configuration section. # . If heavy use of this proxy puts a load on your server, see the # "NOTES ON PERFORMANCE" section below. # # For more info, read the comments regarding any config options you set. # # This script MUST be installed as a non-parsed header (NPH) script. # In Apache and many other servers, this is done by simply starting the # filename with "nph-". It MAY be possible to fake it as a non-NPH # script, MOST of the time, by using the $NOT_RUNNING_AS_NPH feature. # This is not advised. See the comments by that option for warnings. # # # TO USE: # Start a browsing session by calling the script with no parameters. # You can bookmark pages you browse to through the proxy, or link to # the URLs that are generated. # # # NOTES ON PERFORMANCE: # Unfortunately, this has gotten slower through the versions, mostly # because of optional new features. Configured equally, version 1.3 # takes 25% longer to run than 1.0 or 1.1 (based on *cough* highly # abbreviated testing). Compiling takes about 50% longer. # Leaving $REMOVE_SCRIPTS=1 adds 25-50% to the running time. # Remember that we're talking about tenths of a second here. Most of # the delay experienced by the user is from waiting on two network # connections. These performance issues only matter if your server # CPU is getting overloaded. Also, these only matter when retrieving # HTML, because it's the HTML modification that takes all the time. # If you can, use mod_perl. Starting with version 1.3.1, this should # work under mod_perl, which requires Perl 5.004 or later. If you use # mod_perl, be careful to install this as an NPH script, i.e. set the # "PerlSendHeader Off" configuration directive. For more info, see the # mod_perl documentation. # If you use mod_perl and modify this script, see the note near the # "reset 'a-z'" line below, regarding UPPER_CASE and lower_case # variables. # # # TO DO: # What I want to hear about: # . Any HTML tags not being converted here. # . Any method of introducing JavaScript or other script, that's not # being filtered out here. # . Any script MIME types other than those already in @SCRIPT_MIME_TYPES. # . Any MIME types other than text/html that have links that need to # be converted. # # plug any other script holes (e.g. MSIE-proprietary, other MIME types?) # This could use cleaner URL-encoding all over ($base_url, etc.) # more error checking? # find a simple encryption technique for proxy_encode() # support more protocols, like mailto: or gopher: # For ad filtering, add option to disable images from servers other than # that of the containing HTML page? Is it worth it? # # # BUGS: # Anonymity may not not perfect. In particular, there may be some remaining # JavaScript holes. # URLs generated by JavaScript or similar mechanisms won't be re-proxy'ed # correctly. JavaScript in general may not work as expected. # Since ALL of your cookies are sent to this script (which then chooses # the relevant ones), some cookies could conceivably be dropped if # you accumulate a whole lot. I haven't seen this happen yet. # # # I first wrote this in 1996 as an experiment to allow indirect browsing. # The original seed was a program I wrote for Rich Morin's article # in the June 1996 issue of Unix Review, online at # http://www.cfcl.com/tin/P/199606.shtml. # # Confession: I didn't originally write this with the spec for HTTP # proxies in mind, and there are probably some violations of the protocol # (at least for proxies). This whole thing is one big violation of the # proxy model anyway, so I hereby rationalize that the spec can be widely # interpreted here. If there is demand, I can make it more conformant. # The HTTP client and server components should be fine; it's just the # special requirements for proxies that may not be followed. # #-------------------------------------------------------------------------- use strict ; use Socket ; # First block below is config variables, second block is sort-of config # variables, third block is persistent constants, fourth block is would-be # persistent constants (not set until needed), and last block is variables. use vars qw( $TEXT_ONLY $REMOVE_COOKIES $REMOVE_SCRIPTS $FILTER_ADS $HIDE_REFERER $INSERT_ENTRY_FORM $ALLOW_USER_CONFIG @ALLOWED_SERVERS @BANNED_SERVERS @BANNED_NETWORKS $NO_COOKIE_WITH_IMAGE @ALLOWED_COOKIE_SERVERS @BANNED_COOKIE_SERVERS @ALLOWED_SCRIPT_SERVERS @BANNED_SCRIPT_SERVERS @BANNED_IMAGE_URL_PATTERNS $RETURN_EMPTY_GIF $INSERT_HTML $INSERT_FILE $ANONYMIZE_INSERTION $FORM_AFTER_INSERTION $INSERTION_FRAME_HEIGHT $RUNNING_ON_SSL_SERVER $NOT_RUNNING_AS_NPH $HTTP_PROXY $SSL_PROXY $NO_PROXY $PROXY_AUTH $SSL_PROXY_AUTH $MINIMIZE_CACHING $SESSION_COOKIES_ONLY @PROXY_GROUP $USER_AGENT $USE_PASSIVE_FTP_MODE $SHOW_FTP_WELCOME $USE_POST_ON_START $REMOVE_TITLES $NO_BROWSE_THROUGH_SELF $NO_LINK_TO_START $MAX_REQUEST_SIZE $QUIETLY_EXIT_PROXY_SESSION $OVERRIDE_SECURITY $PROXIFY_SCRIPTS $PROXIFY_COMMENTS @SCRIPT_MIME_TYPES @OTHER_TYPES_TO_REGISTER @TYPES_TO_HANDLE $NON_TEXT_EXTENSIONS $PROXY_VERSION @MONTH @WEEKDAY %UN_MONTH @BANNED_NETWORK_ADDRS $RUNNING_ON_IIS @NO_PROXY $NO_CACHE_HEADERS @ALL_TYPES %MIME_TYPE_ID $SCRIPT_TYPE_REGEX $TYPES_TO_HANDLE_REGEX $THIS_HOST $ENV_SERVER_PORT $ENV_SCRIPT_NAME $THIS_SCRIPT_URL $HAS_BEGUN $CUSTOM_INSERTION $HTTP_VERSION $HTTP_1_X $URL $now $packed_flags $encoded_URL $doing_insert_here $env_accept $e_remove_cookies $e_remove_scripts $e_filter_ads $e_insert_entry_form $e_hide_referer $images_are_banned_here $scripts_are_banned_here $cookies_are_banned_here $scheme $authority $path $host $port $username $password $cookie_to_server %auth $script_url $url_start $url_start_inframe $url_start_noframe $is_in_frame $expected_type $base_url $base_scheme $base_host $base_path $base_unframes $default_style_type $default_script_type $status $headers $body $is_html $response_sent $debug ) ; # Under mod_perl, persistent constants only need to be initialized once, so # use this one-time block to do so. unless ($HAS_BEGUN) { #-------------------------------------------------------------------------- # user configuration #-------------------------------------------------------------------------- # If set, then proxy traffic will be restricted to text data only, to save # bandwidth (though it can still be circumvented with uuencode, etc.). $TEXT_ONLY= 0 ; # set to 1 to allow only text data, 0 to allow all # If set, then prevent all cookies from passing through the proxy. To allow # cookies from some servers, set this to 0 and see @ALLOWED_COOKIE_SERVERS # and @BANNED_COOKIE_SERVERS below. You can also prevent cookies with # images by setting $NO_COOKIE_WITH_IMAGE below. # Note that this only affects cookies from the target server. The proxy # script sends its own cookies for other reasons too, like to support # authentication. This flag does not stop these cookies from being sent. $REMOVE_COOKIES= 0 ; # If set, then remove as much scripting as possible. If anonymity is # important, this is strongly recommended! Better yet, turn off script # support in your browser. # On the HTTP level: # . prevent transmission of script MIME types (which only works if the server # marks them as such, so a malicious server could get around this, but # then the browser probably wouldn't execute the script). # . remove Link: headers that link to a resource of a script MIME type. # Within HTML resources: # . remove . # . remove intrinsic event attributes from tags, i.e. attributes whose names # begin with "on". # . remove where "type" attribute is a script MIME type. # . remove various HTML tags that appear to link to a script MIME type. # . remove script macros (aka Netscape-specific "JavaScript entities"), # i.e. any attributes containing the string "&{" . # . remove "JavaScript conditional comments". # . remove MSIE-specific "dynamic properties". # To allow scripts from some sites but not from others, set this to 0 and # see @ALLOWED_SCRIPT_SERVERS and @BANNED_SCRIPT_SERVERS below. # See @SCRIPT_MIME_TYPES below for a list of which MIME types are filtered out. # I do NOT know for certain that this removes all script content! It removes # all that I know of, but I don't have a definitive list of places scripts # can exist. If you do, please send it to me. EVEN RUNNING A SINGLE # JAVASCRIPT STATEMENT CAN COMPROMISE YOUR ANONYMITY! Just so you know. # Richard Smith has a good test site for anonymizing proxies, at # http://users.rcn.com/rms2000/anon/test.htm # Note that turning this on removes most popup ads! :) $REMOVE_SCRIPTS= 1 ; # If set, then filter out images that match one of @BANNED_IMAGE_URL_PATTERNS, # below. Also removes cookies attached to images, as if $NO_COOKIE_WITH_IMAGE # is set. # To remove most popup advertisements, also set $REMOVE_SCRIPTS=1 above. $FILTER_ADS= 0 ; # If set, then don't send a Referer: [sic] header with each request # (i.e. something that tells the server which page you're coming from # that linked to it). This is a minor privacy issue, but a few sites # won't send you pages or images if the Referer: is not what they're # expecting. If a page is loading without images or a link seems to be # refused, then try turning this off, and a correct Referer: header will # be sent. # This is only a problem in a VERY small percentage of sites, so few that # I'm kinda hesitant to put this in the entry form. Other arrangements # have their own problems, though. $HIDE_REFERER= 1 ; # If set, insert a compact version of the URL entry form at the top of each # page. This will also display the URL currently being viewed. # When viewing a page with frames, then a new top frame is created and the # insertion goes there. # If you want to customize the appearance of the form, modify the routine # mini_start_form() near the end of the script. # If you want to insert something other than this form, see $INSERT_HTML and # $INSERT_FILE below. # Users should realize that options changed via the form only take affect when # the form is submitted by entering a new URL or pressing the "Go" button. # Selecting an option, then following a link on the page, will not cause # the option to take effect. # Users should also realize that anything inserted into a page may throw # off any precise layout. The insertion will also be subject to # background colors and images, and any other page-wide settings. $INSERT_ENTRY_FORM= 1 ; # If set, then allow the user to control $REMOVE_COOKIES, $REMOVE_SCRIPTS, # $FILTER_ADS, $HIDE_REFERER, and $INSERT_ENTRY_FORM. Note that they # can't fine-tune any related options, such as the various @ALLOWED... and # @BANNED... lists. $ALLOW_USER_CONFIG= 1 ; # Create your own proxy_encode() and proxy_decode() to tranform the target # URL to and from the format that will be stored in PATH_INFO. The encoded # form should only contain characters that are legal in PATH_INFO. This # varies by server, but using only printable chars, no "?" or "#", and no # two adjacent slashes ("//") works on most servers. Don't let PATH_INFO # contain the strings "./", "/.", "../", or "/..", or else it may get # compressed like a pathname somewhere. Try not to make the resulting # string too long, either. # Of course, proxy_decode() must exactly undo whatever proxy_encode() does. # Make proxy_encode() as fast as possible-- it's a major bottleneck for the # whole program. # Because of the simplified absolute URL resolution in full_url(), there may # be ".." segments in the default encoding here, notably in the first path # segment. Normally, that's just an HTML mistake, but please tell me if # you see any privacy exploit with it. # Note that a few sites have embedded applications (like applets or Shockwave) # that expect to access URLs relative to the page's URL. This means they # may not work if the encoded target URL can't be treated like a base URL, # e.g. that it can't be appended with something like "../data/foo.data" # to get that expected data file. In such cases, the default encoding below # should let these sites work fine, as should any other encoding that can # support URLs relative to it. sub proxy_encode { my($URL)= @_ ; $URL=~ s#^([\w+.-]+)://#$1/# ; # http://xxx -> http/xxx # $URL=~ s/(.)/ sprintf('%02x',ord($1)) /ge ; # each char -> 2-hex # $URL=~ tr/a-zA-Z/n-za-mN-ZA-M/ ; # rot-13 return $URL ; } sub proxy_decode { my($enc_URL)= @_ ; # $enc_URL=~ tr/a-zA-Z/n-za-mN-ZA-M/ ; # rot-13 # $enc_URL=~ s/([0-9A-Fa-f]{2})/ sprintf("%c",hex($1)) /ge ; $enc_URL=~ s#^([\w+.-]+)/#$1://# ; # http/xxx -> http://xxx return $enc_URL ; } # Encode cookies before they're sent back to the user. # The return value must only contain characters that are legal in cookie # names and values, i.e. only printable characters, and no ";", ",", "=", # or white space. # cookie_encode() is called twice for each cookie: once to encode the cookie # name, and once to encode the cookie value. The two are then joined with # "=" and sent to the user. # cookie_decode() must exactly undo whatever cookie_encode() does. # Also, cookie_encode() must always encode a given input string into the # same output string. This is because browsers need the cookie name to # identify and manage a cookie, so the name must be consistent. # This is not a bottleneck like proxy_encode() is, so speed is not critical. sub cookie_encode { my($cookie)= @_ ; # $cookie=~ s/(.)/ sprintf('%02x',ord($1)) /ge ; # each char -> 2-hex # $cookie=~ tr/a-zA-Z/n-za-mN-ZA-M/ ; # rot-13 $cookie=~ s/(\W)/ '%' . sprintf('%02x',ord($1)) /ge ; # simple URL-encoding return $cookie ; } sub cookie_decode { my($enc_cookie)= @_ ; $enc_cookie=~ s/%([\da-fA-F]{2})/ pack('C', hex($1)) /ge ; # URL-decode # $enc_cookie=~ tr/a-zA-Z/n-za-mN-ZA-M/ ; # rot-13 # $enc_cookie=~ s/([0-9A-Fa-f]{2})/ sprintf("%c",hex($1)) /ge ; return $enc_cookie ; } # Use @ALLOWED_SERVERS and @BANNED_SERVERS to restrict which servers a user # can visit through this proxy. Any URL at a host matching a pattern in # @BANNED_SERVERS will be forbidden. In addition, if @ALLOWED_SERVERS is # not empty, then access is allowed *only* to servers that match a pattern # in it. In other words, @BANNED_SERVERS means "ban these servers", and # @ALLOWED_SERVERS (if not empty) means "allow only these servers". If a # server matches both lists, it is banned. # These are each a list of Perl 5 regular expressions (aka patterns or # regexes), not literal host names. To turn a hostname into a pattern, # replace every "." with "\.", add "^" to the beginning, and add "$" to the # end. For example, "www.example.com" becomes "^www\.example\.com$". To # match *every* host ending in something, leave out the "^". For example, # "\.example\.com$" matches every host ending in ".example.com". For more # details about Perl regular expressions, see the Perl documentation. (They # may seem cryptic at first, but they're very powerful once you know how to # use them.) @ALLOWED_SERVERS= () ; @BANNED_SERVERS= () ; # If @BANNED_NETWORKS is set, then forbid access to these hosts or networks. # This is done by IP address, not name, so it provides more certain security # than @BANNED_SERVERS above. # Specify each element as a decimal IP address-- all four integers for a host, # or one to three integers for a network. For example, '127.0.0.1' bans # access to the local host, and '192.168' bans access to all IP addresses # in the 192.168 network. Sorry, no banning yet for subnets other than # 8, 16, or 24 bits. # IF YOU'RE RUNNING THIS ON OR INSIDE A FIREWALL, THIS SETTING IS STRONGLY # RECOMMENDED!! In particular, you should ban access to other machines # inside the firewall that the firewall machine itself may have access to. # Otherwise, external users will be able to access any internal hosts that # the firewall can access. Even if that's what you intend, you should ban # access to any hosts that you don't explicitly want to expose to outside # users. # In addition to the recommended defaults below, add all IP addresses of your # server machine if you want to protect it like this. # After you set this, YOU SHOULD TEST to verify that the proxy can't access # the IP addresses you're banning! # This feature is simple now but will be more complete in future releases. # How would you like this to be extended? What would be useful to you? @BANNED_NETWORKS= ('127.0.0.1', '192.168', '10') ; # Settings to fine-tune cookie filtering, if cookies are not banned altogether # (by user checkbox or $REMOVE_COOKIES above). # Use @ALLOWED_COOKIE_SERVERS and @BANNED_COOKIE_SERVERS to restrict which # servers can send cookies through this proxy. They work like # @ALLOWED_SERVERS and @BANNED_SERVERS above, both in how their precedence # works, and that they're lists of Perl 5 regular expressions. See the # comments there for details. # If non-empty, only allow cookies from servers matching one of these patterns. # Comment this out to allow all cookies (subject to @BANNED_COOKIE_SERVERS). #@ALLOWED_COOKIE_SERVERS= ('\bslashdot\.org$') ; # Reject cookies from servers matching these patterns. @BANNED_COOKIE_SERVERS= ( '\.doubleclick\.net$', '\.preferences\.com$', '\.imgis\.com$', '\.adforce\.com$', '\.focalink\.com$', '\.flycast\.com$', '\.go\.com$', '\.avenuea\.com$', '\.linkexchange\.com$', '\.pathfinder\.com$', '\.burstnet\.com$', '\btripod\.com$', '\bgeocities\.yahoo\.com$', '\.mediaplex\.com$', ) ; # Set this to reject cookies returned with images. This actually prevents # cookies returned with any non-text resource. $NO_COOKIE_WITH_IMAGE= 1 ; # Settings to fine-tune script filtering, if scripts are not banned altogether # (by user checkbox or $REMOVE_SCRIPTS above). # Use @ALLOWED_SCRIPT_SERVERS and @BANNED_SCRIPT_SERVERS to restrict which # servers you'll allow scripts from. They work like @ALLOWED_SERVERS and # @BANNED_SERVERS above, both in how their precedence works, and that # they're lists of Perl 5 regular expressions. See the comments there for # details. @ALLOWED_SCRIPT_SERVERS= () ; @BANNED_SCRIPT_SERVERS= () ; # Various options to help filter ads and stop cookie-based privacy invasion. # These are only effective if $FILTER_ADS is set above. # @BANNED_IMAGE_URL_PATTERNS uses Perl patterns. If an image's URL # matches one of the patterns, it will not be downloaded (typically for # ad-filtering). For more information on Perl regular expressions, see # the Perl documentation. # Note that most popup ads will be removed if scripts are removed (see # $REMOVE_SCRIPTS above). # If ad-filtering is your primary motive, consider using one of the many # proxies that specialize in that. The classic is from JunkBusters, at # http://www.junkbusters.com . # Reject images whose URL matches any of these patterns. This is just a # sample list; add more depending on which sites you visit. @BANNED_IMAGE_URL_PATTERNS= ( 'ad\.doubleclick\.net/ad/', '\b[a-z](\d+)?\.doubleclick\.net(:\d*)?/', '\.imgis\.com\b', '\.adforce\.com\b', '\.avenuea\.com\b', '\.go\.com(:\d*)?/ad/', '\.eimg\.com\b', '\bexcite\.netscape\.com(:\d*)?/.*/promo/', '/excitenetscapepromos/', '\.yimg\.com(:\d*)?.*/promo/', '\bus\.yimg\.com/[a-z]/(\w\w)/\1', '\bus\.yimg\.com/[a-z]/\d-/', '\bpromotions\.yahoo\.com(:\d*)?/promotions/', '\bcnn\.com(:\d*)?/ads/', 'ads\.msn\.com\b', '\blinkexchange\.com\b', '\badknowledge\.com\b', '/SmartBanner/', '\bdeja\.com/ads/', '\bimage\.pathfinder\.com/sponsors', 'ads\.tripod\.com', 'ar\.atwola\.com/image/', '\brealcities\.com/ads/', '\bnytimes\.com/ad[sx]/', '\busatoday\.com/sponsors/', '\busatoday\.com/RealMedia/ads/', '\bmsads\.net/ads/', '\batdmt\.com/[a-z]/', ) ; # If set, replace banned images with 1x1 transparent GIF. $RETURN_EMPTY_GIF= 1 ; # If either $INSERT_HTML or $INSERT_FILE is set, then that HTML text or the # contents of that named file (respectively) will be inserted into any HTML # page retrieved through this proxy. $INSERT_HTML takes precedence over # $INSERT_FILE. # When viewing a page with frames, a new top frame is created and the # insertions go there. # NOTE: Any HTML you insert should not have relative URLs in it! The problem # is that there is no appropriate base URL to resolve them with. So only use # absolute URLs in your insertion. (If you use relative URLs anyway, then # a) if $ANONYMIZE_INSERTION is set, they'll be resolved relative to this # script's URL, which isn't great, or b) if $ANONYMIZE_INSERTION==0, # they'll be unchanged and the browser will simply resolve them relative # to the current page, which is usually worse.) # The frame handling means that it's fairly easy for a surfer to bypass this # insertion, by pretending in effect to be in a frame. There's not much we # can do about that, since a page is retrieved the same way regardless of # whether it's in a frame. This script uses a parameter in the URL to # communicate to itself between calls, but the user can merely change that # URL to make the script think it's retrieving a page for a frame. Also, # many browsers let the user expand a frame's contents into a full window. # [The warning in earlier versions about setting $INSERT_HTML to '' when using # mod_perl and $INSERT_FILE no longer applies. It's all handled elsewhere.] # As with $INSERT_ENTRY_FORM, note that any insertion may throw off any # precise layout, and the insertion is subject to background colors and # other page-wide settings. #$INSERT_HTML= "