Cross-Server Server Side Includes (SSI) Solved?

UPDATED: Case Wiki page added: RemoteSSI.cgi

UPDATED: Jeremy Smith had some very nice input into the program design. I'm playing with the idea of restricting the SERVER_PROTOCOL environment variable to "INCLUDED" to get around any issues with having to set up a list of known hosts. This way, if you try to access the script directly or through some other HTTP method in an attempt to abuse it, it just won't work. The only way to use it is as an SSI, so abuse could only come from people who have logged in to upload their pages. I still haven't fully figured out Greg and Jeremy's caching idea yet, but I'm working on understanding that now. Updated original entry below...

I had a bit of inspiration today and solved the problem of dropping http://blog.case.edu content into http://www.case.edu pages.

#!/usr/bin/perl
use CGI qw(:standard); # provides header
use LWP::Simple; # provides getprint


###########################################################
## This work is licensed under the                       ##
## Creative Commons Attribution License.                 ##
## To view a copy of this license, visit                 ##
## http://creativecommons.org/licenses/by/2.0/           ##
## or send a letter to:                                  ##
##     Creative Commons                                  ##
##     559 Nathan Abbott Way                             ##
##     Stanford, California 94305, USA.                  ##
##                                                       ##
## Essentially this means you can do whatever            ##
## you want with the code as long as you credit          ##
## Grayden MacLennan (grayden.maclennan@case.edu)        ##
## as the original author.                               ##
##                                                       ##
## This document was last modified on September 21, 2005 ##
###########################################################


######
# remoteSSI.cgi
# by Grayden MacLennan
# grayden.maclennan@case.edu
#
# 2005-09-21
#
# This is a VERY simple program that grabs content from any
# http-accessible source and spits it out again.
#
# The original motivation behind this program was to allow
# me to do Server Side Includes (SSI) on one server while the
# included content sits on another server.  Normally, SSI only
# works within a local file system, so this program in effect
# gives a LOCAL path to a REMOTE resource.
#
#
# --Example of usage--
#
#  What you'd do for a normal local file:
#
#    <!--#include virtual="/somepath/includefile.inc" -->
#
#  What you'd LOVE to do but can't:
#
#    <!--#include virtual="http://remote.server.com/somepath/includefile.inc" -->
#
#  What you do to get around the problem:
#
#    <!--#include virtual="/cgi-bin/remoteSSI.cgi?url=http://remote.server.com/somepath/includefile.inc" -->
#
######

# Step 1 - print a header so the CGI won't throw an Internal Server Error 500
print header;

# Step 2 - if this is being used for SSI purposes, reprint anything it finds at the URL (bad URL generates nothing)
if ( $ENV{"SERVER_PROTOCOL"} eq "INCLUDED" ) {
	getprint( param("url") );
}
else {
	print "Hey - this isn't a proxy server!";
}
# Step 3 - There is no Step 3!

Trackbacks

Trackback URL for this entry is: http://blog.case.edu/gtm4/mt-tb.cgi/2780 Cross Server SSIs
Excerpt: Here is my stab at it, Grayden. I restricted proxying to an explicitly allowed set of hosts, made printing...
Weblog: Jeremy Smith's blog
Tracked: September 22, 2005 04:23 PM

Comments

A good start. An even better approach is to use caching so you don't have to query this server for every page load. Considering the load on this server can get pretty high, the tradeoff in implementing the cache might be worth it. Every millisecond counts.

Posted by Gregory Szorc on September 22, 2005 05:08 AM

In addition to Greg's comment, I was thinking about this a bit more and realized that this is essentially an ad-hoc proxy, which could open up some interesting security questions. It could be used for hiding the true destination of your web browsing session, and as Greg mentioned, it could possibly also be used as a form of denial of service for the server that hosts the script if you flood it with large requests.

Thoughts on how to fix it to make it more robust/secure without losing the basic functionality? I suppose I could add some code that checks to make sure the forwarded URL is in a particular desired domain like case.edu, but is there a more elegant solution?

UPDATE: Turned proxy capability OFF by restricting to SSI behavior only

Posted by Grayden MacLennan on September 22, 2005 12:57 PM

Hmm. Interesting. Lately I have mostly been looking at techniques and thinking about software engineering / maintainability issues. So this is kind of tangential.

This could be pretty useful for including across multiple servers you control. I'd discourage anybody from doing things like this:

1. Create centralized code server for public use.
2. People pull code from server run-time to power their applications.

The major advantage of that kind of model is that, if you work your caching right, you *always* have an up-to-date version of the code from the repository. The major disadvantage is that you may be screwed if the repository goes down, or ceases to exist, etc. I have this problem with Bugzilla right now, because apparently it has some thing that pulls graphics at run-time (I don't know why) from some website that no longer provides them (at least not by the protocol used in my version of Bugzilla).

You're also eroding your security somewhat, as the host requesting the include is now vulnerable to attacks from the host providing the include. This can probably be overcome with some relatively simple crypto stuff (you could just sign all of the code you want to include with your public key, and then check that, I guess).

I guess I would tend to prefer, where possible, a permanent local copy be maintained as a backup, so that you can automatically fall back on that copy if the included copy is not available. If you made the cache a backup/revision system, that could have uses to...you end up trading off between convenience of using remote file vs. stability. Depends on what you're trying to accomplish, I suppose.

Just some musings.

Posted by Seth Johnson on September 23, 2005 05:42 AM

I created http://wiki.case.edu/CaseBlog:HowTo#How_do_I_Include_Content_from_a_Blog.40Case_Web_Page_on_www.case.edu_Web_Page, as this is a common question that comes into blog-admin@case.edu. And, I would believe, many people are going to welcome the use of this CGI.

Posted by jms18 on September 26, 2005 06:34 PM

The rewritten CGI currently supports pulling content from any server - does anyone see a need to limit the sources to blog.case.edu or to *.case.edu or so on?

One other problem with the current implementation is that it requires the "use" command, which is forbidden in the SafePerl environment available to individual Aurora account users. I'm working with Tom Sterin to get the CGI tossed into the public CGI folder in order to make it available as a server-wide resource, since the main CGI folder doesn't have those limitations.

There is also still a bit of code tweaking left to do. I'm still debating the caching idea since Aurora seems to do some caching of its own already, and I'd like Tom's input on that in terms of potential for load on Aurora. If we're going to restrict the includes to *.case.edu, then I'll also have to toss in some of the code from Jeremy's excellent whack at the problem.

I'll keep updating the body of the post as updates get made.

Posted by Grayden MacLennan on September 26, 2005 06:53 PM

You could toss this code up on the Case Wiki using the source code extension to make it print nicely.

Posted by Gregory Szorc on September 26, 2005 08:38 PM

I spoke with Tom today, and there are some definite concerns about server load that we still have to sort out. This was originally intended to be a quick fix for one person (me), but now I've sort of stepped ahead of myself by implying that this will be a useful resource for the whole campus without actually checking first to make sure that's ok. Tom and I are going to chat again next Wednesday to talk about how things should be handled, and if this will be doable on a large scale.

Posted by Grayden MacLennan on September 27, 2005 07:51 PM
there are some definite concerns about server load

What? What concerns?

Posted by jms18 on September 27, 2005 09:34 PM
What? What concerns?

Basically Tom is swamped and hasn't had a chance to look at the code yet or determine what it would do to Aurora. It's concern based on incomplete information rather than concern based on seeing what's there and not liking it.

Posted by Grayden MacLennan on September 27, 2005 09:50 PM

Okay, well, I've seen the code. It's not like its trying to compute the Bayesian result of the fetched page or anything. It's a simple network call and print statement. I (probably) generate more load on a server when I check my email than this little script will generate in a day.

Posted by jms18 on September 27, 2005 10:39 PM

It's quite true that the processing load of the script is minimal, but will the LAG of having to wait around for a foreign server to get off its ass and return a snippet of text before building the final page and being able to complete the http request impact Aurora's performance? It seems to me that the server process responsible for each individual request would spend a lot of time (relatively speaking) just sitting around idley waiting for another server, but then again I'm not sure about how the OS or server software handle things like that. Thoughts?

Posted by Grayden MacLennan on September 28, 2005 12:27 AM

Timeout conditions can be controlled using LWP::UserAgent (instead of LWP::Simple) and timeout expenses can be mitigated using caching. In my code snippet, I manually set LWP::UserAgent's timeout to 10 seconds; so that would be the maximum amount of lag time introduced.

Posted by jms18 on September 28, 2005 02:06 AM

Post a comment