Alistair Maclean's Web Site
Apache Course Notes
Back

An Introductory course
in the
Apache Web Server


Contents

1.1 What this course intends to show.
1.2 History
1.3 What is a web server
1.4 Versions
1.5 How is Apache laid out
1.5.1 Locations and Directories
1.5.1.1 Linux
1.5.1.2 Windows
1.5.2 Files
1.5.2.1 Linux
1.5.2.2 Windows
1.6 Configuration
1.6.1 The basics
1.6.2 Virtual hosting
1.6.2.1 Example
1.6.3 Server Side Includes (SSI)
1.7 Compiling
1.7.1 Binaries
1.7.2 Source
1.7.2.1 Configuring
1.7.2.2 Make
1.7.2.3 Make Install
1.8 Modules
1.9 Maintaining your server
1.9.1 Running
1.9.2 What log files show
1.9.3 Robots.txt
1.10 Adding more capability
1.10.1 PHP
1.10.1.1 Example
1.10.2 Cold Fusion
1.10.2.1 Example
1.10.3 PERL
1.10.4 CGI
1.10.4.1 Example
1.10.5 FrontPage Extensions
1.10.6 Security
1.10.6.1 SSL
1.10.6.2 Hiding Returns
1.10.6.3 .htaccess
1.11 Information Sources
1.12 Advanced Apache

 

1.1 What this course intends to show.

This course is intended for those that wish to implement an Apache based web site. It will show where and what the critical parts of the system are, how to configure them, some maintenance issues, and where to look when it comes time to improve the feature set of the web server.

It is expected that those taking the course will have some concept of what the internet is, knowing certain terms, if not their implementation, i.e., DNS, HTML, TCP/IP.

Lately, the Apache organization released version 2.0 of the Apache Web Server, these notes make some reference to this product, though they have not been revised to fully account for it. This document therefore mainly covers Apache 1.3 releases on Windows and Linux platforms.

1.2 History

(From www.apache.org)

In February of 1995, the most popular server software on the Web was the public domain HTTP daemon developed by Rob McCool at the National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign. However, development of that httpd had stalled after Rob left NCSA in mid-1994, and many webmasters had developed their own extensions and bug fixes that were in need of a common distribution. A small group of these webmasters, contacted via private e-mail, gathered together for the purpose of coordinating their changes (in the form of "patches"). Brian Behlendorf and Cliff Skolnick put together a mailing list, shared information space, and logins for the core developers on a machine in the California Bay Area, with bandwidth donated by HotWired. By the end of February, eight core contributors formed the foundation of the original Apache Group.

Using NCSA httpd 1.3 as a base, we added all of the published bug fixes and worthwhile enhancements we could find, tested the result on our own servers, and made the first official public release (0.6.2) of the Apache server in April 1995. By coincidence, NCSA restarted their own development during the same period, and Brandon Long and Beth Frank of the NCSA Server Development Team joined the list in March as honorary members so that the two projects could share ideas and fixes.

The early Apache server was a big hit, but we all knew that the codebase needed a general overhaul and redesign. During May-June 1995, while Rob Hartill and the rest of the group focused on implementing new features for 0.7.x (like pre-forked child processes) and supporting the rapidly growing Apache user community, Robert Thau designed a new server architecture (code-named Shambhala) which included a modular structure and API for better extensibility, pool-based memory allocation, and an adaptive pre-forking process model. The group switched to this new server base in July and added the features from 0.7.x, resulting in Apache 0.8.8 (and its brethren) in August.

After extensive beta testing, many ports to obscure platforms, a new set of documentation (by David Robinson), and the addition of many features in the form of our standard modules, Apache 1.0 was released on December 1, 1995.

Less than a year after the group was formed, the Apache server passed NCSA's httpd as the #1 server on the Internet.

1.2 What is a web server

 http protocol
 html pages
 Images
 Applications

A web server is a program that accepts a request for a document and serves up that document. The design of most web servers is covered by a standard called RFC2616 (RFC = Request For Comment) that covers how servers are to respond to a protocol definition called HTTP/1.1 (HTTP = Hyper Text Protocol).

A web server is therefore a program that listens for requests and dishes out the requested file. It does nothing more in its simplest form; because of this, writing applications that operate using web servers can become quite complicated. A basic web server does not retain information about a page between requests by a user or client. Apache provides mechanisms to allow such data to be retained, as most real web servers do, through the use of various add-in modules.

Documents may be any form of file, though a number have been standardized. You will notice phrases like "text/html" or "image/jpeg" all over the place when chugging through HTML files and in doing work on web servers, these are the MIME types. These MIME types help both the server and the clients understand the types of files being moved around. I have never seen a complete list of these definitions, and new ones are being invented all the time. Have a look in your configuration files (mime.types, and magic) for the supported MIME types that your server understands.

Web servers, like Apache, can run applications, or at least applications can be launched as a result of a client accessing a page that requests an application be launched. In most cases such applications are written to make use of CGI (common Gateway Interface), but in recent times more integrated methods have been defined and are in use. This change is due to performance and resource needs. CGI, while allowing programs to be written in almost any language, requires an initial process to be forked or spawned off, allowing the requested application to run. If the web server CGI application is being used by 100 people simultaneously, then 100 processes will be created, each with its own memory and processing needs. The new methods attempt to put in place a processing engine, within the Apache environment, allowing for just one process and reuse of resources, but with the downside of limiting the development languages that can be used to create these applications.

1.3 Versions

 Windows
 Linux / Unix
 Others
 httpd -V

Apache is available for many hardware platforms and Operating Systems. In almost all cases it runs similarly. Apache.org is not necessarily responsible for an implementation on a particular hardware/OS combination, that may be the responsibility of some third party, for instance the machine vendor (IBM, Sun).

The version we will use is the Linux release. Of the Windows and Linux releases, the Linux release is by far the more stable, and the recommended version. I would have to say that installing the Windows version is easier, though the Linux RPM versions are not too hard to handle.

Supported platforms include:

AIX, BEOS, BSD, Darwin, Digital Unix, HPUX, IRIX, Linux, NetWare, OS/2, OS/390, QNX, Solaris, Win32, Mac, Mac OS X.

The current versions are 1.3.26 and 2.0.39 (as of July 2002)

To determine the version you are using

/usr/sbin/httpd -V or /usr/local/apache2/bin/httpd -V

This will inform you of the version, various compile options, and the directories it thinks it is using. To print out all command line help use:

/usr/sbin/httpd -h

1.4 How is Apache laid out

 Directory structure
 Configuration files
 Executables
 Document root
 Logs
 What is where?

httpd.conf
httpd (executable)
httpd (script)
access_log

1.5 Locations and Directories

Probably the best way of finding Apache in a Linux system you are not familiar with is to do a find:

 find / -name httpd

This should locate all the httpd directories and executables, look for something in a /bin or /sbin directory (prepend the directory name with anything you want). Run a located executable with the parameters to display full help:

 httpd -V

This will show the compile options that specify where the configuration directory is (we hope), from there you should be able to rummage around through the configuration files to locate all the other file locations.

In Windows a search in explorer for Apache.exe should find the standard install (the path of which is something like d:\Program Files\Apache Group\Apache).

If executables have been renamed, or the system has been compiled in a non-standard way neither of these methods may work immediately, at that point you may have to revert to solutions like hunting for running processes that show the characteristics of the web server (multiple forked processes with the same name and consecutive PID numbers).

1.5.1.1 Linux

This is a nasty issue. Due to the fact that Linux distributions are up to the personal tastes of the creator, the locations of files and directories varies immensely. In general the following can be found:

Configuration files

 /etc/httpd/
  SuSE they're right there
  RedHat they're down another level in /etc/httpd/conf/

Binaries

 /usr/sbin

Where the server thinks the pages are

/home/httpd/
/usr/local/apache1.x.x/
/usr/httpd/
/var/httpd/
/var/www/
/opt/www/

Log files

/var/log/httpd

with a link to /etc/httpd/log or any log directory in the pages tree.

1.5.1.2 Windows

In Windows the entire directory structure is stored under:
d:\Program Files\Apache Group\Apache\

Where d:\ is some drive letter you choose at install time, the install directory tree can be altered at this juncture, but probably won't be.

1.5.2 Files

1.5.2.1 Linux

The primary configuration file is httpd.conf, though you will often see others such as srm.conf and access.conf. The latter two are included in most recent releases for completeness sake only as all configuration parameters as stored in httpd.conf.

In the Linux distribution the server executable is normally called "httpd" this requires execute permissions and is normally in the /usr/sbin/ directory. Other notable executables from the distribution can include "suexec," "rotatelog" and "apachectl" also stored in the usr/bin/ directory, though not necessarily. I have noticed that apachectl seems only to be available if you build the server yourself, distro's don't seem to include it.

By default, the log files you get are are "error_log" and "access_log." If you alter your configuration files you can create "agent_log," "referer_log," and "transfer_log" files. There is also an option to integrate (access, agent, referer) these into a common "access_log" file.

1.5.2.2 Windows

The primary configuration file is httpd.conf, though you will often see others such as srm.conf and access.conf. The latter two are included in most recent releases for completeness sake only.

In the windows distribution the main server executable is a shell (apache.exe) that loads and launches "ApacheCore.dll."

By default, the log files you get are are "error_log" and "access_log." If you alter your configuration files you can create "agent_log," "referer_log," and "transfer_log" files. There is also an option to integrate (access, agent, referer) these into a common "access_log" file.

There is a move afoot to cause all modules in the Windows versions of Apache to be named the same as that of all other platforms, this means that many of the loadable modules will be changing names from XXXX.DLL to XXXX.so.

1.6 Configuration

common / essential configuration options
 /usr/sbin/httpd -T
 Virtual hosts
 SSI

1.6.1The basics

There are many configuration options in httpd.conf, many are unused by mere mortals. A sampling of those that are needed is given below, it is not exhaustive and should only be taken as a guide. httpd.conf is flowing with comments, most of the time these are VERY instructive, of course its when you get to some that are not so good that you run into problems.

NB. To comment out a command - make it inactive - put a # symbol at the start of the line. Conversely to enable a command remove the # symbol that appears at the start of a line.

There are several sources of useful information, such as the Apache web site, and various books. Not all the books are entirely current though, so watch out: Caveat Emptor.

Rob McCool always gets mentioned!

Option

What it does

Example

ServerType

Sets up how the server will be started. Always use

standalone

ServerRoot

Where the configuration files will be found

e:/Program Files/Apache Group/Apache
or
/etc/httpd

Listen

The TCP port number the server will listen on for requests. Useful if you need to run multiple servers on a single machine

80

ServerAdmin

The email address of the person that gets all the flak when things blow up

santa@x-mas.org

ServerName

The name by which the server is known

blitzen.x-mas.org

DocumentRoot

Really important. The start of the directory tree where the HTML pages will reside

/home/httpd/htdocs

<Directory "/home/httpd/htdocs">

The start of a section that defines how the Document Root directory security and other options apply to site

 

AccessFileName

The name of the file that defines how security is implemented in each folder. If the file of this name is found in a directory then its parameters are used instead of any others

.htaccess

HostnameLookups

This controls whether the server tries to resolve the DNS names for the IP addresses that are making requests. Leave it off if you value time, your clients time that is.

off

LogLevel

If you want to fill the logs with messages this is your way to do it. Changing this value through the settings will increase or decrease the verbosity of the servers log output.

warn

IndexOptions

If you so set up the server, when there are no HTML files with the correct default name, a directory listing is created. I prefer to switch this feature off, not wanting folks to amble round my system freely, though you get a 404 error.

FancyIndexing

<Location >

This is a block of configuration, commented out by default, that establishes a mechanism for checking on the status of the server. Works through mod_info.c

/server-status

<VirtualHost >

This is another block command that allows the setting up of additional web sites on the same server. This is called Virtual hosting.

www.theelves.com

Having messed with the httpd.conf file, or any of the other files, you have two options, start the server and hope, or check the config files to see if they are OK. The later method is done by using

/usr/sbin/httpd -T (Linux)
or
apache -T (Windows)

This will either show where the errors are, or display the message "Syntax OK."

If you have errors, fix them, normally the server will not start if there are configuration issues.

1.6.2 Virtual hosting

Virtual hosting is a method of allowing the web server to play host to multiple web sites. These sites may have different names, or different IP addresses or both. The sites may also be on different ports. Most web hosting companies use this method to rapidly and efficiently host many web sites on a single server.

The central part to setting up a Virtual host is understanding how the virtual host blocks function in the configuration files. Any parameters that have already been declared in the configuration file that are not overridden by a similar configuration option in the virtual hosts blocks carry through. In more recent versions of Apache, you will notice in the configuration file (httpd.conf) that at some point it states that any parameter from that point can be used both in and out of the virtual host blocks. This is important as any setting you make from this point in the configuration file, that lies outside a virtual host block will be in effect for the virtual hosts, unless the virtual host block alters it.

The NameVirtualHost parameter is the default name or IP for the server, this allows you to catch browsers that don't support HTTP/1.1, or don't specify a domain name (just an IP address.

The name of each virtual host can be an IP address or a domain name. 
If the browser asks for a domain name and the virtual hosts are IP addresses then 
	the system does a reverse look up on a DNS,
	then jumps to the virtual host entry.

A) The domain name and the virtual hosts are DNS names (www.name.com)
	the system jumps to the virtual host block and goes from there
B) The address and the virtual host are IP addresses (a.b.c.d)
	the system jumps to the virtual host block and goes from there
	an IP address and the virtual hosts are domain names
	the system does an IP look up on a DNS
	then jumps to the virtual host entry
By IP address, request consider a browser user doing:
	http://123.123.123.010/santasdirtysecrets.html
By domain request consider a browser user doing:
	httpd://www.x-mas.org/santadirtysecrets.htm
			

1.6.2.1 Example

# Virtual Hosts blocks
NameVirtualHost 123.123.123.010

<VirtualHost www.x-mas.org>
	ServerName www.x-mas.org
	ServerAdmin santa@x-mas.org
	DocumentRoot /opt/www/x-mas/htdocs
	ErrorLog /var/log/httpd/x-mas/error_log
	TransferLog /var/log/httpd/x-mas/transfer_log
</VirtualHost>

<VirtualHost wishes.x-mas.org>
	ServerName wishes.x-mas.org
	ServerAdmin youhope@x-mas.org
	DocumentRoot /opt/www/wish/htdocs
	ErrorLog /var/log/httpd/x-mas/wishes.x-mas.org-error_log
	TransferLog /var/log/httpd/x-mas/wishes.x-mas.org-transfer_log
</VirtualHost>

<VirtualHost www.theelves.com>
	ServerName www.theelves.com
	ServerAdmin lordhighelf@ theelves.com
	DocumentRoot /opt/www/theelves/htdocs
	ErrorLog /var/log/httpd/theelves.com-error_log
	TransferLog /var/log/httpd/x-mas/theelves.com--transfer_log
</VirtualHost>

This would set up three virtual hosts, all named and all operating on their own DNS registration names. It is also possible to change all the <VirtualHost > entries to be IP addresses. This may have its advantages, as some older browsers cannot handle the HTTP/1.1 feature that enables virtual hosting. If a browser cannot handle the virtual naming it will fall through to the DocumentRoot that has been specified for the whole server (should such an option exist), or to the Virtual host block specified by NameVirtualHost.

Lastly, depending on the setup of your server, you will likely need to check that your /etc/hosts (d:\winnt\system32\drivers\etc\hosts) file has entries for each of your virtual domains. If the DNS you are talking to can quickly resolve the names, all well and good, but incase things are slow, add entries like:   123.123.123.010          x-mas.org      theleves.com
to your HOST's file.

1.6.3Server Side Includes (SSI)

Server side includes (SSI) are a feature of the server that allow pages containing detail needed by other pages to be integrated into the output by the server. SSI is useful to page designers as it cuts down on the material they have to create, all pages could for instance have the same heading and endings without the need to create every page with this information.

The mod_include module is generally responsible for server side includes, and in more recent distributions is enabled by default.

1.7.1 Compiling

general introduction

   ./configure
   needs
   options

If you need to hide the server, make it not reply with its version, then you need to alter that in the compile process.

Apache is covered by an Open Source License, one aspect of this is that the source code is freely available. This means you can alter it, recompile it with new options or just poke about in it. In general, it would be unwise to alter it as you would loose any semblance of support offered by the community, unless you manage to do something pretty great and get it accepted as a new feature in the code base. But compiling it is a very real option, one that comes up far more frequently than you might think.

Versions of Apache are release in most Linux distributions, several versions of Unix and on other platforms, too. Some of these installations are OEM defaults, as in RedHat and even Solaris. The version in any particular distribution is likely to be out-of-date by the time you load the CD. New versions are released to fix security holes and bugs so there is a real need to install new versions once in a while.

New versions come in two flavors: source and binary.

1.7.1.1 Binaries

If you have a supported OS version and there is an available binary, you might think that would be good enough, not necessarily so. The binary will have been created with various options that may not compare to what came with your original setup, or a version of the server you compiled more recently. Install the binary on a spare machine to make sure that there are no issues before proceeding.

For RedHat RPM distributions:
    rpm -e apache.yourlast-version   Uninstall old
    rpm -i apache.new-version.rpm   Install new

For the tar ball installs
    tar -xzvf apache.new-version.tar.gz
        (x = expand, z = use gzip, v = be verbose, f = use the following file)

ZIP, EXE or MSI installs in Windows

For these installs execute an EXE install, expand a ZIP install and execute the setup if its not done automatically, or right click on the MSI file and select Install from the menu.

Once you have dug around to find your new apache check it to see if it has what you need. The version check is useful to determine what options were used in its creation. If you can get it running, the /server-info function will also identify which modules have been built in to the binary. Alter the configuration files to add any of the features you had installed in your old system, record everything you did, and re do it on the main server.

1.7.2 Source

Most Operating System releases only come in a source form, which though it is intimidating is not as bad as it first appears. So many before us have had to compile this and other Open Source projects on so many occasions that the process for doing the compile is now really, fairly straight forward.

You will get a tar file or a ZIP file of the source, this needs to be extracted and placed somewhere. There are many places this can be done, in your home directory is one, or in a common location. Remember that you are going to create a replacement web server, so you probably don't want to do this on the working, production, server box.

For Linux you will need to have installed the development packages, giving you the gcc compiler and its various libraries. On windows, the readme suggest that you can use Microsoft C++ and some Borland compilers (not yet the free command line BC 5), but I think the system is set up to make use of the free GNU C compiler.

When you install the source it will create a number of directories that contain the source and various Make files.

The compile is a three step process:
    Configure the options you want
    Make the system with these options
    Create the install

There is a way of generating the install in such a manner that it is easier to build your own install ZIP or tar (see the Make Install section below).

1.7.2.1Configuring

To configure the install you need to specify the parameter you want to change and the value you want to change it to. The documentation lists a significant number of the settings.

Use ./configure --help to list all the options the interface can help you with, its a long list but worth chugging through. Additionally, a help file called "INSTALL" has more information on doing compile configuration, not all of it seems up to date - I have in mind here references to PHP.

Alternatively, you can edit one of the header files (include/httpd.h) and make gross changes to the configuration from there, I suspect that this is frowned upon in high Apache circles, but it is one of the few places that allows you all options permanently

1.7.2.2 Make

There really isn't a great deal to do here, except sit back and bite your nails. If there are errors you will find out rapidly. If you do get errors, there is normally a good reason for it, remember the chances that you are the first user to ever use a certain command parameter on the Operating System are small to non-existent, assume you are wrong before sending flame mail to an Apache newsgroup.

1.7.2.3 Make Install

One issue I have here is that on pressing enter after this innocuous command anything you already have installed is blown away by the new install - Blam! just like that. You should always have backups, as they say.

If you want to offset the whole install so that it can be tarred or zipped, to be placed on another system, you can override the install processing using something like the following:

    make install-quiet root=/tmp/apache_root/

This creates the system but with all the files now relative to the directory you specified.

You may notice that there are several extra executables created, these included in my instance apachectl a program that allows you to test the initialization of Apache in a number of ways. Also "suexec," log handlers and the programs that create the security password files for use with .htaccess (or equivalents).

1.8 Modules

    installing
    configuring
    status

Apache is designed to allow add on modules to be included in two distinct ways:

    Built-in (static, compiled in -- SO)
    Add-on (dynamic -- DSO)

Some add-ons allow either of these two methods to be implemented, while other add-ons recommend one manner over the other. The only way to find out the preferred method is to read the documentation for the particular add-on. If you intended to add many static modules over a period of time you will become very familiar with rebuilding apache with an ever growing list of parameters. Most, though not all, modules are written in C. Apache has a large API that developers can use to build modules that serve to fix various problems (like mod_speling [sic] that allows the server to make guesses at misspelt URL's or page names a user is asking for).

In general, adding a new module will have the follow stages:

    Download the code (few are binaries)
    Determine the best mode of operation (compiled in or dynamic)
    Move code to a suitable location (often dictated by documentation)

Either,
    Set Apache parameters and recompile, including the new module
    Re-install Apache with the new module
Or
    Compile the module
    Install new dynamic module to appropriate location
    Restart Apache
    Test Apache to make sure it still works
    Add some test pages to exercise module
    Test module
    Notify whomever that the module is now in place and ready for use

1.9 Maintaining your server

    running
    error_log
    access_log
    referer_log
    agent_log
    transfer_log
    Common logs
    Analysis and statistical tools
    favicon.ico      The bane of log readers
    tail -f command
    log rotation

1.9.1 Running

To run your Apache server is a task that varies by operating system. In Windows the latest version of the install (1.3.17) allows for Apache to be run as a service. You will find after the install the Win2K or NT Services Applet has an entry for Apache, and that after reboot the server will be running.

In Linux, assuming you have a standard distribution, there are frequently a series of scripts in a directory under /etc/. This directory changes places by distribution, but in RedHat and SuSE it's in
      /etc/rc.d/init.d
The file you are looking for goes variously by the name of "apache" or "httpd." It is a script file, you can view it by using "more httpd." The script takes three parameters: Start, Stop and Restart. Restart will stop then start the server. These scripts do all the necessary look up to find the process Id's when you are trying to stop the server. Use them they are very good.

Generally, in Linux, Apache starts up at boot time. You should find a reference to "httpd" in one of files in the start up directories (rc3.d or rc5.d).

If you get errors on startup its generally because you failed to set the directory permissions correctly for the server to get at its files, or the configuration file has an error. Note that to run Apache you have to be root.

1.9.2 What log files show

The Log files display various details of the operation of the server. Each log file has its own duties and shows its own style of data.

Log File

Displays

error_log

error messages and things like start up and termination messages

access_log

IP addresses, access time and the page requested

referer_log

The place the client came from and the page requested

agent_log

The type of browser or search engine that accessed the site

transfer_log

This is a clone of the Access log.

Should you prefer, the access_log, referer_log and agent_log can all be integrated into one file, if you look at the configuration files you will see an entry for "combined." Uncomment it and put everything in one bucket. Error data still goes to the error_log though.

Using the configuration files you can change the names of all these log files. It's probably not worth it to do this as several statistical analysis tools default to these names.

    An error_log entry:

[Sun May 07 14:12:03 2000] [error] [client 192.168.1.36] File does not exist: e:/www/phone.gif
			

    An access_log entry

127.0.0.1 - - [06/Apr/2000:14:10:24 -0700] "GET /default2.htm HTTP/1.1" 200 162
			

Having got megabytes of log file what do you do with it. To begin with you will likely saunter through them with your eyes glazing over at all the IP addresses. You will probably be asked to say how many pages have been accessed, or are search engines getting to the site. These questions and many more can be sorted out using various log analysis tools. Due to the widespread nature of Apache, there are several tools that you can use, some of these cost money - and generally do a nice job, others are free and vary in quality of output. I have been using a tool called Analog (www.analog.cx).

A word of warning. In the configuration of the server you may be tempted to use the "HostnameLookups" option to cause all those 123.453.231.010 addresses to be resolved to something meaningful (like grinch.northpole.com) and have a clearer idea of who is calling. Unfortunately, while this is not too bad for those people that are coming through some proxy for which there is a DNS entry, those that are accessing you from say, a cable modem or a DSL connection, having a fixed IP not listed on a DNS will cause the requested page to take forever (a few seconds) to be displayed - the DNS requests have to time out before the page is displayed and you get a log entry that is still just an IP number.

I wrote a small program that looks at the log file, and at your leisure - not those of the browsing public, goes off and tries to replace the IP addresses with resolvable names.

If you are lucky, you will get to see a perennial error in your error_log file that involves a file not being found. The file is favicon.ico. It's not your page designers that have gone nuts, it is a person using a Microsoft IE 4.0+ browser that is book marking your fabulous site for future reference. To get ride of the error either create a null file with this name and place in the root of the web site, or get a 32x32 pixel, 256 color GIF image, name it favicon.ico and place it in the root of the web site. The icon so created will show up as an image in the users browser on their book marking of your site.

To some people there is nothing more satisfying than to see activity on their web site through the constantly growing logs. If you are using NT or Windows 2000, then you need to install one of the Posix toolkits to do this, in Linux its just part of the package:
    tail -f error_log

It is a simple command, which I prefer to put on the access_log or the referer_log file as these give me more fun; the joyless ones like to know about errors first. This will follow (-f) the growing log files, displaying any new entries almost as they are made.

Log rotation is a maintenance chore that can be automated to an extent. In cron (or using the schedule service and AT command on NT) it is possible to alter how the various logs are treated at the end of some time period. Because the logs can become quite large, it is generally advised to rotate and compress them. This is partially done by one of the utilities that comes with Apache (only on Linux) called logrotate (in /usr/sbin/). This utility switches off Apache for a moment, whips out the log file, renames it and creates a new empty replacement. It is invoked, generally through cron, for which there are scripts on the web you can use to modify the activities of this rotation process, including the compression of the file to save space.

1.9.3 Robots.txt

    What is it & what does it do?
    How is it configured?

The "robots.txt" file is a file that web search engines use to limit their searches. I suppose someone could create an unscrupulous search engine that indexed everything on your site, but in general purveyors of search technology have a hard enough time just indexing the sites they can hit. If you don't have a "robots.txt" file then you will start to see error messages in your error_log file.

Robots.txt is an ASCII text file containing lines for each directory you don't want the search engine spiders to look at.

user-agent: *
Disallow: /mydirectory/
Disallow: /my-other-directory/
			

1.10 Adding more capability

    PHP
    Cold Fusion
    Perl
    CGI
    FrontPage Extensions
    Security
        SSL
        port 443
        Hiding returns, see compilation
        .htaccess

1.10.1 PHP

PHP is a scripting language that looks a lot like C. It is implemented as a module that can be either compiled into Apache, or more preferably dynamically added through the APXS interface. PHP is a lightweight engine that runs well with Apache, it is fast and generally seems quite reliable. Other than compiling up the PHP module and copying it to the libraries directory your Apache uses, you will need to modify the httpd.conf file to cause PHP pages to be sent to the engine for interpretation. There are generally two modifications that are made;

load-module      Adds the module to the list of loadable modules
AddType          Adds a MIME definition that indicates the file extension to be used

You will likely have to compile your PHP installation. The latest version (4.0.6) compiles in a similar manner to Apache with that 3 step process (configure, make, make install), this produces one file (libphp4.so - under Linux). For Windows you will download an already compiled version, simply installing it, and setting the appropriate parameters in the apache configuration files. Note also that PHP under windows has a php.ini file that needs to be tweaked a little.

1.10.1.1 Example

(in test.php)

<html>
<head>
<title>Test</title>
</head>
<body bgcolor="white">
     <?php
     for ($i=8; $i<20; $i++)
     {
        echo "<br><font style=\"font-size:".$i."pt\">Hello</font>";
     }
     ?>
</body>
</html>
			

1.10.2 Cold Fusion

Cold Fusion is a product made by Allaire (latterly Macromedia) that is a commercial product filling some of the same space that PHP does. Under Windows, Cold Fusion makes use of IIS but in Linux it can make use of Apache and a few other web servers. Cold Fusion is an interpreted scripting language, its syntax looks more like HTML on steroids than a traditional programming language.

Even for Linux, the Cold Fusion install is entirely binary. There is no compiling to be done, just make sure that you have a supported distribution or the correct libraries.

1.10.2.1 Example

(in test.cfm)

<html>
<head>
<title>Test</title>
</head>
<body bgcolor="white">
    <cfloop  index="i" from="8" to="20">
        <cfoutput>
            <BR>
            <FONT style="font-size:#i#pt">Hello</font>
        </cfoutput>
    </cfloop>
</body>
</html>
			

1.10.3 mod_perl

While you can run Perl through CGI scripts it is not always the most efficient way of doing it as each CGI process will be forked and the Perl engine invoked for each of these processes. This can lead to resource issues and slow performance. mod_perl is an effort to redress these problems, with it installed there is now only one Perl engine running at any time, this reduces resource consumption and leads to faster start up of the mod_perl application. The library has many features, and add-ons to this add-on include the ability to add ASP (Active Server Pages) functionality to your Apache server

Unlike Perl, mod_perl applications can be directly coded into HTML pages.

1.10.4 CGI

The Common Gateway Interface (CGI) was the original means of getting applications to run on web servers, even today you can install most of the additional language features listed here as CGI based tools. CGI makes use of the standard Input/Output channels in the Operating System. To send data to a CGI application data is sent out on STDOUT, it is read by the CGI application on STDIN, and vice-versa. CGI applications are therefore only limited by the ability of the language they are written in to be able to use these standard I/O channels.

To get the most performance possible most CGI applications are written in C or C++. Essentially the CGI application is written as a filter program.

CGI also specifies a means of getting at certain of the servers environment parameters, this allows the CGI application to better understand the environment it is in. Least we not forget: web servers are stateless - they don't remember anything between user actions, therefore your CGI program will have to be sent much data, and return it, if you are to envisage building systems using it.

1.10.4.1 Example

in HTML file

<html>
<head>
    <title>Test CGI</title>
</head>
<body bgcolor="white">
    <a href="myapp?name=ThisText">Click Me</a>
</body>
</html>
			

in C module

include <stdio.h>
void main(int argc, char *argv[])
{
    /* Your code goes here */
    fprintf(stdout, "%s", argv[1]);		/* display the input string */
}
			

1.10.5 FrontPage Extensions

We live in a world in which
1) Microsoft is a major player
2) Not everyone wants to know how to write HTML
This has lead to a line of product by Microsoft called FrontPage. It is a web site creation tool. It's got many fancy bells and whistles, including the ability to use pre-built functions (hover buttons for instance) and download web sites to the web server. This last function is easy to implement on NT or Win2K but a trifle harder on Linux. Note, I am not sure if there is a way to set up FrontPage on an Apache server that is hosted on a Windows machine - I think MS expects it all to be done through IIS - but you could always try.

There seem to be several ways around allowing your users or clients to make use of FrontPage (other than telling them to use FTP and learning to code the way Real Geeks do). Firstly Microsoft has a web page and various downloads to implement FrontPage on Unix and Linux servers.

    http://msdn.microsoft.com/workshop/languages/fp/2000/unixfpse.asp

There is an installation shell script, a patch script, and a 14Mb install file.

Secondly, there is an organization called RtR (http://www.rtr.com/fpsupport/) that seems to have a FrontPage extensions add-in that is independent of MS (but then again it may not be).

Thirdly, there are various links around the web for FrontPage modules for Apache. How these work is left to you to determine.

The Microsoft version of the FrontPage Extensions installs and updates Apache. In older versions of Apache (1.3.6 for instance) the install even modified (replaced) the Apache executables - a feature I am not fond of.

With the extensions installed, a FrontPage user will be able to connect to your Apache site, upload and download their pages and make use of all the special features that FrontPage enables.

1.10.6 Security

1.10.6.1 SSL

SSL is a security standard that Netscape introduced. It has two particular features:
    You will use port 443
    Traffic between you and the client browser will be (lightly) encrypted

There is tons of information about implementing SSL on Apache, as there are dozens of configuration features. Suffice it to say that there are SSL equivalents to most of the standard configuration settings (there is an <SSLVirtual_Host > block for instance).

1.10.6.2 Hiding Returns

There are many tools and systems on the internet that are used to find out about servers, some are benign, some not so. If you really want a web server to sit on the internet and NEED to have some world wide authority determine that you are running an Apache server version 1.2.3.4 then leave alone, because that information is readily accessible. On the other hand you may not want to be so brazen. In recent months an Internet worm called Ramen has been doing the rounds, zapping web sites running RedHat distributions using a security hole in wu_ftp. Consider that Apache, while not directly responsible for this security issue, could be in some future attack, and you realise why paranoia is your friend. In the section on compilation, I mentioned that you could edit the httpd.h file, in here is a constant called SERVER_BASEVERSION that sets the servers' version number, and is the value passed back when a client queries the server. By changing this value you can at least confuse for a short time anyone trying to get detailed information about the configuration of your system. Additionally you can set the configuration file value ServerTokens to limit the information emanating from the server in its response headers.

You might also like to look in the configuration files to see if you have two small blocks enabled that allow server-status and server-info to be requested from the server. If these are not limited to the internal network by using "deny from all" and then "allow from 192.168.1." (or similar) then an outsider can review your server setup, especially with server-info enabled.

1.10.6.3.htaccess

The Apache security model allows for two ways of establishing individual directory level security. Either you can edit the <Directory > blocks in the httpd.conf file, or you can create .htaccess files in the directories you wish to secure. While there are operating system level means of achieving directory security, in Unix/Linux it seems to be frowned upon (in NT directory rights are generally the best way of implementing security).

.htacess files (or any other name you give them (remembering to change the entries in the configuration file

1.11 Information Sources

Apache web site
PHP web site
MySQL web site
Postgres web site
registry.apache.org
books

Apache information sources are important. There are few commercial outlets that provide support (RedHat and IBM might for a price), so most support is by web chat sites and newsgroups. Some parts are better supported than others, PHP has a very good web site for instance.

For module information the primary source seems to be the module registry at Apache, but I think as you look around you will find other sources.

There are a large number of books on Apache. Many are quite out of date. Be aware that version 1.3.6 and above are quite different in structure to what went before, and that version 2.0 and above will be different again. There have been annotated code listing published, use the real thing, the annotated listings are so out of date by publishing time they don't even make good toilet paper dispensers. Several books give large run downs on the configuration options (O'Reilly for one), these options may or may not be ordered as they appear in the configuration files, its a matter of taste as to whether this is a good thing or not. I prefer to see the parameters listed somewhere in a book in the order they are found in the configuration files. The IDG books also give a run down on some of the modules that can be added, and the parameters these add to the brew.

Lastly, and it should be your first stop, a default installation of Apache includes a htdocs/manual directory that includes a very nice run down on all the parameters that the standard system contains. It is not altogether clean from the pages how to do things, but it is a starting point from which to go.

1.12 Advanced Apache

Topics I have not covered include redirection and running clustered Apache. These are topics that commercial sites might use to build maintainable, scaleable Apache installations, but in the limited time available are beyond what I can tell you about.

Monitoring and log analysis are also subjects that have profound impact and are complex issues in their own rights. In a commercial situation your marketing team might want to know page hits and hit rates to determine if a particular campaign is working. Tools for this are a matter of personal taste, wander over to Google to find your personal poison.

Integration of other server types. We have not touched on how we go about integrating other types of servers (i.e., WAP servers). These may become important additions in coming years, currently they are "maturing" technologies, tread carefully.


© Copyright A. Maclean 2001-
  A page index