WWW FAQs: How do I add a captcha to my web form?

2007-10-24: Captchas are a great way to slow spammers down. Spammers and other annoying jerks have discovered ways to abuse websites that offer contact forms, guestbooks, feedback pages, and so on. When spammers use automated tools to attack these pages, they can post many unwanted messages very quickly.

To slow down these attacks, many websites use CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) software. "Turing tests" in general are used to distinguish between computers and real people. In particular, captcha software (the term is commonly used as an ordinary noun, uncapitalized) tries to provide a test that humans can easily pass, but computers will hopefully fail.

A typical captcha involves a picture of text— usually with the text rotated, distorted, colored and otherwise creatively altered. Human beings have no trouble reading the text, but simple computer programs can't. A well-written captcha can often keep out even OCR (Optical Character Recognition) programs.

What the heck is a Turing Test?

A "Turing test" is any test that attempts to distinguish human beings from computers. The idea of a Turing test is credited to the computer science pioneer Alan Turning, who first described it in 1950. For more information, see Wikipedia's Turing test entry.

Captchas Are Not Perfect... Not Even Close

Sounds like a good idea— so what's the catch? Well, there are several problems:

1. Computers can break 'em anyway... although amateur programmers won't have an easy time doing so. Greg Mori and Jitendra Mailk's Breaking a Visual CAPTCHA discusses advanced techniques that can be used to crack even fairly sophisticated captcha systems.

2. Some humans can't break 'em! Obviously, blind users can't solve a visual captcha. Better captcha systems also offer an audio-based option. Even then, deafblind users (those who are both deaf and blind) are locked out. Sites employing captchas should at least consider offering special accounts to those with special needs in this area. One solution is to offer a telephone number— and make sure you accept TDD relay calls! These are voice calls placed through an interpreter. Your telephone support staff should be educated about this and encouraged to create accounts or carry out other captcha-protected tasks on behalf of legitimate users who contact you via phone.

3. Captchas can take up extensive CPU resources (that is, slow down your web server) or require features not present on your website (for example, some web hosts doesn't include the GD library in their PHP offering— probably because they don't realize how easy it is).

4. Bad guys will, in some cases, hire humans to do the data entry instead, or at least to do the captcha-solving part. If your troublemakers are determined to get past the captcha, they can.

So, does this mean you shouldn't use a captcha? Not at all. Some sites (mine included) are faced with many abusive, unwelcome form submissions every day. For sites like these, a well-designed captcha system makes all the difference. Just don't expect perfection, be sure to include the audio option, and offer alternatives for deafblind users. Note that the telephone can be a valid alternative - deafblind users with access to the Internet via braille interfaces likely also have access to TDD relay services which allow them to place voice calls through an interpreter.

How To Implement A Captcha

That's enough about the why (and the why not). How do we implement a practical captcha system?

There are many dynamic web programming languages in the world, and I can't cover all of them in every article. Here I'll assume you are using PHP. If you're not using PHP, you should consider it! PHP is the most popular tool of its kind, and it runs on all major web servers.

I have written a simple captcha system in PHP which you can use on your own site. It's easy to set up and extremely easy to plug into your own PHP code. And you can try a live demo here.

The only catches are:

1. Your server's PHP must include the GD library, with support for JPEG and FreeType (TrueType font output). If it does, you will see that mentioned in your phpinfo page. If not, complain to your web host— they aren't doing a good job if they still don't give you this widely expected feature in 2007.

2. If you don't mind the Bitstream Vera font or the sound of my voice, great! If you aren't crazy about those two things, you'll need to provide your own recordings of the letters of the alphabet and your own TrueType font file (.ttf file) as described in the next two steps.

3. Optional: if you do decide to replace the audio samples, you'll need to record your own, perhaps using Audacity. Name them a.wav, b.wav, etc. (all lower case) and upload them to the fonts subdirectory of the captcha directory. Then convert them to the correct raw audio format by running the convert-wav-to-ub.pl Perl script in my sounds directory. That script requires the sox utility. sox can be installed as a package in most Linux distributions. If that's not the case for your server (for instance, because your hosting is on Windows), visit the sox project page. Or just use my provided audio samples and don't worry about it.

4. Optional: if you decide to replace the font, you'll need a decent-looking TrueType font file (.ttf). And you need to know the file path on the server to that TrueType font file. For copyright reasons, fonts are sometimes not included on web server systems. Look in /usr/share/fonts (Linux/Unix) and c:/windows/fonts (Windows) fonts, or upload TrueType files of your own. Of course you can just use the Bitstream Vera font I provide and not worry about this.

5. Your web server's PHP setup must include the session feature, and it must be set up properly. For more information about this issue, see my article how do I keep track of user sessions?

OK with all of the requirements? Great! Let's install my captcha system. And then I'll take a look at the PHP code to show you how it works and give you ideas for creating your own, in PHP or any other web development language.

1. Download captcha.zip and extract it on your website. Upload all of the files to your site, and keep the same arrangement of directories. This is easy, just drag the captcha folder to your web space and everything else will automatically follow. Don't try too hard and start copying the files one by one, that's a sure recipe for confusion.

2. Optional: edit the file captcha-settings.php, changing the $captchaFont variable to the path of a TrueType font file (.ttf file) that exists on your web server. This must be a file system path, not a URL. If you're not clear on this, just use my standard setting, or upload a new TrueType font file to the fonts folder and change the font file name in captcha-settings.php. You can also change $captchaSounds to point to a different directory of sound samples, but only if you have recorded your own letter sounds as wav files and converted them with the convert-wav-to-ub.pl Perl script as described above.

If you are running this code on Microsoft Internet Information Server, or on an Apache configuration that doesn't allow .htaccess files to limit access to directories, then consider fetching the font file and audio samples from an alternate location outside your web space and changing these settings accordingly. Otherwise, users can download the font and the audio samples and analyze them.

Does it matter if users download the font and audio samples? If you have created your own audio samples and chosen an uncommon (but legible) font, yes— denying crackers access to these raw materials can make breaking your captcha tougher. But if you are using my standard font and samples, then crackers have the option of downloading those already. In that case there is much less to be gained by hiding these files on your particular server.

If you are running Apache with .htaccess support turned on in the usual way, the .htaccess file I have provided should do an acceptable job of blocking browser access and preventing direct downloading of fonts and audio samples.

3. Test it out! If you copied the captcha folder into the main folder of your web space, you can access a demonstration page easily. The demonstration page is an example of a customer contact form that sends a message only if the customer correctly enters a Captcha code.

First, edit the demo.php file in your captcha folder and change this line to specify your email address:


$address = 'your_email_address@example.com';

Next, just access the demo page with your browser:

http://www.yoursitenamehere.com/captcha/demo.php

If all goes well, you will see a picture containing a few letters, and you'll also be offered a link that allows you to listen to the letters instead. That feature is vital for blind and vision-impaired users, so I encourage you to keep it in place.

Type the letters you see in the picture into the text field, enter a message in the message field, and click "Send Your Message." If you entered the code correctly, you'll see an acknowledgement of that, and your message will be emailed to the you'll see a congratulatory message. If you did not enter the code correctly, you'll see an error message and be given another chance to enter the verification code.

Using The Captcha In Your Own PHP Pages

That's fine, but how do you use this captcha in your own script? Very easily. All you have to do is use a require statement to load the captcha code... at the very beginning of your page. And I do mean the very beginning. Before the DOCTYPE, before the HTML element, before anything. That's because, under the right circumstances, the captcha code outputs an image (in JPEG format) or speech (in WAV format) directly to the web browser. And if you have already written anything at all to the user, then it's too late. So please, pay attention to this requirement!

Here's the relevant code from the beginning of test.php:


<?php
// Always at the VERY TOP of the page.
require 'captcha.php';
?>

OK, we've loaded the captcha code! Now it's safe to continue with our own PHP code and HTML code.

But how do we display the captcha image in our page? And how do we create a link to the captcha sound?

Easily! The captcha code provides two convenient functions, captchaImgUrl and captchaWavUrl. The first returns the URL of the captcha image, and the second contains the URL of the captcha sound. Here's the part of test.php that displays them:


<!-- Display the image -->
<img src="<?php echo captchaImgUrl()?>">
<!-- Link to the sound -->
<a href="<?php echo captchaWavUrl()?>">Listen To This</a>

Now we've displayed the image and the sound. And you already know how to collect input from the user... right? If not, read my article how can I receive form submissions? before continuing.

All caught up with form submissions? Great! I'll assume you have a form field like this one where the user is expected to enter the captcha code:


<input type="text" name="captcha"/>

But how do we compare the user's response to the correct response? Again, it's easy. The captcha code has already stored the correct response in $_SESSION['captchacode']. Just compare the user's input to that value to find out if the user did the right thing. Assuming you have a form field named captcha where the user has entered the captcha code, you can code it like this:


if ($_POST['captcha'] == $_SESSION['captchacode']) {
  // The user gave the right answer
}

"But what do I do inside that 'if' statement, exactly?"

That depends on your application. Why are you using my Captcha in the first place? Because you have content that should only appear, or actions that should only take place, when the Captcha code has been correctly entered. So put that content (or perform those actions) only inside this if statement (between the { and the }).

Keep in mind that PHP lets you shift in and out of "HTML mode" in the middle of your code. So it's easy to output certain HTML only when the test is successful.

"Hey, doesn't that mean the user can see the captcha code?" No. PHP session data is not saved on the user's computer. Only a short random "cookie" is saved on the user's computer. PHP uses that cookie to look up a session file stored on the server that contains the real details— such as the captcha code. So there's no security hole here.

"Is $_SESSION a global veriable? Do I need a global statement?"

$_SESSION is a "superglobal," so you don't have to worry about importing it into functions with the global statement.

Clearing the Captcha Code

It works! But there's one more crucial step: the captcha system doesn't know it worked, not yet. So the same captcha code will be shown to the user if they return to the page during their current session. That's not right— they should get a new code the next time they want to perform an action that should be reasonably difficult to abuse. We can fix that by calling captchaDone():


if ($_POST['captcha'] == $_SESSION['captchacode']) {
  // The user gave the right answer
  captchaDone();
}

At this point you can go ahead and complete the task that you created the captcha for— usually sending email to your staff, creating a new account, or posting a comment to a public web page or blog... useful things that stop being useful when automated programs can do them 500 times in ten seconds!

A Complete Example

Still confused? Check out this complete, working example of a page that takes advantage of my Captcha to send a message to the webmaster if and only if the user enters the right Captcha code (also found in demo.php in captcha.zip). You might want to use this page as a starting point for your own work.


<?php
// Always at the VERY TOP of the page.
// The opening php tag above has to be the
// VERY FIRST thing in your page, NO blank lines,
// no NOTHING EVER, or it will NOT work. Yes, really!
require 'captcha.php';

// Now $captchaimg and $captchawav are set and we can introduce
// those links wherever we like in the page. We can also
// access the captcha code as $_SESSION['captchacode']
// and verify what the user enters in our form, as shown
// below.

// Where to send the messages users enter in the contact form
// (change to your address if you really use this)
$myaddress = 'YOURADDRESS@YOURSITE.com';
?>

<?php
if ($_POST['send']) {
  $errors = array();
  if ($_POST['captcha'] != $_SESSION['captchacode']) {
    $errors[] = "You did not enter the letters shown in the image.";
  }
  if (!sizeof($errors)) {
    // IMPORTANT: If you don't call this the
    // user will keep getting the SAME code!
    captchaDone();
    $message = $_POST['message'];
    mail($myaddress, 'Contact Form Submission', $message);
    // Notice we can shift in and out of "HTML mode"
    // to display certain HTML only when the
    // user passes the test
?>
<html>
<head>
<title>Message Sent</title>
</head>
<body>
<h1>Message Sent</h1>
Thank you for using our handy contact form.
<p>
<!-- Generate a link back to ourselves -->
<a href="<?php echo $SERVER['SCRIPT_URL']?>">Contact Us Again</a>
</body>
</html>
<?php
    // Exit now to prevent the original form from
    // appearing again
    exit(0);
  }
}
?>
<html>
<head>
<title>Contact Us</title>
</head>
<body>
<h1>Contact Us</h1>
<?php
foreach ($errors as $error) {
  echo("<p>$error<p>\n");
}
?>
<p>
<form method="POST" action="<?php echo $SERVER['SCRIPT_URL']?>">
<p>
<b>Verification Code</b>
<p>
To prove you are a human being, you must enter the lowercase letters shown
below in the field on the right. Thank you for your understanding!
<p>
<img style="vertical-align: middle" src="<?php echo captchaImgUrl()?>">&nbsp;&nbsp;<input name="captcha" size="8"/>
<a href="<?php echo captchaWavUrl()?>">Listen To This</a>
<p>
Please enter your message in the text field below. Then click
"Send Your Message."
<p>
<textarea name="message" rows="10" cols="60">
</textarea>
<p>
<input type="submit" name="send" value="Send Your Message"/>
</form>
</body>
</html>

Notice that this page is really two pages in one. If the send button has already been clicked to submit the form (if ($_POST['send']) { ... }), then we check whether the Captcha code is correct, send the message and display a page that acknowledges this. If not— or if the user has entered the Captcha code incorrectly— we display the contact form.

Remembering That The Captcha Has Already Been Completed

If you are writing a simple comment form, you're probably finished at this point. But if there are many more pages that become accessible after the captcha is completed, you are probably wondering how to remember that fact. This is actually outside the scope of this article— it's a session-handling question, and I cover those in my article how do I keep track of user sessions? But in a nutshell, since we already have PHP sessions turned on in this code, you can easily set a session variable like this:


$_SESSION['already-passed-captcha'] = 1;

Then you simply verify that setting on later visits to your script. You could, for instance, choose not to display the captcha image and the link to the captcha sound when this variable is set, and also choose not to check whether the user entered the captcha code again.

How It Works

Great, we have a captcha system that works! And it was easy to set up. But just how does it work?

You'll notice that captcha.php must be brought in to your script with a require statement before anything is written to the page. That's because, in addition to generating the random string of letters that makes up the captcha code, captcha.php also generates the actual image or speech sounds. If you look at the beginning of captcha.php, you'll see this code:


session_start();
if ($_GET['captchaimg']) {
  captchaSendImg();
} elseif ($_GET['captchawav']) {
  captchaSendWav();
} else {
  captchaCode();
}

What's happening here?

The first line turns on PHP's session management system— which gives us a place to remember information, such as the captcha code, from one HTTP request to the next. The next line checks to see whether our script was called upon to generate the captcha image itself... instead of the page the image is in! You might not be aware that PHP is not limited to generating web pages. We can also write images and audio files to the browser. By generating a URL that points back to the same script, but including a parameter called captchaimg, we give ourselves a way to tell the difference between a request for an image and a request for the page itself. If we are being asked for an image, we call captchaSendImg(), which generates the image via PHP's GD graphics functions and then exits the PHP script completely before anything else can be output. captchaSendWav does the same job for audio.

If neither is set, we call captchaCode(), which generates a new random string of letters— if we don't already have one in the current session, that is. The letters are drawn from a slightly abbreviated alphabet (see the $chars variable). We do that to avoid letters that are easily mistaken for numbers or for one another when shown on the screen or spoken out loud.

We then set the $captchaimg and $captchawav variables to point right back to the script as I described above, using $_SERVER['SCRIPT_URL'], which is always a relative URL pointing to the script we started from (test.php, for example). We add ?captchaimg=1 to simulate a GET-method form submission. We also add captchasalt, which is simply a random value to prevent unwanted caching of images and speech.

Generating the Image

The captcha image is created using the PHP GD functions, a standard library of graphics functions based on the gd library, which I originally created back in 1994. These days gd is maintained by the PHP community.

The imagecreatetruecolor call creates an image suitable for "true-color" images, or in this case smooth grays and antialiasing to prevent a jagged appearance. We set a white background by allocating a color with red, green and blue all at the maximum level (255), and then filling the image with the imagefilledrectangle function.

Once the stage has been set, we can generate the text. We do that with the imagettftext function, which generates text using any TrueType font— one of the nicest capabilities of GD. We display each character at a slightly different angle, generating a random angle with the mt_rand function. PHP's traditional rand function usually calls the built-in random number generator of the system it's running on— which is often a lousy one. mt_rand is a higher-quality random number generator also offered in PHP and I recommend its use.

Notice that we also position and size the characters somewhat randomly. We do all of these things to make life tougher for those who would like to automatically crack our captcha with OCR (Optical Character Recognition) software. Sophisticated attacks may succeed against this captcha, but simple attacks will not. Frustrating those who attack every web form they see with generic cracking software is our primary goal here.

"Don't we have to 'seed' the random number generator?"

No. Reasonably modern versions of PHP automatically call mt_srand() before generating the first random value.

After writing the text, we add a sprinkling of "snow," short antialiased lines that can appear anywhere in the image. We do this to create additional problems for OCR software. Humans generally have no trouble reading the letters through the snow.

Once the image has been drawn, we send it directly to the browser and then exit completely, ending the script without generating a page:


header("Content-type: image/jpeg");
imagejpeg($im);
exit(0);

Why does this work? Because we're here in response to a request for an image, not a page! Recall that captchaImgUrl() returned a link to our script with the ?captchaimg=1 parameter added. We embedded this link in our test page with an img element. The captcha.php code, running at the very beginning of the "page," spotted this and called captchaSendImg. The Content-type: image/jpeg header tells the server that we're generating an image instead of a page. And the imagejpeg function writes the image data directly to the browser. This way, there's no need to clutter up the file system with temporary captcha image and audio files. We generate 'em when we need 'em.

Generating the Speech

The visual part of the captcha is pretty standard stuff. But the audio portion is a bit more sophisticated. The captchaSendWav function assembles the .ub files, which are simply raw 8-bit, 22khz audio samples, into a valid WAV-format audio file which just about every system can play. But it does a little more than that.

If you use my own voice samples, there's a risk that an attacker has those samples too. And that person could write simple code to recognize the same samples when they come back as a WAV file.

To frustrate this type of attack, I do two things:

1. I randomly select a small volume change factor, and multiply every sample by that factor. This doesn't make a perceptible difference to humans, but it frustrates simplistic attempts to write code that recognizes my speech samples.

2. I average in "extra" samples here and there in a random way. Again, this makes little difference to humans but creates a real problem for programs trying to recognize my original letter sounds in a naive fashion.

Generating a valid WAV audio stream isn't hard. The WAVE soundfile format page provided by Marina Bosi of Stanford University explains the format, and PHP's pack function (borrowed from Perl) provides an elegant way to output data in whatever binary format is required.

I end the process by outputting the WAV data directly to the browser, just as I did with the image. Unlike with the image, however, we also need to specify the Content-length: header explicitly. This shouldn't be necessary, but in practice some web browsers don't cooperate if we don't spell out the size of the complete WAV file for them:


$clength = strlen($wav);
header("Content-length: $clength");
header("Content-type: audio/wav");
echo($wav);
exit(0);

Conclusion

Keeping spammers and other abusive lowlifes out of your website is a tough job, and no one can do it perfectly. But by making use of a well-designed captcha system, you can "raise the bar" for spammers and keep simple automated attacks out.

Providing both audio and visual captchas keeps our web pages accessible to a wide audience. Even so, good ethics (and possibly the Americans with Disabilities Act as well) require us to provide a practical alternative for deafblind users. TDD relay telephone service may be an acceptable alternative if your website offers the option of telephone support.

The spammers will be back. Captchas alone won't keep them out! Especially if you don't design your scripts carefully to prevent tactics such as posting dozens of links, or tricking a contact form into mailing dozens of additional people. A captcha is not a substitute for a well-designed site that watches out for inappropriate form values and other untrustworthy behavior on the user's part. But a good captcha system can be a healthy part of your website's security strategy, slowing down attacks and announcing that your site is not an easy target.

Share |

Legal Note: yes, you may use sample HTML, Javascript, PHP and other code presented above in your own projects. You may not reproduce large portions of the text of the article without our express permission.

Got a LiveJournal account? Keep up with the latest articles in this FAQ by adding our syndicated feed to your friends list!


Follow us on Twitter | Contact Us

Copyright 1994-2014 Boutell.Com, Inc. All Rights Reserved.