The Wayback Machine - https://web.archive.org/web/20060623190912/http://imdb.com:80/Help/Classes/Master/expressions
Login | Register to personalize  

Search IMDb.com

 
Search Web

Orientation Week
*  Help Main
*  Site Tour
*  For New Users
*  Which Courses to Take
*  Reasons to Register
*  Freq. Asked Questions
*  A-Z Features Index
*  How to Buy Videos
*  How to Sell Videos
*  IMDb History
*  IMDb Contacts
*  Job Openings

Freshman Year
*  Class Instructions
*  Basic Searching
*  Basic Navigation

Sophomore Year
*  The IMDb Sections
*  Photos on IMDb
*  Message Boards
*  How to Link to Us

Junior Year
*  Intermediate Searching
*  Advanced Searching
*  IMDb My Movies

Senior Year
*  Adding/Correcting Data

Master's Degree
*  Regular Expressions
*  Technical Info about IMDb

Master's Degree: Regular Expressions

INTRODUCTION

This class will teach you how to use regular expressions to turbo-charge your title searches.

QUESTIONS ANSWERED

  • The basic and intermediate search methods still gave me way too many results to sort through. Is there any way to refine them further?

LESSON 1: WHAT IS A REGULAR EXPRESSION?

At the most basic definition "regular expressions" is a fancy programmer's way of saying "patterns." When you do a regular expression search (available from our complex title search by specifying "regular expression" instead of a "word" or "fuzzy" search), you're using special characters and codes to get very specific in your search.

A simple example is the slang word "def." With our "word" search, the words you specify can be found within words. So if you put in "def" as your search term, you'll get every title with "def" in it, like "Defending Your Life" and "The Bank Defaulter". By using a regular expression search along with the proper characters and codes, you can specify that "def" must be a word all by itself. You can also specify that it has to be at the beginning or end of the title and many other things.

For most searches at the IMDb, you'll never need to use a regular expression search. But if you need titles whose names meet certain specific conditions (if other aspects of the title, like whether or not we have quotes from it, or the year it was made are important, try our advanced searching class), this will be very useful.

IMPORTANT NOTE

This lesson in regular expressions should not be considered a comprehensive tutorial for the use of regular expressions in Perl, Unix or other operating systems, editors or languages that use them. We are only covering those elements that we believe will be useful to our users.

LESSON 2: CONSTRUCTING REGULAR EXPRESSIONS

A regular expression is constructed using special characters and codes to define what the pattern is and where it must be. We'll split the characters and codes into sections based on their functions.

Combining

The characters and codes you'll find below can be combined together to increase the specificity or range of your search. We'll occasionally demonstrate combinations in the sections below, but these are not the only combinations. Feel free to experiment with different combinations when you need to.

Anchors

An anchor sets where the pattern must occur in the title of a movie or TV show. There are three anchors relevant to the IMDb.

^ - Sets an anchor at the beginning of a title. A regular expression of the form ^de would find all titles beginning with "de".

$ - Sets an anchor at the end of a title. A regular expression of the form de$ would find all titles ending with "de".

\b - This sets a word boundary. \bde would find all titles that had words beginning with "de". de\b would find all titles that had words ending with "de". \bde\b would find all titles where "de" was a word in and of itself.

By combining these anchors, you can create some fairly specific searches. As we mentioned in the introduction, "def" occurs in a lot of words. To find a title with the slang "def" as a word by itself, you would search for \bdef\b. To find a title that starts with "def" as a word, you would search for ^def\b (the first \b is not necessary because you're anchored at the beginning of the line by the ^).

TIP: Anchors are very useful for finding one-word titles. They're also helpful for finding all titles that start or end with a specific word.

Special Characters

If you are not familiar with the word "character," it is a programming term meaning any letter, number or symbol. A, 2 and ? are all characters.

Special characters are used for matching specific patterns within words. Though our word search will find "old" in "gold" and "holder," using special characters can help you specify where in the word "old" must occur, or even find patterns where any combination or range of letters or numbers occurs between others... or doesn't occur.

There are five special characters relevant to the IMDb.

+ - This matches one or more instances of the characters right before it. ba+ will match "ba" or "baa" or even "baaa". You can put the + in the middle of expressions, so ba+d would match "bad" or "baad" and so on.

* - This matches zero or more instances of the character right before it. ba* will match "b", "ba", "baa", or even "baaa". You can put the * in the middle of expressions, so ba*d would match "bd", "bad" or "baad" and so on.

? - This matches zero or one instances of the letter right before it. ba? will only match "b" or "ba". You can put the ? in the middle of expressions, so ba?d would match "bd" or "bad".

[] - When you use brackets, you can specify a variety of different characters that must match, or must not match. [eo] will match either E or O. For example b[eo]d will match "bed" or "bod", but not "bid" or "bud".

. - This matches zero or one instances of any character. \b[eo]d.\b would match "bed" or "body" or "bed!" and other words matching the pattern (including foreign words like "boda").

^ - Yes, we just saw this in the Anchors section, but when placed within brackets, it means "not." For example b[^eiou]d would match "bad" but not "bed", "bid", "bod" or "bud".

Ranges & Sequences

Let's say you want to find something in a range, say the letter B followed the number 1, 2, 3, 4 or 5. In the section above, we mentioned how you can pick a few numbers or letters and put them in brackets. Thus to do that search, you could use b[12345]. But that's a lot of typing.

The brackets let you specify ranges as well as individual characters. Thus regular expressions let you substitute [1-5] for [12345]. Similarly, you can do this with ranges of letters, so [e-l] substitutes for typing all the letters between E and L.

Of course, even typing [0-9] is a lot of typing when you really just want to specify any number. Because that's such a common sequence, you can substitute a two-character code for it.

\d - matches any digit (0-9).

\w - matches any digit or letter, but not punctuation or symbols.

You can also combine ranges by just putting them in one after another. [1-4e-k] will match any number from 1 to 4 or any letter from e through k.

Matching Multiples

The special characters above are pretty powerful, yet in a way they're also limited. Using e+, you can specify that the pattern must have one or more E's in a row in it. But that could return results that have 50 E's in a row as well as results that have just one or two.

You can specify just two E's by using e{2}. Note that these are not parentheses, but are instead created by hitting the bracket keys with the shift held down.

Let's say, though, that you didn't just want to search for 2 E's in a row. You wanted all titles with at least 2 in a row, but no more than four. You can create a range, but instead of using a dash, you use a comma. To get the title with 2-4 E's, you would type e{2,4}.

One or the other

Although brackets will let you specify a few characters or range of characters to do an "or" match (i.e. [ejp] will match e, j or p), you can also do this with entire patterns.

If you went through the class on adding data, you'll remember the "pipe" character. It can look like either a straight vertical line or a sort of stretched colon on your screen or keyboard depending on the font on either. On most keyboards it looks like a stretched colon and in most fonts used on computer screens (especially in web browsers) it looks like a line. It is generally typed by hitting shift and the backslash (\) key on the average American keyboard, though its placement may vary with internationalization or specialized layouts.

When you use the pipe, you must have a complete expression on both sides. So, if you were searching for any titles that included "bed" or "body" anywhere in them (i.e. "body" could be within "nobody"), you would use bed|body as your expression. If you wanted to put word boundaries on them so that the titles had to have "bed" or "body" as distinct words, you would use \bbed\b|\bbody\b as your expression.

AND THAT'S IT

Those of you who are old hands at regular expressions and are reading this just to see what we had to say on them or how we've implemented their use will notice that we've left out some functions, but the functions listed above represent pretty much all you'll need within the confines of the way we present information.

If you're a really techy person, our last class is called "Tech info about IMDb," which goes into some stuff on some of the equipment we use to serve our millions of monthly users and some info on how the database is constructed. If you're not really into tech stuff, you'll probably find it boring and it is not essential to getting the most out of the database. We provide it mainly for students. If you're interested, you can head on over. Otherwise, consider yourself a graduate of the IMDb University with honors. Congratulations.