![]() |
![]() |
![]() |
![]() |
![]() Master's Degree: Regular ExpressionsINTRODUCTIONThis class will teach you how to use regular expressions to turbo-charge your title searches. QUESTIONS ANSWERED
LESSON 1: WHAT IS A REGULAR EXPRESSION?At the most basic definition "regular expressions" is a fancy programmer's way of saying "patterns." When you do a regular expression search (available from our complex title search by specifying "regular expression" instead of a "word" or "fuzzy" search), you're using special characters and codes to get very specific in your search. A simple example is the slang word "def." With our "word" search, the words you specify can be found within words. So if you put in "def" as your search term, you'll get every title with "def" in it, like "Defending Your Life" and "The Bank Defaulter". By using a regular expression search along with the proper characters and codes, you can specify that "def" must be a word all by itself. You can also specify that it has to be at the beginning or end of the title and many other things. For most searches at the IMDb, you'll never need to use a regular expression search. But if you need titles whose names meet certain specific conditions (if other aspects of the title, like whether or not we have quotes from it, or the year it was made are important, try our advanced searching class), this will be very useful. IMPORTANT NOTE This lesson in regular expressions should not be considered a comprehensive tutorial for the use of regular expressions in Perl, Unix or other operating systems, editors or languages that use them. We are only covering those elements that we believe will be useful to our users. LESSON 2: CONSTRUCTING REGULAR EXPRESSIONSA regular expression is constructed using special characters and codes to define what the pattern is and where it must be. We'll split the characters and codes into sections based on their functions. Combining The characters and codes you'll find below can be combined together to increase the specificity or range of your search. We'll occasionally demonstrate combinations in the sections below, but these are not the only combinations. Feel free to experiment with different combinations when you need to. Anchors An anchor sets where the pattern must occur in the title of a movie or TV show. There are three anchors relevant to the IMDb. ^ - Sets an anchor at the beginning of a title. A regular expression of the form $ - Sets an anchor at the end of a title. A regular expression of the form \b - This sets a word boundary. By combining these anchors, you can create some fairly specific searches. As we mentioned in the introduction, "def" occurs in a lot of words. To find a title with the slang "def" as a word by itself, you would search for TIP: Anchors are very useful for finding one-word titles. They're also helpful for finding all titles that start or end with a specific word. Special Characters If you are not familiar with the word "character," it is a programming term meaning any letter, number or symbol. A, 2 and ? are all characters. Special characters are used for matching specific patterns within words. Though our word search will find "old" in "gold" and "holder," using special characters can help you specify where in the word "old" must occur, or even find patterns where any combination or range of letters or numbers occurs between others... or doesn't occur. There are five special characters relevant to the IMDb. + - This matches one or more instances of the characters right before it. * - This matches zero or more instances of the character right before it. ? - This matches zero or one instances of the letter right before it. [] - When you use brackets, you can specify a variety of different characters that must match, or must not match. . - This matches zero or one instances of any character. ^ - Yes, we just saw this in the Anchors section, but when placed within brackets, it means "not." For example Ranges & Sequences Let's say you want to find something in a range, say the letter B followed the number 1, 2, 3, 4 or 5. In the section above, we mentioned how you can pick a few numbers or letters and put them in brackets. Thus to do that search, you could use The brackets let you specify ranges as well as individual characters. Thus regular expressions let you substitute Of course, even typing [0-9] is a lot of typing when you really just want to specify any number. Because that's such a common sequence, you can substitute a two-character code for it. \d - matches any digit (0-9). \w - matches any digit or letter, but not punctuation or symbols. You can also combine ranges by just putting them in one after another. Matching Multiples The special characters above are pretty powerful, yet in a way they're also limited. Using You can specify just two E's by using Let's say, though, that you didn't just want to search for 2 E's in a row. You wanted all titles with at least 2 in a row, but no more than four. You can create a range, but instead of using a dash, you use a comma. To get the title with 2-4 E's, you would type One or the other Although brackets will let you specify a few characters or range of characters to do an "or" match (i.e. If you went through the class on adding data, you'll remember the "pipe" character. It can look like either a straight vertical line or a sort of stretched colon on your screen or keyboard depending on the font on either. On most keyboards it looks like a stretched colon and in most fonts used on computer screens (especially in web browsers) it looks like a line. It is generally typed by hitting shift and the backslash (\) key on the average American keyboard, though its placement may vary with internationalization or specialized layouts. When you use the pipe, you must have a complete expression on both sides. So, if you were searching for any titles that included "bed" or "body" anywhere in them (i.e. "body" could be within "nobody"), you would use AND THAT'S ITThose of you who are old hands at regular expressions and are reading this just to see what we had to say on them or how we've implemented their use will notice that we've left out some functions, but the functions listed above represent pretty much all you'll need within the confines of the way we present information. If you're a really techy person, our last class is called "Tech info about IMDb," which goes into some stuff on some of the equipment we use to serve our millions of monthly users and some info on how the database is constructed. If you're not really into tech stuff, you'll probably find it boring and it is not essential to getting the most out of the database. We provide it mainly for students. If you're interested, you can head on over. Otherwise, consider yourself a graduate of the IMDb University with honors. Congratulations. |