All Articles

How to Write Regular Expressions: A Beginner's Guide

·15 min read

What Are Regular Expressions?

Regular expressions (regex or regexp) are sequences of characters that define a search pattern. They are one of the most powerful tools in a developer's toolkit, used for searching, matching, and manipulating text. Every major programming language supports regex, and once you learn the syntax, you can apply it everywhere — from JavaScript and Python to command-line tools like grep and sed.

At first glance, regular expressions can look intimidating — something like ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ seems like line noise. But regex is built from simple, composable building blocks. Once you learn the individual pieces, even complex patterns become readable.

Getting Started: Literal Matches

The simplest regex is a literal string. The pattern hello matches the exact text "hello" wherever it appears. Most characters match themselves — letters, digits, and many symbols are literal by default.

Pattern: hello
Text:    "say hello to the world"
Match:       ^^^^^

Try this pattern in our Regex Tester to see matches highlighted in real time.

Metacharacters: The Building Blocks

Regex becomes powerful through metacharacters — characters with special meanings. Here are the essential ones:

The Dot (.)

Matches any single character except a newline. The pattern h.t matches "hat", "hit", "hot", "h9t", and even "h t".

Pattern: h.t
Matches: "hat" ✓  "hit" ✓  "hot" ✓  "ht" ✗  "hoot" ✗

Anchors (^ and $)

Anchors don't match characters — they match positions:

^hello     — matches "hello world" but not "say hello"
world$     — matches "hello world" but not "world cup"
^hello$    — matches exactly "hello" and nothing else

Escaping (\)

To match a metacharacter literally, precede it with a backslash. To match an actual dot, use \.. To match a dollar sign, use \$.

Pattern: \$\d+\.\d{2}
Matches: "$19.99" ✓  "$5.00" ✓  "19.99" ✗

Character Classes

Character classes let you match one character from a specific set. Defined with square brackets:

[aeiou]      — matches any vowel
[0-9]        — matches any digit (same as \d)
[a-zA-Z]     — matches any letter
[^0-9]       — matches anything that is NOT a digit (^ negates inside [])
[a-z0-9_]    — matches lowercase letters, digits, or underscore

Shorthand Character Classes

Regex provides shortcuts for commonly used character classes:

\b\d{3}-\d{4}\b    — matches "555-1234" but not "12345-67890"
\w+@\w+\.\w+     — basic (naive) email pattern

Quantifiers: How Many Times?

Quantifiers specify how many times the preceding element should be matched:

colou?r        — matches "color" and "colour"
\d{3}          — matches exactly three digits: "123" ✓ "12" ✗
\d{2,4}        — matches 2 to 4 digits: "12" ✓ "1234" ✓ "12345" partial
go+gle         — matches "gogle", "google", "gooogle", etc.
https?://      — matches "http://" and "https://"

Greedy vs Lazy Quantifiers

By default, quantifiers are greedy — they match as much text as possible. Adding ? after a quantifier makes it lazy, matching as little as possible:

Text:    "<b>bold</b> and <b>more bold</b>"

Greedy:  <b>.*</b>     → matches "<b>bold</b> and <b>more bold</b>"
Lazy:    <b>.*?</b>    → matches "<b>bold</b>" (first match)

This is one of the most important concepts in regex. When scraping HTML or parsing delimited text, lazy quantifiers prevent over-matching.

Groups and Capturing

Parentheses () serve two purposes: grouping and capturing.

Grouping

Parentheses group parts of a pattern so you can apply quantifiers to the whole group:

(ha)+          — matches "ha", "haha", "hahaha"
(ab|cd)        — matches "ab" or "cd"
(https?://)?   — optionally matches "http://" or "https://"

Capturing Groups

Parentheses also capture the matched text, which you can reference later — in replacements or in your code:

// JavaScript: extracting date parts
const pattern = /(\d{4})-(\d{2})-(\d{2})/;
const match = "2026-02-25".match(pattern);
// match[1] = "2026", match[2] = "02", match[3] = "25"

// Named capturing groups (modern syntax)
const pattern = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;
const match = "2026-02-25".match(pattern);
// match.groups.year = "2026"

Non-Capturing Groups

If you need grouping but don't need the captured value, use (?:...) for a slight performance benefit:

(?:https?://)?(www\.)?example\.com

Alternation: OR Logic

The pipe | acts as a logical OR. It matches the pattern on either side:

cat|dog          — matches "cat" or "dog"
(Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day  — matches any day name
\.(jpg|png|gif)$ — matches files ending in .jpg, .png, or .gif

Lookahead and Lookbehind

Lookahead and lookbehind are zero-width assertions — they check what comes before or after the current position without consuming characters:

\d+(?=px)        — matches digits followed by "px": "16px" → "16"
\d+(?!px)        — matches digits NOT followed by "px"
(?<=\$)\d+       — matches digits preceded by "$": "$99" → "99"
(?<!\d)\d{3}(?!\d) — matches exactly 3 digits not adjacent to others

Lookaround is incredibly useful for password validation, where you need to check for multiple conditions at the same position:

// Password: at least 8 chars, one uppercase, one digit, one special
^(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*]).{8,}$

Common Regex Patterns

Here are battle-tested patterns for everyday tasks. Test them with our Regex Tester and refer to the Regex Cheatsheet for quick reference.

Email Address

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

This covers the vast majority of real-world email addresses. A fully RFC 5322-compliant regex would be thousands of characters long — this pragmatic version is what most applications use.

URL

https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)

Phone Number (US)

^(\+1)?[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$

// Matches: 555-123-4567, (555) 123-4567, +1 555.123.4567

IP Address (IPv4)

^((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)$

Hex Color Code

^#([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})$

// Matches: #FF5733, #f00, #abc123

HTML Tag

<([a-z][a-z0-9]*)\b[^>]*>(.*?)<\/\1>

The \1 is a backreference that matches whatever the first capturing group matched, ensuring the closing tag matches the opening tag.

Regex cannot fully parse HTML — nested and malformed tags will break it. For real HTML parsing, use a proper parser like DOMParser or BeautifulSoup. Regex is fine for quick-and-dirty extraction from well-formed HTML.

Regex in JavaScript

JavaScript supports regex natively through the RegExp object and regex literals:

// Regex literal
const pattern = /\d{3}-\d{4}/g;

// RegExp constructor (useful for dynamic patterns)
const dynamic = new RegExp("\\d{3}-\\d{4}", "g");

// Common methods
"call 555-1234".match(/\d{3}-\d{4}/);      // ["555-1234"]
"call 555-1234".search(/\d{3}-\d{4}/);     // 5 (index)
"call 555-1234".replace(/\d{3}-\d{4}/, "XXX-XXXX"); // "call XXX-XXXX"
/\d{3}/.test("123");                         // true

// matchAll for multiple matches with groups
const text = "2026-02-25 and 2026-03-15";
const matches = [...text.matchAll(/(\d{4})-(\d{2})-(\d{2})/g)];

JavaScript Regex Flags

Regex in Python

import re

# Basic matching
match = re.search(r'\d{3}-\d{4}', 'call 555-1234')
print(match.group())  # "555-1234"

# Find all matches
matches = re.findall(r'\d+', 'scores: 95, 87, 92')
# ['95', '87', '92']

# Named groups
match = re.search(r'(?P<year>\d{4})-(?P<month>\d{2})', '2026-02-25')
print(match.group('year'))  # "2026"

# Substitution
result = re.sub(r'\b(\w)', lambda m: m.group(1).upper(), 'hello world')
# "Hello World"

# Compile for reuse
pattern = re.compile(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')
is_valid = bool(pattern.match('user@example.com'))

Always use raw strings (r"...") for regex patterns in Python to avoid double-escaping backslashes.

Performance Pitfalls

Poorly written regex can cause catastrophic backtracking — where the engine tries an exponential number of paths before failing. This can freeze your application or cause denial-of-service vulnerabilities (ReDoS).

Patterns to Avoid

// ❌ Catastrophic backtracking
(a+)+b           — exponential on strings like "aaaaaaaaaaac"
(a|a)+b          — same problem with alternation
(.*a){10}        — nested quantifiers with overlapping matches

// ✅ Safe alternatives
a+b              — no nested quantifiers
[^b]*b           — negated character class instead of .*

Tips for Performant Regex

Testing and Debugging Regex

Writing regex without a testing tool is like writing code without a debugger. Use our Regex Tester to:

Keep our Regex Cheatsheet bookmarked for quick reference when you forget whether \b means backspace or word boundary.

Step-by-Step: Building a Real Pattern

Let's build a regex to validate a date in YYYY-MM-DD format, step by step:

  1. Start with the year: \d{4} — four digits.
  2. Add the separator: \d{4}-
  3. Add the month (01–12): (0[1-9]|1[0-2])
  4. Add another separator and the day (01–31): (0[1-9]|[12]\d|3[01])
  5. Anchor it: ^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$
const datePattern = /^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$/;

datePattern.test("2026-02-25");  // true
datePattern.test("2026-13-01");  // false (invalid month)
datePattern.test("2026-02-32");  // false (invalid day)
datePattern.test("26-02-25");    // false (2-digit year)

Note that this pattern doesn't validate whether the day is actually valid for the given month (e.g., February 30). For full date validation, combine regex with a date library.

Wrapping Up

Regular expressions reward practice. Start with simple literal matches, add character classes and quantifiers, then gradually incorporate groups and lookaround. The key is to build patterns incrementally and test each step — and the Regex Tester makes that workflow seamless. Within a few weeks of regular use, you'll find yourself reaching for regex instinctively whenever you need to search, validate, or transform text.

Try These Tools

Related Articles

Enjoy this article? Buy us a coffee to support free tools and guides.