Build an Amazon Scraper Using Your Chrome Profile

2025-11-26 11 min read

How to build a simple Amazon scraper using your Chrome profile?

Look, I'll be honest with you - scraping Amazon isn't exactly a walk in the park. They've got some pretty sophisticated anti-bot mechanisms, and if you go at it the wrong way, you'll be staring at CAPTCHA screens faster than you can say "web scraping." But here's the thing: there's a clever way to do it that makes Amazon think you're just... well, you.

Let me walk you through how I built this scraper. Whether you're a business person trying to understand the technical side or a developer looking to build something similar, I'll break it down so it actually makes sense.

Come with me!

The Big Idea: use your own Chrome

This is where most people get it wrong. They fire up a fresh Selenium instance, maybe throw in some proxy rotation, and wonder why Amazon is blocking them after three requests. Sounds familiar? Here's the secret sauce: use your actual Chrome profile.

Think about it - your browser has your login sessions, your cookies, your browsing history. To Amazon, it looks like you browsing their site. Not some suspicious headless browser making requests at 3 AM.

At the very beginning, we need to find the folder where our Chrome profile is stored.
To do that, type chrome://version/ into the address bar.
There you'll immediately see the path to your profile.
For me, it looks like this:

C:\Users\myusername\AppData\Local\Google\Chrome\User Data\Profile 1

So the path we care about is:

C:\Users\myusername\AppData\Local\Google\Chrome\User Data\

For convenience, let's create a .bat file (my example is on Windows, but it works almost the same on Linux/macOS).

Inside the .bat file, add:

"C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9333 --user-data-dir="C:\Users\myusername\AppData\Local\Google\Chrome\User Data\"

Great! The most important part here is the port: 9333.
You can choose (almost) any number - I just picked this one.

Now, when you run the .bat file, Chrome will open with your profile already loaded.

Time to look at the code!
We want to connect Selenium to Chrome.
Let's grab Python by the head and get f*cking started!

class DriverManager:
    def connect(self):
        options = Options()
        options.add_experimental_option("debuggerAddress", f"localhost:{Config.CHROME_DEBUG_PORT}")
        self.driver = webdriver.Chrome(options=options)

See that debuggerAddress bit? That's connecting Selenium to your already running Chrome browser. You start Chrome with remote debugging enabled (more on that in a sec), and boom - Selenium can control your regular browsing session.

Wait, How Do I Actually Do This?

On your machine, start Chrome like this:

...\chrome.exe --remote-debugging-port=9333

That's it. Chrome runs normally as we said, but! now it's listening on port 9333 for commands. When the scraper connects, it's like having a really fast, tireless assistant controlling your own browser.

The beautiful part? If Amazon throws a CAPTCHA at you (and sometimes they will), you just solve it manually. The scraper waits patiently, and once you click those traffic lights or whatever, it continues on its merry way.

Simple but effective project stucture

I'm a big believer in keeping things clean and modular. Here's how I structured this:

src/
├── main.py              # app entry point
├── config.py            # all the boring configuration stuff
├── routes.py            # API endpoints
└── scraper/
    ├── driver_manager.py    # handles chrome connection
    ├── scraper.py           # scraping logic
    └── data_extractor.py    # parses and cleans the data

One Browser to Rule Them All!

class DriverManager:
    def __init__(self):
        self.driver = None
        self.wait = None

    def connect(self):
        options = Options()
        options.add_experimental_option("debuggerAddress", f"localhost:{Config.CHROME_DEBUG_PORT}")
        self.driver = webdriver.Chrome(options=options)
        self.wait = WebDriverWait(self.driver, Config.SELENIUM_TIMEOUT)
        return self.driver

This singleton pattern ensures we're reusing the same browser connection. Why? Because starting up a new Chrome instance every time is expensive (both in time and resources), and more importantly, you lose all that precious session data.

The WebDriverWait is there for those moments when Amazon's JavaScript takes a hot second to load. Trust me, you need this.

The Scraper: where the MAGIC happens

Here's where we actually grab the data:

def search(self, query):
    response = self._get_response(f"https://www.amazon.com/s?k={query}&ref=cs_503_search")

    results = []
    for listitem_el in response.soup.select('div'):
        product_container_el = listitem_el.select_one(".s-product-image-container")
        if not product_container_el:
            continue

I'm using BeautifulSoup here because, let's face it, it's way more pleasant to work with than XPath or Selenium's built-in element finders. Once the page loads, I grab the HTML and let BeautifulSoup parse it. Simple as that.

Tip: Amazon's search results use a specific structure with div. This is pretty stable across their site variations. I learned this the hard way after my scraper broke twice because I was relying on class names that Amazon kept changing.

Prices are... priceless

Amazon's pricing HTML is... interesting. Sometimes there's a sale price, sometimes there isn't. Sometimes the regular price is crossed out, sometimes it's not even there. Here's how I handle it:

def get_price_from_elements(price1_el, price2_el):
    regular_price = None
    sale_price = None
    current_price = None

    if price1_el and price2_el:
        # both exist = item is on sale
        sale_price = DataExtractor._parse_currency_value(price1_el.text)
        regular_price = DataExtractor._parse_currency_value(price2_el.text)
        current_price = sale_price
    elif price1_el:
        # only one price = regular price
        regular_price = DataExtractor._parse_currency_value(price1_el.text)
        current_price = regular_price

The _parse_currency_value method is where things get spicy:

def _parse_currency_value(s: str):
    # extract currency symbol ($, €, £, etc.)
    currency_match = re.match(r'^[^\s&0-9]+', s)
    currency = currency_match.group(0) if currency_match else None

    # grab all digits and convert to float
    digits = ''.join(re.findall(r'\d+', s))
    if not digits:
        raise ValueError("Invalid price!")

    amount = float(digits) / 100  # convert cents to $s

Why divide by 100? Because "1999" should be $19.99, not $1,999. This handles all sorts of currency formats without breaking a sweat.

The Flask API - Make it happen!

I wrapped everything in a simple Flask API because, honestly, who wants to mess with Python imports every time they need to scrape something?

@api.route('/search', methods=['GET'])
def search():
    query = request.args.get('query', '')

    if not query:
        return jsonify({"error": "query required"}), 400

    try:
        driver = driver_manager.get_driver()
        scraper = Scraper(driver)
        result = scraper.search(query)
        return jsonify(result)
    except Exception as e:
        return jsonify({"error": str(e)}), 500

Now you can just:

curl "http://localhost:5000/search?query=mechanical+keyboard"

And get back nice, clean JSON:

{
  "query": "mechanical keyboard",
  "count": 20,
  "results": [
    {
      "id": "item-xyz",
      "title": "Cherry MX Blue Mechanical Gaming Keyboard",
      "url": "https://www.amazon.com/dp/B08...",
      "image_src": "https://m.media-amazon.com/images/...",
      "price": {
        "current_price": {
          "formatted": "$89.99",
          "amount": 89.99,
          "symbol": "$"
        },
        "regular_price": { ... },
        "sale_price": null
      }
    }
  ]
}

The product details

Search results are great, but sometimes you need the full details:

def get_product(self, url):
    response = self._get_response(url)

    product_title_el = response.soup.select_one("#productTitle")
    price1_el = response.soup.select_one('.a-price > .a-offscreen')
    price2_el = response.soup.select_one('.a-price > .a-offscreen')

    if not product_title_el:
        return None

    return {
        "url": url,
        "title": data_extractor.get_product_title(product_title_el.text),
        "price": data_extractor.get_price_from_elements(price1_el, price2_el),
        "images": data_extractor.get_images_from_product(response.soup.select("#altImages .imageThumbnail"))
    }

One cool trick here is the image upscaling. Amazon gives you tiny thumbnails by default, but with a little regex magic:

def get_images_from_product(image_elements):
    images = []
    for image_element in image_elements:
        img = image_element.select_one('img')
        src = img.get('src')
        # replace the size parameter to get full resolution
        new_src = re.sub(r'\.[^/]*?_\.(jpg|jpeg|png|webp)$', r'._AC_SL1500_.\1', src)
        images.append(new_src)
    return images

That regex finds Amazon's size indicator in the URL and replaces it with AC_SL1500 - their code for "give me the big version."

Yeah, I know you’re a hardcoder, but please don’t hardcode anything

I keep all the tunable parameters in one place:

class Config:
    CHROME_DEBUG_PORT = 9333
    API_HOST = '0.0.0.0'
    API_PORT = 5000
    SELENIUM_TIMEOUT = 30
    PAGE_LOAD_DELAY = 5

That PAGE_LOAD_DELAY? Critical. Amazon's pages load in stages, and if you try to parse too early, you'll miss half the data. Five seconds is my sweet spot, but YMMV depending on your internet speed.

Yeah we did it!

Let me break down the advantages of using your own browser:

1. You're INVINCI... sorry! INVISIBLE (Mostly)
Using your real browser profile means you have:

Your actual cookies
Your login session (if you're logged in)
Your browsing history
Your browser fingerprint

All of this makes you look like a regular user, not a bot.

2. CAPTCHA? No way
When Amazon gets suspicious, you just solve the CAPTCHA like a normal person. The scraper waits, you click, life goes on.

3. Simple to maintain
No complicated proxy rotation, no headless browser detection workarounds, no constantly updating user agents. Just straightforward code that works.

4. Easy to debug
Because you can see the browser, debugging is trivial. Selector not working? Open the dev tools in your browser and figure it out.

Let's Be Real - limitations

This approach is perfect for:

Personal projects
Building a prototype
Low-volume scraping
Understanding how Amazon's frontend works

But it's not great for:

High-volume production scraping
Running on servers (you need a desktop environment)
Parallel requests (one browser = one request at a time)
Completely automated, hands-off operation

For professional consider API

If you're running a business that needs reliable, high-volume Amazon data, you probably want something more robust. Managing your own scraping infrastructure gets complicated fast - you need proxies, CAPTCHA solving services, constant maintenance as Amazon changes their HTML...

For production use cases, I'd recommend checking out Amazon Instant Data API from our friends at DataOcean. They handle all the headaches of maintaining scrapers at scale, dealing with rate limits, rotating IPs, and keeping up with Amazon's changes. Sometimes paying for a good API beats maintaining your own infrastructure.

The Code Structure: the good approach

One thing I want to emphasize is the separation of concerns. Notice how:

driver_manager.py only handles browser connections
scraper.py only handles page navigation and element location
data_extractor.py only handles parsing and cleaning data
routes.py only handles HTTP requests

This isn't just me being pedantic. When Amazon changes their HTML (and they will), you only need to update the selectors in scraper.py. When you want to add a new data field, you just extend data_extractor.py. Clean architecture saves your sanity.

Expect the worst

Amazon's HTML isn't always consistent. Sometimes fields are missing. Sometimes they use different class names for the same thing. That's why I check everything:

if not image_el or not link_el or not h2:
    continue

Better to skip a product than crash the entire scraper because one listing is malformed.

Any thoughts?

Building a scraper is part art, part science. The technical bits are straightforward once you understand them, but the real skill is in making architectural decisions that save you time down the road.

Using your own browser via remote debugging is one of those "why didn't I think of this sooner?" solutions. It's elegant, it works, and it keeps things simple.

Is it perfect? No. Will it scale to millions of requests? Also no. But for what it is - a clean, maintainable, easy-to-understand scraper that actually works - I'm pretty happy with it.

Now go forth and scrape responsibly. And seriously, if you need production-scale scraping, check out that DataOcean API or just contact me if your needs are much much than simple API could give you. Your future self will thank you.

Want to build something similar?
The code structure I showed you works for pretty much any website. Just swap out the selectors, adjust the data extraction logic, and you're good to go. The browser-connection approach is universal.

Questions?
Drop them in the comments. I'm always happy to talk scraping strategies, Python architecture, or why BeautifulSoup is superior to XPath (fight me).

Happy scraping! 🚀

👉 You can find the full Amazon scraper code on our GitHub, feel free to check it out https://github.com/letsscrapecom/simple-amazon-scraper

#amazon-scraper #-web-scraping #-selenium-python #-chrome-profile-scraping #-bypass-captcha

We're ready when you are 👍 View all articles