Advanced interacting with DOM in Puppeteer: disabling JavaScript, loading HTML without visiting the site, error handling, delaying and scrolling the page
Contents
1. How to use Puppeteer: installation and quick start
3.1 What exactly does Puppeteer get – the original HTML or the DOM after JavaScript finishes?
3.2 How to disable (stop) all JavaScript in Puppeteer
3.3 How to get the HTML source in Puppeteer instead of the DOM after creating a page with JavaScript
3.4 How to pass HTML pages from a string (from a database) to Puppeteer
3.5 How to save the current state of a web page in Puppeteer when an error occurs
3.6 How to pause script and action execution in Puppeteer for a while. Delaying Puppeteer scripts
3.7 How to take a screenshot of a page with images in Puppeteer if the site uses lazy download
3.8 How to scroll a page in Puppeteer
3.9 How to do gradual page scrolling in Puppeteer
4.
5.
6.
Before we move on to the next practical examples of using Puppeteer, let's dive a little deeper into the theory behind Puppeteer and look at the options and actions that can affect the generation of the DOM of a web page.
These tips will help you better understand how Puppeteer works. You will also improve your existing skills in working with Puppeteer, for example, you will be able to:
- Disable JavaScript scripts;
- Choose whether you want to work with the original HTML or the current DOM;
- Improve the appearance of screenshots if lazy download for images is enabled on the page;
- Load HTML code from a database or other sources, instead of from the site;
- Save the current DOM state and take a screenshot when an error occurs;
- Pause JavaScript and Puppeteer for a certain amount of time;
- Move down the page or scroll the page gradually.
3.1 What exactly does Puppeteer get – the original HTML or the DOM after JavaScript finishes?
The content of a web page consists of HTML code received from a web server. But modern technologies (HTML 5 and JavaScript) allow you to change the content of a web page on the fly: JavaScript can add new elements, change the properties of existing elements, remove elements, and perform a variety of actions with them. JavaScript can simply clear all the HTML code received from the web server and create a completely different page.
HTML DOM (Document Object Model) is the content of a web page at the current moment, after JavaScript has added, removed, or changed elements.
That is, when a web page is loaded, its DOM corresponds to the HTML code received from the server. Subsequently, the DOM may not change (for static pages), or change greatly under the influence of JavaScript.
You may have already noticed that for some sites, the saved web page looks completely different from what you saw on your screen. This happens because when you perform the “Save As” operation, you save the HTML that the server sent, but do not save the current DOM.
In a real sense, the DOM is the HTML that forms the current version of the web page.
See also: How to save the entire HTML DOM in its current state
If you get the HTML code of a page, for example, using the cURL utility (as well as many other tools designed to download individual web pages or create local copies of websites), then the original HTML code will be downloaded.
What exactly does Puppeteer output when we output HTML or save a web page? Puppeteer outputs the HTML DOM as it looks after JavaScript has finished running. That is, the result of saving the same page using cURL and Puppeteer may differ (primarily for sites that dynamically create a web page).
So, if your goal is to get the HTML DOM of a site after JavaScript has finished running, then Puppeteer will perform this task by default.
3.2 How to disable (stop) all JavaScript in Puppeteer
To disable JavaScript in Puppeteer, add the following option:
await page.setJavaScriptEnabled(false)
This option should be added before navigating to the web page, because if you do it after, the option will have no visible effect, because JavaScript has already executed.
An example of how to do this will be given below.
3.3 How to get the HTML source in Puppeteer instead of the DOM after creating a page with JavaScript
To get the source code of a web page (HTML) instead of the DOM, disable JavaScript execution as follows:
await page.setJavaScriptEnabled(false)
An example of running Puppeteer with JavaScript disabled:
const puppeteer = require('puppeteer'); async function run() { const url = process.argv[2]; const browser = await puppeteer.launch(); const page = await browser.newPage(); const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36'; await page.setUserAgent(customUserAgent); await page.setJavaScriptEnabled(false) await page.goto('https://hackware.ru/?p=19226'); const html = await page.content(); console.log(html); browser.close(); } run();
Let's compare the output for the same page with JavaScript enabled (default) and with JavaScript disabled.
Search for the specified HTML tags on a page with JavaScript enabled:
node js-en.js | grep -i -E --color '<pre|<code'
We won't go into the details of the result, just note that only CODE tags were found in the resulting HTML code.
Now let's search for the same HTML tags on the same page, but with JavaScript disabled:
node js-dis.js | grep -i -E --color '<pre|<code'
As you can see, only PRE tags were found.
The results show that by default, the PRE tag is used in HTML markup, but when forming the DOM of a page, JavaScript significantly changes the HTML markup, including replacing PRE tags with CODE.
3.4 How to pass HTML pages from a string (from a database) to Puppeteer
How to load a page in Puppeteer not from a site, but from a database or a string, and why is this necessary?
In Puppeteer, you can pass the contents of a web page as HTML code, without opening the URL. This can be useful, for example, in the following situations:
- you want to take a screenshot of a web page that is generated by HTML code (replaces the following sequence of actions: save HTML to file → open file in web browser → take a screenshot)
- you want to get the DOM of the page from HTML after JavaScript has finished
To do this, you can use the Page.setContent() method: https://pptr.dev/api/puppeteer.page.setcontent
The following script passes HTML code to Puppeteer, and then prints out the resulting DOM (which is very different from the original HTML – you can see this by disabling JavaScript):
const puppeteer = require('puppeteer'); async function run() { const browser = await puppeteer.launch(); const page = await browser.newPage(); const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36'; await page.setUserAgent(customUserAgent); //await page.setJavaScriptEnabled(false) page.setContent(`<!doctype html> <html lang="en-US"> <head> <meta charset="UTF-8" /> <title>Working with elements</title> </head> <body> <div id="div1">The text above has been created dynamically.</div> </body> </html> <script> source_html = document.documentElement.outerHTML function addElement(value, content) { const newDiv = document.createElement(value); const newContent = document.createTextNode(content); newDiv.appendChild(newContent); const currentDiv = document.getElementById("div1"); document.body.insertBefore(newDiv, currentDiv); } addElement('div', 'Hi there and greetings!'); addElement('h2', 'Original HTML code:'); addElement('pre', source_html); addElement('ul', ''); for (let i = 0; i < 10; i++) { addElement('li', i); } dom = document.documentElement.outerHTML addElement('h2', 'New HTML DOM:'); addElement('pre', dom); </script>`) const html = await page.content(); console.log(html); browser.close(); } run();
This page does not make much sense, its main purpose is to show how much the original HTML (left) and the final DOM (right) can differ for the same page – in the following screenshot, the DOM does not fit in the console window (you can see less than half):
3.5 How to save the current state of a web page in Puppeteer when an error occurs
Very soon, we will move on to interacting with web page elements from Puppeteer (in simple terms, we will click buttons and enter data in text fields, and then click buttons again). By the way, we will press the first button from Puppeteer in the next example.
So, if errors occur, Puppeteer will crash and it will be unclear what exactly happened. As shown above, HTML code analysis may not yield results, since Puppeteer works with a temporarily generated DOM, which disappears after Puppeteer finishes working.
How can we make it so that if a crash occurs due to an error, Puppeteer would save the current state of the DOM and take a screenshot?
For this, we can use the try-catch construct:
try { } catch (error) { }
This is a JavaScript construct, not a Puppeteer library construct.
An example of a code fragment that tries to perform an action, and if an error occurs, it displays the error text, takes a screenshot of the web page and saves the current DOM contents to a file:
try { // ACTIONS } catch (error) { console.error(error); await page.screenshot({ path: 'fail.png', fullPage: true }); const html = await page.content(); const fs = require('fs'); fs.writeFile("failed.htm", html, function(err) { if(err) { return console.log(err); } console.log("The failed file was saved!"); }); }
Let's look at the full example of code that will cause an error – we are trying to click a button that does not exist on the page.
const puppeteer = require('puppeteer'); function delay(time) { return new Promise(function(resolve) { setTimeout(resolve, time) }); } async function run() { const browser = await puppeteer.launch(); const page = await browser.newPage(); const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36'; await page.setViewport({width: 1440, height: 3440}); await page.setUserAgent(customUserAgent); await page.goto('https://hackware.ru/?p=19226'); for (let i = 0; i < 1; i++) { await page.evaluate(() => { window.scrollTo(0, document.body.scrollHeight); }); try { const button = await page.$('div > div.c-loadMoreButton.u-grid-columns > a'); await button.click(); await delay(4000); } catch (error) { console.error(error); await page.screenshot({ path: 'fail.png', fullPage: true }); const html = await page.content(); const fs = require('fs'); fs.writeFile("failed.htm", html, function(err) { if(err) { return console.log(err); } //console.log("The failed file was saved!"); }); } } browser.close(); } run();
Since the button that Puppeteer is trying to click is missing, it causes an error. The error message is printed to the console:
TypeError: Cannot read properties of null (reading 'click') at run (/home/mial/bin/tests/puppeteer/fail.js:28:17)
Two files will also be created: fail.png and failed.htm, containing a screenshot and the DOM of the page at the time the errors occurred.
3.6 How to pause script and action execution in Puppeteer for a while. Delaying Puppeteer scripts
If you tried to run the examples shown in the previous parts for different sites, you may have noticed that there are no problems with premature execution of actions and script termination. That is, if the page needs time to load, or JavaScript is executed on the page for a long time, Puppeteer waits for these processes to finish and receives the HTML code or takes screenshots correctly.
So if you want to pause Puppeteer to finish executing JavaScript, there's a good chance you don't actually want to do that – Puppeteer might just do your job correctly by default.
However, I've seen situations where you really need to wait for Puppeteer to complete operations when interacting with web pages (like waiting for the next part of the page to load).
The Puppeteer library used to have a way to pause Puppeteer, but all of those functions and methods have been deprecated and removed from the Puppeteer library.
However, you can use the following construct to pause Puppeteer before the web browser starts the next programmed action:
function delay(time) { return new Promise(function(resolve) { setTimeout(resolve, time) }); } // some actions // await button.click(); await delay(4000);
A practical example of using the delay – if you remove the delay from your script, the result will not be as expected:
const puppeteer = require('puppeteer'); function delay(time) { return new Promise(function(resolve) { setTimeout(resolve, time) }); } async function run() { const browser = await puppeteer.launch(); const page = await browser.newPage(); const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36'; await page.setViewport({width: 1440, height: 3440}); await page.setUserAgent(customUserAgent); await page.goto('https://www.zdnet.com/topic/artificial-intelligence/'); for (let i = 0; i < 5; i++) { await page.evaluate(() => { window.scrollTo(0, document.body.scrollHeight); }); try { const button = await page.$('div > div.c-loadMoreButton.u-grid-columns > a'); await button.click(); await delay(4000); } catch (error) { console.error(error); await page.screenshot({ path: 'fail.png', fullPage: true }); const html = await page.content(); const fs = require('fs'); fs.writeFile("failed.htm", html, function(err) { if(err) { return console.log(err); } }); } } await page.screenshot({ path: 'delay.jpg', fullPage: true }); browser.close(); } run();
Result of running the script (you can comment out the line “await delay(4000);” and make sure that you really need to wait for the page to load:
3.7 How to take a screenshot of a page with images in Puppeteer if the site uses lazy download
Lazy loading is a method of waiting for certain parts of a web page – especially images – to load until they are needed. Instead of loading everything at once, which is called “greedy” loading, the browser does not request certain resources until the user interacts in a way that requires those resources.
To make it clearer, in practice this manifests itself, for example, in the fact that images are loaded only after you scroll to them. Note the following screenshot – at the top of the screen, the images have loaded, but at the bottom of the screen, the images have not loaded. And it is useless to set a delay and wait for the images to load – they will not load until a user scrolls to them.
But what to do in this case when using Puppeteer? There are at least two options:
- Scroll to the images to load them – exactly as the site logic suggests.
- Increase the size of the web browser window so that the images get to the first screen without scrolling.
We will consider both options, let's start with the second one, since it is very simple.
In the previous parts, we have already become familiar with the setViewport() method, which allows you to set the size of the virtual web browser window. Now we will use it again, but set the height value much larger, for example:
const puppeteer = require('puppeteer'); async function run() { const browser = await puppeteer.launch(); const page = await browser.newPage(); const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36'; await page.setViewport({width: 1440, height: 3440}); await page.setUserAgent(customUserAgent); await page.goto('https://www.zdnet.com/topic/artificial-intelligence/'); await page.screenshot({ path: 'lazy-download.jpg', fullPage: true }); browser.close(); } run();
As you can see, all the images have now loaded, since the website thinks that all the page content is on the first screen.
The screen sizes shown in this example are not the limit, for example, the following screen sizes also do not cause an error:
await page.setViewport({width: 14400, height: 14400});
So you can experiment with the size.
3.8 How to scroll a page in Puppeteer
To scroll a page, you can use the following code:
await page.evaluate(() => { window.scrollTo(0, document.body.scrollHeight); });
Note that you will most likely need to set a delay before the next action so that the web page can load the necessary data.
Example:
const puppeteer = require('puppeteer'); function delay(time) { return new Promise(function(resolve) { setTimeout(resolve, time) }); } async function run() { const browser = await puppeteer.launch(); const page = await browser.newPage(); const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36'; await page.setViewport({width: 1440, height: 1440}); await page.setUserAgent(customUserAgent); await page.goto('https://www.zdnet.com/topic/artificial-intelligence/'); await page.evaluate(() => { window.scrollTo(0, document.body.scrollHeight); }); await delay(4000); await page.screenshot({ path: 'scrolled.jpg', fullPage: true }); browser.close(); } run();
You can see that all the images have loaded now (note that the screenshot is not perfect – the ad block that should be stuck to the bottom of the screen is in the wrong place).
3.9 How to do gradual page scrolling in Puppeteer
Setting a large height of the web browser window or scrolling to the very bottom of the web browser usually works well for loading lazy download content. In addition to these options, you can also use gradual page scrolling – an example of the code is shown below:
const puppeteer = require('puppeteer'); function delay(time) { return new Promise(function(resolve) { setTimeout(resolve, time) }); } async function run() { const browser = await puppeteer.launch(); const page = await browser.newPage(); const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36'; await page.setViewport({width: 1440, height: 800}); await page.setUserAgent(customUserAgent); await page.goto('https://www.zdnet.com/topic/artificial-intelligence/'); const height = await page.evaluate(() => { return document.body.scrollHeight; }); window_h = page.viewport().height step = height / window_h + 1 for (let i = 0; i < step; i++) { current_height = window_h * i await page.evaluate((current_height) => { window.scrollTo(0, current_height); }, current_height); await delay(300); } await page.screenshot({ path: 'smooth-scroll.jpg', fullPage: true }); browser.close(); } run();
This is a screenshot after smooth scrolling – as you can see, all the images have loaded now and in general the page does not have the problem that was present in the previous example when simply moving down:
Related articles:
- How to use Puppeteer: installation and quick start (100%)
- Interacting with DOM in Puppeteer: how to get HTML code and extract various tags (text, images, links) (100%)
- How to make changes in browser Developer Tools persist after page reload (56.9%)
- How to install normal Firefox in Kali Linux (54.4%)
- Errors in Kali Linux ‘W: Failed to fetch’ and ‘W: Some index files failed to download. They have been ignored, or old ones used instead.’ (SOLVED) (53.8%)
- What to do if Linux does not boot (RANDOM - 50%)