Interacting with DOM in Puppeteer: how to get HTML code and extract various tags (text, images, links)

Contents

1. How to use Puppeteer: installation and quick start

2. Interacting with DOM in Puppeteer: how to get HTML code and extract various tags (text, images, links)

2.1 How to get HTML of a web page in Puppeteer

2.2 How to get HTML from a web page and save it to a file in Puppeteer

2.3 How to extract all specific elements (tags) from HTML (DOM) in Puppeteer

2.4 How to extract all “p” tags (paragraphs of text) in Puppeteer

2.5 How to extract all “a” tags (links) in Puppeteer

2.6 How to extract all image links in Puppeteer

2.7 How to extract the page title in Puppeteer

2.8 How to use Page.$eval() and Page.$$eval() methods to select elements

2.9 How to find all elements by class name in Puppeteer

2.10 How to find an element by id in Puppeteer

2.11 How to find an element in Puppeteer by text that contains an element

3. Advanced interacting with DOM in Puppeteer: disabling JavaScript, loading HTML without visiting the site, error handling, delaying and scrolling the page

4.

5.

6.


2.1 How to get HTML of a web page in Puppeteer

I think the main purpose of using Puppeteer is to get HTML and DOM of a web page, not screenshots. Let's start with getting HTML of a web page.

Create a file html.js with the following content:

const puppeteer = require('puppeteer');

async function run() {
	const browser = await puppeteer.launch();
	const page = await browser.newPage();
	const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36';
  
	await page.setViewport({width: 1440, height: 3440});
	await page.setUserAgent(customUserAgent);
	await page.goto('https://w-e-b.site/?act=client-tls-fingerprinting');
	
	const html = await page.content();
	console.log(html);
  
	browser.close();
}

run();

Run the file like this:

node html

2.2 How to get HTML from a web page and save it to a file in Puppeteer

I think you've already seen the HTML code, but I'm sure it's not what you expected – all the HTML code is output to the console window. To save HTML code to a file in Puppeteer, use the following construct:

node html > page.htm

You can also save the HTML (DOM) code you get from Puppeteer to a file without using output redirection. To do this, create a file html-to-file.js with the following content:

const puppeteer = require('puppeteer');

async function run() {
	const browser = await puppeteer.launch();
	const page = await browser.newPage();
	const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36';
  
	await page.setViewport({width: 1440, height: 3440});
	await page.setUserAgent(customUserAgent);
	await page.goto('https://w-e-b.site/?act=client-tls-fingerprinting');
	
	const html = await page.content();
	const fs = require('fs');
	fs.writeFile("test.htm", html, function(err) {
	if(err) {
		return console.log(err);
	}
	console.log("The file was saved!");
	});
  
	browser.close();
}

run();

Run the file like this:

node html-to-file

Running this program will save the page source code obtained by Puppeteer to the file test.htm.

2.3 How to extract all specific elements (tags) from HTML (DOM) in Puppeteer

In Puppeteer, you can use JavaScript selectors, which allow you to uniquely identify different elements of the page and extract their content, or interact with them – for example, filling in text fields and clicking on buttons. We'll look at this later, but for now let's look at examples of extracting specific tags as a starter.

In the following examples, the selector used in the querySelectorAll method will vary. We can also choose which properties we want to retrieve. An example of popular properties that we will use in this article:

innerText
innerHTML
src
href

Documentation:

In the following examples, we will make extensive use of the Page.evaluate() method, which executes the specified function in the context of the page and returns the result.

Inside the Page.evaluate() method, we will use “regular” JavaScript, that is, without using the Puppeteer library.

Documentation:

2.4 How to extract all “p” tags (paragraphs of text) in Puppeteer

Let's create a file extract-p.js with the following content:

const puppeteer = require('puppeteer');

async function run() {
	const browser = await puppeteer.launch();
	const page = await browser.newPage();
	const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36';
  
	await page.setViewport({width: 1440, height: 3440});
	await page.setUserAgent(customUserAgent);
	await page.goto('https://hackware.ru/?p=19139');

	const tags = await page.evaluate(() => {
		return Array.from(document.querySelectorAll('p'))
			.map(heading => heading.innerText);
	});
  
	for (const tag of tags) {
		console.log(tag);
	}
  
	browser.close();
}

run();

Let's run:

node extract-p

The purpose of this script is to extract all “p” tags from the page https://hackware.ru/?p=19139

Note, if you are familiar with programming, but do not program in JavaScript, then you may be confused by the following code fragment:

return Array.from(document.querySelectorAll('p'))
	.map(heading => heading.innerText);

After “return” the code should not be executed and, it would seem, the construction does not make sense. It's all about the peculiarities of how JavaScript handles spaces (including the newline character), namely, that JavaScript ignores spaces. That is, the line above in a more familiar form looks like this:

return Array.from(document.querySelectorAll('p')).map(heading => heading.innerText);

But since JavaScript often uses several methods in a row and ignores spaces, it is quite common to use the above notation to improve code readability.

To make the previous example more meaningful, let's clarify the selector:

const puppeteer = require('puppeteer');

async function run() {
	const browser = await puppeteer.launch();
	const page = await browser.newPage();
	const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36';
  
	await page.setViewport({width: 1440, height: 3440});
	await page.setUserAgent(customUserAgent);
	await page.goto('https://hackware.ru/?p=19139');

	const tags = await page.evaluate(() => {
		return Array.from(document.querySelectorAll('div > div.entrytext > p'))
			.map(heading => heading.innerText);
	});
  
	for (const tag of tags) {
		console.log(tag);
	}
  
	browser.close();
}

run();

Now only paragraphs of text inside the article will be found, without text in sidebars, footers and other areas that are not interesting to us.

Below we will look at how to get the selector strings and you will understand where the string came from:

div > div.entrytext > p

2.5 How to extract all “a” tags (links) in Puppeteer

Let's create a file extract-art-a.js with the following content:

const puppeteer = require('puppeteer');

async function run() {
	const browser = await puppeteer.launch();
	const page = await browser.newPage();
	const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36';
  
	await page.setViewport({width: 1440, height: 3440});
	await page.setUserAgent(customUserAgent);
	await page.goto('https://hackware.ru/?p=19139');

	const tags = await page.evaluate(() => {
		return Array.from(document.querySelectorAll('a'))
			.map(heading => heading.href);
	});
  
	for (const tag of tags) {
		console.log(tag);
	}
  
	browser.close();
}

run();

This code will extract all links from the specified page. To extract only links from the article, you need to replace the selector with the following line (this is the selector for the page specified in the example):

div > div.entrytext > p > a

Also pay attention to the line:

.map(heading => heading.href);

For links, we are interested in such a property as “href”.

The following example will show only external links (links that point to a different domain than the one where the analyzed page is hosted):

const puppeteer = require('puppeteer');

async function run() {
	const browser = await puppeteer.launch();
	const page = await browser.newPage();
	const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36';

	await page.setUserAgent(customUserAgent);
	await page.goto('https://hackware.ru/?p=19139');

	const tags = await page.evaluate(() => {
		return Array.from(document.querySelectorAll('div > div.entrytext > p > a'))
			.map(heading => heading.href);
	});
	
	const hostname = await page.evaluate(() => {
		return window.location.hostname;
	});	
	const re = new RegExp(hostname);
	
	for (const tag of tags) {
		if (!re.test(tag) && tag) {
			console.log(tag);
		}
	}

	browser.close();
}

run();

2.6 How to extract all image links in Puppeteer

Let's create a file extract-art-img.js with the following content:

const puppeteer = require('puppeteer');

async function run() {
	const browser = await puppeteer.launch();
	const page = await browser.newPage();
	const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36';
  
	await page.setViewport({width: 1440, height: 3440});
	await page.setUserAgent(customUserAgent);
	await page.goto('https://hackware.ru/?p=19139');

	const tags = await page.evaluate(() => {
		return Array.from(document.querySelectorAll('img'))
			.map(heading => heading.src);
	});
  
	for (const tag of tags) {
		console.log(tag);
	}
  
	browser.close();
}

run();

This code will show all links to images on the specified page. To show only links to images from the article, you need to use the following selector:

div > div.entrytext > p > a > img

Note that for images we are interested in such a property as “src”.

2.7 How to extract the page title in Puppeteer

Create a file extract-art-title.js with the following contents:

const puppeteer = require('puppeteer');

async function run() {
	const browser = await puppeteer.launch();
	const page = await browser.newPage();
	const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36';
  
	await page.setViewport({width: 1440, height: 3440});
	await page.setUserAgent(customUserAgent);
	await page.goto('https://hackware.ru/?p=19139');

	const tags = await page.evaluate(() => {
		return Array.from(document.querySelectorAll('title'))
			.map(heading => heading.innerText);
	});
  
	console.log(tags[0]);
  
	browser.close();
}

run();

Note that instead of looping through the entire array of tags, we accessed the array element directly:

console.log(tags[0]);

By the way, Puppeteer has a special method for titles, and you can use it like this:

const puppeteer = require('puppeteer');

async function run() {
	const browser = await puppeteer.launch();
	const page = await browser.newPage();
	const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36';
  
	await page.setUserAgent(customUserAgent);
	await page.goto('https://hackware.ru/?p=19214');
	const title = await page.title();
	console.log(title);
  
	browser.close();
}

run();

Documentation:

2.8 How to use Page.$eval() and Page.$$eval() methods to select elements

Above, we looked at options for using the page.evaluate() method. The main purpose of this method is to execute JavaScript code in the context of an open web page. That is, it is a universal method that is used for different purposes. We used it to execute “regular” JavaScript (without Puppeteer methods), which selected elements by a specific selector.

But for these purposes (selecting elements by a specific selector), there are special Puppeteer methods:

  • Page.$eval() – this method finds the first element on the page that matches the selector and passes the result as the first argument to pageFunction. This method is suitable for searching by element id, or if you only need 1 (the first) element.
  • Page.$$eval() – this method returns all elements that match the selector and passes the resulting array as the first argument to pageFunction. This method is suitable for searching by class name, as well as by tag name.

Let's consider using the Page.$$eval() method to search for paragraphs of text. The code below will also output the total number of paragraphs found:

const puppeteer = require('puppeteer');

async function run() {
	const browser = await puppeteer.launch();
	const page = await browser.newPage();
	const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36';

	await page.setViewport({width: 1440, height: 3440});
	await page.setUserAgent(customUserAgent);
	await page.goto('https://hackware.ru/?p=19214');


	const tags = await page.$$eval('div > div.entrytext > p', tag => {
  		return tag.map(tag => tag.innerText);
	});
	for (const tag of tags) {
		console.log(tag);
	}
	
	const count = await page.$$eval('div > div.entrytext > p', tag => tag.length);
	console.log('Total: ' + count);
	
	browser.close();
}

run();

Documentation:

2.9 How to find all elements by class name in Puppeteer

In the following parts, we will get to know selectors in more detail, since we will use them to perform such actions as entering data into text fields and forms, searching the site, clicking buttons, getting the value of a certain element, etc.

For now, we will only superficially consider some cases of using selectors. For example, the following code will find all elements whose class is “code”:

const puppeteer = require('puppeteer');

async function run() {
	const browser = await puppeteer.launch();
	const page = await browser.newPage();
	const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36';

	await page.setViewport({width: 1440, height: 3440});
	await page.setUserAgent(customUserAgent);
	await page.goto('https://hackware.ru/?p=19214');


	const tags = await page.$$eval('.code', tag => {
  		return tag.map(tag => tag.innerText);
	});
	for (const tag of tags) {
		console.log(tag);
		console.log('============================================');
	}
	
	const count = await page.$$eval('.code', tag => tag.length);
	console.log('Total: ' + count);
	
	browser.close();
}

run();

2.10 How to find an element by id in Puppeteer

To find an element by its id, we can use Page.$eval() instead of Page.$$eval(). The following code will print the text of the element with the id “creditline”:

const puppeteer = require('puppeteer');

async function run() {
	const browser = await puppeteer.launch();
	const page = await browser.newPage();
	const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36';

	await page.setViewport({width: 1440, height: 3440});
	await page.setUserAgent(customUserAgent);
	await page.goto('https://hackware.ru/?p=19214');

	const tag = await page.$eval('#creditline', el => el.innerText)
	console.log(tag);
	
	browser.close();
}

run();

You can get different properties for different element types:

const searchValue = await page.$eval('#search', el => el.value);

const preloadHref = await page.$eval('link[rel=preload]', el => el.href);

const html = await page.$eval('.main-container', el => el.outerHTML);

2.11 How to find an element in Puppeteer by text that contains an element

You can search for elements by a variety of selectors. Let's look at an example of how to find all elements that contain a certain text. In this example, we will search for “p” tags that contain the string “код”:

const puppeteer = require('puppeteer');

async function run() {
	const browser = await puppeteer.launch();
	const page = await browser.newPage();
	const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36';

	await page.setViewport({width: 1440, height: 3440});
	await page.setUserAgent(customUserAgent);
	await page.goto('https://hackware.ru/?p=19214');
	
	const tags = await page.$$eval('div > div.entrytext > p', tags => {
		var arr = [];
  		for (const tag of tags) {
			if (tag.innerText.includes('код')) {
				arr.push(tag);
			}
		}
		return arr.map(arr => arr.innerText);
	});
	
	for (const tag of tags) {
		console.log(tag);
	}
		
	console.log('Total: ' + tags.length);
	
	browser.close();
}

run();

You can also search by the contents of other tags, for example, by the inscriptions on buttons, etc. Of course, you can do various things with this, for example, the following example searches for a button by its caption and then emulates a click on this button:

const puppeteer = require('puppeteer');

async function run() {
	const browser = await puppeteer.launch();
	const page = await browser.newPage();
	const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36';

	await page.setViewport({width: 1440, height: 3440});
	await page.setUserAgent(customUserAgent);
	await page.goto('https://hackware.ru/?p=19214');
	
	await page.$$eval('button', buttons => {
		for (const button of buttons) {
			if (button.textContent === 'Поиск') {
				button.click();
				break; // Clicking the first matching button and exiting the loop
			}
		}
	});
	
	browser.close();
}

run();

Continue reading: Advanced interacting with DOM in Puppeteer: disabling JavaScript, loading HTML without visiting the site, error handling, delaying and scrolling the page

Recommended for you:

Leave a Reply

Your email address will not be published. Required fields are marked *