I strongly advise against evaluating or parsing crawled data in Node.js if you are anyway using puppeteer for crawling. When you are using puppeteer you already have a browser with a great sandbox for JavaScript code running in another process. Why risk that kind of isolation and "rebuild" a parser in your Node.js script? If your Node.js script breaks, your whole script will fail. In the worst case, you might even expose your machine to serious risks when you try to run untrusted code inside your main thread.
Instead, try to do as much parsing as possible inside the context of the page. You can even do an evileval
call there. There worst that could happen? Your browser hangs or crashes.
Imagine the following HTML page (very much simplified). You are trying to read the text which is pushed into an array. The only information you have is that there is an additional attributeid
which is set totarget-data
.
<html>
<body>
<!--- ... -->
<script>
var arr = [];
// some complex code...
arr.push({
id: 'not-interesting-data',
data: 'some data you do not want to crawl',
});
// more complex code here...
arr.push({
id: 'target-data',
data: 'THIS IS THE DATA YOU WANT TO CRAWL', // <---- You want to get this text
});
// more code...
arr.push({
id: 'some-irrelevant-data',
data: 'again, you do not want to crawl this',
});
</script>
<!--- ... -->
</body>
</html>
Here is a simple example what your code might look like right now:
await page.goto('http://...');
const crawledJsCode = await page.evaluate(() => document.querySelector('script').innerHTML);
In this example, the script extracts the JavaScript code from the page. Now we have the JavaScript code from the page and we "only" need to parse it, right? Well, this is the wrong approach. Don't try to rebuild a parser inside Node.js. Just use the browser. There are basically two approaches you can take to do that in your case.
JSON.parse
, a regex oreval
(eval only if really necessary)In this approach you are replacing native browser functions with your own "fake functions". Example:
const originalPush = Array.prototype.push;
Array.prototype.push = function (item) {
if (item && item.id === 'target-data') {
const data = item.data; // This is the data we are trying to crawl
window.exposedDataFoundFunction(data); // send this data back to Node.js
}
originalPush.apply(this, arguments);
}
This code replaces the originalArray.prototype.push
function with our own function. Everything works as normal, but when an item with our target id is pushed into an array, a special condition is triggered. To inject this function into the page, you could use . To receive the data from Node.js you would have to expose a function to the browser via
:
// called via window.dataFound from within the fake Array.prototype.push function
await page.exposeFunction('exposedDataFoundFunction', data => {
// handle the data in Node.js
});
Now it doesn't really matter how complex the code of the page is, whether it happens inside some asynchronous handler or whether the page changes the surrounding code. As long as the target data is pushing the data into an array, we will get it.
You can use this approach for a lot of crawling. Check how the data is processed and replace the low level functions processing the data with your own proxy version of it.
Let's assume the first approach does not work for some reason. The data is in some script tag, but you are not able to get it by using fake functions.
Then you should parse the data, but not inside your Node.js environment. Do it inside the page context. You could run a regular expression or useJSON.parse
. But do it before returning the data back to Node.js. This approach has the benefit that if your code will crash your environment for some reason, it will not be your main script, but just your browser that crashes.
To give some example code. Instead of running the code from the original "bad code" sample, we change it to this:
const crawledJsCode = await page.evaluate(() => {
const code = document.querySelector('script').innerHTML; // instead of returning this
const match = code.match(/some tricky regex which extracts the data you want/); // we run our regex in the browser
return match; // and only return the results
});
This will only return the parts of the code we need, which can then be fruther processed from within Node.js.
Independent of which approach you choose, both ways are much better and more secure than running unknown code inside your main thread. If you absolutely have to process the data in your Node.js environment, use a regular expression for it like shown in the answer from trincot. You should never use eval to run untrusted code.
I think using an AST generator like Esprima or other AST tools is the easiest way to read and work with source code.
Honestly, if you figure out how to run Esprima, and generate a "Abstract Syntax Tree" from the source code, you will find it surprisingly easy and simple to read the generated tree structure that represents the code you just parsed, and you'll find it surprisingly easy to read the information, and convert it into anything you want.
It may seem daunting at first, but honestly, it is not. You'll be surprised: AST tools like Esprima were made exactly for purposes similar to what you are trying to do, in order to make the job easy.
AST tools are born from years worth of research at how to read and manipulate source code, so I highly recommend them.
Give that a try!
To help you understand what various ASTs look like, you can look at https://astexplorer.net. It is super useful for knowing how AST tree structures from various tools look like.
Oh, one last thing! In order to traverse an AST tree, you can use something like https://github.com/estools/estraverse. It will make life easy.
Scraping is going to be ugly. With some assumptions about the string you are trying to parse, you could:
Convert that string to valid JSON:
To do this reliably you'd have to write a parser that is just as complex as a JSON parser, but with some assumptions, it can probably be simplified to this:
// Sample data
var someCrawledJSCode = `
const somecode = 'somevalue';
arr.push({
data1: {
prices: [{
prop1: 'hi',
prop2: 'hello',
},
{
prop1: 'foo',
prop2: 'bar',
}]
}
});`;
var obj;
var notJson = someCrawledJSCode.replace(/\.push\(([^]*?)\)/, (_, notJson) => {
// Try to turn the string into valid JSON:
// 1. string literals should not be enclosed in single, but double quotes
// 2. property names should be enclosed in double quotes
// 3. there should be no trailing comma after the last property
var json = notJson.replace(/'((\\.|[^\\'])*)'/g, '"$1"')
.replace(/(\w+):/g, '"$1":')
.replace(/,\s*}/g, "}");
obj = JSON.parse(json);
});
console.log(obj);
Our community is visited by hundreds of web development professionals every day. Ask your question and get a quick answer for free.
Find the answer in similar questions on our website.
Do you know the answer to this question? Write a quick response to it. With your help, we will make our community stronger.
PHP (from the English Hypertext Preprocessor - hypertext preprocessor) is a scripting programming language for developing web applications. Supported by most hosting providers, it is one of the most popular tools for creating dynamic websites.
The PHP scripting language has gained wide popularity due to its processing speed, simplicity, cross-platform, functionality and distribution of source codes under its own license.
https://www.php.net/
JavaScript is a multi-paradigm language that supports event-driven, functional, and mandatory (including object-oriented and prototype-based) programming types. Originally JavaScript was only used on the client side. JavaScript is now still used as a server-side programming language. To summarize, we can say that JavaScript is the language of the Internet.
https://www.javascript.com/
Node.js is an open source server-side framework built on top of the Google Chrome JavaScript Engine. The number of sites using NodeJS has increased by 84,000. It is one of the busiest cross-platform JavaScript runtimes. Node.js is an asynchronous, single-threaded, non-blocking I / O model that makes it lightweight and efficient. The Node.js package ecosystem, npm, is also the world's largest open source library ecosystem.
https://nodejs.org/
HTML (English "hyper text markup language" - hypertext markup language) is a special markup language that is used to create sites on the Internet.
Browsers understand html perfectly and can interpret it in an understandable way. In general, any page on the site is html-code, which the browser translates into a user-friendly form. By the way, the code of any page is available to everyone.
https://www.w3.org/html/
Welcome to the Q&A site for web developers. Here you can ask a question about the problem you are facing and get answers from other experts. We have created a user-friendly interface so that you can quickly and free of charge ask a question about a web programming problem. We also invite other experts to join our community and help other members who ask questions. In addition, you can use our search for questions with a solution.
Ask about the real problem you are facing. Describe in detail what you are doing and what you want to achieve.
Our goal is to create a strong community in which everyone will support each other. If you find a question and know the answer to it, help others with your knowledge.