Dump PDF from HEP Website
Background
As we know many publishers in China has provided ebooks for home studies after the COVID-19 epidemic. But some of them like HEP only allow you to read online, so bad.
Steps
Open a reader page such as this one, toggle developer tools and switch to Network
tab, you will found some PDF file requests. Download and open it? Oh no, it's encrypted. These files have BB 51 1F 73
magic codeļ¼but a formal PDF file should be 25 50 44 46
. Certainly, you could fiddle tenaciously with the code and finally found a way to decrypt.
However, we observed that this page is using pdf.js to render the document, this is the weak point!
Type pdfjsLib.version
in console, we got the version 2.2.228
. Now download pdf_viewer.js and make some modify. Replace line 7558 return _possibleConstructorReturn ...
with:
let ret = "{ expression after `return` }";
window.hook = window.hook || [];
hook.push(ret);
return ret;
Install Header Editor extension, create a rule and redirect page's pdf_viewer.js
to own one, or you can use the override function in developer tools.
Reload the page, you could dump decrypted PDF data by hooks[i].pdfDocument.getData()
. Here is a code snippet that allows you to dump pages automatically:
for (const [index, box] of canvas_box.childNodes.entries()) {
// if (index < 210) continue; // skip
box.scrollIntoView();
while (!box.querySelector("canvas"))
await new Promise((r) => setTimeout(r, 50));
await new Promise((r) => setTimeout(r, 200));
const entry = hook.find((entry) => entry.container === box);
const a = document.createElement("a");
a.href = URL.createObjectURL(new Blob([await entry.pdfDocument.getData()]));
a.download = `${index}.pdf`.padStart(7, 0);
a.click();
}
You may need to turn off "Ask where to save each file before downloading" in browser settings.
By the Way
curl -o a.pdf https://wkobwp.sciencereading.cn/api/file/{id}/getDocumentbuffer