MetaProduct User Forums

Portable Offline Browser virtual box

Portable Offline Browser

more details   development history   free download   buy now  

Can the OE Internal Browser parse Java Script ?

Author Message
George Robinson 6/18/2016 8:00:34 PM
Can the OE Internal Browser parse the Java Script on the page
http://www.nxp.com/products/discretes-and-logic/logic/hct:GRP_10199?tab=Products

in order to download all of the cache.nxp.com/documents/data_sheet/*.PDF files, linked from that page ?

George Robinson 6/18/2016 8:06:09 PM
Aaaarggh!

My post above was supposed to go into the Offline Explorer category.
Oleg Chernavin 6/18/2016 8:33:36 PM
Yes, it is possible with 7.2 version. Please install the latest Offline Explorer Pro:

http://www.metaproducts.com/download/betas/opsetup.exe

Then import the Project settings:

http://www.metaproducts.com/files/NXP.txt

Use the Import - Project Settings - Load from Text File on the main toolbar. Then start downloading the Project.

Best regards,
Oleg Chernavin
MP Staff
George Robinson 6/19/2016 5:21:46 AM
> Oleg wrote:
> Yes, it is possible with 7.2 version. Please install the latest Offline Explorer Pro:

Do you mean that earlier versions could not do it but v7.2 can ?
If "yes", could you elaborate what makes the v7.2 so special ?
I am just curious - I am a programmer myself so I will understand any explanation.

> Then import the Project settings:

I had no problem importing the Project settings and the project appears to run, but it saves only 28 files from the directory cache.nxp.com/documents/data_sheet/ ...but there should be hundreds of them.

FYI: I am getting a lot of errors in the log:
ERROR - 2016.6.18 23:52:32 - Error creating file: Error code=00000003 Path not found Referer=javascript:'' Attempts=0

----------------------------------------------------------------------------------------------------------------------------------
ANOTHER QUESTION:

I was trying to understand how Offline Explorer works using this help link:
http://help.metaproducts.com/offline-explorer-pro/#understanding-the-download-process

Unfortunately I was not able to grasp the difference in Project settings between:
1) The mapping of a website (enumerating the links)
2) Saving the files to a local disk.

Specifically, I am confused about which OE Project settings refer to the mapping/traversing/spidering and which settings refer to the actual downloading of the files to the local disk.

For example I noticed, that all of the PDF files that interest me are in the directory:
cache.nxp.com/documents/data_sheet/

...but the contents of this directory cannot be listed so OE needs to get the links to these PDF files from:
www.nxp.com/products/discretes-and-logic/logic/hct/*.*
So, apparently OE needs to limit the mapping/traversing/spidering process to the above directory and below.

...but OE needs to save the PDF files to the local disk only from the directory:
cache.nxp.com/documents/data_sheet/

Unfortunately cache.nxp.com IS NOT BELOW www.nxp.com !!!!


In the Project settings, I cannot see the separation between the directory used for mapping/traversing/spidering and the directory for downloading the PDF files that are to be saved locally.

I have the same conceptual problem in the branch "The Filters" of the Project settings tree.
I do not know whether these filters refer to the files that will be spidered or to the files that will be retrieved and saved locally.
For example: If I only enable the PDF file type in "The Filters" branch, will OE still traverse and parse the .html and .js files, that are the sources of links to these PDF files ?


I wish all of that was explained in the help section at:
http://help.metaproducts.com/offline-explorer-pro/


George
Oleg Chernavin 6/19/2016 8:53:02 PM
That's strange - my download lasted 2 hours and I then counted 227 files in the cache.nxp.com\documents folder.

I also checked a number of pages from the Saved Pages tab - I used the Documentation link to click through the PDF links - they were all openable offline.

The Products/Parts column lists 175 items, so 227 PDF files looks OK.

7.2 version was necessaru because it introduces the ability to autoscroll pages X times before saving them. As you can see, the Products/Parts list gets populated as you scroll the page down.

I fixed the errors in the log, however they had no affect on the download at all. Anyway, here is the updated version:

http://www.metaproducts.com/download/betas/opsetup.exe

I also noticed some links that lead to non-English languages. If they are unnecessary, you may get rid of them easily. Select the Project, click the Properties button, go to the URL Filters - Filename section and add:

lang_cd=

to the Excluded list.

As for how I filtered the links - I allowed all servers and directories to be downloaded. But the URL Filters - Filename - Included keywords list allows only the files that lead to PDF links and PDF links as well.

Oleg.
George Robinson 6/20/2016 12:32:29 PM
> Oleg wrote:
> That's strange - my download lasted 2 hours

Yes, InternetExplorer is terribly slow. It borders on non-usability.
I thought that its API would be faster...but apparently not so.

> Oleg wrote:
> and I then counted 227 files in the cache.nxp.com\documents folder.

That is only because you downloaded more than just the datasheets for these products and this way each products has several PDFs downloaded (e.g. a Datasheet and an ApplicationNote).
Try to limit downloading the PDFs to files only from the directory /documents/data_sheet in order to download only datasheets.

Also, not that you also have the same files filenames with slightly different filenames.
If you do, I found that setting the URL Filter to include filenames to FileExt=.pdf limits the PDF duplicates

> The Products/Parts column lists 175 items, so 227 PDF files looks OK.

Note, that many products did not have their datasheets downloaded, such as:
74HC(T)03
74HC(T)191
74HC(T)2G16
74HC(T)2G34
74HC(T)3G16

... I think this is because they do not have the keyword "tab=Buy_Parametric_Tab" in their corresponding filenames appearing on the starting page:
http://www.nxp.com/products/discretes-and-logic/logic/hct:GRP_10199?cof=0&am=0&tab=Products

However, I noticed that OE's Internal Download Code does better than the IE and can download datasheets for these missing products without the Buy_Parametric_Tab and can do it much faster.
See my project properties at:
http://tinyurl.com/OEprops12

However I found out that even with the Internal Download Code, the datasheets for the following products did not get downloaded:
74HC4049
74HC4050

Also, notice that the other project named "NXP_All_files_1st_Level" did not find any files referring to the two products above.
Do you know why the internal code missed these products but not the others?

It would be nice if the internal code could treat the return value from the ng-click event as just another .js or .html file to be parsed for links.
<a class="ng-scope ng-binding" ng-click="showProdInfoPop($event, rowData.prodCodeObj) "

> I also noticed some links that lead to non-English languages. If they are unnecessary, you may get rid of them easily. Select the Project, click the Properties button, go to the URL Filters - Filename section and add: lang_cd=
to the Excluded list.

Thanks for noticing that. Indeed datasheets in Chinese should no be mingled up with he other ones ;)

> As for how I filtered the links - I allowed all servers and directories to be downloaded. But the URL Filters - Filename - Included keywords list allows only the files that lead to PDF links and PDF links as well.

Ii is still unclear how the OE's Download Process works. For example in my project named "NXP_Internal_Most_Datasheets", I still don't understand what causes the files in the
www.nxp.com/products/discretes-and-logic/logic/hct remote directory to be parsed but not downloaded to my local directory?
George Robinson 6/20/2016 12:40:31 PM
WT_TYPE=Data Sheets
...seems to work as well for limiting the PDF downloads to the Datasheets only.
George Robinson 6/20/2016 12:49:07 PM
I wrote a detailed reply with links to project properties but it did not appear here.
When I was posting it, I got a message that a moderator must approve it :/
Oleg Chernavin 6/22/2016 6:44:12 PM
Yes, I just approved it. I tried to download the link with Project settings, but the server asks for a password.

You can easily post Project settings here - select it, press Ctrl+C on keyboard and paste to the forum message.

Could it be that the setting to prevent duplicates skipped loading the PDFs for 74HC4049 and 74HC4050?

Yes, the internal code is faster. Because via the browser Offline Explorer has to wait for a page to open, then few more seconds more to let all scripts finish their initialization and work, then scroll down to let page and scripts load more data and so on.

Regarding the directory for PDFs. You may change the .pdf Included filename keyword to :

/documents/data_sheet*/*.pdf*

Oleg.
George Robinson 6/22/2016 6:59:44 PM
Oleg wrote:
>Yes, I just approved it. I tried to download the link with Project settings, but the server asks for a password.

O o! It does not happen to me. This server does not require any passwords.
George Robinson 6/22/2016 7:44:29 PM
[Object]
OEVersion=Pro 7.2.4506
Type=0
IID=62807
Caption=NXP_Internal_Most_Datasheets
URL=http://www.nxp.com/products/discretes-and-logic/logic:MC_50808
Dir=P:\OE\
MVer=5
Lev=5
Weekday=257
EnableForms=True
FMGroup=1
LastFMGroup=1
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=gifjpgjpegtiftiffxbmfifbmppngipxjp2j2cj2kwbmplwfwebp
FTVideo.Exts=mpgavianimpegmovflvfliflcvivrmramrvasfasxwmvm1vm2vvobsmilmp4m4v
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaapeoggm4aaif
FTArchive.Exts=7zacearcarjcabexegzisojarlayleilhapakpdfrartartgzzzip oooooooooooooxooooo
FTUDef.Exts=jsaxdcssssivbsdtdxslswfclassent
FTText.B=ooxooo
FTImages.B=xoxooo
FTVideo.B=xoxooo
FTAudio.B=xoxooo
FTArchive.B=ooxooo
FTUDef.B=xoxooo
FTOther.B=ooxooo
FTSizes=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,0,3,0,0,0,0,0,0,0,0
NotIgnoreLogout=False
RSrvsBx=3
RPathIn=data_sheet/products/discretes-and-logic/logic xx
RFileIn=tab=documentation_tabtab=productsnodeid=fileext=.pdfgrp_ xxxxx
RFileEx=nodeid=code=wt_asset=searchlabel=partnerid=lang_cd=orderablepartnum=packageid=partnumber=part_number=fromsearch=tab=package_quality_tab?$tab=design_tools_tabbased-on-pip-[0-9]_[0-9]uc= oxoxxxxxxxxxxxox
RProt=255
LastStart=135:238:88:31:1:198:228:64:
LastEnd=179:93:82:35:2:198:228:64:
PrjStart=6:214:172:141:218:252:227:64:
LastStarted=2016.6.23 0:50:30
LastEnded=2016.6.23 1:36:12
S200=18234
S400=42
SPar=18234
SSav=18234
SLast=200
SSiz=2335237689
SMdf=18234
SHTML=17448
SSuccDowns=38
LFiles=18276
LSize=2335240797
Flags=1
Descr=e1xydGYxXGFuc2lcZGVmZjB7XGZvbnR0Ymx7XGYwXGZuaWwgTVMgU2FucyBTZXJpZjt9fQ0KXHZpZXdraW5kNFx1YzFccGFyZFxsYW5nMTA0OVxmMFxmczE2IA0KXHBhciB9DQo=
ImgDim=0,0,0,0
PrevURL=http://www.nxp.com/products/discretes-and-logic/logic:MC_50808
ExploreDirs=True
ConvertRSS=True
SkipVideoBlocks=True
ConvertFileNames=True
SavePageScroll=16
Oleg Chernavin 6/23/2016 7:36:34 PM
Yes, your Project downloads lots of files. Regarding these two parts that didn't download - use the Find Contents dialog to find web pages that contain these parts. And see if they contain the links to download PDFs or not.

Oleg.
George Robinson 6/29/2016 10:15:23 AM
Yes, however with such a wide scope, OE finds all of the relevant pdf files.

I have another challenge for you:
How to download all of the pdf datasheets from the following URL?:
https://www.renesas.com/en-eu/products/general-purpose-logic/rd74hc.html#documents
Oleg Chernavin 6/30/2016 6:28:22 PM
What about a Project with Level=1, File Filters - Archive - Download from any site. And in the "Open pages in the browser and save" mode.

?

Oleg.
George Robinson 7/1/2016 8:42:47 PM
Nope, such project downloads only several pdf files from
https://www.renesas.com/en-eu/products/

The problem is that OE never goes into the 2nd, 3rd of the product documentation pages.

For example the documentation URL for just this 1 example product at:
https://www.renesas.com/en-eu/products/general-purpose-logic/rd74hc/device/HD74HCT541T.html#documents
...has 3 pages of documentation. The product's pdf datasheet is usually on the 2nd documentation page but the OE never gets there :(
Oleg Chernavin 7/5/2016 7:18:01 PM
Yes, this is tough! Maybe only manually - open this page in the Internal Browesr, select show 50 items per page and then use the Save Page button - Save page with its links. It is on the Internal Browser toolbar.

Oleg.