parsing HTML

One Star

parsing HTML

Hi everybody.
I have to get some information (everything legal) from html pages. There's a way to take just the information I need, deleting html tags??
Thanks in advance.
Seventeen Stars

Re: parsing HTML

hi,
use tFlieFetch to read html file :
https://help.talend.com/search/all?query=tFileFetch&content-lang=en
You can use libraby like jSoup to parse html.( write some java code)
You 've also got exchange component like tHTTPBot or tHTTPTableInput (to read table)
if well-formed (X)html use xml talend component ofter load html pages.
hope it helps
regards
laurent
One Star

Re: parsing HTML

I already tried but with some problems. I started using talend last week, so I'm not very practice.
Can you please help me better with an example?
thank you very much.
One Star

Re: parsing HTML

Hi everybody,
I tried to use tTikaExtracor and it works, but it does't remove the free space between lines...
Is there a component that write in the output file sequentially?
thanks in advance.
Moderator

Re: parsing HTML

Hi rob911,
Is this component working well for your space issue TalendHelpCenter:tReplace?
Is there a component that write in the output file sequentially?

Could you set a example for your requirement "write in the output file sequentially"?
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
Seventeen Stars

Re: parsing HTML

hi all,
perhaps you could 'pre-procede' your Html fiel reading it with tFileInputFullRow and check option "skip empty rows".
hope it helps
regards
laurent
One Star

Re: parsing HTML

I can't upload the screenshot of my file and talend job.
kzone, where should I put tFileInputFullRow? In my job I have tTikaExtractor -> FixedFlowInput -> tFileOutputDelimited....
xdshi, with tTikaExtractor I can delete every code line of my html file, but the useful lines remain in the position where they were in the code.
thanks to you two, hoping you can get me to a solution Smiley Happy
Moderator

Re: parsing HTML

Hi,
You should register and log in as a Community member first, then you'll get a Image upload box that allows to upload screen captures and images up to 200KB(Limits: 20 images per post, each image must be less then 1024x768 pixels and 200 KB).
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star

Re: parsing HTML

I'm already registered but I can't log in, I don't know why I can't.
Anyway the problem is that the line which I'm interested in are not disposed in the right sequence in the file, I mean that there are too many empy row, in this empy row there was the code.
I put in Tika the url I'm interested in then I get the useful lines in a txt file, but they are in the same position of the html file and I want them in sequential rows.
I used this post http://www.talendforge.org/forum/viewtopic.php?id=22254 .
But the output is different and I don't know why! Smiley Sad
Seventeen Stars

Re: parsing HTML

Anyway the problem is that the line which I'm interested in are not disposed in the right sequence in the file, I mean that there are too many empy row, in this empy row there was the code.

as you have :
tTikaExtractor -> FixedFlowInput -> tFileOutputDelimited
next read delimited file with tFileInputFullRow skipping empty rows ...
Not sure it's the more efficient way - I'm sure in fact Smiley Happy - but not sure about what you're expecting .
regards
laurent
One Star

Re: parsing HTML

Hi,
I tried tFileInputFullRow -> tFileOutputDelimited skipping empy row, but it doesn't clean empty row... Smiley Sad
Regards
One Star

Re: parsing HTML

Hi everybody,
fine I don't need to have an orderly file anymore.
I just need to extract some lines... is there a component that help me with that?? I need to specify some start words and some end words.
Thanks in advance.
One Star

Re: parsing HTML

I'm using tFileInputRegex and it's matching the lines I need... but how can I write these lines in an output files?
Using tFileInputRegex -> tFileOutputDelimited doesn't work.
regards
One Star

Re: parsing HTML

Hi Everybody,
I am reading a html file using tFileInputFullRow ,but it's not reading the html file from starting. I mean to say it should start reading the file at <html> tag ,but it's starting at somewhere i am not sure where . Note: i have not checked the random option of the component.