Re: How to strip HTML tags and just get the text



Hi Jon,

Thanks so much for the code. Its much appreciated.

Andrew

"Jon E. Scott" <NOSPAMsupport@xxxxxxxxxxxxxxxxxxxxx> wrote in message
news:42c064bc@xxxxxxxxxxxxxxxxxxxxxxxxx
> "Andrew Diabo" <aadiabo@xxxxxxxxxxxxx> wrote in message
> news:42c05cd0$1@xxxxxxxxxxxxxxxxxxxxxxxxx
>> I'm looking for code or component that can just get the body text
>> from an html file. I appreciate any resource for this.
>>
>> Thanks in advance.
>>
>> Andrew
>
> Quite a few HTML parsers out there, but you can use MS' parser if IE4 or
> higher is installed:
>
> uses MSHTML, ComObj, ActiveX;
>
> function ExtractHTMLText(const HTMLString: string): string;
> var
> HTMLDoc: IHTMLDocument2;
> v: Variant;
> begin
> Result := HTMLString;
> try
> //Try to use IE HTMLDocument2 interface for text extraction
> //IE4 or higher is required
> HTMLDoc := CreateComObject(CLASS_HTMLDocument) as IHTMLDocument2;
> v := VarArrayCreate([0, 0], varVariant);
> v[0] := HTMLString;
> HTMLDoc.Write(PSafeArray(TVarData(v).VArray));
> Result := HTMLDoc.body.innerText;
> VarClear(v);
> HTMLDoc := nil;
> except
> //IHTMLDocument2 not registered, use manual parsing method
> end;
> end;
>
> example usage:
>
> procedure TForm1.Button1Click(Sender: TObject);
> var
> sl: TStringList;
> begin
> sl := TStringList.Create;
> try
> sl.LoadFromFile('c:\downloads\samplefile.html');
> Memo1.Lines.Text := ExtractHTMLText(sl.Text);
> finally
> sl.Free;
> end;
> end;
>
> --
> Thanks,
> Jon E. Scott
> Blue Orb Software
> http://www.blueorbsoft.com
>


.


Quantcast