One of the great things about Java is the extensive standard libraries available as part of the standard platform and there is certainly great support for XML in those libraries. However for your particular need there is no direct support in the standard libraries.
So really you have two options:
- Go and build something from scratch yourself. This is generally painful and time consuming.
- Check the 'community' and see if someone else has already encountered the problem (highly likely) and been kind enough to share it.
In this case there is a useful little project on SourceForge call JTidy. The JTidy web site can be found at http://sourceforge.net/projects/jtidy/
JTidy provides HTML syntax checking and "pretty printing" of HTML, but for our purposes here it also allows you to take a HTML file as input and convert it into XML. JTidy reads through the input file and if it finds any mismatched or missing end tags it corrects them and outputs a well-formed XML document.
As you can see from the sample code below, it is quite straightforward to use. Simply set the JTidy instance to output XML, supply an input URL, output file and error file, start up the conversion and you are pretty much done.
import java.net.URL;
import java.io.*;
import org.w3c.tidy.Tidy;
public class TestHTML2XML {
private String url;
private String outFileName;
private String errOutFileName;
public TestHTML2XML(String url, String outFileName, String
errOutFileName) {
this.url = url;
this.outFileName = outFileName;
this.errOutFileName = errOutFileName;
}
public void convert() {
URL u;
BufferedInputStream in;
FileOutputStream out;
Tidy tidy = new Tidy();
//Tell Tidy to convert HTML to XML
tidy.setXmlOut(true);
try {
//Set file for error messages
tidy.setErrout(new PrintWriter(new FileWriter(errOutFileName), true));
u = new URL(url);
//Create input and output streams
in = new BufferedInputStream(u.openStream());
out = new FileOutputStream(outFileName);
//Convert files
tidy.parse(in, out);
//Clean up
in.close();
out.close();
} catch (IOException e) {
System.out.println(this.toString() + e.toString());
}
}
public static void main(String[] args) {
/*
* Parameters are:
* URL of HTML file
* Filename of output file
* Filename of error file
*/
TestHTML2XML t = new TestHTML2XML(args[0], args[1], args[2]);
t.convert();
}
}
Do you have a Java related question for Michael that you want answered? Forward your questions to builder@zdnet.com.au
Michael Geisler is a senior systems engineer with Sun Microsystems and has more than 14 years of experience in the IT and telecommunications industry. He has been working with Java since the first public beta and is currently the vice-president of the Australian Java Users Group (AJUG).
Do you need help with Java, C, or C++? 





1
K.ganesan - 13/02/07
thanks! your code used to understand xml in java.
» Report offensive content
2
Rashmi - 14/04/08
Hi
I am having problems converting a web page to XML using Jtidy in Windows XP I have downloded the latest version of Jtidy but I am getting errors when the above code is run as it is not recognising the code
Tidy tidy = new Tidy();
I am newly using Jtidy can anybody tell me the detailed steps as to how do I run a web page and convert it into corresponding XML ...
Any help would be greatly appreciated....
» Report offensive content
3
woutboeing - 17/04/08
hello,
at school im learning java with conTEXT, a java text editing soft. we save the files as java, javac or txt, but now my question is, if i have a working program, how do i make it work for ppl that dont use context, so i can run it from any machine, (like some sort of EXE file, or html)
could you tell me what format to use, and how to do that?
(pls mail me)
» Report offensive content
4
karthi - 10/08/08
nothing
» Report offensive content
5
Nike - 17/11/08
I have problem in printing an html file ( browser look ) thru a printer . .
I tried coding with respect to document rendering n all but nothing worked
plz help me out and provide me wid a code which will print an html file wid image directly thru a printer . . . .
» Report offensive content
6
sengalvarayan - 27/09/09
i have the doubt i write html file and how i call the java file from the html file
» Report offensive content
7
Hassan - 18/02/10
Sometimes your source HTML page is so messed up that you can't convert it to XML directly. What i did worked for me and maybe will work for others. Look at the code
» Report offensive content
8
vinay - 22/02/10
i got all code whih over there written bt i couldn't run ths code properly pls tell me how to run this code with step by step.. thanks
» Report offensive content
9
abhijeet - 24/02/11
I want to save my html data into XML file..
eg. if i write text in textbox then it should be added to respective XML tag
pls help
» Report offensive content
10
abhijeet - 24/02/11
I want to save my html data into XML file..
eg. if i write text in textbox then it should be added to respective XML tag.I want to do this using JS
pls help
» Report offensive content