Page 2 of 3

Datafile-Format discussion

Posted: 16 Dec 2006, 16:13
by Troman
karmazilla wrote: I cringe at the idea of a homegrown XML parser
You shouldn't get too close to a scripting language parser, Karma, I can't foresee the results.  :)

Actually I already had a parser for txt files ready in my head, that shouln't be a problem. The only thing that scares me is if we were to use XML for something else with a more complicated syntax, which might be difficult to parse with a LALR(1) parser, as I don't have much experience with XML i'm not sure if that's going to work well; i'm wouldn't want to sweat over it.

Datafile-Format discussion

Posted: 16 Dec 2006, 17:06
by karmazilla
The wazrone scripting language isn't standardized the same way XML is, besides, it's already there, implemented, and scripts have been writen to run in it - that's the difference.
The only thing that scares me is if we were to use XML for something else with a more complicated syntax
Such as... maps? Or what do you have in mind? (karmazilla hopes Troman isn't talking about creating a procedural scripting language in XML >_< )
XML is itself defining the syntax, which can be build upon to create new markup languages (by constraining validity in the form of DTDs or Schemas)

Datafile-Format discussion

Posted: 16 Dec 2006, 17:21
by Watermelon
So we have 2 ideal candiates for new data format:custom plain text and xml

list of their cons and pros imo:

custom plain text:
pros:
1. smaller faster to parse
2. easy to read/edit for ppl with little or not scripting skills
3. easier to implement:some functions to read custom 'start mark','end mark','delimiter' etc

cons:
1. not universal
2. no relational values support
3. it's a 'new mini scripting language' itself

xml:
pro:
1. universal,can be converted into database/web etc easily
2. supports more features and extensible
3. friendly to anyone who already knows xml

cons:
1. slow parse speed
2. some of the features might not be useful for 'raw data parsing' like those text files
3. much harder to implement
4. it requires modders to have base knowledge of xml

Datafile-Format discussion

Posted: 16 Dec 2006, 18:02
by Troman
karmazilla wrote: The wazrone scripting language isn't standardized the same way XML is, besides, it's already there, implemented, and scripts have been writen to run in it - that's the difference.
Not sure I understood you. XML has its own syntax which can be parsed like anything else, from a point of view of a parser it would be nothing different from a script.
karmazilla wrote: Such as... maps? Or what do you have in mind?
Actually lnd maps should be easy to parse. I didn't mean anything in particular.

With a parser like Bison you can parse anything, be it XML, HTML file or a shell script, command-line arguments, mail headers or anything else, doesn't really matter, a parser is there to parse a text which is built upon some rules, and it should work as long as the syntax isn't too ambiguous, Bison can't deal well with ambiguousness because of its parsing algorithm, although some other parsers, like Accent, based on generalized LR algorithm, can.

XML syntax i'm familiar with should be an easy job for Bison compared to what it normaly has to deal with. What I don't know is whether there are any XML extentions i'm not familiar with and which we might want to use (that's what I had in mind) which could be too ambiguous for Bison. Hence for a more complicated syntax an external library is a better choice, for just data txt files wz uses it wouldn't matter. If we decide to use XML for data i agree with you Karma, an external library sounds like a better solution to me.

Datafile-Format discussion

Posted: 16 Dec 2006, 18:16
by Kamaze
Maybe we should split it into a own topic?
Anyways, we discuss this in the Dev-List already haviely.

Datafile-Format discussion

Posted: 17 Dec 2006, 09:26
by kage
lav_coyote25 wrote: hmmmm... k - question!

these txt files - weapons etc - would/could they be incorporated into what is being attempted ( tech tree as part of the gui ) ??

as i said its just a question... needed to be asked. ;D
no idea what you mean...
karmazilla wrote: I cringe at the idea of a homegrown XML parser :( libxml is a very small dependancy - I'm looking at the ubuntu package right now, and libxml1 only depends on libc6 and zlib1g. Besides, because of the SGML legacy in XML, there are some very hairy syntax rules in XML, like DTDs, Entity resolving, xml:id and CDATA sections. Plus, if we're going to create XSD Schemas and validate against them, then libxml might (I'm not entierly sure) have some functionality solve that.
i entirely agree with karma. if you create your own syntax parser, you want it to accept any valid xml, even if you choose to ignore namespace declarations, processor directives, attributes, etc. it's not worth the effort to make a new parser, as you'll spend 3 months just making it so it'll fit to spec and accept valid xml. if all you really want it basic tags, with pretty much no support for anything else, then go ahead and make a parser, as you'll save a lot of time ignoring the stuff you wont use, but don't call it xml (call it wzml or something), because if you call it xml, people will get pissed when they can't use output from their xml editor.
karmazilla wrote: And, if performance in XML parsing is an issue, then wouldn't you expect the lads and lassies behind libxml to know a thing or two about it?
the libxml people aren't the people behind xml, so they can't really decide such things, but yes, they do know a thing or two about it: aside from keeping up with the frequently changing specs, most of their time goes into trying to shave off every last calculation. in either case though, if it turned out to be "slow", then what are they going to do? turn libxml into an audio resampling library? most of the stuff people are familiar with when it comes to sgml derivatives is html, which generally has 7-8 orders of magnitude more content than it does meta-data (the tags), so of course it parses quickly. storing game data would have about 3-4x more bytes put into the metadata than it does the actual data, and there'd be a much different performance curve for that.

in case i need to refresh memories: what's fast and ideal for machines is exactly the opposite of what's fast and ideal for humans -- csv files are much closer to the machine side, and if you don't allow for meaningless whitespace, it's very fast for a machine to parse, but impossibly slow for a human to understand. xml was never designed to be a "speedy alternative to the other stuff" -- it was designed for abstract compatibility: whether or not the content would make sense to a machine, the format definitely does, and same for a human.

what most people don't know is that xml comes in two forms: the textual representation, and the post-parser map that is usually stored in memory (dom). both are as much a part of xml as the other, and you can convert from one to the other without loss of data -- this binary representation is necessary because parsing the xml text (taking into account all parts of the spec, such as namespaces, processor directives, cdata sections, comments, and a few other little bits) *is* so damned slow. when you use dtd's or schemas, what they do immediately after the parser goes through the text, is they validate the entire document by iterating through the dom from start to finish... if you're using xml schema's, then the schema itself first needs to be parsed and iterated so that the schema can first be validated against a hard-coded dtd, so that the schema can be confirmed as a valid schema before the main xml document can be validated against this schema; every single xml resource you use goes through this process of being validated and then being validated against its dtd or schema if any is present.

after all that, the dom is finally presented to the program, which usually iterates the entire dom once more -- all xml documents that use a schema must be iterated through 3 times from start to finish, and there is absolutely no way to optimize this and cut out one or two iterations since the xml spec is very clear and solid about this very point: the schema should not even try to validate an xml document if it's not first determined to be valid xml, and the xml document should not be presented to the program until the xml is determined to be both valid xml, and valid as per any optional schemas: their being draconian about it is really a good thing (because document authors tend to slip through the cracks wherever they can if a format isn't draconian), but it does remove potential for optimization. so, if you use the dom approach, and you have a seperate schema for each type of xml document, then you end up with 6 implicit iterations for each xml data file used in warzone (3 for the schema, and 3 for the actual document).

now you have sax: the only reason it was created is that one of the people working at the w3c on the xml spec realized that typical xml parsing ops both used way too much memory, and had too much overhead involved in parsing an xml file (took too long), and that most features in the dom were not used, being that most programs just iterated through the dom to grab data and then freed it from memory. so david megginson, the original developer of sax made an api that didn't validate the document before initial program access: it instead fired off callbacks every time it encountered a different kind of data (such as "start of element", "end of element", or an attribute, or a namespace declaration), and the program can choose to ignore any kind of data, and beyond that, stuff like the logical heirarcy is something that the receiving program must keep track of if it is of any importance, and if any errors crop up, the program can choose to ignore it. downside is that xml documents only now have as much logical structure as the receiving program chooses to give them. upside is the same as the downside, plus it's really really really fast compared to the dom approach, and it only involves a single iteration. other downside is that you can't validate the document without reinventing the wheel, but this isn't of great concern.

sax is still no where near as fast as the scanf method, but it'd take that 50x longer processing time and cut it down to about 5-7x.

i'd say that if we do use xml, then we really don't need an external form of validation, such as a schema (internal validation is just as good for this kind of use), and should go with sax. sax probably would be fast enough, and if not, as watermelon said, we could switch to something faster, or as kamaze said, we could just parse once and cache all results for future use (using timestamp checks, of course).

Datafile-Format discussion

Posted: 17 Dec 2006, 11:23
by karmazilla
Maybe we should split it into a own topic?
might be an idea.
the libxml people aren't the people behind xml
For all I know, one or two of their developers could be members of W3C, and thus in the position to influence the standard.
EDIT: just had a look at their AUTHORS file and the XML spec. and no, noone from libxml helped define the XML spec.
so they can't really decide such things
Decide what things?
keeping up with the frequently changing specs
Oh, it's not that bad unless you venture into the WS-* type of specs. XML itself it pretty solid and mature, so is DTD and XSD Schema.
so, if you use the dom approach, and you have a seperate schema for each type of xml document, then you end up with 6 implicit iterations for each xml data file used in warzone (3 for the schema, and 3 for the actual document).
Unless you have a validating SAX parser and/or use hard-coded DTD validators. If both are employed you can cut it down to 2 iterations: 1) parse XSD Schema using hard-coded DTD validators, and 2) run validating parse on the actual document using the newly created Schema validator.

Hard-coded DTD validators are used in some advanced XML tools, and validating SAX parsers, I think, are the norm in Java land. ;)
downside (of SAX) is that xml documents only now have as much logical structure as the receiving program chooses to give them. upside is the same as the downside, plus it's really really really fast compared to the dom approach, and it only involves a single iteration. other downside is that you can't validate the document without reinventing the wheel, but this isn't of great concern.
Another upside is that SAX can be used to read streaming and arbitry large documents (though that isn't terribly useful to us). Plus, most DOM implementations use a SAX parser under the hood.
i'd say that if we do use xml, then we really don't need an external form of validation, such as a schema
Even if warzone itself dosn't validate against a schema, they still have merit. They can be used to define a standard on the format we expect warzone to read, and even if warzone dosn't use them, they will empower the rest of the tool chain: XML Editors can get autocompletion, code generators can create object models based on them, perifiral warzone editors and utilities can verify that their output will be readable to warzone.
But first and foremost, they define a standard that is testable for comformance.
as kamaze said, we could just parse once and cache all results for future use (using timestamp checks, of course).
+1, obviously. I wouldn't dream of doing it any different.

Datafile-Format discussion

Posted: 17 Dec 2006, 13:03
by Troman
kage wrote: if all you really want it basic tags, with pretty much no support for anything else, then go ahead and make a parser, as you'll save a lot of time ignoring the stuff you wont use, but don't call it xml (call it wzml or something), because if you call it xml, people will get pissed when they can't use output from their xml editor.
What will make people pissed is when they will have to scroll and count commas, load WZ source to find out if they are about to edit sensor range or weapon cost field. I have no intention to deceive people, what I want is a human-readable data. How the format is called or what it is parsed with is a secondary matter. Xml, wzml or lua tables, any of them would do the trick.

Datafile-Format discussion

Posted: 18 Dec 2006, 02:24
by kage
i agree with troman and the rest who want something human readable: there are a lot of options (hell, even .ini's would work), and it doesn't matter to me which ones we use, but we think about, as kamaze pointed out, using caching of raw data (which would make the speed penalties are negligable).
karmazilla wrote: Another upside is that SAX can be used to read streaming and arbitry large documents (though that isn't terribly useful to us). Plus, most DOM implementations use a SAX parser under the hood.
i wasn't suggesting sax was fast because the code was fast (the code is fast): rather, i was suggesting it was dead fast (only compared to other xml implementations) because it doesn't do all the stuff that is common with xml, and doesn't rigidly adhere to the parsing and loading spec (it does adhere fully to the syntax spec, though): dom implementations using sax under the hood is all well and good, but immediately those implementations gain that dom penchant for being slower and using a lot more memory, simply because they are dom.
karmazilla wrote: +1, obviously. I wouldn't dream of doing it any different.
in that case, i wont complain if you validate each xml data file against 17 different schemas and it takes 5 minutes: i'll just fire up a skirmish to precache everything, grab a bite to eat, and then jump into an mp game.

Datafile-Format discussion

Posted: 18 Dec 2006, 02:46
by karmazilla
kage wrote:i wasn't suggesting sax was fast because the code was fast (the code is fast): rather, i was suggesting it was dead fast (only compared to other xml implementations) because it doesn't do all the stuff that is common with xml, and doesn't rigidly adhere to the parsing and loading spec (it does adhere fully to the syntax spec, though): dom implementations using sax under the hood is all well and good, but immediately those implementations gain that dom penchant for being slower and using a lot more memory, simply because they are dom.
I believe DOM is slow becase it's building a lossless memory model of the XML document, with all the struct/object allocations that follows in the wake of this strategy.
Sure we can turn off validation to get an extra edge in performance - this might be depending on whether we're build a debug or a release, where the later is preferably as fast as possible.
But I wonder what difference in load speed this will make.
kage wrote:in that case, i wont complain if you validate each xml data file against 17 different schemas and it takes 5 minutes: i'll just fire up a skirmish to precache everything, grab a bite to eat, and then jump into an mp game.
17 is grossly overrated, methinks. There would be at most a schema per xml data file, plus the XSD Schema DTD (in a perfect world, the Schema DTD would be loaded and parsed only once; in real life, I think it is either loaded every time a Schema is loaded, or it's hardcoded and not loaded at all... who knows).

Datafile-Format discussion

Posted: 18 Dec 2006, 03:04
by kage
karmazilla wrote: 17 is grossly overrated, methinks. There would be at most a schema per xml data file, plus the XSD Schema DTD (in a perfect world, the Schema DTD would be loaded and parsed only once; in real life, I think it is either loaded every time a Schema is loaded, or it's hardcoded and not loaded at all... who knows).
lol. i was going for a joke -- i'm pretty new at this humor thing though, so i don't always come across as intended.

anyways, most use of schemas are specified within the xml document: if we use this approach, the xml spec pretty much forces us to reload the schema, even if it's the same one, for each file. however, many libs support validating against an arbitrary schema after the dom is loaded, in which case we can get away with loading a monolithic schema once, and reusing it each time. the after-the-fact validation is much more efficient, but requires that you know what kind of data you're dealing with (in this case we would).

Datafile-Format discussion

Posted: 18 Dec 2006, 14:05
by Watermelon
Troman wrote: What will make people pissed is when they will have to scroll and count commas, load WZ source to find out if they are about to edit sensor range or weapon cost field. I have no intention to deceive people, what I want is a human-readable data. How the format is called or what it is parsed with is a secondary matter. Xml, wzml or lua tables, any of them would do the trick.
yea I agree with Troman,anything will be better than the one wz uses,all kinds of data 'types' mixed togther in a line and delimited by comma's...

maybe a custom plain text format similar to 'ini' can be used as a temporary workaround,since xml becomes controversial lately...

for example:
[internal stats name 1]
name = "name";
cost = 100;
piefilename = "piefile.pie";
[internal stats name 2]
name = "name2";
cost = 10;
piefilename = "piefile2.pie";

Datafile-Format discussion

Posted: 18 Dec 2006, 14:20
by ratarf
At least, that's human readable... But what if I wanted to add an extra, e.g. "damage" field to each object? And will it be easy to show everything in html tables on a website from this file?

Datafile-Format discussion

Posted: 18 Dec 2006, 15:02
by Watermelon
ratarf wrote: At least, that's human readable... But what if I wanted to add an extra, e.g. "damage" field to each object? And will it be easy to show everything in html tables on a website from this file?
I think we better keep the bison/flex file parser for now,and add a mini interpreter to 'translate' ini-like human readable data in to wz arbitrary line(var1,var2,var3....varLast\n),web compatibility shouldnt be a major concern since php can read any data type ranging from binaries to custom plain texts.

Datafile-Format discussion

Posted: 18 Dec 2006, 16:24
by ratarf
I agree.