[510 pls]Optimize datum variable accessing

BYOND Forums

Announcements · BYOND Help · Bug Reports · Feature Requests · Beta Testers · Beta Bugs · Developer Help · Design Philosophy · Demos & Libraries · Tutorials & Snippets · Art & Sound · Classified Ads · Game Updates · Contests & Events · Linux Talk · On Topic · Off Topic

Page: 1 2 3 4

Jul 30 2016, 12:56 pm

Ter13

To answer the question, 3 times for every iteration, But I should point out, that none of the code he showed involves proc level vars, only user created vars inside objects (like datums, atoms, etc), proc level vars resolve to what is basically procvars[indexnumber] with procvars being a flat array. You might be onto something here about the optimization, but there already exists syntax to do that. for (index in i to j step k).

The real issue is that BYOND's variables are a list. Most OO languages literally store objects are solid memory regions where each variable itself is simply a pointer offset for where the value begins. BYOND doesn't use primitive data types so to resolve an object's variables we have to:

Seek object pointer.
(seek prototype structure pointer?)
Seek variable list pointer.
Seek variable list index.
Seek pointer to value contained.
Return value.

BYOND's structure is implicit. That's why this particular thing is slow.

Whereas proc memory is simply:

Seek index.
Seek pointer to value contained.
return value.

A normal OO language compiler would resolve all of this at compile time.

The object instance would be stored as a contiguous memory region, and the compilation would use the object pointer + the offset determined by the byte width of members defined prior to the member desired.

Almost everything in BYOND is abstracted into some sort of table, even numeric values for some absurd reason. Dan's imperatives were very much on memory and not grunt processing power when he wrote DM's core structure. The world has changed since, where memory is cheap and processing power is the gold standard.

Jul 31 2016, 11:32 am

Hiead

Since variable lookups are relatively expensive, then it follows that a real performance booster would be implementing some sort of runtime caching scheme for frequently used variables inside of a proc. For instance, anything in a loop header automatically qualifies.

Giving the runtime a little bit of lookahead to see what variables are involved in the next few operations is an extremely cheap operation in itself: it only involves analyzing a few bytes in memory. "Hey, I'm going to use this for this operation and the next one too, so I don't need to look it up twice." Even a relatively "dumb" implementation of such a thing should reap tangible results.

Aug 2 2016, 11:46 am (Edited on Aug 2 2016, 12:04 pm)

MrStonedOne

I mean, the thing to note here, is that this only applies to vars attached to things (like datums/atoms/etc). This isn't such an issue for loops because those are almost always proc vars, and lookup of those is super cheap that it really can't be optimized further.

The question becomes, would the cost of checking to see if an object var is cached be lower then the gain, considering that the majority of datum var accesses are singular, as in aren't repeated.

And how would it check? you can't store it on the var as it has to resolve the var to see that, so you gain nothing. You could store it on the proc, but now you have to loop thru a list to see if the var is there, and you've now only just moved the overhead. You could change how it compiles so that you have a list of every 'var' in the proc, and have that point to the var on the datum (and have it look this up if it's null), but now you have to figure out how to detect if the thing the var is attached to changes in such a way to invalidate that pointer.

I suppose you could just do this for src. vars, since src is highly unlikely to change mid proc run.

You also need to factor in the cost of freeing the cached pointers to the end of the proc, furthering proc overhead, for what you hope is a net gain.

I thought of this and didn't suggest it because I just don't think the gain will be worth it, and if you need such a thing, you can cache it as a proc var yourself in softcode.

How ever only lummox could really remark on this. It could be worthwhile to store an array of src vars accessed in the proc, compile the actual var accesses to an index of that array, if the pointer in the array is zero, look up the var, otherwise, use the pointer.

The big issue is this only optimizes a subcase of a subcase of var access scenarios. var accesses that are both datum vars, and repeated, but only when accessing src.var (implied src. or not)

Is that worth it for the effort it would take to code? Especially when you can get what is basically the same optimization in softcode with a few lines?

The world has changed since, where memory is cheap and processing power is the gold standard.

Until byond goes 64bit, both are at a premium. Lummox could relax the memory constraints by disabling the per-process memory limit of 2GB using the LARGEADDRESSAWARE compile flag, giving byond 4gb of memory to work with, but even then, ss13 without the memory optimization in datum vars would exceed that, and you would be surprised how bad it would be for some of your projects if you tested. (just make something to loop world.contents, loop thru their vars, and assign them to random data unless its a built in (doing the same for datums in world too))

Aug 2 2016, 5:02 pm
Ter13	(just make something to loop world.contents, loop thru their vars, and assign them to random data unless its a built in (doing the same for datums in world too)) I'd never dream of such a thing. You have a truly perverse mind.

Aug 2 2016, 6:40 pm
MrStonedOne	do it anyways! It will really make you appreciate the memory savings byond is giving you.

Aug 2 2016, 6:45 pm

In response to MrStonedOne

Hiead

MrStonedOne wrote:

I mean, the thing to note here, is that this only applies to vars attached to things (like datums/atoms/etc).

But the very test you're using to determine whether these optimizations are successful uses vars attached to things, and the very topic of this thread is "Optimize datum variable accessing."

And how would it check?

I already answered that in my post.

Giving the runtime a little bit of lookahead to see what variables are involved in the next few operations

Presumably, the runtime loads the bytecode into memory and then runs some kind of interpreter across the bytes, executing the operations it comes across in sequence. So the most basic solution is to run some analyzer/translator over the in-memory representation of a proc.

Imagine you take what you understand to be the table of proc vars, and replicate the same thing to create a fast-access table of proc cached vars.

The analyzer looks up frequently-referenced vars in the instructions to come and places them into a cache table. It swaps the in-memory opcode to lookup a datum var and replaces it with a reference to the cache table. This would involve another bytecode operator; like how there is something similar to "[proc var] [0x0001]" maybe this is "[proc cache] [0x0001]." (Rough recall on the "[proc var] [0x0001]" part, from when I last reverse-engineered the DMB format, years ago.)

The end result is a minor overhead in forward-analyzing a few instructions; however, each proc only needs to be analyzed once, and now instead of searching a list or array of variables, the instructions execute in O(1) time. Obviously, there has to be some cost in the analysis phase, just as there is some cost in JIT compilation, etc. But JIT compilation has proven wildly successful, as I imagine something like this would be.

The only "gotcha" case the analyzer really has to account for is in reference changes, but that's trivial due to DM's (relatively) limited instruction set.

You also need to factor in the cost of freeing the cached pointers to the end of the proc, furthering proc overhead, for what you hope is a net gain.

You don't need to re-allocate the variable objects. BYOND uses reference counting. Caching something should be as simple as storing a pointer and incrementing the ref count. At the end, the ref counts are decremented again. I would again chalk this up to negligible overhead.

I thought of this and didn't suggest it because I just don't think the gain will be worth it

Except in embedded systems (increasingly less-so!) or extremely dynamic or high-fps 3D environments, focusing on such micro-optimizations as whether or not a processor can branch-predict a series of instructions is hardly ever worth it, either. That's because 99% of performance issues are not rooted in micro-optimizations such as these, but rather a sign of overall architectural/design issues.

How ever only lummox could really remark on this.

That's true. I'd always be willing to take a gander and see what I could tune up as well---free of charge---if he asked. He is a busy man, after all. I wouldn't wish maintaining this suite and community on any one person. #feelsbadman

Aug 2 2016, 8:57 pm

Lummox JR

I don't think scanning the proc code and pre-optimizing would be doable. Object vars are looked up by name (technically, by string ID), and I can't conceive of any way to speed that up beyond eking more speed out of the current operations.

The language internally simply does not give datums a strongly-defined structure; I see a potential value in that, if there were some way to flag a datum as working that way. (I wouldn't want to do it universally for lots of backward-compatibility reasons.) The trick would then be getting the compiler to save a different kind of var index value for cases where it had every reason to believe it was dealing with a certain type. And if it turned out wrong, like if the var contained a different type, that'd be very hard to catch.

Aug 2 2016, 9:46 pm
MrStonedOne	The thought i had was make it so you could define a datum var as inline, and that var could only be accessed inside the datum (as in, only via src.var), but be compile time resolved. It's limited in functionality, but it would be the easiest* to implement as it would just be an extension of a proc var or global var _{_{*easiest != easy in general}}

Aug 3 2016, 1:57 am (Edited on Aug 3 2016, 7:42 am)

In response to Lummox JR

Hiead

Lummox JR wrote:

I don't think scanning the proc code and pre-optimizing would be doable. Object vars are looked up by name (technically, by string ID)

I disagree. I know that they are looked up by name ID. I'm having trouble figuring out why it's not doable to change the opcodes in memory before executing them.

and I can't conceive of any way to speed that up beyond eking more speed out of the current operations.

The idea for me isn't to speed up a single var access beyond either the binary search tree that has already come up, or a hash map. It's rather to deduce when the same var access could be reused multiple times. When multiple operations need to look up the same var (name ID), then subsequent usages should be able to reuse the reference attained in the first lookup. A very basic version might have pseudocode similar to:

optimize_datum_vars():
   while ( next_var_id = get_next_datum_var() ):
      if ( vars_seen.contains(next_var_id) ):
         if ( !vars_cached.contains(next_var_id) ):
            cache_var(next_var_id)
            vars_cached += next_var_id
      else:
         vars_seen += next_var_id

cache_var(var_id):
   find first reference to var_id in proc in-memory bytecode:
      // store this var at an index in the proc cache
      pre-inject some command that looks like:
         "store [datum.var_id] -> proc_cache[current_cache_index]"
      // access this var via index in proc cache
      transform datum.var_id to proc_cache[current_cache_index]
   for every subsequent reference to var_id:
      transform reference to proc_cache[current_cache_index]
   increment current_cache_index

Obviously, that's a very rough sketch and could use work of its own. An optimal analyzer / binary translator is beyond the scope of this post. This type of function would need to be called on a proc only one time. It is probably feasible to run such an analyzer on every proc once at startup, though it would be trivial to flag whether or not a proc has been analyzed or not.

Accesses could likewise be interpreted inside loop bodies and headers based on some other heuristic, such that a variable accessed in 10000 iterations of a loop is only looked up once, cached, and then accessed via the cache subsequently.

Edit: It was non-obvious so I wanted to clarify: What you would be caching is a pointer to the Var2 and not the value of it itself, so value changes on it would have little consequence.

Aug 3 2016, 9:58 am

In response to Hiead

Lummox JR

Hiead wrote:

Lummox JR wrote:

I don't think scanning the proc code and pre-optimizing would be doable. Object vars are looked up by name (technically, by string ID)

I disagree. I know that they are looked up by name ID. I'm having trouble figuring out why it's not doable to change the opcodes in memory before executing them.

Doing that at runtime would be impossible; there's just no way that could be done with any kind of speed. That's the kind of optimization that would make sense at compile-time only. But even so, such a change would mean that all datums--or at least all datums marked for such a thing--would need to be handled differently. The following issues come into play:

1) The datum would need to have space for all of its vars, or at least all of its vars that were marked for this optimization, allocated and assigned whether they were changed from their original values or not. This additional assignment, and dereferencing when the datum was deleted, would be a drain on efficiency; the only way around that that I can see would be to only init the block of vars once an access occurs.

2) The compiler would have to be aware of which vars it was dealing with, by var type (this BTW is completely non-trivial, as it gets into parts of the compiler I don't understand well), and optimize var indexes accordingly so they were something like [pre-evalulated index][12].

3) At runtime, accessing a pre-evaluated index on a non-matching type would be disastrous. If the index was in range, you'd be reading/writing a completely arbitrary var on the datum you were working with. If it was out of range, a bound check would be critical.

These aren't show-stoppers, but addressing the bigger concerns would involve adding some additional checks that would (very slightly) blunt the efficiency of the access. For me probably the biggest problem is #2: the compiler has to be way, way more context-aware when dealing with var indexes than it is now.

Also, and this is potentially an issue: I don't know how a new var index code would impact things in terms of efficiency. No idea. The current codes are tightly grouped, making the switch() statement that handles them fairly well-optimized. I fear the logic behind handling a new code would probably lead to tiny slowdowns in all other var accesses across the board. If the new code were made to be in the same block as the others, that would fix the efficiency issue but it'd tack another tiny slowdown onto these optimized cases.

The idea for me isn't to speed up a single var access beyond either the binary search tree that has already come up, or a hash map. It's rather to deduce when the same var access could be reused multiple times. When multiple operations need to look up the same var (name ID), then subsequent usages should be able to reuse the reference attained in the first lookup.

This can be accomplished by the programmer themselves, by assigning the value to a local var. The compiler is quite incapable of making that kind of leap, as it requires a very high-level sense of optimization. Implementing such a thing would be nightmarish, far more so than the work I discussed above which is at least mostly approachable.

Aug 17 2016, 1:51 pm (Edited on Aug 17 2016, 3:00 pm)

Lummox JR

I'm doing some testing, and it appears doing the binary search with conditional moves is tricky--but probably doable. I was able to force at least one cmov into place with this code:

if(varname == (vname2=v1->name)) return v1;
last = (varname < vname2 ? i : last);
first = (varname > vname2 ? i+1 : first);

This is the assembly that produces, with optimizations turned on:

            if(varname == (vname2=v1->name)) return v1;
007F2094  mov         ebx,dword ptr [eax+4]
007F2097  cmp         dword ptr [varname],ebx
007F209A  je          GetVarNamed+59h (07F20B9h)
            last = (varname < vname2 ? i : last);
007F209C  movzx       ecx,dx
007F209F  mov         eax,edi
007F20A1  cmovb       ecx,eax
007F20A4  movzx       edx,cx
            first = (varname > vname2 ? i+1 : first);
007F20A7  jbe         GetVarNamed+4Fh (07F20AFh)
007F20A9  lea         eax,[edi+1]
007F20AC  movzx       esi,ax

As you can see, this is not ideal because there's still a branch on one side. That branch exists because of the i+1, which is done via lea. If I increment i in between these two lines, then the comparison has to be done all over again. So I need to find a way that the cmov is done in both cases.

[edit]
I found a way to do this that used two cmovs, that looks to be fairly elegant in the compiled code so I think it's a winner. I've done zero profiling (to be honest, I have no desire to go there), but I like the looks of it.

Unfortunately cmov optimization will not help with bag searches and insertions, which sucks because that would have had a huge impact on appearances. I found out that my ternary statement wouldn't compile to a cmov, or in this case two of them, and I think it's clear now why.

x = (c <= 0) ? x->bag_left : x->bag_right;

Both of those require a load from memory. I could probably force the cmov issue by splitting this into two lines, but then it would grab data from memory twice instead of just once (since cmov does the read regardless), which I suspect would result in a minor performance loss.

Even more annoying, I really want c to be a register var because it's used later in the routine and it gets saved to memory (on the stack) on each loop. Sadly it appears the compiler does not want to do that.

Aug 21 2016, 10:24 pm
Lummox JR	I forgot to mention it in this thread, but the cmov optimization I did for vars is in place in 511.1353. Please let me know what you find in terms of performance.

Aug 22 2016, 10:24 am (Edited on Aug 23 2016, 4:09 pm)
MrStonedOne	I got a profile running, Each version is gonna take about 5 to 10 minutes to test and i have 3 byond versions to test (1347, 1352, and 1353) edit, I had to re-design the profile setup because the results have too much noise. Those are running now.

Aug 23 2016, 11:03 pm (Edited on Aug 23 2016, 11:10 pm)

MrStonedOne

1347:

Proc Name                                        Self CPU    Total CPU    Real Time        Calls
/datum/testdatum48vars/proc/testvaroverhead        46.478       46.483       46.496        50000
/datum/testdatum48vars/proc/testvaroverheadR       42.690       42.692       42.728        50000
/datum/testdatum24vars/proc/testvaroverheadR       35.821       35.828       35.845        50000
/datum/testdatum24vars/proc/testvaroverhead        32.659       32.666       32.679        50000
/datum/testdatum8vars/proc/testvaroverhead         31.059       31.060       31.070        50000
/datum/testdatum8vars/proc/testvaroverheadR        30.915       30.920       30.923        50000
/datum/testdatum1var/proc/testvaroverheadR         23.652       23.656       23.665        50000
/datum/testdatum1var/proc/testvaroverhead          22.024       22.027       22.036        50000

1352:

Proc Name                                        Self CPU    Total CPU    Real Time        Calls
/datum/testdatum48vars/proc/testvaroverhead        44.587       44.592       44.605        50000
/datum/testdatum48vars/proc/testvaroverheadR       38.824       38.828       38.840        50000
/datum/testdatum24vars/proc/testvaroverheadR       35.836       35.839       35.852        50000
/datum/testdatum24vars/proc/testvaroverhead        32.601       32.604       32.611        50000
/datum/testdatum8vars/proc/testvaroverhead         30.804       30.804       30.817        50000
/datum/testdatum8vars/proc/testvaroverheadR        30.009       30.009       30.021        50000
/datum/testdatum1var/proc/testvaroverhead          23.447       23.454       23.465        50000
/datum/testdatum1var/proc/testvaroverheadR         23.435       23.437       23.442        50000

1353:

Proc Name                                        Self CPU    Total CPU    Real Time        Calls
/datum/testdatum48vars/proc/testvaroverhead        45.564       45.568       45.585        50000
/datum/testdatum48vars/proc/testvaroverheadR       40.128       40.134       40.148        50000
/datum/testdatum24vars/proc/testvaroverheadR       36.040       36.044       36.055        50000
/datum/testdatum24vars/proc/testvaroverhead        32.729       32.732       32.748        50000
/datum/testdatum8vars/proc/testvaroverhead         30.771       30.774       30.781        50000
/datum/testdatum8vars/proc/testvaroverheadR        30.507       30.509       30.518        50000
/datum/testdatum1var/proc/testvaroverheadR         23.558       23.562       23.575        50000
/datum/testdatum1var/proc/testvaroverhead          23.270       23.272       23.281        50000

(Each call triggers 10000 random var reads)
(R version of the proc is just when the call order is reversed.)
(100,000 datums of each type is created, each call is called on a random datum from the pool)

Other interesting notes, 510 didn't want to free the memory used up at the end the proc (in an earlier version of the code), and 510's peak usage was markedly lower than 511, 180MB vs 230MB for this final test run.

Page: 1 2 3 4