To answer the question, 3 times for every iteration, But I should point out, that none of the code he showed involves proc level vars, only user created vars inside objects (like datums, atoms, etc), proc level vars resolve to what is basically procvars[indexnumber] with procvars being a flat array. You might be onto something here about the optimization, but there already exists syntax to do that. for (index in i to j step k).

The real issue is that BYOND's variables are a list. Most OO languages literally store objects are solid memory regions where each variable itself is simply a pointer offset for where the value begins. BYOND doesn't use primitive data types so to resolve an object's variables we have to:

Seek object pointer.
(seek prototype structure pointer?)
Seek variable list pointer.
Seek variable list index.
Seek pointer to value contained.
Return value.

BYOND's structure is implicit. That's why this particular thing is slow.

Whereas proc memory is simply:

Seek index.
Seek pointer to value contained.
return value.

A normal OO language compiler would resolve all of this at compile time.

The object instance would be stored as a contiguous memory region, and the compilation would use the object pointer + the offset determined by the byte width of members defined prior to the member desired.

Almost everything in BYOND is abstracted into some sort of table, even numeric values for some absurd reason. Dan's imperatives were very much on memory and not grunt processing power when he wrote DM's core structure. The world has changed since, where memory is cheap and processing power is the gold standard.
Since variable lookups are relatively expensive, then it follows that a real performance booster would be implementing some sort of runtime caching scheme for frequently used variables inside of a proc. For instance, anything in a loop header automatically qualifies.

Giving the runtime a little bit of lookahead to see what variables are involved in the next few operations is an extremely cheap operation in itself: it only involves analyzing a few bytes in memory. "Hey, I'm going to use this for this operation and the next one too, so I don't need to look it up twice." Even a relatively "dumb" implementation of such a thing should reap tangible results.
I mean, the thing to note here, is that this only applies to vars attached to things (like datums/atoms/etc). This isn't such an issue for loops because those are almost always proc vars, and lookup of those is super cheap that it really can't be optimized further.

The question becomes, would the cost of checking to see if an object var is cached be lower then the gain, considering that the majority of datum var accesses are singular, as in aren't repeated.

And how would it check? you can't store it on the var as it has to resolve the var to see that, so you gain nothing. You could store it on the proc, but now you have to loop thru a list to see if the var is there, and you've now only just moved the overhead. You could change how it compiles so that you have a list of every 'var' in the proc, and have that point to the var on the datum (and have it look this up if it's null), but now you have to figure out how to detect if the thing the var is attached to changes in such a way to invalidate that pointer.

I suppose you could just do this for src. vars, since src is highly unlikely to change mid proc run.

You also need to factor in the cost of freeing the cached pointers to the end of the proc, furthering proc overhead, for what you hope is a net gain.

I thought of this and didn't suggest it because I just don't think the gain will be worth it, and if you need such a thing, you can cache it as a proc var yourself in softcode.

How ever only lummox could really remark on this. It could be worthwhile to store an array of src vars accessed in the proc, compile the actual var accesses to an index of that array, if the pointer in the array is zero, look up the var, otherwise, use the pointer.

The big issue is this only optimizes a subcase of a subcase of var access scenarios. var accesses that are both datum vars, and repeated, but only when accessing src.var (implied src. or not)

Is that worth it for the effort it would take to code? Especially when you can get what is basically the same optimization in softcode with a few lines?



The world has changed since, where memory is cheap and processing power is the gold standard.

Until byond goes 64bit, both are at a premium. Lummox could relax the memory constraints by disabling the per-process memory limit of 2GB using the LARGEADDRESSAWARE compile flag, giving byond 4gb of memory to work with, but even then, ss13 without the memory optimization in datum vars would exceed that, and you would be surprised how bad it would be for some of your projects if you tested. (just make something to loop world.contents, loop thru their vars, and assign them to random data unless its a built in (doing the same for datums in world too))
(just make something to loop world.contents, loop thru their vars, and assign them to random data unless its a built in (doing the same for datums in world too))

I'd never dream of such a thing. You have a truly perverse mind.
do it anyways! It will really make you appreciate the memory savings byond is giving you.
In response to MrStonedOne
MrStonedOne wrote:
I mean, the thing to note here, is that this only applies to vars attached to things (like datums/atoms/etc).

But the very test you're using to determine whether these optimizations are successful uses vars attached to things, and the very topic of this thread is "Optimize datum variable accessing."

And how would it check?

I already answered that in my post.

Giving the runtime a little bit of lookahead to see what variables are involved in the next few operations

Presumably, the runtime loads the bytecode into memory and then runs some kind of interpreter across the bytes, executing the operations it comes across in sequence. So the most basic solution is to run some analyzer/translator over the in-memory representation of a proc.

Imagine you take what you understand to be the table of proc vars, and replicate the same thing to create a fast-access table of proc cached vars.

The analyzer looks up frequently-referenced vars in the instructions to come and places them into a cache table. It swaps the in-memory opcode to lookup a datum var and replaces it with a reference to the cache table. This would involve another bytecode operator; like how there is something similar to "[proc var] [0x0001]" maybe this is "[proc cache] [0x0001]." (Rough recall on the "[proc var] [0x0001]" part, from when I last reverse-engineered the DMB format, years ago.)

The end result is a minor overhead in forward-analyzing a few instructions; however, each proc only needs to be analyzed once, and now instead of searching a list or array of variables, the instructions execute in O(1) time. Obviously, there has to be some cost in the analysis phase, just as there is some cost in JIT compilation, etc. But JIT compilation has proven wildly successful, as I imagine something like this would be.

The only "gotcha" case the analyzer really has to account for is in reference changes, but that's trivial due to DM's (relatively) limited instruction set.

You also need to factor in the cost of freeing the cached pointers to the end of the proc, furthering proc overhead, for what you hope is a net gain.

You don't need to re-allocate the variable objects. BYOND uses reference counting. Caching something should be as simple as storing a pointer and incrementing the ref count. At the end, the ref counts are decremented again. I would again chalk this up to negligible overhead.

I thought of this and didn't suggest it because I just don't think the gain will be worth it

Except in embedded systems (increasingly less-so!) or extremely dynamic or high-fps 3D environments, focusing on such micro-optimizations as whether or not a processor can branch-predict a series of instructions is hardly ever worth it, either. That's because 99% of performance issues are not rooted in micro-optimizations such as these, but rather a sign of overall architectural/design issues.

How ever only lummox could really remark on this.

That's true. I'd always be willing to take a gander and see what I could tune up as well---free of charge---if he asked. He is a busy man, after all. I wouldn't wish maintaining this suite and community on any one person. #feelsbadman
I don't think scanning the proc code and pre-optimizing would be doable. Object vars are looked up by name (technically, by string ID), and I can't conceive of any way to speed that up beyond eking more speed out of the current operations.

The language internally simply does not give datums a strongly-defined structure; I see a potential value in that, if there were some way to flag a datum as working that way. (I wouldn't want to do it universally for lots of backward-compatibility reasons.) The trick would then be getting the compiler to save a different kind of var index value for cases where it had every reason to believe it was dealing with a certain type. And if it turned out wrong, like if the var contained a different type, that'd be very hard to catch.
The thought i had was make it so you could define a datum var as inline, and that var could only be accessed inside the datum (as in, only via src.var), but be compile time resolved.

It's limited in functionality, but it would be the easiest* to implement as it would just be an extension of a proc var or global var
*easiest != easy in general
In response to Lummox JR
Lummox JR wrote:
I don't think scanning the proc code and pre-optimizing would be doable. Object vars are looked up by name (technically, by string ID)

I disagree. I know that they are looked up by name ID. I'm having trouble figuring out why it's not doable to change the opcodes in memory before executing them.

and I can't conceive of any way to speed that up beyond eking more speed out of the current operations.

The idea for me isn't to speed up a single var access beyond either the binary search tree that has already come up, or a hash map. It's rather to deduce when the same var access could be reused multiple times. When multiple operations need to look up the same var (name ID), then subsequent usages should be able to reuse the reference attained in the first lookup. A very basic version might have pseudocode similar to:
optimize_datum_vars():
while ( next_var_id = get_next_datum_var() ):
if ( vars_seen.contains(next_var_id) ):
if ( !vars_cached.contains(next_var_id) ):
cache_var(next_var_id)
vars_cached += next_var_id
else:
vars_seen += next_var_id

cache_var(var_id):
find first reference to var_id in proc in-memory bytecode:
// store this var at an index in the proc cache
pre-inject some command that looks like:
"store [datum.var_id] -> proc_cache[current_cache_index]"
// access this var via index in proc cache
transform datum.var_id to proc_cache[current_cache_index]
for every subsequent reference to var_id:
transform reference to proc_cache[current_cache_index]
increment current_cache_index

Obviously, that's a very rough sketch and could use work of its own. An optimal analyzer / binary translator is beyond the scope of this post. This type of function would need to be called on a proc only one time. It is probably feasible to run such an analyzer on every proc once at startup, though it would be trivial to flag whether or not a proc has been analyzed or not.

Accesses could likewise be interpreted inside loop bodies and headers based on some other heuristic, such that a variable accessed in 10000 iterations of a loop is only looked up once, cached, and then accessed via the cache subsequently.

Edit: It was non-obvious so I wanted to clarify: What you would be caching is a pointer to the Var2 and not the value of it itself, so value changes on it would have little consequence.
In response to Hiead
Hiead wrote:
Lummox JR wrote:
I don't think scanning the proc code and pre-optimizing would be doable. Object vars are looked up by name (technically, by string ID)

I disagree. I know that they are looked up by name ID. I'm having trouble figuring out why it's not doable to change the opcodes in memory before executing them.

Doing that at runtime would be impossible; there's just no way that could be done with any kind of speed. That's the kind of optimization that would make sense at compile-time only. But even so, such a change would mean that all datums--or at least all datums marked for such a thing--would need to be handled differently. The following issues come into play:

1) The datum would need to have space for all of its vars, or at least all of its vars that were marked for this optimization, allocated and assigned whether they were changed from their original values or not. This additional assignment, and dereferencing when the datum was deleted, would be a drain on efficiency; the only way around that that I can see would be to only init the block of vars once an access occurs.

2) The compiler would have to be aware of which vars it was dealing with, by var type (this BTW is completely non-trivial, as it gets into parts of the compiler I don't understand well), and optimize var indexes accordingly so they were something like [pre-evalulated index][12].

3) At runtime, accessing a pre-evaluated index on a non-matching type would be disastrous. If the index was in range, you'd be reading/writing a completely arbitrary var on the datum you were working with. If it was out of range, a bound check would be critical.

These aren't show-stoppers, but addressing the bigger concerns would involve adding some additional checks that would (very slightly) blunt the efficiency of the access. For me probably the biggest problem is #2: the compiler has to be way, way more context-aware when dealing with var indexes than it is now.

Also, and this is potentially an issue: I don't know how a new var index code would impact things in terms of efficiency. No idea. The current codes are tightly grouped, making the switch() statement that handles them fairly well-optimized. I fear the logic behind handling a new code would probably lead to tiny slowdowns in all other var accesses across the board. If the new code were made to be in the same block as the others, that would fix the efficiency issue but it'd tack another tiny slowdown onto these optimized cases.

The idea for me isn't to speed up a single var access beyond either the binary search tree that has already come up, or a hash map. It's rather to deduce when the same var access could be reused multiple times. When multiple operations need to look up the same var (name ID), then subsequent usages should be able to reuse the reference attained in the first lookup.

This can be accomplished by the programmer themselves, by assigning the value to a local var. The compiler is quite incapable of making that kind of leap, as it requires a very high-level sense of optimization. Implementing such a thing would be nightmarish, far more so than the work I discussed above which is at least mostly approachable.
I'm doing some testing, and it appears doing the binary search with conditional moves is tricky--but probably doable. I was able to force at least one cmov into place with this code:

if(varname == (vname2=v1->name)) return v1;
last = (varname < vname2 ? i : last);
first = (varname > vname2 ? i+1 : first);

This is the assembly that produces, with optimizations turned on:

            if(varname == (vname2=v1->name)) return v1;
007F2094 mov ebx,dword ptr [eax+4]
007F2097 cmp dword ptr [varname],ebx
007F209A je GetVarNamed+59h (07F20B9h)
last = (varname < vname2 ? i : last);
007F209C movzx ecx,dx
007F209F mov eax,edi
007F20A1 cmovb ecx,eax
007F20A4 movzx edx,cx
first = (varname > vname2 ? i+1 : first);
007F20A7 jbe GetVarNamed+4Fh (07F20AFh)
007F20A9 lea eax,[edi+1]
007F20AC movzx esi,ax

As you can see, this is not ideal because there's still a branch on one side. That branch exists because of the i+1, which is done via lea. If I increment i in between these two lines, then the comparison has to be done all over again. So I need to find a way that the cmov is done in both cases.

[edit]
I found a way to do this that used two cmovs, that looks to be fairly elegant in the compiled code so I think it's a winner. I've done zero profiling (to be honest, I have no desire to go there), but I like the looks of it.

Unfortunately cmov optimization will not help with bag searches and insertions, which sucks because that would have had a huge impact on appearances. I found out that my ternary statement wouldn't compile to a cmov, or in this case two of them, and I think it's clear now why.

x = (c <= 0) ? x->bag_left : x->bag_right;

Both of those require a load from memory. I could probably force the cmov issue by splitting this into two lines, but then it would grab data from memory twice instead of just once (since cmov does the read regardless), which I suspect would result in a minor performance loss.

Even more annoying, I really want c to be a register var because it's used later in the routine and it gets saved to memory (on the stack) on each loop. Sadly it appears the compiler does not want to do that.
I forgot to mention it in this thread, but the cmov optimization I did for vars is in place in 511.1353. Please let me know what you find in terms of performance.
I got a profile running, Each version is gonna take about 5 to 10 minutes to test and i have 3 byond versions to test (1347, 1352, and 1353)

edit, I had to re-design the profile setup because the results have too much noise.

Those are running now.
1347:
Proc Name                                        Self CPU    Total CPU    Real Time        Calls
/datum/testdatum48vars/proc/testvaroverhead 46.478 46.483 46.496 50000
/datum/testdatum48vars/proc/testvaroverheadR 42.690 42.692 42.728 50000
/datum/testdatum24vars/proc/testvaroverheadR 35.821 35.828 35.845 50000
/datum/testdatum24vars/proc/testvaroverhead 32.659 32.666 32.679 50000
/datum/testdatum8vars/proc/testvaroverhead 31.059 31.060 31.070 50000
/datum/testdatum8vars/proc/testvaroverheadR 30.915 30.920 30.923 50000
/datum/testdatum1var/proc/testvaroverheadR 23.652 23.656 23.665 50000
/datum/testdatum1var/proc/testvaroverhead 22.024 22.027 22.036 50000


1352:
Proc Name                                        Self CPU    Total CPU    Real Time        Calls
/datum/testdatum48vars/proc/testvaroverhead 44.587 44.592 44.605 50000
/datum/testdatum48vars/proc/testvaroverheadR 38.824 38.828 38.840 50000
/datum/testdatum24vars/proc/testvaroverheadR 35.836 35.839 35.852 50000
/datum/testdatum24vars/proc/testvaroverhead 32.601 32.604 32.611 50000
/datum/testdatum8vars/proc/testvaroverhead 30.804 30.804 30.817 50000
/datum/testdatum8vars/proc/testvaroverheadR 30.009 30.009 30.021 50000
/datum/testdatum1var/proc/testvaroverhead 23.447 23.454 23.465 50000
/datum/testdatum1var/proc/testvaroverheadR 23.435 23.437 23.442 50000

1353:
Proc Name                                        Self CPU    Total CPU    Real Time        Calls
/datum/testdatum48vars/proc/testvaroverhead 45.564 45.568 45.585 50000
/datum/testdatum48vars/proc/testvaroverheadR 40.128 40.134 40.148 50000
/datum/testdatum24vars/proc/testvaroverheadR 36.040 36.044 36.055 50000
/datum/testdatum24vars/proc/testvaroverhead 32.729 32.732 32.748 50000
/datum/testdatum8vars/proc/testvaroverhead 30.771 30.774 30.781 50000
/datum/testdatum8vars/proc/testvaroverheadR 30.507 30.509 30.518 50000
/datum/testdatum1var/proc/testvaroverheadR 23.558 23.562 23.575 50000
/datum/testdatum1var/proc/testvaroverhead 23.270 23.272 23.281 50000

(Each call triggers 10000 random var reads)
(R version of the proc is just when the call order is reversed.)
(100,000 datums of each type is created, each call is called on a random datum from the pool)

Other interesting notes, 510 didn't want to free the memory used up at the end the proc (in an earlier version of the code), and 510's peak usage was markedly lower than 511, 180MB vs 230MB for this final test run.
Page: 1 2 3 4