Wednesday, October 7, 2015

Curves

So what have I been work on lately, for some time now I have been working again, so less hours to experiment and work on things, actually feel the need turn off the computer sometimes.

Anyway, I'm a geek and geeks can't turn off the computer for too long before curiosity takes over.

Lately I have been investigating curves, I know what you are thinking, and It's not that, I'm not stupid, I know to keep my distance.

I'm talking about other types of curves, or mathematically draw curved lines.

Now you thinking maybe taking about y=(x*scalefactor)^2, or y=sin(x)*scalefactor.
No formulas like that only draws curves in y direction, you can't have diagonal curve, or curve between two random coordinates, using formulas like that.

So there is a video from Steven Wittens, on youtube, that I really enjoyed.

https://www.youtube.com/watch?v=Zkx1aKv2z8o&index=6&list=WL

Bezier curves!!!!

So way on earth is curves important, well you draw thing in paint program save image use that, no need to know how it was done, but realize the paint tools use curves.

Even the fonts I use for this text I have written, uses some curve, SVG image or vector images, use curves, flash videos, and lots of other things.

So the best way to play around with curves is a editor, or program you drag curves around, and this what I did, soon I complete the program, as simple toy.



So way not use library someone else has made, way spend this amount of time on something you get for free, well what is fun in that, coding is about figuring things out, and play with things, learn something, the most fun I have had is maybe programs I did not complete. Just experimenting on things.

So why don't you work on mplayer, well the answer is I need a break from mplayer, and it has become work, coding should be fun, so this way I doing this I guess.

Tuesday, August 11, 2015

PowerPC code optimization, experiment, code generator part 2.




Lately I have been experimenting, using sort of JIT engine to generate machine code, the graph above is example of what, I can do.

The graph test performance of the code deepening on x and y factor, to find the ideal condition for the best speed.

X axes is the number of unrolls / float point register used (unrolls), the Y axes is the max number of code block per loop.

The test runs 64000 int to float conversion with a float point scale factor.

So what you see is that number of unrolls help, but if the code in loops gets to big, the speed goes down.

This kind of test if I wrote it by head takes a month, but as I'm generating the code. I can try different combinations in a few seconds.

The same kind of code generator test can be done on any type of assembler code, it works on AltiVec, FPU or CPU instructions.


Sunday, July 26, 2015

PowerPC Machine Code generator Experiment.

PowerPC Machine Code generator Experiment.

I have been looking at ways to convert float to int fast, lately.
I have been thinking about this video.

https://www.youtube.com/watch?v=So-m4NUzKLw

C64 Demo coders trying to make C64 do more than they should able to do.
So there has been lots of talk about JIT compiler in Amiga community, EUAE jit compiler by Álmos Rajnai.

JIT compilers made a big difference in speed.

What I wont to find out.

1)
If the loops that GCC can't unroll, is unrolled by the machine code generator what difference does it make, to unroll the loops completely.

2)
So what I was wondering about is, if it was possible to generate code scaled to Instruction cache, to eliminate unnecessary machine code, and eliminate cache misses.

3)
Does it make difference to try unroll loops or not, in C code.

Disclaimer:

I'm no expert in PowerPC assembler, but I have some experience with assembler trying to code for MC68000  using inline assembler on Amiga500/BlitzBasic2, and from my school years at "VK2" Data Technical, and for 2 years at Technician school at Kongsberg, coding the Z80, 6802/6804 chips, as educational tool. I have some experience with PowerPC, trying to optimize things in Basilisk II, So I'm not hard core Machine code head.

If you ask me about name of some opcode, I will not know too many years ago, but fundamentals is the same, interrupts, machine code, thing stored, thing gets loaded, added, registers, flags.

The rules of the game.

So normally your write inline assembler, we let the compiler pick the registers, and handle saving and restoring registers, but we are not going to do this.

This how GCC normally generate code.

R0 is used for temporary storage.
R1 is revered for stack.
R2 is reserved, your not allowed to use this one.
R3 to R10 can be used for arguments.
R3 is also used as return value.

Using the optimization 3 in GCC or O3 flag, GCC expects R4 to R10 to be unchanged.
So this is the basic game rules we need to keep in mind.

Other things we need to think about is that memory going to execute need to be flagged for it to be executable, unlike on 680x0.

Before the memory is executed, we need to flush instruction cache. If we don't do that, we can't be sure the machine code in cache is correct. Unless we flush it.

So next, I'm going to explain the procedure, of finding the machine code, and understanding whats going on, the tools.

Well there is "GCC", write some C code, compile it, and then disassemble it, look at the result.
to disassemble, I use objdump.

objdump -d a.out
or
objdump -d -S a.out

So what you see.
The relative address, then the machine code, and then assembler name, command.

So to generate machine code, I need the machine code, the assembler name, is only useful to understand what machine code does, who remembers hex numbers?

Well you can look up IBM documentation, but they are not good, if you want to write code, they explain what assembler does, but not what assembler opcodes that goes with that, (there is no see also reference.)

So as noob you get stuck quickly, if you only try to read there documents.

Time for the code


So here is we have a typical function, so to convert x number of floats to int's.








This the same function, we have unrolled it, so that it takes fewer loops to execute, some people says this does not make any difference, that C compiler do that anyway, well we see :-)












This the typical machine code, you get when decompile a C program compiled with no optimization, I have added two machines codes extra to load R3 and R4, this because of GCC O3 flag.

When compiled with O3 flag, the stw, r3 and r4 is stripped away by the compiler, what stw does, is store the r3 and r4 registers on stack.
















This is the actual code, we need to run many times, again there are two extra machine codes I have added, to move to next source and destination address, the addi r3,r3,4 and addi r4,r4,4.

I was shocked at amount of instruction needed to do casting between float and int.
converting between int and float should be avoided at all cost on the PowerPC.











This is a for loop that is disassembled. Well I can't copy the code as it is, as the "blt-" opcode offsets, and "b" opcode offsets, need to be calculated.

Set_jit_loop() functions does this, the functions takes number of loops, and the amount of code that goes between the loop, so it can be inserted between the two tables.




















So this function alloc's memory for code,
the result is

"function start"
"float to int" * loops
"function end"





















So this function alloc's memory for code,
the result is

"function start"
"for n=1 to num_loops_needed"
"float to int" * in_cache
"next"
"float to int" * don’t_fith
"function end"

So this is bit smarts, we try find out what max number of float_to_int we can fit into instruction cache. What I found out is that its not the cache size that is the limiting factor, but max length of indirect jump, that is real limit :-/






















Just free the memory after we are done








The main code that runs all the tests










































Now for the results.















Trying to fit code inside the instruction cache made little difference, it mostly a waste of time trying to do it.

Assembler optimized the code, made a big difference compared to C code compiled without optimization flags.

Compiling standard C code with O3 flag, made big difference, there is almost no different between assembler optimized code and standard C code.

However, look at unrolled C function, it's was slower without being optimized, but with O3 is just beats everything.

I guess it is because GCC is able, to take advantage of out of order execution on PowerPC, GCC cannot unroll the normal C code, because the number of loops is not static constant, but variable number loops.

So conclusion is, what you write and how you compile the code, make the most difference, betting the C compiler is hard, even if you are an expert.

Wednesday, January 14, 2015

AmigaOS4.1 boot process UBOOT.

I was thinking it be nice to write a useful Blog about the boot process of AmigaOS4.1.
Many people have problem understanding the Boot Process, coming from Classic Amiga (500/1200/4000), or if they are used to a PC. That has Windows preinstalled.

AmigaOS4.1 can boot from tree different Firmware types (On PC there is BIOS or UEFI, on PowerPC there different Firmware).

On Pegasus II, there is OpenFirmware.
On AmigaONE-X1000 there is CFE.
On AmigaONE-XE/SE/Mini there is UBOOT.
On Sam440/460 (AmigaONE-500), there is UBOOT.

The most common BIOS Firmware is UBOOT this the one most people are having problems with. CFE and OpenFirmware is easier to use.



The uboot boot process.

UBOOT checks hardware and scans SATA and IDE and USB buses for devices.
Once it knows what is connected, it goes to the next step, this where things can go wrong.

For Linux the boot process is simple, as UBOOT diskboot command loads in kernel from PPCBoot image device all you do is define the controller number and the partition number and it finds it.

For AmigaOS this bit more complicated, BOOTA is not feed the controller, nor the partition number instead, the boot process depends on BOOT1 and BOOT2 variables, If device defined in BOOT1 is not found it goes to the next one BOOT2.

This sound simple, but here is the problem, BOOTA does not know what is a CDROM or HD, unless you have defined it, there for you will need to have UBOOT variable for etch controller that defines what is wired to a CDROM what connector what is not in use, and what connector that is wired to Hard drive.

This is defined by the UBOOT variable a1ide_conf for AmigaONE-XE/SE ide controller, for other controller the name of variable is different, but the suffix is always _conf.

The variable defines what is connected to A1 ide controller.
0 is not connected
1 is HD
2 is CDROM

When BOOTA command knows what is CDROM and what is HD, the chance of success has increased.

The SLB2 is loaded from the RDB (Amiga partition table), the SLB will look for kickstart files on first bootable partition on hard drive it finds.

So if the kickstart modules are not found, you have done something bad when partition the harddrive, and setting up boot priorities.

If everything works out then kickstart modules are loaded, and AmigaOS4 kickstart takes over.

For AmigaOS4 kickstart to continue booting kickstart drivers must be loaded that support the controllers you have connected, and also it need to know that things are connected, once it has found the CDROM’s and HDDRIVES, it scan partition tables to boot partition or CDROM with the highest boot priority.

Here again you might have done something wrong, if you changed the boot priority of partition too high, then the kickstart will never boot from the CDROM. It can get really tricky to reset the boot priorities, luckily the AmigaOS kickstart was designed as the firmware for Classic Amiga, and it has not lost the possibility for user to select boot device.

By pressing the “Scroll Lock” key on the keyboard, you can get into boot menu of the kickstart, or you can hold the two mouse buttons.

Monday, January 12, 2015

Mplayer 6.4 userguide for AmigaOS4.1

LiveForIt-Mplayer 6.x requires: AmigaOS4.1 Final and Radeon™ HD v2.4 drivers from Hans/A-EON, to take full advantage of COMP, COMP_YUV and COMP_YUV2.

News_Release_RadeonHD24.pdf
www.vesalia.de
amigakit

If you do not plan to upgrade just yet, stick with the older LiveForIt-Mplayer version 5.5. or be forced to use cgx_wpa, SDL or p96_pip.

Some useful information that only relates to LiveForIt-Mplayer 6.4 for AmigaOS4.1

Video outputs mplayer for AmigaOS4.1 supports:

Comp
This video output is written by Kjetil Hvalstrand and is based on cgx_wpa output, but was changed to use composition instead of WritePixelArray(), and enabled the video output to have scalable windows, and full-screen mode that allowed the video to stretched to fit screen mode. the video output convert yuv420/yv12 bitmaps into 32bit ARGB bitmaps using the CPU, just like CGX_WPA.

Comp_yuv
This is based on Comp, but was rewritten to use new color spaces yuv420p that is now supported by 
Radeon HD 2.4 drivers for AmigaOS4.1, this basically enables this video output to not need to convert into ARGB format,
in addition DRI support was added for codecs that supports this, and we have accelerated video to graphic card using DMA from new Graphic library 54.153. this video output supports window scaling and full screen mode.

Comp_yuv2
Same as comp_yuv but mplayer do not wait for vsync to complete, window refresh has been moved into its own thread, so mplayer can continue doing some thing else, while it waits for vsync.

Cgx_wpa
This video output was originally written by DET Nicolas, and Fabian Coeurjoly, to use CyberGraphicsX on MorphOS and AmigaOS, AmigaOS4.x uses Picasso96 so this heavy modified version of the original, most of the code is the same. the video output support window mode, but you can't re-size the window, the video also support full-screen, but no scaling to fit the screen. 

P96_PIP
This is the good old Picasso96 overlay video output from Jorge Strohmayer, originally it did not support double buffering, I added double buffering to video output, this video output support window mode and full screen.
the video output does not support DRI nor DMA transfer.

PIP
PIP is experimental video output from Jorge Strohmayer, full screen mode is not working atm, and PIP is there for not included in mplayer build by default. some optional color spaces is supported.

SDL
(Simple DirectMedia Layer), is a none native GUI system that sits on top of graphic.library,
SDL should support overlay, but this is not implemented on AmigaOS4.1, SDL is there for slow to render graphics, SDL video output support CPU scaled video output in window mode, but not in full screen.


Using video outputs:

Options for comp/comp_yuv/comp_yuv2.


Mplayer mymovie.avi –vo comp_yuv2:help

Shows a list of video output options.

Mplayer mymovie.avi –vo comp_yuv2:monitor=0

This opens up mplayer on monitor 0 or first monitor, when going full screen, no need for screen promotion in Workbench.

Mplayer mymovie.avi –vo comp_yuv2:monitor=1

This opens mplayer on monitor 1 or second monitor.

Mplayer mymovie.avi –vo comp_yuv2:nodma

Disables DMA, this can be used for debugging.
Image is now writes directly to VRAM,
(If codec support DRI, then video output will be black, when DMA is disabled.)

Mplayer mymovie.avi –vo comp_yuv2:nodri

Disables DRI rendering, forces mplayer to draw images using slices.
(Many codecs do not support DRI)

Mplayer mymovie.avi –vo comp_yuv2:pubscreen=dopus.1

This should open mplayer window on public screen dopus.1


Options for P96_PIP 

This one does not have any options.


Most common options for PIP
 (not compiled into mplayer by default).

Mplayer mymovie.avi –vo pip:mode=0

display video in YUV410 format (default)

Mplayer mymovie.avi –vo pip:mode=1

display video in YUV420 format

Mplayer mymovie.avi –vo pip:mode=2

display video in YUV422 format


Using AREXX with mplayer:

Start mplayer from shell.

Now you can send ARexx commands like this.

Rx Arexx/Volume100.rx

this script sets mplayers volume to 100% volume.

Rx Arexx/GetTimeLength.rx

Get length current film played.

Rx Arexx/GetPercentPos.Rx

Get percentage position in the film.

Rx Arexx/Help.rx

Get a list of ARexx commands from mplayer.


Q+A

What is DRI?
DRI is short for Direct Rendering Interface, when a codec support DRI, the codec asks the video output for image buffer, instead of allocating a buffer of its own in Codec, as result mplayer don't need to copy image slices from the codec buffer to video output.

What is DMA?

DMA is short for Direct Memory Access, DMA allows hardware to copy or access memory with out using the CPU. you can find more about it on Wikipedia:
http://en.wikipedia.org/wiki/Direct_memory_access

What is VSync?

VSync enables the program to sync to refresh rate, so you don't get a half drawn image, before it displayed.

LCD monitors use 60Hz refresh rate, older CTR monitors some time up 100hz refresh rate.
60Hz equals 60 frames per second; most videos are record at 25 FPS (frames per second).

LCD monitors have static picture, the image is changed at 60Hz, while the older CTR has to paint the pixels repeatedly this can make the screen flicker (as you can see the pixel turn on and off), if the refresh is 60Hz or lower, with higher refresh rate you trick the eye to not see it.




For general questions about mplayer you can find the Linux man pages on MplayerHQ useful

http://www.mplayerhq.hu/DOCS/man/en/mplayer.1.html