Re: gains from vectorization




> vectorization certainly won't help any kind of code, so it is important
> to understand what part of the code is consuming most CPU time.
> (compile the code with -pg, run it, and examine the output of gprof
> executable_name gmon.out).



the flat profile as generated by gmon is as follows. now, the total
computation time on my system was ~25mins = 1500sec. This profiler
shows that the intrinsic function matmul itself took 2899.24 seconds.
how do I interpret that? also, if I look at the percent time column,
intrinsic matmul is taking the most of the time. are there matrix
multiplication routines that are faster than the intrinsic ones? there
are some multiplications with diagonal matrices in my code. should it
increase the efficiency of the code if I write a seperate routine for
multiplications involving diagonal matrices?

---------------------
Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ks/call Ks/call name
43.59 2899.24 2899.24
_g95_matmul22_r4r4
9.77 3549.37 650.13 300000 0.00 0.00 matrix_MP_ludcmp_
8.14 4091.03 541.66 _g95_spread
6.04 4492.47 401.44 1 0.40 1.96 MAIN_
5.91 4885.51 393.04
_g95_section_array
5.02 5219.56 334.05 7200000 0.00 0.00
nrutil_MP_outerprod_r__
2.80 5405.94 186.38 7200000 0.00 0.00 matrix_MP_lubksb_
2.75 5588.73 182.80 _g95_transpose
2.21 5735.72 146.98
_g95_dot_product_r4
2.04 5871.72 136.00 150000 0.00 0.00
mcmc_MP_gibbsnorm_
1.41 5965.81 94.09 300000 0.00 0.00
matrix_MP_pdsymminv_
1.32 6053.45 87.64 _g95_bump_element
1.11 6127.09 73.64 80750000 0.00 0.00
random_MP_random_normal__
0.85 6183.61 56.53 _g95_random_4
0.85 6240.11 56.50 xorshf96
0.79 6292.59 52.47 5691750 0.00 0.00
nrutil_MP_swap_rv__
0.52 6326.93 34.34 insert_mem
0.51 6360.89 33.96 malloc
0.37 6385.62 24.73 300000 0.00 0.00
matrix_MP_identity_
0.32 6407.11 21.49 free
0.32 6428.43 21.32 _g95_maxvald1_r4
0.26 6446.00 17.57 _g95_maxloc_r4
0.25 6462.96 16.96
_g95_init_assumed_shape
0.23 6478.44 15.48 get_user_mem
0.23 6493.64 15.20 section_size
0.21 6507.59 13.95 compare
0.21 6521.25 13.66 delete_treap
0.18 6533.13 11.88 _g95_rand
0.18 6544.86 11.73 free_user_mem
0.18 6556.50 11.64
_g95_init_multipliers
0.16 6567.10 10.60
_g95_allocate_array
0.15 6577.04 9.94
_g95_array_from_section
0.14 6586.56 9.52
_g95_deallocate_array
0.11 6593.94 7.38 _g95_xorshift128
0.11 6601.30 7.36 7200000 0.00 0.00
nrutil_MP_imaxloc_r__
0.08 6606.59 5.29 initialize_memory
0.07 6611.44 4.85 _g95_size
0.07 6615.90 4.47 delete_root
0.06 6619.65 3.75 largebin_index
0.04 6622.64 2.98 _g95_write_real
0.04 6622.64 2.98 _g95_write_real
0.04 6625.48 2.85 put_field
0.03 6627.80 2.31 _g95_temp_array
0.03 6629.81 2.01
malloc_consolidate
0.02 6631.45 1.65 _g95_temp_alloc
0.02 6632.94 1.49 huge
0.02 6634.34 1.39 rotate_left
0.02 6635.49 1.16 _g95_huge_4
0.01 6636.27 0.77 _g95_temp_free
0.01 6637.01 0.74 get_field
0.01 6637.70 0.69 100000 0.00 0.00
matrix_MP_normsquare_
0.01 6638.39 0.69
_g95_list_formatted_write
0.01 6639.03 0.64 215596 0.00 0.00 ignlgi_
0.01 6639.67 0.64 _g95_write_block
0.01 6640.31 0.64
size_record_buffer
0.01 6640.94 0.63 _g95_any_4
0.01 6641.50 0.56
_g95_bump_element_dim
0.01 6642.02 0.52 100000 0.00 0.00 sgamma_
0.01 6642.53 0.51
data_transfer_init
0.01 6643.03 0.50 100000 0.00 0.00 snorm_
0.01 6643.52 0.49 7 0.00 0.00 io_MP_writebuff_
0.01 6644.01 0.48 7500000 0.00 0.00
nrutil_MP_assert_eq3__
0.01 6644.48 0.47 matrix_MP_choldc_
0.01 6644.91 0.44 write_fixed
0.01 6645.32 0.41 write_separator
0.01 6645.70 0.38
_g95_is_internal_unit
0.01 6646.07 0.36
random_MP_random_gamma__
0.01 6646.41 0.34 _g95_find_unit
0.00 6646.70 0.29
_g95_write_integer
0.00 6646.99 0.29 write_free
0.00 6647.25 0.26 100000 0.00 0.00 gengam_
0.00 6647.51 0.26 _g95_salloc_w
0.00 6647.72 0.21
_g95_transfer_real
0.00 6647.92 0.20
_g95_get_float_flavor
0.00 6648.11 0.19 start_transfer
0.00 6648.30 0.19
write_formatted_sequential
0.00 6648.48 0.18 _g95_free_fnodes
0.00 6648.65 0.17 _g95_st_write
0.00 6648.81 0.16 _g95_extract_mint
0.00 6648.97 0.16 fd_flush
0.00 6649.13 0.16 rotate_right
0.00 6649.29 0.16 write_record
0.00 6649.42 0.13 _g95_get_ioparm
0.00 6649.55 0.13 _g95_get_sign
0.00 6649.68 0.13
_g95_st_write_done
0.00 6649.80 0.12 215596 0.00 0.00 rgnqsd_
0.00 6649.92 0.12 _g95_get_unit
0.00 6650.04 0.12 _g95_sfree
0.00 6650.15 0.11 _g95_library_end
0.00 6650.26 0.11 free_fnode
0.00 6650.36 0.10 215596 0.00 0.00 ranf_
0.00 6650.46 0.10 init_write
0.00 6650.56 0.10
nrutil_MP_outerprod_d__
0.00 6650.65 0.09 writen
0.00 6650.73 0.08 215661 0.00 0.00 __g95_master_0__
0.00 6650.81 0.07
nrutil_MP_imaxloc_i__
0.00 6650.88 0.07
_g95_library_start
0.00 6650.94 0.07
_g95_transfer_integer
0.00 6651.01 0.06 recursive_io
0.00 6651.06 0.05
nrutil_MP_ifirstloc_
0.00 6651.10 0.04 215597 0.00 0.00 __g95_master_0__
0.00 6651.14 0.04 itoa_4
0.00 6651.18 0.04
matrix_MP_printmatrix_
0.00 6651.21 0.04 215629 0.00 0.00 getcgn_
0.00 6651.24 0.04
nrutil_MP_assert_eq4__
0.00 6651.27 0.03 matrix_MP_diag_
0.00 6651.30 0.03 32 0.00 0.00 setcgn_
0.00 6651.32 0.02 215630 0.00 0.00 __g95_master_0__
0.00 6651.34 0.02 15411 0.00 0.00 sexpo_
0.00 6651.35 0.01 3 0.00 0.00 io_MP_readbuff_
0.00 6651.36 0.01
_g95_next_list_char
0.00 6651.37 0.01 _g95_sign_r4
0.00 6651.38 0.01 finalize_transfer
0.00 6651.38 0.00 215629 0.00 0.00 qrgnin_
0.00 6651.38 0.00 62 0.00 0.00 mltmod_
0.00 6651.38 0.00 32 0.00 0.00 initgn_
0.00 6651.38 0.00 1 0.00 0.00 inrgcm_
0.00 6651.38 0.00 1 0.00 0.00 qrgnsn_
0.00 6651.38 0.00 1 0.00 0.00 setall_

.