Simplified APRA MySuper Heatmaps with Python


In Australia, retirement funds are known as superannuation or Super for short. During your working life, you and your employer make contributions into a fund, generally of your choice.

Around a million working Australians cannot choose their fund because of enterprise bargaining arrangements or workplace determination.[Final Report] For the remaining ~14 million, we can jump ship and are free to move to the best fund available with the lowest fees.

2021 Update: The ATO have launched the YourSuper comparison comparison site https://www.ato.gov.au/YourSuper-Comparison-Tool/

According to the report, about 40% of Aussies have multiple accounts, which increases your fees! At one point I had over 3 accounts, as I moved companies I just ticked the default fund provided at each location.

There are 3 types of Super:

  • self-managed superannuation trusts (SMSFs) (regulated by the ATO),
  • exempt public sector superannuation schemes (regulated by Commonwealth, state or territory legislation)
  • and APRA-regulated funds regulated by the Australian Prudential Regulation Authority (APRA) under the Superannuation Industry (Supervision) Act

The APRA funds, and their MySuper products are what I will cover, as it contains the bulk of the funds of which one can choose from.

MySuper products are Low-cost, simple superannuation products for members who make no active choice about their superannuation.[Final Report]

From the Royal Commission, the report recommended that APRA be given a new role which was more focused on member outcomes, and proposed that APRA would oversee and promote the overall efficiency and transparency of the superannuation system, to the ultimate benefit of members, i.e. Us and our money.

APRA came out with their MySuper Product Heatmap, which is a good start but good but is overwhelming to understand.

While APRA recognises that net return has value as a headline indicator, it is important to look separately at the drivers of net returns – namely, net investment returns and administration fees – to draw further insights.[APRA Data Insights]

What I will cover here is some slicing and dicing of APRA’s data in Python and producing some simplified heatmaps. I have included the final heatmaps at the start, then I walkthrough the steps in Python.

APRA MySuper heatmaps in Excel
APRA MySuper Product Heatmap in Microsoft Excel

Performance Simplified

Net Investment Return – Heatmap of Top 15 Funds

Simplified Performance Heatmap using Python Seaborn
Simplified Performance Heatmap of Top 15 MySuper Funds Grouped by Risk Tolerance (L, M, H) and their Ranking

Total Fees per Investment Size – Simplified

$10,000 Investment

Simplified Total Fee Heatmap of Top 15 MySuper Funds Grouped by Risk Tolerance (L, M, H) and their Performance Ranking - $10k Investment
Simplified Total Fee Heatmap of Top 15 MySuper Funds Grouped by Risk Tolerance (L, M, H) and their Performance Ranking – $10k Investment

$25,000 Investment

Simplified Total Fee Heatmap of Top 15 MySuper Funds Grouped by Risk Tolerance (L, M, H) and their Performance Ranking - $25k Investment
Simplified Total Fee Heatmap of Top 15 MySuper Funds Grouped by Risk Tolerance (L, M, H) and their Performance Ranking – $25k Investment

$50,000 Investment

Simplified Total Fee Heatmap of Top 15 MySuper Funds Grouped by Risk Tolerance (L, M, H) and their Performance Ranking - $50k Investment
Simplified Total Fee Heatmap of Top 15 MySuper Funds Grouped by Risk Tolerance (L, M, H) and their Performance Ranking – $50k Investment

$100,000 Investment

Simplified Total Fee Heatmap of Top 15 MySuper Funds Grouped by Risk Tolerance (L, M, H) and their Performance Ranking - $100k Investment
Simplified Total Fee Heatmap of Top 15 MySuper Funds Grouped by Risk Tolerance (L, M, H) and their Performance Ranking – $100k Investment

$250,000 Investment

Simplified Total Fee Heatmap of Top 15 MySuper Funds Grouped by Risk Tolerance (L, M, H) and their Performance Ranking - $250k Investment
Simplified Total Fee Heatmap of Top 15 MySuper Funds Grouped by Risk Tolerance (L, M, H) and their Performance Ranking – $250k Investment

Python Walkthrough

Read in the Data

APRA provides both a Microsoft Excel XLSX file and a comma-separated CSV file.

fp_1 = r'C:UsersbernoOneDriveLearningPythonScriptsdataMySuper Product Heatmap.csv'
skiprows_1 = 0

# Read in csv file
df_raw = pd.read_csv(Path(fp_1), skiprows=skiprows_1)

Let’s take a quick look at the data frame to check that everything was read in correctly or not.

Read in of the APRA CSV - Looking good
Original APRA data read into a Python Pandas Dataframe

Next is to see all the column names and a count of missing values. I am only interested in the total fees and funds that have a 5-year net investment return.

fn_check_missing_data(df_raw)
rse_net_rollover_ratio_3_year_average                        320
rse_adjusted_total_accounts_growth_rate_3_year_average       320
single_strategy_lifecycle_indicator                          320
rse_net_assets_000                                           320
proportion_of_total_assets_in_mysuper                        320
rse_total_accounts                                           320
rse_net_cash_flow_ratio_3_year_average                       320
5_year_net_investment_return_nir_p_a                         154
5_year_nir_relative_to_simple_reference_portfolio_p_a        154
5_year_nir_relative_to_listed_saa_benchmark_portfolio_p_a    154
5_year_net_return_50_000_rep_member_p_a                      154
3_year_nir_relative_to_listed_saa_benchmark_portfolio_p_a    138
strategic_growth_asset_allocation                            138
3_year_net_investment_return_nir_p_a                         138
3_year_nir_relative_to_simple_reference_portfolio_p_a        138
3_year_net_return_50_000_rep_member_p_a                      138
lifecycle_stage_name                                          97
administration_fees_disclosed_10_000_account_balance          35
administration_fees_disclosed_25_000_account_balance          35
administration_fees_disclosed_50_000_account_balance          35
administration_fees_disclosed_100_000_account_balance         35
administration_fees_disclosed_250_000_account_balance         35
total_fees_disclosed_10_000_account_balance                   35
total_fees_disclosed_25_000_account_balance                   35
total_fees_disclosed_50_000_account_balance                   35
total_fees_disclosed_100_000_account_balance                  35
total_fees_disclosed_250_000_account_balance                  35
mysuper_product_name                                           0
rse_name                                                       0
rse_licensee                                                   0

I will now assign a new dataframe to those columns I wish to keep.

# Define columns of interest
cols_1 = ['rse_name', 'mysuper_product_name',
                   'strategic_growth_asset_allocation', 
                   '5_year_net_investment_return_nir_p_a', 
                    'total_fees_disclosed_10_000_account_balance',
                    'total_fees_disclosed_25_000_account_balance',
                    'total_fees_disclosed_50_000_account_balance',
                    'total_fees_disclosed_100_000_account_balance',
                    'total_fees_disclosed_250_000_account_balance']

# Assign dataframe with only columns of interest
df = df_raw.loc[:,cols_1]

Another quick review of the new dataframe.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 417 entries, 0 to 416
Data columns (total 9 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   rse_name                                      417 non-null    object 
 1   mysuper_product_name                          417 non-null    object 
 2   strategic_growth_asset_allocation             279 non-null    float64
 3   5_year_net_investment_return_nir_p_a          263 non-null    float64
 4   total_fees_disclosed_10_000_account_balance   382 non-null    float64
 5   total_fees_disclosed_25_000_account_balance   382 non-null    float64
 6   total_fees_disclosed_50_000_account_balance   382 non-null    float64
 7   total_fees_disclosed_100_000_account_balance  382 non-null    float64
 8   total_fees_disclosed_250_000_account_balance  382 non-null    float64
dtypes: float64(7), object(2)
memory usage: 29.4+ KB

Of the 417 funds, there are only 263 which have reported results for at least the last 5-years, so I will drop those funds that haven’t reported.

# Drop empty rows
df.dropna(axis=0, how='any', inplace=True)
df.info()

Let’s have a quick glance over two columns to see if things make sense.

df[['5_year_net_investment_return_nir_p_a', 'total_fees_disclosed_100_000_account_balance']].describe()
  5 year NIR p.a. Total fees 100k balance
count 263.0000 263.0000
mean 0.0729 0.0101
std 0.0152 0.0023
min 0.0375 0.0050
25% 0.0624 0.0088
50% 0.0768 0.0098
75% 0.0834 0.0113
max 0.0983 0.0182
Python pandas describe on 2 columns. Values are expressed as decimals hence you need to multiple by 100 to get in %

The 5-year return p.a. ranges from 3.75% to 9.83%. We want to be near the top end of this stat. The total fees on a $100k balance range from 0.5% to an eye-watering 1.82% (that is a whopping +300% difference). We want to be near the bottom end of this stat!

All figures appear reasonable, no negative percentages (which could happen in the returns but never in the fees)

File Specific Functions

Refer to the code on GitHub. I don’t think it warrants dumping the slew of code here.

Cleaning

The original dataset was clean, so no cleaning required. I’ve added a few columns to the dataframe for later use. I included some comments for each.

# General cleanups

# Replace a really long name
df['mysuper_product_name'] = df['mysuper_product_name'].str.replace(
    'Goldman Sachs & JBWere Superannuation Fund_MySuper Product', 'MySuper Product')

# Add a column to group the growth asset allocations into Low, Medium, High
# Change from categorical type to string for use in charts
df['growth'] = pd.qcut(df['strategic_growth_asset_allocation'], q=3, 
                                             labels=['L', 'M', 'H']).astype('str')

# Assign a ranking to the 5-year NIR for later comparisons
df['5_yr_nir_ranking'] = df['5_year_net_investment_return_nir_p_a'].rank(method='min',                                        ascending=False).astype('int')

# Assign a ranking column to string type for later use in charts
df['5_yr_nir_rank_str'] = df['5_yr_nir_ranking'].astype('str')

# Add a concatenated column for use in the heatmap
df['MySuper_Details'] = df['rse_name'] + ' - ' +                                                  df['mysuper_product_name'] + ' - '
                       + df['growth'] + ' (NIR Rank ' +                                          df['5_yr_nir_rank_str'] + ')'

To get an idea of the returns distribution, the value_counts will do the trick.

# Check the breakdown of the NIR
df['5_year_net_investment_return_nir_p_a'].value_counts(normalize=True, bins=3)
(0.078, 0.0983]    0.4639
(0.0578, 0.078]    0.3460
(0.0364, 0.0578]   0.1901
Name: 5_year_net_investment_return_nir_p_a, dtype: float64

About 19% has returns less than 5.8% with the remaining 81% returning over 5.8% p.a. over the past 5-years.

I set the risk tolerance (growth column) into 3 equal buckets according to their strategic growth asset allocation. We could refine this, but I think it provides a reasonable split.

Analysis

Now comes the most fun bits, seeing what we can garner from the data.

For starters, I will run some checks and comparisons with the source data material to ensure everything is jiving well.

Scatterplot – NIR vs Fees for a set Growth Allocation

APRA prepared a Data Insights file with graphs and commentary to supplement the heatmap data. On page 9 of the pdf, Figure 3 below shows a negative relationship between net returns and total fees over 5 years with an allocation to growth assets greater than 60%. Let’s see if we can create something similar.

APRA Data Insights Figure 3 - Downward trend of returns for increased fees
APRA Data Insights Figure 3 – Downward trend of returns for increased fees

First, I will set a strategic growth asset allocation of 0.6 (60%) to match that in ARPA Figure 3. Then I will assign a dataframe based on this condition and pass it into a scatterplot with code sourced from StackOverflow how-to-overplot-a-line-on-a-scatter-plot-in-python and format-y-axis-as-percent.

Our version of APRA Data Insights Figure 3 - Downward trend of returns for increased fees
Our version of APRA Data Insights Figure 3 – Downward trend of returns for increased fees

The result is like Figure 3 keeping in mind the dates of the data vary from 10 December vs. 30 June. The trend is heading downward, which goes to show that the higher the fees the lower the $ return.

Heatmap – Variable Correlations

APRA Figure 4 (not shown here) showed a weak correlation between returns versus fees (correlation was -0.1). This is a good sign, which means paying higher fees has a weak link to your expected returns, or find a low fee fund!

To see if the data has any other potential relationships (or correlations), we can either run a df.corr() or produce this cool looking half masked heatmap that I discovered by Manu Kalia.

Half Masked Variable Correlations Heatmap using Seaborn
Half Masked Variable Correlations Heatmap using Seaborn

While the heatmap is aesthetically pleasing, it contains a lot of useless info for this case. So let’s just feed in the returns columns into the plotting function for evaluation.

 Net Investment Returns Variable Correlations Heatmap using Seaborn
Net Investment Returns Variable Correlations Heatmap using Seaborn

There appears to be a strong correlation between returns and the growth allocation as suspected. There is little to no correlation with returns and fees as noted by APRA.

Ranking of Top Funds Based on Return Performance

In the cleaning section, I added a column to rank the funds based on the highest 5-year return as the top-ranking using rank. But a scan through the output doesn’t tell you much as one fund is showing across the board. So grabbing the top 15 will be useless as it will only show one fund cause they provided multiple investment options with varying growth asset allocations.

Pandas ranking method ranks the records the same when their values are the same. The first two records in the table both have a rank of 1 cause their 5-year returns are the same value. Refer to the pandas docs for more details.

Net Investment Returns Sorting - Only 1 Fund across the board!
Net Investment Returns Sorting – Only 1 Fund across the board!

Getting the top 15 using nlargest again captures the same fund multiple times. What we want to achieve is a single fund per risk tolerance as defined as High, Medium, Low and output their best respective return/rating. For example, we want to get AON Mater Trust with High growth as a single line with their highest return of 9.83% then followed by QSuper with Medium growth with a return of 9.77%, and so forth.

apra_df_cols_nlargest
Net Investment Returns using nlargest – Still not very useful

This is a perfect candidate for pandas pivot_table. Pivot the funds to group them and extract the desired values for returns, fees and ranking.

# Assign some columns to display. Adjust to suit
cols = ['rse_name', 
       '5_year_net_investment_return_nir_p_a', 'growth', '5_yr_nir_ranking', 'MySuper_Details']

# Assign a dictionary to be passed into the pivot_table aggfunc. The keys are the columns to 
# aggregate and the values are the aggregate function or list of functions. 


aggfunc_dict = {'5_year_net_investment_return_nir_p_a': 'max',
                # get the max return of a fund grouping   

                '5_yr_nir_ranking': 'min', 
                # get the min ranking as lower the number the better

                'total_fees_disclosed_10_000_account_balance': 'max',
                'total_fees_disclosed_25_000_account_balance': 'max',
                'total_fees_disclosed_50_000_account_balance': 'max',
                'total_fees_disclosed_100_000_account_balance': 'max',
                'total_fees_disclosed_250_000_account_balance': 'max', 
                # get the max fees of a fund grouping

                'MySuper_Details': 'first'}
                # get the first name in the group.

# Assign some index columns to pivot on 
index=['rse_name', 'mysuper_product_name', 'growth']

# Assign the column name for which to get the top N funds
performance_column = '5_year_net_investment_return_nir_p_a'

# run a pivot on the dataframe with and selecting the top N funds
pd.pivot_table(df, values=aggfunc_dict.keys(), index=index, aggfunc=aggfunc_dict
              ).nlargest(top_funds_to_show, performance_column
                        ).reset_index()[cols]
apra_df_top_performers
Net Investment Returns Pivoted – Getting better but what’s up with those mismatches

Good grief! Why are some rankings not matching? This was a bit puzzling. The purpose of the combined MySuper_Details column is the label text for the heatmap.

During pivoting the MySuper_Details function I set it to first, which seems to make sense as the first row should be towards the top.

I tried to find a clear list of functions that we can pass into the aggfunc but wasn’t able to come up with a complete list, so I had to open up the source file and have a peruse. I found the following on lines 1419 to 1424 of the groupby.py documentation: sum add np.sum prod np.prod min np.min np.max first last.

The first function sends values to numpy and returns the first value at position [0]. Based on this, I concluded that the best-ranked fund was not always at the top during the pivot operation. Sorting the dataframe in the function fixed the issue.

Net Investment Returns Cleaned up and ready for use in a Heatmap
Net Investment Returns Cleaned up and ready for use in a Heatmap

Note that I left the rankings as they appear in the original dataframe. I assume it could be reset to 1, 2, 3 etc. if that is what you desire.

As a last check, I like to confirm results from the original dataframe to ensure I hadn’t messed up my pivot tables and cleaning operations. idxmax() should do the trick.

df.loc[df['5_year_net_investment_return_nir_p_a'].idxmax(),:]
rse_name                                                                       AON Master Trust
mysuper_product_name                                                                Aon MySuper
strategic_growth_asset_allocation                                                        0.8700
5_year_net_investment_return_nir_p_a                                                     0.0983
total_fees_disclosed_10_000_account_balance                                              0.0170
total_fees_disclosed_25_000_account_balance                                              0.0115
total_fees_disclosed_50_000_account_balance                                              0.0097
total_fees_disclosed_100_000_account_balance                                             0.0087
total_fees_disclosed_250_000_account_balance                                             0.0082
growth                                                                                        H
5_yr_nir_ranking                                                                              1
5_yr_nir_rank_str                                                                             1
MySuper_Details                                 AON Master Trust - Aon MySuper - H (NIR Rank 1)
Name: 125, dtype: object

We have a match. All looking good, so now onto creating some simple, easy to read heatmaps to replace that convoluted one provided by APRA.

Net Investment Return – Heatmap Top N Funds

To create the heatmap of top-performing funds shown at the start of the article, I called a function by passing in the dataframe, a top number of funds to show and a filename to save the figure to the computer.

fn_pivot_table_for_heatmap(df, top_funds_to_show, 'img/apra_5_year_top_perform_heat.png')

This function calls another pivot table function, which pivots the top performers as discussed above.

def fn_pivot_table_for_heatmap(df, top_funds_to_show, file=None):
    """Seaborn heatmap of top funds based on NIR
    df: dataframe
    top_funds_to_show_show: interger
    file: str value of a figure filename to save. Default is None, which doesn't save.
    """
    nir='5_year_net_investment_return_nir_p_a'

    # Get top N performers unique to each fund / risk tolerance (L, M, H)
    df_top_perform = fn_pivot_table(df, top_funds_to_show)
    df_top_perform.reset_index(inplace=True)

    # Pivot the top N performers with their details to be passed to the heatmap labels
    df_pt = df_top_perform[['MySuper_Details', nir]]

    # Set the index for use in the heatmap
    df_pt.set_index('MySuper_Details', inplace=True)

    fig, ax = plt.subplots(figsize=(4,10))

    sns.heatmap(df_pt.sort_values(nir, ascending=False), annot=True,
                cmap='coolwarm', cbar=False, fmt='.2%', ax=ax )

    plt.title(f'Heatmap of 5-year Net Investment Return p.a. - Top {top_funds_to_show} n
             Data from: APRA MySuper Products 10 December 2019',
             fontsize=20)
    plt.xlabel(f' 5-year Net Investment Return p.a.',
               fontsize=12)
    plt.ylabel('MySuper Product Name with Growth Asset Allocation: Low (L), Medium (M), High (H)',
               fontsize=12);
    if file: plt.savefig(file, bbox_inches='tight')
    plt.show();

Total Fees – Heatmaps

A function call for the fees is very similar to the performance function.

fn_pivot_table_for_heatmap_fees(df, 'total_fees_disclosed_10_000_account_balance', 10,
                                top_funds_to_show, 'img/apra_top_perform_fees_10k.png')

The function to produce the fee heatmap is like the performance function above, with a swap of NIR values and titles with fees.

Comments

I sourced the data from APRA of which they received from registered funds. Based on the commentary above, some funds provided a heap of product names with tonnes of allocation percentages and other funds only provided one. So even though the data appears to be plenty, there are a gazillion other possibilities.

Use this information to compare against your super fund to see how they place and decide if a move is on the cards. Given the current climate, I think it will affect all funds performances, so I will lean towards the lowest fees. Again, you need to research further those funds under consideration.

For example, my current fund is HostPlus and they are ranking mid to the higher end of the fees of the top performers. With a $50,000 investment, their Balanced option is showing 1.22%. I wouldn’t be paying this, so I will dig deeper. Running a DuckDuckGo search on HostPlus Fees brings up: Fees and Costs

Example – HOSTPLUS Balanced option BALANCE OF $50,000  
Investment fees 0.58% For every $50,000 you have in the superannuation product you will be charged $290 each year
PLUS Administration fees $78 ($1.50 per week) And, you will be charged $78 in administration fees regardless of your balance
PLUS Indirect costs for the MySuper product 0.33% And, indirect investment costs of $165 each year will be deducted from your investment
EQUALS Cost of product 1.07% If your balance was 50,000, then for that year you will be charged fees of $533 for the superannuation product.
HostPlus Example of Fees for their Balanced option product of $50k Investment Balance

This 1.07% is not 1.22% in APRA data, so I can only assume they dropped their fees between periods.

I chose HostPlus Indexed Balances (not included in APRA) and the fees on the site are:

# Estimated Hostplus Superannuation Fund - 30 June 2019 Fees and costs
# Investment option: Indexed Balanced
# https://pds.hostplus.com.au/6-fees-and-costs

invest_amt = 50000
invest_fees = 0.02/100 # %
admin_fee = 78 # p.a.
indirect_fee = 0.03/100 # %

fn_fee_checks(invest_amt, invest_fees, admin_fee,  indirect_fee)
For every $50000 you have in the superannuation product you will be charged $10 each year
And, you will be charged $78 in administration fees regardless of your balance
And, indirect investment costs of $15 each year will be deducted from your investment
If your balance was $50000, then for that year you will be charged fees of $103 for the superannuation product.
FINAL TOTAL COST FEE: 0.21%

The total fee for this fund is 0.21% which is lower than any of the top funds. So I will not be making a move at this stage.

This showed me I have to be careful when picking the fund options. With all my other funds, I just completed their questionnaire about my risk tolerance during joining and didn’t bother to check the fees of the selected investment option selected by the computer, based on my answers.

Now your turn. Get out there and get the best deal for you.

Resources

References

[Final Report]: Royal Commission into Misconduct in the Banking, Superannuation and Financial Services Industry.
[APRA Data Insights]: APRA Data Insights MySuper Product Heatmap 10 December 2019