Why The Repository Pattern Fails For Legacy Databases Like Yours And What To Do About It

Introduction: A Classic Example (spot the flaw?)
Implicit Architecture Rules
The Unnecessary Abstraction
Cutting Out the Middleman
Your Data Sources Will Change...

Introduction: A Classic Example (spot the flaw?)

One of the hallmarks of creating a loosely coupled architecture is using the repository pattern to return data. Each model's repository is implemented from an interface, and that interface is injected into, typically, a service.

namespace Company.Domain.DomainModels
{
    public class Customer
    {
        public int CustomerId {get; set;}
        public string Name {get; set;}
    }
}

namespace Company.Domain.DataInterfaces
{
    public interface IRepository<T> where T: class
    {
        T Single(Expression<Func<T, bool> where);
        IEnumerable<T> Many(Expression<Func<T, bool> where);
    }
}

The concrete implementation, using Entity Framework 6.

namespace Company.Data
    public class CustomerRepository: IRepository<Customer>
    {
        CompanyDbContext _context = new CompanyDbContext();
        Customer Single(Expression<Func<Customer, bool> where)
        {
            return _context.Customers.SingleOrDefault(where);   
        }
        IEnumerable<Customer> Many(Expression<Func<Customer, bool> where)
        {
            return _context.Customers.Where(where);
        }
    }
}

The concrete customer service with the repository interface injected.

namespace Company.Services
{
    public class CustomerService: ICustomerService
    {
        IRepository<Customer> _customerRepository;
        
        public CustomerService(IRepository<Customer> customerRepository)
        {
            _customerRepository = customerRepository
        }
        
        public Customer GetCustomer(int id)
        {
            return _customerRepository.Single(a => a.CustomerId == id);
        }
        
        public List<Customers> GetCustomers(Expression<Func<Customer, bool>> where)
        {
            return _customerRepository.Many(where);
        }
    }
}

A unit test. (Note this is a lousy unit test because it's "testing" a data passthrough, but it illustrates the point.)

using XUnit;
using NSubstitute;
using FluentAssertions;

namespace Company.Tests
{
    public class CustomerService_Should
    {
        [Fact]
        public void Return_one_customer()
        {
            int id = 1;
            Customer expected = new Customer() {CustomerId = 1;}
            var customerRepository = Substitute.For<IRepository<Customer>>();
            var customerService = new CustomerService(customerRepository);
            customerRepository.Single((a => a.CustomerId == id).Returns(expected);
            var result = customerService.GetCustomer(id);
            result.Should().BeEquivalentTo(expected);
        }
    }
}

The namespaces indicate where in the Onion Architecture the classes belong.

Core: Customer represents a company customer. It has no behavior.
Core: IRepository is our generic standard for getting data. By injecting this standard into services, we can swap out the database by simply implementing it in a new concrete repository.
Core: ICustomerService is our contract for what a CustomerService should implement
Data: CustomerRepository is the concrete instance that will return models from the database.
Service: CustomerService is the concrete instance that returns models to the client.

Hint The flaw with the repository pattern is #2 and #4.

Image Copyright Jeffrey Palermo

Implicit Architecture Rules

Implicit in the Onion architecture are these rules:

Domain Models represent the way the organization really works.
Domain Services return Domain Models. (e.g. CustomerService returns Customer)
Data Services return Data Models (e.g. CustomerRepository returns...Customer, in this case)
APIs return with View Models (MVC Controllers, WebApi return something like CustomerMaintModel)
UIs work with View Models

This is a loosely coupled architecture. By using interfaces, it's also highly testable.

So, what's wrong with it?

The repository expects to return a domain model from the database. And that's not how most databases, especially legacy databases, are designed.

Let's make sure this is really clear, because it's the flaw in every example I've seen of this architecture. Here's how the domain and database stay loosely coupled.

The domain models represent how the organization sees information.
The database models represent how the data is stored.
The domain should have no knowledge of the database.

The mistake I've seen is that organizations' repositories return their data models, not domain models. They then rely on the service to map to the domain models. This means the repositories are tightly coupled to the data model.

In this light, look at the classic example again. The repository returns a Customer object. Customer is a domain model. This means that, if the database tables aren't one-to-one matches to the domain models, we have two choices in mapping the table model to the domain model.

Use IRepository and have the CustomerService do the mapping. For instance, the List<Customer> GetCustomers() method would get the data from the CustFileRepository and populate a Customer list.

This is how I've seen most organizations try to do it. But we've tightly coupled our data model to our domain's IRepository. The data model is now part of the domain model. If we wanted to swap out the database, it would have to have a CustFile table. If it doesn't, we're rewriting the repositories and injecting new interfaces into the services.
Use IRepository and have the CustomerRepository do the mapping. For instance, the IEnumerable<Customer> Many() method would get the data using a data model such as CustFile and populate a Customer list. There's no guarantee CustFile has the same properties as Customer.

This would be "more right" since there'd be loose coupling to the data. We could supposedly swap out the database. But as you'll see it won't work the way you think, and becomes an unnecessary layer of abstraction.

Let's try doing the second way. But let's use a more real-world scenario.

The customer data is stored in two Microsoft SQL Server tables that were created over the course of ten years. The extended table was used to handle new columns for the original table. Note the different primary key names and storing name information in a weird way. Also, there's no foreign key for the one-to-zero-one relationship.

CustFile
=========
CustId   int PK
FullName varchar(100)

CustFileExtended
===============
CustomerNumber int PK
Zip5           int

We need data models specific to these tables.

namespace Company.Data.DataModels
{
    public class CustFile
    {
        public int CustId { get; set; }
        public string FullName { get; set; }
    }
    
    public class CustFileExtended
    {
        public int CustNumber { get; set; }
        public int Zip5 { get; set; }
    }
}

To keep the example clear, we'll use an explicit ICustomerRepository. Remember, in order to be loosely coupled, this repository returns a Customer domain model.

If it returned CustFile and CustFileExtended models, then those data models would have to be part of the Domain. And that would tightly couple the abstract domain model to the concrete data model.

namespace Company.Domain.DataInterfaces
{
    public interface ICustomerRepository
    {
        // Create
        Customer Create(Customer entity);

        // Retrieve
        Customer RetrieveSingle(int id);
        IEnumerable<Customer> RetrieveMany();
        IEnumerable<Customer> RetrieveMany(Expression<Func<Customer, bool>> where);

        // Update
        Customer Update(Customer entity);

        // Delete
        void Delete(int id);
    }
}

Note the use of Expression. That's coupling to Queryables, which are provider-dependent.

Now we'll try to implement the repository using Entity Framework. To make the problem clear, we'll only look at the RetrieveMany methods.

namespace Company.Data
{
    public class CustomerRepository : ICustomerRepository
    {
        private CompanyDbContext _context;
        public CustomerRepository(CompanyDbContext context)
        {
            _context = context;
        }
        
        public IEnumerable<Customer> RetrieveMany()
        {
            // Now what?
        }
        
        public IEnumerable<Customer> RetrieveMany(Expression<Func<Customer, bool>> where)
        {
            // Now what?
        }
    }
}

We'll assume we can mock the context somehow. But this is where things go awry.

Many organizations end up adding IRepository and IUnitOfWork to DbSet and DbContext. Patterns on top of patterns. Abstractions on top of abstractions!

How do we return a collection of Customers? Remember, we're getting the data from two different tables.

Important The calls to the database have no knowledge of what a Customer is.

public IEnumerable<Customer> RetrieveMany()
{
    List<Customer> customers = new List<Customer>();
    
    // there's no filtering, so get all records. That's right. ALL OF THEM.
    var custFiles = _context.CustFiles.ToList();
    int[] ids = custFiles.Select(a => a.CustId).ToArray();
    var custFileExtendeds = _context.CustFileExtendeds.Where(a => ids.Contains(a.CustNumber)).ToList();
    
    customers = custFiles.Select(a => new Customer()
    {
       CustomerId = a.CustId,
       Name = a.FullName,
       ZipCode = custFileExtendeds.SingleOrDefault(b => b.CustNumber == a.CustId)?.Zip5
    });
    
    return customers;
}

We're using a generic repository pattern, which is typically going to include a method such as above because sometimes we do want all the records. But not if there are a hundred million. OK, let's try filtering.

public IEnumerable<Customer> RetrieveMany(Expression<Func<Customer, bool>> where)
{
    List<Customer> customers = new List<Customer>();
    
    // And...now what?
}

We have a filter expression that can be examined for its properties and conditions. So at first it seems like all we have to do is

Extract the Customer properties
Map them to the CustFile and CustFileExtended properties
Pull out the boolean conditions
Create Expressions from the properties and conditions and apply the filters to the appropriate CustFile or CustFileExtended DbSets.
Build and return the Customer list.

That's...actually, a lot of work. And wait...that's not complete, is it? There could be a few more steps after 4.

Programmatically handle a CustFile property that's filtering based on a CustFileExtended property. E.g. Customer.Discount == Customer.MaxDiscount. Except in the database it's CustFile.Discount and CustFileExtended.MaxDiscount. How's that going to work?
Programmatically handle nested conditions that go across table properties. Good luck to you.
Hope to heck a Customer property being filtered on isn't made up of two properties from two different tables.

and so on until...

Build and return the Customer list and pray it's right.

OK, so maybe we need custom repository interfaces for each domain model. Then we can make the client supply known filters like customer ids, get our initial datasets down to a managable size, then filter on the domain model.

//No need for an Expression since List<Customer> is generic.
public IEnumerable<Customer> RetrieveMany(int zipCode, Func<Customer, bool> where)
{
    List<Customer> customers = new List<Customer>();
    var custFileExtendeds = _context.CustFileExtendeds.Where(a => a.Zip5 == zipCode).ToList();
    //Build the customer list and add the filter at the end.
    return custFileExtendeds.Select(a => new Customer()
    {
       CustomerId = a.CustNumber,
       ZipCode = a.Zip5,
       Name = _context.CustFiles.Single(a => a.CustId == a.CustNumber).FullName
    }).Where(where);
}

And the Service method would look like this:

public List<Customer> GetCustomers(int zipCode, Func<Customer, bool> where)
{
    return _customerRepository.RetrieveMany(zipCode, where);
}

Seeing that service method, you're hopefully asking some important questions. Like...

What if my domain model was being populated by data from multiple databases and maybe a web service?

What's the repository pattern gaining me?

The Unnecessary Abstraction

We've discovered a few things about having separate domain and data models.

No matter what, we need to map data models to domain models.
We can't have fully flexible filtering.
The mapping problems using IRepository methods are the same as with CustomerService methods. Specifically, the methods need to restrict the data that will come from the database in order for performance to be good.

There's one other requirement: whatever data sources are injected into CustomerService need to be mockable.

Whenever you see a passthrough query such as shown above in CustomerService.GetCustomers, it's worth asking, "Can I cut out the middleman?"

The answer here is "Yes." The repository is just getting in the way. We know the concrete CustomerService will depend on concrete data. The repository is just an abstraction of that data so we can unit test. We've already determined that if we want to swap out the database, we're going to be rewriting something. Why add to our troubles by essentially writing our CustomerService methods twice?

Solution Inject the mockable data source directly into the service

If the data source uses an interface, great. If there are multiple data sources, do whatever it takes to make them mockable. But don't use a repository abstraction. It just adds complexity for no gain.

Cutting Out the Middleman

Here's how our CustomerService example looks with the repository removed, organized by onion layers. These could be different assemblies.

Domain

namespace Company.Domain.DomainModels
{
    public class Customer
    {
        public int CustomerId { get; set; }
        public string Name { get; set; }
        public int ZipCode { get; set; }

    }
}

namespace Company.Domain.ServiceInterfaces
{
    public interface ICustomerService
    {
        //Create
        Customer Create(Customer customer);

        //Retrieve
        Customer RetrieveSingleById(int id);
        Customer RetrieveSingleByName(string name);
        List<Customer> RetrieveManyByIds(int[] ids);
        List<Customer> RetrieveManyByPartialName(string partialName);
        List<Customer> RetrieveManyByZipCode(int zipCode, Func<Customer, bool> where);

        //Update
        Customer Update(Customer customer);

        //Delete
        void Delete(int id);

    }
}

Data

namespace Company.Data.DataModels
{
    public class CustFile
    {
        public int CustId { get; set; }
        public string FullName { get; set; }
    }
    
    public class CustFileExtended
    {
        public int CustNumber { get; set; }
        public int Zip5 { get; set; }
    }
}

namespace Company.Data.SqlDatabase
{
    public class SqlDbContext : DbContext
    {
        public IDbSet<CustFile> CustFiles { get; set; }
        public IDbSet<CustFileExtended> CustFileExtendeds {get;set;}

        public SqlDbContext(string nameOrConnectionString) : base(nameOrConnectionString) { }
        
        //This constructor is used by Effort in unit testing
        public SqlDbContext(DbConnection existingConnection) : base(existingConnection, true) { }

        protected override void OnModelCreating(DbModelBuilder modelBuilder)
        {
            modelBuilder.Entity<CustFile>()
                .HasKey(a => a.CustId);

            modelBuilder.Entity<CustFileExtended>()
                .HasKey(a => a.CustNumber);
        }
    }
}

Services

This is only showing one method.

namespace Company.Services
{
    public class CustomerService : ICustomerService
    {
        SqlDbContext _context = new EF6Context("SqlDb");

        public CustomerService(SqlDbContext context)
        {
            _context = context;
        }

        public List<Customer> RetrieveManyByZipCode(int zipCode, Func<Customer, bool> where)
        {
            var custFileExtendeds = _context.CustFileExtendeds.Where(a => a.Zip5 == zipCode).ToList();
            return GetCustomers(custFileExtendeds)
                .Where(where).ToList();
        }
        
        // Helpers for mapping
        private List<Customer> GetCustomers(List<CustFile> custFiles)
        {
            List<Customer> customers = new List<Customer>();
            var custFileExtendeds = _context.CustFileExtendeds
                .Where(a => customers.Select(b => b.CustomerId).ToList().Contains(a.CustNumber)).ToList();

            customers.AddRange(custFiles.Select(a =>
            {
                var custFileExtended = custFileExtendeds.Single(b => b.CustNumber == a.CustId);
                return new Customer()
                {
                    CustomerId = a.CustId,
                    Name = a.FullName,
                    ZipCode = custFileExtended.Zip5
                };
            }));

            return customers;
        }

        private List<Customer> GetCustomers(List<CustFileExtended> custFileExtendeds)
        {
            int[] ids = custFileExtendeds.Select(a => a.CustNumber).ToArray();
            var custFiles = _context.CustFiles.Where(a => ids.Contains(a.CustId)).ToList();
            return GetCustomers(custFiles);
        }
    }
}

Test

There are various ways to mock Entity Framework. I like Effort because you don't need to add interfaces to your DbContext, you just need a particular constructor.

//other usings above, these are the testing dependencies
using Company.Data;
using Company.Data.DataModels;
using Company.Domain.DomainModels;
using Company.Domain.ServiceInterfaces;
using Company.Services;
using Xunit;
using Effort;
using FluentAssertions;

namespace DealingWithData.Tests
{
    public class CustomerService_Should
    {
        SqlDbContext _context;
        ICustomerService _customerService;

        public CustomerService_Should()
        {
            //This is what makes Effort the in-memory database
            _context = new SqlDbContext(DbConnectionFactory.CreateTransient());
            _customerService = new CustomerService(_context);
        }

        ~CustomerService_Should() 
        {
            _context.Dispose();
        }

        [Fact]
        public void Return_a_single_customer_by_id()
        {
            // arrange
            Customer expected = new Customer()
            {
                CustomerId = 1,
                Name = "Herbert",
                ZipCode = 12345
            };
            // mock the data the service works with
            _context.CustFiles.Add(new CustFile() { CustId = 1, FullName = "Herbert" });
            _context.CustFileExtendeds.Add(new CustFileExtended() { CustNumber = 1, Zip5 = 12345 });
            _context.SaveChanges();

            // act
            var actual = _customerService.RetrieveSingleById(expected.CustomerId);

            // assert
            actual.Should().BeEquivalentTo(expected);
        }
    }
}

If...when...we have changes in our data sources, we'll have to update the service and the unit tests. That's OK. Before, we'd have been updating repositories and unit tests. This removes the repository pattern dead weight.

Your Data Sources Will Change...

Sometimes radically. Your domain model will change, too, but it shouldn't be driven by the backend data. You should be able to code and unit test without any concrete dependencies, and it shouldn't be painful.

Do you really need the Repository pattern? Probably not.