Why the Repository Pattern Fails for Legacy Databases Like Yours and What to Do About It
2020-02-13 15:11
Contents
- Introduction: A Classic Example (spot the flaw?)
- Implicit Architecture Rules
- The Unnecessary Abstraction
- Cutting Out the Middleman
- Your Data Sources Will Change...
Introduction: A Classic Example (spot the flaw?)
One of the hallmarks of creating a loosely coupled architecture is using the repository pattern to return data. Each model's repository is implemented from an interface, and that interface is injected into, typically, a service.
namespace Company.Domain.DomainModels
{
public class Customer
{
public int CustomerId {get; set;}
public string Name {get; set;}
}
}
namespace Company.Domain.DataInterfaces
{
public interface IRepository<T> where T: class
{
T Single(Expression<Func<T, bool> where);
IEnumerable<T> Many(Expression<Func<T, bool> where);
}
}
The concrete implementation, using Entity Framework 6.
namespace Company.Data
public class CustomerRepository: IRepository<Customer>
{
CompanyDbContext _context = new CompanyDbContext();
Customer Single(Expression<Func<Customer, bool> where)
{
return _context.Customers.SingleOrDefault(where);
}
IEnumerable<Customer> Many(Expression<Func<Customer, bool> where)
{
return _context.Customers.Where(where);
}
}
}
The concrete customer service with the repository interface injected.
namespace Company.Services
{
public class CustomerService: ICustomerService
{
IRepository<Customer> _customerRepository;
public CustomerService(IRepository<Customer> customerRepository)
{
_customerRepository = customerRepository
}
public Customer GetCustomer(int id)
{
return _customerRepository.Single(a => a.CustomerId == id);
}
public List<Customers> GetCustomers(Expression<Func<Customer, bool>> where)
{
return _customerRepository.Many(where);
}
}
}
A unit test. (Note this is a lousy unit test because it's "testing" a data passthrough, but it illustrates the point.)
using XUnit;
using NSubstitute;
using FluentAssertions;
namespace Company.Tests
{
public class CustomerService_Should
{
[Fact]
public void Return_one_customer()
{
int id = 1;
Customer expected = new Customer() {CustomerId = 1;}
var customerRepository = Substitute.For<IRepository<Customer>>();
var customerService = new CustomerService(customerRepository);
customerRepository.Single((a => a.CustomerId == id).Returns(expected);
var result = customerService.GetCustomer(id);
result.Should().BeEquivalentTo(expected);
}
}
}
The namespaces indicate where in the Onion Architecture the classes belong.
- Core:
Customer
represents a company customer. It has no behavior. - Core: IRepository is our generic standard for getting data. By injecting this standard into services, we can swap out the database by simply implementing it in a new concrete repository.
- Core: ICustomerService is our contract for what a CustomerService should implement
- Data: CustomerRepository is the concrete instance that will return models from the database.
- Service: CustomerService is the concrete instance that returns models to the client.
Hint The flaw with the repository pattern is #2 and #4.
Image Copyright Jeffrey Palermo
Implicit Architecture Rules
Implicit in the Onion architecture are these rules:
- Domain Models represent the way the organization really works.
- Domain Services return Domain Models. (e.g. CustomerService returns Customer)
- Data Services return Data Models (e.g. CustomerRepository returns...Customer, in this case)
- APIs return with View Models (MVC Controllers, WebApi return something like CustomerMaintModel)
- UIs work with View Models
This is a loosely coupled architecture. By using interfaces, it's also highly testable.
So, what's wrong with it?
The repository expects to return a domain model from the database. And that's not how most databases, especially legacy databases, are designed.
Let's make sure this is really clear, because it's the flaw in every example I've seen of this architecture. Here's how the domain and database stay loosely coupled.
- The domain models represent how the organization sees information.
- The database models represent how the data is stored.
- The domain should have no knowledge of the database.
The mistake I've seen is that organizations' repositories return their data models, not domain models. They then rely on the service to map to the domain models. This means the repositories are tightly coupled to the data model.
In this light, look at the classic example again. The repository returns a Customer object. Customer is a domain model. This means that, if the database tables aren't one-to-one matches to the domain models, we have two choices in mapping the table model to the domain model.
Use IRepository
and have the CustomerService do the mapping. For instance, the List<Customer> GetCustomers()
method would get the data from the CustFileRepository and populate a Customer list.This is how I've seen most organizations try to do it. But we've tightly coupled our data model to our domain's IRepository. The data model is now part of the domain model. If we wanted to swap out the database, it would have to have a CustFile table. If it doesn't, we're rewriting the repositories and injecting new interfaces into the services.
Use IRepository
and have the CustomerRepository do the mapping. For instance, the IEnumerable<Customer> Many()
method would get the data using a data model such as CustFile and populate a Customer list. There's no guarantee CustFile has the same properties as Customer.This would be "more right" since there'd be loose coupling to the data. We could supposedly swap out the database. But as you'll see it won't work the way you think, and becomes an unnecessary layer of abstraction.
Let's try doing the second way. But let's use a more real-world scenario.
The customer data is stored in two Microsoft SQL Server tables that were created over the course of ten years. The extended table was used to handle new columns for the original table. Note the different primary key names and storing name information in a weird way. Also, there's no foreign key for the one-to-zero-one relationship.
CustFile
=========
CustId int PK
FullName varchar(100)
CustFileExtended
===============
CustomerNumber int PK
Zip5 int
We need data models specific to these tables.
namespace Company.Data.DataModels
{
public class CustFile
{
public int CustId { get; set; }
public string FullName { get; set; }
}
public class CustFileExtended
{
public int CustNumber { get; set; }
public int Zip5 { get; set; }
}
}
To keep the example clear, we'll use an explicit ICustomerRepository. Remember, in order to be loosely coupled, this repository returns a Customer domain model.
If it returned CustFile and CustFileExtended models, then those data models would have to be part of the Domain. And that would tightly couple the abstract domain model to the concrete data model.
namespace Company.Domain.DataInterfaces
{
public interface ICustomerRepository
{
// Create
Customer Create(Customer entity);
// Retrieve
Customer RetrieveSingle(int id);
IEnumerable<Customer> RetrieveMany();
IEnumerable<Customer> RetrieveMany(Expression<Func<Customer, bool>> where);
// Update
Customer Update(Customer entity);
// Delete
void Delete(int id);
}
}
Note the use of Expression. That's coupling to Queryables, which are provider-dependent.
Now we'll try to implement the repository using Entity Framework. To make the problem clear, we'll only look at the RetrieveMany methods.
namespace Company.Data
{
public class CustomerRepository : ICustomerRepository
{
private CompanyDbContext _context;
public CustomerRepository(CompanyDbContext context)
{
_context = context;
}
public IEnumerable<Customer> RetrieveMany()
{
// Now what?
}
public IEnumerable<Customer> RetrieveMany(Expression<Func<Customer, bool>> where)
{
// Now what?
}
}
}
We'll assume we can mock the context somehow. But this is where things go awry.
Many organizations end up adding IRepository and IUnitOfWork to DbSet and DbContext. Patterns on top of patterns. Abstractions on top of abstractions!
How do we return a collection of Customers? Remember, we're getting the data from two different tables.
Important The calls to the database have no knowledge of what a Customer is.
public IEnumerable<Customer> RetrieveMany()
{
List<Customer> customers = new List<Customer>();
// there's no filtering, so get all records. That's right. ALL OF THEM.
var custFiles = _context.CustFiles.ToList();
int[] ids = custFiles.Select(a => a.CustId).ToArray();
var custFileExtendeds = _context.CustFileExtendeds.Where(a => ids.Contains(a.CustNumber)).ToList();
customers = custFiles.Select(a => new Customer()
{
CustomerId = a.CustId,
Name = a.FullName,
ZipCode = custFileExtendeds.SingleOrDefault(b => b.CustNumber == a.CustId)?.Zip5
});
return customers;
}
We're using a generic repository pattern, which is typically going to include a method such as above because sometimes we do want all the records. But not if there are a hundred million. OK, let's try filtering.
public IEnumerable<Customer> RetrieveMany(Expression<Func<Customer, bool>> where)
{
List<Customer> customers = new List<Customer>();
// And...now what?
}
We have a filter expression that can be examined for its properties and conditions. So at first it seems like all we have to do is
- Extract the Customer properties
- Map them to the CustFile and CustFileExtended properties
- Pull out the boolean conditions
- Create Expressions from the properties and conditions and apply the filters to the appropriate CustFile or CustFileExtended DbSets.
- Build and return the Customer list.
That's...actually, a lot of work. And wait...that's not complete, is it? There could be a few more steps after 4.
- Programmatically handle a CustFile property that's filtering based on a CustFileExtended property. E.g.
Customer.Discount == Customer.MaxDiscount
. Except in the database it's CustFile.Discount and CustFileExtended.MaxDiscount. How's that going to work? - Programmatically handle nested conditions that go across table properties. Good luck to you.
- Hope to heck a Customer property being filtered on isn't made up of two properties from two different tables.
and so on until...
- Build and return the Customer list and pray it's right.
OK, so maybe we need custom repository interfaces for each domain model. Then we can make the client supply known filters like customer ids, get our initial datasets down to a managable size, then filter on the domain model.
//No need for an Expression since List<Customer> is generic.
public IEnumerable<Customer> RetrieveMany(int zipCode, Func<Customer, bool> where)
{
List<Customer> customers = new List<Customer>();
var custFileExtendeds = _context.CustFileExtendeds.Where(a => a.Zip5 == zipCode).ToList();
//Build the customer list and add the filter at the end.
return custFileExtendeds.Select(a => new Customer()
{
CustomerId = a.CustNumber,
ZipCode = a.Zip5,
Name = _context.CustFiles.Single(a => a.CustId == a.CustNumber).FullName
}).Where(where);
}
And the Service method would look like this:
public List<Customer> GetCustomers(int zipCode, Func<Customer, bool> where)
{
return _customerRepository.RetrieveMany(zipCode, where);
}
Seeing that service method, you're hopefully asking some important questions. Like...
What if my domain model was being populated by data from multiple databases and maybe a web service?
What's the repository pattern gaining me?
The Unnecessary Abstraction
We've discovered a few things about having separate domain and data models.
- No matter what, we need to map data models to domain models.
- We can't have fully flexible filtering.
- The mapping problems using IRepository
methods are the same as with CustomerService methods. Specifically, the methods need to restrict the data that will come from the database in order for performance to be good.
There's one other requirement: whatever data sources are injected into CustomerService need to be mockable.
Whenever you see a passthrough query such as shown above in CustomerService.GetCustomers, it's worth asking, "Can I cut out the middleman?"
The answer here is "Yes." The repository is just getting in the way. We know the concrete CustomerService will depend on concrete data. The repository is just an abstraction of that data so we can unit test. We've already determined that if we want to swap out the database, we're going to be rewriting something. Why add to our troubles by essentially writing our CustomerService methods twice?
Solution Inject the mockable data source directly into the service
If the data source uses an interface, great. If there are multiple data sources, do whatever it takes to make them mockable. But don't use a repository abstraction. It just adds complexity for no gain.
Cutting Out the Middleman
Here's how our CustomerService example looks with the repository removed, organized by onion layers. These could be different assemblies.
Domain
namespace Company.Domain.DomainModels
{
public class Customer
{
public int CustomerId { get; set; }
public string Name { get; set; }
public int ZipCode { get; set; }
}
}
namespace Company.Domain.ServiceInterfaces
{
public interface ICustomerService
{
//Create
Customer Create(Customer customer);
//Retrieve
Customer RetrieveSingleById(int id);
Customer RetrieveSingleByName(string name);
List<Customer> RetrieveManyByIds(int[] ids);
List<Customer> RetrieveManyByPartialName(string partialName);
List<Customer> RetrieveManyByZipCode(int zipCode, Func<Customer, bool> where);
//Update
Customer Update(Customer customer);
//Delete
void Delete(int id);
}
}
Data
namespace Company.Data.DataModels
{
public class CustFile
{
public int CustId { get; set; }
public string FullName { get; set; }
}
public class CustFileExtended
{
public int CustNumber { get; set; }
public int Zip5 { get; set; }
}
}
namespace Company.Data.SqlDatabase
{
public class SqlDbContext : DbContext
{
public IDbSet<CustFile> CustFiles { get; set; }
public IDbSet<CustFileExtended> CustFileExtendeds {get;set;}
public SqlDbContext(string nameOrConnectionString) : base(nameOrConnectionString) { }
//This constructor is used by Effort in unit testing
public SqlDbContext(DbConnection existingConnection) : base(existingConnection, true) { }
protected override void OnModelCreating(DbModelBuilder modelBuilder)
{
modelBuilder.Entity<CustFile>()
.HasKey(a => a.CustId);
modelBuilder.Entity<CustFileExtended>()
.HasKey(a => a.CustNumber);
}
}
}
Services
This is only showing one method.
namespace Company.Services
{
public class CustomerService : ICustomerService
{
SqlDbContext _context = new EF6Context("SqlDb");
public CustomerService(SqlDbContext context)
{
_context = context;
}
public List<Customer> RetrieveManyByZipCode(int zipCode, Func<Customer, bool> where)
{
var custFileExtendeds = _context.CustFileExtendeds.Where(a => a.Zip5 == zipCode).ToList();
return GetCustomers(custFileExtendeds)
.Where(where).ToList();
}
// Helpers for mapping
private List<Customer> GetCustomers(List<CustFile> custFiles)
{
List<Customer> customers = new List<Customer>();
var custFileExtendeds = _context.CustFileExtendeds
.Where(a => customers.Select(b => b.CustomerId).ToList().Contains(a.CustNumber)).ToList();
customers.AddRange(custFiles.Select(a =>
{
var custFileExtended = custFileExtendeds.Single(b => b.CustNumber == a.CustId);
return new Customer()
{
CustomerId = a.CustId,
Name = a.FullName,
ZipCode = custFileExtended.Zip5
};
}));
return customers;
}
private List<Customer> GetCustomers(List<CustFileExtended> custFileExtendeds)
{
int[] ids = custFileExtendeds.Select(a => a.CustNumber).ToArray();
var custFiles = _context.CustFiles.Where(a => ids.Contains(a.CustId)).ToList();
return GetCustomers(custFiles);
}
}
}
Test
There are various ways to mock Entity Framework. I like Effort because you don't need to add interfaces to your DbContext, you just need a particular constructor.
//other usings above, these are the testing dependencies
using Company.Data;
using Company.Data.DataModels;
using Company.Domain.DomainModels;
using Company.Domain.ServiceInterfaces;
using Company.Services;
using Xunit;
using Effort;
using FluentAssertions;
namespace DealingWithData.Tests
{
public class CustomerService_Should
{
SqlDbContext _context;
ICustomerService _customerService;
public CustomerService_Should()
{
//This is what makes Effort the in-memory database
_context = new SqlDbContext(DbConnectionFactory.CreateTransient());
_customerService = new CustomerService(_context);
}
~CustomerService_Should()
{
_context.Dispose();
}
[Fact]
public void Return_a_single_customer_by_id()
{
// arrange
Customer expected = new Customer()
{
CustomerId = 1,
Name = "Herbert",
ZipCode = 12345
};
// mock the data the service works with
_context.CustFiles.Add(new CustFile() { CustId = 1, FullName = "Herbert" });
_context.CustFileExtendeds.Add(new CustFileExtended() { CustNumber = 1, Zip5 = 12345 });
_context.SaveChanges();
// act
var actual = _customerService.RetrieveSingleById(expected.CustomerId);
// assert
actual.Should().BeEquivalentTo(expected);
}
}
}
If...when...we have changes in our data sources, we'll have to update the service and the unit tests. That's OK. Before, we'd have been updating repositories and unit tests. This removes the repository pattern dead weight.
Your Data Sources Will Change...
Sometimes radically. Your domain model will change, too, but it shouldn't be driven by the backend data. You should be able to code and unit test without any concrete dependencies, and it shouldn't be painful.
Do you really need the Repository pattern? Probably not.